Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Histogram for words in a MS Word 4

Status
Not open for further replies.

Bruce18W

Technical User
May 9, 2011
27
US
Hi,
Does anyone have code for counting the number of occurrences of each word in an MS Word document?

Example: the code run on a very short document containing only "the boy ran to the home"

would produce
the 2
boy 1
ran 1
etc.

thanks, Bruce
 


hi,

Short or generating code to 1. determine the unique word occurrences and 2) counting the number of occurrences for each unique occurrence and 3) displaying in some way, chart or table the result, check out Concordance in Word Help. In a nutshell, you can COPY your document text to a new document, REPLACE each SPACE with a paragraph, CONVERT the result to a table, SORT the table. I'd copy the table to Excel to analyze, using a PivotTable/PivotChart. A less than 5 minute exersize.

Skip,
[sub]
[glasses]Just traded in my old subtlety...
for a NUANCE![tongue][/sub]
 
It COULD be done via VBA, but it would be kludgey and wildly inefficient. Skip has the fastest and most useable solution with the added (huge) bonus of a built-in easy way of dealing the resulting data - the very strength of Excel.

 
The following macro generates a list of all words used in the active document, and outputs them alphabetically sorted, with frequency of occurrence in a two-column table at the end of that document, starting on a new page. It only lists words in the MainTextStory, not headers/footers/footnotes/endnotes etc. The 'unwanted' character exlusion set (defined by the StrIn = Replace(StrIn, Chr(i), " ") processes) is fairly comprehensive and includes all numbers.

The macro also has provision for an exclusion list, so that various words and phrases can be excluded. The exclusions list (defined by the words & phrases in the StrExcl string variable) lists the words and phrases to be omitted from the concordance. Any phrases should be listed in the exclusions list before any of the single-word exclusions (so that conflicts don’t occur).
Code:
Sub ConcordanceBuilder()
Application.ScreenUpdating = False
Dim StrIn As String, StrOut As String, StrTmp As String, StrExcl As String
Dim i As Long, j As Long, k As Long, Rng As Range
StrExcl = " a,the,and,or,but,not,to,of,i,you,we,he,her,them,she,him,it,they,who"
With ActiveDocument
  StrIn = .Content.Text
  For i = 1 To 255
    Select Case i
      Case 1 To 64, 91 To 96, 123 To 191, 247
      StrIn = Replace(StrIn, Chr(i), " ")
    End Select
  Next
  StrIn = " " & LCase(Trim(StrIn)) & " "
  For i = 0 To UBound(Split(StrExcl, ","))
    StrIn = Replace(StrIn, " " & Split(StrExcl, ",")(i) & " ", " ")
  Next
  While InStr(StrIn, "  ") > 0
    StrIn = Replace(StrIn, "  ", " ")
  Wend
  StrIn = " " & Trim(StrIn) & " "
  j = UBound(Split(StrIn, " "))
  For i = 1 To j
    If Len(Trim(StrIn)) = 0 Then Exit For
    StrTmp = Split(StrIn, " ")(1)
    While InStr(StrIn, " " & StrTmp & " ") > 0
      StrIn = Replace(StrIn, " " & StrTmp & " ", " ")
    Wend
    k = j - UBound(Split(StrIn, " "))
    StrOut = StrOut & StrTmp & vbTab & k & vbCr
    j = UBound(Split(StrIn, " "))
  Next
  Set Rng = .Range.Characters.Last
  With Rng
    .InsertAfter Chr(12) & StrOut
    .Start = .Start + 1
    .ConvertToTable Separator:=vbTab, Numcolumns:=2
    .Tables(1).Sort Excludeheader:=False, FieldNumber:=1, _
      SortFieldType:=wdSortFieldAlphanumeric, _
      SortOrder:=wdSortOrderAscending, CaseSensitive:=False
  End With
End With
Application.ScreenUpdating = True
End Sub

Cheers
Paul Edstein
[MS MVP - Word]
 
This is fantastic! Thanks so much. I just ran it on a manuscript and immediately got some insight into word usage, as in, "wow, did I really use that word 53 times?"

Thanks again,

Bruce
 
Very nice macropod certainly worth another star.

The only problem I see is it counts single letters proceeded with an apostrophe like:

Code:
that's   counts the s
can't    counts the t
I'm      counts the m

and I can't see where it does this.

Otherwise perfect!

Sam
 
I believe it's here:

For i = 1 To 255
Select Case i
Case 1 To 64, 91 To 96, 123 To 191, 247
StrIn = Replace(StrIn, Chr(i), " ")
End Select
Next

The 1 to 64 may need to be split into two groups omitting 39.

The rest of 1 to 64 looks safe to me.

Sam

Again thanks for the clever post

 
Hi mscallisto,

Yes, the code strips out the apostrophes, with the result that whatever's either side of them is treated as separate words. This has both benefits and drawbacks. One, that might be viewed as a benefit, is that the code can separate out posessive verbal contractions like 's','ll', 're' so that they can be counted in their own right - or aggregated with the verbs they logically represent. Of course, some might view this behaviour negatively. Another option would be to add such contractions to the exclusions lits. A drawback is that names like O'Donnell also get split. If you don't want words with apostrophes split, the Case test could be modified to:
Case 1 To 38, 40 To 64, 91 To 96, 123 To 145, 147 To 191, 247

That still leaves trailing apostrophes, which can be managed by inserting:
StrIn = Replace(Replace(StrIn, "’ ", " "), "' ", " ")
before:
StrIn = " " & LCase(Trim(StrIn)) & " "
The 'problem' with this approach, though, is that it doesn't differentiate between trailing apostrophes that were being used posessively and those being used as closing single quotes.

Cheers
Paul Edstein
[MS MVP - Word]
 
Paul,

Could you possibly/please spend a few minutes adding some comments to the code? I think I'd learn a lot by understanding it better, and some comments along the way would help. I ran it on a short sentence, hoping that I'd understand it. I do, sort of. But some comments would, I'm sure, improve understanding.

Thanks again, Bruce
 
Try:
Code:
Sub ConcordanceBuilder()
Application.ScreenUpdating = False
Dim StrIn As String, StrOut As String, StrTmp As String, StrExcl As String
Dim i As Long, j As Long, k As Long, Rng As Range
'Words to be excluded
StrExcl = " a,the,and,or,but,not,to,of,i,you,we,he,her,them,she,him,it,they,who"
With ActiveDocument
  'Load the content into a string
  StrIn = .Content.Text
  'Convert certain characters to spaces
  For i = 1 To 255
    Select Case i
      Case 1 To 64, 91 To 96, 123 To 191, 247
      StrIn = Replace(StrIn, Chr(i), " ")
    End Select
  Next
  'Prefix the string with single spaces
  StrIn = " " & LCase(Trim(StrIn)) & " "
  'Convert excluded words to spaces
  For i = 0 To UBound(Split(StrExcl, ","))
    StrIn = Replace(StrIn, " " & Split(StrExcl, ",")(i) & " ", " ")
  Next
  'Convert all double spaces to single spaces
  While InStr(StrIn, "  ") > 0
    StrIn = Replace(StrIn, "  ", " ")
  Wend
  'Prefix the string with single spaces
  StrIn = " " & Trim(StrIn) & " "
  j = UBound(Split(StrIn, " "))
  'Find each word in the string and replace all occurrences of that word with spaces
  For i = 1 To j
    If Len(Trim(StrIn)) = 0 Then Exit For
    StrTmp = Split(StrIn, " ")(1)
    While InStr(StrIn, " " & StrTmp & " ") > 0
      StrIn = Replace(StrIn, " " & StrTmp & " ", " ")
    Wend
    'Re-calculate the array size.
    'The difference in elements reflects how many words were replaced
    k = j - UBound(Split(StrIn, " "))
    'Output the word and its count, separated by a tab.
    'Each word & count goes on a new line
    StrOut = StrOut & StrTmp & vbTab & k & vbCr
    'Re-set the array size.
    j = UBound(Split(StrIn, " "))
  Next
  'Point to the end of the document
  Set Rng = .Range.Characters.Last
  With Rng
    'Insert a page break, followed by the output string
    .InsertAfter Chr(12) & StrOut
    'Point to the first output character. The end looks after itself
    .Start = .Start + 1
    'Convert the output string to a table
    .ConvertToTable Separator:=vbTab, Numcolumns:=2
    'Sort the table
    .Tables(1).Sort Excludeheader:=False, FieldNumber:=1, _
      SortFieldType:=wdSortFieldAlphanumeric, _
      SortOrder:=wdSortOrderAscending, CaseSensitive:=False
  End With
End With
Application.ScreenUpdating = True
End Sub

Cheers
Paul Edstein
[MS MVP - Word]
 
Are you open for new software? I used free RapidMiner Community Eedition ( with Text Processing extension installed.
The text file is saved in in text format.
Process structure:
1) ReadDocument module with "file" set to processed text file,
2) ProcessDocuments linked to ReadDocument ("doc" port),
3) Tokenize module inside ProcessDocuments (subprocess), in "non letters" mode,
4) WordListToData module linked to ProcessDocuments ("wor" input and output ports),
5) WriteExcel module linked to WordListToData module ("exa" port), excel file pointed eiter existing or name only.
This process executed generates alphabetic list of words with document reference and count of occurences.

I know that it's not worth investing time in lerning this software only for single word counting. I found this application excellent in general data mining and visualisation, it can be used for this purpose too.

combo
 
Thanks for the suggestion, combo. I'll keep it in mind for future needs.

Bruce
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top