Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

word checking tool

Status
Not open for further replies.

Stampertje

Technical User
Oct 28, 2005
13
NL
Hi there ...

Since I got such great help here last time I thought .. let's try again :)

I am looking for the following.

I would like a VBA script that checks a wordlist using an already available wordlist in txt format.

So ...

Open a word document containing a list of words --> check if these words exist in a txt wordlist --> all words that don't exist should be put in a new word document as a list.

I hope this is understandable :)
 
Hi Ralf,

So what you want is to run the deletion from your earlier thread and get a deduplicated list of what remains?

This code will produce such a list from a document and put it in a new document.
Code:
[blue]Sub GetWords()

Dim WordsCollection As New Collection
Dim itemWord As Variant
Dim rngWord As Range
Dim strWord As String

For Each rngWord In ActiveDocument.Content.Words
    strWord = Trim(rngWord.Text)
    Select Case strWord
        Case "", ".", ",", ";", vbCr
        Case Else
            On Error Resume Next
            WordsCollection.Add strWord, strWord
            On Error GoTo 0
    End Select
Next rngWord

Documents.Add

With ActiveDocument
    For Each itemWord In WordsCollection
        .Range.InsertAfter itemWord
        .Range.InsertParagraphAfter
    Next
End With

Set WordsCollection = Nothing

End Sub[/blue]



Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Almost :)

I will explain what I am using this for, maybe that gets you the entire idea...

I am trying to filter and after that delete words that are incorrectly spelled. Due to the limitations that the word spell checker has with the amount of words that you can add to a personal spell list.

So what I want to do are the following actions:

I have a list created of words extracted from a letter based on frequency.
I then would like to check this word list with a wordlist a already have with words I know are correct.
All words that are spelled incorrectly based on the check with my own wordlist need to be placed in a new list of words. Then that list of words I use to delete these words from the letter.

Now you alreeady created 1 part I needed. You created the script to remove the words I don't want.

I also have a script to create a complete wordlist including the frequency of these words.

What I now need is a script that creates a list of all words that don't exist in my own wordlist. (this is a txt file)

I hope you understand what I mean :)

 

I need to think about this for a while to get my head rround exactly what you're doing, however, ...

As I understand it, there is some question about whether there really is a limit on the size of custom dictionaries and there are some pretty big ones out there. What exactly makes you think your file is too big?

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Yes, I am not sure trying to code your own custom dictionaries is all that great an idea. This seems to be what you are doing, and I am not sure it is needed.

Gerry
 
The custom.dic is limited to about 5000 words or so (maybe more maybe less) I have wordlists of over 4 million words.

 

I'm sure custom dictionaries can accommodate more than 5000 words but 4 million words! Whatever are you doing?

Nothing that you can write in VBA in Word is going to give any kind of acceptable performance and you're going to start to be pushing the size limit for a Word document. With that many words (more than most languages have) you are going to have to build some intelligence into anything you write.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Well it's a process I run for a few days .. no rush involved.

But thanks for all the help given :)
I can understand that this may be asking to much of you all ... I will see if I can work it out myself and if I do I will post the code I used :)
 

I am intrigued as to what you are working with that needs to use that many non-dictionary words. That aside, if I understand, you have a (relatively short) list of words in a Word document which you want to check against a long list in a text file so that you can extract all the words from the first list which are not in the second.

A couple of questions - how long is the short list in the Word document? Is either list in any order? And how is the list in the Word document organised? Or have I misunderstood?

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Dear Tony, fumei and all,

just a small explanation on the difficulties of spellchecking the German language, my mother tongue, a.f.a.i.k. similar to Ralfs Dutch.
Germans, especially journalists and lawyers, like to glue nouns together.
An example in colloquial language:
"The police searched the contents of the bag" could be
1) "Die Polizei untersuchte den Inhalt der Tasche" or
2) "Die Polizei untersuchte den Tascheninhalt"
In case 1) Word 97 would know all the words, in case 2) Word 97 wouldn't know Tascheninhalt.
Another example from chemistry:
"Acetyl salicylic acid" would be "Acetylsalicylsäure". Again Word97 wouldn't know the compund noun, even if it knows its parts.
In Word 2000 and above, MS changed the spellchecking. In German it now uses a method of which I believe it is similar to hyphenating: the spellchecker tries to look at syllables and not at whole words. The problem: if it knows the parts, it does not check even if the compound is senseless. Example:
I translated an English text and mistyped:
"Ich muß übelregen" (= "I must bad move") instead of
"Ich muß überlegen" (= "I must think about it").
Word2000 knows "übel" (bad) and "regen" (move) and does not mark the senseless compund as incorrect.
That is why I advies everybody who works with large documents which have to be 100 % correct to stay with Word97 until MS fixes the spellchecking. I have worked as a secretary for a large law firm and found the spellchecking of Word 2000 annoying.
That is why I suppose Ralf is trying to work around Word XPs spellchecking.
 
Hi Markus,

The spell checker does not check context in any language - you can get complete nonsense past it in any language provided each word is a correctly spelt word according to its dictionary.

I do appreciate the problems with compound words in German and, I think to a lesser extent, in Dutch but the scale of what Ralf is suggesting horrifies me. The entire Dutch language fits in a pocket size book ("het groene boekje") and 4 million extra words is a lot of combinations even if one includes things like chemical names.

What is being asked for isn't logically difficult but I really can't help thinking there must be another approach. I will wait to see what Ralf posts.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Ok.. I just got home from work :)

How do I have a 4 million large vocabulairy you all ask ... It is because I do not only need the correct dutch spelling including all the different ways we can write 1 word in (past time, current, and others I can't translate in a normal way) but also all the medical terms (latin, english, french etc) and all the medicin terms ... so that together gives me a huge vocabulairy.

Now I don't always need them so I have lists of words from a specific specialism I would like to run agains normal letters.

Let's give you a simple example.

I have a list with the words:
paracetamol
midalgan
i
often
have
headache
a
and
use
often


Now I make a sentence:

I often have a headace and use paracetamol and midalgan.

My checker would first make a list of words from the entire sentence and then check that list with the words in my correct word list.

This would mean that the word headace would be put in a seperate word list, I call that the "unknown" words.

Often unknown words aren't wrong, they could be new terms, latin terms or others that I don't know of yet.

I then check these words to be either valid or false and when false I add them to a list of false words. Then I use that list of false words to remove these words from the letter.


I already have all the parts.
The only part I still need is to be able to check if the words in my original letter have words that are not in my correct word list.

Does this clear it up a bit?
 
Hi Ralf,

Regardless of context, as far as I can understand all you want to do is to compare two lists and throw out (to a new list) items in one which aren't in the other.

Can you tell me:

(a) The format of both lists - word document, text file, other (b) How the words in the lists are separated
(c) Whether either of the lists is in any order
(d) The size of the shorter list (is it small enough to be held in, say, an array or must it be processed in its file)

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
1 file is in txt (my own list)
1 file is in word (the word list I extracted)

The words are in a list so they are from top to bottom, 1 word per line.
Words can however contain . - ' (like n.v.t.)
They are in alphabetical order
The shorter list is around 4 MB

Hope this helps .. and thanks in advance :)
 
Hi Ralf,

So it's a straightforward two file compare. I haven't properly tested this but I think it should work
Code:
[blue]Sub CompareLists()

Dim ListToCheck As Document
Dim ListUnMatched As Document

Dim Para As Paragraph
Dim WordToCheck As String
Dim MasterListWord As String

Set ListToCheck = ActiveDocument
Set ListUnMatched = Documents.Add

Open "C:\Documents and Settings\Tony\Desktop\WordList.txt" For Input As #1

[green]' Word document contains multiple paragraphs. In this case, ..
'  .. processing with For Each is more efficient than ..
'  .. using paragraph numbers in this case, so drive with it[/green]

For Each Para In ListToCheck.Paragraphs
    
    [green]' Get (non-blank) word to check[/green]
    WordToCheck = Left$(Para.Range.Text, Len(Para.Range.Text) - 1)
    If WordToCheck <> "" Then
        
        [green]' Skip all words in master list less than or equal to test word[/green]
        Do While (WordToCheck > MasterListWord) And Not EOF(1)
            Input #1, MasterListWord
        Loop

        [green]' If word not in list write out to new list[/green]
        If EOF(1) Or (WordToCheck < MasterListWord) Then
            ListUnMatched.Content.InsertAfter WordToCheck
            ListUnMatched.Content.InsertParagraphAfter
        End If
    
    
    End If
    
Next Para

Close #1

End Sub[/blue]

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
I will test it this evening...

Thanks in advance! :p
 
Dear Tony,

I am not argueing with you, I just want to know. You say:
"The spell checker does not check context in any language - you can get complete nonsense past it in any language provided each word is a correctly spelt word according to its dictionary."
But a dictionary consists of words. And "übelregen" (of my example) is not a correct word - even though it's parts are correct. So I think my criticising of Word's spellchecker is appropriate.

Anyway - what I'd like to ask you:
In your code you say "Dim Para As Paragraph". In other codes I read "Dim Para as Word.Paragraph". What should I rather use?

Thank you

Markus
 
Hi Markus,

I've just gone back and reread your post and I had, with my schoolboy German, misunderstood. I can see, now, what you are saying about how Word, seemingly, accepts any old compounding.

When you are coding in Word there is not normally any reason to qualify every object type with "Word" although there might be occasions when you want to do it - if you are, perhaps, using automation and have two objects with the same name in different libraries; even though Word might get correctly defaulted it might be better to be explicit. Using "Word.Paragraph" is never wrong, and some people do it routinely - it's really just a matter of choice.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top