Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Wanet Telecoms Ltd on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Spell Checking Question

Status
Not open for further replies.

docmeizie

Programmer
Aug 5, 2003
326
US
Okay when spell check runs, it has to reference a database of all the words that are words. I really want to know what it is referencing so perhaps I can reference in a project I am working on myself. I know this can be done cause you can reference the spell checker on a form by going through Word. But I am trying not to have to go through Word. I am creating a game that compares the all the words a user comes up with to all viable words for a given set of letters mixed up.

If I take a peek in your Windows, to fix a problem, does that make me a "Peeping Tom"? Hmmmmmmmmmmv [pc1][shocked]
 
You should be aware that the data word uses is almost certainly copyrighted...
 
Ok, this is fine, but you do realise there is a flaw to this. If the user has added names to their dictionary, in word, or even random letters in any order, then the spell check db will contain these as well.

If I was you, I would find a copy of the words you wish to allow (they are out there), and then distribute this with your app. This way, no one can cheat with your game!

Good Luck

BB
 
Copyrighted, how can that be they are used everyday by people. How can you copyright a language? At any rate it is not for monetary gain, it is just something that I wish to practice upon so that I may satisfy my curiosity. I do realize the variance from the original word database and am willing to accept that since it will only be on my system, and it won't really matter to me with added words cause I have added none.

If I take a peek in your Windows, to fix a problem, does that make me a "Peeping Tom"? Hmmmmmmmmmmv [pc1][shocked]
 
No, no, no

The issue is that the particular word list MS use is effectively copyright (not the words but the format). They licence it themselves, and they can't sell on that licence. Sorry, but that's the way it is
 
Interesting reference! Unfortunately it contains a mixture of English, American English and Australian spelling variations.

I'm still looking

________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'If we're supposed to work in Hex, why have we only got A fingers?'
 
I am finding that checking a word against a list of thousands of words is taking forever, any ideas how everyone makes theirs practically instantanious?
 
or Binary Search Algorithm

________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'If we're supposed to work in Hex, why have we only got A fingers?'
 
As far as I know, one cannot copyright something based on "sweat of the brow" which basically excludes compilations. For example, it is perfectly legal to scan your competitor's white pages phone book and print and sell your own version.

I myself managed to find on the internet a pretty good text list of words for my own Boggle Cube word finder program I wrote. I am also going to use it for the Anagram Explorer program I am about to start. Would you like me to email it to you?

If you're truly interested in getting the "complete" list (but only up to 8-letter-long words) you should try to find a Scrabble or Boggle word list.

As for making the search go faster, using indexes and binary searches are both valid strategies. Do you know how to get started on this?

Let's say your word list is an array of strings, with 50,000 words in it. If you check to see whether a word starting with Z exists by stepping through the list from the start, it's going to take a while to find it.

You could build a quick index by scanning the list once and marking where each next letter starts in an array of Longs like lngMyIndex(1 to 26). When you want to look up "leapfrog" you would look to see where L starts and ends, at M, by inspecting lngMyIndex(12) and (13) which could for example be 23,000 and 25,000, cutting your search down to 2,000 words instead of 50,000.

L is the 12th letter of the alphabet and you can derive it by doing Asc("l") - Asc("a") + 1, which collapses to Asc("l") - 96. Note this only works for lower case, so either do Asc(LCase("L")) - 96 or do Asc(UCase("l")) - 64.

You will of course be replacing my "L" or "l" with Left(strWordToSearchFor, 1). Asc only returns the ascii value of the first character, but I think it is good programming practice to only pass it the string you want because you can get weird results with Unicode and it's good to start good habits now.

To do a binary search, don't check each word for equality in a row, rather, start in the middle of the range. In the example above you might start at 24,000... you don't get a match, but you can also test for which is greater. Let's say "leapfrog" is less than word 24,000 in your wordlist. Now you have a new range that you know leapfrog will be in, if it is in your list: 23,000-24,000. You can then check at 23,500. I will leave the details of implementing this up to you.

Since I wanted speed, I actually created 3 indexes, a one-, two-, and three-letter index of letters starting words. Note that you need to handle cases where an index does not point into the dictionary (for example, there are no words in any English dictionary which start with "qzj").

-E

-----------------

I did make my own dictionary class for my unfinished Boggle program and just got started on rewriting it today. (It solves cubes you enter manually, but doesn't look pretty and doesn't know how to "roll" the dice to actually play the game.) My class will be able to load word lists from at least a text file to begin with, and will supply word lookups, anagrams, and other functions (and yes I will do my best to make it work fast with indexes).

If and when I finish it, I will probably be making it into a DLL. would you like a copy? I'd be happy to supply it. I will be releasing both my Boggle (tm) word finder and my Anagram Explorer for free, just to get my name out there as a programmer and also for the experience.
 
I still believe a true BSA would be faster.

Pseudocode only:

'Assumes wordlist is in aryW(65535)
lngMax = Ubound(aryW)
lngMin = Lbound(aryW)
blnSuccess = False
For lngCount = 1 to 17
lngPtr = lngMin + lngMax \ 2
Select Case aryW(lngPtr)
Case > strSearch
lngMax = lngPtr
Case < strSearch
lngMin = lngPtr
Case = strSearch
blnSuccess = True
Exit For
End Select
Next

________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'If we're supposed to work in Hex, why have we only got A fingers?'
 
Let's duke it out, johnwm! When I create my Dictionary class I will try both methods and tell you what results I get on speed: your suggestion to do binary search alone, and mine to combine an index with binary search.

Admittedly, as you put in your sample code, for 50k words the maximum number of mismatches is 17. But if you have an index and can cut your search down to say, 2000 words, the maximum number if mismatches is 11. I'll bet that

dereferencing two longs from an array, then a binary search through 11 maximum steps

is faster than

a binary search through 17 maximum steps.

(As long as I get to keep a copy of the index in memory and not have its build-time at program startup counted.)

This is all in good fun, of course!
 
Esquared,
I'll see what I can do. Suggest we start from a common point and base the search on the list referred above:


It may not be perfect, but it's a common start point.



________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'If we're supposed to work in Hex, why have we only got A fingers?'
 
>one cannot copyright something based on &quot;sweat of the brow&quot; which basically excludes compilations

Untrue, as far as I am aware. Traditionally, the &quot;sweat of the brow&quot; doctrine has been exactly what has provided provided copyright protection for databases and compilation - although a Supreme Court ruling in 1996 qualified this to be only applicable if said database or compilation was arranged and selected in an original manner. Consequently:

>it is perfectly legal to scan your competitor's white pages phone book and print and sell your own version

But only because the fairly simple selection and organization of the data are not considered sufficiently original to merit copyright protection. Don't make the mistake of assuming that this means other lists/compilations/databases are also necessarily exempt of copyright protection
 
hmmmmmmmmmmmmm ... mmmmmmmm,


Interesting, and possily a good starting point for a crossword puzzle dictionary?

johnwm, do you know the layout of the dictionalry? I glanced at it briefly, and it appears to have some &quot;organization&quot;. The Top many lines are seperated (dellimited) diffferently for ones which follow, so their meaning / intent is not readily obvious (at least in the brief glance).




MichaelRed
m.red@att.net

Searching for employment in all the wrong places
 
They are not delimited, rather, the characters you see as delimiters are part of the word or indicative that the word is a suffix. I suggest we toss the words with nonalpha characters.
 
Esquared,
I had something like this in mind:
[tt]
'---------------------------------------------------------------------------------------
' Procedure : FindWord
' DateTime : 12/01/2004 17:07
' Author : johnwm
' Purpose : Finds if word is in array
' Takes : strSearch is word to be found
' Returns : True if word is in (public) aryW
' Assumes : A Public Array aryW() contains all the words to be searched for
'---------------------------------------------------------------------------------------
'
Private Function FindWord(strSearch As String) As Boolean
Dim lngMax As Long
Dim lngMin As Long
Dim lngPtr As Long
Dim lngCount As Long
Dim blnSuccess As Boolean
lngMax = UBound(aryW)
lngMin = LBound(aryW)
blnSuccess = False
For lngCount = 1 To 20
lngPtr = (lngMin + lngMax) \ 2
Select Case aryW(lngPtr)
Case Is > strSearch
lngMax = lngPtr + 1
Case Is < strSearch
lngMin = lngPtr - 1
Case Else
blnSuccess = True
Exit For
End Select

Next
FindWord = blnSuccess
End Function
[/tt]

To test, I've run it thus:

[tt]
Private Sub Form_Load()
Dim strFile As Long
Dim strWords As String
Dim temp As Long
Dim lngLength As Long
temp = timeGetTime
strFile = FreeFile
Open &quot;c:\dict.txt&quot; For Input As #strFile
lngLength = LOF(strFile)
strWords = Input(lngLength, #strFile)
aryW = Split(strWords)
MsgBox &quot;Loading + Split Time = &quot; & (timeGetTime - temp) & &quot;ms&quot;
End Sub

Private Sub Command1_Click()
Dim blnYes As Boolean
Dim lngRepeat As Long
Dim temp As Long
temp = timeGetTime
Text2.Text = &quot;False&quot;
For lngRepeat = 1 To 10000
Text2.Text = FindWord(Text1.Text)
Next lngRepeat
MsgBox &quot;Time = &quot; & (timeGetTime - temp) & &quot;ms for 10000 repeats&quot;
End Sub
[/tt]

________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'If we're supposed to work in Hex, why have we only got A fingers?'
 
Of course, you could save yourselves the hassle of writing indexing/hashing and search/sort algorithms by using the scripting library's dictionary object, which does it all for you...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top