Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

General Logic for fuzzy matching

Status
Not open for further replies.

FractalWalk

Technical User
Nov 18, 2002
141
US
In Excel, I am trying to look at a column of text data and find strings that are similar to one another. The idea is to allow a user to choose to overwrite one piece of data with the other to make them match exactly. For example "123 Main St." and "123 main street" need to be standardized to the same thing.

So what I am looking for suggestions in logic. So far I'm comparing string length and setting a maximum length difference. I'm also requiring that the first 2 characters match (not case sensitive). I'm thinking about calculating the number of charcaters within both strings that match and setting a minimum percentage threshold.

Any other ideas? This does not have to be an exact science in that its not critical that every possible match is found. Rather this is just something that prompts a user to make a change when a fuzzy match is found. I'm looking to try and knock out 80% of the work in an automated manner.
 
I've never done anything like this, however I'm interested in what you wind up with, so please post back with what works.

If I were doing this, I'd also have a lookup of common substitutions (i.e. Road = Rd., Street = St., etc.)

[blue]"Well, once again my friend, we find that science is a two headed beast. One head is nice, it gives us aspirin and other modern conveniences,...but the other head of science is BAD! Oh, beware the other head of science, Arthur; it bites!!" - The Tick[/blue]
 
I did a quick and dirty "fuzzy logic" address matching function a while back. Don't have the code anymore, but basically it accepted two strings, then parsed each of them into segments (using space character for delimiter). It assumed that it only needed to check segment 1 of string A against segment 1 of string B, and so forth. It looked for an exact match on any all-number strings, case-insensitive match of the first three letters of character strings, and first letter match of any segment where the A version or B version ended in a period. Then it ranked the string match based on how many segment matches it found.

Certainly wasn't perfect, but it did get me 80-90% there.

VBAjedi [swords]
 
I went with my posted matching criteria: initial charcaters, string length, percent of total characters. It worked well for the data I was using it on. I would guess I cleaned over 90% of the data with it. I used it on name lists for people as well as products. Eventually, I'll try it on addresses, which means that I'll probably have to tweak it a bit.

The code (VBA in Excel) sorts the data and compares each cell with the following row. It tracks if the match is exact or fuzzy and continues until there is no match (either). Then it populates all the selected items that match and populates them into a list box on a user form. The user can then select which items need to be standardized and move them to a second list box. Then the user can select one of those items to be the correct standardized spelling or choose his own spelling for all selected items.


Sub Standardize_Names()
Cells.Sort Key1:=Range("C1"), Order1:=xlAscending, Header:=xlNo
FirstXChar = 2 'The number of starting characters that must match
StrngLen = 3 'The maximum number of characters, by which the strings can differ
MatchPerCent = 0.8 'The percentage of matching charcaters relaltive to total characters
Load Standardize 'User Form To prompt user changes


'Loop line by line while the defined matching conditions are true or exact match.
'Exit when any match is false

x = 1
Do While Cells(x + 1, 3) <> Empty
strt = x
mtch = 0

'store first two characters of target string
'store length of target string
'track if strings are exactly the same

a = Left(UCase(Cells(strt, 3)), FirstXChar)
l = Len(Cells(strt, 3))
Do
Cells(3, 100).Formula = "=EXACT(C" & strt & ",C" & x & ")"
If Cells(3, 100) Then
mtch = mtch + 1
x = x + 1
Else

'check for fuzzy match to comparison string by string length and first char's

b = Left(UCase(Cells(x + 1, 3)), FirstXChar)
m = Len(Cells(x + 1, 3))
If a <> b Or Abs(l - m) > StrngLen Then
Exit Do
Else

'check for fuzzy match by total macthing char's (case insensitive)
'Loop through the entrie ASCII character set
'Find instance of selected ASCII character in target and comparison strings
'Track the number of matching characters and repeat to find multiples

counter = 0
For Y = 0 To 255
i = 0
j = 0
found:
i = InStr(i + 1, UCase(Cells(strt, 3)), Chr(Y))
j = InStr(j + 1, UCase(Cells(x + 1, 3)), Chr(Y))
If i > 0 And j > 0 Then counter = counter + 2: GoTo found
Next Y

'If total matched characters are > than target percent then
'increment x to retrieve next string in list else exit the loop

If counter / (l + m) < MatchPerCent Then Exit Do Else x = x + 1
End If
End If
Loop

'If all 3 match criteria were met at least once, x will have incremented above strt.
'If all potential fuzzy matches are actually exact matches then mtch = x - strt
'Set target cells to values to pass to UserForm named Standardize

If x > strt And mtch < x - strt Then
Cells(1, 100) = strt
Cells(2, 100) = x
Standardize.Show
End If
If counter <> 0 And mtch = 0 Then x = x + 1
Loop
End Sub
 
Algorithms related to partial or approximate text matching which one should certainly be able to find documented online include: soundex (and its host of variants), metaphone (don't forget double metaphone), NYSISS and Levenshtein Distance.

-Predictor
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top