Good day!
Background:
1. Using VB6 (VBA) within Microsoft Access
2. Utilizing 3 functions within the main SimilSearch() routine. These are:
a. Simil()
i. Simil is a "smart string comparison" algorithm. It takes two strings, compares them, then spits out a percentage of likeness.
b. SoundsLike()
ii. SoundsLike is a cousin of the Soundex algorithm. SoundsLike is a tad more strict on it's comparisons, thus making it more efficient.
c. AddSimilarEntry()
iii. Simple routine that adds the similar entry to a database table.
============================
Question:
SimilSearch() searches through customer names within my database for like names. If it finds any, it add a new entry into a seperate table.
The main routine SimilSearch() is nothing more than a progressive linear algorithm. It compares each item against the current key. This take a while.
How can I cut down on execution time? Is there a way I can make this "smart" and exclude comparing items that are obviously not even close?
============================
Code:
SimilSearch() *Main routine
AddSimilarEntry()
Simil()
SoundsLike()
============================
Background:
1. Using VB6 (VBA) within Microsoft Access
2. Utilizing 3 functions within the main SimilSearch() routine. These are:
a. Simil()
i. Simil is a "smart string comparison" algorithm. It takes two strings, compares them, then spits out a percentage of likeness.
b. SoundsLike()
ii. SoundsLike is a cousin of the Soundex algorithm. SoundsLike is a tad more strict on it's comparisons, thus making it more efficient.
c. AddSimilarEntry()
iii. Simple routine that adds the similar entry to a database table.
============================
Question:
SimilSearch() searches through customer names within my database for like names. If it finds any, it add a new entry into a seperate table.
The main routine SimilSearch() is nothing more than a progressive linear algorithm. It compares each item against the current key. This take a while.
How can I cut down on execution time? Is there a way I can make this "smart" and exclude comparing items that are obviously not even close?
============================
Code:
SimilSearch() *Main routine
Code:
Public Function SimilSearch()
Dim db As DAO.Database
Dim rsSearch, rsSearchClone, rsSimilarTable As DAO.Recordset
Dim searchString, comparisonString, searchSound, comparisonSound As String
Dim SimilThreshold, stringRelevance As Double
Dim NumberFound, i, j As Integer
Dim timeStart As Double
Dim timeStop As Double
On Error GoTo errorhandler
Set db = CurrentDb
'Recodset we wish to search
Set rsSearch = db.OpenRecordset("SELECT Customer.Name AS SearchField FROM Customer;", dbOpenSnapshot)
Set rsSearchClone = rsSearch.Clone
'Recordset we wish to place similar records into
Set rsSimilarTable = db.OpenRecordset("SELECT * FROM SimilarStrings;", dbOpenDynaset)
'Threshold for string comparison
'It seems that .83 seems to be the magic number. Any higher or lower will
'result in either too many or too few results, respectivley.
SimilThreshold = 0.83
With rsSearch
.MoveLast
.MoveFirst
'Set the progress bar control max
Me.progBar.max = .RecordCount - 1
Do Until .EOF
Me.progBar.value = Me.progBar.value + 1
'Set the string we with to compare all others to for this iteration
searchString = ![SearchField]
'This is the loop where we compare 'searchString' to all other
'entries within the database.
'NOTE - This is progressive algorithm. If 'searchString' is
'record number 15, then we start the comparison at
'record 16.
i = i + 1
With rsSearchClone
.MoveFirst
.Move i
Do Until .EOF
'stringRelevance holds the comparison percent relevancy
comparisonString = ![SearchField]
stringRelevance = Simil((searchString), (comparisonString))
If stringRelevance >= SimilThreshold Then
'For added strictness, we compare the two strings using
'the SoundsLike algorithm which is related to Soundex.
'Although the threshold was met, this dosen't mean that the
'two strings are completely relevant to one another.
searchSound = SoundsLike(searchString)
comparisonSound = SoundsLike(comparisonString)
If searchSound Like comparisonSound Then
If AddSimilarEntry(rsSimilarTable, searchString, comparisonString _
, stringRelevance, searchSound) = False Then
MsgBox "Error adding new similar entry to database."
End If
End If
End If
'Move to the next comparison item
.MoveNext
Loop
End With
'Move to the next key item
.MoveNext
Loop
End With
errorhandler:
Select Case Err.Number
Case 0 To 3021
Resume Next
Case Else
MsgBox "Error: " & Err.Number & vbCrLf & "Description: " & Err.Description
End Select
End Function
AddSimilarEntry()
Code:
Public Function AddSimilarEntry(rsTemp As DAO.Recordset, ByVal searchString As String, ByVal comparisonString As String _
, ByVal stringRelevance As String, ByVal searchSound As String) As Boolean
On Error GoTo errorhandler
' Adds a new record to a Recordset using the data passed
' by the calling procedure. The new record is then made
' the current record.
With rsTemp
.AddNew
![OriginalString] = searchString
![compareString] = comparisonString
![PercentRelevance] = stringRelevance
![SoundsLike] = searchSound
.Update
.Bookmark = .LastModified
End With
AddSimilarEntry = True
errorhandler:
If Err.Number > 0 Then
AddSimilarEntry = False
End If
End Function
Simil()
Code:
'**************************************
' Name: _A Smart String Comparison
' Description:This takes 2 strings and r
' eturns the percent alike that they are.
' (i.e. "test string number 1" is 86.48% s
' imilar to "teststring numb 2")
' This Function is very useful! You can use it In databases
' To match data that may have errors in it. Examples being people's names,
' company names, addresses, or anything Else where you may encounter misspellings or
' inconsistencies in the data. Your feedback and/or votes are greatly appreciated! --
' NEW - updated to use Byte arrays instead of strings, 50-300% performance improvement!
' An implementation of the , Ratcliff/Obershelp/Levenshtein method.
'
' Inputs:mainstring and checkstring, the
' 2 strings to compare
'
' Returns:how similar the 2 strings are
' (percent, as in .8)
'
' Assumes:This code recursively loops th
' rough the 2 strings, finding the largest
' common substring, then checking the rema
' inder of the string.
Private b1() As Byte
Private b2() As Byte
Public Function Simil(String1 As String, String2 As String) As Double
Dim l1 As Long
Dim l2 As Long
Dim l As Long
Dim r As Double
If UCase(String1) = UCase(String2) Then
r = 1
Else
l1 = Len(String1)
l2 = Len(String2)
If l1 = 0 Or l2 = 0 Then
r = 0
Else
ReDim b1(1 To l1): ReDim b2(1 To l2)
For l = 1 To l1
b1(l) = Asc(UCase(Mid(String1, l, 1)))
Next
For l = 1 To l2
b2(l) = Asc(UCase(Mid(String2, l, 1)))
Next
r = SubSim(1, l1, 1, l2) / (l1 + l2) * 2
End If
End If
Simil = r
Erase b1
Erase b2
End Function
Private Function SubSim(st1 As Long, end1 As Long, st2 As Long, end2 As Long) As Long
Dim c1 As Long
Dim c2 As Long
Dim ns1 As Long
Dim ns2 As Long
Dim i As Long
Dim max As Long
If st1 > end1 Or st2 > end2 Or st1 <= 0 Or st2 <= 0 Then Exit Function
For c1 = st1 To end1
For c2 = st2 To end2
i = 0
Do Until b1(c1 + i) <> b2(c2 + i)
i = i + 1
If i > max Then
ns1 = c1
ns2 = c2
max = i
End If
If c1 + i > end1 Or c2 + i > end2 Then Exit Do
Loop
Next
Next
max = max + SubSim(ns1 + max, end1, ns2 + max, end2)
max = max + SubSim(st1, ns1 - 1, st2, ns2 - 1)
SubSim = max
End Function
SoundsLike()
Code:
Public Function SoundsLike(ByVal pWord As String, Optional pAccuracy As Byte) As String
' char importance "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
Const SoundCodes = "01230120022455012623010202"
Dim lPunctuation As String
Dim lCurChar As String
Dim lWordLen As Integer
Dim lCurValue As Integer
Dim lSoundex As String
Dim x As Integer
'characters not taken into account for Soundex
lPunctuation = ".,/`';][-=<>?:~}{+_)(*&^%$#@!\|1234567890" & Chr(34) & Chr(32)
'default accuracy is 4
If pAccuracy = 0 Then pAccuracy = 4
If pWord <> "" Then
'remove the punctuation and numeric characters from the string
pWord = RemoveChars(pWord, lPunctuation)
'convert to uppercase
pWord = UCase(pWord)
lWordLen = Len(pWord)
'in accordance with soundex rules - start at 2nd
'character
lSoundex = Left(pWord, 1)
x = 2
'perform conversion until desired level of accuracy or
'length of word reached
While Len(lSoundex) < pAccuracy And x <= lWordLen
lCurChar = Mid(pWord, x, 1)
'ignore double characters
If lCurChar <> Mid(pWord, x + 1, 1) Then
'get the Soundex value
lCurValue = Val(Mid(SoundCodes, (Asc(lCurChar) - 64), 1))
'ignore zero Soundex values
If lCurValue > 0 Then lSoundex = lSoundex & lCurValue
End If
x = x + 1
Wend
'if the length is less than the desired accuracy, pad
'with zeroes
lSoundex = lSoundex & String(pAccuracy - Len(lSoundex), "0")
SoundsLike = lSoundex
End If
End Function
Public Function RemoveChars(pMessage As String, pRemovable As String) As String
Dim lMessage As String
Dim lCurChar As String
Dim x As Integer
'cycle through message string
For x = 1 To Len(pMessage)
'get the current character
lCurChar = Mid(pMessage, x, 1)
'if the current character is not in the removable list,
'add it to the local message
If InStr(pRemovable, lCurChar) = 0 Then lMessage = lMessage & lCurChar
Next x
RemoveChars = lMessage
End Function