Better SoundEx implementation? (Very interesting!) 13

RebLazer · Aug 21, 2003

I need to programmatically check if two names are really the same using SoundEx. The code below works - but I'm sure there's a better implementation out there. If you know of one, please let me in on it!

Thanks so much,
Lazer

The code in green calls the stored procedure in red:[tt]

declare @hold1 varchar(4)
declare @hold2 varchar(4)
exec soundexalpha 'robert', @hold1 out
exec soundexalpha 'raburt', @hold2 out
if (@hold1=@hold2) select 'Y'
else select 'N'

ALTER proc dbo.SoundexAlpha
@instring varchar(50),
@soundex varchar(50)=NULL output

/* Extended Soundex
Copied from 'The Guru's Guide to Transact-SQL'
by Ken Henderson pp42-45

Original Soundex algorithm patented by Russel & O'Dell in 1918

Translation to Transact-SQL by Ken Henderson 5/16/98

Example:
declare @soundex varchar(4)
exec SoundexAlpha 'Rodgers', @soundex OUT
@soundex will have the extended soundex of 'rodgers'

*/
as

declare @workstr varchar(10)

set @instring=upper(@instring)
--use @soundex as a work buffer (omitting 1st char)
set @soundex=right(@instring,len(@instring)-1)

--replace vowels with 'A'
set @workstr='EIOUY'
while (@workstr<>'') begin
set @soundex=replace(@soundex,left(@workstr,1),'A')
set @workstr=right(@workstr,len(@workstr)-1)
end

-- reaffix 1st char
set @soundex = left(@instring,1) + @soundex

--translate common prefixes
if (left(@soundex,3)='MAC') set @soundex='MCC'+right(@soundex,len(@soundex)-3)
if (left(@soundex,2)='KN') set @soundex='NN'+right(@soundex,len(@soundex)-2)
if (left(@soundex,1)='K') set @soundex='C'+right(@soundex,len(@soundex)-1)
if (left(@soundex,2)='PF') set @soundex='FF'+right(@soundex,len(@soundex)-2)
if (left(@soundex,3)='SCH') set @soundex='SSS'+right(@soundex,len(@soundex)-3)
if (left(@soundex,2)='PH') set @soundex='FF'+right(@soundex,len(@soundex)-2)

--remove 1st char again
set @instring=@soundex
set @soundex=right(@soundex,len(@soundex)-1)

--translate other infixes
set @soundex=Replace(@soundex,'DG','GG')
set @soundex=Replace(@soundex,'CAAN','TAAN')
set @soundex=Replace(@soundex,'D','T')
set @soundex=Replace(@soundex,'NST','NSS')
set @soundex=Replace(@soundex,'AV','AF')
set @soundex=Replace(@soundex,'Q','G')
set @soundex=Replace(@soundex,'Z','S')
set @soundex=Replace(@soundex,'M','N')
set @soundex=Replace(@soundex,'KN','NN')
set @soundex=Replace(@soundex,'K','C')

set @soundex=Replace(@soundex,'AHA','~~~')
set @soundex=Replace(@soundex,'H','A')
set @soundex=Replace(@soundex,'~~~','AHA')

set @soundex=Replace(@soundex,'AW','A')
set @soundex=Replace(@soundex,'PH','FF')
set @soundex=Replace(@soundex,'SCH','SSS')

--truncate ending A or S
if (right(@soundex,1)='A' or right(@soundex,1)='S')
set @soundex=left(@soundex,len(@soundex)-1)

--translate ending NT to TT
if(right(@soundex,2)='NT') set @soundex=
left(@soundex,len(@soundex)-2)+'TT'

--remove all A
set @soundex=replace(@soundex,'A','')

--reaffix 1st char
set @soundex=left(@instring,1)+@soundex

--remove repeating chars
declare @c int
declare @d varchar(2)

set @c=65
while (@c<91) begin
set @d=char(@c)+char(@c)
while(charindex(@d,@soundex)<>0)
set @soundex=replace(@soundex,@d,char(@c))
set @c=@c+1
end

--return 4 bytes (with space pad if necessary)
set @soundex=left(@soundex,4)
if (len(@soundex)<4) set @soundex=@soundex+space(4-len(@soundex))

return 0

GO

zhavic · Aug 29, 2003

Thaks Lazer,

Glad I could help !!.

I have one little service pack on that function

In last step (24), there is
/* (24) if first char of original surname was a vowel, append it to the code */
IF CHARINDEX( @cFirst_char, @cVowels ) > 0
SELECT @cKey = @cKey + @cFirst_char

In the algorithm is 'append it to the code', I appended it at the end, but after some tests I have found out, that it must be appended at the start of code.
So this is it:

[thumbsup2]

/* (24) if first char of original surname was a vowel, append it to the start of code */
IF CHARINDEX( @cFirst_char, @cVowels ) > 0
SELECT @cKey = @cFirst_char + @cKey

[cheers]

Zhavic

---------------------------------------------------------------
In the 1960s you needed the power of two Comodore64s to get a rocket to the moon. Now you need a machine which is a vast number of times more powerful just to run the most popular GUI.

RebLazer · Sep 2, 2003

Zhavic,

Amazing - when I left work, I had made myself a note to try to figure out why a certain pair of names supposedly matched. I didn't understand why I kept getting a certain false positive. Your "service pack" fixed it!

A million thanks!
Lazer

SQLBill · Sep 2, 2003

Zhavic,

I would like to suggest that you take your final code, go to the FAQ tab and create a FAQ with your code. Title it: A better Soundex code!

-SQLBill

RebLazer · Sep 2, 2003

Ditto!

Lazer

zhavic · Sep 3, 2003

Thanx for suggestion, there is new FAQ about SoundEx now.

Zhavic

---------------------------------------------------------------
In the 1960s you needed the power of two Comodore64s to get a rocket to the moon. Now you need a machine which is a vast number of times more powerful just to run the most popular GUI.

RebLazer · Sep 4, 2003

Zhavic,

Now that I got your Soundex function working smoothly with my application, I was a bit disturbed to see that all of the following
[tt]
jame
james
jan
jane
jean
jeans
jim
john
june
[/tt]
share the same Soundex of "JAN"!

Is there anyway you can think of do make it even more accurate?

Thanks again so much,
Lazer

RebLazer · Sep 4, 2003

Zhavic,

I see that the algorithms call for that M -> N switch. It just seems strange - I am going to play around with removing that line and watch how it affects the results...

There's nothing wrong with your code. I just don't understand why the algorithm would want to do that...

Lazer

RebLazer · Sep 4, 2003

I have e-mailed some of the people whose e-mail addresses are listed in conjunction with the algorithm. I'll post back to here any news I receive.

Lazer

zhavic · Sep 5, 2003

Ok Lazer, I'll take a look there later.

Zhavic

---------------------------------------------------------------
In the 1960s you needed the power of two Comodore64s to get a rocket to the moon. Now you need a machine which is a vast number of times more powerful just to run the most popular GUI.

RebLazer · Sep 8, 2003

Regarding the M/N issue, I got the following three responses - the first two are from people at the National Institute of Standards and Technology and the third is from a private individual whose name I saw associated with the algortihm. I've put them in three different colors for easier reading.

I find the blue response particularly interesting.

Lazer

________________________________________

A classical version of Soundex can be found in the NIST Physics Lab pages*, or in our Dictionary of Algorithms and Data Structures (go to
http://www.nist.gov/dads/#S
and click on Soundex). The latter will also give you links also to variants, including the one you implemented, called NYSIIS, which was developed for a specific purpose that was somewhat different from that of the original Soundex. The purpose is explained at
http://www.choicemaker.com/MEDD_2001_Imm_Reg_Conference.pdf,
and the New York Schools apparently found it useful. There are a lot of variants of the Soundex principle, as it does not work for all names. The original one worked mainly for European names, but there are Oriental languages that it doesn't work so well for. In both Japanese and Chinese, which are very different languages (even though the Japanese adopted a subset of Chinese - Japanese "Kanji" - characters to be used along with their syllable-based forms for writing, which sometimes makes people think the languages themselves are similar), the English R and L are the same sound (linguists would say they are in the same phoneme). In Chinese (not Japanese) tones that we do not hear (nor do the Japanese) differentiate between words and names. In some African languages, clicking sounds that most languages do not have can make a difference. So as new names come to be used, the retrieval purposes of Soundex are changed. The airlines use Soundex to help them find names that may have been misspelled on their lists of passengers, but I doubt that they use the classical form today.

So the answer to your question is that there are a lot of arbitrary decisions made in the any of these schemes, but if they are still around, they have stood the test of time within the arena for which they were invented. It would be interesting to know if the New York Schools still use NYSIIS.
http://physics.nist.gov/cuu/Reference/soundex.html
Larry

The original SOUNDEX algorithm does, indeed, map
M and N to the same "sound". Does it make sense?
SOUNDEX was designed to help in the 1880 census.
Since records were hand-written, "m" and "n" often
looked alike, so it helped recall (the ability to
get everything pertinent vs. precision, getting
little that is not pertinent) to lump them together.
See
http://www.bradandkathy.com/genealogy/overviewofsoundex.html
Newer routines, like double metaphone and NYSIIS,
are more sophisticated, so don't produce as many
of the anomalies you note.

Sincerely,
-paul-

I don't know the why this is done this way, but it is what the algorithm calls for. The whole point of these phonetic coders is to compress the space of proper names and group the like-sounding ones together and you have to admit that they do that.
You know, it may be that each one of the names you listed sound like JAN somewhere in NYC

Sorry I couldn't be more elucidating.

-- matt

RebLazer · Sep 11, 2003

Zhavic and everyone,

Someone pointed me to a site

http://www.imagepartners.co.uk/thesaurus/Forenames.aspx

that does first-name matching. It works fairly.

Zhavic, you developed a really great implemenation - far better than Soundex or Metaphone!

Lazer

RebLazer · Sep 11, 2003

Zhavic,

I noticed something very strange. Here are the Soundex codes for the following words:

steve STAF
ste-ve STA-V
s-te-ve S-TA-V
s-t-e-v-e S-T-A-V-

Look how the hyphens change the Soundex codes - like from an F in the first code, to a V in the second code.

Where in the algorithm/implementation does that happen? I can't find it or understand why it is happening...

Thanks once again,
Lazer

zhavic · Sep 11, 2003

Hi Lazer, thanx again.

To your question:

This is by step ( 4) transcode 'EV' to 'EF' if not at start of name

Word 'STEVE' it transcode to 'STEFE' and then in step ( 7) replace all vowels with 'A',
it changes to 'STAFA' and step (22) removes all trailing vowels, so it trim last 'A' and the result is 'STAF'

But when you add hyphens = 'STE-VE', there is no 'EV' substring in the word and step (4) does nothing.
Then step (7) changes it to 'STA-VA' and then step (22) removes all trailing vowels and you get 'STA-V'

So the difference is in step (4).

From your blue post
"m" and "n" often looked alike, so it helped recall to lump them together.
Good explanation for step (13) of algorythm

Zhavic

---------------------------------------------------------------
In the 1960s you needed the power of two Comodore64s to get a rocket to the moon. Now you need a machine which is a vast number of times more powerful just to run the most popular GUI.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Better SoundEx implementation? (Very interesting!) 13

RebLazer

Programmer

zhavic

Programmer

RebLazer

Programmer

SQLBill

MIS

RebLazer

Programmer

zhavic

Programmer

RebLazer

Programmer

RebLazer

Programmer

RebLazer

Programmer

zhavic

Programmer

RebLazer

Programmer

RebLazer

Programmer

RebLazer

Programmer

zhavic

Programmer

Similar threads

Part and Inventory Search

Sponsor