Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

*** How to Find Duplicates, Triplicates, and Similar Records?? ***

Status
Not open for further replies.

jonnywah

Programmer
Feb 7, 2004
80
US
I have a table Chemical that has the following fields.

chemicalId - primary key
chemicalShortName - eg. viagrasium
chemicalFullName - eg. viagrasium dio sulphate
chemicalNameUsed - eg. viagra
chemicalCompanyShortName - eg. ABC Pharmaceutical
chemicalCompanyFullName - eg. ABC Pharmaceutical Internation Corporation
chemicalContactEmail1 - jim@abcpharm.com
chemicalContactEmail2 - jerry@abcpharmcorp.com

I have to create a Stored Procedure to return the chemicalId of records that appear to be "duplicates", "triplicates" ("quad ...", etc) or are "VERY SIMILAR".

The procedure should check all the fields above and return a list of IDs of duplicates, triplicates, and records that are very similar. I will then manually eliminate and merge similar / duplicate records. Please help!
 
this is how to find duplicate rows:

SELECT a.au_lname, a.au_fname
FROM table AS a
GROUP BY a.au_lname, a.au_fname
HAVING COUNT(*) > 1
 
The following example will dispaly all multiple records for which chemicalShortName and chemicalFullName are the same:
Code:
select chemicalId , chemicalShortName , chemicalFullName from _tChemical
where chemicalId in (
select chemicalId from _tChemical  t2, 
(select chemicalShortName, chemicalFullName from _tChemical group by chemicalShortName, chemicalFullName having count(*) > 1) as t1
where t2.chemicalShortName = t1.chemicalShortName and t2.chemicalFullName = t1.chemicalFullName) group by chemicalShortName,chemicalFullName, chemicalId
-obislavu-
 
eramgarden and obislavu,

I would like to find "very similar" records, not just identical duplicate records. How can I do this? Thank you in advance.
 
Define "very similar".

You can look for patterns using like
For example
[tt]
myField LIKE 'A%'
[/tt]

Finds everything where 'myField' starts with "A".

If you want to be more technical about the whole thing then check out
That site discusses (and provides code) for computing the "Levenshtein Distance" which is a numerical measure of how "similar" two strings are.
 
I know how to use simple "like" statements, however I would like to only return "very similar" records in terms of chemicalShortName, chemicalFullName, chemicalNameUsed, chemicalCompanyShortName, chemicalCompanyFullName, chemicalContactEmail1, and chemicalContactEmail2.

"Very similar" records are defined as records with very similar fields (see above) and many similar fields.
 
Again ... I don't see an implementable definition of "very similar" and defining a thing in terms of itself doesn't really help.

Can you come up with something like "90% of the characters in corresponding positions match" OR "the first 10 characters of each string match"? If you can then you at least have a starting point.

For example, are
ABC Pharmaceutical and ABC Pharmaceutical International Corporation "very similar" or not?

How About ABC Pharm. Intl. Corp.?

Without some measure that can be defined in code you really can't get there from here.

With respect to email addresses, this doesn't seem realistic.

Obviously ajones@myisp.com and bjones@myisp.com could be regarded as "very similar" because they differ by only one letter but "ajones" and "bjones" could be (probably are) two completely different people.
 
ABC Pharmaceutical, ABC Pharmaceutical International Corporation, and ABC Pharm. Intl. Corp. would be considered similar.

Similar can be defined as the "first 5 characters" in each string and "50% of the characters" match in any two of the fields:

chemicalShortName
chemicalFullName
chemicalNameUsed
chemicalCompanyShortName
chemicalCompanyFullName
chemicalContactEmail1
chemicalContactEmail2

Email address would be harder, but ajones@myisp.com and bjones@myisp.com should be considered similar. I will manually look over the similar records after the stored procedure finds them.



 
So you want something like this?
[tt]
Public Function VerySimilar(Str1 As String, Str2 As String)
As Boolean
Dim L As Integer, n As Integer, Match As Integer
Dim L1 As Integer, L2 As Integer
L1 = Len(Str1)
L2 = Len(Str2)
If L1 = 0 OR L2 = 0 Then
VerySimilar = FALSE
Else
L = IIF ( L1 > L2, L2, L1 )
If L > 5 Then L = 5

If (Left$(Str1,L) <> Left$(Str2,L)) Then
VerySimilar = FALSE
Else
L = IIF ( L1 > L2, L2, L1 )
For n = 1 to L
If Mid$(Str1,n,1) = Mid$(Str2,n,1) Then Match = Match + 1
Next n

VerySimilar = (cSng(Match) / cSng(L)) >= 0.5
End If
End IF

End Function
[/tt]
 
Okay, now is more clear.
I think you can use the SOUNDEX() function to determine when two fields are similar:
Example :
SOUNDEX("ABC Pharmaceutical") will be equal with SOUNDEX("ABC Pharmaceutical International") and also equal to SOUNDEX("ABC Pharm. Intl. Corp.");
Code:
select chemicalId , chemicalShortName , chemicalFullName from _tChemical
where chemicalId in (
select chemicalId from _tChemical  t2, 
(select chemicalShortName, chemicalFullName from _tChemical group by chemicalShortName, chemicalFullName)  as t1
where soundex(t2.chemicalShortName) = soundex(t1.chemicalShortName) and soundex(t2.chemicalFullName) = soundex(t1.chemicalFullName)) group by chemicalShortName,chemicalFullName, chemicalId 
[code]
-obislavu-
 
Soundex function wasn't useful here as almost all the records were returned as "similar"

Golom, I understand the logic for your function but unfortunately this seems to be VB code. I will try to get this working in a SQL Stored procedure, but I am new to this. Any advice or suggestions would be appreciated. Thank you in advance.

 
Okay, in this case you should write your proper function soundex( s1 varchar2, s2 varchar2) to determine whenever the two strings are very similar.
Call this function in the stored procedure.
Of course the VB code provided by Golom is easy to translate in a function recognized by your DB. All above functions (Mid(), Len() have a corresponding function.

-obislavu-
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top