Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding Duplicates when records are exactly the same

Status
Not open for further replies.

dawnd3

Instructor
Joined
Jul 1, 2001
Messages
1,153
Location
US
I have a table of over 200,000 records of companies and their address, phone, etc. that I am trying to clean up. Many are duplicates but do not show in the duplicates query because they aren't exact. Like one company name might be missing a comma, or an "inc." or the address is concatenated in one field and in the other it is in the proper fields. It would take me forever to go through and make them match just so they can be pulled as duplicates.

Anyway, what I imagine is some sort of function that can spot duplicates if a certain number of characters match in more than one field. Is this even possible? Say if a string of 10 characters match in 3 different fields, it will show it as a duplicate. What do you think? I use Access 2003

Thanks,

Dawn

 
There are lots of ways to do this, but there is no magic bullet. There are a lot of possibilities to ID potential duplicates, but without some rules there is no way to determine that it is indeed a match. The trick is not to identify a single function or query, but to come up with a utility and application that allows a lot of different queries or algorithms to ID potential matches, and then allows the user to quickly choose the correct ones and make fixes.

So my first question is what do you do after IDing a duplicate? Do you combine all data into a single record? Do you delete all duplicate entries and leave a single record? If you are going to do maintenance, then IDing the duplicates in only half the battle.

So this is how I do it.
1) Clean up my data by first searching and replacing.
Examples
I would replace all Inc, Inc. with Incorporated
Co. Co with Company
Lmtd LTD with Limited
etc.
2) Now I would build a lots of different types of queries that return a record and its possible dublicates. The reason for this is that it provides flexibility. There is no single query that can return all of the possibilities.

One would obviously be where the name is equal

One would be where the name is part of a Name
Example
SELECT A.CompanyName, B.CompanyName
FROM Suppliers2 AS A, Suppliers2 AS B
WHERE (((A.CompanyName) Like "*" & .[CompanyName] & "*" And (A.CompanyName)<>.[CompanyName]));

Maybe a join on multiple fields besides name (Address, Phone, etc)

A query that looks for abbreviations. Match NWA with North West Airlines.

A query using a soundex function

3)Now I would personally use recordsets and read through all of my queries and populate a table. I would read each of my records and then record in my table each possible duplicate and the rule used (query name) to find it.

My table would have these fields
Company_PrimaryKey
possibleDuplicate_PrimaryKey
queryRuleUsed

4)Now I would build a form with two synchronized conintous subforms using the above table in queries. Subform 1 would be a continous list of all records with possible duplicates. Subform 2 would be the related possible duplicates. I would have some buttons on the form to do "delete all possible duplicates", "delete a specific duplicate", "copy missing data from duplicates." This will allow me to manage the duplicates


Obviously you need to focus on preventing this in the future. You should not be able to enter a new company without it searching the DB for possible duplicate records. Different countries have different ways to uniquely ID a buisness such as Tax ID number.

If interested in this approach I will show some examples.
 
Just to add to MajP's excellent advice:

Try to regularise all the entries so that they're all in the same format.

Do some grouped queries to count the number of similar entries in the major address fields (city, county, country, for UK addresses). This might show up 90% as sensible, 5% as typos and 5% as blank. You might be able to fix half of these with some one-off queries; the same misspellings will come up time and again and you'll have a lot where one of the address fields is blank and the others have all been shifted across one column.

Do some queries sorted on phone and on postcode. These show up a lot of duplicates very quickly.

You'll need somebody who knows the business to help you with the final manual checking. Someone who can say "They've gone out of business." or "That was their old name back in the nineties."

Make a backup after every run of deletions. This is going to be a long process and mistakes are going to be made.

Keep a record of how long this takes you and use it it push for higher standards of data entry.



Geoff Franklin
 
I very much appreciate your responses and I am definitely taking some great ideas from them, but some won't work because, as I said in my post, I am working with over 200,000 records. I upload 2-3000 records at a time, so I can't possibly manually/visually check for dups. Keep the ideas coming though, I can use as many clean up tips as possible.

Dawn

 
I upload 2-3000 records at a time, so I can't possibly manually/visually check for dups.
I'm afraid I don't know of any way of getting rid of duplicates without a final manual check. You can write code that will pick out possible duplicates and you can automatically delete a lot of them but you won't get 100% accuracy without human intervention.

It depends how much these duplicates are going to cost the company. If you're in a high-value business then it might be worth spending a lot of money to get rid of that last 1% of duplicates. On your figures that would be 2,000 duplicates so if you're sending out some thick, glossy catalogues at $5 each then you could justify sitting down for $10,000 worth of your time to get rid of them.

Geoff Franklin
 
You titles this "finding duplicates when records are exactly the same". Handling that automatically is simple, but unfortunately we are talking about records that we do not know if they are exactly the same. Or at least you have not mentioned that. Maybe there are some combinations of fields that even if the name is not equal there is no way to have those additional fields the same and not be the same record. If that is the case tell us about them.

However, like everyone one said you can only write algorithms that return some level of confidence. Sorry, but it is your fault for not enforcing strict rules on the front end. I understand sometimes you are at the mercy of the data source and have no control.

Think of Google with some of the most complex search algorithms. Does it ever return things that are not exact matches?

Like I said it will have to be a strategy, not a single function, query, or magic bullet.

When you upload do you have any control over the data? Or is this coming from some data source where you can institute some buisiness rules?

Lets assume you upload (How often?) 2-3k at a time. Maybe there are a few rules to implement on those that will make IDing duplicates 90% efficient.
1) Require a tax ID or some commonly used unique identifier
2) All names spelled out completely (American Express not AMEX, North West Airlines not NWA, Incorporated not Inc., etc.)

Now you have a chance when importing to automatically handle exact names. Assuming there is 10% duplicates that means there are 300 duplicates. If you clean up your rules and now can find 90% truly exact matches. You will end of with 30 not exact but possible matches. Now you can try some different queries on your data and display another 90% of possible but not exact matches. So you would have 27 you would need to intervene (Acme Distributors Incorporated vice ADI are they actually the same? who knows but you). So now you deal with about 30 manual a day leaving you a possible 1% error rate.

The current 200k using these concepts might take a few days to review, but the better your interface the faster it will go. But you really need to focus on a three prong attack. Fix the current data, install buisness rules to make less likely of duplicates, and do an automatic and manual ID when data is imported.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top