Deleting duplicate records 3

Alira · Jun 24, 2002

Hi All!

Can some one tell me how to delete duplicate records from the table? Have a great day!

ranjtl · Jun 24, 2002

to find a duplicate record do a group by with a having clause. like this:
select colA, colB, count(*)
from table
group by colA
having count(*) > 1

where colA, colB is the list of columns that need to be distinct. if you have an identity column, you can then delete the records from the result of the query that have an id value greater than the min id value AND the same values in the distinct column(s) of the record you want to keep. a cursor might help in this endeavor.

hope this is clear enough to help.

-rik

essa2000 · Jun 24, 2002

Dear ;

I have an article on the same topic , I think it will help you a lot.

Question.
Is there a SQL that I can use to delete duplicate entries from a data store, while leaving a distinct copy - leave a single copy, remove all duplicate except one?

From your question, it is unclear whether your table has a unique key or not. Since you refer to this as a "data store", I'm guessing that your duplicates might be true duplicates, meaning that every value in every column is identical. Let me first address the case where the table does not have a unique key.

**************** NO UNIQUE KEY ******************

In this case, we have a difficult problem if we are trying to solve this with a single SQL Statement. In this situation, I recommend one of the following approaches:

1.) Add a unique key to the table This is easy. Add a column called ID as an integer, and make it an identifier column by checking the identity box in the table design window. Set the Identity Seed to 1 and the Identity Increment to 1. The column will automatically be populated with unique values for each row in the table. Proceed to UNIQUE KEY section below.
2.) Write a stored procedure. The strategy here would be to write a query that returns a row for each set of duplicates, using a query such as the following:

SELECT Field1, Field2, Count(ID)
FROM Foo1
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1

Use a cursor to loop through the returned rows, then for each set of duplicates, read all rows for that set:

SELECT Field1, Field2, ID
FROM Foo1
WHERE Field1 = @FIELD1 and Field2 = @FIELD2

Then delete each row except the first one returned, for each set of duplicates.

******************* UNIQUE KEY ********************
If dealing with a table that does have a unique key, the problem of removing duplicates is much easier, and able to be accomplished in one SQL statement such as the following:

DELETE
FROM Foo1
WHERE Foo1.ID IN

-- List 1 - all rows that have duplicates
(SELECT F.ID
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1))
AND Foo1.ID NOT IN

-- List 2 - one row from each set of duplicate
(SELECT Min(ID)
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1)
GROUP BY Field1, Field2);

Since this may appear complicated, let me explain. My strategy here is to return two lists: The first, List 1, is a list of all rows that have duplicates, and the second, List 2, is a list of one row from each set of duplicates. This query simply deletes all rows that are in List 1 but not in List 2.

I have copied this for you here.

Regards,
Muhammad Essa Mughal

paron · Oct 31, 2002

Make your life easy! Go to

http://www.15seconds.com/issue/011009.htm

for a simple, comprehensive tutorial article on the subject. Includes code, about 5 lines of it, minus the copious comments. Makes it look as easy as you suspected it had to be.

Ron

SQLSister · Oct 31, 2002

Don't forget to fix your user interface so that you won't continue to get duplicate records.

MeanGreen · Oct 31, 2002

Amen SQLSister! Fix that user interface.

paron · Nov 1, 2002

No, I don't agree 100%. At least, for the app I'm creating right now. The "system" includes the users, and my users are a little timid. Having the computer refuse to do things, or beep at them, or send warning messages to them puts them off.

For what I'm doing now, I think it's better to let them work, at least at first, and clean up after them as needed. Later, maybe, when they are more accustomed to the process, then I might tighten up.

But, I might be wrong. Comment?

paron · Nov 1, 2002

P.S. -- removing dups is also very handy when you are bringing in legacy data.

Ron

MeanGreen · Nov 1, 2002

Removing dupes is very handy when bringing in legacy data, however, allowing users to input duplicate data and not controlling the content of your data is wrong. While your users may be timid, if the data they input produces errors in reports (bad counts) or the time involved to produce unique values because you have dupes starts taking longer, then the users will stop using the application. I am all for tempting users to use your application, but taking off the safety belts is not the way to keep them coming back. Showing them how your application protects them and how it prevents them from doing "double" work will help convince them that the application is helpful. I feel that this is where you will get your users coming back. This is only my opinion, but you asked.

Thank you.

paron · Nov 1, 2002

I asked because I really wanted to know. I work by myself, and when I saw who had responded, and how, I thought it was a good chance to learn.

Thank you for your help. I think I will put the "removing duplicates" closer to the front of the process (in the "temp" tables), so the user can see it happen, but the dups don't get into the "permanent" tables. That will be both educational and safe, I hope!

Thanks again,
Ron

SQLSister · Nov 1, 2002

While I do agree it is necessary on occasion to deal with duplicates when bringing in data from other sources, you should never design errors into your system deliberately.

If you have timid users, you fix this with training. Specifically, you show them the things that they could do wrong and how to read the error message and how to recover from it. Then you have someone around to hold their hands the first week or so they use the system and help them through the problems. If they are really timid, you could even give them a cheat sheet showing them common error messages and what to do about them. After about a week, they are used to the fact that the system won't let them make certain mistakes and they become more careful about how they enter data.

paron · Nov 1, 2002

I don't have a lot of faith in users looking at documentation, per se, but a cheat sheet . . . who can resist cheating?

And, demonstrating the errors sounds like a really good idea -- very bracing for the users to see "the expert" get an error message and recover. Should make the whole process more human and less demeaning for the users.

I might even borrow a page from MS's book and put out a "How to cause this error" sheet as well. Thanks for the ideas.

Ron

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Deleting duplicate records 3

Alira

Programmer

ranjtl

Programmer

essa2000

Programmer

paron

Instructor

SQLSister

Programmer

MeanGreen

IS-IT--Management

paron

Instructor

paron

Instructor

MeanGreen

IS-IT--Management

paron

Instructor

SQLSister

Programmer

paron

Instructor

Similar threads

Part and Inventory Search

Sponsor