to find a duplicate record do a group by with a having clause. like this:
select colA, colB, count(*)
from table
group by colA
having count(*) > 1
where colA, colB is the list of columns that need to be distinct. if you have an identity column, you can then delete the records from the result of the query that have an id value greater than the min id value AND the same values in the distinct column(s) of the record you want to keep. a cursor might help in this endeavor.
I have an article on the same topic , I think it will help you a lot.
Question.
Is there a SQL that I can use to delete duplicate entries from a data store, while leaving a distinct copy - leave a single copy, remove all duplicate except one?
From your question, it is unclear whether your table has a unique key or not. Since you refer to this as a "data store", I'm guessing that your duplicates might be true duplicates, meaning that every value in every column is identical. Let me first address the case where the table does not have a unique key.
**************** NO UNIQUE KEY ******************
In this case, we have a difficult problem if we are trying to solve this with a single SQL Statement. In this situation, I recommend one of the following approaches:
1.) Add a unique key to the table This is easy. Add a column called ID as an integer, and make it an identifier column by checking the identity box in the table design window. Set the Identity Seed to 1 and the Identity Increment to 1. The column will automatically be populated with unique values for each row in the table. Proceed to UNIQUE KEY section below.
2.) Write a stored procedure. The strategy here would be to write a query that returns a row for each set of duplicates, using a query such as the following:
SELECT Field1, Field2, Count(ID)
FROM Foo1
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1
Use a cursor to loop through the returned rows, then for each set of duplicates, read all rows for that set:
SELECT Field1, Field2, ID
FROM Foo1
WHERE Field1 = @FIELD1 and Field2 = @FIELD2
Then delete each row except the first one returned, for each set of duplicates.
******************* UNIQUE KEY ********************
If dealing with a table that does have a unique key, the problem of removing duplicates is much easier, and able to be accomplished in one SQL statement such as the following:
DELETE
FROM Foo1
WHERE Foo1.ID IN
-- List 1 - all rows that have duplicates
(SELECT F.ID
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1))
AND Foo1.ID NOT IN
-- List 2 - one row from each set of duplicate
(SELECT Min(ID)
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1)
GROUP BY Field1, Field2);
Since this may appear complicated, let me explain. My strategy here is to return two lists: The first, List 1, is a list of all rows that have duplicates, and the second, List 2, is a list of one row from each set of duplicates. This query simply deletes all rows that are in List 1 but not in List 2.
for a simple, comprehensive tutorial article on the subject. Includes code, about 5 lines of it, minus the copious comments. Makes it look as easy as you suspected it had to be.
No, I don't agree 100%. At least, for the app I'm creating right now. The "system" includes the users, and my users are a little timid. Having the computer refuse to do things, or beep at them, or send warning messages to them puts them off.
For what I'm doing now, I think it's better to let them work, at least at first, and clean up after them as needed. Later, maybe, when they are more accustomed to the process, then I might tighten up.
Removing dupes is very handy when bringing in legacy data, however, allowing users to input duplicate data and not controlling the content of your data is wrong. While your users may be timid, if the data they input produces errors in reports (bad counts) or the time involved to produce unique values because you have dupes starts taking longer, then the users will stop using the application. I am all for tempting users to use your application, but taking off the safety belts is not the way to keep them coming back. Showing them how your application protects them and how it prevents them from doing "double" work will help convince them that the application is helpful. I feel that this is where you will get your users coming back. This is only my opinion, but you asked.
I asked because I really wanted to know. I work by myself, and when I saw who had responded, and how, I thought it was a good chance to learn.
Thank you for your help. I think I will put the "removing duplicates" closer to the front of the process (in the "temp" tables), so the user can see it happen, but the dups don't get into the "permanent" tables. That will be both educational and safe, I hope!
While I do agree it is necessary on occasion to deal with duplicates when bringing in data from other sources, you should never design errors into your system deliberately.
If you have timid users, you fix this with training. Specifically, you show them the things that they could do wrong and how to read the error message and how to recover from it. Then you have someone around to hold their hands the first week or so they use the system and help them through the problems. If they are really timid, you could even give them a cheat sheet showing them common error messages and what to do about them. After about a week, they are used to the fact that the system won't let them make certain mistakes and they become more careful about how they enter data.
I don't have a lot of faith in users looking at documentation, per se, but a cheat sheet . . . who can resist cheating?
And, demonstrating the errors sounds like a really good idea -- very bracing for the users to see "the expert" get an error message and recover. Should make the whole process more human and less demeaning for the users.
I might even borrow a page from MS's book and put out a "How to cause this error" sheet as well. Thanks for the ideas.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.