Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

I need help with removing duplicates :) 1

Status
Not open for further replies.

tomharrow

Programmer
Mar 3, 2004
7
GB
Hi All

I have a table of approx 500,000 members which I need to dedupe only by email address. and i dont mind which records get removed. as long as there is a unique email address.

e.g.

tom, harrow, tom@tom.com, 30/09/1978, blue, £456
thomas, harrow, tom@tom.com, <NULL>, blue, £456
tom, harrow, tom@tom.com, 30/09/1978, <NULL>, <NULL>

I dont mind which record remains... as long as there are no duplicate email addresses.

I have tried a number of ways but none seem to work properly and the only effective method opnly keeps one field(email) i.e. creates a table with only distinct email addresses.

Any help hugely appreciated.

Thanks

Tom
 
If you had a primary key field it would get kind of easy...


for instance run the following script in QA
Code:
Create database crap
go
use crap
go
create table dups(c1 int identity, email varchar(300))
go
insert into dups (email) values ('rob@mysite.com')
insert into dups (email) values ('rob@mysite.com')

insert into dups (email) values ('rob@Somesite.com')
insert into dups (email) values ('rob@SomeSite.com')
insert into dups (email) values ('rob@SomeSite.com')

insert into dups (email) values ('rob@yoursite.com')
insert into dups (email) values ('rob@yoursite.com')

insert into dups (email) values ('rob@noSite.com')
insert into dups (email) values ('rob@noSite.spam.com')
go

select * from dups
go
delete from dups where c1 not  in 
(select min(c1) from dups d2 where  d2.email = dups.email group by email having count(email) > 1) 
and email in 
(select email 
from dups 
group by email 
having count(email) >1)
go
select * from dups
go
drop table dups

YOu should notice that you only have the 4 distinct rows left...

Rob

PS I work for Stars :)
BUT ONLY PAY me if its worth it..

STAR GAZER in training...


 
Backup database :X

Create identity column.

Run this
Code:
DELETE FROM myTable
WHERE identitycolumn NOT IN 
(	SELECT MIN(identitycolumn)
	FROm myTable
	GROUP by emailcolumn
)

Remove identity column.

------
heisenbug: A bug that disappears or alters its behavior when one attempts to probe or isolate it
schroedinbug: A bug that doesn't appear until someone reads source code and realizes it never should have worked, at which point the program promptly stops working for everybody until fixed.

[ba
 
One note: if emailcolumn can be NULL, exclude these records with extra AND emailcolumn IS NOT NULL.

------
heisenbug: A bug that disappears or alters its behavior when one attempts to probe or isolate it
schroedinbug: A bug that doesn't appear until someone reads source code and realizes it never should have worked, at which point the program promptly stops working for everybody until fixed.

[ba
 
Hi Guys

thanks so much for your help. but the following script (affected 0 records)

am i doing something wrong?

before running this i did the following just to make sure

select count(email) from customer = 809639
select count(distinct(email)) from customer = 444327

DELETE FROM customer
WHERE autoid NOT IN
( SELECT MIN(autoid)
FROm customer
GROUP by autoid
)

thnks

tom
 
NoCoolHandle

please excuse my ameteurness but i am having trouble understanding your pseudo code.

would the following be correct?

autoid is my PK field

select * from customer
go
delete from customer where c1 not in
(select min(c1) from customer d2 where d2.email = customer.email group by email having count(email) > 1)
and email in
(select email
from customer
group by email
having count(email) >1)
go
select * from customer

--i dont understand why wi have to delete this?
--go
--drop table customer
 


I guess, you need also make sure:

select count(*) from customer = select count(distinct autoid) from customer
 
The drop was only for my demo script (cleaning up after I have built things on your server :)

you shouldn't!!!

My c1 = your autoid

however.... USE VONGRUNTS... it is very easy..

The key is the (change your code to include the red sections)

Code:
DELETE FROM customer
WHERE autoid NOT IN 
(    SELECT MIN([red]autoid[/red])
    FROm customer
    GROUP by [red]email[/red]
)
[red]and not email is null[/red]
 
Thanks

that time it worked ... but only removed about 1800 records??
 
Just 1 quick tip...

When I am playing with new updates/delete/inserts on a "Live" database..
Often I will add a single line before the update

Code:
begin tran
THen Run your update...
Code:
DELETE FROM customer
WHERE autoid NOT IN 
(    SELECT MIN(autoid)
    FROm customer
    GROUP by email
)
and not email is null
and if it looks good..

run the following
Code:
commit tran

if it looks bad

run the following
Code:
rollback tran

The key is that until you either commit or rollback the transaction (update) is in question and hasn't been done..

In qa (query analyzer) my total code might look like..

Code:
[blue]Begin Tran
DELETE FROM [/blue]customer
[blue]WHERE [/blue]autoid [gray]NOT IN [/gray]
(    [blue]SELECT MIN[/blue](autoid)
    [blue]FROm [/blue]customer
    [blue]GROUP by [/blue]email
)
[gray]and not[/gray] email [blue]is[/blue] [gray]null[/gray]

[blue]select[/blue] * [blue]from[/blue] customer
[gray]--[/gray][green]Rollback Tran[/green]

Then I would run the script and depending on the results fo the select RUn either the Rollback Tran or Commit tran at the bottom. That could be done by simply selecting (highlighting) the text with a mouse drag and then executing (ALT+X) just that section..

HTH


Rob
 
How many NULL values you have in email column?

------
heisenbug: A bug that disappears or alters its behavior when one attempts to probe or isolate it
schroedinbug: A bug that doesn't appear until someone reads source code and realizes it never should have worked, at which point the program promptly stops working for everybody until fixed.

[ba
 
nocoolhandle - excellent advice i have taken that onboard :)

there are no null values in theemail field
 
OK then... simple test:
Code:
select email
from myTable
group by email
having count(*) > 1
How many rows?

------
heisenbug: A bug that disappears or alters its behavior when one attempts to probe or isolate it
schroedinbug: A bug that doesn't appear until someone reads source code and realizes it never should have worked, at which point the program promptly stops working for everybody until fixed.

[ba
 
Strange... can you post some duplicate rows (only autoid and email columns) ?

------
heisenbug: A bug that disappears or alters its behavior when one attempts to probe or isolate it
schroedinbug: A bug that doesn't appear until someone reads source code and realizes it never should have worked, at which point the program promptly stops working for everybody until fixed.

[ba
 
Here's my version with a derived table:

Code:
DELETE C
FROM
  Customer C
  LEFT JOIN (
    SELECT MinId = Min(AutoID) FROM Customer GROUP BY Email
  ) M ON C.AutoID = M.MinId
WHERE
  M.MinID IS NULL

But do watch out for nulls. And do be careful with deletes and outer joins, they can be tricky. [smile]

-------------------------------------
Only the unvirtuous can be dutiful, for the virtuous already perform duty's requirements by inclination and for pleasure. Where there is no pain, no disinclination, there is no duty.
- Erik E
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top