INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Removing duplicates from large amount of records
2

Removing duplicates from large amount of records

Removing duplicates from large amount of records

(OP)
Hi all,

Been quite a few years since I've had to deal with duplicates in MySQL and have a large database that I need to remove duplicate records from, so looking for the least resource intensive/easiest way to do it.

I've got the following setup:

id, postcode, street_address, postal_town

Now, I'm looking to remove any duplicates where all 3 of the above (other than id) are duplicated in more than one row, so for example:

Quote:

1,EH4 2BS,example terrace,edinburgh
2,EH4 2BS,example terrace,edinburgh
3,EH52 3SD,example street,broxburn
4,EH52 3SW,example road,broxburn
5,EH4 2BS,example terrace,edinburgh

In the above, rows 1 and 2 and 5 are duplicates and I want to remove all but one of those.

Things I need to take into account:

1) Street names can exist in different towns so I cannot just use something as simple as pulling out unique street names or unique postcodes, it needs to check if all 3 fields are identical before removing anything.
2) This table has just under 30 million records.
3) Any number of records could be duplicated and there could be any number of duplicates on each.

Thanks in advance,

Wullie

The pessimist complains about the wind. The optimist expects it to change. The leader adjusts the sails. - John Maxwell

Story Choices - Your Story, Your Way

RE: Removing duplicates from large amount of records

The postcode IS THE unique item in your criteria though, EH4 2BS only applies to ONE street (or part of a street) in ONE town. Two postcodes may point to one street in one town, but they cannot point to one street name in different towns.

So all your delete routine has to copy ONE row (GROUP BY postcode) to a temporary table then delete the original and rename the temporary table.
So:
create a temp table, duplicate free with:

CODE --> SQL

CREATE TABLE temptable AS SELECT * FROM currtable WHERE 1 GROUP BY postcode; 


Delete the original (with duplicates)

CODE --> SQL

DROP TABLE currtable; 


rename the temp table to the same name as the former duplicated one

CODE --> SQL

RENAME TABLE temptable TO currtable; 



Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum

RE: Removing duplicates from large amount of records

(OP)
Hi Chris,

Thanks for your response. Good to see some of the usual members are still here and posting on a regular basis. :)

I originally went the same route as you but then was told that a postcode can technically within the rules cover more than one street in some unusual instances, so cannot use just the postcode as I need to also take into account whether or not the other details are also the same before removing any duplicates.

Obviously with a very small dataset I could do this easily without worrying about the query, however in this case I'm looking for the least resource consuming way to do it with 30 million records.

Thanks in advance,

Wullie

The pessimist complains about the wind. The optimist expects it to change. The leader adjusts the sails. - John Maxwell

Story Choices - Your Story, Your Way

RE: Removing duplicates from large amount of records

Try with

CODE --> SQL

CREATE TABLE temptable AS SELECT DISTINCT postcode, street, town FROM currtable WHERE 1; 

that should remove duplicates where postcode, street and town are identical

To go further than that will need a recursive procedure to test the row criteria against the already selected records

Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum

RE: Removing duplicates from large amount of records

surely you can just create a compound unique index? mysql will automatically remove duplicates

CODE

alter ignore table mytable
add unique index myCompoundIndex (
 postCode, address, city
) 

RE: Removing duplicates from large amount of records

(OP)
Thanks guys. Much appreciated. :)

Wullie

The pessimist complains about the wind. The optimist expects it to change. The leader adjusts the sails. - John Maxwell

Story Choices - Your Story, Your Way

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close