INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Python Code - Eliminate all the duplicated occurences
4

Python Code - Eliminate all the duplicated occurences

Python Code - Eliminate all the duplicated occurences

(OP)

Hello, I would like to write a python script that eliminate all the duplicated occurrences in the second column, keeping the first match.

For an input like this:

CODE -->

101000249 101000249
101000250 5552931
101000251 101000251
101000254 5552931
101000255 101000255
101000256 101000256
101000257 5552605
101000258 5552605
101000259 101000259
101000260 101000260 


I should get that:

CODE -->

101000249 101000249
101000250 5552931
101000251 101000251
101000255 101000255
101000256 101000256
101000257 5552605
101000259 101000259
101000260 101000260 

The python code that I attempted is the following:

CODE -->

#/bin/python

file_object=open('file1.txt','r')
file_object2=open('file2.txt','w')

read_data=file_object.readlines()
nd=[]

for line in read_data:
        s=line
        if s[2] not in nd:
                nd.append(s[2])
                line = line.strip('\n')
                file_object2.write(str(line)+"\n") 


Thank you very much for your support!

RE: Python Code - Eliminate all the duplicated occurences

(OP)
I think I get it, I was filtering using the third column instead of the second field.

CODE -->

#/bin/python
file_object=open('sample.txt','r')
file_object2=open('sample2.txt','w')

read_data=file_object.readlines()
nd=[]

for line in read_data:
        s=line
        b=s.split()
        if b[1] not in nd:
                nd.append(b[1])
                line = line.strip('\n')
                file_object2.write(str(line)+"\n") 

Does anybody agree?

RE: Python Code - Eliminate all the duplicated occurences

(OP)

The Python script took about 10hours to remove the duplicated entries for a file with 1970028 lines. Do you guys have some suggestions to make the code a bit faster?
I hear that Python is very fast, but I am wondering if a shell script would be faster in this case.

RE: Python Code - Eliminate all the duplicated occurences

Hi

Quote (nvhuser)

read_data=file_object.readlines()
You mean, you put the script to slurp into the memory the whole ~2 million line file ? Well, I usually avoid such practices.
  • Variable s is not necessary
  • strip()ing the newline just to add it back in the next line is pointless
  • Converting the line into str() is pointless

CODE --> Python

#!/bin/python

file_object=open('sample.txt','r')
file_object2=open('sample2.txt','w')
nd=[]

for line in file_object:
        b=line.split()[1]
        if b not in nd:
                nd.append(b)
                file_object2.write(line)

file_object.close()
file_object2.close() 

But on ~2 million lines maybe a database would perform better. If that processing has to be done frequently, I would give it a try with an SQL database. If nothing else is handy, SQLite may do it too.

Feherke.
feherke.ga

RE: Python Code - Eliminate all the duplicated occurences

2
above code should work faster if you change nd to a dict instead of a list
searching for an item in a list takes takes linear time O(n).
searching for an item in a dictionary requires constant time O(1).

RE: Python Code - Eliminate all the duplicated occurences

Hi

Quote (JustinEzequiel)

above code should work faster if you change nd to a dict instead of a list
Great suggestion ! medal

Initially sounds scary to put so many data in a dict, but seems to be far below the limit where dict becomes unfeasible due to internal storage's memory requirement.

Feherke.
feherke.ga

RE: Python Code - Eliminate all the duplicated occurences

(OP)
Sorry guys, it seems that I need to study harder.

Are you suggesting to do something like this?

CODE -->

for line in file_object:
    dict['+str(line.split()[0])+']="+str(line.split()[1])+" 

RE: Python Code - Eliminate all the duplicated occurences

Hi

Definitely not.

CODE --> Python

#!/bin/python

file_object=open('sample.txt','r')
file_object2=open('sample2.txt','w')
nd={}

for line in file_object:
        b=line.split()[1]
        if b not in nd:
                nd[b]=None
                file_object2.write(line)

file_object.close()
file_object2.close() 

Feherke.
feherke.ga

RE: Python Code - Eliminate all the duplicated occurences

(OP)
Wow, I tested the Python code using with dictionary and it is really fast.

Thank you both of you!

Quote (Feherke)

But on ~2 million lines maybe a database would perform better. If that processing has to be done frequently, I would give it a try with an SQL database. If nothing else is handy, SQLite may do it too.

Is SQL easy to use? I have some experience with MySQL, will it help?

RE: Python Code - Eliminate all the duplicated occurences

Hi

This should do it in MySQL :

CODE --> MySQL

create table sample (
    id integer primary key auto_increment,
    c1 text,
    c2 text
);

load data infile 'sample.txt'
into table sample
fields terminated by ' '
(c1, c2);

select
c1, c2

from sample join (
    select
    min(id) id

    from sample

    group by c2
) foo using (id)

order by id

into outfile 'sample2.txt'
fields terminated by ' ';

drop table sample; 
Note that input file will be /var/lib/mysql/<database_name>/sample.txt. Not played much with it, but for your ~2 million records an index on column c2 may help.

Feherke.
feherke.ga

RE: Python Code - Eliminate all the duplicated occurences

2
Hi

After all these discussions I was curious how well MySQL would perform. So I generated a file with 2 million rows from which 1 million unique.
  • Python / dict : ~2 seconds
  • MySQL : ~2 minutes
  • Python / list : ~2 hours for 99%, then the last 167 rows seemed to never finish so I killed it
( While already included off-topic solution too, AFAIK the shortest code for this task would be 15 characters of Awk : n[$2]?0:n[$2]=1. With this code gawk does the work in ~2 seconds and mawk in ~2 minutes. )

Feherke.
feherke.ga

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close