Python Code - Eliminate all the duplicated occurences
Python Code - Eliminate all the duplicated occurences
(OP)
Hello, I would like to write a python script that eliminate all the duplicated occurrences in the second column, keeping the first match.
For an input like this:
CODE -->
101000249 101000249 101000250 5552931 101000251 101000251 101000254 5552931 101000255 101000255 101000256 101000256 101000257 5552605 101000258 5552605 101000259 101000259 101000260 101000260
I should get that:
CODE -->
101000249 101000249 101000250 5552931 101000251 101000251 101000255 101000255 101000256 101000256 101000257 5552605 101000259 101000259 101000260 101000260
The python code that I attempted is the following:
CODE -->
#/bin/python
file_object=open('file1.txt','r')
file_object2=open('file2.txt','w')
read_data=file_object.readlines()
nd=[]
for line in read_data:
s=line
if s[2] not in nd:
nd.append(s[2])
line = line.strip('\n')
file_object2.write(str(line)+"\n") Thank you very much for your support!

Talk To Other Members
RE: Python Code - Eliminate all the duplicated occurences
CODE -->
#/bin/python file_object=open('sample.txt','r') file_object2=open('sample2.txt','w') read_data=file_object.readlines() nd=[] for line in read_data: s=line b=s.split() if b[1] not in nd: nd.append(b[1]) line = line.strip('\n') file_object2.write(str(line)+"\n")Does anybody agree?
RE: Python Code - Eliminate all the duplicated occurences
The Python script took about 10hours to remove the duplicated entries for a file with 1970028 lines. Do you guys have some suggestions to make the code a bit faster?
I hear that Python is very fast, but I am wondering if a shell script would be faster in this case.
RE: Python Code - Eliminate all the duplicated occurences
You mean, you put the script to slurp into the memory the whole ~2 million line file ? Well, I usually avoid such practices.
CODE --> Python
But on ~2 million lines maybe a database would perform better. If that processing has to be done frequently, I would give it a try with an SQL database. If nothing else is handy, SQLite may do it too.
Feherke.
feherke.ga
RE: Python Code - Eliminate all the duplicated occurences
searching for an item in a list takes takes linear time O(n).
searching for an item in a dictionary requires constant time O(1).
RE: Python Code - Eliminate all the duplicated occurences
Great suggestion !
Initially sounds scary to put so many data in a dict, but seems to be far below the limit where dict becomes unfeasible due to internal storage's memory requirement.
Feherke.
feherke.ga
RE: Python Code - Eliminate all the duplicated occurences
Are you suggesting to do something like this?
CODE -->
for line in file_object: dict['+str(line.split()[0])+']="+str(line.split()[1])+"RE: Python Code - Eliminate all the duplicated occurences
Definitely not.
CODE --> Python
Feherke.
feherke.ga
RE: Python Code - Eliminate all the duplicated occurences
Thank you both of you!
Is SQL easy to use? I have some experience with MySQL, will it help?
RE: Python Code - Eliminate all the duplicated occurences
This should do it in MySQL :
CODE --> MySQL
Feherke.
feherke.ga
RE: Python Code - Eliminate all the duplicated occurences
After all these discussions I was curious how well MySQL would perform. So I generated a file with 2 million rows from which 1 million unique.
- Python / dict : ~2 seconds
- MySQL : ~2 minutes
- Python / list : ~2 hours for 99%, then the last 167 rows seemed to never finish so I killed it
( While already included off-topic solution too, AFAIK the shortest code for this task would be 15 characters of Awk : n[$2]?0:n[$2]=1. With this code gawk does the work in ~2 seconds and mawk in ~2 minutes. )Feherke.
feherke.ga