comapring 2 huge files

leila1983 · Aug 27, 2007

Hi,
I just started writing Perl programs,I want to compare two files but my problem is they are so huge(more than 4 million lines in each) and I don't want to do exactly match, I have written a script but it is so slow and it takes more than a few days and I think Perl is much faster than it and something should be wrong in my program.
First I have read files and then push them into hashes and then for each line, first I split my lines to fields and then apply my matching rules, for example first field and second field should be exactly match and third field can have 1 tolerance (for example if I have 4 in third field of first file, it is acceptable if I have 3 or 4 or 5 in second file).
And my other problem is: if there is 2 records in second file that match with 1 record in first one I want to match just one of them,on the other hand I want one by one matches.
I need help to solve it, and I have written this script but it is not complete on its comparing rules.

#!/usr/bin/perl -w

open(mci_1_3_, '/ictedrin/ICTPRD/CdrExtract/26_feb/mednet_2.txt') ||
die "open: $!";

while(<mci_1_3_>){
chomp;
$lines_1{$_}++;
}
close(mci_1_3_);

print "first file was read and push to \n";

open(itc_3_1, '/ictedrin/ICTPRD/CdrExtract/26_feb/ITC_MEDNET_1.txt') ||
die "open: $!";

while(<itc_3_1>){
chomp;
$lines_2{$_}++;
}
close(itc_3_1);

print "second file was read and push to \n";
$same=0;
$not_same=0;

foreach $key_1(keys %lines_1) { # once for each key of %fred
$flag=0;
my @lines_1 = split(/,/, $key_1);

foreach $key_2(keys %lines_2) { # once for each key of %fred
if ($flag==1){
last;}

my @lines_2 = split(/,/, $key_2);
if($lines_2{$key_2}==2){
next;}

if(substr($lines_1[1],1) eq $lines_2[0] && substr($lines_1[2],1) eq substr($lines_2[1],2)){

$lines_2{$key_2}\n";
$lines_2{$key_2}='2';
$same ++;
$flag=1;

print " they are same:\n $key_1 $key_2\n";
print " ############ keys are :\n $lines_1{$key_1} $lines_2{$key_2}\n";
}

}
if ($flag==0){
$not_same ++;}
}
print "we have $same same lines\n";
print "we have $not_same not same lines\n";

Thanks in advance

prex1 · Aug 27, 2007

Sure your code may be optimized a lot, for example it is not efficient to use hashes with very long keys, you can read the files into arrays (by the way if you have equal lines in each file, with hashes you'll lose them). Another way for optimizing is to read only one file into memory, then read the second one line by line and make your equality checks on a per line basis. A third way of optimization I can think of is that you first check the first field for equality, and go on with the others only if they are equal, and so on.
However my main point is that if there is not any structure in your data, this operation will be inevitably very long.
It would be better if you first describe the structure of your files.

prex1

http://www.xcalcs.com

: Online tools for structural design

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

leila1983 · Aug 27, 2007

Hi
thanks for your reply, I am trying to use array instead of hash, and structure of my file is:
I have some lines that its filed are the same,first and second and third field are numbers that should be same and forth one is date that it should be same and fifth one is time that is hour minutes second. and I want to match it with 1 minutes tolerance and last one is duration (by second) and it can have 10 second tolerance.

thanks again

prex1 · Aug 27, 2007

OK, but I meant also: is there any structure connecting or grouping the lines between them or across the two files? Are the lines in each file ordered in some way?
Also: why are you comparing two files? Or what do the two files share or have in common?
Last but not least: could you modify the structure of those files to add some sort of indexing field that would help in the compare operation?
Don't have a definite idea in mind at the moment, just trying to fix the boundaries of the problem.

prex1

http://www.xcalcs.com

: Online tools for structural design

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

KevinADC · Aug 27, 2007

post some sample lines of the data along with a description of the data.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

comapring 2 huge files

leila1983

Programmer

prex1

Programmer

leila1983

Programmer

prex1

Programmer

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor