carillonator
Programmer
Hi,
I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:
The script compares each [binary] characteristic of each patent with every other patent (yes, that's 8 million squared x 480 comparisons!) and counts the number of differences for each patent pair (my attempt at the improved code is below).
I've read that for loops are memory hogs and slow the program down, but I don't know what alternative I have for this situation. I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.
The program will be run on an 8-core machine with 64G memory. I'm going to allow arguments to the script that limit execution to only a range of the data (i.e. limit the first loop), and run 7 different instances at the same time. Or, is there a smarter way to allocate resources?
Since it will take a VERY long time to run this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.
Thanks in advance!!
I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:
Code:
patent #, char1, char2, char3, ... , char480
1234567,1,0,1,0,1,0, ... (480 characteristics)
The script compares each [binary] characteristic of each patent with every other patent (yes, that's 8 million squared x 480 comparisons!) and counts the number of differences for each patent pair (my attempt at the improved code is below).
I've read that for loops are memory hogs and slow the program down, but I don't know what alternative I have for this situation. I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.
The program will be run on an 8-core machine with 64G memory. I'm going to allow arguments to the script that limit execution to only a range of the data (i.e. limit the first loop), and run 7 different instances at the same time. Or, is there a smarter way to allocate resources?
Since it will take a VERY long time to run this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.
Thanks in advance!!
Code:
#!/usr/bin/perl
use strict;
my(@lines,@patno1,@patno2,@record1,@record2);
open(OUT, "<patents.csv")|| die("Could not open file!\n");
@lines=<OUT>;
close(OUT);
#clear variance file if it exists
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n");
close(OUT);
map(chomp,@lines);
# iterate over all patents
for(my $i=0;$i<$#lines;$i++)
{
@record1=split(/\,/,$lines[$i]);
$patno1=shift(@record1);
# iterate through all other lines to compare
for(my $j=0;$j<$#lines;$j++)
{
@record2=split(/\,/,$lines[$j]);
$patno2=shift @record2;
# don't compare a record to itself
if($i!=$j)
{
my $variance=0;
# iterate through each characteristic
for(my $k=0;$k<$#record1;$k++)
{
if($record1[$k]!=$record2[$k])
{
$variance++;
}
}
open(OUT, ">>variance.csv")|| die("Could not open file variance.csv!\n");
print OUT $patno1.",".$patno2.",".$variance."\n";
close(OUT);
}
}
}