Huge data file and looping best practices 1

carillonator · Apr 25, 2009

Hi,

I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:

Code:

patent #, char1, char2, char3, ... , char480 
1234567,1,0,1,0,1,0, ... (480 characteristics)

The script compares each [binary] characteristic of each patent with every other patent (yes, that's 8 million squared x 480 comparisons!) and counts the number of differences for each patent pair (my attempt at the improved code is below).

I've read that for loops are memory hogs and slow the program down, but I don't know what alternative I have for this situation. I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.

The program will be run on an 8-core machine with 64G memory. I'm going to allow arguments to the script that limit execution to only a range of the data (i.e. limit the first loop), and run 7 different instances at the same time. Or, is there a smarter way to allocate resources?

Since it will take a VERY long time to run this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.

Thanks in advance!!

Code:

#!/usr/bin/perl 
use strict; 
 
my(@lines,@patno1,@patno2,@record1,@record2); 
 
open(OUT, "<patents.csv")|| die("Could not open file!\n"); 
@lines=<OUT>; 
close(OUT); 
 
#clear variance file if it exists 
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n"); 
close(OUT); 
 
map(chomp,@lines); 
 
# iterate over all patents 
for(my $i=0;$i<$#lines;$i++) 
{ 
	@record1=split(/\,/,$lines[$i]); 
	$patno1=shift(@record1);  
	 
	# iterate through all other lines to compare 
	for(my $j=0;$j<$#lines;$j++) 
	{ 
		@record2=split(/\,/,$lines[$j]); 
		$patno2=shift @record2; 
		 
		# don't compare a record to itself 
		if($i!=$j) 
		{ 
			my $variance=0; 
			 
			# iterate through each characteristic 
			for(my $k=0;$k<$#record1;$k++) 
			{ 
				if($record1[$k]!=$record2[$k]) 
				{ 
					$variance++;					 
				} 
			} 
		 
		open(OUT, ">>variance.csv")|| die("Could not open file variance.csv!\n"); 
		print OUT $patno1.",".$patno2.",".$variance."\n"; 
		close(OUT);		 
			 
		} 
	} 
}

carillonator · Apr 25, 2009

I do see that the whole 6G data file is brought into memory. I'm guessing I need to use while loops, but I'm not sure the best way to do it.

feherke · Apr 26, 2009

Hi

I would say, the author was not a dedicated documentation reader.

carillonator said:
Code:

map(chomp,@lines);

perldoc said:
If you chomp a list, each element is chomped, and the total number of characters removed is returned.

That is not so huge slowdown :

Code:

[blue]master #[/blue] time seq 1000000 | perl -e '@l=<>;map(chomp,@l)'
real    0m1.889s
user    0m1.766s
sys     0m0.119s
[blue]master #[/blue] time seq 1000000 | perl -e '@l=<>;chomp @l'
real    0m1.601s
user    0m1.616s
sys     0m0.127s

But seems to be a bad idea when handling huge amount of data :

Code:

[blue]master #[/blue] time seq 10000000 | perl -e '@l=<>;map(chomp,@l)'
[red]Out of memory![/red]

real    0m17.970s
user    0m17.813s
sys     0m1.242s

[blue]master #[/blue] time seq 10000000 | perl -e '@l=<>;chomp @l'
real    0m17.122s
user    0m17.249s
sys     0m1.132s

But the really problem is [tt]split[/tt]ing up the lines in both of the two loops. Probably would better to cache its result.

Maybe I misunderstand the task, but I would say, there is no need to loop from 0 in the inner loop. Because the number of differences between lines n an m are the same as between lines m an n.

Code:

    [gray]# iterate through all [s]other[/s] [/gray][red]remaining[/red][gray] lines to compare[/gray]
    for (my $j=[red]$i+1[/red];$j<$#lines;$j++)

But maybe the ugliest "trick" is to [tt]open[/tt] and [tt]close[/tt] the output file for each line you output.

So I would start with a rewrite :

Code:

#!/usr/bin/perl
use strict;

open(OUT, "<patents.csv") || die("Could not open file!\n");
my @lines=<OUT>;
close(OUT);

# open variance file
open(OUT, ">variance.csv") || die("Could not open file variance.csv!\n");

chomp @lines;

# split up all lines
for my $s (@lines) { $s=[ split /,/,$s ] }

# iterate over all patents
for (my $i=0;$i<scalar @lines;$i++) {

  # iterate through all remaining lines to compare
  for (my $j=$i+1;$j<scalar @lines;$j++) {

    my $variance=0;

    # iterate through each characteristic
    for (my $k=1;$k<scalar @{$lines[$i]};$k++) {
      $variance++ if $lines[$i][$k]!=$lines[$j][$k];
    }

    print OUT $lines[$i][0],",",$lines[$j][0],",",$variance,"\n";
    print $lines[$i][0],",",$lines[$j][0],",",$variance,"\n";
  }
}

# close variance file
close(OUT);

This should my a little faster than the previous code.

Sorry, but I have no idea how the task itself could be split in separate parts.

Feherke.

http://rootshell.be/~feherke/

travs69 · Apr 26, 2009

If the comparison part takes a lot of time you could use Parallel:Forkmanger to fork off a process to do the computing for you and either have it write the results out somewhere and have the parent pull them back in or use shared memory to talk back and forth (or any of the other methods) or a database..

or what ever method you want to use to have proc's talk.

I'm not sure how much better threads are than forks but I know that I can use forks everywhere but not threads.

You can use the benchmark module as you make changes to your code to see what kind of improvements you are getting (I would definitely use a lot smaller of a test file !!)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

travs69 · Apr 26, 2009

I guess take back my forking suggestion.. if you fork it's going to have to duplicate the @lines array and that's going to be a huge memory resoure, unless you store it in memory and access it that way. I've never tried to store something that big, IPC::Shareable is the module I use to store hashes/arrays in memory.

If this is a linux machine I think your going to have to specifically allow the user to load have access to more memory than normal (I think by default it's like 256 or 512M).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

prex1 · Apr 27, 2009

I'll go on starting from the already good work of feherke: by doing the [tt]split[/tt] only once per line, avoiding the [tt]shift[/tt] on every line, avoiding repeated [tt]open[/tt]'s and [tt]close[/tt]'s, and comparing only with the lines past the present one, his code is tremendously faster: the last modification gives a 50% reduction!.
A minor improvement consists in assigning to variables the length of the array resulting from splitting the lines (that I suppose have always the same number of elements) and of the [tt]@lines[/tt] array.
A further significant improvement may be obtained if you know that, on the number of line compares, that equals about 3.2E13, the number of line matches (except the patent number) is significantly higher than 50% (otherwise, BTW, your [tt]variance[/tt] file would become really enormous!). In this case, comparing first the lines as strings would save a lot of time.
To do so one has to extract first the patent number; this operation would be faster if the length of the patent number is always the same. Also some other improvements are possible if other restraints are known on how the characteristics are represented (e.g. constant length, having only selected values, etc.). You mention binary values: do you mean a true binary representation (like 100=4)? I understand they are represented in decimal in the file, as you compare them as numbers, but a clarification would be useful. I guess that comparing them as strings would give a performance improvement, as perl wouldn't need to transform them into numbers, though I'm not so sure of that.
So carillonator, please come back with some more information (and examples) on the structure of your data, before we go on further.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

stevexff · Apr 27, 2009

A couple of questions/observations:[ol][li]depending on how often you need to run this, it probably won't be worth building and testing a slick multi-threaded solution. Just write something crude and simple, and set it running. It will have finished by the time you've tested a threaded solution.[/li][li]It's going to produce a massive amount of output. What are you going to do with it when you have it? Is there some way we can filter it while it's being produced? e.g. Are you looking for anything in particular, like maybe combinations where the variance is zero, or less than a certain value?[/li][li]If we compare record A with record B, we don't then need to compare record B with record A, as the variance will be the same. Do we even need to produce two output records in this instance?[/li][/ol]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

travs69 · Apr 27, 2009

Steve
on #3 I thought about that to but looking at the code I don't think he'll ever have duplicate compares.

Code:

for (my $i=0;$i<scalar @lines;$i++) {

  # iterate through all remaining lines to compare
  for (my $j=$i+1;$j<scalar @lines;$j++) {

so if he starts on line 1, he then compares line 2+ correct?
So on line 2 he doesn't even bother to start with line 1, but goes straight to line 3 and starts comparing.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

stevexff · Apr 27, 2009

In feherke's revised code, no, but in the original post duplicates were possible...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

KevinADC · Apr 27, 2009

If the data is fixed length you can use unpack() instead of split(), should improve efficiency even more.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Huge data file and looping best practices 1

carillonator

Programmer

carillonator

Programmer

feherke

Programmer

travs69

MIS

travs69

MIS

prex1

Programmer

stevexff

Programmer

travs69

MIS

stevexff

Programmer

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor