Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Huge data file and looping best practices 1

Status
Not open for further replies.

carillonator

Programmer
Apr 25, 2009
2
US
Hi,

I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:

Code:
patent #, char1, char2, char3, ... , char480 
1234567,1,0,1,0,1,0, ... (480 characteristics)

The script compares each [binary] characteristic of each patent with every other patent (yes, that's 8 million squared x 480 comparisons!) and counts the number of differences for each patent pair (my attempt at the improved code is below).

I've read that for loops are memory hogs and slow the program down, but I don't know what alternative I have for this situation. I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.

The program will be run on an 8-core machine with 64G memory. I'm going to allow arguments to the script that limit execution to only a range of the data (i.e. limit the first loop), and run 7 different instances at the same time. Or, is there a smarter way to allocate resources?

Since it will take a VERY long time to run this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.

Thanks in advance!!

Code:
#!/usr/bin/perl 
use strict; 
 
my(@lines,@patno1,@patno2,@record1,@record2); 
 
open(OUT, "<patents.csv")|| die("Could not open file!\n"); 
@lines=<OUT>; 
close(OUT); 
 
#clear variance file if it exists 
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n"); 
close(OUT); 
 
map(chomp,@lines); 
 
# iterate over all patents 
for(my $i=0;$i<$#lines;$i++) 
{ 
	@record1=split(/\,/,$lines[$i]); 
	$patno1=shift(@record1);  
	 
	# iterate through all other lines to compare 
	for(my $j=0;$j<$#lines;$j++) 
	{ 
		@record2=split(/\,/,$lines[$j]); 
		$patno2=shift @record2; 
		 
		# don't compare a record to itself 
		if($i!=$j) 
		{ 
			my $variance=0; 
			 
			# iterate through each characteristic 
			for(my $k=0;$k<$#record1;$k++) 
			{ 
				if($record1[$k]!=$record2[$k]) 
				{ 
					$variance++;					 
				} 
			} 
		 
		open(OUT, ">>variance.csv")|| die("Could not open file variance.csv!\n"); 
		print OUT $patno1.",".$patno2.",".$variance."\n"; 
		close(OUT);		 
			 
		} 
	} 
}
 
I do see that the whole 6G data file is brought into memory. I'm guessing I need to use while loops, but I'm not sure the best way to do it.
 
Hi

I would say, the author was not a dedicated documentation reader.
carillonator said:
Code:
map(chomp,@lines);
perldoc said:
If you chomp a list, each element is chomped, and the total number of characters removed is returned.
That is not so huge slowdown :
Code:
[blue]master #[/blue] time seq 1000000 | perl -e '@l=<>;map(chomp,@l)'
real    0m1.889s
user    0m1.766s
sys     0m0.119s
[blue]master #[/blue] time seq 1000000 | perl -e '@l=<>;chomp @l'
real    0m1.601s
user    0m1.616s
sys     0m0.127s
But seems to be a bad idea when handling huge amount of data :
Code:
[blue]master #[/blue] time seq 10000000 | perl -e '@l=<>;map(chomp,@l)'
[red]Out of memory![/red]

real    0m17.970s
user    0m17.813s
sys     0m1.242s

[blue]master #[/blue] time seq 10000000 | perl -e '@l=<>;chomp @l'
real    0m17.122s
user    0m17.249s
sys     0m1.132s
But the really problem is [tt]split[/tt]ing up the lines in both of the two loops. Probably would better to cache its result.

Maybe I misunderstand the task, but I would say, there is no need to loop from 0 in the inner loop. Because the number of differences between lines n an m are the same as between lines m an n.
Code:
    [gray]# iterate through all [s]other[/s] [/gray][red]remaining[/red][gray] lines to compare[/gray]
    for (my $j=[red]$i+1[/red];$j<$#lines;$j++)
But maybe the ugliest "trick" is to [tt]open[/tt] and [tt]close[/tt] the output file for each line you output.

So I would start with a rewrite :
Code:
#!/usr/bin/perl
use strict;

open(OUT, "<patents.csv") || die("Could not open file!\n");
my @lines=<OUT>;
close(OUT);

# open variance file
open(OUT, ">variance.csv") || die("Could not open file variance.csv!\n");

chomp @lines;

# split up all lines
for my $s (@lines) { $s=[ split /,/,$s ] }

# iterate over all patents
for (my $i=0;$i<scalar @lines;$i++) {

  # iterate through all remaining lines to compare
  for (my $j=$i+1;$j<scalar @lines;$j++) {

    my $variance=0;

    # iterate through each characteristic
    for (my $k=1;$k<scalar @{$lines[$i]};$k++) {
      $variance++ if $lines[$i][$k]!=$lines[$j][$k];
    }

    print OUT $lines[$i][0],",",$lines[$j][0],",",$variance,"\n";
    print $lines[$i][0],",",$lines[$j][0],",",$variance,"\n";
  }
}

# close variance file
close(OUT);
This should my a little faster than the previous code.

Sorry, but I have no idea how the task itself could be split in separate parts.


Feherke.
 
If the comparison part takes a lot of time you could use Parallel:Forkmanger to fork off a process to do the computing for you and either have it write the results out somewhere and have the parent pull them back in or use shared memory to talk back and forth (or any of the other methods) or a database.. :) or what ever method you want to use to have proc's talk.

I'm not sure how much better threads are than forks but I know that I can use forks everywhere but not threads.

You can use the benchmark module as you make changes to your code to see what kind of improvements you are getting (I would definitely use a lot smaller of a test file !!)



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
I guess take back my forking suggestion.. if you fork it's going to have to duplicate the @lines array and that's going to be a huge memory resoure, unless you store it in memory and access it that way. I've never tried to store something that big, IPC::Shareable is the module I use to store hashes/arrays in memory.

If this is a linux machine I think your going to have to specifically allow the user to load have access to more memory than normal (I think by default it's like 256 or 512M).


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
I'll go on starting from the already good work of feherke: by doing the [tt]split[/tt] only once per line, avoiding the [tt]shift[/tt] on every line, avoiding repeated [tt]open[/tt]'s and [tt]close[/tt]'s, and comparing only with the lines past the present one, his code is tremendously faster: the last modification gives a 50% reduction!.
A minor improvement consists in assigning to variables the length of the array resulting from splitting the lines (that I suppose have always the same number of elements) and of the [tt]@lines[/tt] array.
A further significant improvement may be obtained if you know that, on the number of line compares, that equals about 3.2E13, the number of line matches (except the patent number) is significantly higher than 50% (otherwise, BTW, your [tt]variance[/tt] file would become really enormous!). In this case, comparing first the lines as strings would save a lot of time.
To do so one has to extract first the patent number; this operation would be faster if the length of the patent number is always the same. Also some other improvements are possible if other restraints are known on how the characteristics are represented (e.g. constant length, having only selected values, etc.). You mention binary values: do you mean a true binary representation (like 100=4)? I understand they are represented in decimal in the file, as you compare them as numbers, but a clarification would be useful. I guess that comparing them as strings would give a performance improvement, as perl wouldn't need to transform them into numbers, though I'm not so sure of that.
So carillonator, please come back with some more information (and examples) on the structure of your data, before we go on further.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
A couple of questions/observations:[ol][li]depending on how often you need to run this, it probably won't be worth building and testing a slick multi-threaded solution. Just write something crude and simple, and set it running. It will have finished by the time you've tested a threaded solution.[/li][li]It's going to produce a massive amount of output. What are you going to do with it when you have it? Is there some way we can filter it while it's being produced? e.g. Are you looking for anything in particular, like maybe combinations where the variance is zero, or less than a certain value?[/li][li]If we compare record A with record B, we don't then need to compare record B with record A, as the variance will be the same. Do we even need to produce two output records in this instance?[/li][/ol]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Steve
on #3 I thought about that to but looking at the code I don't think he'll ever have duplicate compares.
Code:
for (my $i=0;$i<scalar @lines;$i++) {

  # iterate through all remaining lines to compare
  for (my $j=$i+1;$j<scalar @lines;$j++) {
so if he starts on line 1, he then compares line 2+ correct?
So on line 2 he doesn't even bother to start with line 1, but goes straight to line 3 and starts comparing.





~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
In feherke's revised code, no, but in the original post duplicates were possible...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
If the data is fixed length you can use unpack() instead of split(), should improve efficiency even more.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top