Processing Big Flat File 4

Extension · Apr 26, 2007

Hi,

I have a flat file which has 90000 records - one per line - and the processing time is quite lengthy. (File size is almost 10 mb)
I decided to remove the "new line" so it would minimize the number of lines and also the size of the file; size is now 3.5 mb.
Even after changing the data structure, it takes almost 5-6 full seconds to loop around the data.

Question: So I change the data structure ? So I convert the data to an hash and then loop trough the hash to do the comparison ? Your input and expertise would be appreciated.

Data Structure:
OLD:

Code:

93993|AD003|IBG0034591
93994|AB010|IBG0105789
93995|DE012|IBG0124585
93996|EF123|IBG1237833

New:

Code:

93993|AD003|IBG0034591?93994|AB010|IBG0105789?93995|DE012|IBG0124585?93996|EF123|IBG1237833

Code to process new data structure:

Code:

sub GetData {

open(FILE,"$File") || die "Could not open $File";
my @Array = <FILE>;
close(FILE);
			
foreach my $Record (@Array) {

	my @DataSets = split(/\?/,$Record);
			
	foreach my $DataSet (@DataSets) {
			
		my ($ID, $REF_NUMBER, $SEC_NUMBER) = split(/\|/,$DataSet);
			
		if ($ID eq $InputID) {
			$Value = $REF_NUMBER;
		}

	}
	
}
	
return $Value

}

Kirsle · Apr 26, 2007

Try convering your data structure into a hash like this:

Code:

$VAR1 = {
   $ID => {
      REF_NUMBER => $REF_NUMBER,
      SEC_NUMBER => $SEC_NUMBER,
   },
   # so-on for each one
};

The code to do this once would be like the code you already have. Just have it create a new hash key for each $ID and put the $REF_NUMBER and $SEC_NUMBER under that hash.

Use Data:

umper to dump the hash into a file. When you need it later, just `do` the file, and then a simple exists $hash->{$ID} is all that's needed later.

Code:

# Run this code once to build the hash ref.
# It will take a few seconds here, but
# it's more efficient in the long run.
sub GetData {

open(FILE,"$File") || die "Could not open $File";
my @Array = <FILE>;
close(FILE);

[COLOR=blue]my $hash = {};[/color]
            
foreach my $Record (@Array) {

    my @DataSets = split(/\?/,$Record);
            
    foreach my $DataSet (@DataSets) {
            
        my ($ID, $REF_NUMBER, $SEC_NUMBER) = split(/\|/,$DataSet);[COLOR=blue]

        $hash->{$ID} = {
            REF_NUMBER = $REF_NUMBER,
            SEC_NUMBER = $SEC_NUMBER,
        };[/color]

    }
    
}
    
[COLOR=blue]use Data::Dumper;
open (OUT, ">structure.pl");
print OUT Dumper($hash);
close (OUT);[/color]
return $Value

}

# In your main program, to load the data structure:
my $hash = do "structure.pl";

if (exists $hash->{$InputID}) {
   print "$hash->{$InputID}->{REF_NUMBER}\n";
}

So, you'd run the hash generating script once (and run it again when the data in the flat file gets changed), and then in your main program which might run many times, just load that hash back into memory with "do". Then you can do a quick lookup in the hash instead of looping, so your main program isn't held up for a few seconds at a time every single time it wants to find the information.

-------------
Cuvou.com | The NEW Kirsle.net

WinblowsME · Apr 26, 2007

Code:

open(FILE,"$File") || die "Could not open $File";
   my @Array = <FILE>;
close(FILE);

Wouldn't this be better since you don't have to load the huge file into memory?

Code:

open(FILE,"$File") || die "Could not open $File";
   while ( <FILE> )
   {
      # process each line
   }
close(FILE);

spookie · Apr 26, 2007

Kirsle,
I have not understood below part of your code snippet.

Code:

use Data::Dumper;
open (OUT, ">structure.pl");
print OUT Dumper($hash);
close (OUT);
return $Value

Can you please explain this a bit?

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

Kirsle · Apr 26, 2007

In your loop, you populate the hashref $hash with the data you read in from the file.

Data:

umper's Dumper() method takes a data structure and prints it as normal Perl code. It will print something that looks like this:

Code:

$VAR1 = {
   93993 => {
      'REF_NUMBER' => 'AD003',
      'SEC_NUMBER' => 'IBG0034591',
   },
   93994 => {
      'REF_NUMBER' => 'AD010',
      'SEC_NUMBER' => 'IBG0105789',
   },
};

So, that bit of the code uses Data:

umper, opens the file "structure.pl" for writing, prints Dumper($hash) to it (which writes out something like what I posted above), closes the file... and that "return" line probably could be deleted since it's not returning anything at this point, but just converting your flat file into a Perl structure.

http://search.cpan.org/perldoc?Data::Dumper

-------------
Cuvou.com | The NEW Kirsle.net

brigmar · Apr 26, 2007

TIMTOWTDI....

Code:

print GetData(93995);

sub GetData {
	my $id = shift;
	open(FILE,"$File") || die "Could not open $File";
  $_ = <FILE>;
  close(FILE);
  my $count = (my $ref_number,my $sec_number) = /(?:^|[?])$id\|(.*?)\|(.*?)(?:$|[?])/;
  return ($count) ? $ref_number : "";
}

That'll work with your current data structure...

Extension · Apr 26, 2007

Kirsle Thanks for your help. I will go through your code to understand it.

brigmarI've tried your code and it takes about 5 seconds to process, which is the same as with my current code.

Thanks to everyone for your time and help.

travs69 · Apr 26, 2007

Wait... are you complaining that it takes 5 seconds to process 3+ Meg of data??? Heck at that point your probably hitting limits on the thresholds of your hard drive

KevinADC · Apr 26, 2007

assuming your old data structure:

Code:

sub GetData {

   open(my $FH,"$File") || die "Could not open $File";
   while (my $line = <$FH>){
      my ($ID, $REF_NUMBER, undef) = split(/\|/,$line);
      if ($ID eq $InputID) {
        return $REF_NUMBER;
      }
   } 
}

if you use an indirect filehandle ($FH) perl automatically closes the file when then filehandle goes out of scope so there is no need to implicitly close the file in the above code. If the $ID is near the end of the file it will still take some time to process, but if it's nearer the start of the file it should be faster. $SEC_NUMBER is "undef" because you are not using that for anything so there is no need to define it.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

KevinADC · Apr 26, 2007

I agree with travis, 5 seconds is not bad to process a file that size.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Extension · Apr 26, 2007

travs69Yes, 5 full seconds is too long. That's why I'm looking for other solutions. The "concept" proposed by Kirsle is good; I need to avoid re-opening the file everytime cause even some efficient code would cut 2-3 secs in the processing time.

Thanks

Extension · Apr 26, 2007

Ooops ! In my last post, I wanted to say: " [...]even some efficient code wouldn't cut 2-3 secs in the processing time "

brigmar · Apr 26, 2007

I really don't see how you'd improve upon what you have.
No matter how you cut it, you'll still be reading in a multi-megabyte file and processing it.

Which suggestion you go with depends on how often you will be comparing against the file:

Kirsle's suggestion frontloads the process. It will take a certain amount of time to process the original log file, a certain amount of time to write the Dumper file, and a certain amount of time to 'do' the dumper file. Once that is in place, checks against the data will be VERY FAST, meaning that if you're looking to call your function often, this will payoff bigtime. If there are any changes to the log file during runtime they will not be reflected until the frontloading process is repeated.

If you're just looking to call the function a couple of times, you'd do best to stick with what you have.

Something to consider for the low usage option is changing the record separator within your function so that you don't need to load the entire file into memory:

Code:

print GetData('000099');

sub GetData {
  my $id = shift;
  my $found = 0;
  my $refnum = "";
  my $secnum = "";
  local $/  = "?"; ## Set the record separator

  open(FILE,"$File") || die "Could not open $File";
  while(<FILE>) {
    last if $found = ($refnum,$secnum) = /^$id\|(.*?)\|(.*)$/;
  }
  close(FILE);

  return $found ? $refnum : "" ;
}

stevexff · Apr 26, 2007

If five seconds is too long, and even some efficient coding won't cut 2-3 seconds off that time, then use a proper relational database. With an index. [ol][li]It won't open and close the file each time.[/li][li]It will keep the index in a memory buffer most of the time.[/li][li]It won't read the whole file each time, just the record you want.[/li][li]If you don't have a database, the are plenty of good free ones (MySQL, PostGreSQL, etc.). And some expensive ones too.[/li][/ol]
It should cut at least 4.9 seconds off your current execution time.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

travs69 · Apr 26, 2007

The other thing you could do is just load it into shared memory and leave it there, you would have to have something that checks and if is gone for some reason (reboot, memory deletion, or whatever) and reloads it.. but heck on a unix box you could leave something like that in memory for ever.. I guess it depends on how often you are accessing it to see if that is justified or not.

spookie · Apr 27, 2007

Kirsle,
Thanks for the explanation. Have a star.
P.S. I am always confused with the use of Data:

umper

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

Kirsle · Apr 27, 2007

Here's a little test case for Data:

umper:

Code:

#!/usr/bin/perl -w

# this is script1.pl

use Data::Dumper;

# make some random hashref
my $colors = {
   'red' => 'FF0000',
   'blue' => '0000FF',
   'green' => '00FF00',
};

# Just to see what we just did:
print "The hex code for red = $colors->{red}\n"
   . "Green = $colors->{green}\n"
   . "Blue = $colors->{blue}\n";

# open a txt file for writing
open (OUT, ">output.txt");

# dump the hashref into it
print OUT Dumper($colors);

# close the file
close (OUT);

Run that script, and then open the "output.txt" that was created by it. You should see your hashref in an easy to read Perlly format. It declares this hashref as $VAR1, but the variable names aren't important here:

Code:

$VAR1 = {
   'red' => 'FF0000',
   'blue' => '0000FF',
   'green' => '00FF00',
};

The Perl functions do and require look for the included Perl file to return some kind of a true value. This is why oftentimes a Perl script that is meant to be require'd by another script, has a lone "1;" at the very end of the file. This "1;" is just a true value, so that the script which includes the file will get a true value.

You could also put this at the end of an included file:

Code:

"a true value";

At any rate, require and do return to the caller, the value returned from the included script. So, if your script ended with "a true value" and you did:

Code:

my $returned = require "my-script.pl";

Then $returned = "a true value"

Now, moving on with the test case, create a second Perl script:

Code:

#!/usr/bin/perl -w

# this is script2.pl, which knows nothing
# about the $colors hash you declared in
# script1.pl; it's going to read it from that
# file that Data::Dumper printed to.

my $data = do "output.txt";

# Now $data matches the $colors hash from
# our last script! This proves it:
print "The hex value for red is: $data->{red}\n"
   . "And for blue it is: $data->{blue}\n"
   . "And for green it is: $data->{green}\n";

So, if that doesn't clear up how Data:

umper can be useful, just think of it like this:

Data:

umper takes some kind of structure from memory (be it a scalar, array, hash, scalar reference, hash reference, array reference, ...), and prints it out in a human-readible way. It prints it in a way that, what is printed is also perfectly useable Perl code.

It's good for debugging, too; if you have some complicated loops and your hash isn't getting updated, you can print to STDOUT your data structure using Dumper, just to see on screen what's going on.

This method of saving data to be loaded by another Perl script just takes advantage of the fact that do and require return the very last true value that the included script returns.

-------------
Cuvou.com | The NEW Kirsle.net

Extension · Apr 27, 2007

Wow !! Thank you very much to everyone. Your help was really appreciated. Big * to Kirsle, brigmar and Kevin.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Processing Big Flat File 4

Programmer

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

MIS

Technical User

Technical User

Programmer

Programmer

Programmer

Programmer

MIS

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor