Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Large File, Complex Processing, Processor Speed & Memory

Status
Not open for further replies.

menkes

Programmer
Nov 18, 2002
47
US
I have a script which performs some complex data manipulation on multiple files. I am looking to decrease the processing time, but have not been very successful.

THE SCRIPT
-Read 6 files directly into hashes. These files are about 14MB in size.

-Read the main file line by line using "while". This file is about 1.25GB.

-Write the output file line by line, creating a final output file that is 1.4GB in size.

-Inside the script I build several summary hashes & arrays for each book of business...the largest book of business would build hashes & arrays holding roughly 700K records, each record about 160 bytes in size. These summaries are written to the output file and cleared after each book of business (there are 12K of these).

THE COMPUTER
-I originally ran this on a dual Xeon 450, 1GB of RAM, Win2K Server (OS is not option, so please save the flames). It took 3 hours and 17 minutes to complete. Of course, the script only uses one of the 2 processors - maxed out the one it was using.

-I moved this to a new box. This one has a 3GHz P4, 2G RAM, Win2K Server...and FSB is 533MHz. The script ran in 2 hours and 13 minutes. Better.

I think I can get much better on the new box. The CPU is maxed, but the script only uses 300MB of RAM. So I decided to try and use more RAM and read the entire input file (the 1.25GB file) into memory using &quot;foreach $line (<BIGFILE>)&quot;. This time the script used 1.4GB of RAM and the CPU was maxed. It improved the processing time by a whopping 4 minutes.

Any suggestions, tips?

Sorry for such a long post...if you have made it here, thanks just for staying with me.

Scott
 
Hi Scott,

A few thoughts

1 - Do you know which code in your script is consuming the most time? Where, in other words, you should be concentrating your optmisation efforts.

I use Time::HiRes for this.

2 - What is your target time? When are you going to stop optimising? Could I suggest that you will never be able to write a script that will run in less time than it takes the cmd line utility 'copy' to copy your 1.25 gb file from here to there - so that's an interesting starting point as a target.

Another starting point would be a Perl version of copy (which is what you're writing, with added nice bits :)). How long does the script below take to run on your 1.25gb file?

while(<>)}
print;
}

This is fastest possible version of your script (slightly simplified, I know) If you get within a few percent of this time - stop optimising, it's as good as it will get with Perl.

The question I ask is &quot;How slowly can I get away with this script running?&quot; optmising past this point might be fun but you've got to ask yourself why you're doing it.

3 - The six files you're reading into hashes. Are you reading them line by line or slurping them?

4 - Disk I/O. You haven't mentioned this anywhere.

Ideally your input and output files should be on separate logical drives and these drives should be as striped as you can manage with the number of disks you have available.

As a minimum I would suggest that you *need* two disks. Put the input file on one and write the output file to the other.

5 - The more memory the better, allow the O/S to do the best it can with regards to disk caching. The total amount of data you're reading and writing is about 2.75gb or therabouts. 3gb would be a good number then :)

6 - Application Architecture. It's possible to get speed increases, on multiple processor machines, by splitting off some work to other processes. Your 6 hashed files; you could write a Perl script that did nothing but read them into hashes and start listening on a socket. Your main script would then request the lookup data from the other process as and when it needed, the trouble with this approach is that: It's hard work and it works best where processing can be simultaneous - in this example your main script would wait for data to be returned to it from the hash server script.

I would suggest doing things in this order:

1 - I/O issues (easy and quick fix to add another disk)
2 - define the target time with the little script above
3 - benchmark your code
4 - examine hash building routines
5 - Memory
6 - Architecture last.

Long answer as well :) sorry...
Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.
 
That's a lot of info, and I really appreciate it. I'll dig through this today. Thanks so much!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top