BrianAtWork
Programmer
I have a script that parses event logs from a database every day. While there are maybe around 4000-8000 actual events (queries) against the database, the log contains about 120 "pieces of data" for each event... (A piece of data being items like the Connection time, CPU Used, Memory used, Disconnection Time, etc.)
About 113 out of the 120 pieces of data are ALWAYS in each event, and 7 of them may or may not be in an event.
For this reason, my perl script is set to skip any lines that contain those 7 pieces of data:
These 8 lines of code are at the very top of a while loop that is processing all half-a-million to a million lines of data.
I can't help but wonder if this has a somewhat signifigant impact on performance. Each of those 8 tests are done on every incoming line from the log file. That is about 4 million to 8 million tests done in this script each day.
Is there a way to optimize these exclusions? I could maybe use the beginning of line anchor ^ but there is a variable amount of spaces before each piece of text I am searching for. I thought of maybe creating a regex with qr and putting them all inside the same regex separated by pipes - but I don't know if that is better or not - do 8 checks on each line of data perform better than one check with 8 "or" statements against the same line of data?
Just a little background - the format of an event is like this:
That is not a real example, as an actual event would be about 115 lines long - but it shows the structure. Each data "title" is separated from the data by a colon. /^[0-9]+\) / is the record separator.
Any optimization tips on this? What is the most efficient way to exclude the 7 same pieces of text from a file that contains a million lines of data, when those 7 pieces probably only appear a handfull of times?
Any thoughts would be greatly appreciated!
Thanks!
Brian
About 113 out of the 120 pieces of data are ALWAYS in each event, and 7 of them may or may not be in an event.
For this reason, my perl script is set to skip any lines that contain those 7 pieces of data:
Code:
next if ($thisrec =~ /^\n$/);
next if ($thisrec =~ /Memory usage:/);
next if ($thisrec =~ /Node Number/);
next if ($thisrec =~ /Memory Pool Type/);
next if ($thisrec =~ /Current size/);
next if ($thisrec =~ /High water mark/);
next if ($thisrec =~ /Maximum size allowed/);
next if ($thisrec =~ /Configured size/);
These 8 lines of code are at the very top of a while loop that is processing all half-a-million to a million lines of data.
I can't help but wonder if this has a somewhat signifigant impact on performance. Each of those 8 tests are done on every incoming line from the log file. That is about 4 million to 8 million tests done in this script each day.
Is there a way to optimize these exclusions? I could maybe use the beginning of line anchor ^ but there is a variable amount of spaces before each piece of text I am searching for. I thought of maybe creating a regex with qr and putting them all inside the same regex separated by pipes - but I don't know if that is better or not - do 8 checks on each line of data perform better than one check with 8 "or" statements against the same line of data?
Just a little background - the format of an event is like this:
Code:
3) Connection Event
Connection Time: 08/18/2004 19:22:03
User Name: username
Node Number: 4
System CPU: 0.4
User CPU: 1.2
Disconnection Time: 08/18/2004 19:23:31
4) Connection Event
...
That is not a real example, as an actual event would be about 115 lines long - but it shows the structure. Each data "title" is separated from the data by a colon. /^[0-9]+\) / is the record separator.
Any optimization tips on this? What is the most efficient way to exclude the 7 same pieces of text from a file that contains a million lines of data, when those 7 pieces probably only appear a handfull of times?
Any thoughts would be greatly appreciated!
Thanks!
Brian