Exclusion Optimization 7

BrianAtWork · Oct 18, 2006

I have a script that parses event logs from a database every day. While there are maybe around 4000-8000 actual events (queries) against the database, the log contains about 120 "pieces of data" for each event... (A piece of data being items like the Connection time, CPU Used, Memory used, Disconnection Time, etc.)

About 113 out of the 120 pieces of data are ALWAYS in each event, and 7 of them may or may not be in an event.

For this reason, my perl script is set to skip any lines that contain those 7 pieces of data:

Code:

        next if ($thisrec =~ /^\n$/);
        next if ($thisrec =~ /Memory usage:/);
        next if ($thisrec =~ /Node Number/);
        next if ($thisrec =~ /Memory Pool Type/);
        next if ($thisrec =~ /Current size/);
        next if ($thisrec =~ /High water mark/);
        next if ($thisrec =~ /Maximum size allowed/);
        next if ($thisrec =~ /Configured size/);

These 8 lines of code are at the very top of a while loop that is processing all half-a-million to a million lines of data.

I can't help but wonder if this has a somewhat signifigant impact on performance. Each of those 8 tests are done on every incoming line from the log file. That is about 4 million to 8 million tests done in this script each day.

Is there a way to optimize these exclusions? I could maybe use the beginning of line anchor ^ but there is a variable amount of spaces before each piece of text I am searching for. I thought of maybe creating a regex with qr and putting them all inside the same regex separated by pipes - but I don't know if that is better or not - do 8 checks on each line of data perform better than one check with 8 "or" statements against the same line of data?

Just a little background - the format of an event is like this:

Code:

3) Connection Event
  Connection Time: 08/18/2004 19:22:03
  User Name: username
  Node Number: 4
    System CPU: 0.4
    User CPU: 1.2
  Disconnection Time: 08/18/2004 19:23:31

4) Connection Event
...

That is not a real example, as an actual event would be about 115 lines long - but it shows the structure. Each data "title" is separated from the data by a colon. /^[0-9]+\) / is the record separator.

Any optimization tips on this? What is the most efficient way to exclude the 7 same pieces of text from a file that contains a million lines of data, when those 7 pieces probably only appear a handfull of times?

Any thoughts would be greatly appreciated!

Thanks!

Brian

stevexff · Oct 18, 2006

Brian

What do you do with each 'event'? Do you select specific data items for later processing? Or do you just want to strip out the records that have the 7 items on them and write the file back out?

If it is the former, then won't they be excluded by default, as you haven't explicitly selected them?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

KevinADC · Oct 18, 2006

You might save time by using index() to find the sub strings instead of a full blown regexp, if no wild card matching is necessay:

Code:

        next if ($thisrec =~ /^\n$/);
        next if (index($thisrec ,'Memory usage:') == -1);
        next if (index($thisrec ,'Node Number') == -1);
        next if (index($thisrec ,'Memory Pool Type') == -1);
        next if (index($thisrec ,'Current size') == -1);
        next if (index($thisrec ,'High water mark') == -1);
        next if (index($thisrec ,'Maximum size allowed') == -1);
        next if (index($thisrec ,'Configured size') == -1);

- Kevin, perl coder unexceptional!

BrianAtWork · Oct 18, 2006

Kevin - I hadn't thought of using index - I will benchmark that version and see if it performs faster. Thanks!

Steve - I write each event out to a csv file, but I want to exclude the 7 fields that only appear in the logs some of the time. The other 113 fields appear in the log for every single event. If I don't exclude the 7 fields, my CSV file gets thrown off because then each row will have 113 through 120 fields. I want each row in the output file to have the same number of fields.

Maybe there's a different way of accomplishing this, but by excluding the 7 "random" fields, I can ensure that each row has the same number of fields in it.

rharsh · Oct 18, 2006

First off, in my experience, multiple regexes are nearly always faster than one regex with a bunch of or's (using the |'s.)

But, instead of using lots of regexs, you can probably get away with just using two. One to check for a blank line, and one to check your line headings. How about something like this:

Code:

my $lineHeading;
my %skipLines = map {($_, 1)} ('Memory usage', 'Node Number',
    'Memory Pool Type','Current size', 'High water mark',
    'Maximum size allowed', 'Configured size');

while (<DATA>) {
    next if /^\s*$/;
    
    if (($lineHeading) = /^\s*([^:]+)\s*:/) {
        next if $skipLines{$lineHeading};
    }
    print;
}

KevinADC · Oct 18, 2006

thats a good suggestion.

- Kevin, perl coder unexceptional!

BrianAtWork · Oct 19, 2006

Yes - that is a very good suggestion!

I'll give it a shot and see what kind of performance increase I get.

Back to Kevin's advice with Index(): I tested that method out, and it was consistantly 20% faster than using regexes, so the index method is clearly more efficient.

I'll share the results from running with rharsh's method when I finish testing.

Thanks for the great advice!

KevinADC · Oct 19, 2006

you might also gain some speed using the 'o' modifier since the regexp never changes:

if (($lineHeading) = /^\s*([^:]+)\s*:/o) {

- Kevin, perl coder unexceptional!

brigmar · Oct 19, 2006

Considering the way that logical statements are processed, some of that logic can be flattened..

Code:

my $lineHeading;
my %skipLines = map {($_, 1)} ('Memory usage', 'Node Number',
    'Memory Pool Type','Current size', 'High water mark',
    'Maximum size allowed', 'Configured size');

while (<DATA>) {
    print unless /^\s*$/ or ((($lineHeading) = /^\s*([^:]+)\s*:/o) and $skipLines{$lineHeading};
}

Not adding much, I know...

BrianAtWork · Oct 20, 2006

Thanks to both of you. Yep, I had already included the /o switch in a previous test, and left it in for your hash suggestion. In benchmark testing, your hash method was %30-%40 faster than the original script - That is a very nice improvement.

And I personally like brigmar's condensation of the code (it improves performance a little), but it may be harder to understand for some of the people in my group who aren't as perl savvy

Great input all - I'm pretty happy with the performance increases!

brigmar · Oct 20, 2006

Have some stars from me too; I learnt a few tricks from this thread.

max1x · Oct 21, 2006

Awesome information rharsh and Kevin. Stars for both of u.

KevinADC · Oct 21, 2006

Thanks for the stars folks! I'll reach the 300 stars milestone pretty soon! [peace]

- Kevin, perl coder unexceptional!

rharsh · Oct 22, 2006

Glad to help and thanks for the stars.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Exclusion Optimization 7

BrianAtWork

Programmer

stevexff

Programmer

KevinADC

Technical User

BrianAtWork

Programmer

rharsh

Technical User

KevinADC

Technical User

BrianAtWork

Programmer

KevinADC

Technical User

brigmar

Programmer

BrianAtWork

Programmer

brigmar

Programmer

max1x

Programmer

KevinADC

Technical User

rharsh

Technical User

Similar threads

Part and Inventory Search

Sponsor