Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regular expression efficiency 1

Status
Not open for further replies.

azzazzello

Technical User
Jan 25, 2005
297
US
Hi,

I have a file with a million lines in the following format.

Code:
field1=val1 field2="val2" field3=""

It is basically a list of key/values pairs separated by a space. Where a value is NOT encased in quotes, it will have only alphanumerics or a dot or a minus sign. Where it IS encased in double quotes, it is freeform and can have any characters (including space and = sign) except for a double quote. I need to convert these into a hash of key/values

I have the following to parse them, but something tells me there is a faster way of doing it. Any ideas?

Code:
 my %vals;
 while ($line =~ /([^=\"]+)=((\"(.*?)\")|([\-\w\.]*?))\s+/g)
    {
        ($key,$val) = ($1,$2);
        $val =~ s/"//g if $val;
        $vals{$key} = defined($val) ? $val : "";
    }
 
That's pretty much exactly how I would do it too.

Would probably simplify it a little bit though:

Code:
use Data::Dumper;

use strict;
use warnings;

while (my $line = <DATA>) {
	my %vals;
	while ($line =~ /\G(\w+)=("[^"]*"|\S*)\s+/g) {
		my ($key, $val) = ($1, $2);
		$val =~ s/^"|"$//g;
		$vals{$key} = $val;
	}
	print Dumper(\%vals);
}

__DATA__
field1=val1 field2="val2" field3="" 
field1=val1 field2="contains foo=bar and other stuff" field3=""

- Miller
 
This is a tentative at finding something faster (together with the code used to test the execution time).
The [tt]sub rind{}[/tt] is the same as the regex method above, the [tt]sub eind{}[/tt] uses an approach based on [tt]index[/tt] and [tt]substr[/tt] . The latter is about two times faster, though I don't know if the gain is worth the effort of writing a more complex routine.
Code:
use strict;
use warnings;
my$line;
my$num=1000000;
my$ini=time;
for(0..$num){
  $line='field1=val1 field2="val2" field3=""
';
  eind(\$line);
  $line='field1=val1 field2="contains foo=bar and other stuff" field3=""
';
  eind(\$line);
}
print time-$ini,"\n";
$ini=time;
for(0..$num){
  $line='field1=val1 field2="contains foo=bar and other stuff" field3=""
';
  rind(\$line);
  $line='field1=val1 field2="val2" field3=""
';
  rind(\$line);
}  
print time-$ini,"\n";

sub rind{
  my%vals;
  my($refs)=@_;
  my($key,$val);
  while($$refs=~/\G(\w+)=(\"[^\"]*\"|\S*)\s+/g){
    ($key,$val)=($1,$2);
    $val=~s/^"|"$//g;
    $vals{$key}=$val;
  }
}
sub eind{
  my%vals;
  my($refs)=@_;
  my($i,$j,$key,$val);
  my$p=0;
  while(($i=index($$refs,'=',$p))>-1){
    $key=substr($$refs,$p,$i-$p);
    if(substr($$refs,++$i,1)eq'"'){
      $j=index($$refs,'"',++$i);
      $val=substr($$refs,$i,$j++ -$i);
    }else{
      $j=index($$refs,' ',$i);
      if($j<0){
        $val=substr($$refs,$i);
      }else{
        $val=substr($$refs,$i,$j-$i);
      }
    }
    $vals{$key}=$val;
    $p=++$j;
  }
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
prex1,

Thank you!! is it worth it? When a vital production script on a deadline takes 1.5 hours to run instead of 3 hour, hell yea it's worth it!
 
azzazzello, have you tried running the code in your production environment yet?

While the processing is going to be faster, that code isn't going to fix the system IO delays. I'd be pretty surprised if your 3 hour run times were actually cut in half.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top