Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Code Optimization for "grep" substitute 3

Status
Not open for further replies.

BrianAtWork

Programmer
Apr 12, 2004
148
US
Hello everyone!
I have a directory that is very large (contains over 4,000 files in it), so "grep string *.sh" does not work because *.sh returns to many files for grep. Also, many of the files contain lines that are longer than 2048 bytes long - another condition that causes grep to fail.
This is on a UNIX-AIX box.
I wrote a peice of perl code to use in substitute of the native grep command. It works perfectly, but it runs a few seconds slower than grep.
I use a "-l" option if the use wants to display every line that contains their match (opposite of grep -l). If the user doesn't use the "-l" option, only the file is displayed, and the script jumps to the next file... I use a variable to hold the name of the subroutine that the script should use in this case, and because of this I am unable to "use strict;"...

Here is the code:
Code:
#!/usr/bin/perl
# $Id: pgrep.pl,v 1.3 2005/01/03 18:11:04 brian Exp brian$

use Getopt::Std;
use vars qw($opt_s $opt_o $opt_l $opt_i);
my $ONEARG=1;           #Set a flag to see the user is passing just the string
if ($#ARGV != 0 or $ARGV[0] =~ /^\-.*[^s].*/ ){
  $ONEARG = 0;
  getopts("s:o:li");    #calls getopt function to store command line options
} #end if
else {                  #if there are no command line flags, set
  $opt_s=$ARGV[0];      #the string variable to the incoming argument.
} #end else

#################DELCARE GLOBALS#######################
my $helpmsg=qq(\nUsage: pgrep.pl 'search string'
 or   pgrep.pl string
 or   pgrep.pl -i -l -s 'search string' -o /path/results.out
****Exiting script!\n\n);
if ( ! $ONEARG  and ! $opt_s ) { #If there are flags, but not the -s flag
  print STDOUT $helpmsg;
  exit;
} #end if
select(STDOUT);#default: print to STDOUT
if ($opt_o) {  #If the user wants output to go to an outpufile.
  open(OUTFILE,">$opt_o") or die "Unable to open output file: $opt_o\n";
  print STDOUT "Output will be saved to $opt_o\n";
  select(OUTFILE);
} #end if
my $listsub="files_only";  #default: List files that contain the match
$listsub="all_lines" if ($opt_l); #if opt_l, list every line that matches...
my $re = qr/${opt_s}/o; #default: case sensitive search
$re = qr/${opt_s}/io if ($opt_i); #if opt_i, case INsensitive search
print "Files that contain '$opt_s':\n\n";
&test_it; #run the subroutine

#########################################################
sub test_it {
opendir(DIR,'.') or die "Unable to read current directory.\n";
my @allfiles = readdir(DIR); #read in all files in directory.
closedir(DIR);
shift @allfiles if $allfiles[0] =~ /^\.\.?$/; #get rid of . and .. in the list
shift @allfiles if $allfiles[0] =~ /^\.\.?$/;
DIRLOOP: foreach my $file (@allfiles) { #Read in each file.
    next DIRLOOP if (! -r $file); #Check if file is readable or not.
    open(INFILE,"<$file") or die "Unable to open file: $file\n";
    while(<INFILE>) {
                &$listsub($file,$_) if /$re/; #$re contains the regex
    } #end while
    close(INFILE);
} #end DIRLOOP
close(OUTFILE);
}#end sub test_it

##########################################################
sub files_only { #print only the file name, and skip to the next file
        print "$_[0]\n";
        close(INFILE);
        next DIRLOOP;
}#end sub files_only

##########################################################
sub all_lines { #print the filename and the matching line.
        print "$_[0]: $_[1]";
}#end sub all_lines

My main question is are there any places I can optimize my code to help it run faster?
Please feel free to criticize! [peace]
Plus, maybe this post can help other people?

Thanks in advance!

Brian
 
Hi Brian,

I can't see much, it looks good to me. Nice actually.

This line here

[tab]&$listsub($file,$_) if /$re/;

could be made to run a little faster if you let the regex engine know that the variable $re doesn't change throughout the life of the loop - tell it to only compile the regex once like this.

[tab]&$listsub($file,$_) if /$re/o;

As that loop runs quite a few times (4000) you may get measurable differences.

You could probably make it run a little faster by doing away with that ref'd sub altogether and putting that code into the loop. The speed advantage you'd get may or may not be significant; you'd have to balance that against how clear the code is to read now (quite clear) with how clearly it would read with that code stuffed into the loop (not quite as clear)

You might also consider using xargs to get over the problem of there being too many arguments for sh to cope with.


Mike

You cannot really appreciate Dilbert unless you've read it in the
original Klingon.

Want great answers to your Tek-Tips questions? Have a look at faq219-2884

 
I would change sub test_it to:

Code:
sub test_it {
opendir(DIR,'.') or die "Unable to read current directory.\n";
my @allfiles = readdir(DIR); #read in all files in directory.
closedir(DIR);
DIRLOOP: foreach my $file (@allfiles) { #Read in each file.
    next DIRLOOP if $file eq '.' or $file eq '..';
    next DIRLOOP if (! -r $file); #Check if file is readable or not.
    open(INFILE,"<$file") or die "Unable to open file: $file\n";
    while(<INFILE>) {
        &$listsub($file,$_) if /$re/; #$re contains the regex
    } #end while
    close(INFILE);
} #end DIRLOOP
close(OUTFILE);
}#end sub test_it

and the file testing for (! -r $file) seems like it might not be necessary and will speed up the script if removed as file tests are slow so I have heard.
 
Thanks so much for the advice!
The first thing I changed was the &$listsub($file,$_) if /$re/o; - compile the regex only one time. Great tip Mike!

I also changed my test_it sub per your recommendation, Kevin. I also removed the -r (is file readable) check, and that seems to improve the performance - Very good idea.

One thing I wonder about in your new subroutine is the check to see if the file is '.' or '..' - If I am scanning 4000 files, that means that I am testing for '.' and '..' 4000 times. Will that reduce performance? That is the reason I had those awkward shift statements outside of the "foreach" loop - so that they would only be tested twice. Maybe there is a better way to eliminate the '.' and '..' files from the array before I get into the foreach loop?

Mike - I am also working on getting rid of my ref'd sub. I just wasn't too sure if having an if statement to decide which subroutine to call would be better than the ref'd sub...

But thank you both for your help! I like the improvements - the code is running faster (I used Benchmark to verify this).

Thanks!

Brian
 
I'm not entirely sure, but aren't . and .. always the first two entries returned by readdir? You could just shift twice or splice without check if that's always the case.

If that's not the case or you're not that confident, it's certainly faster to check directly against . and .. with eq rather than firing up the regex engine.

And as far as -r is concerned, it depends on the behavior you want. If the file isn't readable, then the open on the next line will fail and the script dies. Maybe you can skip the die and just say next, skipping any files you can't read? Again, just depends on the behavior you want.

________________________________________
Andrew

I work for a gift card company!
 
One thing I wonder about in your new subroutine is the check to see if the file is '.' or '..' - If I am scanning 4000 files, that means that I am testing for '.' and '..' 4000 times. Will that reduce performance? That is the reason I had those awkward shift statements outside of the "foreach" loop - so that they would only be tested twice. Maybe there is a better way to eliminate the '.' and '..' files from the array before I get into the foreach loop?

I would think shifting the array would cause perl to have to reindex the entire array two times, I am not sure if that is faster or slower than just checking for "eq '.' or '..'" 4000 times. 4000 times is pretty insignificant but if you have to squeeze out ever millisecond of speed you can benchmark both methods and see which is faster.
 
Is there a reindexing overhead in Perl? I've never heard of one. Most languages just use simple offsets, I don't know if Perl's dynamic arrays actually require indexing.

If you're serious about honing the code, check out Devel::profile. After running it, you know which chunks consume the most time and are worth your efforts. Like all profiling, it takes a good bit longer to run.

________________________________________
Andrew

I work for a gift card company!
 
Try it both ways, benchmarking is the only way to be sure.

Mike

You cannot really appreciate Dilbert unless you've read it in the
original Klingon.

Want great answers to your Tek-Tips questions? Have a look at faq219-2884

 
I ran benchmarks on the different methods (shifting the first 2 off the array before the loop, or an "if" statement inside the loop), and the results are insignificant. It runs about 2-5 seconds faster by removing the if (! -r) check though...

The main thing I am going to dig into is removing the subroutine references.

Thank you all very much for you insight! You all helped me greatly, and hopefully this post can help others!

Brian
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top