Efficient File Parsing

drkestrel · Feb 13, 2001

I have a Perl script that need to do the following-

Code:

For all files in a known directory
  Do some validation
     If Validation succeeds Then
        Concatenate the content of line 2-last line from
        this file and all other file into a String
      end if
  while
end for all files

now print that string into a file

What I did for grabbing line 2-n is something like the following:

Code:

$theFiles= &quot;&quot;;  #the string variable storing the 
                #concatenated files
forAllFiles
{
  open(ONEOFTHEFILE,$fileName);
   #If vadliation succeeds 
    {
      @aFile = ONEOFTHEFILE;
      splice @aFile,0,1;
      foreach $eachLine(@aFile) 
      {
           $theFiles .= $eachLine .&quot;\n&quot;;
      }
  }
}
#open another file and print $theFiles to a new file.

The important thing is that I don't want to alter the original file, but it seems like the splicing is not very efficient. Processing about 20files totalling 7MB takes 19 minutes!

Any more efficient methods of doing what I need to do?
It would be nice if I could for instance transfer $aFile[2-N] to String without using the $forEach, etc etc.?? Possilbe or not?

stillflame · Feb 13, 2001

NOTICE: in your post, the text "$#64;" is really an "@". the TGML parser seems to mess up when "@"'s are inside of "code" brackets, and is seemingly trying to 'escape' it one too many times before it gets to html ("$#64;" turns into an "@" when left raw in html). I use the "tt" bracket delimiters, and don't seem to get this problem.

Now, as to the question... the issue you want addressed is speed.
[tt]
$theFiles .= join("", <ONEOFTHEFILES>);
[/tt]
is listed in the documents for perl 5.6 (somewhere concerning porting) as the best way to put a file into a string. this reads it in all at once, but then doesn't assign it to a namespace, just processes it into a string.
the only thing you have to do first is read in the first lines of the file, and do nothing with them. so:
[tt]
scalar <ONEOFTHEFILES>; #in scalar context
scalar <ONEOFTHEFILES>; #returns only one line
$theFiles .= join("", <ONEOFTHEFILES>);
[/tt]
One last thing: although this may seem faster logically, i'm not certain actually will be. speed problems can usually only addressed with side-by-side benchmarks, or other speed testing functions. if you can isolate the potential bottlenecks in your code, then put them each in their own separate benchmark to find out how efficient they are, you can then be certain of where the hangup is. It may be that the function is slow due to something else entirely. "If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

MikeLacey · Feb 13, 2001

From the Perl documentation, I've never used it but it looks good, I may well do:

File::Slurp -- single call read & write file routines; read directories

--------------------------------------------------------------------------------

SUPPORTED PLATFORMS
Linux
Solaris
Windows
This module is not included with the standard ActivePerl distribution. It is available as a separate download using PPM.
--------------------------------------------------------------------------------

SYNOPSIS
use File::Slurp;
$all_of_it = read_file($filename);
@all_lines = read_file($filename);
write_file($filename, @contents)
overwrite_file($filename, @new_contnts);
append_file($filename, @additional_contents);
@files = read_dir($directory);

--------------------------------------------------------------------------------

DESCRIPTION
These are quickie routines that are meant to save a couple of lines of code over and over again. They do not do anything fancy.

read_file() does what you would expect. If you are using its output
in array context, then it returns an array of lines. If you are calling
it from scalar context, then returns the entire file in a single string.
It croaks()s if it can't open the file.

write_file() creates or overwrites files.

append_file() appends to a file.

overwrite_file() does an in-place update of an existing file or creates a new file if it didn't already exist. Write_file will also replace a file. The difference is that the first that that write_file() does is to trucate the file whereas the last thing that overwrite_file() is to trucate the file. Overwrite_file() should be used in situations where you have a file that always needs to have contents, even in the middle of an update.

read_dir() returns all of the entries in a directory except for ``.'' and ``..''. It croaks if it cannot open the directory.

--------------------------------------------------------------------------------

AUTHOR
David Muir Sharnoff <muir@idiom.com>

File::Slurp -- single call read & write file routines; read directories
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

drkestrel · Feb 14, 2001

Chers for suggesting the join() operation

Should have looked on the Efficiency Section of Programming Perl!

Anyway, I still have to do the following-
splice @lines,0,1;

$allIFiles .= join("",@lines);

As I couldn't get the scalar working?? It end up after all that splicing isn't that slow... It is now taking onlyw 45 seconds instead of 18minutes!!

hurray

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Efficient File Parsing

drkestrel

MIS

stillflame

Programmer

MikeLacey

MIS

drkestrel

MIS

Similar threads

Part and Inventory Search

Sponsor