regex and file handling

mailint1 · Nov 10, 2007

I have thousands of files named like these:

c:\input\pumico-home.html
c:\input\ofofo-home.html
c:\input\cimaba-office.html
c:\input\plata-home.html
c:\input\plata-office.html
c:\input\zito-home.html

I need a Perl script that pass through both the categories of these files (*
-home.html and *-office.html) searching for some regular expressions:

for *-home.html:
regexAone: abc(\d+abc)def
regexAtwo: lmn(\d+ooo)ofg
regexAthree: pqr(\d+kh)stu

for *-office.html:
regexBone: artemis(ao\d+dde)lock
regexBtwo: pretamus(\d+zz)balim

It must output each regular expression match result (important: only the par
t of it within the parenthesis, not the entire match) to a txt file with the
same name as the input html file plus a suffix corresponding to the name of
the processed regular expression like above defined, producing an output sc
enery like this:

c:\output\pumico-home-regexAone.txt
c:\output\pumico-home-regexAtwo.txt
c:\output\pumico-home-regexAthree.txt
c:\output\ofofo-home-regexAone.txt
c:\output\ofofo-home-regexAtwo.txt
c:\output\ofofo-home-regexAthree.txt
c:\output\cimaba-office-regexBone.txt
c:\output\cimaba-office-regexBtwo.txt
c:\output\plata-home-regexAone.txt
c:\output\plata-home-regexAtwo.txt
c:\output\plata-home-regexAthree.txt
c:\output\plata-office-regexBone.txt
c:\output\plata-office-regexBtwo.txt
c:\output\zito-home-regexAone.txt
c:\output\zito-home-regexAtwo.txt
c:\output\zito-home-regexAthree.txt

__________________

For example, supposing that the c:\input\cimaba-office.html file contains th
e following 5 lines:
dfgdfsgdf
setertert
artemisao123456ddelock
garumbzeta
pretamus9999zzbalim
popolissss

c:\output\cimaba-office-regexBone.txt will be generated containing:
ao123456dde

c:\output\cimaba-office-regexBtwo.txt will be generated containing:
9999zz

mailint1 · Nov 10, 2007

I already have two quasi-solutions but both miss to use two different direct
ories for input and output files:

quasi-solution 1:

my $f;
my %f;
while( <DATA> ){
if( /^\s*for\s*(.*):\s*$/ ){
$f=$1;
}elsif( /^\s*regex(\w+):\s*(.*)/ ){
$f{$f}{$1}=qr/$2/;
}elsif( /\S/ ){
warn "inrecognized $_";
}
}
for( values %f ){
my $re = join'|',values %$_;
$_->{':name:'}=['',keys %$_];
$_->{':re:'}=qr/$re/;
}
while( my($k,$v) = each %f ){
local @ARGV=<output/$k>;
my $p='';
my @f;
my $f;
while( <> ){
if( $ARGV ne $p ){
$f=$ARGV;
$f=~s/\.html$//;
$_ && close $_ for @f ;
@f=();
}
if( /$v->{':re:'}/ ){
for my $m( grep defined $+[$_],1..$#- ){
open $f[$m],">>$f$v->{':name:'}[$m]" unless $f[$m];
print {$f[$m]} substr($_,$-[$m],$+[$m]-$-[$m]),"\n";
}
}
}

}
#assuming each regex has exactly one ()
__DATA__
for *-home.html:
regexAone: abc(\d+abc)def
regexAtwo: lmn(\d+ooo)ofg
regexAthree: pqr(\d+kh)stu

for *-office.html:
regexBone: artemis(ao\d+dde)lock
regexBtwo: pretamus(\d+zz)balim

________________________________________

quasi-solution 2:

#!/usr/bin/perl

use strict;

my $path = ".\\";
# my $path = $ARGV[0];
# print "$path\n";

## take all HTML file from the given directory
my @files = glob($path . "*.html");

## for each HTML file
foreach my $file (@files){

## if file name are in given format home
if($file =~ /\-home\.html$/){

## take only the exact file name from the file path
$file =~ s/.*\\//g;
my $op_file = $file;
$op_file =~ s/\.html//g;
print "$file\n";

## open the file and read content for regex matching
open(INFILE, "<$file");
foreach my $line (<INFILE> ){
if($line =~ /abc(\d+abc)def/){
open(OUTFILE, ">>$path\\$op_file"."-regexAone.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /lmn(\d+ooo)ofg/){
open(OUTFILE, ">>$path\\$op_file"."-regexAtwo.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /pqr(\d+kh)stu/){
open(OUTFILE, ">>$path\\$op_file"."-regexAthree.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
}
close(INFILE);

}

## if file name are in given format home
elsif($file =~ /\-office\.html$/){

## take only the exact file name from the file path
$file =~ s/.*\\//g;
my $op_file = $file;
$op_file =~ s/\.html//g;
print "$file\n";

## open the file and read content for regex matching
open(INFILE, "<$file");
foreach my $line (<INFILE> ){
if($line =~ /artemis(ao\d+dde)lock/){
open(OUTFILE, ">>$path\\$op_file"."-regexBone.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /pretamus(\d+zz)balim/){
open(OUTFILE, ">>$path\\$op_file"."-regexBtwo.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
}
close(INFILE);

}
}

________________________________________

How can I add the support for reading and writing the files in two different
directories?

PaulTEG · Nov 11, 2007

opendir, readdir and closedir is one way, and File::Find fro CPAN is another

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)

travs69 · Nov 11, 2007

I second File::Find

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

regex and file handling

mailint1

Technical User

mailint1

Technical User

PaulTEG

Technical User

travs69

MIS

Similar threads

Part and Inventory Search

Sponsor