Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regex and file handling

Status
Not open for further replies.

mailint1

Technical User
Nov 10, 2007
13
IT
I have thousands of files named like these:

c:\input\pumico-home.html
c:\input\ofofo-home.html
c:\input\cimaba-office.html
c:\input\plata-home.html
c:\input\plata-office.html
c:\input\zito-home.html

I need a Perl script that pass through both the categories of these files (*
-home.html and *-office.html) searching for some regular expressions:

for *-home.html:
regexAone: abc(\d+abc)def
regexAtwo: lmn(\d+ooo)ofg
regexAthree: pqr(\d+kh)stu

for *-office.html:
regexBone: artemis(ao\d+dde)lock
regexBtwo: pretamus(\d+zz)balim

It must output each regular expression match result (important: only the par
t of it within the parenthesis, not the entire match) to a txt file with the
same name as the input html file plus a suffix corresponding to the name of
the processed regular expression like above defined, producing an output sc
enery like this:

c:\output\pumico-home-regexAone.txt
c:\output\pumico-home-regexAtwo.txt
c:\output\pumico-home-regexAthree.txt
c:\output\ofofo-home-regexAone.txt
c:\output\ofofo-home-regexAtwo.txt
c:\output\ofofo-home-regexAthree.txt
c:\output\cimaba-office-regexBone.txt
c:\output\cimaba-office-regexBtwo.txt
c:\output\plata-home-regexAone.txt
c:\output\plata-home-regexAtwo.txt
c:\output\plata-home-regexAthree.txt
c:\output\plata-office-regexBone.txt
c:\output\plata-office-regexBtwo.txt
c:\output\zito-home-regexAone.txt
c:\output\zito-home-regexAtwo.txt
c:\output\zito-home-regexAthree.txt

__________________

For example, supposing that the c:\input\cimaba-office.html file contains th
e following 5 lines:
dfgdfsgdf
setertert
artemisao123456ddelock
garumbzeta
pretamus9999zzbalim
popolissss

c:\output\cimaba-office-regexBone.txt will be generated containing:
ao123456dde

c:\output\cimaba-office-regexBtwo.txt will be generated containing:
9999zz
 
I already have two quasi-solutions but both miss to use two different direct
ories for input and output files:

quasi-solution 1:

my $f;
my %f;
while( <DATA> ){
if( /^\s*for\s*(.*):\s*$/ ){
$f=$1;
}elsif( /^\s*regex(\w+):\s*(.*)/ ){
$f{$f}{$1}=qr/$2/;
}elsif( /\S/ ){
warn "inrecognized $_";
}
}
for( values %f ){
my $re = join'|',values %$_;
$_->{':name:'}=['',keys %$_];
$_->{':re:'}=qr/$re/;
}
while( my($k,$v) = each %f ){
local @ARGV=<output/$k>;
my $p='';
my @f;
my $f;
while( <> ){
if( $ARGV ne $p ){
$f=$ARGV;
$f=~s/\.html$//;
$_ && close $_ for @f ;
@f=();
}
if( /$v->{':re:'}/ ){
for my $m( grep defined $+[$_],1..$#- ){
open $f[$m],">>$f$v->{':name:'}[$m]" unless $f[$m];
print {$f[$m]} substr($_,$-[$m],$+[$m]-$-[$m]),"\n";
}
}
}

}
#assuming each regex has exactly one ()
__DATA__
for *-home.html:
regexAone: abc(\d+abc)def
regexAtwo: lmn(\d+ooo)ofg
regexAthree: pqr(\d+kh)stu

for *-office.html:
regexBone: artemis(ao\d+dde)lock
regexBtwo: pretamus(\d+zz)balim



________________________________________



quasi-solution 2:

#!/usr/bin/perl

use strict;

my $path = ".\\";
# my $path = $ARGV[0];
# print "$path\n";

## take all HTML file from the given directory
my @files = glob($path . "*.html");

## for each HTML file
foreach my $file (@files){

## if file name are in given format home
if($file =~ /\-home\.html$/){

## take only the exact file name from the file path
$file =~ s/.*\\//g;
my $op_file = $file;
$op_file =~ s/\.html//g;
print "$file\n";

## open the file and read content for regex matching
open(INFILE, "<$file");
foreach my $line (<INFILE> ){
if($line =~ /abc(\d+abc)def/){
open(OUTFILE, ">>$path\\$op_file"."-regexAone.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /lmn(\d+ooo)ofg/){
open(OUTFILE, ">>$path\\$op_file"."-regexAtwo.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /pqr(\d+kh)stu/){
open(OUTFILE, ">>$path\\$op_file"."-regexAthree.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
}
close(INFILE);

}

## if file name are in given format home
elsif($file =~ /\-office\.html$/){

## take only the exact file name from the file path
$file =~ s/.*\\//g;
my $op_file = $file;
$op_file =~ s/\.html//g;
print "$file\n";

## open the file and read content for regex matching
open(INFILE, "<$file");
foreach my $line (<INFILE> ){
if($line =~ /artemis(ao\d+dde)lock/){
open(OUTFILE, ">>$path\\$op_file"."-regexBone.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
elsif($line =~ /pretamus(\d+zz)balim/){
open(OUTFILE, ">>$path\\$op_file"."-regexBtwo.txt ");
print OUTFILE "$1\n";
close(OUTFILE);
}
}
close(INFILE);

}
}


________________________________________


How can I add the support for reading and writing the files in two different
directories?
 
opendir, readdir and closedir is one way, and File::Find fro CPAN is another

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)
 
I second File::Find

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top