Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl regex questions for perl script using file::find 1

Status
Not open for further replies.

darthjef

Technical User
Joined
Jul 23, 2007
Messages
8
Location
CA
darthjef (TechnicalUser)
11 Sep 07 12:54
Perl regex questions for perl script using file::find

Hi Im having a problem getting my perl script to run properly. I'm new to perl and programming in general and am at a loss as to how to proceed.

I am trying to capture
a group variable ($1) from a regex (m/()/), save it to another variable ($date)
and then use the captured variable ($date) in a replace
of another regex (s/$1/g)
using file::find.

I have been trying to save the $1 to another variable and I don't know enough perl
to figure out what I am doing wrong.

I am looking at txt files of bibliographic records in MARC format.
Im trying to capture a date from a line and move it to another line in the record:

=099 \\$a336.2420971 CCH 2008 (CCH 2008 is the pattern I am looking for)
--match this pattern and then capture the date (2008)
m/\w\w\w\s(\d\d\d\d)/ -- I need to match this and then
capture the $1 as $date
...
--replace $o119$ywrqr with $o119$v2008$ywrq
(s/(\$o\d*)(\$yw\w\w*)/$1\$v$date$2/g)

=945 \1$h33420010942390$l120.00$o119$ywrqr
to
=945 \1$h33420010942390$l120.00$o119$v2008$ywrqr


my script is as follows:

#!/usr/bin/perl

use strict;
use diagnostics;
use File::Find;

my $startdir = 'c:/iw/WA/';
my $doctype = 'mrk';
my $regex = m/\w\w\w\s(\d\d\d\d)/; (don't think I am doing this correctly)
my $date = '';
$date = $1;

print qq~Finding 099 yyyy and removing it from 099...\n~;

find(
sub{
return unless (/\.$doctype$/i);
local @ARGV = $_;
local $^I = '.bac';
while( <> ){
if(s/(\$o\d*)(\$yw\w\w*)/$1\$v$date$2/g){
print;
}
else {
print;
}
}
}, $startdir);

When I run the script I get the following error messages:

Use of uninitialized value in pattern match (m//) at c:\PERL\BIN\wattst5.pl
line 9 (#1)
(W uninitialized) An undefined value was used as if it were already
defined. It was interpreted as a "" or a 0, but maybe it was a mistake.
To suppress this warning assign a defined value to your variables.

To help you figure out what was undefined, perl tells you what operation
you used the undefined value in. Note, however, that perl optimizes your
program and the operation displayed in the warning may not necessarily
appear literally in your program. For example, "that $foo" is
usually optimized into "that " . $foo, and the warning will refer to
the concatenation (.) operator, even though there is no . in your
program.

Finding 099 yyyy and removing it from 099...

Im trying to capture the $1 of my m// in $regex as $date but I obviously don't
know enough about perl to get this to work. If I was doing a s/// on the same line of text
I would be fine, but because I am trying to do a match on one line, capture a group variable,
then move that value to another line in a replace s/// I am at a loss as to how to get that to work.

Included is a sample record to process:

=LDR 01103cas 2200313 a 4500
=001 cn\75033109\
=005 20070829142812.0
=008 760206c19669999oncur4p\\\\\\\0\\\a0eng\d
=010 \\$acn 75033109
=020 \\$a9781553677604 (pbk. : 84th ed. : Summer : 2008) :$c{dollar}120.00
=020 \\$a9781553677437 (pbk. : 83rd ed. : Spring : 2007)
=022 \\$a0317-9060
=040 \\$aCaOONL$beng$cCaOONL$dm.c.$dNST$dlbi$dlbi
=042 \\$anlc$aisds/c
=043 \\$an-cn---
=082 04$a343/.71/052$222
=099 \\$a336.2420971 CCH 2008
=110 2\$aCanada.
=222 \0$aCanadian income tax act with income tax regulations
=245 10$aCanadian income tax act with regulations.
=260 \\$aDon Mills, Ont.,$bCCH Canadian,$c
=300 \\$av. ;$c23 cm.
=362 0\$a35th ed.(1966)-
=500 \\$aDescription based on current issue.
=650 \0$aIncome tax$xLaw and legislation$zCanada.
=650 \0$aIncome tax$zCanada$xLaw.
=710 2\$aC C H Canadian Limited.
=780 10$aCanada. Laws, Statutes, etc.$t[Income tax act] Canadian income tax act.
=945 \1$h33420010942390$l120.00$o119$ywrqr


Any help would be greatly appreciated.
 
as you suspect these lines are not correct:

Code:
my $regex = m/\w\w\w\s(\d\d\d\d)/; (don't think I am doing this correctly)
my $date = '';
$date = $1;

right now I don't have the time to look over your lengthy post. I will check back later today and see if I can help if help is still needed.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
That would be great! Thanks
 
If I've understood you correctly:

Code:
[gray]#!/usr/bin/perl[/gray]

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]
[black][b]use[/b][/black] [green]diagnostics[/green][red];[/red]
[black][b]use[/b][/black] [green]File::Find[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$startdir[/blue] = [red]'[/red][purple]c:/iw/WA/[/purple][red]'[/red][red];[/red]
[black][b]my[/b][/black] [blue]$doctype[/blue]  = [red]'[/red][purple]mrk[/purple][red]'[/red][red];[/red]
[black][b]my[/b][/black] [blue]$regex[/blue]    = [red]'[/red][purple]\w\w\w\s(\d\d\d\d)$[/purple][red]'[/red][red];[/red]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]qq~[/red][purple]Finding 099 yyyy and removing it from 099...\n[/purple][red]~[/red][red];[/red]

[maroon]find[/maroon][red]([/red]
  [url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [red]{[/red]
    [url=http://perldoc.perl.org/functions/return.html][black][b]return[/b][/black][/url] [olive][b]unless[/b][/olive] [red]([/red][red]/[/red][purple][purple][b]\.[/b][/purple][blue]$doctype[/blue]$[/purple][red]/[/red][red]i[/red][red])[/red][red];[/red]
    [black][b]my[/b][/black] [blue]$year[/blue]=[red]'[/red][purple][/purple][red]'[/red][red];[/red]
    [url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]@ARGV[/blue] = [blue]$_[/blue][red];[/red]
    [black][b]local[/b][/black] [blue]$^I[/blue] = [red]'[/red][purple].bac[/purple][red]'[/red][red];[/red]
    [olive][b]while[/b][/olive][red]([/red] <> [red])[/red] [red]{[/red] 
      [blue]$year[/blue] = [blue]$1[/blue] [olive][b]if[/b][/olive] [red]([/red]![blue]$year[/blue] and [red]/[/red][purple][blue]$regex[/blue][/purple][red]/[/red][red])[/red][red];[/red]
      [red]s/[/red][purple]([purple][b]\$[/b][/purple]o[purple][b]\d[/b][/purple]*)([purple][b]\$[/b][/purple]yw[purple][b]\w[/b][/purple]+)[/purple][red]/[/red][purple][blue]$1[/blue][purple][b]\$[/b][/purple]v[blue]$year[/blue][blue]$2[/blue][/purple][red]/[/red][red]g[/red] [olive][b]if[/b][/olive] [blue]$year[/blue][red];[/red]
      [black][b]print[/b][/black][red];[/red]
    [red]}[/red]
[red]}[/red], [blue]$startdir[/blue][red])[/red][red];[/red]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]diagnostics - Produce verbose warning diagnostics[/li]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Core (perl 5.8.8) Modules used :
[ul]
[li]File::Find - Traverse a directory tree.[/li]
[/ul]
[/tt]

The line that was causing you to error was because you were not checking the regex against anything.

I am assuming you want to scan each .mrk file for that year marker, and then update a later line in each such file with the relevant year marker.

This means having an indicator that you have found the year, and performing the substitution after that. I used $year.

I'm wondering if this might throw up false positives for the year and the substitutions though.

Your print comment about the =099 lines seems to indicate that the year occurs on that particular line, in which case, the regex could/should be modified to reflect that.

Likewise with the line you wish to substitute on. Does that occur on particular lines (=945, for example)?

Your print comment also mentions 'removing', where none is done.
 
Thanks so much brigmar!

That is exactly what I want the script to do. the removal of the date from 099 I can handle from here.

Again thanks so much for the help.

darthjef
 
Hi again brigmar. I spoke too soon. I actually need it to pick up the date for each individual record. The files to process contain a number of records each with a potentially different date in the 099 field

the script as it is grabs the date from the first record and then globally processes the file.

would changing $year = $1 if (!$year and /$regex/);
to $year = $1 if (!$year and /$regex/g);
fix my problem? I will test that.

thanks,
darthjef.
 
Just tested that and its still returning the first record's date in all the other records.
Again I am stumped.

Any more help would be appreciated!

Thanks

darthjef.
 
I don't believe that change will work for you.
I think you could just change the line to:
Code:
$year = $1 if /$regex/;

It's still a hack though.
What is the record separator? If that was known, the whole thing (capture/removal/insertion) could be done in one statement. We're reading line by line and having to remember stuff, as we're using CR (or CRLF) as the record separator.

Also, I don't know anything about the MARC format, but I did notice that there are CPAN modules that cover your requirements.
 
The code change didn't work. The file is actually just a text file so I think its just a CR. Here is an example of a 2 record file. (these aren't actually MARC records but text representations of them)

=LDR 00877nam 2200265 a 4500
=001 lbk00003424\
=005 20070829165504.0
=008 070823s2007\\\\onca\\\a\\\\\\000\p\eng\d
=015 \\$aC2007900606X
=020 \\$a9780888997760 (hc) :$c{dollar}17.95
=035 \\$a.b20610993
=040 \\$aCaOONL$beng$cCaOONL$dlbi
=082 00$ajC811/.54$222
=082 04$a819/.154$222
=099 \\$a819.154 Wit 2005
=100 1\$aWithrow, Sarah,$d1966-
=245 10$aBe a baby /$cSarah Withrow ; pictures by Manuel Monroy.
=260 \\$aToronto :$bGroundwood Books/House of Anansi Press,$cc2007.
=300 \\$a1 v. :$bcol. ill. ;$c21 cm.
=650 \5$aLullabies, Canadian (English)
=650 \5$aChildren's poetry, Canadian (English)
=700 1\$aMonroy, Manuel,$d1970-
=945 \1$h33420010944602$l17.95$o107$ywjn
=945 \1$h33420010944610$l17.95$o107$ywjn
=945 \1$h33420010944628$l17.95$o107$ywmjn

=LDR 01014cam 2200301 a 4500
=001 \\2007006989
=003 DLC
=005 20070829165628.0
=008 070319s2007\\\\nyuab\\c\\\\\\000\1\eng\\
=010 \\$a 2007006989
=020 \\$a9780061131646 (hc) :$c{dollar}19.99
=020 \\$a9780061131660 (lib. bdg.)
=020 \\$a9780061131677 (pbk.)
=035 \\$a.b20604658
=040 \\$aDLC$beng$cDLC$dCaOONL$dlbi
=082 00$a[Fic]$222
=099 \\$aHUNTE 1999
=100 1\$aHunter, Erin.
=245 10$aFirestar's quest /$cErin Hunter.
=250 \\$a1st ed., Super ed.
=260 \\$aNew York :$bHarperCollins Childrens Books,$cc2007.
=300 \\$a510, 2 p. :$bill., map ;$c22 cm.
=440 \0$aWarriors ;$vbk. [7]
=520 \\$aFirestar, leader of the Thunder Clan, sets off on a harrowing journey to find a long-lost clan of cats that had been forced to leave the forest many moons ago.
=650 \1$aCats$vFiction.
=650 \1$aFantasy.
=945 \1$h33420010944396$l19.99$o107$ywjf
=945 \1$h33420010944404$l19.99$o107$ywjf
 
I have to catch a train soon, but can I ask this ?
Does the year always appear on the =099 lines, and is there only ever one =099 line per record ?
If so, then :
Code:
[maroon]find[/maroon][red]([/red]
  [url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [red]{[/red]
    [url=http://perldoc.perl.org/functions/return.html][black][b]return[/b][/black][/url] [olive][b]unless[/b][/olive] [red]([/red][red]/[/red][purple][purple][b]\.[/b][/purple][blue]$doctype[/blue]$[/purple][red]/[/red][red]i[/red][red])[/red][red];[/red]
    [url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$year[/blue]=[red]'[/red][purple][/purple][red]'[/red][red];[/red]
    [url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]@ARGV[/blue] = [blue]$_[/blue][red];[/red]
    [black][b]local[/b][/black] [blue]$/[/blue] = [red]"[/red][purple][purple][b]\n[/b][/purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
    [black][b]local[/b][/black] [blue]$^I[/blue] = [red]'[/red][purple].bac[/purple][red]'[/red][red];[/red]
    [olive][b]while[/b][/olive][red]([/red] <> [red])[/red] [red]{[/red] 
      [olive][b]if[/b][/olive] [red]([/red][red]/[/red][purple]=099.*?([purple][b]\d[/b][/purple][purple][b]\d[/b][/purple][purple][b]\d[/b][/purple][purple][b]\d[/b][/purple])[purple][b]\n[/b][/purple][/purple][red]/[/red][red]s[/red][red])[/red] [red]{[/red]
        [blue]$year[/blue] = [blue]$1[/blue][red];[/red]
        [red]s/[/red][purple](=099.*?)([purple][b]\d[/b][/purple][purple][b]\d[/b][/purple][purple][b]\d[/b][/purple][purple][b]\d[/b][/purple])[purple][b]\n[/b][/purple][/purple][red]/[/red][purple][blue]$1[/blue][purple][b]\n[/b][/purple][/purple][red]/[/red][red]s[/red][red];[/red]
        [red]s/[/red][purple]([purple][b]\$[/b][/purple]o[purple][b]\d[/b][/purple]*)([purple][b]\$[/b][/purple]yw[purple][b]\w[/b][/purple]+)[/purple][red]/[/red][purple][blue]$1[/blue][purple][b]\$[/b][/purple]v[blue]$year[/blue][blue]$2[/blue][/purple][red]/[/red][red]g[/red][red];[/red]
        [url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url][red];[/red]
      [red]}[/red]
    [red]}[/red]
[red]}[/red], [blue]$startdir[/blue][red])[/red][red];[/red]

By making a local version of the record separator ($/), we can tell perl what defines a record and pull in an entire MARC record content each time instead of a line.
 
Thinks: I should probably have included a \n before the =099 in those regexes...
 
Yes the date would only be in =099 and there is only one per record.

so to include \n before =099 did you mean this:

if (n\/=099.*?(\d\d\d\d)\n/s) {
$year = $1;
s/(\n=099.*?)(\d\d\d\d)\n/$1\n/s; (not sure if that one
is right)
s/(\$o\d*)(\$yw\w+)/$1\$v$year$2/g;

Thanks for all the help on this. It is actually starting to make a little more sense.
 
I just tested it with

find(
sub {
return unless (/\.$doctype$/i);
my $year='';
local @ARGV = $_;
local $/ = "\n\n";
local $^I = '.bac';
while( <> ) {
if (/=099.*?(\d\d\d\d)\n/s) {
$year = $1;
s/(=099.*?)(\d\d\d\d)\n/$1\n/s;
s/(\$o\d*)(\$yw\w+)/$1\$v$year$2/g;
print;
}
}
}, $startdir);

and it seems to be working!

I will experiment with it.

Thanks again for all your help!
 
darthjef,
Code:
while( <> ) {
  if (/[b]\n[/b]=099.*?(\d\d\d\d)\n/s) {
    $year = $1;
    s/([b]\n[/b]=099.*?)(\d\d\d\d)\n/$1\n/s;
    s/(\$o\d*)(\$yw\w+)/$1\$v$year$2/g;
    print;
  }
}
By including the \n directly before the =099, we are ensuring that it will only match where =099 is at the start of a line. I used a \n at the end to make sure it will only pick up the year when it is the last thing on the line.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top