Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Shaun E on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A pattern output

Status
Not open for further replies.

Ramnarayan

Programmer
Jan 15, 2003
56
US
Hi

I have a database file which has the below output:

_t2 AP001653 07382480 AP020003
_vl 1
_is 1
_t2 AP001653 07382480 AP020004
_vl 1
_is 2
_t2 AP001653 07382480 AP020005
_vl 2.01
_is 1/2
_t2 AP001653 07382480 AP020006
_vl 2.02
_is 3/4
_t2 AP001653 07382480 AP020007
_vl 3
_is 1
_t2 AP001653 07382480 AP020008
_vl 3
_is 2
_t2 AP001653 07382480 AP020009
_vl 4
_is 1
_t2 AP001653 07382480 AP020010
_vl 4
_is 2
_t2 AP001653 07382480 AP020011
_vl 5
_is 1
_t2 AP001653 07382480 AP020012
_vl 5
_is 2
_t2 AP001653 07382480 AP020013
_vl 6
_is 1
_t2 AP001653 07382480 AP020014
_vl 6
_is 2

Now, I need to write a script that should be able to give me the below output:

Dataset: AP001653 ISSN# 07382480 (This is got from the 2nd column of _t2 line. Has to be same for all the _t2 lines)
Volume# 1-6 (Range of the first volume and the last volume below)
Issue# 1-2 (Range of the first issue and the last issue below)

AP020003 Vol. 1 Iss. 1
AP020004 Vol. 1 Iss. 2
AP020005 Vol. 2.01 Iss. 1/2
AP020006 Vol. 2.02 Iss. 3/4
AP020007 Vol. 3 Iss. 1
AP020008 Vol. 3 Iss. 2
AP020009 Vol. 4 Iss. 1
AP020010 Vol. 4 Iss. 2
AP020011 Vol. 5 Iss. 1
AP020012 Vol. 5 Iss. 2
AP020013 Vol. 6 Iss. 1
AP020014 Vol. 6 Iss. 2

Kindly help me with a script so that I can know better in pattern matching as i am very poor in that!
 
Code:
$data =~ s/     # substitution
(?:.*?\s){3}    # three groups of characters separated by whitespace
(.*?)\n         # last group of characters on the line followed be a
                #    newline, saving the character group to $1
_vl\s(.*?)\n    # _vl followed be a space followed by some characters
                #    and a newline, saving the char group to $2
_is\s(.*?)\n    # same as above, but _is into $3
                # end match and generate output from saved values
/$1 Vol. $2 Iss. $3\n/gxi;
                # g: global(repeat the pattern search and replacement
                # x: ignore whitespace in the regex, makes for multi-
                #    line patterns which are easier to read
                # i: performs a case insensitive search, probably not
                #    needed, but what the hey
I'll just go into the first bit in more depth and the details carry down.
(?:.*?\s){3}
A '.' can match any single character. The '*' matches zero or more of the character or group before it, in this case, '.'

The '?' usually represents matching zero or one of the character or group before it. In this case, and a few others, it curbs the greediness of some quantifiers. The '.*' could just match everything to the end of the string, which it will try to do. If you attach a '?' and make it '.*?' then it matches the shortest possible string it can.

'\s' is a shorthand for whitespace. Tabs, newlines, spaces, and others.

You can group characters to treat as one thing to match with parentheses. This, however, makes what's called backreferences to what was matched between the parentheses. In the above regex, that's how each of the words you wanted got into the $1 $2 and $3 variables. If you still want to group characters but don't want to save the backreference, you start the parenthised expression with '?:'. It works the same as other () groups just without saving the data.

We want to look at all these characters as one because we want to repeat that pattern a few time. That's what the number inside the curly braces does. {3} works like '*' but instead of matching zero or more of what was before it, it matches exactly three of them. You can use ranges in here, if you want at least two but no more than five, you can say {2,5}. If you want at least four and maybe more, use {4,}.

This was a lot more long winded than I originally expected. Lets hope the intent wasn't lost. Read more details about Perl's regular expressions at: ----------------------------------------------------------------------------------
...but I'm just a C man trying to see the light
 
Hi ICRF,

I tried to put all the bits and pieces and make a script. However the script is never working. Can you please give me a script which does the pattern matching as I am not clear how to use the pattern matching in the script.

Thanks for your time.
 
You mean something like this:
Code:
use strict;
use warnings;

my $data = qq~_t2 AP001653 07382480 AP020003
_vl 1
_is 1
_t2 AP001653 07382480 AP020004
_vl 1
_is 2
_t2 AP001653 07382480 AP020005
_vl 2.01
_is 1/2
_t2 AP001653 07382480 AP020006
_vl 2.02
_is 3/4
_t2 AP001653 07382480 AP020007
_vl 3
_is 1
_t2 AP001653 07382480 AP020008
_vl 3
_is 2
_t2 AP001653 07382480 AP020009
_vl 4
_is 1
_t2 AP001653 07382480 AP020010
_vl 4
_is 2
_t2 AP001653 07382480 AP020011
_vl 5
_is 1
_t2 AP001653 07382480 AP020012
_vl 5
_is 2
_t2 AP001653 07382480 AP020013
_vl 6
_is 1
_t2 AP001653 07382480 AP020014
_vl 6
_is 2
~;

$data =~ s/     # substitution
(?:.*?\s){3}    # three groups of characters separated by whitespace
(.*?)\n         # last group of characters on the line followed be a
                #    newline, saving the character group to $1
_vl\s(.*?)\n    # _vl followed be a space followed by some characters
                #    and a newline, saving the char group to $2
_is\s(.*?)\n    # same as above, but _is into $3
                # end match and generate output from saved values
/$1 Vol. $2 Iss. $3\n/gxi;
                # g: global(repeat the pattern search and replacement
                # x: ignore whitespace in the regex, makes for multi-
                #    line patterns which are easier to read
                # i: performs a case insensitive search, probably not
                #    needed, but what the hey

print $data;
One thing I've noticed about these forums when copying code from a post is that multiple spaces turn into \xA0 instead of the expected \x20. If you're getting an error of something to the effect of "Unrecognized character \xA0 at test.pl line 51." then you need to replace these characters somehow (I use UltraEdit and it can be switched into hex editing mode and I do a find/replace there).

If it's web-based and all you're getting are 500 errors, add
Code:
use CGI::Carp qw(fatalsToBrowser);
to the top, it'll let you know what killed the program. For more details on that, check out ----------------------------------------------------------------------------------
...but I'm just a C man trying to see the light
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top