×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Complicated regex's

Complicated regex's

Complicated regex's

(OP)
Hello:

I have a question about a complicated regex that I can not seem to get working.

Here is the code:


#!usr/bin/perl

open(HTML, "link.html") || die ("can't open file: $!");
@links = <HTML>;
close(FILE);

@links = m/<A[^>]+?HREF\s*=\s*["´]?([^´" >]+?)[ ´"]?>/sig;

foreach $link (@links) {

print "$link\n";

}


And here is the HTML file that it reads from:



<html>
<head>
<title>this is some links</title>
</head>
<body>
<A HREF="Link1.html">blah</A>
<A HREF="Link1.html">blah2</A>
<A HREF="Link1.html">blah3</A>
<A HREF="Link1.html">blah4</A>
<A HREF="Link1.html">blah5</A>
</body>
</html>


The regex says that it should parse out all of the A HREF's in the HTML, but, whenever I run it, I dont get anything, just a blank screen.

Any help on how this works is appreciated.

Thank you.


-Vic

vic cherubini
malice365@hotmail.com
epic software
====
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director
Wants to Know: Java, Cold Fusion, Tcl/TK
====

RE: Complicated regex's

The main problems I see with your code are:
1 - the match statement lacks a tilde(~) after the equal sign.  Consequently, you are not doing a match, you are doing an assignment setting your array, @links, to the string following the equal sign.

2 - I think the match needs to go inside your loop..... or ......change the match to a replacement.....       $str =~ s/find_this/replace_with_this/gis;

This works,

#!/usr/local/bin/perl -w
open(HTML, "link.html") || die "can't open file: $!";
@links = <HTML>;
close(HTML);

foreach $link (@links)
    {
    $link =~ /<a href\s*=\s*"(.*?)">(.*?)<\/a>/i;
    if ($1 && $2) { print "LINK - $1\n\tLabel - $2\n\n"; }
    }


'hope this helps.




keep the rudder amid ship and beware the odd typo

RE: Complicated regex's

one more thing..... the way I wrote that, you will only match one link in each line from your input file.  If that needs to be addressed, we can get a little trickier.




keep the rudder amid ship and beware the odd typo

RE: Complicated regex's

(OP)
goBoating:

Thanks a ton for that. You definitly get my vote for TipMaster. Thanks a lot.

Now, like you said, that will only address one line of the HTML code from the page. I know it will be extremely tricky to do it to all lines, but if you have the time, could you please tell me how to do so?

Thanks a bunch for your help.

-Vic

vic cherubini
malice365@hotmail.com
epic software
====
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director
Wants to Know: Java, Cold Fusion, Tcl/TK
====

RE: Complicated regex's

Humbly submitted,
Remebering that there's always more than one way to do it......

My favorite way to catch all occurrences of a pattern in a stream is to use an evaluting replacement....huh?

In a simple match, you have a string and a pattern you want to match.... like this....

$str = 'some words that might contain a pattern to find';
if ($str =~ /a pattern/) { print "Found a pattern\n"; }


To do a replacement......

$str = 'some words that might contain a pattern1 to find';
$str =~ s/pattern1/patten2/;
# now the word 'pattern1' has been changed to 'pattern2'.
# as is the previous replacement finds and replaces the first occurrence only.

$str =~ s/pattern1/pattern2/gs;
# the 'g' says do it globally and the 's' says work across line boundaries.


a little further......
you can do a replace that evaluates the right side and uses a sub
routine to supply the replacement text....

# read the entire file into a var
open(HTML,"<link.html") or die "$!";
while (<HTML>) { $str .= $_; }
close HTML;

$str =~ /(pattern1)/&getNewString($1)/egs;
# the 'e' says evaluate the replacement, then use it.

sub getNewString
{
my $var = $_[0];
# $var is now 'pattern1'
print $var; # change this to print to some previously opened output file.
return('pattern2');
}


more to the point....

# read the entire file into a var
open(HTML,"<link.html") or die "$!";
while (<HTML>) { $str .= $_; }
close HTML;

$str =~ s/<A HREF=["'](.*?)['"]>(.*?)<\/A>/&catchParts($1,$2)/egis;

sub catchParts
{
my $link = $_[0];
my $label = $_[1];
print "LINK - $link and LABEL - $label\n";
return('replace_string_is_not_important_here');
}



So, you can see that we can pass each occurrence of a pattern into the
sub routine.  This is probably overkill for some situations, but, after playing
with this trick a little, I find myself using it more and more.  In your situation, we
really don't care about what we pass back to the replace statement......only that in the sub routine we get each occurrence of the wanted pattern.  At that point, we can do anything we want with it.  I sure there must be a more concise way to do this trick, but, I find so much utility in this approach that I keep going back to it.  If any of this does not make sense, please ask.

'hope this helps....




keep the rudder amid ship and beware the odd typo

RE: Complicated regex's

(OP)
goBoating:

Thank you so much for the help. You don't know how much it does help, really.

Yeah, as always, we have TMTOWTDI in the wonderful language of Perl.

Thanks for all the help.

-Vic

vic cherubini
malice365@hotmail.com
epic software
====
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director
Wants to Know: Java, Cold Fusion, Tcl/TK
====

RE: Complicated regex's

(OP)
Also:

I looked at that in school during class, so I didn't have time to look at it in a lot of detail, but as soon as I get home, I will.

Thanks again for the help.
-Vic

vic cherubini
malice365@hotmail.com
epic software
====
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director
Wants to Know: Java, Cold Fusion, Tcl/TK
====

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close