HTML regexes

cgilover · Nov 10, 2004

I am parsing a page from Google that shows you when the last time the page was cached. That's just fine but the problem is, I am also taking the source code from the cached page and trying to show the user what their page looked like when Google saw it.

On some pages it works (like mine!), but for some lazy coders out there who used partial URLs in their links and images, it fails because it's looking on MY server for their stuff (rrrr).

What I tried to to was come up with a few regexes and force the full url on to any links (I'll work with images when this is done and working). The script runs without errors but unfortunately, the source code doesn't change. there is a link that's <a href="page.html">click here</a> and it won't build the domain onto it.

Any help with fixing this (without pushing me towards a module)?

<code>

my $url_no_slash = $url;
my $url_with_slash = "$url/";
$url =~ s/^http//;

##########
# LWP
##########
my $ua = LWP::UserAgent->new();
$ua->agent("");
my $parse_url = "

http://www.google.com/search?sourceid=navclient&ie=UTF-8&q=cache:http://$url/";

my $content = $ua->get($parse_url)->content();

my @content = split(/\n/, $content);
foreach my $key (@content)
{

if ($key =~ m/<a href="([^"]+)"> /gi)
{
my $link = $1;
print "TEST: $1";

if ($link !~ m/$url/i)
{
# If our $link doesn't contain our
# original url, we need to build it
print "test1\n";

if ($link =~ m/^\//)
{
# If our $link begins with a slash
# we'll add the full url without a
# trailing slash

print "test2\n";
$key =~ s/$link/$url_no_slash/;
}
else
{
# Our $link doesn't begin with a slash
# so we'll have to add it ourselves
print "test3\n";

$key =~ s/$link/$url_with_slash/;
}
}

}
}
</code>

rharsh · Nov 10, 2004

I threw this together in a hurry so I'm sure it can be improved. Also, I haven't extensively tested this - but it may give you a direction to go in.

Code:

my $url = '[URL unfurl="true"]http://www.somedomain.com';[/URL]
my @content;
push @content, '<b><a href="[URL unfurl="true"]http://www.somedomain.com/index2.html">Click[/URL] Here</a></b>';
push @content, '<b><a href="/pages/index3.html">Click Here Too</a></b>';

$url =~ s#^[URL unfurl="true"]http://##i;[/URL]

foreach (@content) {
    if ($_ =~ m/a href="(?!.*?$url.*?")/i) {
        s#(href=")(/)#$1#i;     # Removes leading slash
        s#(href=")(.*?")#$1[URL unfurl="true"]http://$url/$2#i;[/URL]    # Adds missing info
    }
}

cgilover · Nov 10, 2004

Thank you!! That worked very very well! It doesn't catch everything and it appends a / if the $url ends in a slash, but it works 95% of all the broken images and links

Thank you!

rharsh · Nov 11, 2004

If you can post some lines that it misses or adds the extra /, I'll see what I can do to fix it.

mlibeson · Nov 12, 2004

cgilover

Your original code is probably not working because of your $url test. If the URL is

http://www.google.com/index.htm?search=string

then you are trying to match every reference in HREF to ://

http://www.google.com/index.htm?search=string

instead of ://

http://www.google.com/

or ://

http://www.google.com:80/

I hope this helps shed some light.

Michael Libeson

Haunter · Nov 12, 2004

cgilover

A comment:

It is not lazy to use relative urls in coding pages. It makes a page much more portable for the user. It is really a good coding practice.

Cheers

haunter@battlestrata.com

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

HTML regexes

cgilover

Programmer

rharsh

Technical User

cgilover

Programmer

rharsh

Technical User

mlibeson

Programmer

Haunter

Programmer

Similar threads

Part and Inventory Search

Sponsor