Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML regexes

Status
Not open for further replies.

cgilover

Programmer
Oct 8, 2004
32
US
I am parsing a page from Google that shows you when the last time the page was cached. That's just fine but the problem is, I am also taking the source code from the cached page and trying to show the user what their page looked like when Google saw it.

On some pages it works (like mine!), but for some lazy coders out there who used partial URLs in their links and images, it fails because it's looking on MY server for their stuff (rrrr).

What I tried to to was come up with a few regexes and force the full url on to any links (I'll work with images when this is done and working). The script runs without errors but unfortunately, the source code doesn't change. there is a link that's <a href="page.html">click here</a> and it won't build the domain onto it.

Any help with fixing this (without pushing me towards a module)?

<code>

my $url_no_slash = $url;
my $url_with_slash = "$url/";
$url =~ s/^http//;

##########
# LWP
##########
my $ua = LWP::UserAgent->new();
$ua->agent("");
my $parse_url = " my $content = $ua->get($parse_url)->content();


my @content = split(/\n/, $content);
foreach my $key (@content)
{

if ($key =~ m/<a href="([^"]+)"> /gi)
{
my $link = $1;
print "TEST: $1";

if ($link !~ m/$url/i)
{
# If our $link doesn't contain our
# original url, we need to build it
print "test1\n";

if ($link =~ m/^\//)
{
# If our $link begins with a slash
# we'll add the full url without a
# trailing slash

print "test2\n";
$key =~ s/$link/$url_no_slash/;
}
else
{
# Our $link doesn't begin with a slash
# so we'll have to add it ourselves
print "test3\n";

$key =~ s/$link/$url_with_slash/;
}
}

}
}
</code>
 
I threw this together in a hurry so I'm sure it can be improved. Also, I haven't extensively tested this - but it may give you a direction to go in.

Code:
my $url = '[URL unfurl="true"]http://www.somedomain.com';[/URL]
my @content;
push @content, '<b><a href="[URL unfurl="true"]http://www.somedomain.com/index2.html">Click[/URL] Here</a></b>';
push @content, '<b><a href="/pages/index3.html">Click Here Too</a></b>';

$url =~ s#^[URL unfurl="true"]http://##i;[/URL]

foreach (@content) {
    if ($_ =~ m/a href="(?!.*?$url.*?")/i) {
        s#(href=")(/)#$1#i;     # Removes leading slash
        s#(href=")(.*?")#$1[URL unfurl="true"]http://$url/$2#i;[/URL]    # Adds missing info
    }
}
 
Thank you!! That worked very very well! It doesn't catch everything and it appends a / if the $url ends in a slash, but it works 95% of all the broken images and links :)

Thank you!
 
If you can post some lines that it misses or adds the extra /, I'll see what I can do to fix it.
 
cgilover

A comment:

It is not lazy to use relative urls in coding pages. It makes a page much more portable for the user. It is really a good coding practice.

Cheers


haunter@battlestrata.com
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top