Perl Global Replace 4

drifter300 · Aug 29, 2007

Hi

I'm trying to replace all the relative links generated by our CMS with absolute ones. Using a base tag is not an option as the page fragment will be hosted at another domain, so it would cause that domain's relative links not to work.

My script requests the fragment, stores it in strPage, uses the URI package to process it, and then returns the fragment strPageTemp with the relative links replaced by absolute ones.

Code:

$strPageTemp = $strPage;
$base_url = "[URL unfurl="true"]http://www.mysite.co.uk/";[/URL]

while( $strPage =~ /((?:src|href|option value|form action)=\")(\/*)([^\"]*)(")/g ) {

	next if $3 =~ /http(s)?:\/\//;
	next if $3 =~ /#/;

	$abs_url = URI->new_abs($3, $base_url);

	$sNeedsReplacing = "\/" . $3;
	$sNeedsReplacing =~ s/\?/\\?/;
	$sNeedsReplacing =~ s/\&/\\&/;

	$strPageTemp =~ s/$sNeedsReplacing/$abs_url/g;
}

The problem is with the last line of the loop.
If I include the /g for global replace, URLs which occur more than once on the page are processed every time an occurence is found (i.e. if it's there 3 times,

http://...

is added to it's front 3 times). If I don't include it, only the first occurence is replaced, and the others remain relative.

Any ideas on how to sort this out would be much appreciated.
Thanks

KevinADC · Aug 29, 2007

hard to help when a program is not compiled with "strict". You seem to have isolated the section of code casuing the problem but without "strict" being used the problem can be anywhere in the program.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Kirsle · Aug 29, 2007

I would have the code replace it with something temporarily that can be stripped out later, just so that it won't continue to match again and again.

Like, i.e. if you're looking for

Code:

href="[URL unfurl="true"]http://______"[/URL]

Then the end result might be,

Code:

href="http[COLOR=red]%%FILTERED_LINK%%[/color]://[COLOR=red]%%FILTERED_LINK%%[/color]______[COLOR=red]%%FILTERED_LINK%%[/color]

So, with all that extra "crap" added in there, the code won't match in your while() loop again. And then, when you're done with all the substitutions, just strip out the %%FILTERED_LINK%% text globally, and everything is exactly how you wanted it in the end.

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog

drifter300 · Aug 30, 2007

Thanks all, but I managed to sort it.
For anyone's future reference:

Code:

$base_url = "[URL unfurl="true"]http://www.gwynedd.gov.uk/";[/URL]

while( $strPage =~ /((?:src|href|option value|form action)=\")(\/*)([^\"]*)(")/g ) {

	next if $3 =~ /http(s)?:\/\//;
	next if $3 =~ /^#/;

	$abs_url = URI->new_abs($3, $base_url);

	$sNeedsReplacing = $2 . $3 . "\"";
	$sNeedsReplacing =~ s/\?/\\?/;
	$sNeedsReplacing =~ s/\&/\\&/;
	$sNeedsReplacing =~ s/\%/\\%/;
	$sNeedsReplacing =~ s/\+/\\+/;

	$strPage =~ s/$sNeedsReplacing/$abs_url\"/g;
}

The use of a temporary variable was the main problem.

Also some URLs which were embedded in others were replaced twice e.g. /page.asp?cat=3333 and /page.asp?cat=3333&doc=23.
The use of the " in sNeedsReplacing prevented this.

MillerH · Aug 30, 2007

Glad you were able to get a working product. However, I have a few suggestions for you.

1) Always give captured substrings proper names.

Code:

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [red]([/red][blue]$tag[/blue], [blue]$backslash[/blue], [blue]$relurl[/blue], [blue]$quote[/blue][red])[/red] = [red]([/red][blue]$1[/blue], [blue]$2[/blue], [blue]$3[/blue], [blue]$4[/blue][red])[/red][red];[/red]

When you first create a regex, you know what each of the substrings references because it's on your mind. However, when you come back to your code even a week later, you don't want to have to waste time analyzing your regex in order to determine what each substring is referencing. This is especially true if there is a bug and you're trying to analyze why it's happening.

Also, if you were to use a second regex after the first, there is a chance that your substrings could be clobbered. There was actually a risk of that with your test for absolute url's, but you do next so it's not an issue in that specific case. Nevertheless, it will help you out a lot in the long run to get in that practice.

2) Brace Delimiters

Use Brace Delimiters for multiline regex's or any regex that contains a forward slash.

Code:

[olive][b]next[/b][/olive] [olive][b]if[/b][/olive] [blue]$relurl[/blue] =~ [red]m{[/red][purple]https?://[/purple][red]}[/red][red];[/red]

It's simply easier to read than all those leaning towers.

3) use \Q..\E or quotemeta to escape literal strings in a regex

Code:

[blue]$strPage[/blue] =~ [red]s/[/red][purple][purple][b]\Q[/b][/purple][blue]$sNeedsReplacing[/blue][purple][b]\E[/b][/purple][/purple][red]/[/red][purple][blue]$abs_url[/blue][purple][b]\"[/b][/purple][/purple][red]/[/red][red]g[/red][red];[/red]

You created 4 regex's in order to escape the special characters in your string that you were replacing. It's good that you did that. However, that is what quotemeta is for, and it will escape ALL of the regex special characters instead of just those that one remembers.

4) Use a single regex by using the 'e' option

Currently, you're searching for stuff to replace, and then doing replacement on the whole string based off the results. This causes you do pass over the entire page for each element you need replaced. Instead, use the execute option to do this in a single regex statement.

Code:

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]URI[/green][red];[/red]
[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$base_url[/blue] = [red]"[/red][purple][URL unfurl="true"]http://www.gwynedd.gov.uk/[/URL][/purple][red]"[/red][red];[/red]

[black][b]my[/b][/black] [blue]$strPage[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[blue]$strPage[/blue] =~ [red]s{[/red][purple][/purple]
[purple]    (    # Whole String (used for no change)[/purple]
[purple]        ( (?:src|href|option [purple][b]\s[/b][/purple]+ value|form [purple][b]\s[/b][/purple]+ action) [purple][b]\s[/b][/purple]* = [purple][b]\s[/b][/purple]* [purple][b]\"[/b][/purple])   # Pre Info[/purple]
[purple]        ([^[purple][b]\"[/b][/purple]]*)                                                        # URL[/purple]
[purple]        ([purple][b]\"[/b][/purple])                                                            # Post Info[/purple]
[purple]    )[/purple]
[purple][/purple][red]}[/red][red]{[/red][purple][/purple]
[purple]	my ([blue]$wholestring[/blue], [blue]$pre[/blue], [blue]$relurl[/blue], [blue]$post[/blue]) = ([blue]$1[/blue], [blue]$2[/blue], [blue]$3[/blue], [blue]$4[/blue]);[/purple]

[purple]	# No Change if has scheme or is fragment.[/purple]
[purple]	if ([blue]$relurl[/blue] =~ m{^[purple][b]\w[/b][/purple]+:} || [blue]$relurl[/blue] =~ m{^#}) {[/purple]
[purple]		[blue]$wholestring[/blue];[/purple]

[purple]	} else {[/purple]
[purple]		my [blue]$absurl[/blue] = URI->new_abs([blue]$relurl[/blue], [blue]$base_url[/blue]);[/purple]
[purple]		"[blue]$pre[/blue][blue]$absurl[/blue][blue]$post[/blue]";[/purple]
[purple]	}[/purple]
[purple][/purple][red]}[/red][red]xeg[/red][red];[/red]


[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple][blue]$strPage[/blue][/purple][red]"[/red][red];[/red]

[teal]__DATA__[/teal]
[teal]<html>[/teal]
[teal]<body>[/teal]
[teal]<h1>Replace these</h1>[/teal]
[teal]Image: <img src="foo.gif">[/teal]
[teal]Link: <a href="/relative.html">stuff</a>[/teal]
[teal]Form: <form action="bob.cgi" method="POST"></form>[/teal]

[teal]<h1>No Change</h1>[/teal]
[teal]mailto: <a href="mailto: bob@asdf.com">stuff</a>[/teal]
[teal]javascript: <a href="javascript:alert('yo')">stuff</a>[/teal]
[teal]</body>[/teal]
[teal]</html>[/teal]

[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Other Modules used :
[ul]
[li]URI[/li]
[/ul]
[/tt]

5) Don't use a regex at all for parsing of html

Even though the style is improved, the above code is far from perfect. There are a lot of exceptions where it fails. What if the quote is a single quote instead of a double? Or no quotes at all? What if some other attribute like method occurs before action in a form tag? What other tags are we forgetting? What if the casing of the tags varies. Fortunately that is at least fixed easily using the 'i' option. However, there are an immense number of exceptions when working with html.

Therefore don't use a regex at all, but instead an actual parser, such as HTML:

arser

http://search.cpan.org/search?query=HTML::Parser

There is even example script that lets you apply any expression to every link in an html file.

http://search.cpan.org/src/GAAS/HTML-Parser-3.56/eg/hrefsub

Obviously a single expression is not likely to completely serve. However, you should be able to adapt the code without much difficulty.

- Miller

brigmar · Aug 30, 2007

Nice tips Miller. Have a

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Perl Global Replace 4

drifter300

Programmer

KevinADC

Technical User

Kirsle

Programmer

drifter300

Programmer

MillerH

Programmer

brigmar

Programmer

Similar threads

Part and Inventory Search

Sponsor