Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Ampersand regexp? 1

Status
Not open for further replies.

Kirsle

Programmer
Jan 21, 2006
1,179
US
A problem I run into sometimes on my websites is, they fail the W3C HTML Validator on the grounds that I tend to use a lot of &'s in hyperlinks without using &

i.e.
Code:
<a href="file.cgi?color=blue&font=Arial">

they want it to be:

<a href="file.cgi?color=blue[COLOR=red]&amp;[/color]font=Arial">

And by the time I think to validate it, there are many pages to go through (imho, it shouldn't complain when an & is used within an href), so I decided to try using a regexp on the content before it gets printed to the browser (this is a CGI website)

Does anybody know of a simple regexp? The one I had to settle with is kinda blech:

Code:
$cuvou->{template} =~ s/&#/__amp__number__/ig;
$cuvou->{template} =~ s/&lt;/__amp__lt__/ig;
$cuvou->{template} =~ s/&gt;/__amp__gt__/ig;
$cuvou->{template} =~ s/&amp;/__amp__amp__/ig;
$cuvou->{template} =~ s/&quot;/__amp__quot__/ig;
$cuvou->{template} =~ s/&apos;/__amp__apos__/ig;
$cuvou->{template} =~ s/&/&amp;/ig;
$cuvou->{template} =~ s/__amp__number__/&#/ig;
$cuvou->{template} =~ s/__amp__lt__/&lt;/ig;
$cuvou->{template} =~ s/__amp__gt__/&gt;/ig;
$cuvou->{template} =~ s/__amp__amp__/&amp;/ig;
$cuvou->{template} =~ s/__amp__quot__/&quot;/ig;
$cuvou->{template} =~ s/__amp__apos__/&apos;/ig;

Thanks in advance.

-------------
Cuvou.com | The NEW Kirsle.net
 
I revised that one with a three-liner:

Code:
$cuvou->{template} =~ s/&(amp|lt|gt|copy|#\d+);/__goodamp__$1__/ig;
$cuvou->{template} =~ s/&/&amp;/ig;
$cuvou->{template} =~ s/__goodamp__(.+?)__/&$1;/ig;

But there's gotta be a good one-liner out there. :|

-------------
Cuvou.com | The NEW Kirsle.net
 
Not to rain in on your regex parade (I love such problems) but what about &nbsp;'s and the literally hundreds of other HTML Character Entities. What about JavaScript code and the need for AND logic?

It is not going to be possible to create a series, let alone a single regex to fix this particular problem. Instead, if you should focus your attack. Use HTML::parser or some other parsing module to apply a regex to link urls only.

However, if you insist on wanting a regex, just use a negative lookahead assertion.

Code:
my $entities = join '|', qw(amp lt gt copy #\d+ nbsp);
$cuvou->{template} =~ s/&(?!(?:$entities);)/&amp;/ig;
 
Or just use semicolons instead of ampersands for your URLs. Assuming your CGI script is written correctly, it should support that just as well, without the encoding issues.
 
Yea, I tend to use semi-colons myself these days for similar reasons and because its the new-and-improved way of delimiting URI data strings.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top