Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Super-simple HTML parsing.

Status
Not open for further replies.

mpalmer12345

Programmer
Feb 16, 2004
59
US
I'd like to create an ultra simple HTML parser/replacement tool that looks for any and all alphanumeric text between > and <\n and blanks it out with xs. So a string containing
$text = "<TITLE>This is a title<\TITLE>";
would become
<TITLE>xxxx xx x xxxxx<\TITLE>.

$text = "<TITLE>This is a title<\\TITLE>";
$text =~ s/(.*?)<(.*?)>(.*?)<\\(.*)>/$1<$2>/;
$tex2 = $3;
$tex3 = $4;
$tex2 =~ s/\w/x/g;
$text .= "$tex2<\\$tex3>";

This code works for the $text sample above, but it doesn't work for

$text = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>Index of /</TITLE>';

and I can't figure out why. Maybe the newlines are screwing it up?
 
Your code as spec'd doesn't do the sample text either

--Paul
 
$text = '<TITLE>This is a title</TITLE>';

print "before: $text\n";

if ($text =~ m/^(<[^>]+>)([^<]+)(<[^>]+>)$/) {
$find = $2;
$find =~ s/[^ ]/x/g;
}

print " after: $1$find$3\n";


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top