Super-simple HTML parsing.

mpalmer12345 · Mar 11, 2004

I'd like to create an ultra simple HTML parser/replacement tool that looks for any and all alphanumeric text between > and <\n and blanks it out with xs. So a string containing
$text = "<TITLE>This is a title<\TITLE>";
would become
<TITLE>xxxx xx x xxxxx<\TITLE>.

$text = "<TITLE>This is a title<\\TITLE>";
$text =~ s/(.*?)<(.*?)>(.*?)<\\(.*)>/$1<$2>/;
$tex2 = $3;
$tex3 = $4;
$tex2 =~ s/\w/x/g;
$text .= "$tex2<\\$tex3>";

This code works for the $text sample above, but it doesn't work for

$text = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>Index of /</TITLE>';

and I can't figure out why. Maybe the newlines are screwing it up?

PaulTEG · Mar 12, 2004

Your code as spec'd doesn't do the sample text either

--Paul

duncdude · Mar 12, 2004

$text = '<TITLE>This is a title</TITLE>';

print "before: $text\n";

if ($text =~ m/^(<[^>]+>)([^<]+)(<[^>]+>)$/) {
$find = $2;
$find =~ s/[^ ]/x/g;
}

print " after: $1$find$3\n";

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Super-simple HTML parsing.

mpalmer12345

Programmer

PaulTEG

Technical User

duncdude

Programmer

Similar threads

Part and Inventory Search

Sponsor