Newline recognition problem

mpalmer12345 · Mar 15, 2004

I am having a devil of a time figuring this out!

I am taking as my data the source code brought in from a website via LWP::Simple, such as the following:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq (Win98; U) [Netscape]">
<meta name="Author" content="AS">
<title>Contents</title>
</head>

I then want to write a bit of code that removes the newlines between HTML and HEAD etc. so that it comes out as

<!doctype html public "-//w3c//dtd html 4.0 transitional//en"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq (Win98; U) [Netscape]">
<meta name="Author" content="AS">
<title>Contents</title></head>

I am using

$text =~ s/>\n+?</></ig;

which works well on my Perl program at home (I am using Mac), but when I test it out on the identical code on the webpage, it doesn't remove the newlines! Why isn't it doing on the webpage what it does at home???

PaulTEG · Mar 16, 2004

It could be CR/LF pairs ...
\r\n is what you should be looking for instead of just \n

HTH
--Paul

ishnid · Mar 16, 2004

You could look for both \n and \r\n, for portability:

Code:

$text =~ s/>(?:\r?\n)+?</></ig;

duncdude · Mar 16, 2004

$text = '<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>';

print "$text\n\n";

$text =~ s|>[^<]+<html>[^<]+<head>|><html><head>|;

print "$text\n";

[red]<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>[/red]

[blue]<!doctype html public "-//w3c//dtd html 4.0 transitional//en"><html><head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>[/blue]

Kind Regards
Duncan

mpalmer12345 · Mar 16, 2004

Thanks! There did seem to be a CR that crept into the code somehow. The program now works great!

Thanks again! You guys are invaluable!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Newline recognition problem

mpalmer12345

Programmer

PaulTEG

Technical User

ishnid

Programmer

duncdude

Programmer

mpalmer12345

Programmer

Similar threads

Part and Inventory Search

Sponsor