Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Newline recognition problem

Status
Not open for further replies.

mpalmer12345

Programmer
Feb 16, 2004
59
US
I am having a devil of a time figuring this out!

I am taking as my data the source code brought in from a website via LWP::Simple, such as the following:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq (Win98; U) [Netscape]">
<meta name="Author" content="AS">
<title>Contents</title>
</head>

I then want to write a bit of code that removes the newlines between HTML and HEAD etc. so that it comes out as

<!doctype html public "-//w3c//dtd html 4.0 transitional//en"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq (Win98; U) [Netscape]">
<meta name="Author" content="AS">
<title>Contents</title></head>


I am using

$text =~ s/>\n+?</></ig;

which works well on my Perl program at home (I am using Mac), but when I test it out on the identical code on the webpage, it doesn't remove the newlines! Why isn't it doing on the webpage what it does at home???
 
It could be CR/LF pairs ...
\r\n is what you should be looking for instead of just \n

HTH
--Paul
 
You could look for both \n and \r\n, for portability:
Code:
$text =~ s/>(?:\r?\n)+?</></ig;
 
$text = '<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>';

print "$text\n\n";

$text =~ s|>[^<]+<html>[^<]+<head>|><html><head>|;

print "$text\n";

[red]<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>[/red]

[blue]<!doctype html public "-//w3c//dtd html 4.0 transitional//en"><html><head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.61 [en]C-compaq  (Win98; U) [Netscape]">
   <meta name="Author" content="AS">
   <title>Contents</title>
</head>[/blue]


Kind Regards
Duncan
 
Thanks! There did seem to be a CR that crept into the code somehow. The program now works great!

Thanks again! You guys are invaluable!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top