Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations MikeeOK on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Special characters

Status
Not open for further replies.

janvdk

IS-IT--Management
Jan 15, 2005
8
BE

Hi all,

I have a problem : in the following code some characters get "changed" by the script in some way. I gave the string to be searched a specific value to illustrate my problem. In reality $htmlline gets variable values.

-----------
$htmlline = "<td valign='top' >König der Sterne<br>Contatos Cósmicos</td>";


$htmlline =~ s/<td(.*?)>//g;
$htmlline =~ s/<\/td(.*?)>//g;

#following function removes blanks before and after string (not included here)
$htmlline=trimwhitespace($htmlline);

print "$htmlline\n";
K÷nig der Sterne<br>Contatos C¾smicos
-----------

Anybody has an idea how I can fix this ?

Thanks,

Jan
 
Code:
@chars=split //, $htmlline;
foreach (@chars) {
   print "$_ = ". ord($_)."\n";
}

I'm guessing it has something to do with the character set, try the snippet above and below your regexes and see if any of the data that shouldn't have changed, changed, if you know what I mean

HTH
--Paul

cigless ...
 
Paul,

Thanks for your answer.
I did what you suggested and I have to report that inserting your code before and after the regexes for the specific characters is resulting in the same result : the bad characters are printed. So even before the use of the first regex your code shows bad characters already.

Any further ideas starting from this ?

Thanks !

Jan
 
Jan

Looks like a Unicode problem. Put a 'use warnings;' at the start of your code. perl should issue a warning about logically wide characters. I suspect that even if you read the HTML and print it straight out without doing anything to it you will still have a problem.

Try
Code:
open (HTMLOUT, '>:utf8', 'my.out.);
print HTMLOUT "$htmlline\n";
and see if it fixes the problem.
 
It's got to be down to the character set.

BTW the characters aren't necessarily bad ;-)

If you're taking this data from a html form, you may have to specify a different character set, however, if you save these values to a database, and read them back out in the intended character set there shouldn' be a problem (... he thinks)

Can you post a link to the page you're parsing?
--Paul

cigless ...
 
stevexff and PaulTEG,

Thanks for your valuable help.
I think that I worried a bit early basing myself on some "DOS-box" screenoutput. When I write the values to a html-file there is no issue and contents are shown ok through the browser.

Thanks !

Jan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top