Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

utf8

Status
Not open for further replies.

newbie707

Programmer
Apr 19, 2005
4
DE
hi all,

i am trying to read a french (html) page, which is encoded in utf-8, then i extract the text from the htmlcode to write it in txt-file.
the problem is that some chars are converted to some strange chars, more strange than that is, not each occurrence this chars are converted.

for example:
Du temps du président Sukarno, l’Indonésie affichait sans équivoque son mépris pour les contraintes du développement économique et les relations qu’il implique avec les pays occidentaux.
is converted to:
Du temps du président Sukarno, l’Indonésie affichait sans équivoque son mépris pour les contraintes du développement économique et les relations qu’il implique avec les pays occidentaux.

please see that "é" in président is converted to "é" but the "é" in Indonésie isn't

thanks for help
 
Hey 707,

Are you using any file disciplines to let PERL know the encoding that it is reading and the encoding it should generate the output as? PERL is pretty good at figuring that out the ecoding to use, but occassionally it can be confused. For good measure, try the following snippet


if (!open IN, "<:utf8", $datname) {
carp "Could not open $datname for reading.";
}

my @alllLines = <IN>;
close IN;

if (!open OUT, ">:utf8", $outname) {
carp "Could not open $outname for writing.";
}

map {print OUT $_\n"} @allLines;

close OUT;

Sonny.
 
hi sunny,

the program i use is the following:
Code:
#!/usr/bin/perl -w

use HTML::Element;
use HTML::TreeBuilder;

$tree = HTML::TreeBuilder->new();
$fn = '/work/...';
open($R,"<:utf8",$fn);

$tree->parse_file($R);
@text_nodes = $tree->look_down("_tag","p",
          "class","spip",
          "align","justify");
my $txt = "";
foreach my $node (@text_nodes){
  $txt .= $node->as_text();
  $txt .= "\n";
}

open(W,">:utf8","out.txt");
print W $txt;
close(W);
close($R);
 
Hey 707,

Try removing the HTML::TreeBuilder object from your code and just do a simple read and then write.


Sonny.
 
a simple read and write (without using the mode ":utf8") is working good, but it s still a html-code, so there is the tags and the scripts and also some chars are coded in html like the apostroph is coded through &#8217; and so on.
 
hey, i get it.
man must eliminate the tags using regular expressions and then convert the things like &#8217; in unicode using the modul HTML::Entities. [thumbsup2]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top