hey, i get it.
man must eliminate the tags using regular expressions and then convert the things like ’ in unicode using the modul HTML::Entities. [thumbsup2]
a simple read and write (without using the mode ":utf8") is working good, but it s still a html-code, so there is the tags and the scripts and also some chars are coded in html like the apostroph is coded through ’ and so on.
hi sunny,
the program i use is the following:
#!/usr/bin/perl -w
use HTML::Element;
use HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new();
$fn = '/work/...';
open($R,"<:utf8",$fn);
$tree->parse_file($R);
@text_nodes = $tree->look_down("_tag","p",
"class","spip"...
hi all,
i am trying to read a french (html) page, which is encoded in utf-8, then i extract the text from the htmlcode to write it in txt-file.
the problem is that some chars are converted to some strange chars, more strange than that is, not each occurrence this chars are converted.
for...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.