Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations MikeeOK on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

What's the proper way to handle unicode text file on Windows?

Status
Not open for further replies.

cheer8923

Programmer
Aug 7, 2006
230
US
They starts with the byte order market FFFE or FEFF. I tried something like this:

my $fh = new FileHandle("< $file");
if (! $fh) {
die "failed to open list file '$file': $!";
}
my $marker;
if (2 != read($fh, $marker, 2)) {
die "Failed to read the first 2 bytes from $file";
}
if ($marker eq $UNICODE_FFFE) {
binmode($fh, ":encoding(utf8)");
}
else {
$fh->seek(0, 0);
}

But the following read

$line = <$fh>;

still generates a lot of error

print $line will produces letter alternating with space.

The script deals with just ascii text.

1. What's the proper to detect unicode in file?

2. How do I deal with unicode string in regular expression matching?

3. Do I need to convert unicode to non-unicode string to do string operation incl. matching?

Thanks!
 
Seeing as no-one else is replying - maybe try:

Code:
    use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset);

...then something like:

Code:
# clean up the main RDF file...will NOT work with BIG5, but only UTF8
sub run_rdf_cleanup {

  print "Cleaning up RDF file... \n";

  `mv content.rdf.u8 content.rdf.u8.2`;

  open (CONTENT,"/path/to/file.txt") || die $!;
  open (WRITEIT,">/path/to/file2.txt") || die $!;
    while (<CONTENT>) {
      if (/[\200-\377]/) { 
         s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg; 
      }
      print WRITEIT $_;       
    }
  close(WRITEIT);
  close(CONTENT);

}

Only a wild stab in the dark (its something I've used before with the DMOZ RDF file, as I had similar problems with non-standard english charachters)

Hope that helps.

Cheers

Andy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top