What's the proper way to handle unicode text file on Windows?

cheer8923 · Nov 1, 2008

They starts with the byte order market FFFE or FEFF. I tried something like this:

my $fh = new FileHandle("< $file");
if (! $fh) {
die "failed to open list file '$file': $!";
}
my $marker;
if (2 != read($fh, $marker, 2)) {
die "Failed to read the first 2 bytes from $file";
}
if ($marker eq $UNICODE_FFFE) {
binmode($fh, ":encoding(utf8)");
}
else {
$fh->seek(0, 0);
}

But the following read

$line = <$fh>;

still generates a lot of error

print $line will produces letter alternating with space.

The script deals with just ascii text.

1. What's the proper to detect unicode in file?

2. How do I deal with unicode string in regular expression matching?

3. Do I need to convert unicode to non-unicode string to do string operation incl. matching?

Thanks!

youradds · Nov 4, 2008

Seeing as no-one else is replying - maybe try:

Code:

    use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset);

...then something like:

Code:

# clean up the main RDF file...will NOT work with BIG5, but only UTF8
sub run_rdf_cleanup {

  print "Cleaning up RDF file... \n";

  `mv content.rdf.u8 content.rdf.u8.2`;

  open (CONTENT,"/path/to/file.txt") || die $!;
  open (WRITEIT,">/path/to/file2.txt") || die $!;
    while (<CONTENT>) {
      if (/[\200-\377]/) { 
         s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg; 
      }
      print WRITEIT $_;       
    }
  close(WRITEIT);
  close(CONTENT);

}

Only a wild stab in the dark (its something I've used before with the DMOZ RDF file, as I had similar problems with non-standard english charachters)

Hope that helps.

Cheers

Andy

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

What's the proper way to handle unicode text file on Windows?

cheer8923

Programmer

youradds

Programmer

Similar threads

Part and Inventory Search

Sponsor