Unicode Text Files

timesign · Jul 9, 2002

1) Does any one know how I can determine how a text file is encoded. – ie. Is the text file I am opening UTF8 or not. Word usually knows when it opens a file – how does it know?
Is there a byte that is set to a specific value?

Nosferatu · Jul 10, 2002

You are talking text files here, so no other additional data is usually specified into them to let you know you're dealing with 16 bit characters or 8 bit characters.

The only way to do it (as I can think of and how I would do it) is to check for non-printable characters inside the text file, that is, non-US-Ascii.
Most certanly, if you encounter such characters (mainly 0's, if the text file is in any latin alphabet), then you have a Unicode text file.

unicode uses 16 bits for characters. Depending on the byte ordering of the machine, this is how text would look like in UTF-8 and UTF-16, big and little endian:

UTF-8
"I like to scare dogs".

UTF-16 big Endian:
"I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s000".
UTF-16 little Endian:
"0I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s00".

The 0's there are simply the value 0 and it will display as a black rectangle in a non-utf16 viewer.
HTH. [red]Nosferatu[/red]
We are what we eat...
There's no such thing as free meal...
once stated: methane@personal.ro

chiph · Jul 16, 2002

Many Unicode files have a byte-order mark on them. Look for (from an Intel CPU) 0xEF 0xBB 0xBF at the start of the file. Morotola CPUs (of which I don't have an example, sorry) would have the bytes in different order.

Chip H.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Unicode Text Files

timesign

Programmer

Nosferatu

Programmer

chiph

Programmer

Similar threads

Part and Inventory Search

Sponsor