Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Unicode Text Files

Status
Not open for further replies.

timesign

Programmer
May 7, 2002
53
US
1) Does any one know how I can determine how a text file is encoded. – ie. Is the text file I am opening UTF8 or not. Word usually knows when it opens a file – how does it know?
Is there a byte that is set to a specific value?
 
You are talking text files here, so no other additional data is usually specified into them to let you know you're dealing with 16 bit characters or 8 bit characters.

The only way to do it (as I can think of and how I would do it) is to check for non-printable characters inside the text file, that is, non-US-Ascii.
Most certanly, if you encounter such characters (mainly 0's, if the text file is in any latin alphabet), then you have a Unicode text file.

unicode uses 16 bits for characters. Depending on the byte ordering of the machine, this is how text would look like in UTF-8 and UTF-16, big and little endian:

UTF-8
"I like to scare dogs".

UTF-16 big Endian:
"I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s000".
UTF-16 little Endian:
"0I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s00".

The 0's there are simply the value 0 and it will display as a black rectangle in a non-utf16 viewer.
HTH. [red]Nosferatu[/red]
We are what we eat...
There's no such thing as free meal...
once stated: methane@personal.ro
 
Many Unicode files have a byte-order mark on them. Look for (from an Intel CPU) 0xEF 0xBB 0xBF at the start of the file. Morotola CPUs (of which I don't have an example, sorry) would have the bytes in different order.

Chip H.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top