1) Does any one know how I can determine how a text file is encoded. – ie. Is the text file I am opening UTF8 or not. Word usually knows when it opens a file – how does it know?
Is there a byte that is set to a specific value?
You are talking text files here, so no other additional data is usually specified into them to let you know you're dealing with 16 bit characters or 8 bit characters.
The only way to do it (as I can think of and how I would do it) is to check for non-printable characters inside the text file, that is, non-US-Ascii.
Most certanly, if you encounter such characters (mainly 0's, if the text file is in any latin alphabet), then you have a Unicode text file.
unicode uses 16 bits for characters. Depending on the byte ordering of the machine, this is how text would look like in UTF-8 and UTF-16, big and little endian:
UTF-8
"I like to scare dogs".
UTF-16 big Endian:
"I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s000".
UTF-16 little Endian:
"0I0 0l0i0k0e0 0t0o0 0s0c0a0r0e0 0d0o0g0s00".
The 0's there are simply the value 0 and it will display as a black rectangle in a non-utf16 viewer.
HTH. [red]Nosferatu[/red]
We are what we eat...
There's no such thing as free meal... once stated: methane@personal.ro
Many Unicode files have a byte-order mark on them. Look for (from an Intel CPU) 0xEF 0xBB 0xBF at the start of the file. Morotola CPUs (of which I don't have an example, sorry) would have the bytes in different order.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.