Trying to get unicode working

wilsonian · Aug 25, 2003

Hi there.

I am trying to convert my program from reading from and writing to files with ANSI encoding to reading from and writing to files with unicode encoding. I am using notepad to read/write the files. I have been looking at my new files in a hex editor and they are being saved in unicode (when I type things in in notepad).

I am using Microsoft Visual C++ 6. Within my program I am using _T, _ftprintf etc. I have defined UNICODE and _UNICODE for all configurations under both the C/C++ and the Resources settings, and I have wWinMainCRTStartup defined as my entry point.

However the program is still reading and writing one-byte characters. Eg when it reads in "Hello" it gets FF FE 48 00 (the 48 00 being the unicode H) and reads this as a three-character string made up of 1) y-umlaut, 2) something that looks like the Old English letter thorn, 3) the H, and then some whitespace.

Please, if you have any idea why this is happening I would be very grateful, I'm at a complete loss.

Thank you very much.

Cagliostro · Aug 25, 2003

to use UNICODE use L instead of _T.
The string will become wchar_t* (or BSTR) instead of char*.
if you use fread to read, use
wchar_t x[xxx];

size_t fread(
x,
sizeof(x),
1,
file);

Ion Filipski

ICQ: 95034075
AIM: IonFilipski
filipski@excite.com

JohnLac · Aug 26, 2003

I have been doing the same thing but I don't seem to have a problem. The only difference I can see between what you have done and what I have done is defined _UNICODE in the the Project->Settings->C\C++ ->Preprocesor definitions as apposed to the resources tab. Also whenever I want to define a string I use TCHAR as apposed to char or wchar_t. Then when you write a string into the type TCHAR string such as:

///////main//////

TCHAR szString[MAX_PATH] = _T("&quot

;

_tcscpy("Hello World&quot

;

//////////////////

the memory windows should show the string as:

H.e.l.l.o. .W.o.r.l.d

as apposed to

Hello World

-This has worked for me in the past

-John

JohnLac · Aug 26, 2003

Sory that "_tcscpy" should have looked like this:

_tcscpy(szString, _T("Hello World&quot

);

sorry,
John

wilsonian · Aug 26, 2003

IonFilipski, I thought the whole purpose of _T was to choose char or wchar_t depending on whether UNICODE was defined. I was under the impression that it was equvalent to L.

JohnLac, I am using TCHAR, _ftprintf, _tcscopy etc. Eg, here is the code I have been testing with:

//Reading from file with "Hello" in it
TCHAR* hello = new TCHAR[50];
FILE* filePtr = _tfopen(_T("C:\\temp\\uni.txt&quot

, _T("r+&quot

)
_ftscanf(fileInPtr, _T("%s&quot

, hello);
fclose(fileInPtr);

Thank you for your suggestions.

JohnLac · Aug 26, 2003

I have just done the same thing.

Are yuo still having the problem. I tried plugging that code in and it seemed to have worked fine. Looking in the memory window in debug mode it keeps it in memory as:

h.e.l.l.o...

wilsonian · Aug 27, 2003

Yes, the program is still reading in the string in the file as the ASCII representation of the two bytes indicating that the file is in unicode + "H".

Just to clarify, I am not having trouble getting the program to store and manipulate two-byte chars. I am only having trouble reading from (and writing to) files stored with unicode encoding. The program simply reads from the file as though it as stored in ANSI (indeed if I save my "hello" file with ANSI encoding I do get h.e.l.l.o. as my read in string. Can I just check that you were reading your "hello" from a unicode-encoded file?)

Thank you very much.

Cagliostro · Aug 27, 2003

There is no way to check some file properties to see if they are UNICODE or ASCII. If you want to read it, you should know exactly what is the encoding. If you do not know exactly, you can guess (programatically): for example, you know there are only english letters. Instead of ASCII hello, you will see .h.e.l.l.o or h.e.l.l.o. (Unicode big endian or simple unicode).

Ion Filipski

ICQ: 95034075
AIM: IonFilipski
filipski@excite.com

wilsonian · Aug 27, 2003

Ok, it's not big endian. How do I use that knowedge to get the program to parse the file properly?

Thank you.

Cagliostro · Aug 27, 2003

for example, if you know if there are only english letters. There are two pieces of pseudocode. You should just read a piece of file in a byte characer buffer:

Code:

//conditions:
//1. file should contain cahracters from  0 to 127
//2. file is not binary
//3. file is bigger than test buffer size
//4. buffer size is at least 10

bool test_ascii()
{
   bool ret = false;
   const int sz = ...;
   char x[sz];
   FILE* file = fopen ...
   int rd  = fread(x, sizeof(x), 1, file);
   for(int i = 0; i < rd - 1; i++)
   {
      if(x[i] == 0)
      {
          ret = false;
          break;
      }
   }
   return ret;
}

bool test_le_unicode()//test little endian
{
   bool ret = true;
   const int sz = ...;
   char x[sz];
   FILE* file = fopen ...
   int rd  = fread(x, sizeof(x), 1, file);
   for(int i = 0; i < rd - 1; i++, i++)//add 2
   {
      if(x[i] != 0)//for little endian
      {//for big endian you will use if(x[i] != 0)
          ret = false;
          break;
      }
   }
   return ret;
}

Ion Filipski

ICQ: 95034075
AIM: IonFilipski
filipski@excite.com

wilsonian · Aug 28, 2003

Hi there,

I just found the CStdioFile class which I'd overlooked before and it seems to be dealing with unicode nicely.

Thank you very much both of you for your help.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Trying to get unicode working

wilsonian

Programmer

Cagliostro

Programmer

JohnLac

Programmer

JohnLac

Programmer

wilsonian

Programmer

JohnLac

Programmer

wilsonian

Programmer

Cagliostro

Programmer

wilsonian

Programmer

Cagliostro

Programmer

wilsonian

Programmer

Similar threads

Part and Inventory Search

Sponsor