×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Using Arabic Script in a C program

Using Arabic Script in a C program

Using Arabic Script in a C program

(OP)
Hi, I would like to write a program to keep track of a somewhat large amount of Arabic vocabulary words and their English equivalents. The required functionality of this program is quite simple, but I am having major difficulties with Arabic script. I have never used Unicode in a C program before, and although I have done a large amount of research, I have not yet been able to display a single Arabic character on the screen. I have tried wide and multibyte characters under UTF-8 and 16 and UCS2, many of the functions in wchar of course, but there is something I have misunderstood, or am lacking altogether in my attempts. I have been able to print wide characters using their unicode codes for the basic latin alphabet, and about 100 or 200 symbols thereafter, but at a certain point the characters begin to repeat the same sequence over and over again. Can anyone lend me some help with this problem? Thank you.

RE: Using Arabic Script in a C program

> but at a certain point the characters begin to repeat the same sequence over and over again.
Sounds like a bug in your code, nothing more nothing less.

Bugs in your memory allocation, or bugs in your string handling are common.
 

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

RE: Using Arabic Script in a C program

(OP)
You've misunderstood the problem i'm having. I dont have any code, I havent begun to write a program yet. Before I can do that, I have to figure out how to display arabic characters in C, which I have not yet been able to do. Not because of any certain bug, but rather a lack of experience in using anything but the latin alphabet in a program. I have made numerous attempts, and I ask here for advice, pointers, perhaps from someone who has used C to display international characters and who knows the area better than I.

RE: Using Arabic Script in a C program

> but at a certain point the characters begin to repeat the same sequence over and over again.
This implies you've got code which doesn't work.
So post it.

The fact that you've managed to display some characters successfully means you're probably on the right track.  The rest is detail.
 

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

RE: Using Arabic Script in a C program

Where are you printing them?  Is it on the console or on a graphical program.

Is this for Linux or Windows?

Each arabic character has 4 forms: standalone, buddy on the left, buddy on the right and buddy on both sides.  These should be in the Unicode character set.  Normally when you type them in, they will be in the standalone form (0x61F-6DF).  The printing forms are from FE70-FEFC.

RE: Using Arabic Script in a C program

(OP)
Oh yes, this is on the console in Windows or Linux, ive tried both. Here is my code, or what is left of a large amount of experimentation:

CODE


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <stddef.h>
#include <wchar.h>

#define NUM_ARGS 1
#define INPUT_LENGTH 256
FILE *popen(const char *command, const char *type);
FILE *input = NULL;

int main(int argc, char * argv[])
{
  
  setlocale(LC_ALL, "ar_AE");
  
  //setlocale(LC_ALL, "ar_AE.utf8");
  //fwide(stdout, 1);
  //wchar_t c = \u0639;
  
  //1607  When wchar is used instead of an int, many of the symbols are
  // displayed as the same 'u' character with an umlaut.
  
  int b = 2290;
  int i = 0;

  // This displays (most of) the basic latin alphabet 0x0000-0x007F
  for(i = 33; i <= 126; i++){
    fprintf(stdout, "%c  ", i);
  }

  fprintf(stdout, "\n\n\nBreak\n\n\n\n");

  // This displays most of the Latin-1 supplement 0x0080-0x00FF
  for(i = 49825; i <= 50000; i++){
    fprintf(stdout, "%c  ", i);
  }

  fprintf(stdout, "\n\n\nBreak\n\n\n\n");

  // This character does not display
  fprintf(stdout, "\n\n\n%c\n", 639);

  // This loop merely tries to display characters above basic latin-1
  // supplement, but repeats the same string of basic latin.
  for(i = 50342; i <= 60000; i++){
    fprintf(stdout, "%c  ", i);
  }
  
  return EXIT_SUCCESS;
  
}

The repeating sequence of characters and the repeating umlauted u's led me to believe that I lack a certain font or something, but this happens when I try printing Arabic characters in the same way as I have been above as well, and the computers im working on have properly working Arabic fonts and capabilities.

RE: Using Arabic Script in a C program

Using a text editor, I saved a simple file containing a ج character, and saved in a variety of encoding formats.

CODE

test2-ucs2.txt
000000 48 00 65 00 6c 00 6c 00 6f 00 0a 00 2c 06 0a 00  >H.e.l.l.o...,...<
000010
test2-utf-16be.txt
000000 00 48 00 65 00 6c 00 6c 00 6f 00 0a 06 2c 00 0a  >.H.e.l.l.o...,..<
000010
test2-utf-16le.txt
000000 48 00 65 00 6c 00 6c 00 6f 00 0a 00 2c 06 0a 00  >H.e.l.l.o...,...<
000010
test2-utf8.txt
000000 48 65 6c 6c 6f 0a d8 ac 0a                       >Hello....<
000009
$ cat test2-utf8.txt
Hello
ج
$

I then replicated the same, using program code.

CODE

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main ( ) {
    int     n;
    wchar_t myChar = 0x062c;
    FILE    *fp;

    // I don't have AE, but it's just the utf8 we're after
    if ( setlocale(LC_ALL, "en_GB.utf8") == NULL ) {
        fprintf(stderr,"Failed to set locale\n" );
        return 1;
    }

    fp = fopen("test2-byprog.txt","w");
    if ( fp == NULL ) {
        perror("Unable to open file");
        return 1;
    }

    n = fwide( fp, 1 );
    if ( n <= 0 ) {
        fprintf(stderr,"Failed to set wide mode, result=%d\n", n );
        return 1;
    }

    //fwprintf(fp, L"Hello\n\u062c\n" );      // universal character encoded in string
    //fwprintf(fp, L"Hello\n%C\n", myChar );  // uppercase-C for a single wide char
    fwprintf(fp, L"Hello\n%lc\n", myChar ); // modifier on lowercase-c

    fclose( fp );
    return 0;
}
Any of the three fwprintf() lines will have the same effect, so you have a number of choices as to how generate your output strings.

The resulting file matches the UTF8 encoded file from step 1.

CODE

$ gcc -std=c99 prog.c ; ./a.out ; od -Ax -t x1z test2-byprog.txt
000000 48 65 6c 6c 6f 0a d8 ac 0a                       >Hello....<
000009
$ cat test2-byprog.txt
Hello
ج
$
I think the key thing to note is that you need to use the 'w' (wide) output functions.  Using the old narrow functions with %c say will just truncate anything to being a single byte.

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

RE: Using Arabic Script in a C program

(OP)
Thank you very much for the help. I ran the code on Linux as well as Windows XP, but niether displayed the jeem character. Under Linux with either "en_GB.utf8" or "ar_AE.utf8" 'd8 ac' was displayed as ج, as 'd8' and 'ac' in basic latin supplement. Windows did not recognize the same locales, but under the "arabic" locale, the following was displayed:

CODE

Hello
u062c
Hello
,
Hello
Ì

Im confused as to why this is, do you have any idea? Thanks again for the help. Also, I should mention: I am running Linux on Windows XP, im not sure if that makes a difference.

RE: Using Arabic Script in a C program

> was displayed as ج, as 'd8' and 'ac' in basic latin supplement.
Well it certainly isn't being interpreted as a UTF8 stream in that case.

For your Linux box, enter the 'locale' command at the prompt.  I'm guessing it's just the 'C' locale.  Here, it's en_US.utf8

My Linux rig at the moment is a vmware instance of Ubuntu 8.04.
gcc is 4.2.4

Which compiler/version are you using on windows?
 

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

RE: Using Arabic Script in a C program

Do you have "Install files for complex scripts" from Regional Settings installed?  You can't display joined up Arabic without it.

 

RE: Using Arabic Script in a C program

(OP)
Its using the POSIX locale, and 'locale -a' returns a million locales including all those used above. Im also using a vmware image, gcc is 3.4.6.

On Windows im using Microsoft Visual C++ 6.0. Also, yes, the files for complex scripts have been installed under Regional Settings, and Arabic works properly on both Linux and Windows.

RE: Using Arabic Script in a C program

I'm pretty sure you're going to need a UTF8 locale to be able to simply print UTF8 encoded text streams.

As for windows, VC6 is pretty old (it was released over a decade ago).

Simple "express" versions of more up to date Microsoft compilers' are available for free.
http://en.wikipedia.org/wiki/Visual_Studio_Express_Edition
The 2008 version being the latest.  I would guess this has much better locale information.
 

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

RE: Using Arabic Script in a C program

(OP)
Ive downloaded the 2008 express edition, but with only a slightly different result, when running the same code under locale 'arabic' the following is printed to file:

CODE

Hello
Ì
Hello
,
Hello
Ì

It still fails to load any .utf8 locale I can think of; I dont know how to find a list of the available locales on windows. What I dont understand, however, is why this fails on linux as well, where the locales (utf8 or not) are definitely available, and setting a certain utf8 locale does not fail, yet the characters are still printed strangely and not according to utf8.  

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close