INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Terminology Question (UTF-8 "vs" Unicode)

Terminology Question (UTF-8 "vs" Unicode)

(OP)
Hello,

I am having a semantic argument at work.

We have code which converts UTF-8 characters (as const char* in C++) into a wchar_t* array of characters. They are arguing that we are converting "from" UTF-8 "to" Unicode.

My argument is that this is incorrect. We are converting from a single-byte representation of UTF-8 to a multi-byte representation of UTF-8. Both are Unicode. Is this correct?

The follow-on is that, in my understanding, there is no "pure" Unicode format, you must be using some sort of UTF methodology.

Any comments or corrections are greatly welcome.

-Dan

RE: Terminology Question (UTF-8 "vs" Unicode)

unicode is a set of characters (a "character set"). How these are represented in a string is called a "character encoding". There are lots of encodings for unicode: utf-8, utf-16 (LE and BE), utf-32 and even utf-7 if you like hackable systems.

For more info, here's a good read:
http://www.joelonsoftware.com/articles/Unicode.html

+++ Despite being wrong in every important aspect, that is a very good analogy +++
   Hex (in Darwin's Watch)

RE: Terminology Question (UTF-8 "vs" Unicode)

Hi Jedi-Dan.

Although Unicode encompasses several variations, one usually refers to UTF-16 as "Unicode" encoding.
UTF-8 is single-byte where sufficient, multi-byte where not sufficient whereas UTF-16 is always double-byte.

Hence you are converting from a variable-byted encoding into UTF-16.
So saying that you convert from UTF-8 to Unicode is not wrong.
Saying "from a single-byte representation of UTF-8 to a multi-byte representation of UTF-8" however IS wrong.

Check your output in a hex editor. If you get a mixture of standard latin characters and some Â~ whatevers, then you have UTF-8.
If each standard latin character is followed by a NULL character, you have UTF-16, colloquially referred to as "Unicode".

winky smile

"We had to turn off that service to comply with the CDA Bill."
- The Bastard Operator From Hell

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close