Computer Codes & Sorts I've always found these confusing - especially the way in which the same data would sort in a different sequence on a PC and on a mainframe. Having got a basic idea, I thought it would be worth sharing.
Back in the 1970s, most computers standardised on eight-bit bytes. These can be viewed as numbers in the range 0 to 255, or as two-digit hexadecimal numbers in the range 00 to FF. Hexadecimal is a base-16 numbering system, with the numbers 10 to 15 normally shown as A to F. So 1D is 16+13=19. FF is (16*15)+15=255
The eight-bit system is a 'frozen accident'. Six-bit bytes had also been used, which can be viewed as numbers in the range 0 to 63, and that is why octal still has a role in computing. Octal is base-eight numbering, so that 63 is octal 77. In those days IBM was dominant and they imposed the standard of eight-bit bytes. They did also talk about a new type of machine with nine-bit bytes. but I don't think it ever materialised.
Codes which are numbers in the range 0 to 255 can have many meanings. The most commonly encountered systems are ASCII and EBCDIC. Common values viewed as decimal-numbers values in a byte are:
ASCII EBCDIC Null 0 0 Space 32 64 # 35 123 - 45 96 0 48 240 9 57 249 A 65 193 Z 90 233 a 97 129 z 122 169
Note that this has an influence on sorts, and also on names copied from the web. ASCII sorts numbers before letters and EBCDIC sorts them after.
Most PC functions sort numbers before letters, and this is the order in which screens and reports appear for LDA. Note that the mainframe does the opposite - numbers come after letters.
In some cases, the software is more sophisticated than that. Windows NT would sort File11 before File2, but Windows XP sorts it after. It also treats lower-case and upper-case as the same.
ASCII is also relevant to the Internet, where HTML uses a lot of ASCII codes. Names from the web often get inclusions like %20, which is the ASCII code for space, expressed in hexadecimal. This happens for included spaces, so that a name like Autumn Sunset would become Autumn%20Sunset. Not all operating systems will allow included spaces for file names.
"Today domain names code specifications limit the permissible code points to a restricted subset of 38 signs: the letters a-z (upper and lower case alike, 26 signs), the digits 0-9, the hyphen-minus "-" (so called "LDH"), plus the label-separating period (with additional rules such as no minus at the beginning or at the end of a label)." (weblink here).
ASCII, EBCDIC & UNICODE
"ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '@' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose. Below is the ASCII character table and this includes descriptions of the first 32 non-printing characters. ASCII was actually designed for use with teletypes and so the descriptions are somewhat obscure. If someone says they want data in ASCII format, all this means is they want 'plain' text with no formatting such as tabs, bold or underscoring - the raw format that any computer can understand. This is usually so they can easily import the file into their own applications without issues. Notepad.exe creates ASCII text, or in MS Word you can save a file as 'text only'...
"As people gradually required computers to understand additional characters and non-printing characters the ASCII set became restrictive. As with most technology, it took a while to get a single standard for these extra characters and hence there are few varying 'extended' sets.
"ASCII was very simplistic, and so was extended by adding 'extended' sets by various manufacturers. Apart from being confusing this was still restricted to 256 characters. Now computers are more widely established around the world the need to show other characters such as Japanese and Chinese languages along with various symbols became more important. Unicode is an attempt to standardise every character that anyone anywhere might need to process using a computer."
ôIBM adopted EBCDIC (Extended Binary Coded Decimal Interchange Code) developed for punched cards in the early 1960s and still uses it on mainframes today. It is probably the next most well known character set due to the proliferation of IBM mainframes. It comes in at least six slightly differing forms. (weblink here).
Unicode began in the 1980s, as a standard system for Chinese characters (which are numerous, and have several different system favoured by different countries). It was soon realised that a two-byte system could include all world alphabets as well as Chinese characters. The original ideas was for two bytes characters, 16-bits encoding for over 60,000 graphic characters. Since then, the Unicode Standard has grown beyond 16 bits. (See weblink here for more details.)
Unicode Basic Latin covers hexadecimal characters 0000 to 007F, and seems to be identical with basic ASCII. But Latin-1 Supplement, characters 0080 to 00FF, is quite different.
"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
"These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
"Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language... It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.
"Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption." (weblink here).
"Q. I understand that all Unicode characters are 16 bits... Is that correct? "A. Absolutely not! Unicode characters may be encoded at any code point from U+0000 to U+10FFFF. The size of the code unit used for expressing those code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for UTF-32)." (weblink here) .
"The Unicode is the only existing, very recent, table of international character sets produced originally by printer's industry in late 1980's. The origins of Unicode are rooted in works on unified Han, a subset of Chinese, Japanese and Korean (CJK) characters, which (1) have identical internal computer code point, (2) print in Chinese, Japanese or Korean design, according to a language context, (3) may have a similar meaning or not, according to a language context." (weblink here).