C Study Guide

St. Gabriel's College

|

UNICODE

You've probably noticed that all of our programs have been doing input and output with only English letters (in programming we call them Latin characters). However, many programs you use in daily life allow you to do input and output with Thai letters. In this section we're going to talk about UNICODE, which is the way your computer thinks about Thai letters.

We'll have to cover a few topics on this page:

Why should I learn this?

If you do decide to become a programmer, you will have many opportunities doing Localization, translating big programs so that Thai people can use them. Big companies like Microsoft, Apple, Google and Facebook all need people who can:

Many of these companies have a hard time finding people with all three of these skills. Not many foreigners can speak Thai well enough, so the companies need Thai people to help them. Every day more websites and programs are being translated into Thai, and there are tons of jobs available. However, in order to do these kinds of jobs, you'll need to know how the computer thinks about Thai characters.

ASCII

To store Latin characters, the computer uses a system called ASCII (American Standard Code for Information Interchange). ASCII assigns every letter, digit, punctuation mark, etc. to a different number. For example:

You can find the complete list at asciitable.com. There are 256 (2^8) characters in the ASCII table, so a normal char only needs 8 bits to store a character. Let's look at how the computer thinks about the string "Hello World":

"Hello World"

0 1 2 3 4 5 6 7 8 9 10 11 12
'H' 'e' 'l' 'l' 'o' ' ' 'W' 'o' 'r' 'l' 'd' '!' '\0'
0 1 2 3 4 5 6 7 8 9 10 11 12
72 101 108 108 111 32 87 111 114 108 100 33 0

In fact, we could even use an array of Integers as a String, and the program would work normally!

int hello[] = {72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 0 };

printf(ā€œ%sā€, (char *)hello);

Hexidecimal Numbers

Normally when we count, we use the decimal number system, which is base-10. This means, there are 10 digits that we use to count: 0,1,2,3,4,5,6,7,8,9.

When our computer stores data--int, char, float or anything else--it uses binary, a base-2 number system. This means, there are only 2 digits: 1 and 0.

Because binary is base-2, we need to use more digits. This means that numbers quickly become very long. For example, the number 61 is 1111101. When we want to use less space, we use a number system called Hexidecimal, which is base-16. The digits are 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.

In the following table, we'll look at how each number system thinks about different numbers

Decimal Binary Hexidecimal
1 1 1
2 10 2
3 11 3
4 100 4
5 101 5
6 110 6
7 111 7
8 1000 8
9 1001 9
10 1010 A
Decimal Binary Hexidecimal
11 1011 B
12 1100 C
13 1101 D
14 1110 E
15 1111 F
16 10000 10
17 10001 11
18 10010 12
19 10011 13
20 10100 14

Hexidecimal numbers are very important. If you continue programming, you will see them very often.

UNICODE

While ASCII has enough space for all of the Latin characters (and all the European accent characters), there isn't enough room for other languages. We can solve this problem with UNICODE, a huge table with room for 110,000 different characters. With UNICODE, we can write programs that input and output in languages like Chinese, Arabic, Sanskrit and Thai.

The UNICODE table is broken up into many pieces, here is the Thai UNICODE page. Each letter has a Hexidecimal number assigned to it. For example, ก is assigned 0E01, which would be 3585 in decimal numbers.

Let's look at our "Hello World!" example, but now in Thai using UNICODE numbers.

0 1 2 3 4 5 6 7 8 9
'ส' 'ว' '  ั ' 'ส' 'ด' '   ี' 'โ' 'ล' 'ก' '\0'
0 1 2 3 4 5 6 7 8 9
0E2A 0E27 0E31 0E2A 0E14 0E35 0E42 0E25 0E01 0000

Writing a UNICODE program

In order to use UNICODE in our program, we'll have to change the way we've been dealing with Strings.

We have to use a new type called wchar_t, the wide character type. These "wide" strings have to be written with an L before them. Each character is written as \uxxxx, where the x's are the UNICODE number.We have to use special "wide" input and output functions, like wprintf() and fwgets().

wchar_t hello[1000] = L"\u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35\u0E42\u0E25\u0E01";
wprintf(L"%s", hello);

Depending on what computer you use, writing UNICODE programs might be easy or impossible. On Mac or Linux computers, UNICODE programs are easy to write. On Windows, your output will be wrong, but you can still do File I/O if you do it right.