This document provides an introduction to Unicode and character encoding standards. It explains that Unicode is a character set standard that supports all languages worldwide. It describes different character encoding schemes like UTF-8 and UTF-16 that are used to represent Unicode characters in binary. It highlights issues with older single-byte encodings and the benefits of adopting a Unicode encoding to support globalization.
1. This article is a part of Lingoport.com; the original article can be found at
http://www.lingoport.com/software-internationalization-articles/unicode-primer-for-the-uninitiated/
Unicode Primer for the Uninitiated
Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of
awareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. The
advent of Unicode is a key underpinning for global software applications and websites so that they can
support worldwide language scripts. So it’s a very important standard to be aware of, whether you’re in
localization, an engineer or a business manager.
Firstly, Unicode is a character set standard used for
displaying and processing language data in computer
applications. The Unicode character set is the entire
world’s set of characters, including letters, numbers,
currencies, symbols and the like, supporting a number
of character encodings to make that all happen. Before
your eyes glaze over, let me explain what character
encoding means. You have to remember that for a
computer, all information is represented in zeros and
ones (i.e. binary values). So if you think of the letter A
in the ASCII standard of zeros and ones it would look
like this: 1000001. That is, a 1 then five zeros and a 1
to make a total of 7 bits. This binary representation for
A is called A’s code point, and this mapping of zeros
and ones to characters is called the character
encoding. In the early days of computing, unless you
did something very special, ASCII (7 bits per character) was how your data got managed. The problem is
that ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents and
characters specific to non-English alphabets, such as you find in European languages. You certainly can’t
support the complex characters that make up Chinese, Korean and Japanese languages. These languages
require 8-bit (single-byte) or 16-bit (double-byte) character encodings. One important note on all of these
single- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means that
English code points will always be the same regardless the encoding.
2. The Bad Old Days
In the early computing days, specific character
single- and double-byte encodings were developed to
support various languages. That was very bad, as it
meant that software developers needed to build a
version of their application for every language they
wanted to support that used a different encoding.
You’d have the Japanese version, the Western
European language version, the English-only version
and so on. You’d end up with a hoard of individual
An Introduction to Unicode and Character Encoding
software code bases, each needing their own testing,
updating and ongoing maintenance and support, which is very expensive, and pretty near impossible for
businesses to realistically support without serious digressions among the various language versions over
time. You don’t see this problem very often for newly developed applications, but there are plenty of
holdovers. We see it typically when a new client has turned over their source code to a particular country
partner or marketing agent which was responsible for adapting the code to multiple languages. The worst
case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with
18 separate language versions and had no real idea any longer the level of functionality that varied from
language to language. That’s no way to grow a corporate empire!
ISO Latin
A single-byte character set that we often see in applications is
ISO Latin 1, which is represented in various encoding
standards such as ISO-8859-1 for UNIX, Windows-1252 for
Windows and MacRoman on guess what platform. This
character set supports characters used in Western European
languages such as French, Spanish, German, and U.K. English.
Since each character requires only a single byte, this character
set provides support for multiple languages, while avoiding the
work required to support either Unicode or a double-byte Unicode: The Movie
encoding. Trouble is that still leaves out much of the world.
For example, to support Eastern European languages you need to use a different character set, often
referred to as Latin 2, which provides the characters that are uniquely needed for these languages. There
are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When
having to internationalize software for the first time, sometimes companies will start with just supporting
ISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work of
supporting other languages later. The reason is that it’s likely these software applications will need major
reworking of the encoding support in their database and functions, methods and classes within their
source code to go beyond ISO Latin support, which means more time and more money – often cascading
into later releases and foregone revenues. However, if the software company has truly global ambitions,
they will need to take that plunge and provide Unicode support. I’ll argue that if companies are
3. supporting global customers, and even not doing a bit of translation/localization for the interface, they
still need to support Unicode so they can provide processing of their customer’s global data.
Unicode
We come back to Unicode, which as we mentioned above, is a character set created to enable support of
any written language worldwide. Now you might find a language or two lacking Unicode support for its
script but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are
among scripts not yet supported. Arcane until you need them I suppose. I remember a few years ago
when we were developing a multi-lingual site which needed support for Khmer and Armenian, and we
were thankful that Unicode had just added their support a few months prior. If you have a marketing
requirement for your software to support Japanese or Chinese, think Unicode. That’s because you will
need to move to a double-byte encoding at the very least, and as soon as you go through the trouble to
do that, you might as well support Unicode and get the added benefit of support for all languages.
UTF-8
Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to
use, which will be dependent on the application requirements and technologies. UTF-8 is one of the
commonly used character encodings defined within the Unicode Standard, which uses a single byte for
each character unless it needs more, in which case it can expand up to 4 bytes. People sometimes refer
to this as a variable-width encoding since the width of the character in bytes varies depending upon the
character. The advantage of this character encoding is that all English (ASCII) characters will remain as
single-bytes, saving data space. This is especially desirable for web content, since the underlying HTML
markup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 character
encoding. Concerning databases, where large amounts of application data are integral to the application,
a developer may choose a UTF-8 encoding to save space if most of the data in the database does not
need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note
that some databases will not support UTF-8, specifically Microsoft’s SQL Server.
UTF-16
UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each
character whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed
by 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined;
however you must adapt your software to be capable of handling this four-byte combination. Java and
.Net internally process strings (text and messages) as UTF-16.
For many applications, you can actually support multiple Unicode encodings so that for example your
data is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There
are various reasons to do this, such as software limitations (different software components supporting
different Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea is
one of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good money
to solve this.
4. Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but
without the 4-byte characters (only the 16-bit characters are supported).
GB 18030
There’s also a special-case character set when it comes to engineering for software intended for sale in
China (PRC), which is required by the Chinese Government. This character set is GB 18030, and it is
actually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB
18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic”
(16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach to
supporting GB 18030 (the UCS-2 encoding just mentioned is not, however).
Now all of this considered, a converse question might be, what happens when you try to make your
application support complex scripts that need Unicode, and the support isn’t there? Depending upon your
system, you get anything from garbled and meaningless gibberish where data or messages become
corrupted characters or weird square boxes, or the application crashes forcing a restart. Not good.
If your application supports Unicode, you are ready to take on the world.
About Lingoport
Founded in 2001, Lingoport provides extensive software localization and internationalization consulting
services. Lingoport’s Globalyzer software, a market leading software internationalization tool, helps entire
enterprises and development teams to effectively internationalize existing and newly developed source
code and to prepare their applications for localization.
An Introduction to Lingoport’s Globalyzer: