6. What is Character Encoding?
The
set of available characters is called a
character repertoire.
The
location of a given character within a
repertoire is known as its code position, or code
point.
The
method of representing a code point to a
digital form within a given repertoire is called the
character encoding.
6
7. What is Character Encoding?
Example :
In the repertoire ISO 8859-1,
The “code point” of "Latin capital letter A"
is 0x41 (“0x” indicating 41 is hexadecimal).
The “encoding” of “A” is 10000001.0001.
7
8. What is Character Encoding?
So, in general, we can say,
Character encoding is the way that letters,
digits and other symbols are expressed as
binary values that a computer can
understand.
8
9. Why is Character Encoding required?
Computers at their most basic level just
deal with binary numbers, 0 and 1. They
store letters, numerals and other
characters by assigning a binary number for
each one.
9
10. Pre-Unicode Character Repertoires
EBCDIC
ASCII
(American Standard Code for
Information Interchange)
US-ASCII
(U.S. version, standardised as ISO
646)
ISO
8859
ISO
8859-1 (most common version for
Western languages)
ISO
8859-15 (replacement of ISO 8859-1).
10
11. A Brief History
In
the pre-Unicode environment, we had
single 8-bit characters repertoires, which
limited us to a maximum limit of 256
characters. No single encoding could
contain enough characters to cover all the
languages.
So,
hundreds of different encoding systems
were developed for assigning numbers to
characters.
11
12. A Brief History
As a result, these coding systems conflicted with each
other. That is, two encodings could use the same number
for two different characters or different numbers for the
same character.
Any given computer needed to support many different
encodings.
Yet whenever data was passed between different
encodings or platforms, that data always runs the risk of
corruption.
So, a solution of this problem was necessary.
12
13. A Brief History
And so, the solution was Unicode — a
character repertoire that contains most of
the characters used in the languages of the
world.
13
15. What is Unicode?
Unicode
is a computing industry standard
for the consistent encoding, representation
and handling of text expressed in most of
the world's writing systems.
Unicode
provides a unique number for
every character,
no matter what the platform,
no matter what the program,
no matter what the language.
16. Where is Unicode Used?
The
Unicode standards has been adopted by
many software and hardware vendors.
Most
OSs support Unicode.
Unicode
is required for international
document and data interchange.
Several
modern standards use Unicode, such
as,
Programming languages, such as Java, C#, Perl, Python.
Markup Languages such as XML, HTML, XHTML,
JavaScript, LDAP, Cobra etc.
17. How many languages are covered by
Unicode?
The Unicode Standard encodes characters on a per script
basis.
That is, Unicode encodes scripts for languages, rather
than languages themselves.
The reason is, many scripts (especially the Latin script)
are used to write a large number of languages.
Therefore, the easiest answer is that Unicode covers all of
the languages that can be written in the Unicode
supported scripts.
19. How can Unicode cover this much
characters?
As we have already seen, in the pre-Unicode environment, we had
single 8-bit characters repertoires, which limited us to a maximum
limit of 256 characters, i.e., code points.
Where, Unicode code points are logically divided into 17 planes, each
with 65,536 (= 216) code points.
In the Unicode standard, planes are groups of code points that point
to specific characters.
Planes are identified by the numbers 0 to 16, which corresponds with
the possible values 0x00–0x10 of the first two positions in six position
format (hhhhhh).
The only one used in most circumstances is the first plane, known as
the basic multilingual plane, or BMP.
21. How to name a Code Point?
Normally, a Unicode code point is referred to by writing
"U+" followed by its hexadecimal number.
For code points in the Basic Multilingual Plane (BMP), four
digits are used
- e.g. U+0058 for the character Latin Capital Letter “X”.
For code points outside the BMP, five or six digits are
used, as required
- e.g. U+E0001 for the character 'LANGUAGE TAG'
- e.g. U+10FFFD for the character ‘<Plane 16 Private Use,
Last>’
22. Unicode and ISO
A
version of Unicode that has been
standardised by ISO (International
Organization for Standardization) is called
ISO 10646.
There
are minor differences between
Unicode and ISO 10646,but not that much
to ponder about.
23. Mapping of Unicode characters
Unicode/ISO
Therefore,
10646 is only a repertoire.
we need an encoding to go with
it.
And
that encodings are UTF and UCS.
25. What are UTF and UCS?
Unicode
defines two mapping methods:
- Unicode Transformation Format (UTF)
encodings
- Universal Character Set (UCS) encodings
Unicode
transformation format (UTF) is an
algorithmic mapping for every Unicode code point
to a unique byte.
The
ISO 10646 standard uses the term “UCS
transformation format” for UTF; the two terms
are merely synonyms for the same concept.
26. UTF vs Unicode
Say
an application reads the following from
the disk:
1101000 1100101 1101100 1101100 1101111
The
application knows this data represents
a Unicode string encoded with UTF-8 and
must show this as text to the user.
27. UTF vs Unicode
First step, is to convert the binary data to numbers.
The app uses the UTF-8 algorithm to decode the data.
In this case, the decoder returns this:
0x68 65 6C 6C 6F
Since the app knows this is a Unicode string, it can assume
each number represents a character’s code point.
So, it uses Unicode character repertoire to translate each
number to a corresponding character.
The resulting string is:
hello
28. UTF vs Unicode
Therefore,
UTF
is an encoding used to translate
binary data into numbers and vice-versa.
Unicode
is a character repertoire used to
translate numbers into characters and
vice-versa.
29. UTF Formats
UTF encodings include:
UTF-1 – a retired predecessor of UTF-8, maximizes
compatibility with ISO 2022, no longer part of The Unicode
Standard.
UTF-7 – a 7-bit encoding sometimes used in e-mail, often
considered obsolete, no longer part of The Unicode Standard.
UTF-EBCDIC – an 8-bit variable-width encoding similar to UTF-8,
but designed for compatibility with EBCDIC, not part of The
Unicode Standard.
30. UTF Formats
UTF-8
– an 8-bit variable-width encoding
which maximizes compatibility with ASCII,
used by 77.3% of all the websites.
UTF-16
– a 16-bit, variable-width encoding,
used by less than 0.1% of all the websites.
UTF-32
– a 32-bit, fixed-width encoding,
very rarely used.
31. UTF Formats
So,
at present, Unicode standard defines
three encoding forms that allow the same
character data to be transmitted in a byte,
word or double word oriented format (i.e.
in 8, 16, or 32-bits per code unit).
These
three encoding forms are called UTF8, UTF-16 and UTF-32 respectively.
32. UTF-8
UTF-8 is a variable-width encoding that can represent every character in the
Unicode character set. It encodes each of the 1,112,064 code points in the
Unicode character set using one to four 8-bit bytes (termed "octets" in the
Unicode Standard).
It has the following properties:
1.
Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to
0x7F, this means UTF-8 is fully compatible with ASCII.
2.
All characters greater than U+007F are encoded as a sequence of several
bytes, all of which are above 0x7F (namely no ASCII byte), this makes it
unambiguous to determine whether a byte belows to a multi-byte character
or an ASCII character.
3.
The first byte of a multi-byte sequence (that represent a non-ASCII) is
always in the range of 0xC0 to 0xFD. All further bytes in the sequence are in
the range 0x80 to 0xBF. This makes it unambiguous to determine the
boundary of the multi-byte characters (in fact, the first byte also contains a
redundant information about how many bytes follow for the character).
4.
The bytes FEh and FFh are never used in the UTF-8 encoding.
33. UTF-8
As mentioned before,UTF-8 is the popular encoding
form for Unicode.
The reason lies in the fact that all ASCII characters
are encoded as a single byte in UTF-8 which is not
only fully backward compatible, but also space
efficient for US and many European users.
In general, UTF-8 costs no extra space for US ASCII,
only a few percent more for ISO 8859-1 (AKA Latin-1,
covers most West European languages), 50% more for
Chinese/Japanese/Korean, 100% more for Greek and
Cyrillic.
34. UTF-16
Like UTF-8, UTF-16 is also a variable-width encoding that
can represent every character in the Unicode character set.
The difference is, it encodes 1,112,064 code points in the
Unicode character set using one to two 16-bit bytes (two or
four octets).
Properties :
The Unicode code space is divided into seventeen planes of
216 code points each.
The first plane (code points U+0000 to U+FFFF) contains the
most frequently used characters and is called the Basic
Multilingual Plane or BMP. UTF-16 encode code points in this
range as single 16-bit code units that are numerically equal
to the corresponding code points.
35. UTF-16
Code points from the other planes (called Supplementary
Planes) are encoded in UTF-16 by pairs of 16-bit code
units called a surrogate pair, by the following scheme:
o
0x010000 is subtracted from the code point, leaving a 20
bit number in the range 0.. 0x0FFFFF.
o
The top ten bits (a number in the range 0.. 0x03FF) are
added to 0xD800 to give the first code unit or lead
surrogate, which will be in the range 0xD800.. 0xDBFF.
o
The low ten bits (also in the range 0.. 0x03FF) are added
to 0xDC00 to give the second code unit or trail surrogate,
which will be in the range 0xDC00.. 0xDFFF.
36. UTF-16
The Unicode standard permanently reserves the code
point values U+D800 to U+DFFF for UTF-16 encoding of the
lead and trail surrogates, and they will never be assigned
a character, so there should be no reason to encode them.
37. UTF-16
Use in major operating systems and environments :
UTF-16
is used for text in the OS API in Microsoft
Windows 2000/XP/2003/Vista/CE.
UTF-16
is used by the Qualcomm BREW
operating systems; the .NET environments; Mac
OS X's Cocoa and Core Foundation frameworks;
and the Qt cross-platform graphical widget
toolkit.
Symbian
OS used in Nokia S60 handsets and Sony
Ericsson UIQ handsets uses UTF-16.
J2SE
5.0 version of Java uses UTF-16.
38. UTF-32
UTF-32 is a character encoding to encode Unicode
characters that uses exactly 32 bits (four octets) per
Unicode code point.
It uses fixed-length encodings, where all other Unicode
transformation formats use variable-length encodings.
Advantage :
The main advantage of UTF-32, versus variable length encodings, is that
the Unicode code points are directly indexable.
39. UTF-32
•
Disadvantage :
The main disadvantage of UTF-32 is that it is space inefficient,
using four bytes per character. Non-BMP characters are so rare
in most texts, they may as well be considered non-existent for
sizing issues, making UTF-32 twice the size of UTF-16 and up to
four times the size of UTF-8.
Though a fixed number of bytes per code point appear convenient, it
is not as useful as it appears.
Because, it does not make it faster to find a particular offset in the
string, as an "offset" can be measured in the fixed-size code units of
any encoding. It does not make calculating the displayed width of a
string easier except in limited cases, since even with a “fixed width”
font there may be more than one code point per character position or
more than one character position per code point.
For this,UTF-32 is very rarely used.
40. Now, let’s see a problem
Computers
speak different languages, like people!
Some
write data "left-to-right" and others "rightto-left".
A
machine can read its own data just fine problems happen when one computer stores data
and a different type tries to read it.
41. Then, what may be the solution?
Agree
to a common format (i.e., all
network traffic follows a single format), or
Always
include a header that describes the
format of the data. If the header appears
backwards, it means data was stored in the
other format and needs to be converted.
42. Which one is taken?
Practically, there's no rule that all computers must use
the same language, just like there's no rule all humans
need to. Each type of computer is internally consistent (it
can read back its own data), but there are no guarantees
about how another type of computer will interpret the
data it created.
So, the solution is to use a header.
In Unicode, this header is known as Byte Order Mark, or
BOM.
44. What is a BOM?
The
byte order mark (BOM) is a Unicode
character used to signal the “endianness”
of a text file or stream.
It
consists of the character code U+FEFF at
the beginning of a data stream.
45. What does ‘endian’ mean?
Data
types longer than a byte can be stored in
computer memory with the most significant byte
(MSB) first or last.
The
former is called big-endian, the latter littleendian.
Therefore,
it can be said that the terms endian
and endianness, refers to how bytes of a data
word are ordered within memory.
46. Explanation
Each byte of memory is associated with an index,
called its address, which indicates its position.
Bytes of a single data word (such as a 32 bit
integer datatype) are generally stored in
consecutive memory addresses (a 32 bit integer
needs 4 such locations). Big-endian systems are
systems in which the most significant byte of the
word is stored in the smallest address given and
the least significant byte is stored in the largest.
In contrast, little endian systems are those in
which the least significant byte is stored in the
smallest address.
47. Example
Say the data word was "0A 0B 0C 0D" (a set of 4 bytes) and
memory addresses starting at a with offsets 0, 1, 2 and 3 are
given. Then, in big endian systems, byte 0A is placed in offset 0,
0B in 1, 0C in 2 and 0D in 3. In little-endian systems, the order is
the reverse of it (see the diagram).
48. Where is a BOM useful?
A
BOM is useful at the beginning of files
that are typed as text, but for which it is
not known whether they are in big or little
endian format.
The
BOM character may also indicate which
of the several Unicode representations the
text is encoded in.
49. When a BOM is used?
BOM use is optional.
The Unicode Standard permits the BOM in UTF-8, but does
not require or recommend its use.
In UTF-16, a BOM may be placed as the first character of a
file or character stream to indicate the endianness (byte
order) of all the 16-bit code units of the file or stream.
Although a BOM could be used with UTF-32, this encoding
is rarely used for transmission. Otherwise the same rules
as for UTF-16 are applicable.