Unicode 101

Unicode 101

Bouvet søkekollokvie, 2011-10-26
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga

1

Agenda

• Scripts and languages
– background for what follows
• Character encodings
– basic concepts
• Unicode
– it‟s bigger than you think
• Programming
– some practical lessons

2

Scripts and languages

A quick intro to grammatology

3

The beginning of writing

Chinese writing

Egyptian hieroglyphics

Sumerian cuneiform

4000 BCE 3000 BCE 2000 BCE 1000 BCE

4

Logographic scripts

• All of these first scripts are logographic
– sometimes called “ideographic”, too
• One character – one word
– = mountain
– = mouth / door
– 門 = gate
• Compounds for complicated concepts
– 問 = question (mouth inside gate)
– = hotel (literally: alcohol shop)
– 日本 = Japan (literally: sun root)
5

Simplified Chinese

• In the 1950s and 60s the Communist Chinese
government simplified the shapes of many
characters
– these modified characters are used in mainland
China
– Japan and Taiwan still use the original shapes
Character Traditional Simplified
Gate 門门
Country 國国
Vehicle 車车
East 東东

6

Logographic scripts (2)

Good Bad
• Iconic and striking • Hard to learn
• Compact • Hard to write on computers
• Language-independent (to • Works poorly for inflected
some degree) languages
• Sorting is hard

7

The next step: abjads

• The Egyptians later developed a script called
“hieratic”
– meaning “priestly writing”
– oldest known example from 1600 BCE, but must be
older
– actual origin kind of obscure
• It is an abjad
– that is, an alphabet with only consonant signs
– (and some logographic elements)
• A precursor to our own alphabet
8

Why only consonants?

• In Semitic languages everything revolves around
“word roots”
– these consist of three consonants
– many, many words can be derived from the same root in a
systematic way
– (the same applies to Egyptian)
• Example
– s l m = peace
– salaam = peace (related to Hebrew shalom)
– islam = to have peace
– moslem = one who has peace
• Abjads therefore work well for Semites
– and not quite as well for others

9

Abjad family tree
Much simplified. This family Hieratic
has many, many more scripts.

Proto-Sinaitic

Ethiopic Phoenician

Greek Aramaic

Kharoshthi Arabic Hebrew

10

The invention of the alphabet

• The Greeks didn‟t much like not having signs
for vowels
– so they invented them, thus giving us the alphabet
• Salaam
– in Arabic: (m l s)
– in Greek: σαλαμ (s a l a m)
• Grammatologists consider only scripts with
both consonant and vowel signs alphabets
– thus, there is no ”Arabic alphabet” or ”Chinese
alphabet”

11

Alphabet family tree
Again much simplified. Greek

Etruscan

Latin Armenian Cyrillic Georgian

12

ABG?

• The Etruscans had no g sound, so they
pronounced the third character k
– the Romans inherited this
• Thus, in Latin “C” was used for both k and g
– this left Spurius Carvilius Ruga with a bit of a
problem
– he had to write his surname as “Ruca”
– (Latin “ruca”: to fart)
– so he invented “G” (C with a stroke)

14

Backtracking
Hieratic

Proto-Sinaitic

Ethiopic Phoenician

Greek Aramaic

Kharoshthi Arabic Hebrew

15

Kharoshthi

• A script developed for writing Sanskrit in India
– 4th century BCE, or thereabouts
• Sanskrit is Indo-European, like Greek
– hence, the absence of vowels is a major problem
– however, the Indians came up with a different
solution
• This is an abugida
– Ethiopic, too

16

The descendants of Kharoshthi

• The Indian subcontinent has a large number of
scripts
– all are descended from Kharoshthi and follow the
same basic model
• These scripts also spread into Indo-China and
beyond to Indonesia
– Thai, Khmer, Javanese, Tibetan, etc etc
• Total number of users must approach 2 billion

17

Back to China
Chinese

Japanese Korean Vietnamese

• Japanese is an inflected language
– therefore developed extra characters for writing grammatical
endings
• The Vietnamese later moved on to Latin characters under
French influence
• The Koreans did something completely different...
• Leaving out a number of minor ethnic scripts here

18

Hangul

• On October 9 1446, king Sejong announced the
introduction of a new script: Hangul
– it was designed to be easier to learn than Chinese
characters
– it was also designed to be a better fit to the Korean
language
• Hangul is like an alphabet, but better
– wikipedia: Numerous linguists have praised Hangul
for its featural design, describing it as "remarkable",
"the most perfect phonetic system devised", and
"brilliant, so deliberately does it fit the language like a
glove.”

19

Hangul design

• Vowels have different shapes from consonants
– these follow the shape of the vocal cords when
speaking the vowel
• Consonants have a different system

Follows Chinese convention of equal-sized
boxes for all characters. Each box contains
hangul letters for one syllable.

20

Syllabaries

• Like abugidas, but unsystematic
– that is, every letter combination must be learned by
rote

21

1800 – present

• Cover this?

22

Kinds of scripts

abjad
featural alphabet abugida syllabary logographic

feature letter syllable word

what basic shapes correspond to

23

Text directions

• LTR top-down • Bottom up, left to right
– Latin, Greek, Cyrillic, ... – Some minor Indonesian
• RTL top-down abugidas
– Arabic, Hebrew, Syriac, ... • Bottom up, right to left
• Top to bottom, left to right – one minor Moslem Chinese
abjad
– Monglian, Uighur, Buryat,
... • Upwards boustrophedon
• Top to bottom, right to left – Ogham (ancient Irish runes)
– Nushu, Rong

24

Combining characters

• In Arabic, the same letter can have up to four
different shapes
– depending on its position in the word

25

Character encoding

Basics

26

Two key concepts

• Character set
– a function number -> character
– usually with a limited, fixed number of characters
• Character encoding
– a mapping from a bit stream to a sequence of
numbers
– the numbers, of course, refer to characters in some
character set
• Example
– UTF-8 is an encoding for Unicode
– UTF-16 is another

27

Kinds of encodings

• Single-byte
– each byte is a number 0-255. end of story
– ISO 8859-x
• Double-byte
– each word (2 bytes) is a number 0-65536. end of story
– UCS-2
• Variable length
– more complex rules (UTF-8)
• Escape code-based
– uses escape codes to change between different modes
– ISO 2022

28

ASCII

• The mother of all character sets
– 7-bit
• Nearly all character sets today are ASCII
subsets
• The exception is EBCDIC
– mostly used by IBM mainframes
– also a terrible design

29

Code pages

• Primitive character set solution on IBM PCs
• Basically, changing the code page would
change the system font
– 65 would always be „A‟
– 216 could be „Ø‟, Cyrillic, Greek, ... depending on code
page
• Essentially, swapping code page would change
the contents of all text files...
– awful for text processing software

30

ISO 8859-x series

• Lower 128 characters is ASCII
• Higher 128 characters are language-specific
• Now obsolete, thanks to Unicode
• Microsoft have their own extensions
– Windows-12xx, add extra characters where 8859
have obsolete control codes

1 Western Europe 5 Cyrillic 9 Turkish 1 Baltic
3
2 Central Europe 6 Arabic 10 Nordic (Sami) 1 Celtic
4
3 South Europe 7 Greek 11 Thai 1 Latin-1
5 ++
31
4 North Europe 8 Hebrew 12 Doesn’t exist 1 Latin-3

The Far East

• Generally, one character set per country
– Japan JIS X 0208
– Korea KS X 1001 (and 1003)
– China GB 2312
– Taiwan ???
• Combined with different character encodings
– ISO 2022 (-JP, -KR, -CN)
– EUC (-JP, -KR, -TW)
• Additional variants
– Shift-JIS (Japanese, from Microsoft)
– Big5 (Taiwanese)
– ...
32

China

• Doesn‟t want to use Unicode
• Instead introduced GB 18030
– takes GB 2312, then adds Unicode after the GB 2312
part
– requires a mapping table for lower part
– higher part can be mapped to Unicode
algorithmically

33

VISCII
• Vietnamese character set
– tries hard to maintain ASCII compatibility, but there
are just too many Vietnamese characters...

34

Unicode

• The character set to end all character sets
• Before, there was at least one character set for
each script
– generally, it would have Latin plus one more script
– software therefore had to support many different
internal representations of text
• Now, Unicode supports every character that‟s
ever appeared in a character set anywhere
– therefore, it‟s the only character set you need

36

Origin
• 1987
– Engineers at Xerox and Apple discuss the possibility of a universal
character set
– they investigate, and decide it‟s feasible
• 1988
– tentative proposal for a 16-bit character set
• 1989
– Unicode Working Group set up
– all of ISO 8859 added
• 1990
– many more people join
– Chinese characters added
• 1991
– Unicode Consortium founded
– Unicode 1.1 released
• 1992
– ISO 10646 killed off, and replaced by Unicode
37

Design goals

• Universal
– should be the only character set ever needed
• Semantics
– characters should have well-defined semantics
– Ø≠∅
• Dynamic composition
– characters can be composed dynamically
• Convertibility
– every character in an existing character set, must
have a single corresponding character in Unicode
38

Structure

• Originally intended to be 16-bit
– 0x000 – 0xFFFF
– explicit rationale: enough to encode all characters in
daily use
– implicit: not excessive use of space
• Unfortunately, this is not nearly enough
– decided to expand it in 1996
– keep the 16-bit structure
– original range becomes Basic Multilingual Plane
– each stretch of 0xFFFF characters is a plane
– 17 planes (0-16) in all

39

Contents

• As of Unicode 6.0
– more than 109,000 characters
– 93 different scripts

40

The planes

• Plane 0: BMP
– Nearly the only one needed
• Plane 1: Supplementary Multilingual Plane
– mostly historical scripts and weird symbols
• Plane 2: Supplementary Ideographic Plane
– historical Chinese characters
• Panes 3: Tertiary Ideographic Plane
– not in use, reserved for ancient Chinese characters
• Planes 4-13: Unused
• Plane 14: Supplementary Special-purpose Plane
• Planes 15-16: Private Use Area
41

Basic Multilingual Plane
0xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx

CJK
x1FF Latin Syll.

Yi
x3FF Sym

Hang
Sym

ul

Priva

use
bols

te
x5FF

x7FF

Hang

Hang

Priva
Scrip

use
te
CJK CJK CJK CJK CJK CJK

ul

ul
ts Braille
x9FF CJK
Scrip
ts
xBFF

Surro
gates
Stuff
xDFF

Hang
ul
Latin
xFFF
Misc

42

Plane 1
0xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx

x1FF
Unde Large
ciph Asian
x3FF
Conl
LTR Indic
ang
x5FF
Hiero
glyph
ics
x7FF
Hiero Hiero
Large Large Large Large Notat Notat Notat
glyph glyph
Asian Asian Asian Asian ional ional ional
ics ics
x9FF
North
Am.
Near
xBFF
east
Afric
RTL
an
xDFF
Sum
erian
Unde
xFFF
ciph

43

Encodings

• UCS-2
– from ISO 10646
– two bytes per character
– can only encode the BMP
• UCS-4
– also from ISO 10646
– four bytes per character
– can encode the whole thing
• UTF-32
– same as UCS-4
44

UTF-16

• Like UCS-2
– but extended with a trick to cover the full set
• Surrogates
– a block of code points set aside specifically for UTF-
16
• Each non-BMP character is written as two
surrogates, one low and one high
– first 10 bits (0-03FF) added to D800 = first two bytes
– next 10 bits added to DC00 = next two bytes
• So, the characters become:
– 1101 10xx xxxx xxxx 1101 11xx xxxx xxxx
45

UTF-8

• Cleverly designed variable-length encoding
• ASCII is encoded as ASCII
• Can encode all of Unicode as 4 bytes
– whether more or less compact than UTF-16 depends
on the text being coded
– for files UTF-8 is usually far more compact

46

UTF-8

• Far and away most used Unicode encoding
– because of compatibility with ASCII
• Easy to recognize bit patterns
– Lett Ã¥ kjenne igjen = UTF-8 interpreted as 8859-1
– æøå = Ã¦Ã¸Ã¥
– ÆØÅ = Ã†Ã˜Ã…

47

Reading UTF-8 as UTF-16

• Two bytes get treated as a single character
– effectively turns it into a random character

As UTF-8 As UTF-16
48

Han unification

• Not only are there many Chinese characters
– there are different variants of each character
– differences between China (traditional & simplified),
Japan, and Korea (also Vietnam)
• Unicode has decided to encode these only once
– different renderings are considered visual differences
only
• This is quite unpopular in the Far East
– particularly in Japan

49

Too many characters

• Latin characters have a nearly infinite number
of variants:
a á à â
ã ä å ā
ą ă ậ ẩ
ạ aː ȧ ...

• New ones pop up all the time
– these can‟t all be encoded

50

Solution: combining characters

• What if I needed Z with stroke, cedilla, and
umlaut?
• Simple, encode as
– U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
– U+0327 COMBINING CEDILLA
– U+0308 COMBINING DIAERESIS
• Norwegian å should actually be written
– a + combining ring

51

Many ways to write a character

• Unicode has inherited precomposed characters
(like å) from older character sets
– these are all included, for ease of roundtripping
• Unicode normalization provides ways of
streamlining this
– unfortunately, it‟s complex, with numerous different
normalization forms
– won‟t go further into it

52

Unicode Character Database

• Unicode contains more than just the characters
– there is a whole database of characters with many fields
• It contains things like
– names for each character
– decomposition mappings
– deprecation mappings
– case mappings
– breakdown into blocks
– category for each character
– numeric value
– what script the character belongs to
– ...
53

Uses for UCD

• Matching in regular expressions
– by character category
– by script
– ...
• Upper- and lower-casing of strings
– beware, this is complicated...
• Stripping accents
– use decomposition mappings in UCD
• ...

54

More stuff in the Unicode standard

• Guidance on upper/lower-casing
– tricky, because there are national variations
• Unicode Normalization
• Sorting algorithm
• Regular expression guidelines
• Line breaking algorithm
• Bidirectional text display (Arabic)

55

Programming

Dealing with characters

56

The key principle

• One internal representation for text
– used everywhere, with no deviations
– text from outside must always be converted
• Modern programming languages enforce this
– char/String vs byte (in Java)
– Stream vs Reader/Writer (also Java)
• Older languages do not
– C char has no defined representation

57

In an ideal world

• The internal encoding of strings should not be
visible
– they should just be sequences of Unicode characters
• In practice, this turns out to be difficult
– string.charAt(): what should this return?
– string.length(): what should this return?
– etc

58

Internal encodings

• C strings are byte arrays
• C++ bytes, UTF-16, or UTF-32
• Java UTF-16
• .NET UTF-16
• Python UTF-16
• Ruby ???
• JavaScript UTF-16 or UCS-2 (poorly defined)
• PHP strings are byte arrays1)
• Perl UTF-8
1) http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset-
59 encoding-do-strings-have-in-php

Java

• String.charAt(int ix)
– returns the UTF-16 code unit (word) at that index
• String.codePointAt(int ix)
– like charAt, but if it‟s a high surrogate, returns code
point by combining with charAt(ix + 1)
• String.length()
– returns the number of UTF-16 code units in the string
representation
Few programming languages document the
behaviour of the String class well enough for
this to be clear...

60

How long is a string?

String str = "u01B5u0327u0308";
System.out.println(str.length());

Output is 3, even though there is just a single,
combined character.

61

Find the bug
Q: What encoding are we reading?
A: We have no idea.
import java.io.*;

public class Cat {

public static void main(String[] argv)
throws IOException {
BufferedReader in =
new BufferedReader(new FileReader(argv[0]));
String line = in.readLine();
while (line != null) {
System.out.println(line);
line = in.readLine();
}
}
}

62

How to solve

• Either find some way to auto-detect encoding
– requires you to know the syntax of the file
– and that syntax to have auto-detect rules
• Or find some way for the user to specify the
encoding
– for example a command-line parameter

63

Find the bug, 2 Q: How do we know this is UTF-8?
A: We don’t.

public HttpResponse get(String request) throws IOException {
InputStream responseContent = null;
HttpGet httpGet = new HttpGet(request);
HttpResponse response =
new DefaultHttpClient().execute(httpGet);
responseContent = response.getEntity().getContent();
return new HttpResponse(response.getStatusCode(),
response.getStatusLine().getReasonPhrase(),
read(new InputStreamReader(responseContent, "UTF-8")));
}

64

How to solve

• Need to look at the Content-type
– “Content-type: text/plain; charset=iso-8859-1”
• Occasionally, it can be even harder
– MIME-type specific rules for deciding charset if not
specified in request
• Best solution (for a generic Get class) is to
return the stream (not the reader)
– and provide enough info for clients to figure out the
encoding

65

URI vs IRI

• Originally, the character encoding of URIs was
not defined
– characters must be ASCII, or %-escaped
– however, character set of %-escapes not defined
– this was RFC 2396
• Then, RFC 3986 defined %-escapes as being
UTF-8
– explicit characters must still be ASCII only
• RFC 3987 introduced IRIs
– here, everything is UTF-8
– non-ASCII characters do not need to be escaped
66

How to distinguish IRIs and URIs

• Well, uh, you can‟t, really...
– if there are non-ASCII characters it‟s probably an IRI
• Specifications can decide to support IRIs
– for example, in XTM it‟s all IRIs
• So the context can tell you, in some cases

If it sounds like a mess, that’s because it is...

67

XML and Unicode

• One of the good things about XML is that it
gets Unicode right
– text coming out of an XML parser is always Unicode
– unless the author has made a stupid mistake, there
will be no character encoding problems
• Does this via
– syntax for declaring encoding in the document,
– careful, explicit rules for detecting encoding,
– escape syntax for Unicode characters, and
– well-designed rules for Unicode characters in
identifiers and elsewhere
68

Representing characters in XML

• Don‟t ever use entity references
– å and similar are the spawn of the devil
– they require the DOCTYPE to be downloaded
• Use UTF-8 as the character encoding
– and declare it in the <?xml, or omit entirely
– this way all characters can be expressed directly
• If you absolutely must use stupid tricks, use
&#XXXX; character references
– it‟s better to avoid these, but in special cases (human
authoring) they can be useful

69

Unicode 101

Recomendados

Recomendados

Más contenido relacionado

Similar a Unicode 101

Similar a Unicode 101 (10)

Más de Lars Marius Garshol

Más de Lars Marius Garshol (20)

Último

Último (20)

Unicode 101