SlideShare una empresa de Scribd logo
1 de 69
Unicode 101

Bouvet søkekollokvie, 2011-10-26
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga




1
Agenda

• Scripts and languages
    – background for what follows
• Character encodings
    – basic concepts
• Unicode
    – it‟s bigger than you think
• Programming
    – some practical lessons



2
Scripts and languages

A quick intro to grammatology




3
The beginning of writing


                                           Chinese writing



            Egyptian hieroglyphics




           Sumerian cuneiform



4000 BCE                 3000 BCE    2000 BCE                1000 BCE


   4
Logographic scripts

• All of these first scripts are logographic
    – sometimes called “ideographic”, too
• One character – one word
    –   = mountain
    –   = mouth / door
    – 門 = gate
• Compounds for complicated concepts
    – 問 = question (mouth inside gate)
    –     = hotel (literally: alcohol shop)
    – 日本 = Japan (literally: sun root)
5
Simplified Chinese

• In the 1950s and 60s the Communist Chinese
  government simplified the shapes of many
  characters
    – these modified characters are used in mainland
      China
    – Japan and Taiwan still use the original shapes
         Character     Traditional   Simplified
         Gate          門             门
         Country       國             国
         Vehicle       車             车
         East          東             东


6
Logographic scripts (2)

Good                         Bad
• Iconic and striking        • Hard to learn
• Compact                    • Hard to write on computers
• Language-independent (to   • Works poorly for inflected
  some degree)                 languages
                             • Sorting is hard




7
The next step: abjads

• The Egyptians later developed a script called
  “hieratic”
    – meaning “priestly writing”
    – oldest known example from 1600 BCE, but must be
      older
    – actual origin kind of obscure
• It is an abjad
    – that is, an alphabet with only consonant signs
    – (and some logographic elements)
• A precursor to our own alphabet
8
Why only consonants?

• In Semitic languages everything revolves around
  “word roots”
    – these consist of three consonants
    – many, many words can be derived from the same root in a
      systematic way
    – (the same applies to Egyptian)
• Example
    –   s l m = peace
    –   salaam = peace (related to Hebrew shalom)
    –   islam = to have peace
    –   moslem = one who has peace
• Abjads therefore work well for Semites
    – and not quite as well for others

9
Abjad family tree
Much simplified. This family            Hieratic
has many, many more scripts.




                                      Proto-Sinaitic




                           Ethiopic                    Phoenician




                                         Greek                      Aramaic




                                            Kharoshthi      Arabic        Hebrew

10
The invention of the alphabet

• The Greeks didn‟t much like not having signs
  for vowels
     – so they invented them, thus giving us the alphabet
• Salaam
     – in Arabic:    (m l s)
     – in Greek: σαλαμ (s a l a m)
• Grammatologists consider only scripts with
  both consonant and vowel signs alphabets
     – thus, there is no ”Arabic alphabet” or ”Chinese
       alphabet”

11
Alphabet family tree
Again much simplified.                      Greek



                                 Etruscan




                         Latin                      Armenian   Cyrillic   Georgian




12
Alphabetic order




13
ABG?

• The Etruscans had no g sound, so they
  pronounced the third character k
     – the Romans inherited this
• Thus, in Latin “C” was used for both k and g
     – this left Spurius Carvilius Ruga with a bit of a
       problem
     – he had to write his surname as “Ruca”
     – (Latin “ruca”: to fart)
     – so he invented “G” (C with a stroke)


14
Backtracking
                      Hieratic



                    Proto-Sinaitic




         Ethiopic                    Phoenician




                       Greek                      Aramaic




                          Kharoshthi      Arabic        Hebrew

15
Kharoshthi

• A script developed for writing Sanskrit in India
     – 4th century BCE, or thereabouts
• Sanskrit is Indo-European, like Greek
     – hence, the absence of vowels is a major problem
     – however, the Indians came up with a different
       solution
• This is an abugida
     – Ethiopic, too




16
The descendants of Kharoshthi

• The Indian subcontinent has a large number of
  scripts
     – all are descended from Kharoshthi and follow the
       same basic model
• These scripts also spread into Indo-China and
  beyond to Indonesia
     – Thai, Khmer, Javanese, Tibetan, etc etc
• Total number of users must approach 2 billion



17
Back to China
                                 Chinese



                      Japanese     Korean      Vietnamese



• Japanese is an inflected language
     – therefore developed extra characters for writing grammatical
       endings
• The Vietnamese later moved on to Latin characters under
  French influence
• The Koreans did something completely different...
• Leaving out a number of minor ethnic scripts here

18
Hangul

• On October 9 1446, king Sejong announced the
  introduction of a new script: Hangul
     – it was designed to be easier to learn than Chinese
       characters
     – it was also designed to be a better fit to the Korean
       language
• Hangul is like an alphabet, but better
     – wikipedia: Numerous linguists have praised Hangul
       for its featural design, describing it as "remarkable",
       "the most perfect phonetic system devised", and
       "brilliant, so deliberately does it fit the language like a
       glove.”

19
Hangul design

• Vowels have different shapes from consonants
     – these follow the shape of the vocal cords when
       speaking the vowel
• Consonants have a different system



                                 Follows Chinese convention of equal-sized
                                 boxes for all characters. Each box contains
                                 hangul letters for one syllable.



20
Syllabaries

• Like abugidas, but unsystematic
     – that is, every letter combination must be learned by
       rote




21
1800 – present

• Cover this?




22
Kinds of scripts




                abjad
 featural       alphabet   abugida   syllabary   logographic


 feature        letter               syllable         word



            what basic shapes correspond to

23
Text directions

• LTR top-down                       • Bottom up, left to right
     – Latin, Greek, Cyrillic, ...      – Some minor Indonesian
• RTL top-down                            abugidas
     – Arabic, Hebrew, Syriac, ...   • Bottom up, right to left
• Top to bottom, left to right          – one minor Moslem Chinese
                                          abjad
     – Monglian, Uighur, Buryat,
       ...                           • Upwards boustrophedon
• Top to bottom, right to left          – Ogham (ancient Irish runes)
     – Nushu, Rong




24
Combining characters

• In Arabic, the same letter can have up to four
  different shapes
     – depending on its position in the word




25
Character encoding

Basics




26
Two key concepts

• Character set
     – a function number -> character
     – usually with a limited, fixed number of characters
• Character encoding
     – a mapping from a bit stream to a sequence of
       numbers
     – the numbers, of course, refer to characters in some
       character set
• Example
     – UTF-8 is an encoding for Unicode
     – UTF-16 is another

27
Kinds of encodings

• Single-byte
     – each byte is a number 0-255. end of story
     – ISO 8859-x
• Double-byte
     – each word (2 bytes) is a number 0-65536. end of story
     – UCS-2
• Variable length
     – more complex rules (UTF-8)
• Escape code-based
     – uses escape codes to change between different modes
     – ISO 2022

28
ASCII

• The mother of all character sets
     – 7-bit
• Nearly all character sets today are ASCII
  subsets
• The exception is EBCDIC
     – mostly used by IBM mainframes
     – also a terrible design




29
Code pages

• Primitive character set solution on IBM PCs
• Basically, changing the code page would
  change the system font
     – 65 would always be „A‟
     – 216 could be „Ø‟, Cyrillic, Greek, ... depending on code
       page
• Essentially, swapping code page would change
  the contents of all text files...
     – awful for text processing software


30
ISO 8859-x series

•    Lower 128 characters is ASCII
•    Higher 128 characters are language-specific
•    Now obsolete, thanks to Unicode
•    Microsoft have their own extensions
     – Windows-12xx, add extra characters where 8859
       have obsolete control codes

     1 Western Europe 5 Cyrillic   9    Turkish         1   Baltic
                                                        3
     2 Central Europe   6 Arabic   10   Nordic (Sami) 1     Celtic
                                                      4
     3 South Europe     7 Greek    11   Thai            1   Latin-1
                                                        5   ++
31
     4 North Europe     8 Hebrew 12     Doesn’t exist   1   Latin-3
The Far East

• Generally, one character set per country
     –   Japan    JIS X 0208
     –   Korea    KS X 1001 (and 1003)
     –   China    GB 2312
     –   Taiwan   ???
• Combined with different character encodings
     – ISO 2022 (-JP, -KR, -CN)
     – EUC (-JP, -KR, -TW)
• Additional variants
     – Shift-JIS (Japanese, from Microsoft)
     – Big5 (Taiwanese)
     – ...
32
China

• Doesn‟t want to use Unicode
• Instead introduced GB 18030
     – takes GB 2312, then adds Unicode after the GB 2312
       part
     – requires a mapping table for lower part
     – higher part can be mapped to Unicode
       algorithmically




33
VISCII
• Vietnamese character set
     – tries hard to maintain ASCII compatibility, but there
       are just too many Vietnamese characters...




34
Unicode




35
Unicode

• The character set to end all character sets
• Before, there was at least one character set for
  each script
     – generally, it would have Latin plus one more script
     – software therefore had to support many different
       internal representations of text
• Now, Unicode supports every character that‟s
  ever appeared in a character set anywhere
     – therefore, it‟s the only character set you need


36
Origin
• 1987
     – Engineers at Xerox and Apple discuss the possibility of a universal
       character set
     – they investigate, and decide it‟s feasible
• 1988
     – tentative proposal for a 16-bit character set
• 1989
     – Unicode Working Group set up
     – all of ISO 8859 added
• 1990
     – many more people join
     – Chinese characters added
• 1991
     – Unicode Consortium founded
     – Unicode 1.1 released
• 1992
     – ISO 10646 killed off, and replaced by Unicode
37
Design goals

• Universal
     – should be the only character set ever needed
• Semantics
     – characters should have well-defined semantics
     – Ø≠∅
• Dynamic composition
     – characters can be composed dynamically
• Convertibility
     – every character in an existing character set, must
       have a single corresponding character in Unicode
38
Structure

• Originally intended to be 16-bit
     – 0x000 – 0xFFFF
     – explicit rationale: enough to encode all characters in
       daily use
     – implicit: not excessive use of space
• Unfortunately, this is not nearly enough
     –   decided to expand it in 1996
     –   keep the 16-bit structure
     –   original range becomes Basic Multilingual Plane
     –   each stretch of 0xFFFF characters is a plane
     –   17 planes (0-16) in all

39
Contents

• As of Unicode 6.0
     – more than 109,000 characters
     – 93 different scripts




40
The planes

• Plane 0: BMP
     – Nearly the only one needed
• Plane 1: Supplementary Multilingual Plane
     – mostly historical scripts and weird symbols
• Plane 2: Supplementary Ideographic Plane
     – historical Chinese characters
• Panes 3: Tertiary Ideographic Plane
     – not in use, reserved for ancient Chinese characters
• Planes 4-13: Unused
• Plane 14: Supplementary Special-purpose Plane
• Planes 15-16: Private Use Area
41
Basic Multilingual Plane
       0xxx    1xxx    2xxx      3xxx    4xxx   5xxx   6xxx   7xxx   8xxx   9xxx   Axxx   Bxxx   Cxxx   Dxxx    Exxx    Fxxx

                                 CJK
x1FF   Latin                     Syll.

                                                                                    Yi
x3FF   Sym




                                                                                                        Hang
                       Sym




                                                                                                         ul




                                                                                                                        Priva

                                                                                                                        use
                       bols




                                                                                                                         te
x5FF


x7FF




                                                                                          Hang


                                                                                                 Hang




                                                                                                                Priva
               Scrip




                                                                                                                use
                                                                                                                 te
                                         CJK    CJK    CJK    CJK    CJK    CJK




                                                                                           ul


                                                                                                  ul
                ts     Braille
x9FF                             CJK
       Scrip
        ts
xBFF




                                                                                                        Surro
                                                                                                        gates
                                                                                                                        Stuff
xDFF




                                                                                   Hang
                                                                                    ul
               Latin
xFFF
                       Misc




  42
Plane 1
       0xxx   1xxx    2xxx   3xxx    4xxx    5xxx    6xxx    7xxx   8xxx   9xxx   Axxx   Bxxx    Cxxx   Dxxx    Exxx    Fxxx

x1FF
                             Unde                                                        Large
                             ciph                                                        Asian
x3FF
                      Conl
       LTR    Indic
                      ang
x5FF
                                                     Hiero
                                                     glyph
                                                      ics
x7FF
                                     Hiero   Hiero
                                                             Large Large Large Large                    Notat   Notat   Notat
                                     glyph   glyph
                                                             Asian Asian Asian Asian                    ional   ional   ional
                                      ics     ics
x9FF
                             North
                             Am.
                      Near
xBFF
                      east
              Afric
       RTL
               an
xDFF
                                                     Sum
                                                     erian
                      Unde
xFFF
                      ciph




43
Encodings

• UCS-2
     – from ISO 10646
     – two bytes per character
     – can only encode the BMP
• UCS-4
     – also from ISO 10646
     – four bytes per character
     – can encode the whole thing
• UTF-32
     – same as UCS-4
44
UTF-16

• Like UCS-2
     – but extended with a trick to cover the full set
• Surrogates
     – a block of code points set aside specifically for UTF-
       16
• Each non-BMP character is written as two
  surrogates, one low and one high
     – first 10 bits (0-03FF) added to D800 = first two bytes
     – next 10 bits added to DC00 = next two bytes
• So, the characters become:
     – 1101 10xx xxxx xxxx 1101 11xx xxxx xxxx
45
UTF-8

• Cleverly designed variable-length encoding
• ASCII is encoded as ASCII
• Can encode all of Unicode as 4 bytes
     – whether more or less compact than UTF-16 depends
       on the text being coded
     – for files UTF-8 is usually far more compact




46
UTF-8

• Far and away most used Unicode encoding
     – because of compatibility with ASCII
• Easy to recognize bit patterns
     РLett ̴ kjenne igjen = UTF-8 interpreted as 8859-1
     – æøå = æøå
     – ÆØÅ = ÆØÅ




47
Reading UTF-8 as UTF-16

• Two bytes get treated as a single character
     – effectively turns it into a random character




           As UTF-8                         As UTF-16
48
Han unification

• Not only are there many Chinese characters
     – there are different variants of each character
     – differences between China (traditional & simplified),
       Japan, and Korea (also Vietnam)
• Unicode has decided to encode these only once
     – different renderings are considered visual differences
       only
• This is quite unpopular in the Far East
     – particularly in Japan


49
Too many characters

• Latin characters have a nearly infinite number
  of variants:
          a          á          à   â
          ã          ä          å   ā
          ą          ă          ậ   ẩ
          ạ          aː         ȧ   ...

• New ones pop up all the time
     – these can‟t all be encoded




50
Solution: combining characters

• What if I needed Z with stroke, cedilla, and
  umlaut?
• Simple, encode as
     – U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
     – U+0327 COMBINING CEDILLA
     – U+0308 COMBINING DIAERESIS
• Norwegian å should actually be written
     – a + combining ring



51
Many ways to write a character

• Unicode has inherited precomposed characters
  (like å) from older character sets
     – these are all included, for ease of roundtripping
• Unicode normalization provides ways of
  streamlining this
     – unfortunately, it‟s complex, with numerous different
       normalization forms
     – won‟t go further into it




52
Unicode Character Database

• Unicode contains more than just the characters
     – there is a whole database of characters with many fields
• It contains things like
     –   names for each character
     –   decomposition mappings
     –   deprecation mappings
     –   case mappings
     –   breakdown into blocks
     –   category for each character
     –   numeric value
     –   what script the character belongs to
     –   ...
53
Uses for UCD

• Matching in regular expressions
     – by character category
     – by script
     – ...
• Upper- and lower-casing of strings
     – beware, this is complicated...
• Stripping accents
     – use decomposition mappings in UCD
• ...

54
More stuff in the Unicode standard

• Guidance on upper/lower-casing
     – tricky, because there are national variations
•    Unicode Normalization
•    Sorting algorithm
•    Regular expression guidelines
•    Line breaking algorithm
•    Bidirectional text display (Arabic)



55
Programming

Dealing with characters




56
The key principle

• One internal representation for text
     – used everywhere, with no deviations
     – text from outside must always be converted
• Modern programming languages enforce this
     – char/String vs byte (in Java)
     – Stream vs Reader/Writer (also Java)
• Older languages do not
     – C char has no defined representation



57
In an ideal world

• The internal encoding of strings should not be
  visible
     – they should just be sequences of Unicode characters
• In practice, this turns out to be difficult
     – string.charAt(): what should this return?
     – string.length(): what should this return?
     – etc




58
Internal encodings

•    C                     strings are byte arrays
•    C++                   bytes, UTF-16, or UTF-32
•    Java                  UTF-16
•    .NET                  UTF-16
•    Python                UTF-16
•    Ruby                  ???
•    JavaScript            UTF-16 or UCS-2 (poorly defined)
•    PHP                   strings are byte arrays1)
• Perl                     UTF-8
         1) http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset-
59       encoding-do-strings-have-in-php
Java

• String.charAt(int ix)
     – returns the UTF-16 code unit (word) at that index
• String.codePointAt(int ix)
     – like charAt, but if it‟s a high surrogate, returns code
       point by combining with charAt(ix + 1)
• String.length()
     – returns the number of UTF-16 code units in the string
       representation
                          Few programming languages document the
                          behaviour of the String class well enough for
                          this to be clear...

60
How long is a string?

String str = "u01B5u0327u0308";
System.out.println(str.length());


         Output is 3, even though there is just a single,
         combined character.




61
Find the bug
                             Q: What encoding are we reading?
                             A: We have no idea.
import java.io.*;

public class Cat {

     public static void main(String[] argv)
                                 throws IOException {
       BufferedReader in =
          new BufferedReader(new FileReader(argv[0]));
       String line = in.readLine();
       while (line != null) {
         System.out.println(line);
         line = in.readLine();
       }
     }
}

62
How to solve

• Either find some way to auto-detect encoding
     – requires you to know the syntax of the file
     – and that syntax to have auto-detect rules
• Or find some way for the user to specify the
  encoding
     – for example a command-line parameter




63
Find the bug, 2                    Q: How do we know this is UTF-8?
                                   A: We don’t.


public HttpResponse get(String request) throws IOException {
   InputStream responseContent = null;
   HttpGet httpGet = new HttpGet(request);
   HttpResponse response =
      new DefaultHttpClient().execute(httpGet);
   responseContent = response.getEntity().getContent();
   return new HttpResponse(response.getStatusCode(),
       response.getStatusLine().getReasonPhrase(),
       read(new InputStreamReader(responseContent, "UTF-8")));
 }




64
How to solve

• Need to look at the Content-type
     – “Content-type: text/plain; charset=iso-8859-1”
• Occasionally, it can be even harder
     – MIME-type specific rules for deciding charset if not
       specified in request
• Best solution (for a generic Get class) is to
  return the stream (not the reader)
     – and provide enough info for clients to figure out the
       encoding


65
URI vs IRI

• Originally, the character encoding of URIs was
  not defined
     – characters must be ASCII, or %-escaped
     – however, character set of %-escapes not defined
     – this was RFC 2396
• Then, RFC 3986 defined %-escapes as being
  UTF-8
     – explicit characters must still be ASCII only
• RFC 3987 introduced IRIs
     – here, everything is UTF-8
     – non-ASCII characters do not need to be escaped
66
How to distinguish IRIs and URIs

• Well, uh, you can‟t, really...
     – if there are non-ASCII characters it‟s probably an IRI
• Specifications can decide to support IRIs
     – for example, in XTM it‟s all IRIs
• So the context can tell you, in some cases

                 If it sounds like a mess, that’s because it is...




67
XML and Unicode

• One of the good things about XML is that it
  gets Unicode right
     – text coming out of an XML parser is always Unicode
     – unless the author has made a stupid mistake, there
       will be no character encoding problems
• Does this via
     –   syntax for declaring encoding in the document,
     –   careful, explicit rules for detecting encoding,
     –   escape syntax for Unicode characters, and
     –   well-designed rules for Unicode characters in
         identifiers and elsewhere
68
Representing characters in XML

• Don‟t ever use entity references
     – &aring; and similar are the spawn of the devil
     – they require the DOCTYPE to be downloaded
• Use UTF-8 as the character encoding
     – and declare it in the <?xml, or omit entirely
     – this way all characters can be expressed directly
• If you absolutely must use stupid tricks, use
  &#XXXX; character references
     – it‟s better to avoid these, but in special cases (human
       authoring) they can be useful

69

Más contenido relacionado

Similar a Unicode 101 (10)

Lecture 1 Types of Writing
Lecture 1 Types of WritingLecture 1 Types of Writing
Lecture 1 Types of Writing
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
 
Top 6 hardest languages to learn
Top 6 hardest languages to learnTop 6 hardest languages to learn
Top 6 hardest languages to learn
 
LETTERING STYLES.pptx
LETTERING STYLES.pptxLETTERING STYLES.pptx
LETTERING STYLES.pptx
 
Styles of lettering
Styles of letteringStyles of lettering
Styles of lettering
 
Ethiopic Writing System
Ethiopic Writing SystemEthiopic Writing System
Ethiopic Writing System
 
Su 2012 ss orthography(1)
Su 2012 ss orthography(1)Su 2012 ss orthography(1)
Su 2012 ss orthography(1)
 
Languages and tok ecis 2012 bourlet
Languages and tok ecis 2012 bourletLanguages and tok ecis 2012 bourlet
Languages and tok ecis 2012 bourlet
 
Classification of human languages
Classification of human languagesClassification of human languages
Classification of human languages
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
 

Más de Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

Más de Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Unicode 101

  • 1. Unicode 101 Bouvet søkekollokvie, 2011-10-26 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
  • 2. Agenda • Scripts and languages – background for what follows • Character encodings – basic concepts • Unicode – it‟s bigger than you think • Programming – some practical lessons 2
  • 3. Scripts and languages A quick intro to grammatology 3
  • 4. The beginning of writing Chinese writing Egyptian hieroglyphics Sumerian cuneiform 4000 BCE 3000 BCE 2000 BCE 1000 BCE 4
  • 5. Logographic scripts • All of these first scripts are logographic – sometimes called “ideographic”, too • One character – one word – = mountain – = mouth / door – 門 = gate • Compounds for complicated concepts – 問 = question (mouth inside gate) – = hotel (literally: alcohol shop) – 日本 = Japan (literally: sun root) 5
  • 6. Simplified Chinese • In the 1950s and 60s the Communist Chinese government simplified the shapes of many characters – these modified characters are used in mainland China – Japan and Taiwan still use the original shapes Character Traditional Simplified Gate 門 门 Country 國 国 Vehicle 車 车 East 東 东 6
  • 7. Logographic scripts (2) Good Bad • Iconic and striking • Hard to learn • Compact • Hard to write on computers • Language-independent (to • Works poorly for inflected some degree) languages • Sorting is hard 7
  • 8. The next step: abjads • The Egyptians later developed a script called “hieratic” – meaning “priestly writing” – oldest known example from 1600 BCE, but must be older – actual origin kind of obscure • It is an abjad – that is, an alphabet with only consonant signs – (and some logographic elements) • A precursor to our own alphabet 8
  • 9. Why only consonants? • In Semitic languages everything revolves around “word roots” – these consist of three consonants – many, many words can be derived from the same root in a systematic way – (the same applies to Egyptian) • Example – s l m = peace – salaam = peace (related to Hebrew shalom) – islam = to have peace – moslem = one who has peace • Abjads therefore work well for Semites – and not quite as well for others 9
  • 10. Abjad family tree Much simplified. This family Hieratic has many, many more scripts. Proto-Sinaitic Ethiopic Phoenician Greek Aramaic Kharoshthi Arabic Hebrew 10
  • 11. The invention of the alphabet • The Greeks didn‟t much like not having signs for vowels – so they invented them, thus giving us the alphabet • Salaam – in Arabic: (m l s) – in Greek: σαλαμ (s a l a m) • Grammatologists consider only scripts with both consonant and vowel signs alphabets – thus, there is no ”Arabic alphabet” or ”Chinese alphabet” 11
  • 12. Alphabet family tree Again much simplified. Greek Etruscan Latin Armenian Cyrillic Georgian 12
  • 14. ABG? • The Etruscans had no g sound, so they pronounced the third character k – the Romans inherited this • Thus, in Latin “C” was used for both k and g – this left Spurius Carvilius Ruga with a bit of a problem – he had to write his surname as “Ruca” – (Latin “ruca”: to fart) – so he invented “G” (C with a stroke) 14
  • 15. Backtracking Hieratic Proto-Sinaitic Ethiopic Phoenician Greek Aramaic Kharoshthi Arabic Hebrew 15
  • 16. Kharoshthi • A script developed for writing Sanskrit in India – 4th century BCE, or thereabouts • Sanskrit is Indo-European, like Greek – hence, the absence of vowels is a major problem – however, the Indians came up with a different solution • This is an abugida – Ethiopic, too 16
  • 17. The descendants of Kharoshthi • The Indian subcontinent has a large number of scripts – all are descended from Kharoshthi and follow the same basic model • These scripts also spread into Indo-China and beyond to Indonesia – Thai, Khmer, Javanese, Tibetan, etc etc • Total number of users must approach 2 billion 17
  • 18. Back to China Chinese Japanese Korean Vietnamese • Japanese is an inflected language – therefore developed extra characters for writing grammatical endings • The Vietnamese later moved on to Latin characters under French influence • The Koreans did something completely different... • Leaving out a number of minor ethnic scripts here 18
  • 19. Hangul • On October 9 1446, king Sejong announced the introduction of a new script: Hangul – it was designed to be easier to learn than Chinese characters – it was also designed to be a better fit to the Korean language • Hangul is like an alphabet, but better – wikipedia: Numerous linguists have praised Hangul for its featural design, describing it as "remarkable", "the most perfect phonetic system devised", and "brilliant, so deliberately does it fit the language like a glove.” 19
  • 20. Hangul design • Vowels have different shapes from consonants – these follow the shape of the vocal cords when speaking the vowel • Consonants have a different system Follows Chinese convention of equal-sized boxes for all characters. Each box contains hangul letters for one syllable. 20
  • 21. Syllabaries • Like abugidas, but unsystematic – that is, every letter combination must be learned by rote 21
  • 22. 1800 – present • Cover this? 22
  • 23. Kinds of scripts abjad featural alphabet abugida syllabary logographic feature letter syllable word what basic shapes correspond to 23
  • 24. Text directions • LTR top-down • Bottom up, left to right – Latin, Greek, Cyrillic, ... – Some minor Indonesian • RTL top-down abugidas – Arabic, Hebrew, Syriac, ... • Bottom up, right to left • Top to bottom, left to right – one minor Moslem Chinese abjad – Monglian, Uighur, Buryat, ... • Upwards boustrophedon • Top to bottom, right to left – Ogham (ancient Irish runes) – Nushu, Rong 24
  • 25. Combining characters • In Arabic, the same letter can have up to four different shapes – depending on its position in the word 25
  • 27. Two key concepts • Character set – a function number -> character – usually with a limited, fixed number of characters • Character encoding – a mapping from a bit stream to a sequence of numbers – the numbers, of course, refer to characters in some character set • Example – UTF-8 is an encoding for Unicode – UTF-16 is another 27
  • 28. Kinds of encodings • Single-byte – each byte is a number 0-255. end of story – ISO 8859-x • Double-byte – each word (2 bytes) is a number 0-65536. end of story – UCS-2 • Variable length – more complex rules (UTF-8) • Escape code-based – uses escape codes to change between different modes – ISO 2022 28
  • 29. ASCII • The mother of all character sets – 7-bit • Nearly all character sets today are ASCII subsets • The exception is EBCDIC – mostly used by IBM mainframes – also a terrible design 29
  • 30. Code pages • Primitive character set solution on IBM PCs • Basically, changing the code page would change the system font – 65 would always be „A‟ – 216 could be „Ø‟, Cyrillic, Greek, ... depending on code page • Essentially, swapping code page would change the contents of all text files... – awful for text processing software 30
  • 31. ISO 8859-x series • Lower 128 characters is ASCII • Higher 128 characters are language-specific • Now obsolete, thanks to Unicode • Microsoft have their own extensions – Windows-12xx, add extra characters where 8859 have obsolete control codes 1 Western Europe 5 Cyrillic 9 Turkish 1 Baltic 3 2 Central Europe 6 Arabic 10 Nordic (Sami) 1 Celtic 4 3 South Europe 7 Greek 11 Thai 1 Latin-1 5 ++ 31 4 North Europe 8 Hebrew 12 Doesn’t exist 1 Latin-3
  • 32. The Far East • Generally, one character set per country – Japan JIS X 0208 – Korea KS X 1001 (and 1003) – China GB 2312 – Taiwan ??? • Combined with different character encodings – ISO 2022 (-JP, -KR, -CN) – EUC (-JP, -KR, -TW) • Additional variants – Shift-JIS (Japanese, from Microsoft) – Big5 (Taiwanese) – ... 32
  • 33. China • Doesn‟t want to use Unicode • Instead introduced GB 18030 – takes GB 2312, then adds Unicode after the GB 2312 part – requires a mapping table for lower part – higher part can be mapped to Unicode algorithmically 33
  • 34. VISCII • Vietnamese character set – tries hard to maintain ASCII compatibility, but there are just too many Vietnamese characters... 34
  • 36. Unicode • The character set to end all character sets • Before, there was at least one character set for each script – generally, it would have Latin plus one more script – software therefore had to support many different internal representations of text • Now, Unicode supports every character that‟s ever appeared in a character set anywhere – therefore, it‟s the only character set you need 36
  • 37. Origin • 1987 – Engineers at Xerox and Apple discuss the possibility of a universal character set – they investigate, and decide it‟s feasible • 1988 – tentative proposal for a 16-bit character set • 1989 – Unicode Working Group set up – all of ISO 8859 added • 1990 – many more people join – Chinese characters added • 1991 – Unicode Consortium founded – Unicode 1.1 released • 1992 – ISO 10646 killed off, and replaced by Unicode 37
  • 38. Design goals • Universal – should be the only character set ever needed • Semantics – characters should have well-defined semantics – Ø≠∅ • Dynamic composition – characters can be composed dynamically • Convertibility – every character in an existing character set, must have a single corresponding character in Unicode 38
  • 39. Structure • Originally intended to be 16-bit – 0x000 – 0xFFFF – explicit rationale: enough to encode all characters in daily use – implicit: not excessive use of space • Unfortunately, this is not nearly enough – decided to expand it in 1996 – keep the 16-bit structure – original range becomes Basic Multilingual Plane – each stretch of 0xFFFF characters is a plane – 17 planes (0-16) in all 39
  • 40. Contents • As of Unicode 6.0 – more than 109,000 characters – 93 different scripts 40
  • 41. The planes • Plane 0: BMP – Nearly the only one needed • Plane 1: Supplementary Multilingual Plane – mostly historical scripts and weird symbols • Plane 2: Supplementary Ideographic Plane – historical Chinese characters • Panes 3: Tertiary Ideographic Plane – not in use, reserved for ancient Chinese characters • Planes 4-13: Unused • Plane 14: Supplementary Special-purpose Plane • Planes 15-16: Private Use Area 41
  • 42. Basic Multilingual Plane 0xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx CJK x1FF Latin Syll. Yi x3FF Sym Hang Sym ul Priva use bols te x5FF x7FF Hang Hang Priva Scrip use te CJK CJK CJK CJK CJK CJK ul ul ts Braille x9FF CJK Scrip ts xBFF Surro gates Stuff xDFF Hang ul Latin xFFF Misc 42
  • 43. Plane 1 0xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx x1FF Unde Large ciph Asian x3FF Conl LTR Indic ang x5FF Hiero glyph ics x7FF Hiero Hiero Large Large Large Large Notat Notat Notat glyph glyph Asian Asian Asian Asian ional ional ional ics ics x9FF North Am. Near xBFF east Afric RTL an xDFF Sum erian Unde xFFF ciph 43
  • 44. Encodings • UCS-2 – from ISO 10646 – two bytes per character – can only encode the BMP • UCS-4 – also from ISO 10646 – four bytes per character – can encode the whole thing • UTF-32 – same as UCS-4 44
  • 45. UTF-16 • Like UCS-2 – but extended with a trick to cover the full set • Surrogates – a block of code points set aside specifically for UTF- 16 • Each non-BMP character is written as two surrogates, one low and one high – first 10 bits (0-03FF) added to D800 = first two bytes – next 10 bits added to DC00 = next two bytes • So, the characters become: – 1101 10xx xxxx xxxx 1101 11xx xxxx xxxx 45
  • 46. UTF-8 • Cleverly designed variable-length encoding • ASCII is encoded as ASCII • Can encode all of Unicode as 4 bytes – whether more or less compact than UTF-16 depends on the text being coded – for files UTF-8 is usually far more compact 46
  • 47. UTF-8 • Far and away most used Unicode encoding – because of compatibility with ASCII • Easy to recognize bit patterns – Lett Ã¥ kjenne igjen = UTF-8 interpreted as 8859-1 – æøå = æøå – ÆØÅ = ÆØÅ 47
  • 48. Reading UTF-8 as UTF-16 • Two bytes get treated as a single character – effectively turns it into a random character As UTF-8 As UTF-16 48
  • 49. Han unification • Not only are there many Chinese characters – there are different variants of each character – differences between China (traditional & simplified), Japan, and Korea (also Vietnam) • Unicode has decided to encode these only once – different renderings are considered visual differences only • This is quite unpopular in the Far East – particularly in Japan 49
  • 50. Too many characters • Latin characters have a nearly infinite number of variants: a á à â ã ä å ā ą ă ậ ẩ ạ aː ȧ ... • New ones pop up all the time – these can‟t all be encoded 50
  • 51. Solution: combining characters • What if I needed Z with stroke, cedilla, and umlaut? • Simple, encode as – U+01B5 LATIN CAPITAL LETTER Z WITH STROKE – U+0327 COMBINING CEDILLA – U+0308 COMBINING DIAERESIS • Norwegian å should actually be written – a + combining ring 51
  • 52. Many ways to write a character • Unicode has inherited precomposed characters (like å) from older character sets – these are all included, for ease of roundtripping • Unicode normalization provides ways of streamlining this – unfortunately, it‟s complex, with numerous different normalization forms – won‟t go further into it 52
  • 53. Unicode Character Database • Unicode contains more than just the characters – there is a whole database of characters with many fields • It contains things like – names for each character – decomposition mappings – deprecation mappings – case mappings – breakdown into blocks – category for each character – numeric value – what script the character belongs to – ... 53
  • 54. Uses for UCD • Matching in regular expressions – by character category – by script – ... • Upper- and lower-casing of strings – beware, this is complicated... • Stripping accents – use decomposition mappings in UCD • ... 54
  • 55. More stuff in the Unicode standard • Guidance on upper/lower-casing – tricky, because there are national variations • Unicode Normalization • Sorting algorithm • Regular expression guidelines • Line breaking algorithm • Bidirectional text display (Arabic) 55
  • 57. The key principle • One internal representation for text – used everywhere, with no deviations – text from outside must always be converted • Modern programming languages enforce this – char/String vs byte (in Java) – Stream vs Reader/Writer (also Java) • Older languages do not – C char has no defined representation 57
  • 58. In an ideal world • The internal encoding of strings should not be visible – they should just be sequences of Unicode characters • In practice, this turns out to be difficult – string.charAt(): what should this return? – string.length(): what should this return? – etc 58
  • 59. Internal encodings • C strings are byte arrays • C++ bytes, UTF-16, or UTF-32 • Java UTF-16 • .NET UTF-16 • Python UTF-16 • Ruby ??? • JavaScript UTF-16 or UCS-2 (poorly defined) • PHP strings are byte arrays1) • Perl UTF-8 1) http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset- 59 encoding-do-strings-have-in-php
  • 60. Java • String.charAt(int ix) – returns the UTF-16 code unit (word) at that index • String.codePointAt(int ix) – like charAt, but if it‟s a high surrogate, returns code point by combining with charAt(ix + 1) • String.length() – returns the number of UTF-16 code units in the string representation Few programming languages document the behaviour of the String class well enough for this to be clear... 60
  • 61. How long is a string? String str = "u01B5u0327u0308"; System.out.println(str.length()); Output is 3, even though there is just a single, combined character. 61
  • 62. Find the bug Q: What encoding are we reading? A: We have no idea. import java.io.*; public class Cat { public static void main(String[] argv) throws IOException { BufferedReader in = new BufferedReader(new FileReader(argv[0])); String line = in.readLine(); while (line != null) { System.out.println(line); line = in.readLine(); } } } 62
  • 63. How to solve • Either find some way to auto-detect encoding – requires you to know the syntax of the file – and that syntax to have auto-detect rules • Or find some way for the user to specify the encoding – for example a command-line parameter 63
  • 64. Find the bug, 2 Q: How do we know this is UTF-8? A: We don’t. public HttpResponse get(String request) throws IOException { InputStream responseContent = null; HttpGet httpGet = new HttpGet(request); HttpResponse response = new DefaultHttpClient().execute(httpGet); responseContent = response.getEntity().getContent(); return new HttpResponse(response.getStatusCode(), response.getStatusLine().getReasonPhrase(), read(new InputStreamReader(responseContent, "UTF-8"))); } 64
  • 65. How to solve • Need to look at the Content-type – “Content-type: text/plain; charset=iso-8859-1” • Occasionally, it can be even harder – MIME-type specific rules for deciding charset if not specified in request • Best solution (for a generic Get class) is to return the stream (not the reader) – and provide enough info for clients to figure out the encoding 65
  • 66. URI vs IRI • Originally, the character encoding of URIs was not defined – characters must be ASCII, or %-escaped – however, character set of %-escapes not defined – this was RFC 2396 • Then, RFC 3986 defined %-escapes as being UTF-8 – explicit characters must still be ASCII only • RFC 3987 introduced IRIs – here, everything is UTF-8 – non-ASCII characters do not need to be escaped 66
  • 67. How to distinguish IRIs and URIs • Well, uh, you can‟t, really... – if there are non-ASCII characters it‟s probably an IRI • Specifications can decide to support IRIs – for example, in XTM it‟s all IRIs • So the context can tell you, in some cases If it sounds like a mess, that’s because it is... 67
  • 68. XML and Unicode • One of the good things about XML is that it gets Unicode right – text coming out of an XML parser is always Unicode – unless the author has made a stupid mistake, there will be no character encoding problems • Does this via – syntax for declaring encoding in the document, – careful, explicit rules for detecting encoding, – escape syntax for Unicode characters, and – well-designed rules for Unicode characters in identifiers and elsewhere 68
  • 69. Representing characters in XML • Don‟t ever use entity references – &aring; and similar are the spawn of the devil – they require the DOCTYPE to be downloaded • Use UTF-8 as the character encoding – and declare it in the <?xml, or omit entirely – this way all characters can be expressed directly • If you absolutely must use stupid tricks, use &#XXXX; character references – it‟s better to avoid these, but in special cases (human authoring) they can be useful 69