Unicode is ubiquitous — nearly all modern computers use it to store and transmit text — and yet most developer know very little or get panicked when they see the message "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92". Panic no more! Eddie will give the introduction that you never got about Unicode, and give best practices and the knowledge you need to handle text from all around the world!
18. A € ৪ ‽
U+0041 U+20AC U+09EA U+203D
code points
assigns numbers to characters
19. A € ৪ ‽
U+0041 U+20AC U+09EA U+203D
code points
a string is a sequence of code points
20. = U+0041
€ = U+20AC
৪ = U+09EA
‽ = U+203D
A code point is like a
database primary key
into Unicode
Character Database.
Every code point has
properties.
A
22. General category
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
৪
U+09EA
BENGALI DIGIT FOUR
‽
U+203D
INTERROBANG
Letter, uppercase
Digit, decimal
Symbol, currency
Punctuation, other
23. General category
A
U+0041
LATIN CAPITAL LETTER A
a
U+0061
LATIN SMALL LETTER A
ا
U+0627
ARABIC LETTER ALEF
あ
U+3042
HIRAGANA LETTER A
Letter, uppercase
Letter, other
Letter, lowercase
Letter, other
26. Bidirectionality Class
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
ا
U+0627
ARABIC LETTER ALEF
😳
U+1F633
FLUSHED FACE
Strong left-to-right
Strong right-to-left
Weak
Neutral
48. Put a 💩 in your test suite
• most emoji use part of Unicode that is poorly supported
• insert a test that encodes a string with an emoji and
decodes it
49. Use your head
• chardet and ftfy are great tools, but they can sometimes be
wrong!