Introduction to Unicode (that you never got) [PyLadies Dublin, August 2023]

•Descargar como PPTX, PDF•

0 recomendaciones•12 vistas

Unicode is ubiquitous — nearly all modern computers use it to store and transmit text — and yet most developer know very little or get panicked when they see the message "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92". Panic no more! Eddie will give the introduction that you never got about Unicode, and give best practices and the knowledge you need to handle text from all around the world!

Tecnología

Introduction to Unicode
(that you never got)
Eddie Antonio Santos
eddieantonio.ca

Eddie Antonio Santos
PhD Candidate
Assistant lecturer,
software localisation
Software developer
Indigenous languages tech

Beliefs about Unicode
Unicode is a character
encoding
False

Beliefs about Unicode
My app is only in English, so I
don't need to worry about
Unicode
False

Beliefs about Unicode
“Isn't Unicode that emoji
company?”
Yes…

Beliefs about Unicode
Unicode is frustrating
It can be…
but I'm here to help!

• An international
standard for the
transmission and
storage of text on
computers
• A consortium
• a database of
characters and their
properties

A € ৪ ‽
Letters Symbols Digits Punctuation

A 🤔 ৪ ‽
Letters Emoji Digits Punctuation

A € ৪ ‽
U+0041 U+20AC U+09EA U+203D
code points
assigns numbers to characters

A € ৪ ‽
U+0041 U+20AC U+09EA U+203D
code points
a string is a sequence of code points

= U+0041
€ = U+20AC
৪ = U+09EA
‽ = U+203D
A code point is like a
database primary key
into Unicode
Character Database.
Every code point has
properties.
A

Name
A
U+0041
€
U+20AC
৪
U+09EA
‽
U+203D
LATIN CAPITAL LETTER A
BENGALI DIGIT FOUR
EURO SIGN
INTERROBANG

General category
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
৪
U+09EA
BENGALI DIGIT FOUR
‽
U+203D
INTERROBANG
Letter, uppercase
Digit, decimal
Symbol, currency
Punctuation, other

General category
A
U+0041
LATIN CAPITAL LETTER A
a
U+0061
LATIN SMALL LETTER A
‫ا‬
U+0627
ARABIC LETTER ALEF
あ
U+3042
HIRAGANA LETTER A
Letter, uppercase
Letter, other
Letter, lowercase
Letter, other

Bidirectionality Class
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
‫ا‬
U+0627
ARABIC LETTER ALEF
😳
U+1F633
FLUSHED FACE

Bidirectionality Class
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
‫ا‬
‫لف‬
U+0627
ARABIC LETTER ALEF
😳
U+1F633
FLUSHED FACE

Bidirectionality Class
A
U+0041
LATIN CAPITAL LETTER A
€
U+20AC
EURO SIGN
‫ا‬
U+0627
ARABIC LETTER ALEF
😳
U+1F633
FLUSHED FACE
Strong left-to-right
Strong right-to-left
Weak
Neutral

• Unicode assigns code
points to characters
• code points != bytes

A (very) brief history of character
encodings
before computers
• Morse code (~1837)
• Baudot code (1870)
• BCDIC (1928)
computers invented
• EBCDIC (1963)
• ASCII (1967)
• ISO 8859 (1987)
• Latin-1, Latin-2, Windows
CP1252
Cze¶æ, jestem
Stanis³aw

UCS-2
Universal character set
(1991)
🥴🙄😬

UTF-8
👑😻💅🏼✨
Unicode Transmission Format, 8 bit
(~1993)

UTF-8
• backwards
compatible with
ASCII
• supports all
languages

â€
œ
Dia duitâ€
•
,
said RoÃsÃn
Cze¶æ, jestem
Stanis³aw
L&AMP;AMP;ATILDE;&AMP;
AMP;SUP3;PEZ
UnicodeDecodeError: 'utf-8'
codec can't decode byte 0x89 in
position 1: invalid start byte

mojibak
e
gÃ£rblÃ«d tÃªxt, due to mixed character encodings

mojibak
e
garbled text, due to mixed character encodings

UnicodeDecodeError: 'utf-8' codec
can't decode byte 0x89 in position
1: invalid start byte

Put a 💩 in your test suite
• most emoji use part of Unicode that is poorly supported
• insert a test that encodes a string with an emoji and
decodes it

Use your head
• chardet and ftfy are great tools, but they can sometimes be
wrong!

Más contenido relacionado

Último

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Partners Life - Insurer Innovation Award 2024The Digital Insurer

How to convert PDF to text with Nanonetsnaman860154

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Developing An App To Navigate The Roads of BrazilV3cube

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Destacado

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Introduction to C Programming LanguageSimplilearn

Destacado (20)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Introduction to C Programming Language

Introduction to Unicode (that you never got) [PyLadies Dublin, August 2023]

1. Introduction to Unicode (that you never got) Eddie Antonio Santos eddieantonio.ca

3. â€ œ Dia duitâ€ • , said RoÃsÃn

5. Eddie Antonio Santos PhD Candidate Assistant lecturer, software localisation Software developer Indigenous languages tech

7. Beliefs about Unicode

8. Beliefs about Unicode Unicode is a character encoding False

9. Beliefs about Unicode My app is only in English, so I don't need to worry about Unicode False

10. Beliefs about Unicode “Isn't Unicode that emoji company?” Yes…

11. Beliefs about Unicode Unicode is frustrating It can be… but I'm here to help!

12. What even is

13. • An international standard for the transmission and storage of text on computers • A consortium • a database of characters and their properties

14. database of characters

15. A € ৪ ‽ Letters Symbols Digits Punctuation

16. A 🤔 ৪ ‽ Letters Emoji Digits Punctuation

17. A € ৪ ‽ Letters Symbols Digits Punctuation

18. A € ৪ ‽ U+0041 U+20AC U+09EA U+203D code points assigns numbers to characters

19. A € ৪ ‽ U+0041 U+20AC U+09EA U+203D code points a string is a sequence of code points

20. = U+0041 € = U+20AC ৪ = U+09EA ‽ = U+203D A code point is like a database primary key into Unicode Character Database. Every code point has properties. A

21. Name A U+0041 € U+20AC ৪ U+09EA ‽ U+203D LATIN CAPITAL LETTER A BENGALI DIGIT FOUR EURO SIGN INTERROBANG

22. General category A U+0041 LATIN CAPITAL LETTER A € U+20AC EURO SIGN ৪ U+09EA BENGALI DIGIT FOUR ‽ U+203D INTERROBANG Letter, uppercase Digit, decimal Symbol, currency Punctuation, other

23. General category A U+0041 LATIN CAPITAL LETTER A a U+0061 LATIN SMALL LETTER A ‫ا‬ U+0627 ARABIC LETTER ALEF あ U+3042 HIRAGANA LETTER A Letter, uppercase Letter, other Letter, lowercase Letter, other

24. Bidirectionality Class A U+0041 LATIN CAPITAL LETTER A € U+20AC EURO SIGN ‫ا‬ U+0627 ARABIC LETTER ALEF 😳 U+1F633 FLUSHED FACE

25. Bidirectionality Class A U+0041 LATIN CAPITAL LETTER A € U+20AC EURO SIGN ‫ا‬ ‫لف‬ U+0627 ARABIC LETTER ALEF 😳 U+1F633 FLUSHED FACE

26. Bidirectionality Class A U+0041 LATIN CAPITAL LETTER A € U+20AC EURO SIGN ‫ا‬ U+0627 ARABIC LETTER ALEF 😳 U+1F633 FLUSHED FACE Strong left-to-right Strong right-to-left Weak Neutral

27. How this works in

28.

29. (demo)

30. • Unicode assigns code points to characters • code points != bytes

31. Bytes Code points Encoding

32. character encodings

33. A (very) brief history of character encodings before computers • Morse code (~1837) • Baudot code (1870) • BCDIC (1928) computers invented • EBCDIC (1963) • ASCII (1967) • ISO 8859 (1987) • Latin-1, Latin-2, Windows CP1252 Cze¶æ, jestem Stanis³aw

34. UCS-2 Universal character set (1991) 🥴🙄😬

35. UTF-8 👑😻💅🏼✨ Unicode Transmission Format, 8 bit (~1993)

36. UTF-8 • backwards compatible with ASCII • supports all languages

37. society post utf-8

38. â€ œ Dia duitâ€ • , said RoÃsÃn Cze¶æ, jestem Stanis³aw L&AMP;AMP;ATILDE;&AMP; AMP;SUP3;PEZ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 1: invalid start byte

39. mojibak e gÃ£rblÃ«d tÃªxt, due to mixed character encodings

40. mojibak e garbled text, due to mixed character encodings

41.

42. UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 1: invalid start byte

43.

44. (demo)

45. recommendations

46. Always explicitly specify encoding

47. Always use UTF-8

48. Put a 💩 in your test suite • most emoji use part of Unicode that is poorly supported • insert a test that encodes a string with an emoji and decodes it

49. Use your head • chardet and ftfy are great tools, but they can sometimes be wrong!

50. (demo)

51. ASK ME QUESTIONS!

Introduction to Unicode (that you never got) [PyLadies Dublin, August 2023]

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Introduction to Unicode (that you never got) [PyLadies Dublin, August 2023]