SlideShare una empresa de Scribd logo
1 de 82
Unicode: The hero or villain?
Input Validation of free-form Unicode text in Web Applications
Pawel Krawczyk
Unicode: The hero or villain?
In application security since 90’s - pentesting, security architecture, SSDLC, DevSecOps
Active developer Python, C, Java https://github.com/kravietz
OWASP - SAML, PL/SQL, authentication cheatsheets
WebCookies.org - web privacy and security scanner
Immusec.com - competitive pentesting & incident response in UK
About
Contact
pkrawczyk@immusec.com
+44 7879 180015
https://www.linkedin.com/in/pawelkrawczyk
Pawel Krawczyk
Definition of the problem
Unicode: The hero or villain?
Free-form text validation
Pawel Krawczyk
Unicode: The hero or villain?
Free-form text validation
Pawel Krawczyk
Unicode: The hero or villain?
Free-form text validation
Pawel Krawczyk
Unicode: The hero or villain?
Free-form text validation
Pawel Krawczyk
Unicode Primer
Author name her
Official name: A-OGONEK
“a letter in the Polish, Kashubian, Lithuanian, Creek, Navajo, Western Apache, Chiricahua,
Osage, Hocąk, Mescalero, Gwich'in, Tutchone, and Elfdalian alphabets” (Wikipedia)
The rise and fall of letter “Ą”
Author name here
This is the abstract title
Ą
Unicode: The hero or villain?
ASCII: just write “Ą” as “A”
Pawel Krawczyk
Unicode: The hero or villain?
ASCII: just write “Ą” as “A”
Pawel Krawczyk
(Pol.) KĄT = (Eng.) ANGLE
(Pol.) KAT = (Eng.) HANGMAN
Contextual guessing, confusion, misunderstandings, we had lots of fun on IRC back in 90’s...
Unicode: The hero or villain?
Windows-1250: “Ą” is 0xa5
Pawel Krawczyk
Unicode: The hero or villain?
ISO-8859-2: “Ą” is 0xa1
Pawel Krawczyk
Unicode: The hero or villain?
To możliwe!
Pawel Krawczyk
Source: Wikipedia
Unicode: The hero or villain?
Confused beyond diacritics
Pawel Krawczyk
Unicode: The hero or villain?
Confused beyond diacritics
Pawel Krawczyk
Unicode: The hero or villain?
Confusing Unicode
Pawel Krawczyk
Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
Unicode: The hero or villain?
Confusing Unicode
Pawel Krawczyk
Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
Unicode: The hero or villain?
Confusing Unicode
Pawel Krawczyk
Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
Unicode: The hero or villain?
Confusing Unicode
Pawel Krawczyk
Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
Stay Calm and Unicode
Unicode: The hero or villain?
Forget everything
you’ve learned
about pre-Unicode
characters and strings*
*including MBCS and UCS
Rule #1
Pawel Krawczyk
Unicode: The hero or villain?
Pawel Krawczyk
Ą
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
This is not encoding
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
This is not encoding
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
This is not encoding!!!
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
This is encoding
0xC4 0x84
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01
0x04
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01 0x04
0x04 0x01
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01 0x04
0x04 0x01
0x00 0x00 0x01 0x04
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01 0x04
0x04 0x01
0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01 0x04
0x04 0x01
0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00
+AQQ-
UTF-7
Unicode: The hero or villain?
Pawel Krawczyk
Ą Character
LATIN CAPITAL LETTER A WITH OGONEK
Code point
U+0104
“Character’s catalog number”
0xC4 0x84
0x01 0x04
0x04 0x01
0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00
+AQQ-
UTF-7
+ADw-script+AD4-alert(+ACc-
xss+ACc-)+ADw-+AC8-script+AD4-
Homoglyphs
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
CHEROKEE LETTER V
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
CHEROKEE LETTER V
SMALL ROMAN NUMERAL ONE HUNDRED
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
CHEROKEE LETTER V
SMALL ROMAN NUMERAL ONE HUNDRED
CYRILLIC SMALL LETTER O
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
CHEROKEE LETTER V
SMALL ROMAN NUMERAL ONE HUNDRED
CYRILLIC SMALL LETTER O
CYRILLIC SMALL LETTER KOMI DE
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
ARMENIAN CAPITAL LETTER SEH
LATIN SMALL LETTER N
CHEROKEE LETTER V
SMALL ROMAN NUMERAL ONE HUNDRED
CYRILLIC SMALL LETTER O
CYRILLIC SMALL LETTER KOMI DE
CYRILLIC SMALL LETTER IE
Unicode: The hero or villain?
Pawel Krawczyk
ՍnᎥⅽоԁе
Surviving in the
Unicode world
Unicode: The hero or villain?
Inside your application
think text composed of characters;
forget about bytes
Rule #2
Pawel Krawczyk
Unicode: The hero or villain?
Decode bytes into text
on input
Encode text into bytes
on output
Rule #3
Pawel Krawczyk
Unicode: The hero or villain?
Input decoding example
Pawel Krawczyk
Transport metadata
Decoding
Internal representation
Transport data
Text: Jolanta Kozak “Dziaberlak” (“Jabberwocky” by Lewis Carroll)
Exceptions!
Unicode: The hero or villain?
Text processing, persistence* and fun
Pawel Krawczyk
Example text processing
* do not persist before watching this presentation till the end
Unicode: The hero or villain?
Output encoding
Pawel Krawczyk
U+FEFF BYTE ORDER MARK (BOM)
“Unicode signature”
When things go south
Unicode: The hero or villain?
Wrong decoder
Pawel Krawczyk
● Client told us so
● Client told us nothing, so we assumed so
● Client told us ‘utf-8’, but we ignored it and assumed ‘ascii’ because
we have been writing this software since 1986
Unicode: The hero or villain?
What to do - a policy decision!
Pawel Krawczyk
Reject incorrect information
(fail closed)
Partially lose information
(fail open)
Recover information
(fail pretending it’s fine)
Unicode: The hero or villain?
Policy decision
Pawel Krawczyk
Reject incorrect information
(fail closed)
Partially lose information
(fail open)
Recover information
(fail pretending it’s fine)
Unicode: The hero or villain?
Policy decision
Pawel Krawczyk
Reject incorrect information
(fail closed)
Partialy lose information
(fail open)
Recover information
(fail pretending it’s fine)
Validation techniques
Unicode: The hero or villain?
Character category enforcement
Pawel Krawczyk
Unicode: The hero or villain?
Character category enforcement
Pawel Krawczyk
Source: Unicode Standard Annex #44
Unicode: The hero or villain?
Enforce Unicode categories
Rule #5
Pawel Krawczyk
Unicode: The hero or villain?
Character script enforcement
Pawel Krawczyk
Scripts used in the example:
● LATIN
● CYRILLIC
● CJK
● ARABIC
Unicode: The hero or villain?
Enforce Unicode scripts
Rule #6
Pawel Krawczyk
Unicode: The hero or villain?
Text direction enforcement
Pawel Krawczyk
RIGHT-TO-LEFT OVERRIDE U+202E
Visual spoof*
*now prevented by many client programs
Unicode: The hero or villain?
Text direction enforcement
Pawel Krawczyk
Source: Unicode Standard Annex #44
Unicode: The hero or villain?
Enforce consistent text direction
Rule #7
Pawel Krawczyk
Normalization
Unicode: The hero or villain?
When café is not café?
Pawel Krawczyk
Single character U+00E9
Two U+000E and U+0301 characters combined on display
Unicode: The hero or villain?
When it is a problem?
Pawel Krawczyk
● Collation
● Sorting
● Comparison
● Persistence of data with different composition
Unicode: The hero or villain?
Unicode normalization
Pawel Krawczyk
● NFC = Normalization Form “C”
● Converts Unicode characters to a single, consistent form
● U+000E U+0301 and U+00E9 are Unicode “canonical equivalents”
Unicode: The hero or villain?
NFC, NFD, NFKC, NFKD?
Pawel Krawczyk
● NFC will compose, make shorter - é becomes U+00E9
● NFD will decompose, make longer - é becomes U+000E U+0301
● NFKC and NFKD will also replace “compatibility characters”
○ Possible loss of information!
Unicode: The hero or villain?
NFKC, NFKD
Pawel Krawczyk
This is a significant loss of information!
Source: Luciano Ramalho “Fluent Python”
Unicode: The hero or villain?
Compatibility normalization
Pawel Krawczyk
● Precomposed Roman numerals Ⅻ U+216B
Ⅻ
Unicode: The hero or villain?
Compatibility normalization
Pawel Krawczyk
● Precomposed Roman numerals Ⅻ U+216B
● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals)
Ⅻ
Unicode: The hero or villain?
Compatibility normalization
Pawel Krawczyk
● Precomposed Roman numerals Ⅻ U+216B
● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals)
● X U+0058 I U+0049 I U+0049 (Latin letters)
Another typical example:
● ¼ U+00BC → 1 U+0031 ⁄ U+2044 4 U+0034
Ⅻ
Unicode: The hero or villain?
NFKC, NFKD
Pawel Krawczyk
NKFC replaces single Ⅻ U+216B ROMAN NUMERAL TWELVE
character by three Roman digits represented by Latin letters X and I
Useful for search and comparison
● is there “X” in “XII”?
● is there “f” in “ffi”?
Many typesetting programs will replace popular “compatibility sequences”
by appropriate Unicode characters:
--> replaced by → U+2192 RIGHTWARDS ARROW
Unicode: The hero or villain?
Normalize Unicode text
Rule #8
Pawel Krawczyk
Validation strategies
Unicode: The hero or villain?
Policy decisions
Pawel Krawczyk
Let’s talk about your user base for a moment…
● Are they expected to communicate in Chinese, English, Polish, Arabic…?
● Do we expect text in Linear-B, Ugaritic, Klingon?
○ Can you process data in any of these languages?
● What kind of text is expected where?
● E.g. names - are they composed of letters only (“Portia Sutcliffe”)
○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”)
● Define what is valid text
● Define appropriate valid categories, scripts and text directions
● Normalize Unicode input prior to validation
Unicode: The hero or villain?
Policy decisions
Pawel Krawczyk
Let’s talk about your user base for a moment…
● Are they expected to communicate in Chinese, English, Polish, Arabic…?
● Do we expect text in Linear-B, Ugaritic, Klingon?
○ Can you process data in any of these languages?
● What kind of text is expected where?
● E.g. names - are they composed of letters only (“Portia Sutcliffe”)
○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”)
● Define what is valid text
● Define appropriate valid categories, scripts and text directions
● Normalize Unicode input prior to validation
● Do we have free-text fields?
○ If yes, how free is the “free text”?
○ Letters, digits, punctuation?
○ Symbols?
■ Because if someone explains “I clicked File > Properties > General” it takes symbols
● Is all this part of localization for given region?
○ Along with database collation, currency, numbers...
Unicode: The hero or villain?
Questions?
Pawel Krawczyk
pawel.krawczyk@hush.com
+44 7879 180015
https://www.linkedin.com/in/pawelkrawczyk/
🤔

Más contenido relacionado

Más de Pawel Krawczyk

Paweł Krawczyk - Ekonomia bezpieczeństwa
Paweł Krawczyk - Ekonomia bezpieczeństwaPaweł Krawczyk - Ekonomia bezpieczeństwa
Paweł Krawczyk - Ekonomia bezpieczeństwaPawel Krawczyk
 
Are electronic signature assumptions realistic
Are electronic signature assumptions realisticAre electronic signature assumptions realistic
Are electronic signature assumptions realisticPawel Krawczyk
 
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)Dlaczego przejmować się bezpieczeństwem aplikacji (pol)
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)Pawel Krawczyk
 
Filtrowanie sieci - Panoptykon
Filtrowanie sieci - PanoptykonFiltrowanie sieci - Panoptykon
Filtrowanie sieci - PanoptykonPawel Krawczyk
 
Pragmatic view on Electronic Signature directive 1999 93
Pragmatic view on Electronic Signature directive 1999 93Pragmatic view on Electronic Signature directive 1999 93
Pragmatic view on Electronic Signature directive 1999 93Pawel Krawczyk
 
Why care about application security
Why care about application securityWhy care about application security
Why care about application securityPawel Krawczyk
 
Krawczyk Ekonomia Bezpieczenstwa 2
Krawczyk   Ekonomia Bezpieczenstwa 2Krawczyk   Ekonomia Bezpieczenstwa 2
Krawczyk Ekonomia Bezpieczenstwa 2Pawel Krawczyk
 
Audyt Wewnetrzny W Zakresie Bezpieczenstwa
Audyt Wewnetrzny W Zakresie BezpieczenstwaAudyt Wewnetrzny W Zakresie Bezpieczenstwa
Audyt Wewnetrzny W Zakresie BezpieczenstwaPawel Krawczyk
 
Kryptografia i mechanizmy bezpieczenstwa
Kryptografia i mechanizmy bezpieczenstwaKryptografia i mechanizmy bezpieczenstwa
Kryptografia i mechanizmy bezpieczenstwaPawel Krawczyk
 
Zaufanie W Systemach Informatycznych
Zaufanie W Systemach InformatycznychZaufanie W Systemach Informatycznych
Zaufanie W Systemach InformatycznychPawel Krawczyk
 
Real Life Information Security
Real Life Information SecurityReal Life Information Security
Real Life Information SecurityPawel Krawczyk
 
Europejskie Ramy Interoperacyjności 2.0
Europejskie Ramy Interoperacyjności 2.0Europejskie Ramy Interoperacyjności 2.0
Europejskie Ramy Interoperacyjności 2.0Pawel Krawczyk
 

Más de Pawel Krawczyk (13)

Paweł Krawczyk - Ekonomia bezpieczeństwa
Paweł Krawczyk - Ekonomia bezpieczeństwaPaweł Krawczyk - Ekonomia bezpieczeństwa
Paweł Krawczyk - Ekonomia bezpieczeństwa
 
Are electronic signature assumptions realistic
Are electronic signature assumptions realisticAre electronic signature assumptions realistic
Are electronic signature assumptions realistic
 
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)Dlaczego przejmować się bezpieczeństwem aplikacji (pol)
Dlaczego przejmować się bezpieczeństwem aplikacji (pol)
 
Filtrowanie sieci - Panoptykon
Filtrowanie sieci - PanoptykonFiltrowanie sieci - Panoptykon
Filtrowanie sieci - Panoptykon
 
Pragmatic view on Electronic Signature directive 1999 93
Pragmatic view on Electronic Signature directive 1999 93Pragmatic view on Electronic Signature directive 1999 93
Pragmatic view on Electronic Signature directive 1999 93
 
Why care about application security
Why care about application securityWhy care about application security
Why care about application security
 
Source Code Scanners
Source Code ScannersSource Code Scanners
Source Code Scanners
 
Krawczyk Ekonomia Bezpieczenstwa 2
Krawczyk   Ekonomia Bezpieczenstwa 2Krawczyk   Ekonomia Bezpieczenstwa 2
Krawczyk Ekonomia Bezpieczenstwa 2
 
Audyt Wewnetrzny W Zakresie Bezpieczenstwa
Audyt Wewnetrzny W Zakresie BezpieczenstwaAudyt Wewnetrzny W Zakresie Bezpieczenstwa
Audyt Wewnetrzny W Zakresie Bezpieczenstwa
 
Kryptografia i mechanizmy bezpieczenstwa
Kryptografia i mechanizmy bezpieczenstwaKryptografia i mechanizmy bezpieczenstwa
Kryptografia i mechanizmy bezpieczenstwa
 
Zaufanie W Systemach Informatycznych
Zaufanie W Systemach InformatycznychZaufanie W Systemach Informatycznych
Zaufanie W Systemach Informatycznych
 
Real Life Information Security
Real Life Information SecurityReal Life Information Security
Real Life Information Security
 
Europejskie Ramy Interoperacyjności 2.0
Europejskie Ramy Interoperacyjności 2.0Europejskie Ramy Interoperacyjności 2.0
Europejskie Ramy Interoperacyjności 2.0
 

Último

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 

Último (20)

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 

Unicode the hero or villain

  • 1. Unicode: The hero or villain? Input Validation of free-form Unicode text in Web Applications Pawel Krawczyk
  • 2. Unicode: The hero or villain? In application security since 90’s - pentesting, security architecture, SSDLC, DevSecOps Active developer Python, C, Java https://github.com/kravietz OWASP - SAML, PL/SQL, authentication cheatsheets WebCookies.org - web privacy and security scanner Immusec.com - competitive pentesting & incident response in UK About Contact pkrawczyk@immusec.com +44 7879 180015 https://www.linkedin.com/in/pawelkrawczyk Pawel Krawczyk
  • 4. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  • 5. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  • 6. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  • 7. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  • 9. Author name her Official name: A-OGONEK “a letter in the Polish, Kashubian, Lithuanian, Creek, Navajo, Western Apache, Chiricahua, Osage, Hocąk, Mescalero, Gwich'in, Tutchone, and Elfdalian alphabets” (Wikipedia) The rise and fall of letter “Ą” Author name here This is the abstract title Ą
  • 10. Unicode: The hero or villain? ASCII: just write “Ą” as “A” Pawel Krawczyk
  • 11. Unicode: The hero or villain? ASCII: just write “Ą” as “A” Pawel Krawczyk (Pol.) KĄT = (Eng.) ANGLE (Pol.) KAT = (Eng.) HANGMAN Contextual guessing, confusion, misunderstandings, we had lots of fun on IRC back in 90’s...
  • 12. Unicode: The hero or villain? Windows-1250: “Ą” is 0xa5 Pawel Krawczyk
  • 13. Unicode: The hero or villain? ISO-8859-2: “Ą” is 0xa1 Pawel Krawczyk
  • 14. Unicode: The hero or villain? To możliwe! Pawel Krawczyk Source: Wikipedia
  • 15. Unicode: The hero or villain? Confused beyond diacritics Pawel Krawczyk
  • 16. Unicode: The hero or villain? Confused beyond diacritics Pawel Krawczyk
  • 17. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  • 18. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  • 19. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  • 20. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  • 21. Stay Calm and Unicode
  • 22. Unicode: The hero or villain? Forget everything you’ve learned about pre-Unicode characters and strings* *including MBCS and UCS Rule #1 Pawel Krawczyk
  • 23. Unicode: The hero or villain? Pawel Krawczyk Ą
  • 24. Unicode: The hero or villain? Pawel Krawczyk Ą Character
  • 25. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK
  • 26. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK
  • 27. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104
  • 28. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding
  • 29. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding
  • 30. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding!!!
  • 31. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is encoding 0xC4 0x84
  • 32. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04
  • 33. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01
  • 34. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04
  • 35. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00
  • 36. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00 +AQQ- UTF-7
  • 37. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00 +AQQ- UTF-7 +ADw-script+AD4-alert(+ACc- xss+ACc-)+ADw-+AC8-script+AD4-
  • 39. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе
  • 40. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH
  • 41. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N
  • 42. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V
  • 43. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED
  • 44. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O
  • 45. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O CYRILLIC SMALL LETTER KOMI DE
  • 46. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O CYRILLIC SMALL LETTER KOMI DE CYRILLIC SMALL LETTER IE
  • 47. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе
  • 49. Unicode: The hero or villain? Inside your application think text composed of characters; forget about bytes Rule #2 Pawel Krawczyk
  • 50. Unicode: The hero or villain? Decode bytes into text on input Encode text into bytes on output Rule #3 Pawel Krawczyk
  • 51. Unicode: The hero or villain? Input decoding example Pawel Krawczyk Transport metadata Decoding Internal representation Transport data Text: Jolanta Kozak “Dziaberlak” (“Jabberwocky” by Lewis Carroll) Exceptions!
  • 52. Unicode: The hero or villain? Text processing, persistence* and fun Pawel Krawczyk Example text processing * do not persist before watching this presentation till the end
  • 53. Unicode: The hero or villain? Output encoding Pawel Krawczyk U+FEFF BYTE ORDER MARK (BOM) “Unicode signature”
  • 54. When things go south
  • 55. Unicode: The hero or villain? Wrong decoder Pawel Krawczyk ● Client told us so ● Client told us nothing, so we assumed so ● Client told us ‘utf-8’, but we ignored it and assumed ‘ascii’ because we have been writing this software since 1986
  • 56. Unicode: The hero or villain? What to do - a policy decision! Pawel Krawczyk Reject incorrect information (fail closed) Partially lose information (fail open) Recover information (fail pretending it’s fine)
  • 57. Unicode: The hero or villain? Policy decision Pawel Krawczyk Reject incorrect information (fail closed) Partially lose information (fail open) Recover information (fail pretending it’s fine)
  • 58. Unicode: The hero or villain? Policy decision Pawel Krawczyk Reject incorrect information (fail closed) Partialy lose information (fail open) Recover information (fail pretending it’s fine)
  • 60. Unicode: The hero or villain? Character category enforcement Pawel Krawczyk
  • 61. Unicode: The hero or villain? Character category enforcement Pawel Krawczyk Source: Unicode Standard Annex #44
  • 62. Unicode: The hero or villain? Enforce Unicode categories Rule #5 Pawel Krawczyk
  • 63. Unicode: The hero or villain? Character script enforcement Pawel Krawczyk Scripts used in the example: ● LATIN ● CYRILLIC ● CJK ● ARABIC
  • 64. Unicode: The hero or villain? Enforce Unicode scripts Rule #6 Pawel Krawczyk
  • 65. Unicode: The hero or villain? Text direction enforcement Pawel Krawczyk RIGHT-TO-LEFT OVERRIDE U+202E Visual spoof* *now prevented by many client programs
  • 66. Unicode: The hero or villain? Text direction enforcement Pawel Krawczyk Source: Unicode Standard Annex #44
  • 67. Unicode: The hero or villain? Enforce consistent text direction Rule #7 Pawel Krawczyk
  • 69. Unicode: The hero or villain? When café is not café? Pawel Krawczyk Single character U+00E9 Two U+000E and U+0301 characters combined on display
  • 70. Unicode: The hero or villain? When it is a problem? Pawel Krawczyk ● Collation ● Sorting ● Comparison ● Persistence of data with different composition
  • 71. Unicode: The hero or villain? Unicode normalization Pawel Krawczyk ● NFC = Normalization Form “C” ● Converts Unicode characters to a single, consistent form ● U+000E U+0301 and U+00E9 are Unicode “canonical equivalents”
  • 72. Unicode: The hero or villain? NFC, NFD, NFKC, NFKD? Pawel Krawczyk ● NFC will compose, make shorter - é becomes U+00E9 ● NFD will decompose, make longer - é becomes U+000E U+0301 ● NFKC and NFKD will also replace “compatibility characters” ○ Possible loss of information!
  • 73. Unicode: The hero or villain? NFKC, NFKD Pawel Krawczyk This is a significant loss of information! Source: Luciano Ramalho “Fluent Python”
  • 74. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B Ⅻ
  • 75. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B ● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals) Ⅻ
  • 76. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B ● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals) ● X U+0058 I U+0049 I U+0049 (Latin letters) Another typical example: ● ¼ U+00BC → 1 U+0031 ⁄ U+2044 4 U+0034 Ⅻ
  • 77. Unicode: The hero or villain? NFKC, NFKD Pawel Krawczyk NKFC replaces single Ⅻ U+216B ROMAN NUMERAL TWELVE character by three Roman digits represented by Latin letters X and I Useful for search and comparison ● is there “X” in “XII”? ● is there “f” in “ffi”? Many typesetting programs will replace popular “compatibility sequences” by appropriate Unicode characters: --> replaced by → U+2192 RIGHTWARDS ARROW
  • 78. Unicode: The hero or villain? Normalize Unicode text Rule #8 Pawel Krawczyk
  • 80. Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… ● Are they expected to communicate in Chinese, English, Polish, Arabic…? ● Do we expect text in Linear-B, Ugaritic, Klingon? ○ Can you process data in any of these languages? ● What kind of text is expected where? ● E.g. names - are they composed of letters only (“Portia Sutcliffe”) ○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”) ● Define what is valid text ● Define appropriate valid categories, scripts and text directions ● Normalize Unicode input prior to validation
  • 81. Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… ● Are they expected to communicate in Chinese, English, Polish, Arabic…? ● Do we expect text in Linear-B, Ugaritic, Klingon? ○ Can you process data in any of these languages? ● What kind of text is expected where? ● E.g. names - are they composed of letters only (“Portia Sutcliffe”) ○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”) ● Define what is valid text ● Define appropriate valid categories, scripts and text directions ● Normalize Unicode input prior to validation ● Do we have free-text fields? ○ If yes, how free is the “free text”? ○ Letters, digits, punctuation? ○ Symbols? ■ Because if someone explains “I clicked File > Properties > General” it takes symbols ● Is all this part of localization for given region? ○ Along with database collation, currency, numbers...
  • 82. Unicode: The hero or villain? Questions? Pawel Krawczyk pawel.krawczyk@hush.com +44 7879 180015 https://www.linkedin.com/in/pawelkrawczyk/ 🤔