SlideShare una empresa de Scribd logo
1 de 25
Unicode and Character Sets
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 					- Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》
A In person’s eye 0100 0001 In computer’s eye
ASCII        32~127     8bits ISO-8859-1, ISO-8859-2, ISO-8859-3………..  16 In ISO-8859-1, 0xC0is À In ISO-8859-7, 0xC0is ΐ The same octet has different meanings in different charsets!!
Unicode Not a Charset To assign a code point to every words in the world A -> U+0041 http://www.unicode.org/charts/
How to use Unicode in computer?
UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS:  map code points (U+0000~U+FFFF) to octet directly CONS:  Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF
UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS:  map code points (U+00000000~U+FFFFFFFF) to octet directly CONS:  Be incompatible with ASCII Waste huge memory
UTF-8 0000 ~ 007F           0xxxxxxx  0080 ~ 07FF           110xxxxx 10xxxxxx  0800 ~ FFFF           1110xxxx 10xxxxxx 10xxxxxx A  => U+0041  => 1000001 => 01000001 => 0x41   神 => U+795E  => 1111001 01011110 =>  11100111 10100101 10011110 => 0xE7 0xA5 0x9E
UTF-8 PROS:  Be compatible with ASCII Can map all the code points to octets CONS:  Algorithm is a little complicate
It does not make sense to have a string without know what  encoding it uses. 				                - Joel Spolsky Software communicate with each other by octet stream  A B Sends  E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”
Charsets in Perl
Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “E7A59EE9A9AC3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?
Default, Perl treats it just as a sequence of octets  # encoding UTF-8 my $a1 = “神马?”; print length($a1)  #output is 7 How to make perl treat it as a sequence of  characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1)  #output is 3
What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?
UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars
More Examples
#encoding UTF-8 use Devel::Peek; print Dump(“神”), Dump(“E7A59E”); print Dump(“{795E}”), Dump(Encode::decode_utf8(“E7A59E”)); print Dump(“神”.“{795E}”); FLAGS = <PADMY,POK,Ppok> PV = 0x16189d8 “474536” FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e7478 “474536” [UTF8 “{795e}”] FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e74d8 “474536034702450236” br />[UTF8 “{795e}{e7}{a5}{9e}”] 3603 = 11000011 10100111 {e7} = 11100111
Convert “神” from UTF-8 to GBK 神 E7A59E(UTF-8 encoded) UTF8 flag = off decode 神 U+795E(unicode) 神 E7A59E(utf8 encoded) UTF8 flag = on encode 神 C9F1(gbk encoded) UTF8 flag = off
Charsets in MySQL
Server -> database -> table CREATE TABLE XXX …… …… …… DEFAULT CHARSET = UTF-8
SET NAMES X SET CHARACTER_SET_CLIENT = X SET CHARACTER_SET_CONNECTION = X SET CHARACTER_SET_RESULTS = X
Connection_charset = shiftJIS Client_charset = UTF-8 Shell (UTF-8) UTF-8 -> shiftJIS shiftJIS -> UTF-8 Results_charset = UTF-8 MySQL(UTF-8) UTF-8 <- UTF-8 euc-jp <- UTF-8 Client_charset = euc-jp Perl (euc-jp) shiftJIS -> UTF-8 euc-jp -> shiftJIS Results_charset = euc-jp
Q & A
Thank U!

Más contenido relacionado

Destacado

Destacado (8)

Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
 
Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layout
 
What character is that
What character is thatWhat character is that
What character is that
 
Unicode
UnicodeUnicode
Unicode
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Digital Image Processing Fundamental
Digital Image Processing FundamentalDigital Image Processing Fundamental
Digital Image Processing Fundamental
 

Similar a Unicode and character sets

UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
Bert Pattyn
 
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
Software Guru
 

Similar a Unicode and character sets (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
 
The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Unicode
UnicodeUnicode
Unicode
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
 
Journey of Bsdconv
Journey of BsdconvJourney of Bsdconv
Journey of Bsdconv
 
Software to the slaughter
Software to the slaughterSoftware to the slaughter
Software to the slaughter
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconv
 
CyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation ProcessCyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation Process
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Unicode and character sets

  • 2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》
  • 3. A In person’s eye 0100 0001 In computer’s eye
  • 4. ASCII 32~127 8bits ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16 In ISO-8859-1, 0xC0is À In ISO-8859-7, 0xC0is ΐ The same octet has different meanings in different charsets!!
  • 5. Unicode Not a Charset To assign a code point to every words in the world A -> U+0041 http://www.unicode.org/charts/
  • 6. How to use Unicode in computer?
  • 7. UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS: map code points (U+0000~U+FFFF) to octet directly CONS: Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF
  • 8. UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS: map code points (U+00000000~U+FFFFFFFF) to octet directly CONS: Be incompatible with ASCII Waste huge memory
  • 9. UTF-8 0000 ~ 007F 0xxxxxxx 0080 ~ 07FF 110xxxxx 10xxxxxx 0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx A => U+0041 => 1000001 => 01000001 => 0x41 神 => U+795E => 1111001 01011110 => 11100111 10100101 10011110 => 0xE7 0xA5 0x9E
  • 10. UTF-8 PROS: Be compatible with ASCII Can map all the code points to octets CONS: Algorithm is a little complicate
  • 11. It does not make sense to have a string without know what encoding it uses. - Joel Spolsky Software communicate with each other by octet stream A B Sends E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”
  • 13. Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “E7A59EE9A9AC3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?
  • 14. Default, Perl treats it just as a sequence of octets # encoding UTF-8 my $a1 = “神马?”; print length($a1) #output is 7 How to make perl treat it as a sequence of characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1) #output is 3
  • 15. What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?
  • 16. UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars
  • 18. #encoding UTF-8 use Devel::Peek; print Dump(“神”), Dump(“E7A59E”); print Dump(“{795E}”), Dump(Encode::decode_utf8(“E7A59E”)); print Dump(“神”.“{795E}”); FLAGS = <PADMY,POK,Ppok> PV = 0x16189d8 “474536” FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e7478 “474536” [UTF8 “{795e}”] FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e74d8 “474536034702450236” br />[UTF8 “{795e}{e7}{a5}{9e}”] 3603 = 11000011 10100111 {e7} = 11100111
  • 19. Convert “神” from UTF-8 to GBK 神 E7A59E(UTF-8 encoded) UTF8 flag = off decode 神 U+795E(unicode) 神 E7A59E(utf8 encoded) UTF8 flag = on encode 神 C9F1(gbk encoded) UTF8 flag = off
  • 21. Server -> database -> table CREATE TABLE XXX …… …… …… DEFAULT CHARSET = UTF-8
  • 22. SET NAMES X SET CHARACTER_SET_CLIENT = X SET CHARACTER_SET_CONNECTION = X SET CHARACTER_SET_RESULTS = X
  • 23. Connection_charset = shiftJIS Client_charset = UTF-8 Shell (UTF-8) UTF-8 -> shiftJIS shiftJIS -> UTF-8 Results_charset = UTF-8 MySQL(UTF-8) UTF-8 <- UTF-8 euc-jp <- UTF-8 Client_charset = euc-jp Perl (euc-jp) shiftJIS -> UTF-8 euc-jp -> shiftJIS Results_charset = euc-jp
  • 24. Q & A