Unicode and character sets

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》

A In person’s eye 0100 0001 In computer’s eye

ASCII 32~127 8bits ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16 In ISO-8859-1, 0xC0is À In ISO-8859-7, 0xC0is ΐ The same octet has different meanings in different charsets!!

Unicode Not a Charset To assign a code point to every words in the world A -> U+0041 http://www.unicode.org/charts/

How to use Unicode in computer?

UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS: map code points (U+0000~U+FFFF) to octet directly CONS: Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF

UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS: map code points (U+00000000~U+FFFFFFFF) to octet directly CONS: Be incompatible with ASCII Waste huge memory

UTF-8 0000 ~ 007F 0xxxxxxx 0080 ~ 07FF 110xxxxx 10xxxxxx 0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx A => U+0041 => 1000001 => 01000001 => 0x41 神 => U+795E => 1111001 01011110 => 11100111 10100101 10011110 => 0xE7 0xA5 0x9E

UTF-8 PROS: Be compatible with ASCII Can map all the code points to octets CONS: Algorithm is a little complicate

It does not make sense to have a string without know what encoding it uses. - Joel Spolsky Software communicate with each other by octet stream A B Sends E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”

Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “E7A59EE9A9AC3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?

Default, Perl treats it just as a sequence of octets # encoding UTF-8 my $a1 = “神马?”; print length($a1) #output is 7 How to make perl treat it as a sequence of characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1) #output is 3

What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?

UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars

#encoding UTF-8 use Devel::Peek; print Dump(“神”), Dump(“E7A59E”); print Dump(“{795E}”), Dump(Encode::decode_utf8(“E7A59E”)); print Dump(“神”.“{795E}”); FLAGS = <PADMY,POK,Ppok> PV = 0x16189d8 “474536” FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e7478 “474536” [UTF8 “{795e}”] FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e74d8 “474536034702450236” br />[UTF8 “{795e}{e7}{a5}{9e}”] 3603 = 11000011 10100111 {e7} = 11100111

Convert “神” from UTF-8 to GBK 神 E7A59E(UTF-8 encoded) UTF8 flag = off decode 神 U+795E(unicode) 神 E7A59E(utf8 encoded) UTF8 flag = on encode 神 C9F1(gbk encoded) UTF8 flag = off

Server -> database -> table CREATE TABLE XXX …… …… …… DEFAULT CHARSET = UTF-8

SET NAMES X SET CHARACTER_SET_CLIENT = X SET CHARACTER_SET_CONNECTION = X SET CHARACTER_SET_RESULTS = X

Connection_charset = shiftJIS Client_charset = UTF-8 Shell (UTF-8) UTF-8 -> shiftJIS shiftJIS -> UTF-8 Results_charset = UTF-8 MySQL(UTF-8) UTF-8 <- UTF-8 euc-jp <- UTF-8 Client_charset = euc-jp Perl (euc-jp) shiftJIS -> UTF-8 euc-jp -> shiftJIS Results_charset = euc-jp

Unicode and character sets

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Similar a Unicode and character sets

Similar a Unicode and character sets (20)

Último

Último (20)

Unicode and character sets