128 Chars\nEnglish upper, lower symbols + some control charaters\n
128 chars = 7 bits.\n1 byte = 8 bits.\nWe have a whole 128 other characters to play with.\n
Everyone wants a different 128 chars.\nEncodings were born.\nLatin 1 is one of the most popular, it uses the other 128 chars mostly for accented alphabets.\n
It would be simple if we all spoke english!\njapanese, chinese, etc.\n
The only way we can deal with this is to comprise characters of more than 1 byte.\nThis is a big issue for programming as we now have to be very mindful not to split in the middle of a char.\n
Ideally we would only ever have 1 encoding to deal with.\nThis isn’t the case but unicode is as close as it gets.\n
There different\n\n
Character sets are just internal mappings from numbers to strings\n\nhex - int - char\n
here we are writing a lot of null chars\n
Unicode encoding that is 100% compat with US-ASCII\nuses variable length characters to maintain compat\nBest current solution for compat, size, support\n
as we’ve seen multibyte sets are common and useful.\nRuby doesn’t care about encodings, it just displays characters based on a mapping.\nreverse(), split(), size() all break multibyte strings very easily.\nBut ruby doesn’t check so you’ll never really know!\n
Ruby 1.8 supports utf-8 right?\nkinda\nunicode support extends to understanding boundaries so magically, reverse(), split(), match() all work.\nUnfortunately it’s very basic support so transforms like upcase() will bite you in the arse.\nAlso kcode is global so there is no way to deal with more than 1 encoding at a time.\n
Ruby 1.9 brings full string encoding support for over 80 encodings.\nString objects now reference both raw bytes and an Encoding object.\n\n
\n
a string is aware of it’s encoding, character length and byte length\n
force_encoding will retag a string with a different encoding, however the actual bytes are not modified.\nIf the string is tagged with an invalid encoding then it can cause ArgumentException’s when trying to manipulate the strings characters.\n
calling encode() actually modifies the underlying string.\n
\n
String.each is gone in 1.9\nreplaced with explicit each methods and Enumerator functions\n
\n
\n
\n
http://unicode-utils.rubyforge.org/ for locale management\n