SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
Session: Character Sets and Firebird
       Firebird Conference 2011 · Luxembourg                Speaker: Stefan Heymann Page: 1




           Character Sets and Unicode
                   in Firebird


                                Stefan Heymann

                                      www.consic.de
                                    heymann@consic.de




After a short introduction to the world of Character Sets and Unicode, this session will show
you how to bring it all to work in Firebird. You will learn what all those character sets and
collations are and how you can properly use them to get the right characters into the
database and onto your screen.
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 2




Topics

●
    Characters
●
    Character Sets
●
    Unicode
●
    Firebird
●
    Examples
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 3




                     Characters
Session: Character Sets and Firebird
   Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 4




Glyphs vs. Characters




                       Latin uppercase A
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 5




Glyph, Character, Character Set

●
    A Glyph is something you can see with your eyes
●
    A Character is an abstract concept
●
    Rendering of characters as Glyphs is the job of the
    rendering machine (Postscript, GDI, TrueType, Web
    Browser, etc.)
●
    We mostly care for processing the characters
●
    A Character Set assigns a number to a character:
            Uppercase A = 65
            Uppercase B = 66
            etc.
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 6




Glyphs

●
    Not all languages display glyphs as a string of left-
    to-right, contiguous rectangles
●
    Right-to-left (Arabic, Hebrew), top-to-bottom
    (Japanese, Chinese)
●
    Several characters can „melt“ into one glyph
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 7




                  Character Sets
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 8




ASCII: The Mother of Character Sets

●
    American Standard Code for Information
    Interchange: ASCII, ISO 646
●
    7 bits, characters ranging from 0 to 127 (00..7F)
●
    32 invisible control characters
    (NUL, TAB, CR, LF, FF, BEL, ESC, ...)
●
    A..Z, a..z, Digits 0..9, Punctuation (;.-?)
●
    Optimized for English
●
    Only Latin characters, no accents, no umlauts
●
    MIME code: US-ASCII
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 9




ASCII for Europe

●
    Language specific characters get a new assignment
●   [=Ä   =Ö     ]=Ü
●
    Problem: Printer and screen must have same setting
●
    Impossible to mix French and German in one text:

    „Amélie knackt gerne die Kruste von Crème Brulé
    mit dem Löffel“
●
    Died together with DOS
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 10




Use the 8th Bit

●
    128 new characters: 128 to 255 (00..FF)
●
    A lot of 8-Bit character sets have evolved
    ●
         ISO 8859-x = ASCII + 160..255
    ●
         ISO 8859-1 = Latin-1
         (Western European languages)
    ●
         Windows 1252 = ISO 8859-1 + 128..159
    ●
         etc.
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 11




ISO 8859

●
    Characters 0..127 identical to ASCII
●
    128..159 undefined control characters
●
    160..255 individually assigned
Session: Character Sets and Firebird
    Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 12




ISO 8859-1 / Latin-1




Afrikaans, Albanese, Basque, Danish, German, English,
Faroese, Finnish, French, Icelandic, Italian, Catalan,
Dutch, Norwegian, Portuguese, Rhaeto-Romance,
Scottish Gaelic, Schwedish, Spanish, Suaheli
Large parts of the world – wide-spread use
Session: Character Sets and Firebird
    Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 13




ISO 8859-2 / Latin-2




Central and Eastern Europe (Czech, Polish, etc.)
Session: Character Sets and Firebird
    Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 14




ISO 8859-4 / Latin-4




Northern Europe, Baltic, Greenlanic, Sami
Session: Character Sets and Firebird
    Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 15




ISO 8859-5 / Cyrillic




Cyrillic (Russia, Ukraine, etc.)
More important: KOI8-R (Russian), KOI8-U (Ukrainian)
Session: Character Sets and Firebird
    Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 16




ISO 8859-9 / Latin-5




Turkish
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 17




Windows Character Sets („Codepages“)

●
    Partly congruent to ISO-8859
●
    Additional assignment of characters 128..159 with
    visible (non-control) characters
    ●
         Hyphens n length – and m length — (vs. Dash -)
    ●
         Typographic      „quotation marks“, etc.
●
    Windows character sets officially registered at IANA
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 18




Microsoft Windows Codepages
●
    874    Thai
●
    932    Japanese
●
    936    Simplified Chinese
●
    949    Korean
●
    950    Traditional Chinese
●
    1250   Central European
●
    1251   Cyrillic
●
    1252   Western European
●
    1253   Greek
●
    1254   Turkish
●
    1255   Hebrew
●
    1256   Arabic
●
    1257   Baltic
●
    1258   Vietnamese
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 19




windows-1252

●
    Congruent to ISO 8859-1 (Western European
    languages, including English)
●
    Additional characters in the 128..159 range:

    €‚ƒ„…†‡ˆ‰Š‹ŒŽ
    ‘’“”•–—˜™š›œžŸ

    Euro sign € since 2000 (Windows 1252-2000)
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 20




Multi-Byte Character Sets MBCS

●
    Multiple Bytes per Character
●
    Eastern Asian Languages (CJK)
●
    String Length <> Length of character chain
●
    Making extraction of sub-strings more difficult
●
    Firebird functions:
    ●
        BIT_LENGTH: length of a string in bits (!)
    ●
        CHAR_LENGTH/CHARACTER_LENGTH: length
        of a string in characters
    ●
        OCTET_LENGTH: length of a string in bytes
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 21




                       Unicode
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 22




Why Unicode?

●
    One single Character Set for all languages/scripts
●
    No code overlaps
●
    Hardware and OS independant
●
    Standardisation ISO 10646 (vs. ASCII = ISO 646)
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 23




Unicode

●
    Started with 16 Bits/Character, now 32 Bits/Char
●
    Ability to code 1,114,112 characters
●
    Currently only a fraction is used
●
    „Basic Multilingual Plane“ (BMP): 0..U+FFFF
    Can be encoded in 16 bits
●
    Current version: 6.0.0 (February 2011)
●
    Defines Characters, not Glyphs
●
    Practically equal to ISO/IEC 10646
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 24




Character Definition

●
    Unicode defines a numerical Code Point (scalar
    value) and an Identifier for every character

    0041    LATIN CAPITAL LETTER A
    00E4    LATIN SMALL LETTER A WITH DIAERESIS
    0391    GREEK CAPITAL LETTER ALPHA
    05D0    HEBREW LETTER ALEF
    0950    DEVANAGARI OM
    1D56C   MATHEMATICAL BOLD FRAKTUR CAPITAL A
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 25




Unicode Code Points

●
    Codespace: 0..10FFFF
●
    Usual notation is hexadecimal with preceding U+
    and 4 or 5 digits
●
    U+0020
●
    U+0041
●
    U+1D56C
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 26




Unicode Character Names

●
    Consisting of the uppercase letters A..Z, digits 0..9,
    hyphen (-) and whitespace.
●
    BYZANTINE MUSICAL SYMBOL LEIMMA ENOS
    CHRONOU
●
    DESERET CAPITAL LETTER OW
●
    BRAILLE PATTERN DOTS-1245
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 27




Unicode Coding

●
    Storage of Code Points in memory
●
    32 Bits/Character easy but too fat
●
    There are several codings around:
    ●
          8-Bit (UTF-8, formerly called „FSS“)
    ●
        16-Bit (UCS-2, UTF-16)
    ●
        32-Bit (UCS-4, UTF-32)
    ●
        PunyCode for international Domain names
    ●
        „Exotic“: UTF-7, UTF-1, etc.
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 28




UCS-2

●
    16 Bits per Character
●
    Code Area: 0000..FFFF
    = Basic Multilingual Plane (BMP)
●
    Characters beyond the BMP (defined since
    Unicode 3.1) can not be encoded
●
    Replaced by UTF-16
●
    „Unicode“ often used as a synonym for UCS-2
    (which is wrong and can lead to misunderstanding)
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 29




UTF-16

●
    16 Bits per Character
●
    All characters > FFFF must be coded as a „Surrogate
    Pair“ and occupy 2 consecutive 16-Bit words
●
    Complete Code Space can be encoded
●
    Difficult to calculate string length or substrings
●
    Used by Windows 2000 and later
●
    Used by Delphi 2009/2010/XE (UnicodeString type)
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 30




Endianness

●
    Problem with UCS-2, UTF-16: Low/High-Byte
    ordering
●
    Differentiate in UTF-16BE and UTF-16LE in metadata
●
    Byte Order Mark           BOM        U+FEFF
●
    U+FEFF set at the very beginning of each text
●
    U+FFFE is (and will be) an invalid code point
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 31




UCS-2 vs. UTF-16

●
    UTF-16 ist backwards compatible
●
    „Unicode“ is used as a (bad) synonym for UCS-2 or
    UTF-16
●
    WideString, wchar_t
●
    Windows NT3, NT4: UCS-2
●
    Since Windows 2000: UTF-16
●
    Delphi 2009/etc. UnicodeString: UTF-16
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 32




UTF-8

●
    Coding as 8-Bit strings
●
    Called „File System Safe“ (FSS) in its early days
●
    7-bit US-ASCII characters untouched, all others
    occupy 2 to 4 consecutive bytes
●
    Complete codespace can be encoded
●
    Advantage: „Latin“ texts quite compact and readable
    in unaware editors
●
    Problem: string length, substrings, etc.
●
    No BOM necessary, but sometimes used as an
    indicator for UTF-8 text
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 33




UTF-8 coding

●
    US-ASCII characters untouched, Bit 7 reset 0xxxxxxx
●
    All others are sequences of bytes with Bit 7 set
    1xxxxxxx
●
    Head Byte: As many leading bits set as length of
    sequence: 110xxxxx
●   Tail Bytes: Bit 7 set, Bit 6 reset: 10xxxxxx
●
    Recognize the type of byte:
    ●
        Complete character:    0xxxxxxx
    ●
        Part of a sequence:    1xxxxxxx
    ●
        Sequence head:         11xxxxxx
    ●
        Sequence tail:         10xxxxxx
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg           Speaker: Stefan Heymann Page: 34




UTF-8 coding

●
    Bits after the type bits remain for encoding the
    Code Point

ä – LATIN SMALL LETTER A WITH DIAERESIS
00E416 = 22810 = 111001002

                               00000   –   00007F:   0xxxxxxx
110xxxxx      10xxxxxx         00080   –   0007FF:   110xxxxx 10xxxxxx
                               00800   –   00FFFF:   1110xxxx 10xxxxxx 10xxxxxx
    ooo11        100100        10000   –   1FFFFF:   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
--------      --------
11000011      10100100      bin
C3            A4            hex
195           164           dez
à             ¤             Latin-1
Session: Character Sets and Firebird
   Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 35




UTF-8 with ISO 8859-1 Rendering




                    Jürgen Klinsmann
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 36




UCS-4, UTF-32

●
    32 Bits per Character
●
    Every code point as one 32-bit word in memory
●
    Fat but handy (1 word = 1 character)
●
    Endianness issues (UTF-32BE, UTF-32LE, BOM)
●
    Complete codespace can be coded
●
    No practical differences between UCS-4 (ISO) and
    UTF-32 (Unicode)
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 37




What is Plain Text?

●
    For every string, for every text (file, e-mail,
    attachment, download, etc) the encoding MUST be
    known.
●
    Plain text can begin with a BOM

           There Ain't No Such Thing As Plain Text.
           -- Joel Spolsky
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 38




Transcoding / Transliteration

●
    Transformation from special character sets to
    Unicode and back
●
    e.g. Windows-1252 –> Unicode –> ISO 8859-1
●
    Translation tables: www.unicode.org
●
    Characters can get lost (‫ ﻙ‬becomes ?)
●
    Characters can get transformed (ç becomes c)
Session: Character Sets and Firebird
        Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 39




Sorting

●
    Sorting rules also apply for searching
●
    There are cultural differences
    ●
        treat ä like a
    ●
        treat ä like ae
    ●
        treat ä as a seperate character after z
●
    Unicode defines algorithms and delivers tables for
    sorting
Session: Character Sets and Firebird
      Firebird Conference 2011 · Luxembourg     Speaker: Stefan Heymann Page: 40




Case Mappings

●
    Not available in every language
●
    Not always reversible
●   Turkish: ı –> I, i –> İ   English: i –> I
●
    Not necessarily 1:1 ß –> SS
●
    Case Mappings are language dependant
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 41




Comparisons, Sorting

●
    Case Insensitive
      Firebird = FIREBIRD ?
      river = RiVeR ?
      Fluß = FLUSS ?

      a B c <--> B a c

●
    Accent Insensitive
       Amélie = Amelie ?

      a é i o ù <--> a i o é ù
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 42




       Firebird and Character Sets
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 43




CHAR / VARCHAR Fields

●
    Every CHAR or VARCHAR column has a character set
    applied by Firebird
    (Remember? „There is no such thing as plain text“)
●
    This character set will be used for storage
●
    The character set can be defined when declaring the
    column:
    create table persons (
       pers_id integer not null primary key,
       last_name varchar (50) character set iso8859_1,
       first_name varchar (50) character set iso8859_1
    );
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 44




Default Character Set

●
    A default character set for all string columns can be
    defined together with CREATE DATABASE:
    create database 'employee.gdb'
      default character set ISO8859_1;


●
    You can override the default character set:
    create table persons (
       pers_id integer not null primary key,
       last_name varchar (50),
       first_name varchar (50),
       czech_name varchar (50) character set iso8859_2
    );
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 45




Text Blobs

●
    Character Sets also apply to text blobs

    create table persons (
       pers_id integer not null primary key,
       last_name varchar (50),
       first_name varchar (50),
       resume blob sub_type text
    );
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 46




Client Character Set

●
    The Client application defines a character set for its
    connection
●
    All strings will be transliterated to/from this
    Client Character Set
●
    Firebird 1.5: „Unable to transliterate between
    character sets“
●
    Firebird 2.x: ServerCS <–> Unicode <–> ClientCS
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 47




How to define the Client Character Set

●
    IBObjects:
    IB_Connection1.CharSet := 'ISO8859_1';
●
    IBX:
    IbDatabase1.Params.Add ('lc_ctype=ISO8859_1');
●
    IBDAC:
    Connection.Options.CharSet     := 'ISO8859_1';
    Connection.Options.UseUnicode := false;
    Connection.Options.EnableMemos := false;
●
    PHP:
    $db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 48




Collations
●
    Define sort ordering for ORDER BY
●
    Define casing rules for UPPER(), LOWER()
    SELECT *
    FROM ...
    WHERE UPPER (NAME COLLATE DE_DE) = :SEARCHNAME
    ORDER BY LASTNAME COLLATE FR_CA
●
    Define comparison rules (A = B, A <> B)
    WHERE NAME COLLATE UNICODE_CI =
     :PERSNAME COLLATE UNICODE_CI
●
    Collations defined per Character Set
●
    Examples:
    ISO8859_1: DE_DE, DU_NL, FR_FR, FR_CA, PT_PT, PT_BR
    WIN1250: WIN1250, BS_BA, WIN_CZ, WIN_CZ_CI_AI
    UTF8: UCS_BASIC, UNICODE, UNICODE_CI, UNICODE_CI_AI
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 49




Defining the Collation for a Column

●
    There is no such thing as a default collation
●
    Oh, wait: there IS since Firebird 2.5
●
    You can define the standard collation for a column:
    create table persons (
      pers_id integer not null primary key,
      last_name varchar (50) collate de_de,
      first_name varchar (50) collate de_de, ...);
●
    Or you can define it with every string usage:
    where name collate unicode_ci = myname
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 50




Default Collations

●
    Feature introduced in Firebird 2.5 / ODS 11.2
●
    Define the default collation for the default character
    set in CREATE DATABASE:
    create database <file name>
        [ page_size <page size> ]
        [ length = <length> ]
        [ user <user name> ]
        [ password <user password> ]
        [ set names <charset name> ]
        [ default character set <charset name>
           [ [ collation <collation name> ] ]

    create database elvis:presley
      page_size 16384
      user presley
      password guitar
      default character set iso8859_1
        collation de_de;
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 51




Default Collations (continued)

●
    New syntax to change the default collation for
    existing databases:


    ALTER CHARACTER SET <charset_name>
        SET DEFAULT COLLATION <collation_name>


    Example:

    ALTER CHARACTER SET ISO8859_1
      SET DEFAULT COLLATION DE_DE;
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 52




User-Defined Collations

●
    CREATE COLLATION
●
    Poorly documented
●
    Examples don't work
●
    Dead? Badly documented? More investigation
    necessary
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 53




UPPER(), LOWER()

●
    Firebird 1.0/1.5: UPPER() only works correctly if
    there is a collation defined for the parameter field.
    Without collation no uppercasing of letters outside
    the a..z range.
●
    Firebird 2.x: UPPER() will return uppercased
    characters for all characters, no collation required
●
    Firebird 2.x: New LOWER() function
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 54




Case insensitive Searching

●
    WHERE LIKE, WHERE STARTING WITH, WHERE =
    select * from persons
    where upper (last_name) like '%MITR%'

    select * from persons
    where upper (last_name) starting with 'DMIT'

    select * from persons
    where upper (last_name collate de_de) like '%Ä%'

    select * from persons
    where last_name collate unicode_ci like
         '%MÜLLER%' collate unicode_ci
Session: Character Sets and Firebird
   Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 55




Indexed Case Insensitive Searches
Firebird 1.5: Shadow Field and Trigger

  create table persons (
    name       varchar (50) collate de_de,
    name_upper varchar (50) collate de_de);

  create index idx_person_name on persons (name_upper);

  create or replace trigger biu_persons for persons
  before insert or update as
  begin
    new.name_upper = upper (new.name);
  end;

  select * from persons
  where name_upper = 'LÖW'
Session: Character Sets and Firebird
   Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 56




Indexed Case Insensitive Searches
Firebird 2.x: Expression Index

  create table persons (
    name varchar (50) collate de_de,
    city varchar (50));

  create index idx_person_name on persons
    computed by ( upper (name) );

  select * from persons
  where upper (name) = 'FIREBIRD'


  create index idx_person_city on persons
    computed by ( upper (city collate de_de) );

  select * from persons
  where upper (city collate de_de) = 'MÜNCHEN'
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 57




Firebird and Unicode
●
    Unicode internally stored as UTF-8
●
    Character Set UNICODE_FSS: an old version of
    UTF8 that accepts malformed strings and does not
    enforce correct maximum string length.
    Fixed in Firebird 2.5.
●
    Character Set UTF8 since Firebird 2.0: Replacement
    for UNICODE_FSS
●
    Unicode used for transliteration between character
    sets: CS <-- Unicode --> CS
●
    Unicode collation implemented for comparisons and
    casings (ICU DLLs)
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 58




Unicode Collations for UTF8

●
    UCS_BASIC: sorts in Unicode Code Point order
●
    UNICODE: uses the Unicode Collation Algorithm
●
    UNICODE_CI: Case-insensitive [FB2.1]
●
    UNICODE_CI_AI: Case-/Accent-insensitive [FB2.5]

select * from t order by c collate ucs_basic;
A B a b á

select * from t order by c collate unicode;
a A á b B
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 59




Special Character Sets

●
    NONE: Plain octets, no character set applied. With
    this character set setting, Firebird is unable to
    perform conversion operations like UPPER() correctly
    on anything other than the standard 26 latin letters.
●
    OCTETS: Same as NONE. Cannot be used as client
    connection character set. Space character is #x00
●
    ASCII: US-ASCII (English only)
●
    UNICODE_FSS: an old version of UTF8 that accepts
    malformed strings and does not enforce correct
    maximum string length. Replaced by UTF8 in FB2.0.
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 60




Character Sets and Collations

●
    ISO8859_x (e.g. ISO8895_1, ISO8859_2)
    Collations: DE_DE, FR_FR, CS_CZ, etc.
●
    WIN125x (e.g. WIN1252, WIN1250)
    Collations: WIN1252, WIN_PTBR, PXW_CSY, etc.
●
    DOSxxx (e.g. DOS850, DOS852)
    Collations: DB_DEU850, PDOX_CSY




●
    Complete List:
    www.destructor.de/firebird/charsets.htm
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 61




Other Character Sets

●
    BIG_5: Chinese
●
    CYRL, KOI8-R, KOI8-U: Cyrillic, Russian, Ukrainian
●
    KSC_5601: Korean
●
    SJIS_0208: Japanese
●
    EUCJ_0208: Japanese
●
    GB_2312: Chinese
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 62




Which one to select?

●
    DOS, dBASE and Paradox only for legacy support
●
    WINxxx is extension of corresponding ISOxxx, but
    may lead to problems on non-Windows systems.
●
    ISOxxx is missing a few characters of WINxxx (e.g.
    typographic dash signs) –> prepare to handle this
●
    If you expect to store several languages, use
    Unicode UTF8
Session: Character Sets and Firebird
     Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 63




Links

●
    The Unicode Consortium. The Unicode Standard
    www.unicode.org
●
    My Firebird Website and Conference Blog/Gallery
    www.destructor.de/firebird
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 64




                    Questions?


                    Stefan Heymann

                     www.consic.de
                   heymann@consic.de

                   www.destructor.de
                  stefan@destructor.de
Session: Character Sets and Firebird
Firebird Conference 2011 · Luxembourg   Speaker: Stefan Heymann Page: 65




            Thank You!
    Danke! Merci! Grazie!
       ¡Gracias! Obrigado!
     Děkuji! Paldies! Hvala!
             СПАСИБО

Más contenido relacionado

La actualidad más candente

Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problemsAlexey Kovyazin
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird databaseFabio Codebue
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxJasonTuran2
 
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedRNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedChristoph Adler
 
SQLite: um motor de bases de dados relacional open source
SQLite: um motor de bases de dados relacional open sourceSQLite: um motor de bases de dados relacional open source
SQLite: um motor de bases de dados relacional open sourceLuis Borges Gouveia
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 
Estructura fisica y logica ad c arlos iberico
Estructura fisica y logica ad  c arlos ibericoEstructura fisica y logica ad  c arlos iberico
Estructura fisica y logica ad c arlos ibericoCarlos Iberico
 
Understand and optimize Linux I/O
Understand and optimize Linux I/OUnderstand and optimize Linux I/O
Understand and optimize Linux I/OAndrea Righi
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecFIRAT GULEC
 
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!MongoDB
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialJean-François Gagné
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB plc
 
PostGreSQL Performance Tuning
PostGreSQL Performance TuningPostGreSQL Performance Tuning
PostGreSQL Performance TuningMaven Logix
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group ReplicationUlf Wendel
 
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe Pereira
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe PereiraPgBouncer: Pool, Segurança e Disaster Recovery | Felipe Pereira
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe PereiraPGDay Campinas
 
Deploying openstack using ansible
Deploying openstack using ansibleDeploying openstack using ansible
Deploying openstack using ansibleopenstackindia
 

La actualidad más candente (20)

Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problems
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Huong dan dung index_oracle
Huong dan dung index_oracleHuong dan dung index_oracle
Huong dan dung index_oracle
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptx
 
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedRNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
 
SQLite: um motor de bases de dados relacional open source
SQLite: um motor de bases de dados relacional open sourceSQLite: um motor de bases de dados relacional open source
SQLite: um motor de bases de dados relacional open source
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
Estructura fisica y logica ad c arlos iberico
Estructura fisica y logica ad  c arlos ibericoEstructura fisica y logica ad  c arlos iberico
Estructura fisica y logica ad c arlos iberico
 
Understand and optimize Linux I/O
Understand and optimize Linux I/OUnderstand and optimize Linux I/O
Understand and optimize Linux I/O
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat Gulec
 
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 
PostGreSQL Performance Tuning
PostGreSQL Performance TuningPostGreSQL Performance Tuning
PostGreSQL Performance Tuning
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe Pereira
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe PereiraPgBouncer: Pool, Segurança e Disaster Recovery | Felipe Pereira
PgBouncer: Pool, Segurança e Disaster Recovery | Felipe Pereira
 
Deploying openstack using ansible
Deploying openstack using ansibleDeploying openstack using ansible
Deploying openstack using ansible
 

Destacado

Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Feihong Hsu
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, UnicodeOracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, UnicodeMarkus Flechtner
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Project Student
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in PerlNova Patch
 
Coding system
Coding systemCoding system
Coding systemChuc Cao
 
Chapter 2 computer system
Chapter 2 computer systemChapter 2 computer system
Chapter 2 computer systemAten Kecik
 
Chapter 04 computer codes
Chapter 04 computer codesChapter 04 computer codes
Chapter 04 computer codesHareem Aslam
 
Ch 4 ict in-day2_day_life
Ch 4 ict in-day2_day_lifeCh 4 ict in-day2_day_life
Ch 4 ict in-day2_day_lifeCANOSSAMAHIM
 
Ict – information & communication technology
Ict – information & communication technologyIct – information & communication technology
Ict – information & communication technologyDerek Ramdatt
 
Curriculum development at primary and secondary level
Curriculum development at primary and secondary levelCurriculum development at primary and secondary level
Curriculum development at primary and secondary levelInternational advisers
 
Data Encoding
Data EncodingData Encoding
Data EncodingLuka M G
 

Destacado (20)

data representation
data representationdata representation
data representation
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode
UnicodeUnicode
Unicode
 
W 9 numbering system
W 9 numbering systemW 9 numbering system
W 9 numbering system
 
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, UnicodeOracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
Coding system
Coding systemCoding system
Coding system
 
Chapter 2 computer system
Chapter 2 computer systemChapter 2 computer system
Chapter 2 computer system
 
Chapter 04 computer codes
Chapter 04 computer codesChapter 04 computer codes
Chapter 04 computer codes
 
Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Ch 4 ict in-day2_day_life
Ch 4 ict in-day2_day_lifeCh 4 ict in-day2_day_life
Ch 4 ict in-day2_day_life
 
Ict – information & communication technology
Ict – information & communication technologyIct – information & communication technology
Ict – information & communication technology
 
Curriculum development at primary and secondary level
Curriculum development at primary and secondary levelCurriculum development at primary and secondary level
Curriculum development at primary and secondary level
 
21st Century Education
21st Century Education21st Century Education
21st Century Education
 
Data Encoding
Data EncodingData Encoding
Data Encoding
 

Más de Mind The Firebird

Tips for using Firebird system tables
Tips for using Firebird system tablesTips for using Firebird system tables
Tips for using Firebird system tablesMind The Firebird
 
Using Azure cloud and Firebird to develop applications easily
Using Azure cloud and Firebird to develop applications easilyUsing Azure cloud and Firebird to develop applications easily
Using Azure cloud and Firebird to develop applications easilyMind The Firebird
 
A year in the life of Firebird .Net provider
A year in the life of Firebird .Net providerA year in the life of Firebird .Net provider
A year in the life of Firebird .Net providerMind The Firebird
 
How Firebird transactions work
How Firebird transactions workHow Firebird transactions work
How Firebird transactions workMind The Firebird
 
Using ТРСС to study Firebird performance
Using ТРСС to study Firebird performanceUsing ТРСС to study Firebird performance
Using ТРСС to study Firebird performanceMind The Firebird
 
Creating logs for data auditing in FirebirdSQL
Creating logs for data auditing in FirebirdSQLCreating logs for data auditing in FirebirdSQL
Creating logs for data auditing in FirebirdSQLMind The Firebird
 
Firebird Performance counters in details
Firebird Performance counters in detailsFirebird Performance counters in details
Firebird Performance counters in detailsMind The Firebird
 
Understanding Numbers in Firebird SQL
Understanding Numbers in Firebird SQLUnderstanding Numbers in Firebird SQL
Understanding Numbers in Firebird SQLMind The Firebird
 
Threading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondThreading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondMind The Firebird
 
New SQL Features in Firebird 3, by Vlad Khorsun
New SQL Features in Firebird 3, by Vlad KhorsunNew SQL Features in Firebird 3, by Vlad Khorsun
New SQL Features in Firebird 3, by Vlad KhorsunMind The Firebird
 
Orphans, Corruption, Careful Write, and Logging
Orphans, Corruption, Careful Write, and LoggingOrphans, Corruption, Careful Write, and Logging
Orphans, Corruption, Careful Write, and LoggingMind The Firebird
 
Firebird release strategy and roadmap for 2015/2016
Firebird release strategy and roadmap for 2015/2016Firebird release strategy and roadmap for 2015/2016
Firebird release strategy and roadmap for 2015/2016Mind The Firebird
 
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...Mind The Firebird
 
Working with Large Firebird databases
Working with Large Firebird databasesWorking with Large Firebird databases
Working with Large Firebird databasesMind The Firebird
 
Superchaging big production systems on Firebird: transactions, garbage, maint...
Superchaging big production systems on Firebird: transactions, garbage, maint...Superchaging big production systems on Firebird: transactions, garbage, maint...
Superchaging big production systems on Firebird: transactions, garbage, maint...Mind The Firebird
 

Más de Mind The Firebird (20)

Tips for using Firebird system tables
Tips for using Firebird system tablesTips for using Firebird system tables
Tips for using Firebird system tables
 
Using Azure cloud and Firebird to develop applications easily
Using Azure cloud and Firebird to develop applications easilyUsing Azure cloud and Firebird to develop applications easily
Using Azure cloud and Firebird to develop applications easily
 
A year in the life of Firebird .Net provider
A year in the life of Firebird .Net providerA year in the life of Firebird .Net provider
A year in the life of Firebird .Net provider
 
How Firebird transactions work
How Firebird transactions workHow Firebird transactions work
How Firebird transactions work
 
SuperServer in Firebird 3
SuperServer in Firebird 3SuperServer in Firebird 3
SuperServer in Firebird 3
 
Copycat presentation
Copycat presentationCopycat presentation
Copycat presentation
 
Using ТРСС to study Firebird performance
Using ТРСС to study Firebird performanceUsing ТРСС to study Firebird performance
Using ТРСС to study Firebird performance
 
Overview of RedDatabase 2.5
Overview of RedDatabase 2.5Overview of RedDatabase 2.5
Overview of RedDatabase 2.5
 
Creating logs for data auditing in FirebirdSQL
Creating logs for data auditing in FirebirdSQLCreating logs for data auditing in FirebirdSQL
Creating logs for data auditing in FirebirdSQL
 
Firebird Performance counters in details
Firebird Performance counters in detailsFirebird Performance counters in details
Firebird Performance counters in details
 
Understanding Numbers in Firebird SQL
Understanding Numbers in Firebird SQLUnderstanding Numbers in Firebird SQL
Understanding Numbers in Firebird SQL
 
Threading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondThreading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyond
 
New SQL Features in Firebird 3, by Vlad Khorsun
New SQL Features in Firebird 3, by Vlad KhorsunNew SQL Features in Firebird 3, by Vlad Khorsun
New SQL Features in Firebird 3, by Vlad Khorsun
 
Orphans, Corruption, Careful Write, and Logging
Orphans, Corruption, Careful Write, and LoggingOrphans, Corruption, Careful Write, and Logging
Orphans, Corruption, Careful Write, and Logging
 
Firebird release strategy and roadmap for 2015/2016
Firebird release strategy and roadmap for 2015/2016Firebird release strategy and roadmap for 2015/2016
Firebird release strategy and roadmap for 2015/2016
 
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...
Nbackup and Backup: Internals, Usage strategy and Pitfalls, by Dmitry Kuzmenk...
 
Working with Large Firebird databases
Working with Large Firebird databasesWorking with Large Firebird databases
Working with Large Firebird databases
 
Firebird on Linux
Firebird on LinuxFirebird on Linux
Firebird on Linux
 
Superchaging big production systems on Firebird: transactions, garbage, maint...
Superchaging big production systems on Firebird: transactions, garbage, maint...Superchaging big production systems on Firebird: transactions, garbage, maint...
Superchaging big production systems on Firebird: transactions, garbage, maint...
 
Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 

Último

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Último (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Character Sets and Unicode in Firebird

  • 1. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 1 Character Sets and Unicode in Firebird Stefan Heymann www.consic.de heymann@consic.de After a short introduction to the world of Character Sets and Unicode, this session will show you how to bring it all to work in Firebird. You will learn what all those character sets and collations are and how you can properly use them to get the right characters into the database and onto your screen.
  • 2. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 2 Topics ● Characters ● Character Sets ● Unicode ● Firebird ● Examples
  • 3. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 3 Characters
  • 4. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 4 Glyphs vs. Characters Latin uppercase A
  • 5. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 5 Glyph, Character, Character Set ● A Glyph is something you can see with your eyes ● A Character is an abstract concept ● Rendering of characters as Glyphs is the job of the rendering machine (Postscript, GDI, TrueType, Web Browser, etc.) ● We mostly care for processing the characters ● A Character Set assigns a number to a character: Uppercase A = 65 Uppercase B = 66 etc.
  • 6. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 6 Glyphs ● Not all languages display glyphs as a string of left- to-right, contiguous rectangles ● Right-to-left (Arabic, Hebrew), top-to-bottom (Japanese, Chinese) ● Several characters can „melt“ into one glyph
  • 7. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 7 Character Sets
  • 8. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 8 ASCII: The Mother of Character Sets ● American Standard Code for Information Interchange: ASCII, ISO 646 ● 7 bits, characters ranging from 0 to 127 (00..7F) ● 32 invisible control characters (NUL, TAB, CR, LF, FF, BEL, ESC, ...) ● A..Z, a..z, Digits 0..9, Punctuation (;.-?) ● Optimized for English ● Only Latin characters, no accents, no umlauts ● MIME code: US-ASCII
  • 9. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 9 ASCII for Europe ● Language specific characters get a new assignment ● [=Ä =Ö ]=Ü ● Problem: Printer and screen must have same setting ● Impossible to mix French and German in one text: „Amélie knackt gerne die Kruste von Crème Brulé mit dem Löffel“ ● Died together with DOS
  • 10. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 10 Use the 8th Bit ● 128 new characters: 128 to 255 (00..FF) ● A lot of 8-Bit character sets have evolved ● ISO 8859-x = ASCII + 160..255 ● ISO 8859-1 = Latin-1 (Western European languages) ● Windows 1252 = ISO 8859-1 + 128..159 ● etc.
  • 11. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 11 ISO 8859 ● Characters 0..127 identical to ASCII ● 128..159 undefined control characters ● 160..255 individually assigned
  • 12. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 12 ISO 8859-1 / Latin-1 Afrikaans, Albanese, Basque, Danish, German, English, Faroese, Finnish, French, Icelandic, Italian, Catalan, Dutch, Norwegian, Portuguese, Rhaeto-Romance, Scottish Gaelic, Schwedish, Spanish, Suaheli Large parts of the world – wide-spread use
  • 13. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 13 ISO 8859-2 / Latin-2 Central and Eastern Europe (Czech, Polish, etc.)
  • 14. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 14 ISO 8859-4 / Latin-4 Northern Europe, Baltic, Greenlanic, Sami
  • 15. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 15 ISO 8859-5 / Cyrillic Cyrillic (Russia, Ukraine, etc.) More important: KOI8-R (Russian), KOI8-U (Ukrainian)
  • 16. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 16 ISO 8859-9 / Latin-5 Turkish
  • 17. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 17 Windows Character Sets („Codepages“) ● Partly congruent to ISO-8859 ● Additional assignment of characters 128..159 with visible (non-control) characters ● Hyphens n length – and m length — (vs. Dash -) ● Typographic „quotation marks“, etc. ● Windows character sets officially registered at IANA
  • 18. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 18 Microsoft Windows Codepages ● 874 Thai ● 932 Japanese ● 936 Simplified Chinese ● 949 Korean ● 950 Traditional Chinese ● 1250 Central European ● 1251 Cyrillic ● 1252 Western European ● 1253 Greek ● 1254 Turkish ● 1255 Hebrew ● 1256 Arabic ● 1257 Baltic ● 1258 Vietnamese
  • 19. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 19 windows-1252 ● Congruent to ISO 8859-1 (Western European languages, including English) ● Additional characters in the 128..159 range: €‚ƒ„…†‡ˆ‰Š‹ŒŽ ‘’“”•–—˜™š›œžŸ Euro sign € since 2000 (Windows 1252-2000)
  • 20. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 20 Multi-Byte Character Sets MBCS ● Multiple Bytes per Character ● Eastern Asian Languages (CJK) ● String Length <> Length of character chain ● Making extraction of sub-strings more difficult ● Firebird functions: ● BIT_LENGTH: length of a string in bits (!) ● CHAR_LENGTH/CHARACTER_LENGTH: length of a string in characters ● OCTET_LENGTH: length of a string in bytes
  • 21. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 21 Unicode
  • 22. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 22 Why Unicode? ● One single Character Set for all languages/scripts ● No code overlaps ● Hardware and OS independant ● Standardisation ISO 10646 (vs. ASCII = ISO 646)
  • 23. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 23 Unicode ● Started with 16 Bits/Character, now 32 Bits/Char ● Ability to code 1,114,112 characters ● Currently only a fraction is used ● „Basic Multilingual Plane“ (BMP): 0..U+FFFF Can be encoded in 16 bits ● Current version: 6.0.0 (February 2011) ● Defines Characters, not Glyphs ● Practically equal to ISO/IEC 10646
  • 24. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 24 Character Definition ● Unicode defines a numerical Code Point (scalar value) and an Identifier for every character 0041 LATIN CAPITAL LETTER A 00E4 LATIN SMALL LETTER A WITH DIAERESIS 0391 GREEK CAPITAL LETTER ALPHA 05D0 HEBREW LETTER ALEF 0950 DEVANAGARI OM 1D56C MATHEMATICAL BOLD FRAKTUR CAPITAL A
  • 25. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 25 Unicode Code Points ● Codespace: 0..10FFFF ● Usual notation is hexadecimal with preceding U+ and 4 or 5 digits ● U+0020 ● U+0041 ● U+1D56C
  • 26. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 26 Unicode Character Names ● Consisting of the uppercase letters A..Z, digits 0..9, hyphen (-) and whitespace. ● BYZANTINE MUSICAL SYMBOL LEIMMA ENOS CHRONOU ● DESERET CAPITAL LETTER OW ● BRAILLE PATTERN DOTS-1245
  • 27. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 27 Unicode Coding ● Storage of Code Points in memory ● 32 Bits/Character easy but too fat ● There are several codings around: ● 8-Bit (UTF-8, formerly called „FSS“) ● 16-Bit (UCS-2, UTF-16) ● 32-Bit (UCS-4, UTF-32) ● PunyCode for international Domain names ● „Exotic“: UTF-7, UTF-1, etc.
  • 28. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 28 UCS-2 ● 16 Bits per Character ● Code Area: 0000..FFFF = Basic Multilingual Plane (BMP) ● Characters beyond the BMP (defined since Unicode 3.1) can not be encoded ● Replaced by UTF-16 ● „Unicode“ often used as a synonym for UCS-2 (which is wrong and can lead to misunderstanding)
  • 29. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 29 UTF-16 ● 16 Bits per Character ● All characters > FFFF must be coded as a „Surrogate Pair“ and occupy 2 consecutive 16-Bit words ● Complete Code Space can be encoded ● Difficult to calculate string length or substrings ● Used by Windows 2000 and later ● Used by Delphi 2009/2010/XE (UnicodeString type)
  • 30. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 30 Endianness ● Problem with UCS-2, UTF-16: Low/High-Byte ordering ● Differentiate in UTF-16BE and UTF-16LE in metadata ● Byte Order Mark BOM U+FEFF ● U+FEFF set at the very beginning of each text ● U+FFFE is (and will be) an invalid code point
  • 31. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 31 UCS-2 vs. UTF-16 ● UTF-16 ist backwards compatible ● „Unicode“ is used as a (bad) synonym for UCS-2 or UTF-16 ● WideString, wchar_t ● Windows NT3, NT4: UCS-2 ● Since Windows 2000: UTF-16 ● Delphi 2009/etc. UnicodeString: UTF-16
  • 32. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 32 UTF-8 ● Coding as 8-Bit strings ● Called „File System Safe“ (FSS) in its early days ● 7-bit US-ASCII characters untouched, all others occupy 2 to 4 consecutive bytes ● Complete codespace can be encoded ● Advantage: „Latin“ texts quite compact and readable in unaware editors ● Problem: string length, substrings, etc. ● No BOM necessary, but sometimes used as an indicator for UTF-8 text
  • 33. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 33 UTF-8 coding ● US-ASCII characters untouched, Bit 7 reset 0xxxxxxx ● All others are sequences of bytes with Bit 7 set 1xxxxxxx ● Head Byte: As many leading bits set as length of sequence: 110xxxxx ● Tail Bytes: Bit 7 set, Bit 6 reset: 10xxxxxx ● Recognize the type of byte: ● Complete character: 0xxxxxxx ● Part of a sequence: 1xxxxxxx ● Sequence head: 11xxxxxx ● Sequence tail: 10xxxxxx
  • 34. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 34 UTF-8 coding ● Bits after the type bits remain for encoding the Code Point ä – LATIN SMALL LETTER A WITH DIAERESIS 00E416 = 22810 = 111001002 00000 – 00007F: 0xxxxxxx 110xxxxx 10xxxxxx 00080 – 0007FF: 110xxxxx 10xxxxxx 00800 – 00FFFF: 1110xxxx 10xxxxxx 10xxxxxx ooo11 100100 10000 – 1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx -------- -------- 11000011 10100100 bin C3 A4 hex 195 164 dez à ¤ Latin-1
  • 35. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 35 UTF-8 with ISO 8859-1 Rendering Jürgen Klinsmann
  • 36. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 36 UCS-4, UTF-32 ● 32 Bits per Character ● Every code point as one 32-bit word in memory ● Fat but handy (1 word = 1 character) ● Endianness issues (UTF-32BE, UTF-32LE, BOM) ● Complete codespace can be coded ● No practical differences between UCS-4 (ISO) and UTF-32 (Unicode)
  • 37. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 37 What is Plain Text? ● For every string, for every text (file, e-mail, attachment, download, etc) the encoding MUST be known. ● Plain text can begin with a BOM There Ain't No Such Thing As Plain Text. -- Joel Spolsky
  • 38. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 38 Transcoding / Transliteration ● Transformation from special character sets to Unicode and back ● e.g. Windows-1252 –> Unicode –> ISO 8859-1 ● Translation tables: www.unicode.org ● Characters can get lost (‫ ﻙ‬becomes ?) ● Characters can get transformed (ç becomes c)
  • 39. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 39 Sorting ● Sorting rules also apply for searching ● There are cultural differences ● treat ä like a ● treat ä like ae ● treat ä as a seperate character after z ● Unicode defines algorithms and delivers tables for sorting
  • 40. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 40 Case Mappings ● Not available in every language ● Not always reversible ● Turkish: ı –> I, i –> İ English: i –> I ● Not necessarily 1:1 ß –> SS ● Case Mappings are language dependant
  • 41. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 41 Comparisons, Sorting ● Case Insensitive Firebird = FIREBIRD ? river = RiVeR ? Fluß = FLUSS ? a B c <--> B a c ● Accent Insensitive Amélie = Amelie ? a é i o ù <--> a i o é ù
  • 42. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 42 Firebird and Character Sets
  • 43. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 43 CHAR / VARCHAR Fields ● Every CHAR or VARCHAR column has a character set applied by Firebird (Remember? „There is no such thing as plain text“) ● This character set will be used for storage ● The character set can be defined when declaring the column: create table persons ( pers_id integer not null primary key, last_name varchar (50) character set iso8859_1, first_name varchar (50) character set iso8859_1 );
  • 44. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 44 Default Character Set ● A default character set for all string columns can be defined together with CREATE DATABASE: create database 'employee.gdb' default character set ISO8859_1; ● You can override the default character set: create table persons ( pers_id integer not null primary key, last_name varchar (50), first_name varchar (50), czech_name varchar (50) character set iso8859_2 );
  • 45. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 45 Text Blobs ● Character Sets also apply to text blobs create table persons ( pers_id integer not null primary key, last_name varchar (50), first_name varchar (50), resume blob sub_type text );
  • 46. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 46 Client Character Set ● The Client application defines a character set for its connection ● All strings will be transliterated to/from this Client Character Set ● Firebird 1.5: „Unable to transliterate between character sets“ ● Firebird 2.x: ServerCS <–> Unicode <–> ClientCS
  • 47. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 47 How to define the Client Character Set ● IBObjects: IB_Connection1.CharSet := 'ISO8859_1'; ● IBX: IbDatabase1.Params.Add ('lc_ctype=ISO8859_1'); ● IBDAC: Connection.Options.CharSet := 'ISO8859_1'; Connection.Options.UseUnicode := false; Connection.Options.EnableMemos := false; ● PHP: $db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");
  • 48. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 48 Collations ● Define sort ordering for ORDER BY ● Define casing rules for UPPER(), LOWER() SELECT * FROM ... WHERE UPPER (NAME COLLATE DE_DE) = :SEARCHNAME ORDER BY LASTNAME COLLATE FR_CA ● Define comparison rules (A = B, A <> B) WHERE NAME COLLATE UNICODE_CI = :PERSNAME COLLATE UNICODE_CI ● Collations defined per Character Set ● Examples: ISO8859_1: DE_DE, DU_NL, FR_FR, FR_CA, PT_PT, PT_BR WIN1250: WIN1250, BS_BA, WIN_CZ, WIN_CZ_CI_AI UTF8: UCS_BASIC, UNICODE, UNICODE_CI, UNICODE_CI_AI
  • 49. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 49 Defining the Collation for a Column ● There is no such thing as a default collation ● Oh, wait: there IS since Firebird 2.5 ● You can define the standard collation for a column: create table persons ( pers_id integer not null primary key, last_name varchar (50) collate de_de, first_name varchar (50) collate de_de, ...); ● Or you can define it with every string usage: where name collate unicode_ci = myname
  • 50. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 50 Default Collations ● Feature introduced in Firebird 2.5 / ODS 11.2 ● Define the default collation for the default character set in CREATE DATABASE: create database <file name> [ page_size <page size> ] [ length = <length> ] [ user <user name> ] [ password <user password> ] [ set names <charset name> ] [ default character set <charset name> [ [ collation <collation name> ] ] create database elvis:presley page_size 16384 user presley password guitar default character set iso8859_1 collation de_de;
  • 51. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 51 Default Collations (continued) ● New syntax to change the default collation for existing databases: ALTER CHARACTER SET <charset_name> SET DEFAULT COLLATION <collation_name> Example: ALTER CHARACTER SET ISO8859_1 SET DEFAULT COLLATION DE_DE;
  • 52. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 52 User-Defined Collations ● CREATE COLLATION ● Poorly documented ● Examples don't work ● Dead? Badly documented? More investigation necessary
  • 53. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 53 UPPER(), LOWER() ● Firebird 1.0/1.5: UPPER() only works correctly if there is a collation defined for the parameter field. Without collation no uppercasing of letters outside the a..z range. ● Firebird 2.x: UPPER() will return uppercased characters for all characters, no collation required ● Firebird 2.x: New LOWER() function
  • 54. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 54 Case insensitive Searching ● WHERE LIKE, WHERE STARTING WITH, WHERE = select * from persons where upper (last_name) like '%MITR%' select * from persons where upper (last_name) starting with 'DMIT' select * from persons where upper (last_name collate de_de) like '%Ä%' select * from persons where last_name collate unicode_ci like '%MÜLLER%' collate unicode_ci
  • 55. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 55 Indexed Case Insensitive Searches Firebird 1.5: Shadow Field and Trigger create table persons ( name varchar (50) collate de_de, name_upper varchar (50) collate de_de); create index idx_person_name on persons (name_upper); create or replace trigger biu_persons for persons before insert or update as begin new.name_upper = upper (new.name); end; select * from persons where name_upper = 'LÖW'
  • 56. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 56 Indexed Case Insensitive Searches Firebird 2.x: Expression Index create table persons ( name varchar (50) collate de_de, city varchar (50)); create index idx_person_name on persons computed by ( upper (name) ); select * from persons where upper (name) = 'FIREBIRD' create index idx_person_city on persons computed by ( upper (city collate de_de) ); select * from persons where upper (city collate de_de) = 'MÜNCHEN'
  • 57. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 57 Firebird and Unicode ● Unicode internally stored as UTF-8 ● Character Set UNICODE_FSS: an old version of UTF8 that accepts malformed strings and does not enforce correct maximum string length. Fixed in Firebird 2.5. ● Character Set UTF8 since Firebird 2.0: Replacement for UNICODE_FSS ● Unicode used for transliteration between character sets: CS <-- Unicode --> CS ● Unicode collation implemented for comparisons and casings (ICU DLLs)
  • 58. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 58 Unicode Collations for UTF8 ● UCS_BASIC: sorts in Unicode Code Point order ● UNICODE: uses the Unicode Collation Algorithm ● UNICODE_CI: Case-insensitive [FB2.1] ● UNICODE_CI_AI: Case-/Accent-insensitive [FB2.5] select * from t order by c collate ucs_basic; A B a b á select * from t order by c collate unicode; a A á b B
  • 59. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 59 Special Character Sets ● NONE: Plain octets, no character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters. ● OCTETS: Same as NONE. Cannot be used as client connection character set. Space character is #x00 ● ASCII: US-ASCII (English only) ● UNICODE_FSS: an old version of UTF8 that accepts malformed strings and does not enforce correct maximum string length. Replaced by UTF8 in FB2.0.
  • 60. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 60 Character Sets and Collations ● ISO8859_x (e.g. ISO8895_1, ISO8859_2) Collations: DE_DE, FR_FR, CS_CZ, etc. ● WIN125x (e.g. WIN1252, WIN1250) Collations: WIN1252, WIN_PTBR, PXW_CSY, etc. ● DOSxxx (e.g. DOS850, DOS852) Collations: DB_DEU850, PDOX_CSY ● Complete List: www.destructor.de/firebird/charsets.htm
  • 61. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 61 Other Character Sets ● BIG_5: Chinese ● CYRL, KOI8-R, KOI8-U: Cyrillic, Russian, Ukrainian ● KSC_5601: Korean ● SJIS_0208: Japanese ● EUCJ_0208: Japanese ● GB_2312: Chinese
  • 62. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 62 Which one to select? ● DOS, dBASE and Paradox only for legacy support ● WINxxx is extension of corresponding ISOxxx, but may lead to problems on non-Windows systems. ● ISOxxx is missing a few characters of WINxxx (e.g. typographic dash signs) –> prepare to handle this ● If you expect to store several languages, use Unicode UTF8
  • 63. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 63 Links ● The Unicode Consortium. The Unicode Standard www.unicode.org ● My Firebird Website and Conference Blog/Gallery www.destructor.de/firebird
  • 64. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 64 Questions? Stefan Heymann www.consic.de heymann@consic.de www.destructor.de stefan@destructor.de
  • 65. Session: Character Sets and Firebird Firebird Conference 2011 · Luxembourg Speaker: Stefan Heymann Page: 65 Thank You! Danke! Merci! Grazie! ¡Gracias! Obrigado! Děkuji! Paldies! Hvala! СПАСИБО