SlideShare a Scribd company logo
1 of 51
Unicode Fundamentals

1
Presenters
 Md.

Sami Hassan (Ek-29)

 Md.

Nazmul Islam (SH-64)

 Nafi

Md. Kamrul Haque Chowdhury Sadique
(SH-04)

2
Character Encoding
What is a Character?

A Character is the smallest
unit of writing which is
capable of conveying
information.
4
Examples of character


Letters



Digits



Hyphen



Mathematical symbols



Punctuation



Control Characters
- typically not visible
5
What is Character Encoding?
 The

set of available characters is called a
character repertoire.

 The

location of a given character within a
repertoire is known as its code position, or code
point.

 The

method of representing a code point to a
digital form within a given repertoire is called the
character encoding.

6
What is Character Encoding?
Example :
In the repertoire ISO 8859-1,



The “code point” of "Latin capital letter A"
is 0x41 (“0x” indicating 41 is hexadecimal).
The “encoding” of “A” is 10000001.0001.
7
What is Character Encoding?
So, in general, we can say,
Character encoding is the way that letters,
digits and other symbols are expressed as
binary values that a computer can
understand.
8
Why is Character Encoding required?
Computers at their most basic level just
deal with binary numbers, 0 and 1. They
store letters, numerals and other
characters by assigning a binary number for
each one.

9
Pre-Unicode Character Repertoires
 EBCDIC
 ASCII

(American Standard Code for
Information Interchange)

 US-ASCII

(U.S. version, standardised as ISO

646)
 ISO

8859

 ISO

8859-1 (most common version for
Western languages)

 ISO

8859-15 (replacement of ISO 8859-1).

10
A Brief History
 In

the pre-Unicode environment, we had
single 8-bit characters repertoires, which
limited us to a maximum limit of 256
characters. No single encoding could
contain enough characters to cover all the
languages.

 So,

hundreds of different encoding systems
were developed for assigning numbers to
characters.

11
A Brief History


As a result, these coding systems conflicted with each
other. That is, two encodings could use the same number
for two different characters or different numbers for the
same character.



Any given computer needed to support many different
encodings.



Yet whenever data was passed between different
encodings or platforms, that data always runs the risk of
corruption.



So, a solution of this problem was necessary.
12
A Brief History

And so, the solution was Unicode — a
character repertoire that contains most of
the characters used in the languages of the
world.

13
Unicode
What is Unicode?
 Unicode

is a computing industry standard
for the consistent encoding, representation
and handling of text expressed in most of
the world's writing systems.

 Unicode

provides a unique number for
every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Where is Unicode Used?
 The

Unicode standards has been adopted by
many software and hardware vendors.

 Most

OSs support Unicode.

 Unicode

is required for international
document and data interchange.

 Several

modern standards use Unicode, such

as,


Programming languages, such as Java, C#, Perl, Python.



Markup Languages such as XML, HTML, XHTML,
JavaScript, LDAP, Cobra etc.
How many languages are covered by
Unicode?


The Unicode Standard encodes characters on a per script
basis.



That is, Unicode encodes scripts for languages, rather
than languages themselves.



The reason is, many scripts (especially the Latin script)
are used to write a large number of languages.



Therefore, the easiest answer is that Unicode covers all of
the languages that can be written in the Unicode
supported scripts.
Unicode Supported Scripts
How can Unicode cover this much
characters?


As we have already seen, in the pre-Unicode environment, we had
single 8-bit characters repertoires, which limited us to a maximum
limit of 256 characters, i.e., code points.



Where, Unicode code points are logically divided into 17 planes, each
with 65,536 (= 216) code points.



In the Unicode standard, planes are groups of code points that point
to specific characters.



Planes are identified by the numbers 0 to 16, which corresponds with
the possible values 0x00–0x10 of the first two positions in six position
format (hhhhhh).



The only one used in most circumstances is the first plane, known as
the basic multilingual plane, or BMP.
Unicode Planes
How to name a Code Point?


Normally, a Unicode code point is referred to by writing
"U+" followed by its hexadecimal number.



For code points in the Basic Multilingual Plane (BMP), four
digits are used
- e.g. U+0058 for the character Latin Capital Letter “X”.



For code points outside the BMP, five or six digits are
used, as required
- e.g. U+E0001 for the character 'LANGUAGE TAG'
- e.g. U+10FFFD for the character ‘<Plane 16 Private Use,
Last>’
Unicode and ISO
A

version of Unicode that has been
standardised by ISO (International
Organization for Standardization) is called
ISO 10646.

 There

are minor differences between
Unicode and ISO 10646,but not that much
to ponder about.
Mapping of Unicode characters
 Unicode/ISO
 Therefore,

10646 is only a repertoire.

we need an encoding to go with

it.
 And

that encodings are UTF and UCS.
UTF and UCS
What are UTF and UCS?
 Unicode

defines two mapping methods:

- Unicode Transformation Format (UTF)
encodings
- Universal Character Set (UCS) encodings
 Unicode

transformation format (UTF) is an
algorithmic mapping for every Unicode code point
to a unique byte.

 The

ISO 10646 standard uses the term “UCS
transformation format” for UTF; the two terms
are merely synonyms for the same concept.
UTF vs Unicode
 Say

an application reads the following from
the disk:
1101000 1100101 1101100 1101100 1101111

 The

application knows this data represents
a Unicode string encoded with UTF-8 and
must show this as text to the user.
UTF vs Unicode


First step, is to convert the binary data to numbers.



The app uses the UTF-8 algorithm to decode the data.



In this case, the decoder returns this:
0x68 65 6C 6C 6F



Since the app knows this is a Unicode string, it can assume
each number represents a character’s code point.



So, it uses Unicode character repertoire to translate each
number to a corresponding character.



The resulting string is:
hello
UTF vs Unicode
 Therefore,
 UTF

is an encoding used to translate
binary data into numbers and vice-versa.

 Unicode

is a character repertoire used to
translate numbers into characters and
vice-versa.
UTF Formats
UTF encodings include:


UTF-1 – a retired predecessor of UTF-8, maximizes
compatibility with ISO 2022, no longer part of The Unicode
Standard.



UTF-7 – a 7-bit encoding sometimes used in e-mail, often
considered obsolete, no longer part of The Unicode Standard.



UTF-EBCDIC – an 8-bit variable-width encoding similar to UTF-8,
but designed for compatibility with EBCDIC, not part of The
Unicode Standard.
UTF Formats
 UTF-8

– an 8-bit variable-width encoding
which maximizes compatibility with ASCII,
used by 77.3% of all the websites.

 UTF-16

– a 16-bit, variable-width encoding,
used by less than 0.1% of all the websites.

 UTF-32

– a 32-bit, fixed-width encoding,
very rarely used.
UTF Formats
 So,

at present, Unicode standard defines
three encoding forms that allow the same
character data to be transmitted in a byte,
word or double word oriented format (i.e.
in 8, 16, or 32-bits per code unit).

 These

three encoding forms are called UTF8, UTF-16 and UTF-32 respectively.
UTF-8


UTF-8 is a variable-width encoding that can represent every character in the
Unicode character set. It encodes each of the 1,112,064 code points in the
Unicode character set using one to four 8-bit bytes (termed "octets" in the
Unicode Standard).



It has the following properties:
1.

Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to
0x7F, this means UTF-8 is fully compatible with ASCII.

2.

All characters greater than U+007F are encoded as a sequence of several
bytes, all of which are above 0x7F (namely no ASCII byte), this makes it
unambiguous to determine whether a byte belows to a multi-byte character
or an ASCII character.

3.

The first byte of a multi-byte sequence (that represent a non-ASCII) is
always in the range of 0xC0 to 0xFD. All further bytes in the sequence are in
the range 0x80 to 0xBF. This makes it unambiguous to determine the
boundary of the multi-byte characters (in fact, the first byte also contains a
redundant information about how many bytes follow for the character).

4.

The bytes FEh and FFh are never used in the UTF-8 encoding.
UTF-8


As mentioned before,UTF-8 is the popular encoding
form for Unicode.



The reason lies in the fact that all ASCII characters
are encoded as a single byte in UTF-8 which is not
only fully backward compatible, but also space
efficient for US and many European users.



In general, UTF-8 costs no extra space for US ASCII,
only a few percent more for ISO 8859-1 (AKA Latin-1,
covers most West European languages), 50% more for
Chinese/Japanese/Korean, 100% more for Greek and
Cyrillic.
UTF-16


Like UTF-8, UTF-16 is also a variable-width encoding that
can represent every character in the Unicode character set.



The difference is, it encodes 1,112,064 code points in the
Unicode character set using one to two 16-bit bytes (two or
four octets).



Properties :



The Unicode code space is divided into seventeen planes of
216 code points each.



The first plane (code points U+0000 to U+FFFF) contains the
most frequently used characters and is called the Basic
Multilingual Plane or BMP. UTF-16 encode code points in this
range as single 16-bit code units that are numerically equal
to the corresponding code points.
UTF-16


Code points from the other planes (called Supplementary
Planes) are encoded in UTF-16 by pairs of 16-bit code
units called a surrogate pair, by the following scheme:

o

0x010000 is subtracted from the code point, leaving a 20
bit number in the range 0.. 0x0FFFFF.

o

The top ten bits (a number in the range 0.. 0x03FF) are
added to 0xD800 to give the first code unit or lead
surrogate, which will be in the range 0xD800.. 0xDBFF.

o

The low ten bits (also in the range 0.. 0x03FF) are added
to 0xDC00 to give the second code unit or trail surrogate,
which will be in the range 0xDC00.. 0xDFFF.
UTF-16


The Unicode standard permanently reserves the code
point values U+D800 to U+DFFF for UTF-16 encoding of the
lead and trail surrogates, and they will never be assigned
a character, so there should be no reason to encode them.
UTF-16
Use in major operating systems and environments :
 UTF-16

is used for text in the OS API in Microsoft
Windows 2000/XP/2003/Vista/CE.

 UTF-16

is used by the Qualcomm BREW
operating systems; the .NET environments; Mac
OS X's Cocoa and Core Foundation frameworks;
and the Qt cross-platform graphical widget
toolkit.

 Symbian

OS used in Nokia S60 handsets and Sony
Ericsson UIQ handsets uses UTF-16.

 J2SE

5.0 version of Java uses UTF-16.
UTF-32


UTF-32 is a character encoding to encode Unicode
characters that uses exactly 32 bits (four octets) per
Unicode code point.



It uses fixed-length encodings, where all other Unicode
transformation formats use variable-length encodings.



Advantage :

The main advantage of UTF-32, versus variable length encodings, is that
the Unicode code points are directly indexable.
UTF-32
•

Disadvantage :
The main disadvantage of UTF-32 is that it is space inefficient,
using four bytes per character. Non-BMP characters are so rare
in most texts, they may as well be considered non-existent for
sizing issues, making UTF-32 twice the size of UTF-16 and up to
four times the size of UTF-8.



Though a fixed number of bytes per code point appear convenient, it
is not as useful as it appears.



Because, it does not make it faster to find a particular offset in the
string, as an "offset" can be measured in the fixed-size code units of
any encoding. It does not make calculating the displayed width of a
string easier except in limited cases, since even with a “fixed width”
font there may be more than one code point per character position or
more than one character position per code point.



For this,UTF-32 is very rarely used.
Now, let’s see a problem
 Computers

speak different languages, like people!

 Some

write data "left-to-right" and others "rightto-left".

A

machine can read its own data just fine problems happen when one computer stores data
and a different type tries to read it.
Then, what may be the solution?
 Agree

to a common format (i.e., all
network traffic follows a single format), or

 Always

include a header that describes the
format of the data. If the header appears
backwards, it means data was stored in the
other format and needs to be converted.
Which one is taken?


Practically, there's no rule that all computers must use
the same language, just like there's no rule all humans
need to. Each type of computer is internally consistent (it
can read back its own data), but there are no guarantees
about how another type of computer will interpret the
data it created.



So, the solution is to use a header.



In Unicode, this header is known as Byte Order Mark, or
BOM.
Byte Order Mark (BOM)
What is a BOM?
 The

byte order mark (BOM) is a Unicode
character used to signal the “endianness”
of a text file or stream.

 It

consists of the character code U+FEFF at
the beginning of a data stream.
What does ‘endian’ mean?
 Data

types longer than a byte can be stored in
computer memory with the most significant byte
(MSB) first or last.

 The

former is called big-endian, the latter littleendian.

 Therefore,

it can be said that the terms endian
and endianness, refers to how bytes of a data
word are ordered within memory.
Explanation
Each byte of memory is associated with an index,
called its address, which indicates its position.
Bytes of a single data word (such as a 32 bit
integer datatype) are generally stored in
consecutive memory addresses (a 32 bit integer
needs 4 such locations). Big-endian systems are
systems in which the most significant byte of the
word is stored in the smallest address given and
the least significant byte is stored in the largest.
In contrast, little endian systems are those in
which the least significant byte is stored in the
smallest address.
Example
Say the data word was "0A 0B 0C 0D" (a set of 4 bytes) and
memory addresses starting at a with offsets 0, 1, 2 and 3 are
given. Then, in big endian systems, byte 0A is placed in offset 0,
0B in 1, 0C in 2 and 0D in 3. In little-endian systems, the order is
the reverse of it (see the diagram).
Where is a BOM useful?
A

BOM is useful at the beginning of files
that are typed as text, but for which it is
not known whether they are in big or little
endian format.

 The

BOM character may also indicate which
of the several Unicode representations the
text is encoded in.
When a BOM is used?


BOM use is optional.



The Unicode Standard permits the BOM in UTF-8, but does
not require or recommend its use.



In UTF-16, a BOM may be placed as the first character of a
file or character stream to indicate the endianness (byte
order) of all the 16-bit code units of the file or stream.



Although a BOM could be used with UTF-32, this encoding
is rarely used for transmission. Otherwise the same rules
as for UTF-16 are applicable.
Representations of byte order marks
by encoding
THAT’S IT FOR TODAY!!!

THANK YOU
ALL

More Related Content

What's hot

What's hot (20)

Ethernet
EthernetEthernet
Ethernet
 
TCP/IP Introduction
TCP/IP IntroductionTCP/IP Introduction
TCP/IP Introduction
 
Twisted pair cable
Twisted pair cableTwisted pair cable
Twisted pair cable
 
TCP/IP
TCP/IPTCP/IP
TCP/IP
 
Point To Point Protocol
Point To Point ProtocolPoint To Point Protocol
Point To Point Protocol
 
Electronic mail
Electronic mailElectronic mail
Electronic mail
 
OSI model and TCP/IP model
OSI model and TCP/IP modelOSI model and TCP/IP model
OSI model and TCP/IP model
 
TCP/IP model
TCP/IP modelTCP/IP model
TCP/IP model
 
Email
EmailEmail
Email
 
SHA- Secure hashing algorithm
SHA- Secure hashing algorithmSHA- Secure hashing algorithm
SHA- Secure hashing algorithm
 
Address resolution protocol (ARP)
Address resolution protocol (ARP)Address resolution protocol (ARP)
Address resolution protocol (ARP)
 
Tcp/ip model
Tcp/ip  modelTcp/ip  model
Tcp/ip model
 
Transmission modes
Transmission modesTransmission modes
Transmission modes
 
Computer networking
Computer networkingComputer networking
Computer networking
 
Basics of IP Addressing
Basics of IP AddressingBasics of IP Addressing
Basics of IP Addressing
 
Networking Fundamentals
Networking FundamentalsNetworking Fundamentals
Networking Fundamentals
 
Introduction of tcp, ip & udp
Introduction of tcp, ip & udpIntroduction of tcp, ip & udp
Introduction of tcp, ip & udp
 
TCP-IP Reference Model
TCP-IP Reference ModelTCP-IP Reference Model
TCP-IP Reference Model
 
IEEE STANDARDS 802.3,802.4,802.5
IEEE STANDARDS 802.3,802.4,802.5IEEE STANDARDS 802.3,802.4,802.5
IEEE STANDARDS 802.3,802.4,802.5
 
IP (PROTOCOLO DE INTERNET)
IP (PROTOCOLO DE INTERNET)IP (PROTOCOLO DE INTERNET)
IP (PROTOCOLO DE INTERNET)
 

Viewers also liked

Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Project Student
 
Character representation
Character representationCharacter representation
Character representationreegangoodwin1
 
Character expression
Character expressionCharacter expression
Character expressionxhaed123
 
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)Project Student
 
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityTravis Fischer
 
Smart note taker
Smart note takerSmart note taker
Smart note takercrisane93
 
Presentation on data communication
Presentation on data communicationPresentation on data communication
Presentation on data communicationHarpreet Dhaliwal
 
Software reuse ppt.
Software reuse ppt.Software reuse ppt.
Software reuse ppt.Sumit Biswas
 
Dashboard - definition, examples
Dashboard - definition, examplesDashboard - definition, examples
Dashboard - definition, examplesMatthieu Aubry
 

Viewers also liked (14)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Unicode
UnicodeUnicode
Unicode
 
Character representation
Character representationCharacter representation
Character representation
 
Character expression
Character expressionCharacter expression
Character expression
 
Unicode
UnicodeUnicode
Unicode
 
Symbian OS
Symbian OSSymbian OS
Symbian OS
 
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)
Appraisal (Self Assessment, Peer Assessment, 360 Degree Feedback)
 
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
 
Smart note taker
Smart note takerSmart note taker
Smart note taker
 
Presentation on data communication
Presentation on data communicationPresentation on data communication
Presentation on data communication
 
Ch15 software reuse
Ch15 software reuseCh15 software reuse
Ch15 software reuse
 
Software reuse ppt.
Software reuse ppt.Software reuse ppt.
Software reuse ppt.
 
Dashboard - definition, examples
Dashboard - definition, examplesDashboard - definition, examples
Dashboard - definition, examples
 

Similar to Unicode Fundamentals

Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeUlf Mattsson
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptAlula Tafere
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Camomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlCamomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlYamagata Yoriyuki
 
Overview of character encoding
Overview of character encodingOverview of character encoding
Overview of character encodingDuy Lâm
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Muuluu
 
13001620124_AashishAgarwal_Data representation.text and numbers.pdf
13001620124_AashishAgarwal_Data representation.text and numbers.pdf13001620124_AashishAgarwal_Data representation.text and numbers.pdf
13001620124_AashishAgarwal_Data representation.text and numbers.pdfssusercf82c42
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character SystemWebsecurify
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HyeonSeok Choi
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode formatAdityaSharma1452
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding FormsMehdi Hasan
 

Similar to Unicode Fundamentals (20)

Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Camomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlCamomile : A Unicode library for OCaml
Camomile : A Unicode library for OCaml
 
Overview of character encoding
Overview of character encodingOverview of character encoding
Overview of character encoding
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
13001620124_AashishAgarwal_Data representation.text and numbers.pdf
13001620124_AashishAgarwal_Data representation.text and numbers.pdf13001620124_AashishAgarwal_Data representation.text and numbers.pdf
13001620124_AashishAgarwal_Data representation.text and numbers.pdf
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
 
SignWriting in Unicode Next
SignWriting in Unicode NextSignWriting in Unicode Next
SignWriting in Unicode Next
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode format
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding Forms
 

Recently uploaded

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

Unicode Fundamentals

  • 2. Presenters  Md. Sami Hassan (Ek-29)  Md. Nazmul Islam (SH-64)  Nafi Md. Kamrul Haque Chowdhury Sadique (SH-04) 2
  • 4. What is a Character? A Character is the smallest unit of writing which is capable of conveying information. 4
  • 5. Examples of character  Letters  Digits  Hyphen  Mathematical symbols  Punctuation  Control Characters - typically not visible 5
  • 6. What is Character Encoding?  The set of available characters is called a character repertoire.  The location of a given character within a repertoire is known as its code position, or code point.  The method of representing a code point to a digital form within a given repertoire is called the character encoding. 6
  • 7. What is Character Encoding? Example : In the repertoire ISO 8859-1,   The “code point” of "Latin capital letter A" is 0x41 (“0x” indicating 41 is hexadecimal). The “encoding” of “A” is 10000001.0001. 7
  • 8. What is Character Encoding? So, in general, we can say, Character encoding is the way that letters, digits and other symbols are expressed as binary values that a computer can understand. 8
  • 9. Why is Character Encoding required? Computers at their most basic level just deal with binary numbers, 0 and 1. They store letters, numerals and other characters by assigning a binary number for each one. 9
  • 10. Pre-Unicode Character Repertoires  EBCDIC  ASCII (American Standard Code for Information Interchange)  US-ASCII (U.S. version, standardised as ISO 646)  ISO 8859  ISO 8859-1 (most common version for Western languages)  ISO 8859-15 (replacement of ISO 8859-1). 10
  • 11. A Brief History  In the pre-Unicode environment, we had single 8-bit characters repertoires, which limited us to a maximum limit of 256 characters. No single encoding could contain enough characters to cover all the languages.  So, hundreds of different encoding systems were developed for assigning numbers to characters. 11
  • 12. A Brief History  As a result, these coding systems conflicted with each other. That is, two encodings could use the same number for two different characters or different numbers for the same character.  Any given computer needed to support many different encodings.  Yet whenever data was passed between different encodings or platforms, that data always runs the risk of corruption.  So, a solution of this problem was necessary. 12
  • 13. A Brief History And so, the solution was Unicode — a character repertoire that contains most of the characters used in the languages of the world. 13
  • 15. What is Unicode?  Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.  Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
  • 16. Where is Unicode Used?  The Unicode standards has been adopted by many software and hardware vendors.  Most OSs support Unicode.  Unicode is required for international document and data interchange.  Several modern standards use Unicode, such as,  Programming languages, such as Java, C#, Perl, Python.  Markup Languages such as XML, HTML, XHTML, JavaScript, LDAP, Cobra etc.
  • 17. How many languages are covered by Unicode?  The Unicode Standard encodes characters on a per script basis.  That is, Unicode encodes scripts for languages, rather than languages themselves.  The reason is, many scripts (especially the Latin script) are used to write a large number of languages.  Therefore, the easiest answer is that Unicode covers all of the languages that can be written in the Unicode supported scripts.
  • 19. How can Unicode cover this much characters?  As we have already seen, in the pre-Unicode environment, we had single 8-bit characters repertoires, which limited us to a maximum limit of 256 characters, i.e., code points.  Where, Unicode code points are logically divided into 17 planes, each with 65,536 (= 216) code points.  In the Unicode standard, planes are groups of code points that point to specific characters.  Planes are identified by the numbers 0 to 16, which corresponds with the possible values 0x00–0x10 of the first two positions in six position format (hhhhhh).  The only one used in most circumstances is the first plane, known as the basic multilingual plane, or BMP.
  • 21. How to name a Code Point?  Normally, a Unicode code point is referred to by writing "U+" followed by its hexadecimal number.  For code points in the Basic Multilingual Plane (BMP), four digits are used - e.g. U+0058 for the character Latin Capital Letter “X”.  For code points outside the BMP, five or six digits are used, as required - e.g. U+E0001 for the character 'LANGUAGE TAG' - e.g. U+10FFFD for the character ‘<Plane 16 Private Use, Last>’
  • 22. Unicode and ISO A version of Unicode that has been standardised by ISO (International Organization for Standardization) is called ISO 10646.  There are minor differences between Unicode and ISO 10646,but not that much to ponder about.
  • 23. Mapping of Unicode characters  Unicode/ISO  Therefore, 10646 is only a repertoire. we need an encoding to go with it.  And that encodings are UTF and UCS.
  • 25. What are UTF and UCS?  Unicode defines two mapping methods: - Unicode Transformation Format (UTF) encodings - Universal Character Set (UCS) encodings  Unicode transformation format (UTF) is an algorithmic mapping for every Unicode code point to a unique byte.  The ISO 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept.
  • 26. UTF vs Unicode  Say an application reads the following from the disk: 1101000 1100101 1101100 1101100 1101111  The application knows this data represents a Unicode string encoded with UTF-8 and must show this as text to the user.
  • 27. UTF vs Unicode  First step, is to convert the binary data to numbers.  The app uses the UTF-8 algorithm to decode the data.  In this case, the decoder returns this: 0x68 65 6C 6C 6F  Since the app knows this is a Unicode string, it can assume each number represents a character’s code point.  So, it uses Unicode character repertoire to translate each number to a corresponding character.  The resulting string is: hello
  • 28. UTF vs Unicode  Therefore,  UTF is an encoding used to translate binary data into numbers and vice-versa.  Unicode is a character repertoire used to translate numbers into characters and vice-versa.
  • 29. UTF Formats UTF encodings include:  UTF-1 – a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode Standard.  UTF-7 – a 7-bit encoding sometimes used in e-mail, often considered obsolete, no longer part of The Unicode Standard.  UTF-EBCDIC – an 8-bit variable-width encoding similar to UTF-8, but designed for compatibility with EBCDIC, not part of The Unicode Standard.
  • 30. UTF Formats  UTF-8 – an 8-bit variable-width encoding which maximizes compatibility with ASCII, used by 77.3% of all the websites.  UTF-16 – a 16-bit, variable-width encoding, used by less than 0.1% of all the websites.  UTF-32 – a 32-bit, fixed-width encoding, very rarely used.
  • 31. UTF Formats  So, at present, Unicode standard defines three encoding forms that allow the same character data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16, or 32-bits per code unit).  These three encoding forms are called UTF8, UTF-16 and UTF-32 respectively.
  • 32. UTF-8  UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard).  It has the following properties: 1. Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to 0x7F, this means UTF-8 is fully compatible with ASCII. 2. All characters greater than U+007F are encoded as a sequence of several bytes, all of which are above 0x7F (namely no ASCII byte), this makes it unambiguous to determine whether a byte belows to a multi-byte character or an ASCII character. 3. The first byte of a multi-byte sequence (that represent a non-ASCII) is always in the range of 0xC0 to 0xFD. All further bytes in the sequence are in the range 0x80 to 0xBF. This makes it unambiguous to determine the boundary of the multi-byte characters (in fact, the first byte also contains a redundant information about how many bytes follow for the character). 4. The bytes FEh and FFh are never used in the UTF-8 encoding.
  • 33. UTF-8  As mentioned before,UTF-8 is the popular encoding form for Unicode.  The reason lies in the fact that all ASCII characters are encoded as a single byte in UTF-8 which is not only fully backward compatible, but also space efficient for US and many European users.  In general, UTF-8 costs no extra space for US ASCII, only a few percent more for ISO 8859-1 (AKA Latin-1, covers most West European languages), 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
  • 34. UTF-16  Like UTF-8, UTF-16 is also a variable-width encoding that can represent every character in the Unicode character set.  The difference is, it encodes 1,112,064 code points in the Unicode character set using one to two 16-bit bytes (two or four octets).  Properties :  The Unicode code space is divided into seventeen planes of 216 code points each.  The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and is called the Basic Multilingual Plane or BMP. UTF-16 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.
  • 35. UTF-16  Code points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units called a surrogate pair, by the following scheme: o 0x010000 is subtracted from the code point, leaving a 20 bit number in the range 0.. 0x0FFFFF. o The top ten bits (a number in the range 0.. 0x03FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800.. 0xDBFF. o The low ten bits (also in the range 0.. 0x03FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00.. 0xDFFF.
  • 36. UTF-16  The Unicode standard permanently reserves the code point values U+D800 to U+DFFF for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character, so there should be no reason to encode them.
  • 37. UTF-16 Use in major operating systems and environments :  UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE.  UTF-16 is used by the Qualcomm BREW operating systems; the .NET environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit.  Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UTF-16.  J2SE 5.0 version of Java uses UTF-16.
  • 38. UTF-32  UTF-32 is a character encoding to encode Unicode characters that uses exactly 32 bits (four octets) per Unicode code point.  It uses fixed-length encodings, where all other Unicode transformation formats use variable-length encodings.  Advantage : The main advantage of UTF-32, versus variable length encodings, is that the Unicode code points are directly indexable.
  • 39. UTF-32 • Disadvantage : The main disadvantage of UTF-32 is that it is space inefficient, using four bytes per character. Non-BMP characters are so rare in most texts, they may as well be considered non-existent for sizing issues, making UTF-32 twice the size of UTF-16 and up to four times the size of UTF-8.  Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears.  Because, it does not make it faster to find a particular offset in the string, as an "offset" can be measured in the fixed-size code units of any encoding. It does not make calculating the displayed width of a string easier except in limited cases, since even with a “fixed width” font there may be more than one code point per character position or more than one character position per code point.  For this,UTF-32 is very rarely used.
  • 40. Now, let’s see a problem  Computers speak different languages, like people!  Some write data "left-to-right" and others "rightto-left". A machine can read its own data just fine problems happen when one computer stores data and a different type tries to read it.
  • 41. Then, what may be the solution?  Agree to a common format (i.e., all network traffic follows a single format), or  Always include a header that describes the format of the data. If the header appears backwards, it means data was stored in the other format and needs to be converted.
  • 42. Which one is taken?  Practically, there's no rule that all computers must use the same language, just like there's no rule all humans need to. Each type of computer is internally consistent (it can read back its own data), but there are no guarantees about how another type of computer will interpret the data it created.  So, the solution is to use a header.  In Unicode, this header is known as Byte Order Mark, or BOM.
  • 44. What is a BOM?  The byte order mark (BOM) is a Unicode character used to signal the “endianness” of a text file or stream.  It consists of the character code U+FEFF at the beginning of a data stream.
  • 45. What does ‘endian’ mean?  Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last.  The former is called big-endian, the latter littleendian.  Therefore, it can be said that the terms endian and endianness, refers to how bytes of a data word are ordered within memory.
  • 46. Explanation Each byte of memory is associated with an index, called its address, which indicates its position. Bytes of a single data word (such as a 32 bit integer datatype) are generally stored in consecutive memory addresses (a 32 bit integer needs 4 such locations). Big-endian systems are systems in which the most significant byte of the word is stored in the smallest address given and the least significant byte is stored in the largest. In contrast, little endian systems are those in which the least significant byte is stored in the smallest address.
  • 47. Example Say the data word was "0A 0B 0C 0D" (a set of 4 bytes) and memory addresses starting at a with offsets 0, 1, 2 and 3 are given. Then, in big endian systems, byte 0A is placed in offset 0, 0B in 1, 0C in 2 and 0D in 3. In little-endian systems, the order is the reverse of it (see the diagram).
  • 48. Where is a BOM useful? A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format.  The BOM character may also indicate which of the several Unicode representations the text is encoded in.
  • 49. When a BOM is used?  BOM use is optional.  The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use.  In UTF-16, a BOM may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.  Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.
  • 50. Representations of byte order marks by encoding
  • 51. THAT’S IT FOR TODAY!!! THANK YOU ALL