SlideShare una empresa de Scribd logo
1 de 45
1
1
New Privacy Technologiesfor
Unicode and International Data
Standards
2
Agenda
• Old approaches, the journey and the new approach
• Performance and memory footprint
• Portability, security, and language preservation
• Granular protection for all Unicode languages and customizable alphabets
• Byte length preserving and character preserving protection
• Data privacy and security
• Use cases
3
3
Major Issues
Protecting the increasing use International Unicode characters
is required by a growing number of Privacy Laws in many
countries and general Privacy Concerns with private data.
Less granular old approach with multiple
conversion steps:
1. Old approaches to
protect International
Unicode characters will
increase the size and
change the data
formats
• This will break
many applications
and slow down
business
operations
2. Old approaches are
randomly returning
data in new and
unexpected languages
4
Preserving Language, Size, and Performance
5
Unicode standard for consistent
encoding defines
• 143,859 characters
• 154 modern and historic scripts and
languages
Increasingly common to mix multiple languages in the
same input string
6
Unicode
standard
• 154
modern
and
historic
scripts
and
languages
7
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Japanese Address Label with 5 Different Languages
Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
JISX0208
1990
Unicode
5.0
MS
Std
JPN
JISX0212
1990
Windows
31
5 Different Japanese Unicode Character Sets
8
Script UTF-8 Characters Selected 1,2 2,1 2,2 1&2 2,2,2 3,3 4,4
US ASCII 128 60
Latin 1 Suppl 60 60 4 k 4 k
Greek 128 16 k 2 mill
Cyrillic 256 256 65 k 17 mill
Cyrillic Suppl 304 92 k
Cyrillic Suppl+ 336 113 k
Asian & CJK + 13 100 k 100 k
Math 4-bytes 1 k 1 mill
# rows
8 mill
Tokenization of Unicode
J Ö R G E N
a ß x G E N
a Ü a y E N
a Ü b c d N
a Ü b e f g
a Ü j i h g
a Ä l z h g
m ö D z h g
m ä D z h g
Ю Щ Ъ Ы Ь Э
Ы Ы Ь Э Ь Э
Ы Ъ Э Ы Ь Э
Ы Ъ Ь Э Ь Э
Ы Ъ Ы Ъ Э Э
Ы Ъ Ы Ъ Э Щ
Ы Ъ Ы Э Щ Щ
Ы Ъ Ъ Э Щ Щ
Ы Ъ Э Э Щ Щ
Щ Ы Э Э Щ Щ
Щ Ы Э Э Щ Щ
Ρ ϲ ϳ ϴ ϵ
ϳ ϴ ϵ G E
ϳ Ρ ϲ ϳ E
ϳ Ρ ϲ ϳ ϴ
ϳ ϳ ϴ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
Character Chaining for Greek
Character Chaining for Cyrillic / Russian
2- & 3-Character Chaining blocks
Character Chaining for Japanese / Chinese
Character Chaining
for Special
Characters
0 0 0 0 0 1 2 3 4 5 2 0 9 8 8
9 8 7 0 0 3 8 7 4 5 7 8 7 4 5
9 4 5 6 0 3 4 5 6 5 7 4 5 6 5
9 4 7 2 1 3 4 7 2 1 7 4 7 2 5
9 2 3 1 3 8 9 4 1 7 5 8 6 5
5 3 3 2 1 6 7 8 4 1 9 8 7 6 5
5 4 3 2 1 6 7 8 4 1 9 8 7 6 5
两 並 丧
Sizes
Character Chaining for Latin
Derived Table
for
US ASCII + Latin:
6 bytes 8 mill
rows =
48 million bytes
9
Character Fabric
Tokenization of Unicode
Character Chaining
10
Unicode Project
POC v1 POC v2 Product
A 2 Basic lookup tables (SLT) - tokenization of everything in the table UTF8 y y
B 4 Basic lookup tables y y
C Basic lookup tables based on one codepage (e.g. Latin, Greek: https://unicode.org/charts/) y y
D Basis lookup tables based on multiple codepages (e.g. Latin-1 Supplement + Greek Extended + etc: https://unicode.org/charts/) y y
E Look up tables based on specific user-selected characters y y
F 1. Tokenize all Unicode space. (Equals to nr. 3: 4 basic lookup tables.)
G 2a. Tokenize using UTF byte-groups (1-byte, 2-byte). (Equals to nr. 4: basic lookup tables on one codepage)
H 2b. Tokenize using UTF byte-groups (1-byte, 2-byte, 3-byte, 4-byte). (Equals to nr. 3: 4 basic lookup tables.)
I
3. Tokenize using charsets/codegroups (we first tokenize using one set, then using another, so “Denis Денис“ will become
“YAsKp РЪюцЙ”). There could be characters of different byte-lengths in each set as in German: [a-zA-Z0-9]=1byte and
[öäüÖÄÜß]=2byte. (Plus warn customers that the performance will be bad. Clyde has concerns about security.)
y y
J
4. Tokenize using individual user-defined sets of characters that need to be protected (e.g. set no. 1 [a-zA-Z], set no.2 [öäüÖÄÜß]
or just set 1 [a-zA-ZöäüÖÄÜß]). Characters do not migrate from one custom-set to another.
y y
K Chaining 1,2 bytes UTF-8 sequences y y
L Chaining 2,1 bytes UTF-8 sequences y y
M Chaining 1,1 bytes UTF-8 sequences y y
N Chaining 1,1,1 bytes UTF-8 sequences y y
O UTF-8 support y y
P UTF-16 support y y
Q Portability of tokens and lookup tables UTF-8/UTF-16 y y
R Shuffle final token (for language separation option) ?
S Padding of short fields ? y
T Chaining with one IV character input y y
Other
User customizations
UTF support
Unicode tokenization feature
Derived lookup tables
(merge/split
tokenization)
Basic lookup tables
11
Examples of Scripts with one to two bytes characters
12
Cyrillic
Tokenization of the Russian alphabet may include the green
(dotted lines) characters and use the red characters for
special purposes
13
East Asian Scripts
Examples of Scripts with three to four bytes characters
Language preservation can be achieved in groups of Scripts
• Group X: Kanji and Hiragana
• Group Y: Katakana and Punctuation
• Group Z: CJK Unified Ideographs Extension
Kana /
Kanji
Hangul
Hanzi
14
Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes
1) 128 characters (US-ASCII)
2) 1,920 characters Latin-script,
Greek, Cyrillic, Coptic, Armenian,
Hebrew, Arabic, Syriac, Thaana
and N'Ko alphabets.
3) Characters in common use,
including most Chinese,
Japanese and Korean
characters**.
4) Less common CJK (The
commonly used Hanzi/Kanji
characters are in the "CJK
Unified Ideographs" block
between U+4E00 and U+9FFF,
and take 3 bytes in UTF-8.
15
Avoiding leaks about unusual characters from input
Randomization
Input string is the German name “Jörg“
Output Token “züaB”
Leaks about unusual characters from input
German umlauts: ä, ö, ü
Randomization Randomization Randomization
J ö r g
z ü a B
16
Avoiding leaks about unusual characters from input
Approach 1: Randomly Shuffle the output string
Randomization
Input string is the German name “Jörg“
Output Token “züaB”
Leaks about unusual character in specific position of input
Randomization Randomization Randomization
J ö r g
z ü a B
Randomly
Shuffle the
output string
Final Token “azüB” Leaks about unusual character in from input
17
Avoiding leaks about unusual characters from input
Approach 2: Randomize the full input string
Input string is the German name “Jörg“
• 4 characters
• 5 bytes
Output Token “züaB”
• 4 characters
• 6 bytes
Hides information about unusual characters from input
German umlauts: ä, ö, ü
Randomization
J ö r g
z ü ä B
But
Length of input string increases from 5 bytes to 6 bytes
18
Avoiding leaks about unusual characters from input
Randomization
Input string of 4 characters (1+2+3+4= 10 bytes)
Output Token of 4 characters (3+3+4+4=14 bytes)
Avoiding leaks about unusual characters from input
Output Token 40% longer
19
Preserving the numberof Characters and the byte-length
Example of
tokenizing 3-byts
and 4-bytes
Unicode characters
A range of Kanji and
Kana characters with
different string lengths
Randomly mix characters of different length and replace the longer characters
• Prevents leaks about unusual character in input
• No length increase
20
UTF-8 character
encodings for
websites is
95.4%
21
UTF-8: 1-byte, 2-bytes, and 3-bytes characters
Portability aspects between UTF-8 and UTF-16 (used by Teradata and some
other large databases) and start with 1-byte, 2-bytes, and 3-bytes
characters in this example with three samples characters:
22
• full-width input/output
• optional input/output
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Tokenization Table 3
katakana
romaji
Hiragana
Tokenization Table 4
Input string
Output string of tokens
Japanese Examples
高 え 2 ヲ
ブ 〄
野 ぉ A
ィ ル 〳
Input string
Output string of tokens
え 2 ヲ
ブ
ぉ A
ィ ル
Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
Address
katakana
romaji
Tokenization Table 5
Input string
Output string of tokens
2 ヲ
ブ
A
ィ ル
23
Distinguishable Tokens - UTF -8
256 code points 256 code points
256 code points
256 code points
256 code
points
256 code points
Hiragana
46
30 40
Kanji group 2
7 k
34 00
裕子
え ぉ お
2 A / Romaji & half-
width katakana
239
FF 00
ヲ ァ ィ
katakana
111 – 304
30 A0
ブルーノ
Punctuation
70
30 00
〳
Distinguishable Scripts
Tokenization Scripts - Japanese
Kanji group 3
43 k
2 00 00
Kanji group 1
21 k
4E 00
高野
UTF-8 Encoding: 0xF0 0xA0 0x80 0x80
UTF-8 (hex) 0xEF 0xBC 0x81
UTF-8 Encoding: 0xE3 0x90 0x80
UTF-8 Encoding: 0xE4 0xB8 0x80
UTF-8 Encoding: 0xE3 0x82 0xA0
UTF-8 (hex) 0xE3 0x81 0x81
UTF-8 Encoding: 0xE3 0x80 0x81
24
Distinguishable Tokens UTF -8
Group Y Distinguishable*
Data Discovery
Group Y
> # code points than in Group X
*: Distinguishable tokens project
3521 code points
Code points not in DBs
1:1 code point mapping
Group X
Customer select Scripts to use for
Distinguishable characters
64336 code
points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
239 code points 256 code points
11 184 Code points
6 111 code points
25
Portability of tokens and lookup tables based on ISO 10646
Encoding UTF-16 4-bytes Unicode
Portability of code points in tokens can be mapped for up to 3-bytes code points. 4-bytes code points need to be
converted. Conversion is defined in the UTF-16 encoding of ISO 10646 specifications for UTF-16 and the different endian
formats, UTF-16BE and UTF-16LE encodings. Encoding of a single character from an ISO 10646 character value to UTF-16
proceeds as follows. Let U be the character number, no greater than 0x10 FFFF.
1) If U < 0x1 0000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x1 0000. Because U is less than or equal to 0x10 FFFF, U' must be less than or equal to 0xF FFFF. That is,
U' can be represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have
10 bits free to encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10
low-order bits of W2. Terminate.
Graphically, steps 2 through 4 looks like:
 U' = yy yyyy yyyy xx xxxx xxxx
 W1 = 110110 yy yyyy yyyy
 W2 = 110111 xx xxxx xxxx
26
The Token Fabric
The IV Pool
• The IV Pool is a set of pre-generated “Randomized
initialization vectors” to be used in different steps when
creating the encoded fabric.
• Substrings of records in IV Pool will be used in each step
of the tokenization process.
27
Data Privacy and
Security
28
28
Organization’sRisk Context of Privacy
Source:Gartner
29
29
What Are Others Spending on Security?
30
GDPR under "Schrems II" – Lacking “Additional Safeguards”
https://www.jdsupra.com/legalnews/navigating-eu-data-transfers-effects-of-8348955/
X
• InMarch2021,the Bavarian DPA found therewas an unlawfultransfer
of personal data from a Germancontroller to the e-mail marketing
service Mailchimp inthe U.S.
• Failedtoassess whetheranysupplementarymeasures wereneededin
relationtothetransferofpersonaldatatoMailchimp.
• InApril 2021,the PortugueseDPA ordered a public authority to suspend
all transfers of personal data to the U.S. and other thirdcountries.
• Cloudflarewereinsufficienttoprotectthedata(which includedreligiousand
healthdata),andthepartiesdid notimplementany supplementarymeasures
toprovideadequateprotectionforthedata.
• Suspend thetransferofdatatotheU.S. oranyotherthirdcountry without
firstestablishingadequateprotectionforthedata.
31
https://iapp.org/news/a/why-this-french-court-decision-has-far-
reaching-consequences-for-many-businesses/
GDPR under"SchremsII"
Legal safeguards:
• AWS Sarlguarantees in its contract with Doctolib, a French company, that it will
challenge anygeneral access request froma public authority.
Technical safeguards:
• Technically the data hosted byAWS Sarlis encrypted.
• AWS Sarl,a Luxembourg registeredcompany.
• The key is held by a trusted thirdpartyin France, not by AWS.
Other guarantees taken:
• No health data.
• Thedatahostedrelatesonlyto the identificationof individualsforthepurposeof making
appointments.
• Data is deleted after three months.
Doctolib
AWS Sarl
AWS will challenge any general
access request from a public
authority
32
Big Data Protection with GranularFieldLevel Protection for Google
Cloud Protectionthroughout the lifecycleof data in Hadoop
BigData Protectortokenizes or
encryptssensitivedata fields
Enterprise
Policies
Policiesmaybe managedon-
premorGoogleCloudPlatform
(GCP)
PolicyEnforcementPoint
Protecteddatafields
U
Separation of Duties
EncryptionKeyManagem.
Security Officer
33
MajorFinancialInstitution Global UseCase
34
GDPR SecurityRequirements Framework
Encryption and
Tokenization
Discover Data
Assets
Security by
Design
Source:IBM
35
Different Data Protection Techniques
Data Store
DynamicMasking
2-way 1-way
FormatPreserving Computingonencrypteddata FormatPreserving
Tokenization
FormatPreserving
Encryption
(FPE)
HomomorphicEncryption
(HE)
Hashing
Static
Masking
DifferentialPrivacy
(DP)
K-anonymityModel
Random Algorithmic NoiseAdded
Fast Slow VerySlow Fast Fast
Fastest
ClearText
SyntheticData
Derivation
Fast
Anonymization
Of Attributes
Pseudonymization
Of Identifiers
36
Different Data Protection Techniques
37
Attacks onData
38
Data Protection Integration Layers
39
Data Protection Integration Layers
40
Data Protection Integration Layers
41
Data Protection Integration Layers
42
Multi-Cloud Considerations
43
Security Compliance
Privacy
Controls
Regulations
Policies
GDPR
CCPA/CPRA
Data
Security
PCI DSS
HIPAA
Identity and
Access
Management
Risk Management
Data
Privacy
How, What and Why
Balance
RBAC,
MAC,
DAC, and
Attribute
Based
Access
Control
Threats
Industry
Standards
US NIST
Ransomware
Zero Trust
Architecture (ZTA)
Data
Security
Policy
Data Security and
US NIST Zero Trust
Architecture (ZTA)
44
PrivacyStandards
11Published InternationalPrivacyStandards(ISO)
Techniques
Management
Cloud
Framework
Impact
Requirements
Process
20889 IS Privacyenhancingde-identificationterminologyandclassificationoftechniques
27701 IS Securitytechniques-ExtensiontoISO/IEC27001 andISO/IEC 27002 forprivacyinformationmanagement -Requirementsand
guidelines
27018 IS CodeofpracticeforprotectionofPIIinpubliccloudsacting as PIIprocessors
29100 IS Privacyframework
29101 IS Privacyarchitectureframework
29134 IS GuidelinesforPrivacyimpactassessment
29190 IS Privacycapabilityassessmentmodel
29191 IS Requirementsforpartiallyanonymous,partiallyunlinkableauthentication
29151 IS CodeofPracticeforPIIProtection
19608 TSGuidancefordevelopingsecurityandprivacyfunctionalrequirementsbasedon15408
27550 TRPrivacyengineeringforsystemlifecycleprocesses
45
45
Thank You

Más contenido relacionado

La actualidad más candente

introduction to python
 introduction to python introduction to python
introduction to pythonJincy Nelson
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data sciencedeepak teja
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Itp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & OutputItp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & Outputphanleson
 
RRB JE Stage 2 Computer and Applications Questions Part 1
RRB JE Stage 2 Computer and Applications  Questions Part 1RRB JE Stage 2 Computer and Applications  Questions Part 1
RRB JE Stage 2 Computer and Applications Questions Part 1CAS
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character EncodingsMobisoft Infotech
 
Cs1123 11 pointers
Cs1123 11 pointersCs1123 11 pointers
Cs1123 11 pointersTAlha MAlik
 

La actualidad más candente (14)

introduction to python
 introduction to python introduction to python
introduction to python
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
 
Data file handling
Data file handlingData file handling
Data file handling
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Itp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & OutputItp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & Output
 
How To Tame Python
How To Tame PythonHow To Tame Python
How To Tame Python
 
1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL
 
RRB JE Stage 2 Computer and Applications Questions Part 1
RRB JE Stage 2 Computer and Applications  Questions Part 1RRB JE Stage 2 Computer and Applications  Questions Part 1
RRB JE Stage 2 Computer and Applications Questions Part 1
 
Module 2
Module 2 Module 2
Module 2
 
SS & CD Module 3
SS & CD Module 3 SS & CD Module 3
SS & CD Module 3
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Huffman Codes
Huffman CodesHuffman Codes
Huffman Codes
 
Cs1123 11 pointers
Cs1123 11 pointersCs1123 11 pointers
Cs1123 11 pointers
 

Similar a Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b

Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HyeonSeok Choi
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchKim Berg Hansen
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode formatAdityaSharma1452
 
Idn root zone lgr workshop icann53
Idn root zone lgr workshop icann53Idn root zone lgr workshop icann53
Idn root zone lgr workshop icann53ICANN
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptAlula Tafere
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in PerlNova Patch
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Bernt Marius Johnsen
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IVManoj Patil
 
question 1 What is the behavior of setting C locale- Strings are alway.docx
question 1 What is the behavior of setting C locale- Strings are alway.docxquestion 1 What is the behavior of setting C locale- Strings are alway.docx
question 1 What is the behavior of setting C locale- Strings are alway.docxtodd921
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTIONAmit Sharma
 
c programming 2nd chapter pdf.PPT
c programming 2nd chapter pdf.PPTc programming 2nd chapter pdf.PPT
c programming 2nd chapter pdf.PPTKauserJahan6
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...bhargavi804095
 
Introduction to Problem Solving C Programming
Introduction to Problem Solving C ProgrammingIntroduction to Problem Solving C Programming
Introduction to Problem Solving C ProgrammingRKarthickCSEKIOT
 

Similar a Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b (20)

Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode format
 
Unicode
UnicodeUnicode
Unicode
 
Idn root zone lgr workshop icann53
Idn root zone lgr workshop icann53Idn root zone lgr workshop icann53
Idn root zone lgr workshop icann53
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IV
 
question 1 What is the behavior of setting C locale- Strings are alway.docx
question 1 What is the behavior of setting C locale- Strings are alway.docxquestion 1 What is the behavior of setting C locale- Strings are alway.docx
question 1 What is the behavior of setting C locale- Strings are alway.docx
 
Strings and encodings
Strings and encodingsStrings and encodings
Strings and encodings
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTION
 
Chapter02.PPT
Chapter02.PPTChapter02.PPT
Chapter02.PPT
 
c programming 2nd chapter pdf.PPT
c programming 2nd chapter pdf.PPTc programming 2nd chapter pdf.PPT
c programming 2nd chapter pdf.PPT
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...
 
Introduction to Problem Solving C Programming
Introduction to Problem Solving C ProgrammingIntroduction to Problem Solving C Programming
Introduction to Problem Solving C Programming
 

Más de Ulf Mattsson

Jun 15 privacy in the cloud at financial institutions at the object managemen...
Jun 15 privacy in the cloud at financial institutions at the object managemen...Jun 15 privacy in the cloud at financial institutions at the object managemen...
Jun 15 privacy in the cloud at financial institutions at the object managemen...Ulf Mattsson
 
May 6 evolving international privacy regulations and cross border data tran...
May 6   evolving international privacy regulations and cross border data tran...May 6   evolving international privacy regulations and cross border data tran...
May 6 evolving international privacy regulations and cross border data tran...Ulf Mattsson
 
Qubit conference-new-york-2021
Qubit conference-new-york-2021Qubit conference-new-york-2021
Qubit conference-new-york-2021Ulf Mattsson
 
Secure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesSecure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesUlf Mattsson
 
Evolving international privacy regulations and cross border data transfer - g...
Evolving international privacy regulations and cross border data transfer - g...Evolving international privacy regulations and cross border data transfer - g...
Evolving international privacy regulations and cross border data transfer - g...Ulf Mattsson
 
The future of data security and blockchain
The future of data security and blockchainThe future of data security and blockchain
The future of data security and blockchainUlf Mattsson
 
New technologies for data protection
New technologies for data protectionNew technologies for data protection
New technologies for data protectionUlf Mattsson
 
GDPR and evolving international privacy regulations
GDPR and evolving international privacy regulationsGDPR and evolving international privacy regulations
GDPR and evolving international privacy regulationsUlf Mattsson
 
Privacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaPrivacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaUlf Mattsson
 
Safeguarding customer and financial data in analytics and machine learning
Safeguarding customer and financial data in analytics and machine learningSafeguarding customer and financial data in analytics and machine learning
Safeguarding customer and financial data in analytics and machine learningUlf Mattsson
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
 
New opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsNew opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsUlf Mattsson
 
What is tokenization in blockchain - BCS London
What is tokenization in blockchain - BCS LondonWhat is tokenization in blockchain - BCS London
What is tokenization in blockchain - BCS LondonUlf Mattsson
 
Protecting data privacy in analytics and machine learning - ISACA
Protecting data privacy in analytics and machine learning - ISACAProtecting data privacy in analytics and machine learning - ISACA
Protecting data privacy in analytics and machine learning - ISACAUlf Mattsson
 
What is tokenization in blockchain?
What is tokenization in blockchain?What is tokenization in blockchain?
What is tokenization in blockchain?Ulf Mattsson
 
Nov 2 security for blockchain and analytics ulf mattsson 2020 nov 2b
Nov 2 security for blockchain and analytics   ulf mattsson 2020 nov 2bNov 2 security for blockchain and analytics   ulf mattsson 2020 nov 2b
Nov 2 security for blockchain and analytics ulf mattsson 2020 nov 2bUlf Mattsson
 
Unlock the potential of data security 2020
Unlock the potential of data security 2020Unlock the potential of data security 2020
Unlock the potential of data security 2020Ulf Mattsson
 
What is tokenization in blockchain?
What is tokenization in blockchain?What is tokenization in blockchain?
What is tokenization in blockchain?Ulf Mattsson
 
Protecting Data Privacy in Analytics and Machine Learning
Protecting Data Privacy in Analytics and Machine LearningProtecting Data Privacy in Analytics and Machine Learning
Protecting Data Privacy in Analytics and Machine LearningUlf Mattsson
 

Más de Ulf Mattsson (20)

Jun 15 privacy in the cloud at financial institutions at the object managemen...
Jun 15 privacy in the cloud at financial institutions at the object managemen...Jun 15 privacy in the cloud at financial institutions at the object managemen...
Jun 15 privacy in the cloud at financial institutions at the object managemen...
 
Book
BookBook
Book
 
May 6 evolving international privacy regulations and cross border data tran...
May 6   evolving international privacy regulations and cross border data tran...May 6   evolving international privacy regulations and cross border data tran...
May 6 evolving international privacy regulations and cross border data tran...
 
Qubit conference-new-york-2021
Qubit conference-new-york-2021Qubit conference-new-york-2021
Qubit conference-new-york-2021
 
Secure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesSecure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use cases
 
Evolving international privacy regulations and cross border data transfer - g...
Evolving international privacy regulations and cross border data transfer - g...Evolving international privacy regulations and cross border data transfer - g...
Evolving international privacy regulations and cross border data transfer - g...
 
The future of data security and blockchain
The future of data security and blockchainThe future of data security and blockchain
The future of data security and blockchain
 
New technologies for data protection
New technologies for data protectionNew technologies for data protection
New technologies for data protection
 
GDPR and evolving international privacy regulations
GDPR and evolving international privacy regulationsGDPR and evolving international privacy regulations
GDPR and evolving international privacy regulations
 
Privacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaPrivacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA Atlanta
 
Safeguarding customer and financial data in analytics and machine learning
Safeguarding customer and financial data in analytics and machine learningSafeguarding customer and financial data in analytics and machine learning
Safeguarding customer and financial data in analytics and machine learning
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
New opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsNew opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulations
 
What is tokenization in blockchain - BCS London
What is tokenization in blockchain - BCS LondonWhat is tokenization in blockchain - BCS London
What is tokenization in blockchain - BCS London
 
Protecting data privacy in analytics and machine learning - ISACA
Protecting data privacy in analytics and machine learning - ISACAProtecting data privacy in analytics and machine learning - ISACA
Protecting data privacy in analytics and machine learning - ISACA
 
What is tokenization in blockchain?
What is tokenization in blockchain?What is tokenization in blockchain?
What is tokenization in blockchain?
 
Nov 2 security for blockchain and analytics ulf mattsson 2020 nov 2b
Nov 2 security for blockchain and analytics   ulf mattsson 2020 nov 2bNov 2 security for blockchain and analytics   ulf mattsson 2020 nov 2b
Nov 2 security for blockchain and analytics ulf mattsson 2020 nov 2b
 
Unlock the potential of data security 2020
Unlock the potential of data security 2020Unlock the potential of data security 2020
Unlock the potential of data security 2020
 
What is tokenization in blockchain?
What is tokenization in blockchain?What is tokenization in blockchain?
What is tokenization in blockchain?
 
Protecting Data Privacy in Analytics and Machine Learning
Protecting Data Privacy in Analytics and Machine LearningProtecting Data Privacy in Analytics and Machine Learning
Protecting Data Privacy in Analytics and Machine Learning
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b

  • 1. 1 1 New Privacy Technologiesfor Unicode and International Data Standards
  • 2. 2 Agenda • Old approaches, the journey and the new approach • Performance and memory footprint • Portability, security, and language preservation • Granular protection for all Unicode languages and customizable alphabets • Byte length preserving and character preserving protection • Data privacy and security • Use cases
  • 3. 3 3 Major Issues Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data. Less granular old approach with multiple conversion steps: 1. Old approaches to protect International Unicode characters will increase the size and change the data formats • This will break many applications and slow down business operations 2. Old approaches are randomly returning data in new and unexpected languages
  • 5. 5 Unicode standard for consistent encoding defines • 143,859 characters • 154 modern and historic scripts and languages Increasingly common to mix multiple languages in the same input string
  • 7. 7 katakana Kanji group 1 romaji Hiragana Punctuation Japanese Address Label with 5 Different Languages Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234 JISX0208 1990 Unicode 5.0 MS Std JPN JISX0212 1990 Windows 31 5 Different Japanese Unicode Character Sets
  • 8. 8 Script UTF-8 Characters Selected 1,2 2,1 2,2 1&2 2,2,2 3,3 4,4 US ASCII 128 60 Latin 1 Suppl 60 60 4 k 4 k Greek 128 16 k 2 mill Cyrillic 256 256 65 k 17 mill Cyrillic Suppl 304 92 k Cyrillic Suppl+ 336 113 k Asian & CJK + 13 100 k 100 k Math 4-bytes 1 k 1 mill # rows 8 mill Tokenization of Unicode J Ö R G E N a ß x G E N a Ü a y E N a Ü b c d N a Ü b e f g a Ü j i h g a Ä l z h g m ö D z h g m ä D z h g Ю Щ Ъ Ы Ь Э Ы Ы Ь Э Ь Э Ы Ъ Э Ы Ь Э Ы Ъ Ь Э Ь Э Ы Ъ Ы Ъ Э Э Ы Ъ Ы Ъ Э Щ Ы Ъ Ы Э Щ Щ Ы Ъ Ъ Э Щ Щ Ы Ъ Э Э Щ Щ Щ Ы Э Э Щ Щ Щ Ы Э Э Щ Щ Ρ ϲ ϳ ϴ ϵ ϳ ϴ ϵ G E ϳ Ρ ϲ ϳ E ϳ Ρ ϲ ϳ ϴ ϳ ϳ ϴ ϵ ϴ ϲ ϴ ϵ ϵ ϴ ϲ ϴ ϵ ϵ ϴ Character Chaining for Greek Character Chaining for Cyrillic / Russian 2- & 3-Character Chaining blocks Character Chaining for Japanese / Chinese Character Chaining for Special Characters 0 0 0 0 0 1 2 3 4 5 2 0 9 8 8 9 8 7 0 0 3 8 7 4 5 7 8 7 4 5 9 4 5 6 0 3 4 5 6 5 7 4 5 6 5 9 4 7 2 1 3 4 7 2 1 7 4 7 2 5 9 2 3 1 3 8 9 4 1 7 5 8 6 5 5 3 3 2 1 6 7 8 4 1 9 8 7 6 5 5 4 3 2 1 6 7 8 4 1 9 8 7 6 5 两 並 丧 Sizes Character Chaining for Latin Derived Table for US ASCII + Latin: 6 bytes 8 mill rows = 48 million bytes
  • 9. 9 Character Fabric Tokenization of Unicode Character Chaining
  • 10. 10 Unicode Project POC v1 POC v2 Product A 2 Basic lookup tables (SLT) - tokenization of everything in the table UTF8 y y B 4 Basic lookup tables y y C Basic lookup tables based on one codepage (e.g. Latin, Greek: https://unicode.org/charts/) y y D Basis lookup tables based on multiple codepages (e.g. Latin-1 Supplement + Greek Extended + etc: https://unicode.org/charts/) y y E Look up tables based on specific user-selected characters y y F 1. Tokenize all Unicode space. (Equals to nr. 3: 4 basic lookup tables.) G 2a. Tokenize using UTF byte-groups (1-byte, 2-byte). (Equals to nr. 4: basic lookup tables on one codepage) H 2b. Tokenize using UTF byte-groups (1-byte, 2-byte, 3-byte, 4-byte). (Equals to nr. 3: 4 basic lookup tables.) I 3. Tokenize using charsets/codegroups (we first tokenize using one set, then using another, so “Denis Денис“ will become “YAsKp РЪюцЙ”). There could be characters of different byte-lengths in each set as in German: [a-zA-Z0-9]=1byte and [öäüÖÄÜß]=2byte. (Plus warn customers that the performance will be bad. Clyde has concerns about security.) y y J 4. Tokenize using individual user-defined sets of characters that need to be protected (e.g. set no. 1 [a-zA-Z], set no.2 [öäüÖÄÜß] or just set 1 [a-zA-ZöäüÖÄÜß]). Characters do not migrate from one custom-set to another. y y K Chaining 1,2 bytes UTF-8 sequences y y L Chaining 2,1 bytes UTF-8 sequences y y M Chaining 1,1 bytes UTF-8 sequences y y N Chaining 1,1,1 bytes UTF-8 sequences y y O UTF-8 support y y P UTF-16 support y y Q Portability of tokens and lookup tables UTF-8/UTF-16 y y R Shuffle final token (for language separation option) ? S Padding of short fields ? y T Chaining with one IV character input y y Other User customizations UTF support Unicode tokenization feature Derived lookup tables (merge/split tokenization) Basic lookup tables
  • 11. 11 Examples of Scripts with one to two bytes characters
  • 12. 12 Cyrillic Tokenization of the Russian alphabet may include the green (dotted lines) characters and use the red characters for special purposes
  • 13. 13 East Asian Scripts Examples of Scripts with three to four bytes characters Language preservation can be achieved in groups of Scripts • Group X: Kanji and Hiragana • Group Y: Katakana and Punctuation • Group Z: CJK Unified Ideographs Extension Kana / Kanji Hangul Hanzi
  • 14. 14 Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes 1) 128 characters (US-ASCII) 2) 1,920 characters Latin-script, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets. 3) Characters in common use, including most Chinese, Japanese and Korean characters**. 4) Less common CJK (The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8.
  • 15. 15 Avoiding leaks about unusual characters from input Randomization Input string is the German name “Jörg“ Output Token “züaB” Leaks about unusual characters from input German umlauts: ä, ö, ü Randomization Randomization Randomization J ö r g z ü a B
  • 16. 16 Avoiding leaks about unusual characters from input Approach 1: Randomly Shuffle the output string Randomization Input string is the German name “Jörg“ Output Token “züaB” Leaks about unusual character in specific position of input Randomization Randomization Randomization J ö r g z ü a B Randomly Shuffle the output string Final Token “azüB” Leaks about unusual character in from input
  • 17. 17 Avoiding leaks about unusual characters from input Approach 2: Randomize the full input string Input string is the German name “Jörg“ • 4 characters • 5 bytes Output Token “züaB” • 4 characters • 6 bytes Hides information about unusual characters from input German umlauts: ä, ö, ü Randomization J ö r g z ü ä B But Length of input string increases from 5 bytes to 6 bytes
  • 18. 18 Avoiding leaks about unusual characters from input Randomization Input string of 4 characters (1+2+3+4= 10 bytes) Output Token of 4 characters (3+3+4+4=14 bytes) Avoiding leaks about unusual characters from input Output Token 40% longer
  • 19. 19 Preserving the numberof Characters and the byte-length Example of tokenizing 3-byts and 4-bytes Unicode characters A range of Kanji and Kana characters with different string lengths Randomly mix characters of different length and replace the longer characters • Prevents leaks about unusual character in input • No length increase
  • 21. 21 UTF-8: 1-byte, 2-bytes, and 3-bytes characters Portability aspects between UTF-8 and UTF-16 (used by Teradata and some other large databases) and start with 1-byte, 2-bytes, and 3-bytes characters in this example with three samples characters:
  • 22. 22 • full-width input/output • optional input/output katakana Kanji group 1 romaji Hiragana Punctuation Tokenization Table 3 katakana romaji Hiragana Tokenization Table 4 Input string Output string of tokens Japanese Examples 高 え 2 ヲ ブ 〄 野 ぉ A ィ ル 〳 Input string Output string of tokens え 2 ヲ ブ ぉ A ィ ル Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234 Address katakana romaji Tokenization Table 5 Input string Output string of tokens 2 ヲ ブ A ィ ル
  • 23. 23 Distinguishable Tokens - UTF -8 256 code points 256 code points 256 code points 256 code points 256 code points 256 code points Hiragana 46 30 40 Kanji group 2 7 k 34 00 裕子 え ぉ お 2 A / Romaji & half- width katakana 239 FF 00 ヲ ァ ィ katakana 111 – 304 30 A0 ブルーノ Punctuation 70 30 00 〳 Distinguishable Scripts Tokenization Scripts - Japanese Kanji group 3 43 k 2 00 00 Kanji group 1 21 k 4E 00 高野 UTF-8 Encoding: 0xF0 0xA0 0x80 0x80 UTF-8 (hex) 0xEF 0xBC 0x81 UTF-8 Encoding: 0xE3 0x90 0x80 UTF-8 Encoding: 0xE4 0xB8 0x80 UTF-8 Encoding: 0xE3 0x82 0xA0 UTF-8 (hex) 0xE3 0x81 0x81 UTF-8 Encoding: 0xE3 0x80 0x81
  • 24. 24 Distinguishable Tokens UTF -8 Group Y Distinguishable* Data Discovery Group Y > # code points than in Group X *: Distinguishable tokens project 3521 code points Code points not in DBs 1:1 code point mapping Group X Customer select Scripts to use for Distinguishable characters 64336 code points 256 code points 256 code points 256 code points 256 code points 256 code points 256 code points 256 code points 239 code points 256 code points 11 184 Code points 6 111 code points
  • 25. 25 Portability of tokens and lookup tables based on ISO 10646 Encoding UTF-16 4-bytes Unicode Portability of code points in tokens can be mapped for up to 3-bytes code points. 4-bytes code points need to be converted. Conversion is defined in the UTF-16 encoding of ISO 10646 specifications for UTF-16 and the different endian formats, UTF-16BE and UTF-16LE encodings. Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U be the character number, no greater than 0x10 FFFF. 1) If U < 0x1 0000, encode U as a 16-bit unsigned integer and terminate. 2) Let U' = U - 0x1 0000. Because U is less than or equal to 0x10 FFFF, U' must be less than or equal to 0xF FFFF. That is, U' can be represented in 20 bits. 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have 10 bits free to encode the character value, for a total of 20 bits. 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate. Graphically, steps 2 through 4 looks like:  U' = yy yyyy yyyy xx xxxx xxxx  W1 = 110110 yy yyyy yyyy  W2 = 110111 xx xxxx xxxx
  • 26. 26 The Token Fabric The IV Pool • The IV Pool is a set of pre-generated “Randomized initialization vectors” to be used in different steps when creating the encoded fabric. • Substrings of records in IV Pool will be used in each step of the tokenization process.
  • 28. 28 28 Organization’sRisk Context of Privacy Source:Gartner
  • 29. 29 29 What Are Others Spending on Security?
  • 30. 30 GDPR under "Schrems II" – Lacking “Additional Safeguards” https://www.jdsupra.com/legalnews/navigating-eu-data-transfers-effects-of-8348955/ X • InMarch2021,the Bavarian DPA found therewas an unlawfultransfer of personal data from a Germancontroller to the e-mail marketing service Mailchimp inthe U.S. • Failedtoassess whetheranysupplementarymeasures wereneededin relationtothetransferofpersonaldatatoMailchimp. • InApril 2021,the PortugueseDPA ordered a public authority to suspend all transfers of personal data to the U.S. and other thirdcountries. • Cloudflarewereinsufficienttoprotectthedata(which includedreligiousand healthdata),andthepartiesdid notimplementany supplementarymeasures toprovideadequateprotectionforthedata. • Suspend thetransferofdatatotheU.S. oranyotherthirdcountry without firstestablishingadequateprotectionforthedata.
  • 31. 31 https://iapp.org/news/a/why-this-french-court-decision-has-far- reaching-consequences-for-many-businesses/ GDPR under"SchremsII" Legal safeguards: • AWS Sarlguarantees in its contract with Doctolib, a French company, that it will challenge anygeneral access request froma public authority. Technical safeguards: • Technically the data hosted byAWS Sarlis encrypted. • AWS Sarl,a Luxembourg registeredcompany. • The key is held by a trusted thirdpartyin France, not by AWS. Other guarantees taken: • No health data. • Thedatahostedrelatesonlyto the identificationof individualsforthepurposeof making appointments. • Data is deleted after three months. Doctolib AWS Sarl AWS will challenge any general access request from a public authority
  • 32. 32 Big Data Protection with GranularFieldLevel Protection for Google Cloud Protectionthroughout the lifecycleof data in Hadoop BigData Protectortokenizes or encryptssensitivedata fields Enterprise Policies Policiesmaybe managedon- premorGoogleCloudPlatform (GCP) PolicyEnforcementPoint Protecteddatafields U Separation of Duties EncryptionKeyManagem. Security Officer
  • 34. 34 GDPR SecurityRequirements Framework Encryption and Tokenization Discover Data Assets Security by Design Source:IBM
  • 35. 35 Different Data Protection Techniques Data Store DynamicMasking 2-way 1-way FormatPreserving Computingonencrypteddata FormatPreserving Tokenization FormatPreserving Encryption (FPE) HomomorphicEncryption (HE) Hashing Static Masking DifferentialPrivacy (DP) K-anonymityModel Random Algorithmic NoiseAdded Fast Slow VerySlow Fast Fast Fastest ClearText SyntheticData Derivation Fast Anonymization Of Attributes Pseudonymization Of Identifiers
  • 43. 43 Security Compliance Privacy Controls Regulations Policies GDPR CCPA/CPRA Data Security PCI DSS HIPAA Identity and Access Management Risk Management Data Privacy How, What and Why Balance RBAC, MAC, DAC, and Attribute Based Access Control Threats Industry Standards US NIST Ransomware Zero Trust Architecture (ZTA) Data Security Policy Data Security and US NIST Zero Trust Architecture (ZTA)
  • 44. 44 PrivacyStandards 11Published InternationalPrivacyStandards(ISO) Techniques Management Cloud Framework Impact Requirements Process 20889 IS Privacyenhancingde-identificationterminologyandclassificationoftechniques 27701 IS Securitytechniques-ExtensiontoISO/IEC27001 andISO/IEC 27002 forprivacyinformationmanagement -Requirementsand guidelines 27018 IS CodeofpracticeforprotectionofPIIinpubliccloudsacting as PIIprocessors 29100 IS Privacyframework 29101 IS Privacyarchitectureframework 29134 IS GuidelinesforPrivacyimpactassessment 29190 IS Privacycapabilityassessmentmodel 29191 IS Requirementsforpartiallyanonymous,partiallyunlinkableauthentication 29151 IS CodeofPracticeforPIIProtection 19608 TSGuidancefordevelopingsecurityandprivacyfunctionalrequirementsbasedon15408 27550 TRPrivacyengineeringforsystemlifecycleprocesses