Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data. Current approaches to protect International Unicode characters will increase the size and change the data formats. This will break many applications and slow down business operations. The current approach is also randomly returning data in new and unexpected languages. New approach with significantly higher performance and a memory footprint can be customizable and fit on small IoT devices.
We will discuss new approaches to achieve portability, security, performance, small memory footprint and language preservation for privacy protecting of Unicode data. These new approaches provide granular protection for all Unicode languages and customizable alphabets and byte length preserving protection of privacy protected characters.
Old Approaches
Major Issues
Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data.
Old approaches to protect International Unicode characters will typically increase the size and change the data formats.
This will break many applications and slow down business operations. This is an example of an old approach that is also randomly returning data in new and unexpected languages
2. 2
Agenda
• Old approaches, the journey and the new approach
• Performance and memory footprint
• Portability, security, and language preservation
• Granular protection for all Unicode languages and customizable alphabets
• Byte length preserving and character preserving protection
• Data privacy and security
• Use cases
3. 3
3
Major Issues
Protecting the increasing use International Unicode characters
is required by a growing number of Privacy Laws in many
countries and general Privacy Concerns with private data.
Less granular old approach with multiple
conversion steps:
1. Old approaches to
protect International
Unicode characters will
increase the size and
change the data
formats
• This will break
many applications
and slow down
business
operations
2. Old approaches are
randomly returning
data in new and
unexpected languages
5. 5
Unicode standard for consistent
encoding defines
• 143,859 characters
• 154 modern and historic scripts and
languages
Increasingly common to mix multiple languages in the
same input string
7. 7
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Japanese Address Label with 5 Different Languages
Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
JISX0208
1990
Unicode
5.0
MS
Std
JPN
JISX0212
1990
Windows
31
5 Different Japanese Unicode Character Sets
8. 8
Script UTF-8 Characters Selected 1,2 2,1 2,2 1&2 2,2,2 3,3 4,4
US ASCII 128 60
Latin 1 Suppl 60 60 4 k 4 k
Greek 128 16 k 2 mill
Cyrillic 256 256 65 k 17 mill
Cyrillic Suppl 304 92 k
Cyrillic Suppl+ 336 113 k
Asian & CJK + 13 100 k 100 k
Math 4-bytes 1 k 1 mill
# rows
8 mill
Tokenization of Unicode
J Ö R G E N
a ß x G E N
a Ü a y E N
a Ü b c d N
a Ü b e f g
a Ü j i h g
a Ä l z h g
m ö D z h g
m ä D z h g
Ю Щ Ъ Ы Ь Э
Ы Ы Ь Э Ь Э
Ы Ъ Э Ы Ь Э
Ы Ъ Ь Э Ь Э
Ы Ъ Ы Ъ Э Э
Ы Ъ Ы Ъ Э Щ
Ы Ъ Ы Э Щ Щ
Ы Ъ Ъ Э Щ Щ
Ы Ъ Э Э Щ Щ
Щ Ы Э Э Щ Щ
Щ Ы Э Э Щ Щ
Ρ ϲ ϳ ϴ ϵ
ϳ ϴ ϵ G E
ϳ Ρ ϲ ϳ E
ϳ Ρ ϲ ϳ ϴ
ϳ ϳ ϴ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
Character Chaining for Greek
Character Chaining for Cyrillic / Russian
2- & 3-Character Chaining blocks
Character Chaining for Japanese / Chinese
Character Chaining
for Special
Characters
0 0 0 0 0 1 2 3 4 5 2 0 9 8 8
9 8 7 0 0 3 8 7 4 5 7 8 7 4 5
9 4 5 6 0 3 4 5 6 5 7 4 5 6 5
9 4 7 2 1 3 4 7 2 1 7 4 7 2 5
9 2 3 1 3 8 9 4 1 7 5 8 6 5
5 3 3 2 1 6 7 8 4 1 9 8 7 6 5
5 4 3 2 1 6 7 8 4 1 9 8 7 6 5
两 並 丧
Sizes
Character Chaining for Latin
Derived Table
for
US ASCII + Latin:
6 bytes 8 mill
rows =
48 million bytes
10. 10
Unicode Project
POC v1 POC v2 Product
A 2 Basic lookup tables (SLT) - tokenization of everything in the table UTF8 y y
B 4 Basic lookup tables y y
C Basic lookup tables based on one codepage (e.g. Latin, Greek: https://unicode.org/charts/) y y
D Basis lookup tables based on multiple codepages (e.g. Latin-1 Supplement + Greek Extended + etc: https://unicode.org/charts/) y y
E Look up tables based on specific user-selected characters y y
F 1. Tokenize all Unicode space. (Equals to nr. 3: 4 basic lookup tables.)
G 2a. Tokenize using UTF byte-groups (1-byte, 2-byte). (Equals to nr. 4: basic lookup tables on one codepage)
H 2b. Tokenize using UTF byte-groups (1-byte, 2-byte, 3-byte, 4-byte). (Equals to nr. 3: 4 basic lookup tables.)
I
3. Tokenize using charsets/codegroups (we first tokenize using one set, then using another, so “Denis Денис“ will become
“YAsKp РЪюцЙ”). There could be characters of different byte-lengths in each set as in German: [a-zA-Z0-9]=1byte and
[öäüÖÄÜß]=2byte. (Plus warn customers that the performance will be bad. Clyde has concerns about security.)
y y
J
4. Tokenize using individual user-defined sets of characters that need to be protected (e.g. set no. 1 [a-zA-Z], set no.2 [öäüÖÄÜß]
or just set 1 [a-zA-ZöäüÖÄÜß]). Characters do not migrate from one custom-set to another.
y y
K Chaining 1,2 bytes UTF-8 sequences y y
L Chaining 2,1 bytes UTF-8 sequences y y
M Chaining 1,1 bytes UTF-8 sequences y y
N Chaining 1,1,1 bytes UTF-8 sequences y y
O UTF-8 support y y
P UTF-16 support y y
Q Portability of tokens and lookup tables UTF-8/UTF-16 y y
R Shuffle final token (for language separation option) ?
S Padding of short fields ? y
T Chaining with one IV character input y y
Other
User customizations
UTF support
Unicode tokenization feature
Derived lookup tables
(merge/split
tokenization)
Basic lookup tables
12. 12
Cyrillic
Tokenization of the Russian alphabet may include the green
(dotted lines) characters and use the red characters for
special purposes
13. 13
East Asian Scripts
Examples of Scripts with three to four bytes characters
Language preservation can be achieved in groups of Scripts
• Group X: Kanji and Hiragana
• Group Y: Katakana and Punctuation
• Group Z: CJK Unified Ideographs Extension
Kana /
Kanji
Hangul
Hanzi
14. 14
Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes
1) 128 characters (US-ASCII)
2) 1,920 characters Latin-script,
Greek, Cyrillic, Coptic, Armenian,
Hebrew, Arabic, Syriac, Thaana
and N'Ko alphabets.
3) Characters in common use,
including most Chinese,
Japanese and Korean
characters**.
4) Less common CJK (The
commonly used Hanzi/Kanji
characters are in the "CJK
Unified Ideographs" block
between U+4E00 and U+9FFF,
and take 3 bytes in UTF-8.
15. 15
Avoiding leaks about unusual characters from input
Randomization
Input string is the German name “Jörg“
Output Token “züaB”
Leaks about unusual characters from input
German umlauts: ä, ö, ü
Randomization Randomization Randomization
J ö r g
z ü a B
16. 16
Avoiding leaks about unusual characters from input
Approach 1: Randomly Shuffle the output string
Randomization
Input string is the German name “Jörg“
Output Token “züaB”
Leaks about unusual character in specific position of input
Randomization Randomization Randomization
J ö r g
z ü a B
Randomly
Shuffle the
output string
Final Token “azüB” Leaks about unusual character in from input
17. 17
Avoiding leaks about unusual characters from input
Approach 2: Randomize the full input string
Input string is the German name “Jörg“
• 4 characters
• 5 bytes
Output Token “züaB”
• 4 characters
• 6 bytes
Hides information about unusual characters from input
German umlauts: ä, ö, ü
Randomization
J ö r g
z ü ä B
But
Length of input string increases from 5 bytes to 6 bytes
18. 18
Avoiding leaks about unusual characters from input
Randomization
Input string of 4 characters (1+2+3+4= 10 bytes)
Output Token of 4 characters (3+3+4+4=14 bytes)
Avoiding leaks about unusual characters from input
Output Token 40% longer
19. 19
Preserving the numberof Characters and the byte-length
Example of
tokenizing 3-byts
and 4-bytes
Unicode characters
A range of Kanji and
Kana characters with
different string lengths
Randomly mix characters of different length and replace the longer characters
• Prevents leaks about unusual character in input
• No length increase
21. 21
UTF-8: 1-byte, 2-bytes, and 3-bytes characters
Portability aspects between UTF-8 and UTF-16 (used by Teradata and some
other large databases) and start with 1-byte, 2-bytes, and 3-bytes
characters in this example with three samples characters:
22. 22
• full-width input/output
• optional input/output
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Tokenization Table 3
katakana
romaji
Hiragana
Tokenization Table 4
Input string
Output string of tokens
Japanese Examples
高 え 2 ヲ
ブ 〄
野 ぉ A
ィ ル 〳
Input string
Output string of tokens
え 2 ヲ
ブ
ぉ A
ィ ル
Jean XYZ-ABC / 高 ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
Address
katakana
romaji
Tokenization Table 5
Input string
Output string of tokens
2 ヲ
ブ
A
ィ ル
24. 24
Distinguishable Tokens UTF -8
Group Y Distinguishable*
Data Discovery
Group Y
> # code points than in Group X
*: Distinguishable tokens project
3521 code points
Code points not in DBs
1:1 code point mapping
Group X
Customer select Scripts to use for
Distinguishable characters
64336 code
points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
239 code points 256 code points
11 184 Code points
6 111 code points
25. 25
Portability of tokens and lookup tables based on ISO 10646
Encoding UTF-16 4-bytes Unicode
Portability of code points in tokens can be mapped for up to 3-bytes code points. 4-bytes code points need to be
converted. Conversion is defined in the UTF-16 encoding of ISO 10646 specifications for UTF-16 and the different endian
formats, UTF-16BE and UTF-16LE encodings. Encoding of a single character from an ISO 10646 character value to UTF-16
proceeds as follows. Let U be the character number, no greater than 0x10 FFFF.
1) If U < 0x1 0000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x1 0000. Because U is less than or equal to 0x10 FFFF, U' must be less than or equal to 0xF FFFF. That is,
U' can be represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have
10 bits free to encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10
low-order bits of W2. Terminate.
Graphically, steps 2 through 4 looks like:
U' = yy yyyy yyyy xx xxxx xxxx
W1 = 110110 yy yyyy yyyy
W2 = 110111 xx xxxx xxxx
26. 26
The Token Fabric
The IV Pool
• The IV Pool is a set of pre-generated “Randomized
initialization vectors” to be used in different steps when
creating the encoded fabric.
• Substrings of records in IV Pool will be used in each step
of the tokenization process.
30. 30
GDPR under "Schrems II" – Lacking “Additional Safeguards”
https://www.jdsupra.com/legalnews/navigating-eu-data-transfers-effects-of-8348955/
X
• InMarch2021,the Bavarian DPA found therewas an unlawfultransfer
of personal data from a Germancontroller to the e-mail marketing
service Mailchimp inthe U.S.
• Failedtoassess whetheranysupplementarymeasures wereneededin
relationtothetransferofpersonaldatatoMailchimp.
• InApril 2021,the PortugueseDPA ordered a public authority to suspend
all transfers of personal data to the U.S. and other thirdcountries.
• Cloudflarewereinsufficienttoprotectthedata(which includedreligiousand
healthdata),andthepartiesdid notimplementany supplementarymeasures
toprovideadequateprotectionforthedata.
• Suspend thetransferofdatatotheU.S. oranyotherthirdcountry without
firstestablishingadequateprotectionforthedata.
31. 31
https://iapp.org/news/a/why-this-french-court-decision-has-far-
reaching-consequences-for-many-businesses/
GDPR under"SchremsII"
Legal safeguards:
• AWS Sarlguarantees in its contract with Doctolib, a French company, that it will
challenge anygeneral access request froma public authority.
Technical safeguards:
• Technically the data hosted byAWS Sarlis encrypted.
• AWS Sarl,a Luxembourg registeredcompany.
• The key is held by a trusted thirdpartyin France, not by AWS.
Other guarantees taken:
• No health data.
• Thedatahostedrelatesonlyto the identificationof individualsforthepurposeof making
appointments.
• Data is deleted after three months.
Doctolib
AWS Sarl
AWS will challenge any general
access request from a public
authority
32. 32
Big Data Protection with GranularFieldLevel Protection for Google
Cloud Protectionthroughout the lifecycleof data in Hadoop
BigData Protectortokenizes or
encryptssensitivedata fields
Enterprise
Policies
Policiesmaybe managedon-
premorGoogleCloudPlatform
(GCP)
PolicyEnforcementPoint
Protecteddatafields
U
Separation of Duties
EncryptionKeyManagem.
Security Officer
35. 35
Different Data Protection Techniques
Data Store
DynamicMasking
2-way 1-way
FormatPreserving Computingonencrypteddata FormatPreserving
Tokenization
FormatPreserving
Encryption
(FPE)
HomomorphicEncryption
(HE)
Hashing
Static
Masking
DifferentialPrivacy
(DP)
K-anonymityModel
Random Algorithmic NoiseAdded
Fast Slow VerySlow Fast Fast
Fastest
ClearText
SyntheticData
Derivation
Fast
Anonymization
Of Attributes
Pseudonymization
Of Identifiers
44. 44
PrivacyStandards
11Published InternationalPrivacyStandards(ISO)
Techniques
Management
Cloud
Framework
Impact
Requirements
Process
20889 IS Privacyenhancingde-identificationterminologyandclassificationoftechniques
27701 IS Securitytechniques-ExtensiontoISO/IEC27001 andISO/IEC 27002 forprivacyinformationmanagement -Requirementsand
guidelines
27018 IS CodeofpracticeforprotectionofPIIinpubliccloudsacting as PIIprocessors
29100 IS Privacyframework
29101 IS Privacyarchitectureframework
29134 IS GuidelinesforPrivacyimpactassessment
29190 IS Privacycapabilityassessmentmodel
29191 IS Requirementsforpartiallyanonymous,partiallyunlinkableauthentication
29151 IS CodeofPracticeforPIIProtection
19608 TSGuidancefordevelopingsecurityandprivacyfunctionalrequirementsbasedon15408
27550 TRPrivacyengineeringforsystemlifecycleprocesses