Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b

1
1
New Privacy Technologiesfor
Unicode and International Data
Standards

2
Agenda
• Old approaches, the journey and the new approach
• Performance and memory footprint
• Portability, security, and language preservation
• Granular protection for all Unicode languages and customizable alphabets
• Byte length preserving and character preserving protection
• Data privacy and security
• Use cases

3
3
Major Issues
Protecting the increasing use International Unicode characters
is required by a growing number of Privacy Laws in many
countries and general Privacy Concerns with private data.
Less granular old approach with multiple
conversion steps:
1. Old approaches to
protect International
Unicode characters will
increase the size and
change the data
formats
• This will break
many applications
and slow down
business
operations
2. Old approaches are
randomly returning
data in new and
unexpected languages

4
Preserving Language, Size, and Performance

5
Unicode standard for consistent
encoding defines
• 143,859 characters
• 154 modern and historic scripts and
languages
Increasingly common to mix multiple languages in the
same input string

6
Unicode
standard
• 154
modern
and
historic
scripts
and
languages

7
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Japanese Address Label with 5 Different Languages
Jean XYZ-ABC / 高ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
JISX0208
1990
Unicode
5.0
MS
Std
JPN
JISX0212
1990
Windows
31
5 Different Japanese Unicode Character Sets

8
Script UTF-8 Characters Selected 1,2 2,1 2,2 1&2 2,2,2 3,3 4,4
US ASCII 128 60
Latin 1 Suppl 60 60 4 k 4 k
Greek 128 16 k 2 mill
Cyrillic 256 256 65 k 17 mill
Cyrillic Suppl 304 92 k
Cyrillic Suppl+ 336 113 k
Asian & CJK + 13 100 k 100 k
Math 4-bytes 1 k 1 mill
# rows
8 mill
Tokenization of Unicode
J Ö R G E N
a ß x G E N
a Ü a y E N
a Ü b c d N
a Ü b e f g
a Ü j i h g
a Ä l z h g
m ö D z h g
m ä D z h g
Ю Щ Ъ Ы Ь Э
Ы Ы Ь Э Ь Э
Ы Ъ Э Ы Ь Э
Ы Ъ Ь Э Ь Э
Ы Ъ Ы Ъ Э Э
Ы Ъ Ы Ъ Э Щ
Ы Ъ Ы Э Щ Щ
Ы Ъ Ъ Э Щ Щ
Ы Ъ Э Э Щ Щ
Щ Ы Э Э Щ Щ
Щ Ы Э Э Щ Щ
Ρ ϲ ϳ ϴ ϵ
ϳ ϴ ϵ G E
ϳ Ρ ϲ ϳ E
ϳ Ρ ϲ ϳ ϴ
ϳ ϳ ϴ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
ϲ ϴ ϵ ϵ ϴ
Character Chaining for Greek
Character Chaining for Cyrillic / Russian
2- & 3-Character Chaining blocks
Character Chaining for Japanese / Chinese
Character Chaining
for Special
Characters
0 0 0 0 0 1 2 3 4 5 2 0 9 8 8
9 8 7 0 0 3 8 7 4 5 7 8 7 4 5
9 4 5 6 0 3 4 5 6 5 7 4 5 6 5
9 4 7 2 1 3 4 7 2 1 7 4 7 2 5
9 2 3 1 3 8 9 4 1 7 5 8 6 5
5 3 3 2 1 6 7 8 4 1 9 8 7 6 5
5 4 3 2 1 6 7 8 4 1 9 8 7 6 5
两並丧
Sizes
Character Chaining for Latin
Derived Table
for
US ASCII + Latin:
6 bytes 8 mill
rows =
48 million bytes

9
Character Fabric
Tokenization of Unicode
Character Chaining

10
Unicode Project
POC v1 POC v2 Product
A 2 Basic lookup tables (SLT) - tokenization of everything in the table UTF8 y y
B 4 Basic lookup tables y y
C Basic lookup tables based on one codepage (e.g. Latin, Greek: https://unicode.org/charts/) y y
D Basis lookup tables based on multiple codepages (e.g. Latin-1 Supplement + Greek Extended + etc: https://unicode.org/charts/) y y
E Look up tables based on specific user-selected characters y y
F 1. Tokenize all Unicode space. (Equals to nr. 3: 4 basic lookup tables.)
G 2a. Tokenize using UTF byte-groups (1-byte, 2-byte). (Equals to nr. 4: basic lookup tables on one codepage)
H 2b. Tokenize using UTF byte-groups (1-byte, 2-byte, 3-byte, 4-byte). (Equals to nr. 3: 4 basic lookup tables.)
I
3. Tokenize using charsets/codegroups (we first tokenize using one set, then using another, so “Denis Денис“ will become
“YAsKp РЪюцЙ”). There could be characters of different byte-lengths in each set as in German: [a-zA-Z0-9]=1byte and
[öäüÖÄÜß]=2byte. (Plus warn customers that the performance will be bad. Clyde has concerns about security.)
y y
J
4. Tokenize using individual user-defined sets of characters that need to be protected (e.g. set no. 1 [a-zA-Z], set no.2 [öäüÖÄÜß]
or just set 1 [a-zA-ZöäüÖÄÜß]). Characters do not migrate from one custom-set to another.
y y
K Chaining 1,2 bytes UTF-8 sequences y y
L Chaining 2,1 bytes UTF-8 sequences y y
M Chaining 1,1 bytes UTF-8 sequences y y
N Chaining 1,1,1 bytes UTF-8 sequences y y
O UTF-8 support y y
P UTF-16 support y y
Q Portability of tokens and lookup tables UTF-8/UTF-16 y y
R Shuffle final token (for language separation option) ?
S Padding of short fields ? y
T Chaining with one IV character input y y
Other
User customizations
UTF support
Unicode tokenization feature
Derived lookup tables
(merge/split
tokenization)
Basic lookup tables

11
Examples of Scripts with one to two bytes characters

12
Cyrillic
Tokenization of the Russian alphabet may include the green
(dotted lines) characters and use the red characters for
special purposes

13
East Asian Scripts
Examples of Scripts with three to four bytes characters
Language preservation can be achieved in groups of Scripts
• Group X: Kanji and Hiragana
• Group Y: Katakana and Punctuation
• Group Z: CJK Unified Ideographs Extension
Kana /
Kanji
Hangul
Hanzi

14
Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes
1) 128 characters (US-ASCII)
2) 1,920 characters Latin-script,
Greek, Cyrillic, Coptic, Armenian,
Hebrew, Arabic, Syriac, Thaana
and N'Ko alphabets.
3) Characters in common use,
including most Chinese,
Japanese and Korean
characters**.
4) Less common CJK (The
commonly used Hanzi/Kanji
characters are in the "CJK
Unified Ideographs" block
between U+4E00 and U+9FFF,
and take 3 bytes in UTF-8.

15
Avoiding leaks about unusual characters from input
Randomization
Input string is the German name “Jörg“
Output Token “züaB”
Leaks about unusual characters from input
German umlauts: ä, ö, ü
Randomization Randomization Randomization
J ö r g
z ü a B

16
Approach 1: Randomly Shuffle the output string
Randomization
Leaks about unusual character in specific position of input
Randomization Randomization Randomization
J ö r g
z ü a B
Randomly
Shuffle the
output string
Final Token “azüB” Leaks about unusual character in from input

17
Approach 2: Randomize the full input string
• 4 characters
• 5 bytes
• 4 characters
• 6 bytes
Hides information about unusual characters from input
German umlauts: ä, ö, ü
Randomization
J ö r g
z ü ä B
But
Length of input string increases from 5 bytes to 6 bytes

18
Randomization
Input string of 4 characters (1+2+3+4= 10 bytes)
Output Token of 4 characters (3+3+4+4=14 bytes)
Output Token 40% longer

19
Preserving the numberof Characters and the byte-length
Example of
tokenizing 3-byts
and 4-bytes
Unicode characters
A range of Kanji and
Kana characters with
different string lengths
Randomly mix characters of different length and replace the longer characters
• Prevents leaks about unusual character in input
• No length increase

20
UTF-8 character
encodings for
websites is
95.4%

21
UTF-8: 1-byte, 2-bytes, and 3-bytes characters
Portability aspects between UTF-8 and UTF-16 (used by Teradata and some
other large databases) and start with 1-byte, 2-bytes, and 3-bytes
characters in this example with three samples characters:

22
• full-width input/output
• optional input/output
katakana
Kanji group 1
romaji
Hiragana
Punctuation
Tokenization Table 3
katakana
romaji
Hiragana
Input string
Output string of tokens
Japanese Examples
高え 2 ｦ
ブ〄
野ぉ A
ｨル〳
Input string
え 2 ｦ
ブ
ぉ A
ｨル
Jean XYZ-ABC / 高ブルノ Japan Business Leader ジャン・ビネス・リダー t: +81 (0)80 1234 1234
Address
katakana
romaji
Input string
2 ｦ
ブ
A
ｨル

23
Distinguishable Tokens - UTF -8
256 code points 256 code points
256 code points
256 code points
256 code
points
256 code points
Hiragana
46
30 40
Kanji group 2
7 k
34 00
裕子
えぉお
2 A ／ Romaji & half-
width katakana
239
FF 00
ｦｧｨ
katakana
111 – 304
30 A0
ブルーノ
Punctuation
70
30 00
〳
Distinguishable Scripts
Tokenization Scripts - Japanese
Kanji group 3
43 k
2 00 00
Kanji group 1
21 k
4E 00
高野
UTF-8 Encoding: 0xF0 0xA0 0x80 0x80
UTF-8 (hex) 0xEF 0xBC 0x81
UTF-8 Encoding: 0xE3 0x90 0x80
UTF-8 Encoding: 0xE4 0xB8 0x80
UTF-8 Encoding: 0xE3 0x82 0xA0
UTF-8 (hex) 0xE3 0x81 0x81
UTF-8 Encoding: 0xE3 0x80 0x81

24
Distinguishable Tokens UTF -8
Group Y Distinguishable*
Data Discovery
Group Y
> # code points than in Group X
*: Distinguishable tokens project
3521 code points
Code points not in DBs
1:1 code point mapping
Group X
Customer select Scripts to use for
Distinguishable characters
64336 code
points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
256 code points
239 code points 256 code points
11 184 Code points
6 111 code points

25
Portability of tokens and lookup tables based on ISO 10646
Encoding UTF-16 4-bytes Unicode
Portability of code points in tokens can be mapped for up to 3-bytes code points. 4-bytes code points need to be
converted. Conversion is defined in the UTF-16 encoding of ISO 10646 specifications for UTF-16 and the different endian
formats, UTF-16BE and UTF-16LE encodings. Encoding of a single character from an ISO 10646 character value to UTF-16
proceeds as follows. Let U be the character number, no greater than 0x10 FFFF.
1) If U < 0x1 0000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x1 0000. Because U is less than or equal to 0x10 FFFF, U' must be less than or equal to 0xF FFFF. That is,
U' can be represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have
10 bits free to encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10
low-order bits of W2. Terminate.
Graphically, steps 2 through 4 looks like:
 U' = yy yyyy yyyy xx xxxx xxxx
 W1 = 110110 yy yyyy yyyy
 W2 = 110111 xx xxxx xxxx

26
The Token Fabric
The IV Pool
• The IV Pool is a set of pre-generated “Randomized
initialization vectors” to be used in different steps when
creating the encoded fabric.
• Substrings of records in IV Pool will be used in each step
of the tokenization process.

28
28
Organization’sRisk Context of Privacy
Source:Gartner

29
29
What Are Others Spending on Security?

30
GDPR under "Schrems II" – Lacking “Additional Safeguards”
https://www.jdsupra.com/legalnews/navigating-eu-data-transfers-effects-of-8348955/
X
• InMarch2021,the Bavarian DPA found therewas an unlawfultransfer
of personal data from a Germancontroller to the e-mail marketing
service Mailchimp inthe U.S.
• Failedtoassess whetheranysupplementarymeasures wereneededin
relationtothetransferofpersonaldatatoMailchimp.
• InApril 2021,the PortugueseDPA ordered a public authority to suspend
all transfers of personal data to the U.S. and other thirdcountries.
• Cloudflarewereinsufficienttoprotectthedata(which includedreligiousand
healthdata),andthepartiesdid notimplementany supplementarymeasures
toprovideadequateprotectionforthedata.
• Suspend thetransferofdatatotheU.S. oranyotherthirdcountry without
firstestablishingadequateprotectionforthedata.

31
https://iapp.org/news/a/why-this-french-court-decision-has-far-
reaching-consequences-for-many-businesses/
GDPR under"SchremsII"
Legal safeguards:
• AWS Sarlguarantees in its contract with Doctolib, a French company, that it will
challenge anygeneral access request froma public authority.
Technical safeguards:
• Technically the data hosted byAWS Sarlis encrypted.
• AWS Sarl,a Luxembourg registeredcompany.
• The key is held by a trusted thirdpartyin France, not by AWS.
Other guarantees taken:
• No health data.
• Thedatahostedrelatesonlyto the identificationof individualsforthepurposeof making
appointments.
• Data is deleted after three months.
Doctolib
AWS Sarl
AWS will challenge any general
access request from a public
authority

32
Big Data Protection with GranularFieldLevel Protection for Google
Cloud Protectionthroughout the lifecycleof data in Hadoop
BigData Protectortokenizes or
encryptssensitivedata fields
Enterprise
Policies
Policiesmaybe managedon-
premorGoogleCloudPlatform
(GCP)
PolicyEnforcementPoint
Protecteddatafields
U
Separation of Duties
EncryptionKeyManagem.
Security Officer

33
MajorFinancialInstitution Global UseCase

34
GDPR SecurityRequirements Framework
Encryption and
Tokenization
Discover Data
Assets
Security by
Design
Source:IBM

35
Different Data Protection Techniques
Data Store
DynamicMasking
2-way 1-way
FormatPreserving Computingonencrypteddata FormatPreserving
Tokenization
FormatPreserving
Encryption
(FPE)
HomomorphicEncryption
(HE)
Hashing
Static
Masking
DifferentialPrivacy
(DP)
K-anonymityModel
Random Algorithmic NoiseAdded
Fast Slow VerySlow Fast Fast
Fastest
ClearText
SyntheticData
Derivation
Fast
Anonymization
Of Attributes
Pseudonymization
Of Identifiers

36
Different Data Protection Techniques

38
Data Protection Integration Layers

39

40

41

43
Security Compliance
Privacy
Controls
Regulations
Policies
GDPR
CCPA/CPRA
Data
Security
PCI DSS
HIPAA
Identity and
Access
Management
Risk Management
Data
Privacy
How, What and Why
Balance
RBAC,
MAC,
DAC, and
Attribute
Based
Access
Control
Threats
Industry
Standards
US NIST
Ransomware
Zero Trust
Architecture (ZTA)
Data
Security
Policy
Data Security and
US NIST Zero Trust
Architecture (ZTA)

44
PrivacyStandards
11Published InternationalPrivacyStandards(ISO)
Techniques
Management
Cloud
Framework
Impact
Requirements
Process
20889 IS Privacyenhancingde-identificationterminologyandclassificationoftechniques
27701 IS Securitytechniques-ExtensiontoISO/IEC27001 andISO/IEC 27002 forprivacyinformationmanagement -Requirementsand
guidelines
27018 IS CodeofpracticeforprotectionofPIIinpubliccloudsacting as PIIprocessors
29100 IS Privacyframework
29101 IS Privacyarchitectureframework
29134 IS GuidelinesforPrivacyimpactassessment
29190 IS Privacycapabilityassessmentmodel
29191 IS Requirementsforpartiallyanonymous,partiallyunlinkableauthentication
29151 IS CodeofPracticeforPIIProtection
19608 TSGuidancefordevelopingsecurityandprivacyfunctionalrequirementsbasedon15408
27550 TRPrivacyengineeringforsystemlifecycleprocesses

Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Similar a Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b

Similar a Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b (20)

Más de Ulf Mattsson

Más de Ulf Mattsson (20)

Último

Último (20)

Jun 29 new privacy technologies for unicode and international data standards 2021 jun28b