Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We propose a new secure approach, entitled query binning (QB) that allows non-sensitive parts of the data to be outsourced in clear-text while guaranteeing that no information is leaked by the joint processing of non-sensitive data (in clear-text) and sensitive data (in encrypted form). QB maps a query to a set of queries over the sensitive and non-sensitive data in a way that no leakage will occur due to the joint processing over sensitive and non-sensitive data. Interestingly, in addition to improve performance, we show that QB actually strengthens the security of the underlying cryptographic technique by preventing size, frequency-count, and workload-skew attacks.
Partitioned Data Security on Outsourced Sensitive and Non-sensitive Data -- ICDE 2019
1. Partitioned Data Security on
Outsourced Sensitive and Non-
sensitive Data
Sharad Mehrotra1, Shantanu Sharma1, Jeffrey D. Ullman2, and
Anurag Mishra1
1University of California, Irvine, USA
2Stanford University, USA
IEEE International Conference on Data Engineering (ICDE), 2019.
2. Secure Data Outsourcing
2
Can we design an outsourcing solution that is simultaneously
Efficient – significantly better compared to downloading encrypted
data, and
Secure – similar to downloading the data and local processing
Use cryptographic mechanisms to protect sensitive data on
the cloud
3. • State-of-the-art in secure data outsourcing
• Partitioned Computing & corresponding security properties
• Binning algorithm to achieve partitioned security
• Performance results
Roadmap
4. Data/Computation Outsourcing over the Years
Keyword Search over Encrypted Documents
[IEEE SP, 2000, ACNS 04, Cryto, 08,Cryto 09…]
SQL over Encrypted data: [ICDE 02, SIGMOD 02,
VLDB04, Eurocrypt 03,SIGMOD 04, Crypto 11, STOC
09, SOSP 11, …]
MPC and Secret Sharing [CACM 79, Eurocrypt
14,15,17 VLDB 17, Tech 19]
OS
Process 1
Process 2
Trusted
Enclave
Encrypte
d Data
Cache
PageTa
ble
Ecall
Ocall
The adversary can observe the
cache-lines and page table
access
Secure Hardware [CIDR 13, Usenix Security
15, IEEE SP 15,17, NSDI 18]
Solutions represent points in the
spectrum of possibilities
– Explore tradeoffs between
Generality, security,
efficiency.
More secure but orders of magnitude
worse in performance compared to
plaintext processing.
Not secure and software techniques
to make such solutions secure
inefficient
• coarse grain page faults, branch
shadow, cache-line attacks
5. Cryptographic Techniques: Security Threats &
Performance
represents technique is resilient to a
given attack.
DSSE: Distributed Searchable
Symmetric Encryption (PULSAR by
Stealth)
MPC: Multi-party computation (Jana
by Galois)
Opaque SGX based solution [Zhang
et al., NSDI, 2017]
Selecting a single row from TPC-H Customer table of
1.5M rows and 8 columns
• Cryptographic Overheads:
• Searchable encryption – ~2 orders of magnitude
• Secure hardware - ~3-4 order of magnitude
• MPC based solution - ~5-6 orders of magnitude
6. • Organization data is often only partially sensitive [refs in paper]
• Sensitivity dictated by policies
• Sensitivity dictates what data and in what form is it outsourced
• E.g., General office emails possibly not sensitive (hence outsourced)
• Information related to a sensitive project sensitive (hence not outsourced in
plaintext)
• Can we exploit partially sensitive nature of data to scale cryptographic
solutions without compromising security of sensitive data?
• Commercial encrypted database solutions (e.g., Jana by Galois) are beginning to
explore such solutions
Data Sensitivity & Outsourcing
7. Key Insight: Partial Sensitivity of Data (1)
• Data about entry/exit from buildings
possibly sensitive (inference about time spent at work)
• Location within office building possibly not sensitive
• Surveillance video not sensitive
• Surveillance video sensitive, if visitor prefers not to be monitored (OK
to know visitor not in frame, but not if visitor in frame!)
Partial sensitivity is also true for other
domains
http://cybersecurity.ieee.org/blog/2015/11/13/ident
ify-sensitive-data-and-how-they-should-be-handled/
https://digitalguardian.com/
Can we exploit partial sensitivity
to develop efficient (yet secure)
solutions to scale secure
computing and/or data sharing
8. Key Insight: Partial Sensitivity of Data (2)
• Existing work on data classification
• Inference detection using graph-based semantic data modeling [Hinke, IEEE SP, 88]
• User-defined relationships between sensitive and non-sensitive data [Smith, IEEE SP, 90]
• Sensitive patterns hiding using sanitization matrix [Lee et al., COMPSAC, 2004]
• Common knowledge-based association rules [Li et al., DASFAA, 2007]
• Constraints-based mechanisms
• Objectives of finding data-sensitivity
• Data-sharing while keeping sensitive data at the trusted user
• Multi-level secure data accessing
• Allowing data for mining purposes while also preserving the confidentiality of the data
9. Partitioned Computations
Name Department
t1 E(Adam) E(Defense)
t2 E(John) E(Security)
t3 E(Clark) E(Crypto)
t4 E(Lisa) E(Defense)
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Query Q Answer A
Query Qs Query Qns
Answer Ans
Answer As
Sensitive Data Ds
Non-sensitive Data Dns
10. Leakage due to Partitioned Computing…
Name Department
t1 E(Adam) E(Defense)
t2 E(John) E(Security)
t3 E(Clark) E(Crypto)
t4 E(Lisa) E(Defense)
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Sensitive Data Ds
Non-sensitive Data Dns
Query: Retrieve John rows
Query
value
Tuples retrieved
from sensitive side
Tuples retrieved from
non-sensitive side
John T2 T6
Adversarial view
T2 is John’s row.
11. What if we use access-pattern-hiding techniques?
Name Department
t1 E(Adam) E(Defense)
t2 E(John) E(Security)
t3 E(Clark) E(Crypto)
t4 E(Lisa) E(Defense)
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Sensitive Data Ds
Non-sensitive Data Dns
Query: Retrieve John rows
Query
value
Tuples retrieved
from sensitive side
Tuples retrieved from
non-sensitive side
John E(….) T6
Adversarial view
Output size reveals that one of
John’s record is sensitive.
12. Partitioned Data Security
• Non-Linkability
• The Adversary does not learn relationship between any encrypted and plaintext
value
• Cyphertext Indistinguishability
• The adversary does not learn any relationships between encrypted values
• unless underlying crypto allows such relationships to be learnt (e.g., OPE)
13. Secure Partitioned Computation (1)
• Data partitioned into bins
• Non-sensitive data partitioned into
non-sensitive bins (NSB)
• Sensitive data partitioned into
sensitive bin (SB)
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
Dns
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)
Query
value
Tuples retrieved
from sensitive side
Tuples retrieved from
non-sensitive side
John SB(y) NSB(y)
Adversarial view
• Query Q for value y mapped to
all values in the bin
corresponding to y
• Retrieves all data in NSB(y) over
non-sensitive data
• Retrieves all data in SB(y) over
sensitive data
14. Secure Partitioned Computation (2)
• Bins are created such that for each pair of sensitive and non-sensitive
bins s & ns, there exists a value v,
• such that s =SB(v) and ns =NSB(v)
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
Dns
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)
adversarial view does not allow learning linkability between
sensitive and non-sensitive records
15. Secure Partitioned Computation (3)
• Association amongst each sensitive bin and non-sensitive bin prevents
• Leakage through joint access of data
• Output size attacks
• Workload skew attacks can be prevented through (careful) addition of
(minimal) fake queries
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)
Dns
16. Query Binning
• Assumptions
• Equal number of sensitive and non-sensitive attribute values
• Each distinct attribute value appears in at most one tuple in sensitive and one
tuple in non-sensitive data
• Number of values are a product of approximately equal factors
***The paper relaxes all these assumptions
17. The Algorithm: One Tuple Per Value
Bin Creation: Inputs: S and NS
• Permute all sensitive values
• Find approximate square factor of |NS| = x * y such that x
≥ y
• Create x sensitive bins; contains at most y inputs in each
• Create |NS|/x non-sensitive bins
• Assign ith sensitive value to (i mod x)th sensitive bin
• Assigning non-sensitive values: Assign non-sensitive value
corresponding to ith sensitive value, which is allocated to
jth bin, to jth position of ith non-sensitive bin
• NSB[j][i] allocateNS(SB[i][j])
• Fill remaining NS values
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S = {S1, S2, S3, S4, S5, S6}
NS = {NS1, NS2, NS3, NS6, NS7}
18. The Algorithm: One Tuple Per Value
• Bin Retrieval: Input: Query(w)
• If w is in a sensitive bin SB[i][j], then
• Retrieve ith sensitive bin and jth non-sensitive bin
• If w is in a non-sensitive bin NSB[i][j], then
• Retrieve ith non-sensitive bin and jth sensitive bin
S = 6 NS = 6
x = 3
y = 2
S = {S1, S2, S3, S4, S5, S6}
NS = {NS1, NS2, NS3, NS6, NS7}
Query: S2 SB2, NSB1
Query: NS7 NSB1, SB2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
19. Query Execution Cost on Outsourced Data
Techniques Time Resilient to attacks
Size Workload-skew Access-patterns
SGX 10500x
Query Binning + SGX (60% sensitivity) 8929x
Multi-party computations-Jana 954363x
Query Binning + Jana (60% sensitivity) 680131x
x is the time to search a predicate in cleartext.
is showing a technique is resilient to a given attack.
Experiments are conducted over 1.5M rows.
20. Experimental Results (Selection Query)
• X-axis = Data sensitivity (1%, 2%, 20%, 40%, 60%)
• Y-axis = time
SGX Opaque + Partition computing vs SGX Opaque
Data set size = 6M rows
Jana MPC + Partition computing vs Jana MPC
Data set size = 1M rows
21. Analytical Model
• When is query binning better compared to pure cryptographic approach?
Ratio of cost of QB versus
crypto only approach
After several rounds of
simplications (see paper)
Under ideal assumptions….
QB is better than cryptographic only
solution if this holds (see paper)
Ratio of computation cost of cryptographic
techniques vs plaintext per tuple
Ratio of cryptographic computation vs
communication cost per tuple (typically much
greater than 1 for strong cryptographic techniques)
Average query selectivityRatio of sensitive data
22. • If there is no approximate square factor?
• Select nearest square number
• If there is no 1-to-1 mapping of sensitive and non-sensitive value, and
differences in size of the values?
• Bin-packing algorithm
• What about range queries?
• With the help of a modified B-tree created over non-sensitive bins
• What about join queries?
• Keep pseudo-sensitive data with sensitive data
• What about aggregation queries?
• Execute like a selection query without tuple fetching
Query Binning Extensions
23. Distinct Values are not a Product of Approximately
Square Factor (1)
• What will happen when the number of distinct values is not a product
of approximately square factor ???
• Increasing communication cost
• For example 82 non-sensitive values, results in 41 sensitive bins and 2 non-
sensitive bins
ns1, ns2, …, ns41
ns42, ns43, …, ns82
E(s1)
E(s2)
E(s41)
SB1
SB2
SB41
NSB1
NSB2
Communication cost = 42
At most 1 value in
a sensitive bin
At most 41 values in a
non-sensitive bin
24. Distinct Values are not a Product of Approximately
Square Factor (2)
• Reducing communication cost --- by finding nearest square number
• In the case of 82 non-sensitive values, 81 is nearest square number
• Thus, create 9-9 sensitive and non-sensitive bins
ns1, ns2, …, ns10
ns11, ns12, …, ns19
….E(x)….
…E(y)…..
….E(z)…..
SB1
SB2
SB9
41Sensitivevalue
82Non-sensitivevalue
Communication cost = 15
ns74, ns75, …, ns82
At most 5 values
in a sensitive bin
At most 10 values in a
non-sensitive bin
NSB1
NSB2
NSB9
25. The Algorithm: General Case: Multiple Tuples per Value
(1)
• What will happen if all values have a
different number of tuples??
• Size of each sensitive bin is different now
• Assumption: More non-sensitive values
have more sensitive associated tuples.
• The adversary learns from tuple retrieval
that which bin contain sensitive value
corresponding to non-sensitive values
• E.g., retrieval of SB1 and NSB1 reveals that
S1 is allocated to SB1
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bin
25
4
2
Size of
bin
230
170
26. The Algorithm: General Case: Multiple Tuples per Value
(2)
• What will happen if all values have a
different number of tuples?
• Solution: Simply add fake tuples to
sensitive bins
• Problem: too many fake tuples
leading to increases communication
cost
• So how to overcome this problem???
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bin
25
4
2
Size of
bin
230
170
Added fake
tuples
0
21
23
We add 44 fake tuples to
sensitive data
27. The Algorithm: General Case: Multiple Tuples per Value
(3)
• What will happen if all values have a
different number of tuples?
• Solution: Bin-packing-based approach
• Sorting: Sort all the values in a decreasing
order of the number of tuples.
• Allocate sensitive values
• Add fake tuples
• Allocate non-sensitive values as we showed
previously
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S4
S1
S2
S6
S3
S5
NS1 NS2NS7
NS3 NS5NS6
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bins
before adding
faking tuples
16
11
4
Added fake
tuples
0
5
12
S4 = 15
S1 = 10
S2 = 2
S5 = 2
S3 = 1
S6 = 1
After
sorting
We add fewer fake tuples than a simple
solution of adding fake tuples
44 vs 17 fake tuples
28. Range Queries
• A full binary-tree is constructed for all non-sensitive value
• Bins are created for each level of the tree, except the root node
• Bins are retrieved based on least-matching
• For example, a range query from ns8 to ns12 Bins as per node ns23 and ns8
Bins for each node of each level of the tree
29. • Existing cryptographic techniques are orders of magnitude slower as
compared to cleartext processing
• Differentiating between sensitive and non-sensitive data can make
cryptographic techniques faster
• By avoiding expensive cryptographic operation on non-sensitive data
• However, a naïve query execution on partitioned data can lead to information
leakage
• Partitioned security
• Query binning
• Implements partitioned security
• While ensuring efficiency
• Interesting side-effect of QB:
• Makes existing cryptographic techniques more secure as a side-effect.
Conclusion