This document describes Opaque, a secure distributed data analytics framework that allows complex analytics to run on sensitive data while preserving data privacy and functionality. Opaque utilizes hardware enclaves to protect computation and data within enclaves from a malicious operating system or cloud provider. It implements various oblivious primitives and oblivious operators to execute Spark SQL queries without leaking data access patterns. This allows sensitive data to be analyzed using existing Spark SQL queries while preventing privacy leaks from memory, network or computation access patterns.
19. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
20. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
– Either impractically slow (FHE), or limited functionality (CryptDB)
21. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
– Either impractically slow (FHE), or limited functionality (CryptDB)
• Hardware-based systems
22. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
– Either impractically slow (FHE), or limited functionality (CryptDB)
• Hardware-based systems
– Use trusted hardware
23. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
– Either impractically slow (FHE), or limited functionality (CryptDB)
• Hardware-based systems
– Use trusted hardware
– Only single machine computation (Haven), or weaker security
guarantees (VC3)
24. Prior work
• Computation on encrypted data
– A cryptographic approach using homomorphic encryption
– Either impractically slow (FHE), or limited functionality (CryptDB)
• Hardware-based systems
– Use trusted hardware
– Only single machine computation (Haven), or weaker security
guarantees (VC3)
Opaque utilizes trusted hardware
79. Self-verifying computation
Invariant: if computation does not abort,
the execution completed so far is correct
If the computation is complete, then the entire
query was executed correctly
90. ID Name Age Disease
12809 Amanda D. Edwards 40 Diabetes
29489 Robert R. McGowan 56 Diabetes
13744 Kimberly R. Seay 51 Cancer
18740 Dennis G. Bates 32 Diabetes
98329 Ronald S. Ogden 53 Cancer
medical table:
Problem: access pattern leakage
32591 Donna R. Bridges 26 Diabetes
91. ID Name Age Disease
SELECT count(*) FROM medical
GROUP BY disease
12809 Amanda D. Edwards 40 Diabetes
29489 Robert R. McGowan 56 Diabetes
13744 Kimberly R. Seay 51 Cancer
18740 Dennis G. Bates 32 Diabetes
98329 Ronald S. Ogden 53 Cancer
medical table:
Problem: access pattern leakage
32591 Donna R. Bridges 26 Diabetes
92. Problem: access pattern leakage
SELECT count(*) FROM medical
GROUP BY disease
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
93. Problem: access pattern leakage
SELECT count(*) FROM medical
GROUP BY disease
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
94. Problem: access pattern leakage
SELECT count(*) FROM medical
GROUP BY disease
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
95. Problem: access pattern leakage
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Attack viable for both
memory and network
access patterns!
106. Oblivious aggregation
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Oblivious
sort
SELECT count(*) FROM medical GROUP BY disease
Map Sort
107. Oblivious aggregation
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Oblivious
sort
SELECT count(*) FROM medical GROUP BY disease
Map Sort
108. Oblivious aggregation
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Oblivious
sort
SELECT count(*) FROM medical GROUP BY disease
Map Sort
109. Map Sort
Oblivious
sort
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
110. Sort
Oblivious
sort
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
111. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
112. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan
Statistics
Statistics
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
113. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Statistics
Statistics
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
114. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Statistics
Statistics
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
115. 2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
116. 2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Result size
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
117. 2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
118. 2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Offset
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
119. 2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
120. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
121. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
2; 1
2; 2
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
122. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
123. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
124. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Cancer:2
Diabetes:3
Diabetes:1
DUMMY
Final
result
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
125. 12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
12809 … Diabetes
29489 … Diabetes
13744 … Cancer
18740 … Diabetes
98329 … Cancer
32591 … Diabetes
Scan Boundary
processing
Scan
Final
result
Cancer:2
Diabetes:4
Oblivious aggregation
SELECT count(*) FROM medical GROUP BY disease
144. Project-filter
Filter
Query optimization - oblivious
SELECT count(*)
FROM medical
WHERE age > 30
GROUP BY disease
Low-card. obliv. agg.
Scan
Obliv. sort
Aggregate
medical
145. Project-filter
Filter
Query optimization - oblivious
SELECT count(*)
FROM medical
WHERE age > 30
GROUP BY disease
Low-card. obliv. agg.
Scan
Obliv. sort
Aggregate
medical
Reduced # of
oblivious sorts
by 1
150. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
151. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
152. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
|P| < |D| < |M|
153. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
|P| < |D| < |M|
154. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
SQL join order
|P| < |D| < |M|
155. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
SQL join order
|P| < |D| < |M|
156. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
Patient
Disease
⨝
Medication
⨝ ᵞ
SQL join order
|P| < |D| < |M|
157. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
Patient
Disease
⨝
Medication
⨝ ᵞ
SQL join order
Opaque join order
|P| < |D| < |M|
158. Query optimization - mixed sensitivity
D_ID
AGE
NAME
P_ID
END_DATE
START_DATE
PID
COMMENT
DOCTOR
T_ID
DOSAGE
END_TIME
START_TIME
M_ID
T_ID
COMMENT
DATE
TR_ID
G_ID
NAME
D_ID
COST
D_ID
NAME
M_ID
COMMENT
NAME
G_ID
Patient (P_)
Treatment Plan
(TP_)
Treatment
Record (TR_)
Disease (D_) Medication (M_)
Gene (G_)
SELECT p_name, d_name, med_cost
FROM patient, disease,
(SELECT d_id, min(cost) AS med_cost
FROM medication
GROUP BY d_id) AS med
WHERE disease.d_id = patient.d_id
AND disease.d_id = med.d_id
Patient Disease
⨝
Medication
⨝
ᵞ
Patient
Disease
⨝
Medication
⨝ ᵞ
SQL join order
Opaque join order
|P| < |D| < |M|
165. Evaluation
• How does Opaque compare to Spark SQL?
– Big Data Benchmark (BDB)
• Queries 1, 2, 3: filter, aggregation, join
166. Evaluation
• How does Opaque compare to Spark SQL?
– Big Data Benchmark (BDB)
• Queries 1, 2, 3: filter, aggregation, join
• 1 million records
167. Evaluation
• How does Opaque compare to Spark SQL?
– Big Data Benchmark (BDB)
• Queries 1, 2, 3: filter, aggregation, join
• 1 million records
• How does Opaque compare to state-of-the-art oblivious
systems?
168. Evaluation
• How does Opaque compare to Spark SQL?
– Big Data Benchmark (BDB)
• Queries 1, 2, 3: filter, aggregation, join
• 1 million records
• How does Opaque compare to state-of-the-art oblivious
systems?
– GraphSC (graph analytics)
169. Evaluation
• How does Opaque compare to Spark SQL?
– Big Data Benchmark (BDB)
• Queries 1, 2, 3: filter, aggregation, join
• 1 million records
• How does Opaque compare to state-of-the-art oblivious
systems?
– GraphSC (graph analytics)
• PageRank
178. Big Data Benchmark
(encryption mode)Runtime(s)
0.01
0.1
1
10
100
Query number
Query 1 Query 2 Query 3
Spark SQL Opaque
Runtime(s)
0.01
0.1
1
10
100
Query number
Query 1 Query 2 Query 3
Spark SQL Opaque
Distributed
With very little cost, you will have data
encryption, authentication and
computation protection!
Single machine
190. Open source release
• Available at github.com/ucbrise/opaque
• Opaque is implemented as a Spark package
191. Open source release
• Available at github.com/ucbrise/opaque
• Opaque is implemented as a Spark package
• Features
192. Open source release
• Available at github.com/ucbrise/opaque
• Opaque is implemented as a Spark package
• Features
– Supports DataFrame select, filter, group by, join
193. Open source release
• Available at github.com/ucbrise/opaque
• Opaque is implemented as a Spark package
• Features
– Supports DataFrame select, filter, group by, join
– Allows users to specify DataFrames in encryption/
oblivious modes
194. Open source release
• Available at github.com/ucbrise/opaque
• Opaque is implemented as a Spark package
• Features
– Supports DataFrame select, filter, group by, join
– Allows users to specify DataFrames in encryption/
oblivious modes
• Automatic sensitivity propagation in mixed
sensitivity
197. Open source release
• Extension
– More functionality requires rewriting operators in C++
198. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
199. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
200. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
201. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
• Run JVM in the enclave
202. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
• Run JVM in the enclave
• Deployment
203. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
• Run JVM in the enclave
• Deployment
– Master must be trusted
204. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
• Run JVM in the enclave
• Deployment
– Master must be trusted
– SGX available now on Skylake processors
205. Open source release
• Extension
– More functionality requires rewriting operators in C++
– No UDF support yet
– Possible solutions
• Automatically generate C++
• Run JVM in the enclave
• Deployment
– Master must be trusted
– SGX available now on Skylake processors
• Cloud providers have no support yet
207. Conclusion
Opaque is a secure distributed analytics platform
Opaque
SQL
Machine
Learning
Graph
Analytics
Try it out at github.com/ucbrise/opaque
Wenting Zheng - wzheng@eecs.berkeley.edu