SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
BlinkDB and G-OLA:
Supporting Approximate Answers in SparkSQL
Sameer Agarwal and Kai Zeng
Spark Summit | San Francisco, CA | June 15th 2015
About Us
1. Sameer Agarwal
- Software Engineer at Databricks
- PhD in Databases (UC Berkeley)
- Research on ApproximateQuery Processing (BlinkDB)
2. Kai Zeng
- Post-doc in AMP Lab/ Intern at Databricks
- PhD in Databases (UCLA)
- Research on ApproximateQuery Processing (ABM)
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
100 TB on 1000 machines
Continuous Query Execution on Samples of Data
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Continuous Query Execution on Samples
What is the average latency in
the table?
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35 ± 2.1
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35 ± 2.1
33.83 ± 1.3
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Continuous Query Execution on Samples
9
Demo
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
G-OLA
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  batch  processing
val result  =  dataFrame.collect()  //  34.6667
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
onlineDataFrame.collectNext() //  35  ± 2.1
onlineDataFrame.collectNext() //  33.83  ± 1.3
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext())  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
responseTime <=  10.seconds)  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
errorBound >=  0.01)  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
AGGREGATES/  UDAFs
JOINS/GROUP  BYs
NESTED  QUERIES
Interface
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry
Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries
with Bounded Errors and Bounded Response Times on
Very Large Data. In ACM EuroSys 2013.
Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion
Stoica, Michael Jordan. A General Bootstrap
Performance Diagnostic. In ACM KDD 2013
Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet
Talwalkar,Michael Jordan, Samuel Madden, Barzan
Mozafari, Ion Stoica. Knowing When You’re Wrong:
Building Fast and Reliable Approximate Query
Processing Systems. In ACM SIGMOD 2014.
Continuous Query Execution on Samples
Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
21
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
0.2-0.5% additional overhead
High Level Take-away:
Bootstrap and Poissonized Resampling
Techniques are the key towards
achieving quick and continuous error
bars for a general set of queries
30
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
G-OLA
Kai Zeng, Sameer Agarwal, Ankur Dave,
Michael Armbrust and Ion Stoica.
G-OLA: Generalized Online Aggregation
for Interactive Analysis on Big Data. In
SIGMOD 2015.
Continuous Query Execution on Samples
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
Overall Quadratic Cost!
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Delta Update Query
Delta Update Queries
Data
Query
Answer ± ε
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
47
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
48
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
49
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
AVG
SCAN
A
50
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
(I)
51
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
A’
A’
(I) (II)
52
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
(I)
53
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
A ± ε
(I)
54
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
(I)
55
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
latency < 8
(I)
56
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
latency > 12
(I)
57
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
8 < latency < 12
(I)
58
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
8 < latency < 12
(I) (II)
High Level Take-away:
Introduce Delta Update Queries as a First Class
Citizen in Query Execution
Check out our code!
1. Code Preview: http://github.com/amplab/bootstrap-sql.
Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package in July’15
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
Conclusion
1. Continuous QueryExecution on
Samples of Data is an important means
to achieve interactivityin processing
large datasets
2. New SparkSQL Libraries:
- BlinkDB for Continuous Error Bars
- G-OLA for Continuous Partial Answers
Thank you.
SameerAgarwal(sameer@databricks.com)
KaiZeng(kaizeng@cs.berkeley.edu)

Más contenido relacionado

La actualidad más candente

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 

La actualidad más candente (20)

Bakers and Philosophers
Bakers and PhilosophersBakers and Philosophers
Bakers and Philosophers
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
 
From Trill to Quill and Beyond
From Trill to Quill and BeyondFrom Trill to Quill and Beyond
From Trill to Quill and Beyond
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
Solr sparse faceting
Solr sparse facetingSolr sparse faceting
Solr sparse faceting
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Ember
EmberEmber
Ember
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

Similar a BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
Susumu Tokumoto
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 

Similar a BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley (20)

Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri nets
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
RxJava In Baby Steps
RxJava In Baby StepsRxJava In Baby Steps
RxJava In Baby Steps
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 

Más de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Último

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Último (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

  • 1. BlinkDB and G-OLA: Supporting Approximate Answers in SparkSQL Sameer Agarwal and Kai Zeng Spark Summit | San Francisco, CA | June 15th 2015
  • 2. About Us 1. Sameer Agarwal - Software Engineer at Databricks - PhD in Databases (UC Berkeley) - Research on ApproximateQuery Processing (BlinkDB) 2. Kai Zeng - Post-doc in AMP Lab/ Intern at Databricks - PhD in Databases (UCLA) - Research on ApproximateQuery Processing (ABM)
  • 3. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 100 TB on 1000 machines Continuous Query Execution on Samples of Data
  • 4. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Continuous Query Execution on Samples What is the average latency in the table? 34.6667
  • 5. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 Continuous Query Execution on Samples
  • 6. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 Continuous Query Execution on Samples
  • 7. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 33.83 ± 1.3 Continuous Query Execution on Samples
  • 8. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Continuous Query Execution on Samples
  • 10. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  • 11. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples G-OLA
  • 12. Interface val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  batch  processing val result  =  dataFrame.collect()  //  34.6667
  • 13. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online onlineDataFrame.collectNext() //  35  ± 2.1 onlineDataFrame.collectNext() //  33.83  ± 1.3 Interface
  • 14. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext())  { onlineDataFrame.collectNext() } Interface
  • 15. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && responseTime <=  10.seconds)  { onlineDataFrame.collectNext() } Interface
  • 16. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && errorBound >=  0.01)  { onlineDataFrame.collectNext() } Interface
  • 17. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } Interface
  • 18. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } AGGREGATES/  UDAFs JOINS/GROUP  BYs NESTED  QUERIES Interface
  • 19. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  • 20. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys 2013. Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan. A General Bootstrap Performance Diagnostic. In ACM KDD 2013 Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar,Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In ACM SIGMOD 2014. Continuous Query Execution on Samples
  • 21. Focused on estimating aggregate errors given representative samples Central LimitTheorem (CLT) Error Estimation using Bootstrap HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93 21 Error Estimation on a Sample of Data d predicate for the query) The following results are (asymptotically in sample size) true, but not di- rectly useful, since they depend on unknown properties of the underlying dis- tribution. In all cases we just plug in the sample values. For example, instead of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, I assume no filtering (p = 1). Filtering will increase variance a bit, or potentially a lot for extremely selective queries (p = 0). I can compute the filtering-adjusted values if you like. 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n)) Sampling! … Resampling! D S100 S1 S θ(S1 ) θ(S100 )θ(S) 95% confidence interval! …
  • 22. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap What is the average latency in the table?
  • 23. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 24. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 25. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 ID City Latency 1 SLC 37 2 NYC 30 3 SLC 34 4 LA 36 5 NYC 30 ID City Latency 1 SLC 34 2 SLC 37 3 SLC 34 4 LA 36 5 LA 36 ... θ1 = 34.6 ... 35 ± 1.6 θ2 = 33.4 θ100 = 35.4
  • 26. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 Error Estimation in BlinkDB Leverage Poissonized Resampling to generate samples with replacement
  • 27. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 Sample from a Poisson (1) Distribution θ1 = 33.8 Error Estimation in BlinkDB
  • 28. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 6 SF 28 2 Incremental Error Estimation Error Estimation in BlinkDB
  • 29. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 #2 1 NYC 30 2 1 2 NYC 38 1 0 3 SLC 34 0 2 4 SLC 34 1 2 5 SLC 37 1 0 6 SF 28 2 1 Construct all Resamples in a Single Pass Error Estimation in BlinkDB 0.2-0.5% additional overhead
  • 30. High Level Take-away: Bootstrap and Poissonized Resampling Techniques are the key towards achieving quick and continuous error bars for a general set of queries 30
  • 31. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage G-OLA Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust and Ion Stoica. G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015. Continuous Query Execution on Samples
  • 32. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec
  • 33. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 34. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec
  • 35. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec Overall Quadratic Cost!
  • 36. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 37. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 38. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  • 39. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  • 40. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A Delta Update Query
  • 47. 47 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  • 48. 48 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  • 49. 49 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A AVG SCAN A
  • 50. 50 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  • 51. 51 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A A’ A’ (I) (II)
  • 52. 52 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  • 53. 53 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A A ± ε (I)
  • 54. 54 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 (I)
  • 55. 55 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency < 8 (I)
  • 56. 56 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency > 12 (I)
  • 57. 57 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I)
  • 58. 58 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I) (II)
  • 59. High Level Take-away: Introduce Delta Update Queries as a First Class Citizen in Query Execution
  • 60. Check out our code! 1. Code Preview: http://github.com/amplab/bootstrap-sql. Send us an email to kaizeng@cs.berkeley.eduand sameer@databricks.comto get access! 2. Spark Package in July’15 3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
  • 61. Conclusion 1. Continuous QueryExecution on Samples of Data is an important means to achieve interactivityin processing large datasets 2. New SparkSQL Libraries: - BlinkDB for Continuous Error Bars - G-OLA for Continuous Partial Answers