SlideShare una empresa de Scribd logo
1 de 21
StreamApprox
Approximate  Stream  Analytics  in  
Apache  Flink
Sep  2017
Do  Le  Quoc,  Pramod Bhatotia,  
Ruichuan Chen,    Christof  Fetzer,  Volker  Hilt, Thorsten  Strufe
Modern  online  services
1
Stream  
Aggregator
Stream  
Analytics  
System
Useful
Information
2
Modern  online  services
Low  latency  
Tension
Approximate  computing
Efficient  resource
utilization
Approximate  Computing
3
Many  applications:
Approximate  output  is  good  enough!
E.g.  :  Google  Trends  -­‐-­‐-­‐ Big  Data  vs  Machine  Learning  (Sep/2012  – Sep/2017)
The  trend  of  data  is  more  important  than the  precise  numbers
Approximate  Computing
4
Take  a  
sample Approximate  output  
± Error  bound  Compute
Approximate  computing
Idea:  To  achieve  low  latency,  compute  over  a  sub-­‐set  of  data  items  
instead  of  the  entire  data-­‐set
State-­‐of-­‐the-­‐art  systems
5
ApproxHadoop [ASPLOS’15] Using  multi-­‐stage  sampling
BlinkDB [EuroSyS’13] Using  pre-­‐existing  samples
Quickr [SIGMOD’16] Injecting  samplers  into  query  plan
Not  designed  
for  
stream  analytics
StreamApprox:  Design  goals
6
Practical Supports  adaptive  execution  based  on  query  budget
Efficient Employs  online  sampling  techniques
Transparent Targets  existing  applications  w/  minor  code  changes
Outline
• Motivation
• Design
• Evaluation
7
StreamApprox:  Overview
8
Input  data  stream
Approximate  output  
± Error  bound
StreamApprox
Streaming  query Query  budget  
Stream  
Aggregator  
(E.g Kafka)
S1
S2
Sn
…
data  stream
Query  budget:  
• Latency/throughput guarantees
• Desired  computing  resources  for  query  processing
• Desired  accuracy
Key  idea:  Sampling
9
Simple  random  sampling  (SRS):
Stratified  sampling  (STS):
SRS
SRS
SRS
SRS
Key  idea:  Sampling
Reservoir  sampling  (RS):
10
i
Size  of  reservoir  =  k
Replace  by  item  i
Drop  item  i
Spark-­‐based  Sampling
11
Step  
#1
Create  strata  
using  groupByKey()
Step  
#3
Synchronize  between  
worker  nodes
to  select  a  
sample  of  size  k
Step  
#2
Apply  SRS
to  each  stratum  Si
These  steps  are  very  expensive
Spark-­‐based  Stratified  Sampling  (Spark-­‐based  STS)
StreamApprox:  Core  idea
12
Easy  to  parallelize,  doesn't  
need  any  synchronization  
between  workers
RS Weight  =  #items/k  =  6/4S2
RS
S3
Weight  =  1
S1
RS
Size  of  reservoir  =  k
Weight  =  #items/k  =  8/4
RS  :  Reservoir  Sampling
k  =  4
Online  Adaptive  Stratified  Reservoir  Sampling  (OASRS)
StreamApprox:  Core  idea
13
Weight  =  2
Weight  =  1.5
Weight  =  1
OASRS
Worker  1
Weight  =  1
Weight  =  2
Weight  =  1.5
Size  of  reservoir  =  4
OASRS
Worker  2
Implementation
14
Approximate  output  
± Error  bound
StreamApprox
Stream  
Aggregator
S1
S2
Sn
…
data  stream
Implementation
15
Sampling  
module
Flink Computation  
Engine
Output
± Error  bound
Error  
Estimation  
moduleRefined  sampling  
parameters
Stream  
aggregator  
S1
S2
Sn
…
Flink-­‐based  StreamApprox
Outline
• Motivation
• Design
• Evaluation
16
Experimental  setup
• Evaluation  questions
• Throughput  vs  sample  size
• Throughput  vs  accuracy
• Testbed
• Cluster:  17  nodes  
• Datasets:  
• Synthesis:  Gaussian  distribution,  Poisson  distribution  datasets  
• CAIDA  Network  traffic  traces;  NYC  Taxi  ride  records
17
See  the  paper  
for  more  
results!
Throughput
0
2
4
6
8
10 20 40 60 80
Throughput  (M)
#items/s
Sampling  fraction  (%)
Flink-­‐based  StreamApprox
Spark-­‐based  StreamApprox
Spark-­‐based  STS
18
Higher
the  better
Spark-­‐based  StreamApprox: ~2X  higher  throughput  over  Spark-­‐based  STS
Flink-­‐based  StreamApprox:  1.3X  higher  throughput  over  Spark-­‐based  StreamApprox
With  sampling  fraction  <  60%
Throughput  vs  Accuracy
0
1
2
3
4
5
0.5 1
Throughput  (M)
#items/s
Accuracy  loss  (%)
Flink-­‐based  StreamApprox
Spark-­‐based  StreamApprox
Spark-­‐based  STS
19
Higher
the  better
Spark-­‐based  StreamApprox: ~1.32X  higher  throughput  over  Spark-­‐based  STS
Flink-­‐based  StreamApprox:  1.62X  higher  throughput  over  Spark-­‐based  StreamApprox
With  the  same  accuracy  loss
Conclusion
20
StreamApprox:  Approximate  computing  for  stream  analytics
Practical Adaptive  execution  based  on  query  budget
Efficient Online  stratified  sampling  technique
Thank  you!
Details:  StreamApprox [Middleware’17]  
https://streamapprox.github.io
Transparent Supports  applications  w/  minor  code  changes

Más contenido relacionado

La actualidad más candente

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 

La actualidad más candente (20)

Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
 
Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache Flink
 

Similar a Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink

IPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
IPLC Analytic Dashboard - Mohd Rizal bin Mohd RamlyIPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
IPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
MyNOG
 

Similar a Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink (20)

Edge Comp.pptx
Edge Comp.pptxEdge Comp.pptx
Edge Comp.pptx
 
approxIoT.pptx
approxIoT.pptxapproxIoT.pptx
approxIoT.pptx
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
IPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
IPLC Analytic Dashboard - Mohd Rizal bin Mohd RamlyIPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
IPLC Analytic Dashboard - Mohd Rizal bin Mohd Ramly
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 

Más de Flink Forward

Más de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Último

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 

Último (20)

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink

  • 1. StreamApprox Approximate  Stream  Analytics  in   Apache  Flink Sep  2017 Do  Le  Quoc,  Pramod Bhatotia,   Ruichuan Chen,    Christof  Fetzer,  Volker  Hilt, Thorsten  Strufe
  • 2. Modern  online  services 1 Stream   Aggregator Stream   Analytics   System Useful Information
  • 3. 2 Modern  online  services Low  latency   Tension Approximate  computing Efficient  resource utilization
  • 4. Approximate  Computing 3 Many  applications: Approximate  output  is  good  enough! E.g.  :  Google  Trends  -­‐-­‐-­‐ Big  Data  vs  Machine  Learning  (Sep/2012  – Sep/2017) The  trend  of  data  is  more  important  than the  precise  numbers
  • 5. Approximate  Computing 4 Take  a   sample Approximate  output   ± Error  bound  Compute Approximate  computing Idea:  To  achieve  low  latency,  compute  over  a  sub-­‐set  of  data  items   instead  of  the  entire  data-­‐set
  • 6. State-­‐of-­‐the-­‐art  systems 5 ApproxHadoop [ASPLOS’15] Using  multi-­‐stage  sampling BlinkDB [EuroSyS’13] Using  pre-­‐existing  samples Quickr [SIGMOD’16] Injecting  samplers  into  query  plan Not  designed   for   stream  analytics
  • 7. StreamApprox:  Design  goals 6 Practical Supports  adaptive  execution  based  on  query  budget Efficient Employs  online  sampling  techniques Transparent Targets  existing  applications  w/  minor  code  changes
  • 9. StreamApprox:  Overview 8 Input  data  stream Approximate  output   ± Error  bound StreamApprox Streaming  query Query  budget   Stream   Aggregator   (E.g Kafka) S1 S2 Sn … data  stream Query  budget:   • Latency/throughput guarantees • Desired  computing  resources  for  query  processing • Desired  accuracy
  • 10. Key  idea:  Sampling 9 Simple  random  sampling  (SRS): Stratified  sampling  (STS): SRS SRS SRS SRS
  • 11. Key  idea:  Sampling Reservoir  sampling  (RS): 10 i Size  of  reservoir  =  k Replace  by  item  i Drop  item  i
  • 12. Spark-­‐based  Sampling 11 Step   #1 Create  strata   using  groupByKey() Step   #3 Synchronize  between   worker  nodes to  select  a   sample  of  size  k Step   #2 Apply  SRS to  each  stratum  Si These  steps  are  very  expensive Spark-­‐based  Stratified  Sampling  (Spark-­‐based  STS)
  • 13. StreamApprox:  Core  idea 12 Easy  to  parallelize,  doesn't   need  any  synchronization   between  workers RS Weight  =  #items/k  =  6/4S2 RS S3 Weight  =  1 S1 RS Size  of  reservoir  =  k Weight  =  #items/k  =  8/4 RS  :  Reservoir  Sampling k  =  4 Online  Adaptive  Stratified  Reservoir  Sampling  (OASRS)
  • 14. StreamApprox:  Core  idea 13 Weight  =  2 Weight  =  1.5 Weight  =  1 OASRS Worker  1 Weight  =  1 Weight  =  2 Weight  =  1.5 Size  of  reservoir  =  4 OASRS Worker  2
  • 15. Implementation 14 Approximate  output   ± Error  bound StreamApprox Stream   Aggregator S1 S2 Sn … data  stream
  • 16. Implementation 15 Sampling   module Flink Computation   Engine Output ± Error  bound Error   Estimation   moduleRefined  sampling   parameters Stream   aggregator   S1 S2 Sn … Flink-­‐based  StreamApprox
  • 18. Experimental  setup • Evaluation  questions • Throughput  vs  sample  size • Throughput  vs  accuracy • Testbed • Cluster:  17  nodes   • Datasets:   • Synthesis:  Gaussian  distribution,  Poisson  distribution  datasets   • CAIDA  Network  traffic  traces;  NYC  Taxi  ride  records 17 See  the  paper   for  more   results!
  • 19. Throughput 0 2 4 6 8 10 20 40 60 80 Throughput  (M) #items/s Sampling  fraction  (%) Flink-­‐based  StreamApprox Spark-­‐based  StreamApprox Spark-­‐based  STS 18 Higher the  better Spark-­‐based  StreamApprox: ~2X  higher  throughput  over  Spark-­‐based  STS Flink-­‐based  StreamApprox:  1.3X  higher  throughput  over  Spark-­‐based  StreamApprox With  sampling  fraction  <  60%
  • 20. Throughput  vs  Accuracy 0 1 2 3 4 5 0.5 1 Throughput  (M) #items/s Accuracy  loss  (%) Flink-­‐based  StreamApprox Spark-­‐based  StreamApprox Spark-­‐based  STS 19 Higher the  better Spark-­‐based  StreamApprox: ~1.32X  higher  throughput  over  Spark-­‐based  STS Flink-­‐based  StreamApprox:  1.62X  higher  throughput  over  Spark-­‐based  StreamApprox With  the  same  accuracy  loss
  • 21. Conclusion 20 StreamApprox:  Approximate  computing  for  stream  analytics Practical Adaptive  execution  based  on  query  budget Efficient Online  stratified  sampling  technique Thank  you! Details:  StreamApprox [Middleware’17]   https://streamapprox.github.io Transparent Supports  applications  w/  minor  code  changes