SlideShare a Scribd company logo
1 of 46
Getafix: Using Popularity to Curtail Memory Costs in
Interactive Analytics Engines
Distributed Protocols Research Group
http://dprg.cs.uiuc.edu/
Mainak Ghosh*, Ashwini Raina*, Le Xu*, Xiaoyao
Qian*, Indranil Gupta*, Himanshu Gupta#
*University of Illinois at Urbana Champaign, #Oath Inc.
Real-Time Analytics
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 2
Distributed
OLAP
Streaming
Analytics
Worth 13.7 billion dollars by
2021[1]
Interactive Analytics Engine
[1] Research and Markets. 2015. Streaming Analytics Market by Verticals –
Worldwide Market Forecast & Analysis (2015 - 2020). Report. (June 2015).
Interactive Analytics Engines
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
3
Data Ingestion
Deep Storage
Batch
Data
Real-time
Data
Query Compute
Router
Compute Nodes
Prefetched, Cached
Replicated
Handles petabytes of
data and millions of
queries per day
Requires large
amount of memorySegments
Less replication => less memory
Memory is not cheap …
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 4
Managing data is important in these
systems
Memory Performance Tradeoff
• Uniform Replication
Strategy
• Difficult to determine
the knee offline
• Can we do better?
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
5
Observed Knee
Skewed Access Pattern
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
6
Top 1%
segments
43% of
query
accesses
Reducing Memory
• Selective Replication
• Scarlett
[Ananthanarayanan’11]
• Data Compression
• Blowfish [Khandelwal’16]
• EC-Cache[Rashmi’16]
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
7
Getafix: Goals
• Minimize memory
required
• Reduce cost of cluster
provisioning
• Minimize Makespan
• Small impact on
average query latency
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
8
Goal
Agenda
• Motivation
• Problem Formulation
• System Design
• Evaluation
• Takeaway
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 9
Colored Blocks and Bins …
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines
CN3 CN4
Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average)
Goal is to provide
a load balanced
assignment with
the least amount
of replication
S1 (5 AM – 6 AM)
S2 (6 AM – 7 AM)
S3 (7 AM – 8 AM)
10
CN1 CN2
Segments
Compute Nodes
CN1
An arbitrary mapping ...
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines
CN3 CN4
Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average)
S1 (5 AM – 6 AM)
S2 (6 AM – 7 AM)
S3 (7 AM – 8 AM)
11
CN1
CN2
Segments
Compute Nodes
CN1
2 22 1
Load Balanced
Assignment.
Number of
Replicas: 7
Solution which minimizes number of replicas …
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines
CN3 CN4
Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average)
S1 (5 AM – 6 AM)
S2 (6 AM – 7 AM)
S3 (7 AM – 8 AM)
12
CN1
CN2
Segments
Compute Nodes
CN1
1 21 1
Load Balanced
Assignment.
Number of
Replicas: 5
Static Problem: Assumptions (Relaxed later)
1. Blocks are of uniform size.
• All queries take same time to execute on segments
2. The number of blocks and bins are static.
• All queries and segments have arrived.
3. Bins are of uniform capacity.
• Compute nodes are homogeneous in computation power.
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 13
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 14
CN1 CN2 CN3
Capacity: 4
𝟔 + 𝟑 + 𝟐 + 𝟏
𝟑
S1 S2 S3 S4
6 3 2 1
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 15
CN1 CN2 CN3
S1 S2 S3 S4
6 3 2 1
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Capacity: 4
Compute Node Selection Policy
• Bin Packing heuristics like First Fit, Largest Fit and Best Fit
applicable here
• In our running example, we use First Fit
• We choose the next CN (lowest CN id) that is not yet full.
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 16
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 17
CN1 CN2 CN3
S1 S2 S3 S4
6 3 2 1
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 18
CN1 CN2 CN3
S2 S1 S3 S4
3 2 2 1
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 19
CN1 CN2 CN3
S2 S1 S3 S4
3 2 2 1
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 20
CN1 CN2 CN3
S1 S3 S4 S2
2 2 1 0
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 21
CN1 CN2 CN3
S3 S1 S4 S2
2 1 1 0
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 22
CN1 CN2 CN3
S1 S4 S2 S3
1 1 0 0
S1 S4 S2 S3
1 1 0 0
S1 S4 S2 S3
1 1 0 0
S1 S4 S2 S3
1 1 0 0
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 23
CN1 CN2 CN3
S1 S2 S3 S4
0 0 0 0
Load Balanced
but not optimal
in replica count
1 2 3
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Capacity: 4
Segment Placement Algorithm
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 24
CN1 CN2 CN3
S1 S2 S3 S4
0 0 0 0
Assign bin capacity as total
number of query segment
pairs by number of compute
nodes
Create a priority queue on
segments by access count
Pick from the head, assign to
compute node based on a
policy and return the rest
Continue till the queue is
empty
Best Fit provably
achieves this
minimal replica
count
1 2 2
Capacity: 4
Agenda
• Motivation
• Problem Formulation
• System Design
• Evaluation
• Takeaway
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 25
Algorithm Execution
• Executes in periodic rounds
• Small interval length
• Tracks total access time
• Total time all queries spend accessing
a particular segment
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 26
CN1
Access Logger
S1 (5 AM – 6 AM)
S2 (6 AM – 7 AM)
Coordinator
Best Fit Algorithm
Segment
Access
(CN1):
S1: 1, S2: 2
Queries
Segment Management
• Aggressively removes replicas of
unpopular segments except one
• One replica avoids network fetch on
access
• Single replica garbage collected
under resource constraint
• Bootstrap loads new segments
• Avoids segment loads in fast path
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 27
Deep
Storage
CN1
Access Logger
S1 (5 AM – 6 AM)
S2 (6 AM – 7 AM)
Coordinator
Best Fit Algorithm
CN1:
Insert S3
Remove S1
Bootstrap Loader
Garbage
Collection
Fetch S3
Cluster Heterogeneity
• Production clusters are tiered – hot, warm and cold data
• Unexpected slowdowns (stragglers)
• Bins are assigned unequal capacities in the algorithm
• Auto-Tiering: Match popular segments to powerful nodes.
• Done manually today by sys-admin.
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 28
Other Details …
• Query Routing
• Segment Load Balancer
• Minimizing Network Transfer
• Fault Tolerance
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 29
Agenda
• Motivation
• Problem Formulation
• System Design
• Evaluation
• Takeaway
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 30
Memory Latency Tradeoff
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 31
• Getafix uses ~50% less
memory compared to
Scarlett
• Small impact on
average latency
Getafix
Memory Savings -- Scalability
• Baseline: Scarlett[2]
• Total memory savings
increases with client
load
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 32
Dollar Cost Savings
Memory Cost Model
Working Set Data Size x Replication Factor x Cost of Memory
Cost Saving Compared to Scarlett
= 100 TB x (4.2 – 1.9) x $0.005 per GB per hour (EC2)
= $1150 per hour
= $10 million annually
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 33
Tiering Accuracy
Getafix achieves 75%
tiering accuracy
compared to 42% for
Getafix-B
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 34
`
16
8
4
`
16
8
4
Getafix-B Getafix
High Moderate Low
Rounds
Query Processing
Volume (CPU Time)
Takeaway
• Data management in interactive analytics engine –
an important problem
• Best fit based segment replication algorithm
minimizes replica count under assumptions
• Our system, Getafix built on top of Druid relaxes
these assumptions
• Getafix out-performs known baselines by reducing
memory and cost
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 35
Evaluation Summary
Getafix minimizes memory and cost
with little impact on performance
Getafix adapts well to heterogeneity
in clusters by effectively auto-tiering
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 36
Memory Savings -- Scalability
• Baseline: Scarlett[2]
• Total memory savings
increases with client
load
• 1.72x – 1.92x reduction
in maximum memory
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 37
[2] Ganesh Ananthanarayanan, et.al. Scarlett: Coping with Skewed Content
Popularity in Mapreduce Clusters. (EuroSys ’11).
Heterogeneity Improvements
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 38
• Baseline: (Getafix-B)
Getafix w/o
Heterogeneity
• Improves on all
metrics
• Total Memory Savings
improves 20 – 30%
Heterogeneity Improvements
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 39
• Baseline: Vanilla Getafix
(Getafix-B)
• Improves on all metrics
• Total Memory Savings
improves 20 – 30%
• ~20% improvement in
tail latency observed
Evaluation Goals
Memory and Cost Savings
Effectiveness of Auto-Tiering
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 40
Getafix: Goals
• Minimize memory required
• Minimize cost of cluster provisioning
• Minimize makespan
• Minimal impact on average query latency
6/11/2018
Popular is Cheaper: Curtailing Memory Costs in Interactive
Analytics Engines
45
Memory Latency Tradeoff
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 46
• Baseline: Uniform and
Scarlett
• 2.15X worse query
performance for
Uniform
Memory Latency Tradeoff
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 47
• Baseline: Uniform and
Scarlett
• 2.15X worse query
performance for
Uniform
• Getafix uses least
memory
Evaluation Goals
Getafix minimizes memory and cost
with little impact on performance
Support for Heterogeneity
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 48
Implementation
• Tracks total access time
• Total time all queries spend accessing a
particular segment
• Executes the algorithm in periodic rounds
• Small interval length to catch popularity change
early
• Segment Popularity at round K+1 for
segment sj estimated as:
Popularity(sj, K + 1)
=
1
2
∗ Popularity(sj, K) + AccessTime(sj, K + 1)
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 49
Segment Management
• Getafix follows the segment placement plan
• Segments are bootstrap loaded
• Avoids segment loads in fast path
• Aggressively removes replicas of unpopular segments except one
• One replica avoids network fetch on access
• Single replica garbage collected under resource constraint
6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 50

More Related Content

What's hot

Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Streamlio
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingApache Apex
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensCitus Data
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...VMware Tanzu
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...VMware Tanzu
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
vJUG - Introduction to data streaming
vJUG - Introduction to data streamingvJUG - Introduction to data streaming
vJUG - Introduction to data streamingNicolas Fränkel
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkTugdual Grall
 

What's hot (20)

Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
vJUG - Introduction to data streaming
vJUG - Introduction to data streamingvJUG - Introduction to data streaming
vJUG - Introduction to data streaming
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 

Similar to Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines

Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfAlbert Wong
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSChristoforos Kachris
 
Deep Learning on Everyday Devices
Deep Learning on Everyday DevicesDeep Learning on Everyday Devices
Deep Learning on Everyday DevicesBrodmann17
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014Dylan Tong
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemDan Eaton
 

Similar to Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines (20)

Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
Deep Learning on Everyday Devices
Deep Learning on Everyday DevicesDeep Learning on Everyday Devices
Deep Learning on Everyday Devices
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
Dst
DstDst
Dst
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines

  • 1. Getafix: Using Popularity to Curtail Memory Costs in Interactive Analytics Engines Distributed Protocols Research Group http://dprg.cs.uiuc.edu/ Mainak Ghosh*, Ashwini Raina*, Le Xu*, Xiaoyao Qian*, Indranil Gupta*, Himanshu Gupta# *University of Illinois at Urbana Champaign, #Oath Inc.
  • 2. Real-Time Analytics 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 2 Distributed OLAP Streaming Analytics Worth 13.7 billion dollars by 2021[1] Interactive Analytics Engine [1] Research and Markets. 2015. Streaming Analytics Market by Verticals – Worldwide Market Forecast & Analysis (2015 - 2020). Report. (June 2015).
  • 3. Interactive Analytics Engines 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 3 Data Ingestion Deep Storage Batch Data Real-time Data Query Compute Router Compute Nodes Prefetched, Cached Replicated Handles petabytes of data and millions of queries per day Requires large amount of memorySegments Less replication => less memory
  • 4. Memory is not cheap … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 4 Managing data is important in these systems
  • 5. Memory Performance Tradeoff • Uniform Replication Strategy • Difficult to determine the knee offline • Can we do better? 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 5 Observed Knee
  • 6. Skewed Access Pattern 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 6 Top 1% segments 43% of query accesses
  • 7. Reducing Memory • Selective Replication • Scarlett [Ananthanarayanan’11] • Data Compression • Blowfish [Khandelwal’16] • EC-Cache[Rashmi’16] 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 7
  • 8. Getafix: Goals • Minimize memory required • Reduce cost of cluster provisioning • Minimize Makespan • Small impact on average query latency 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 8 Goal
  • 9. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 9
  • 10. Colored Blocks and Bins … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) Goal is to provide a load balanced assignment with the least amount of replication S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 10 CN1 CN2 Segments Compute Nodes CN1
  • 11. An arbitrary mapping ... 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 11 CN1 CN2 Segments Compute Nodes CN1 2 22 1 Load Balanced Assignment. Number of Replicas: 7
  • 12. Solution which minimizes number of replicas … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 12 CN1 CN2 Segments Compute Nodes CN1 1 21 1 Load Balanced Assignment. Number of Replicas: 5
  • 13. Static Problem: Assumptions (Relaxed later) 1. Blocks are of uniform size. • All queries take same time to execute on segments 2. The number of blocks and bins are static. • All queries and segments have arrived. 3. Bins are of uniform capacity. • Compute nodes are homogeneous in computation power. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 13
  • 14. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 14 CN1 CN2 CN3 Capacity: 4 𝟔 + 𝟑 + 𝟐 + 𝟏 𝟑 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count
  • 15. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 15 CN1 CN2 CN3 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  • 16. Compute Node Selection Policy • Bin Packing heuristics like First Fit, Largest Fit and Best Fit applicable here • In our running example, we use First Fit • We choose the next CN (lowest CN id) that is not yet full. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 16
  • 17. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 17 CN1 CN2 CN3 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  • 18. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 18 CN1 CN2 CN3 S2 S1 S3 S4 3 2 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  • 19. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 19 CN1 CN2 CN3 S2 S1 S3 S4 3 2 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  • 20. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 20 CN1 CN2 CN3 S1 S3 S4 S2 2 2 1 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  • 21. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 21 CN1 CN2 CN3 S3 S1 S4 S2 2 1 1 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  • 22. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 22 CN1 CN2 CN3 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  • 23. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 23 CN1 CN2 CN3 S1 S2 S3 S4 0 0 0 0 Load Balanced but not optimal in replica count 1 2 3 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  • 24. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 24 CN1 CN2 CN3 S1 S2 S3 S4 0 0 0 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Best Fit provably achieves this minimal replica count 1 2 2 Capacity: 4
  • 25. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 25
  • 26. Algorithm Execution • Executes in periodic rounds • Small interval length • Tracks total access time • Total time all queries spend accessing a particular segment 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 26 CN1 Access Logger S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) Coordinator Best Fit Algorithm Segment Access (CN1): S1: 1, S2: 2 Queries
  • 27. Segment Management • Aggressively removes replicas of unpopular segments except one • One replica avoids network fetch on access • Single replica garbage collected under resource constraint • Bootstrap loads new segments • Avoids segment loads in fast path 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 27 Deep Storage CN1 Access Logger S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) Coordinator Best Fit Algorithm CN1: Insert S3 Remove S1 Bootstrap Loader Garbage Collection Fetch S3
  • 28. Cluster Heterogeneity • Production clusters are tiered – hot, warm and cold data • Unexpected slowdowns (stragglers) • Bins are assigned unequal capacities in the algorithm • Auto-Tiering: Match popular segments to powerful nodes. • Done manually today by sys-admin. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 28
  • 29. Other Details … • Query Routing • Segment Load Balancer • Minimizing Network Transfer • Fault Tolerance 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 29
  • 30. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 30
  • 31. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 31 • Getafix uses ~50% less memory compared to Scarlett • Small impact on average latency Getafix
  • 32. Memory Savings -- Scalability • Baseline: Scarlett[2] • Total memory savings increases with client load 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 32
  • 33. Dollar Cost Savings Memory Cost Model Working Set Data Size x Replication Factor x Cost of Memory Cost Saving Compared to Scarlett = 100 TB x (4.2 – 1.9) x $0.005 per GB per hour (EC2) = $1150 per hour = $10 million annually 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 33
  • 34. Tiering Accuracy Getafix achieves 75% tiering accuracy compared to 42% for Getafix-B 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 34 ` 16 8 4 ` 16 8 4 Getafix-B Getafix High Moderate Low Rounds Query Processing Volume (CPU Time)
  • 35. Takeaway • Data management in interactive analytics engine – an important problem • Best fit based segment replication algorithm minimizes replica count under assumptions • Our system, Getafix built on top of Druid relaxes these assumptions • Getafix out-performs known baselines by reducing memory and cost 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 35
  • 36. Evaluation Summary Getafix minimizes memory and cost with little impact on performance Getafix adapts well to heterogeneity in clusters by effectively auto-tiering 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 36
  • 37. Memory Savings -- Scalability • Baseline: Scarlett[2] • Total memory savings increases with client load • 1.72x – 1.92x reduction in maximum memory 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 37 [2] Ganesh Ananthanarayanan, et.al. Scarlett: Coping with Skewed Content Popularity in Mapreduce Clusters. (EuroSys ’11).
  • 38. Heterogeneity Improvements 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 38 • Baseline: (Getafix-B) Getafix w/o Heterogeneity • Improves on all metrics • Total Memory Savings improves 20 – 30%
  • 39. Heterogeneity Improvements 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 39 • Baseline: Vanilla Getafix (Getafix-B) • Improves on all metrics • Total Memory Savings improves 20 – 30% • ~20% improvement in tail latency observed
  • 40. Evaluation Goals Memory and Cost Savings Effectiveness of Auto-Tiering 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 40
  • 41. Getafix: Goals • Minimize memory required • Minimize cost of cluster provisioning • Minimize makespan • Minimal impact on average query latency 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 45
  • 42. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 46 • Baseline: Uniform and Scarlett • 2.15X worse query performance for Uniform
  • 43. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 47 • Baseline: Uniform and Scarlett • 2.15X worse query performance for Uniform • Getafix uses least memory
  • 44. Evaluation Goals Getafix minimizes memory and cost with little impact on performance Support for Heterogeneity 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 48
  • 45. Implementation • Tracks total access time • Total time all queries spend accessing a particular segment • Executes the algorithm in periodic rounds • Small interval length to catch popularity change early • Segment Popularity at round K+1 for segment sj estimated as: Popularity(sj, K + 1) = 1 2 ∗ Popularity(sj, K) + AccessTime(sj, K + 1) 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 49
  • 46. Segment Management • Getafix follows the segment placement plan • Segments are bootstrap loaded • Avoids segment loads in fast path • Aggressively removes replicas of unpopular segments except one • One replica avoids network fetch on access • Single replica garbage collected under resource constraint 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 50

Editor's Notes

  1. Thanks <session chair> for the introduction. Today we will talk about our system Getafix which minimizes memory required in interactive analytics engines
  2. Before diving into the solution, we will briefly describe the space and the motivation. The real time analytics space can be divided into two verticals – streaming analytics and distributed OLAP. Systems like Samza, Heron fall into the streaming space while Apache Kylin, SQL Server are some examples of OLAP systems. In the intersection lies interactive analytics engine. Some popular systems in this space are Druid, Presto. This area is expected to grow to 13.7 billion dollars by 2021
  3. I will introduce the different modules in interactive analytics engine next. A typical use-case is ad clicks pipeline. Historical ad data and real time ad click events are collected, batched by time and stored in a deep storage like Amazon S3. The individual blocks are called segments and they are immutable. Queries are computed by a separate set of machines, typically VMs from Amazon EC2. They have powerful processors and large amount of memory. To reduce query response time, data is prefetched from the backend data store and cached in memory. Further segments are also replicated to better load balance query load to popular segments. We will primarily focus on replication in this talk as that is critical to reducing memory. A typical production deployment can handle petabytes of data and millions of queries per day. This means the front end VMs require hundreds of gigabytes in memory capacity.
  4. We all know memory is a precious resource. The cost of memory has seen a steady increase in the last 2 years because of high demand. This has been observed in multiple online marketplaces and holds true for multiple products. This makes managing data in memory intensive systems like interactive analytics engines an extremely important problem to curtail costs.
  5. Reducing memory in these systems will impact performance. We try to understand this tradeoff next. We start with a simple replication strategy where all segments are replicated uniformly. We ran experiments on a fixed cluster size of 20 frontend nodes with replication factors 4, 7 and 10. We measure total memory used on the x-axis and average query latency observed on the y-axis. The curve has a knee when all segments are replicated 7 times. We do not observe a steady improvement in performance because over-replication often leads to query hotspots because of data colocation. We have two observations here: Finding the knee of this curve depends on cluster size and query rate and cannot be intuitively determined offline. Second, is this the best that we can do?
  6. Actually, we can do better by exploiting the fact that segment access patterns are skewed. This is a CDF over the segment accesses collected from a production trace in Yahoo!. We observe that top 1% of segments get 43% of the query accesses. Also they see an order of magnitude more access than the bottom 40%.
  7. These access patterns have also been observed in batch processing systems and they have used selective replication schemes to significantly reduce memory. Scarlett is a system in this space which replicates blocks in HDFS files based on the number of concurrent accesses it sees. We implemented a version of this algorithm and ran an experiment. We observed a 2.15x reduction in query performance with 1.75x less memory. The improvement in query performance comes from the fact that popular segments now have more replicas to divide the query load. Other than selective replication, compressing unpopular data has also been explored as a technique to reduce memory. We consider this to be orthogonal to our work and focus on finding better selective replication schemes in the rest of the talk.
  8. Concretely, the goals of our system Getafix is to occupy this portion of the tradeoff space. To break it down even further, we want to minimize memory required with small impact on average query latency. We want to minimize makespan. Lastly, by minimizing memory, we want to reduce cost of cluster provisioning.
  9. In the rest of the talk, I will first describe the algorithm that powers Getafix. Next, I will briefly talk about some of the design components. Finally, I will show how Getafix does on the goals.
  10. A typical query in an interactive analytics engine can touch multiple segments. In this example we show 4 queries. Query 1 computes on segment S1 to S3 and they can be executed in parallel as long as the segments are present in 3 different compute nodes. We consider this query segment pair as colored blocks. They represent a unit of computation. Their size stands for the time it will take to run the query on that segment. Here, all the blocks have the same size or they take unit time. The color stands for the segment it is being executed on, blue stands for segment S1 and so on. Compute nodes are bins. You create a replica of a segment when you have two compute nodes with a ball of the same color. Remember our goal is to minimize the memory required while minimizing Makespan. This means we want to load balance the blocks with the least amount of replication.
  11. This is an arbitrary load balanced solution. Each compute nodes has this many number of replicas. Total number of replicas required in this mapping is 7.
  12. The mapping which minimizes the number of replicas is this one. Here the number of replicas is 5.
  13. To make the problem tractable we make the following 3 assumptions. We relax all of these assumptions while designing the system. We expect that blocks are of uniform size. We solve for the static version of the problem first where all blocks and bins are known upfront. Finally, bins have uniform capacity.
  14. Next we will describe our algorithm. We use a different example for this. The table here lists the segments in the system and the number represents popularity which is the number of queries accessing it. We have 3 compute nodes. A balanced allocation means each bin will have a capacity of 4. Note that this ensures that we minimize Makespan. We assign this capacity to each bin. Next, we create a priority queue with segments ordered by descending order of popularity.
  15. The algorithm runs in iterations. In each iteration, it picks the most popular segment and assigns it to a compute node based on some pre-defined policy. If the node does not have sufficient capacity to fit all the blocks, then the rest is returned to the queue.
  16. We use bin packing heuristics for compute node selection. In this example, we are going to use first fit, i.e, we will choose the compute node with the lowest id that is empty.
  17. We pick the blue blocks, fit 4 of the queries and return the other 2. Note its position changes in the priority queue
  18. We continue iterating till the queue is empty
  19. Next we select the orange blocks. They all fit in CN2
  20. We pick blue again, fit 1 of the balss
  21. Next we fit the yellow blocks
  22. Finally the last two available slots get filled by the left blue and green blocks. The total number of replicas in this assignment is 6. If you look closely, this is not an assignment with the least replication. You can reduce replication by 1 if you swap the blue ball in CN2 with the green ball in CN3
  23. The final allocation looks like this. It can be shown that best fit will provably achieve the minimum replication count in this algorithm. We refer the audience to the paper for the formal proof.
  24. While designing Getafix, we relaxed the assumptions of the last section. We will talk about it next
  25. The segment replication algorithm is run inside a coordinator node which is tasked with running and enforcing data management policies. It executes the algorithm in periodic rounds. We implemented an access logger module in compute nodes to track popularity of each segment and send it to the coordinator. It then runs the algorithm to calculate the segment placement plan. In a round, while the new placement is being planned and executed, the old placement pattern can still be used for ongoing queries, i.e., queries are not stopped. We chose to react to popularity changes to eliminate the overhead of batching in the fast path. Additionally, we use a small interval length between algorithm execution to catch popularity changes early. This ensures we can smoothly handle increased query load on a particular segment. Finally, access logger in Getafix tracks total access time instead of number of queries. In the last section we assumed that all blocks have the same size. In reality, queries can have different execution profile based on semantics, filter parameters, etc. As a result, counting the number of queries is not an accurate representation of popularity.
  26. Based on the generated segment placement plan, the coordinator instructs the compute nodes to load or remove segments for the next round. To load a new segment, compute nodes fetches it from deep storage. To keep memory utilization high, Getafix aggressively removes unpopular segment replicas. All but one replicas can be potentially removed. One replica for each segment is maintained in the frontend cluster to avoid demand fetch on access. Coordinator can instruct a compute node to remove even this single replica under resource constraint. For freshly minted segments, which have not seen queries yet, Getafix bootstrap loads these segments. This is done to avoid demand loading the segment during query execution.
  27. We relax our last assumption about homogeneous compute nodes here. Production clusters are typically tiered. Hot data is stored in VMs having sufficiently large memory to fit the full data size. They also have higher compute capacity. Warm data is stored in VMs where 30-40% data can fit in memory and rest resides on faster disks like SSDs and so on. Additionally, stragglers can be a problem too. To handle this heterogeneity, we assign capacities to the compute nodes based on the amount of query load it has taken in the past. The rest of the algorithm works as is. At the end, we further match the popular segments to more powerful nodes. This is called Auto-Tiering. This represents a better management alternative to the manual auto-tiering done today by sys admins
  28. There are several other design components for which we refer the audience to our paper
  29. Next we evaluate Getafix against some of the goals that we had set out earlier
  30. First, we want to revisit the tradeoff plot and see how Getafix does compared to the goals. We observe a 50% reduction in memory usage with 5% increase in average latency compared to scarlett. Getafix performs better at the tail. You will find the plot in paper.
  31. Next, we want to quantify how the memory saving scales with increase in query load. We fix the number of compute nodes to 20. We vary the query load by adding more clients plotted on the x-axis. On the y-axis, we plot the memory used by Scarlett divided by Getafix. It is a ratio and higher is better. We observe a 1.45 – 2.15X reduction in total memory for Getafix and it increases as the number of clients grow. Scarlett alleviates query hotspots by creating more replicas of popular segments, while Getafix carefully balances replicas of popular and unpopular segments to keep overall replication (and memory usage) low.
  32. Finally, we do some back-of-the-envelope calculations to estimate cost savings. We calculate memory costs as a product of working set data size, effective replication factor and memory cost. We use numbers for the first and the last parameters from data publicly available. We know EC2 couples memory and processing core costs, but for the purposes of back of the envelope calculation we talk about only memory. For effective replication factor, we use our observed numbers from the evaluation. Compared to Scarlett, we estimate a cost saving of 10 millions dollars annually.
  33. In this last plot, we measure how good our auto-tiering implementation is. We ran our experiment on Cluster-1. It has 3 hot, 6 warm and 6 cold tier nodes. We represent them on the x-axis. On the y-axis we plot rounds. We classify each compute node as high, moderate and low based on the query processing volume it sees. Each division is color coded. For vanilla Getafix we observe lot of hot tier nodes getting moderate and low query volume. This is because they do not have popular segments assigned to them. With our auto-tiering optimizations, we observe a much better tiering.
  34. To summarize, today we talked about the data management problem in interactive analytics engine. I hope I have been able to convince you that it is an important problem today. We showed an algorithm that can provably reduce the memory requirements under some simplifying assumptions. We built our system Getafix which relaxes these assumptions. Through multiple evaluations, we show that Getafix can reduce memory in practice with little impact on performance. Lastly, I would like to leave you with this adage: Has rising provisioning costs put you in a fix, you can solve it all by using Getafix  . Thanks you and I am now happy to take questions.
  35. To summarize, Getafix performs well on all the goals that we had set for it. We have more evaluation in the paper and we highly encourage the audience to take a look.
  36. Reduction in maximum memory required in a single compute node is also significant and reduces slightly as the number of client grows. The maximum memory provides an estimate of how much VM memory to provision. Thus, this saving has a direct correlation to cost.
  37. For these experiments, we use 3 different types of VMs with different core count and memory to create a tiered cluster. We ran our experiments with 2 such clusters configurations. Cluster-1 has larger fraction of hot tier i.e more powerful nodes compared to Cluster-2. We compare to vanilla Getafix without the optimizations. Getafix improves on all the metrics we measured for. We saw 20% reduction in memory and it increased from Cluster-1 to Cluster-2
  38. Our optimizations improves observed tail latency by 20%
  39. In today’s talk, we are going to see how much memory and cost saving Getafix can provide. Additionally, we will see how Getafix adapts to heterogeneity in cluster
  40. This brings us to the goals of our system Getafix. Our primary goal is to minimize the amount of memory required. By reducing memory used, we intent to minimize costs. We want to achieve this while minimizing Makespan or maximizing query throughput. We also want to have minimal impact on average query latency
  41. We study the tradeoff between amount of memory and query performance. The goal of the experiment is to see where Getafix lies in the tradeoff space. We compare Getafix to Scarlett and Uniform replication. In uniform replication, you replicate all segments some pre-defined number of times. We ran 4 experiments for Uniform with replication factors 4, 7, 10, 13. On the x-axis we plot the effective replication for the strategies which directly relates to memory used. On the y axis we plot performance as average query latency. Non uniform schemes out-perform by 2.15x because it either over-replicates unpopular segments or under-replicates popular segments leading to hotspots.
  42. Getafix has the least effective replication factor
  43. In the next set of experiments, we evaluate the heterogeneity optimizations for Getafix.
  44. In the last section we assumed that all blocks have the same size. In reality, queries can have different execution profile based on semantics, filter parameters, etc. Counting the number of queries is not accurate representation of popularity. Instead, we look at the total time all queries spent accessing a particular segment. In a production system, queries and segments arrive continuously. We run the algorithm in periodic rounds. We keep the interval small enough to catch changes early. The popularity of a segment is calculated by maintaining a weighted average of popularity values collected in the past round. Latest popularity numbers are given more priority than older ones.
  45. In each round, Getafix follows the segment placement plan generated by the algorithm. Segments replicas are inserted, moved and removed based on the plan. For freshly minted segments, which have not seen queries yet, Getafix bootstrap loads these segments. This is done to avoid demand loading the segment during query execution. To keep memory utilization high, Getafix aggressively removes unpopular segment replicas. All but one replicas can be potentially removed. One replica for each segment is maintained in the frontend cluster to avoid demand fetch on access. This single replica can also be garbage collected under resource constraint