SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
Spark+AI Summit
April 24, 2019
Smart join algorithms
for fighting skew at scale
Andrew Clegg, Applied Machine Learning Group @ Yelp
@andrew_clegg
Data skew, outliers, power laws, and their symptomsOverview
Joining skewed data more efficiently
Extensions and tips
How do joins work in Spark, and why skewed data hurts
Data skew
Classic skewed distributions
DATA SKEW
Image by Rodolfo Hermans, CC BY-SA 3.0, via Wikiversity
Outliers
DATA SKEW
Image from Hedges & Shah, “Comparison of mode estimation methods and application in molecular clock analysis”, CC BY 4.0
Power laws
DATA SKEW
Word rank vs. word frequency in an English text corpus
Electrostatic and gravitational forces (inverse square law)Power laws are
everywhere
(approximately)
‘80/20 rule’ in distribution of income (Pareto principle)
Relationship between body size and metabolism
(Kleiber’s law)
Distribution of earthquake magnitudes
Word frequencies in natural language corpora (Zipf’s law)Power laws are
everywhere
(approximately)
Participation inequality on wikis and forum sites (‘1% rule’)
Popularity of websites and their content
Degree distribution in social networks (‘Bieber problem’)
Why is this a problem?
Hot shards in databases — salt keys, change schemaPopularity hotspots:
symptoms and fixes
Hot mappers in map-only tasks — repartition randomly
Hot reducers during joins and aggregations … ?
Slow load times for certain users — look for O(n2) mistakes
Diagnosing hot executors
POPULARITY HOTSPOTS
!?
Spark joins
under the hood
Shuffled hash join
SPARK JOINS
DataFrame 1
DataFrame 2
Partitions
Shuffle Shuffle
Shuffled hash join with very skewed data
SPARK JOINS
DataFrame 1
DataFrame 2
Partitions
Shuffle Shuffle
Broadcast join can help
SPARK JOINS
DataFrame 2
Broadcast
DataFrame 1
Map
Joined
Broadcast join can help sometimes
SPARK JOINS
DataFrame 2
Broadcast
DataFrame 1
Map
Joined
Must fit in RAM on
each executor
Joining skewed
data faster
Append random int in [0, R) to each key in skewed dataSplitting a single key
across multiple tasks
Append replica ID to original key in non-skewed data
Join on this newly-generated key
Replicate each row in non-skewed data, R times
R = replication factor
Replicated join
FASTER JOINS
DataFrame 1
DataFrame 2_0
Shuffle Shuffle
DataFrame 2_1
DataFrame 2_2
0
0
0
1
2
1
2
1
2
Logical Partitions
Replica IDs
replication_factor = 10
# spark.range creates an 'id' column: 0 <= id < replication_factor
replication_ids = F.broadcast(
spark.range(replication_factor).withColumnRenamed('id', 'replica_id')
)
# Replicate uniform data, one copy of each row per bucket
# composite_key looks like: 12345@3 (original_id@replica_id)
uniform_data_replicated = (
uniform_data
.crossJoin(replication_ids)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
def randint(limit):
return F.least(
F.floor(F.rand() * limit),
F.lit(limit - 1), # just to avoid unlikely edge case
)
# Randomly assign rows in skewed data to buckets
# composite_key has same format as in uniform data
skewed_data_tagged = (
skewed_data
.withColumn(
'composite_key',
F.concat(
'original_id',
F.lit('@'),
randint(replication_factor),
)
)
)
# Join them together on the composite key
joined = skewed_data_tagged.join(
uniform_data_replicated,
on='composite_key',
how='inner',
)
# Join them together on the composite key
joined = skewed_data_tagged.join(
uniform_data_replicated,
on='composite_key',
how='inner',
) WARNING
Inner and left outer
joins only
Remember duplicates…
FASTER JOINS
All different rows
Same row
Same row
Same row
100 million rows of data with uniformly-distributed keysExperiments with
synthetic data
Standard inner join ran for 7+ hours then I killed it!
10x replicated join completed in 1h16m
100 billion rows of data with Zipf-distributed keys
Can we do better?
Very common keys should be replicated many timesDifferential replication
Identify frequent keys before replication
Use different replication policy for those
Rare keys don’t need to be replicated as much (or at all?)
replication_factor_high = 50
replication_high = F.broadcast(
spark
.range(replication_factor_high)
.withColumnRenamed('id', 'replica_id')
)
replication_factor_low = 10
replication_low = … # as above
# Determine which keys are highly over-represented
top_keys = F.broadcast(
skewed_data
.freqItems(['original_id'], 0.0001) # return keys with frequency > this
.select(
F.explode('id_freqItems').alias('id_freqItems')
)
)
uniform_data_top_keys = (
uniform_data
.join(
top_keys,
uniform_data.original_id == top_keys.id_freqItems,
how='inner',
)
.crossJoin(replication_high)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
uniform_data_rest = (
uniform_data
.join(
top_keys,
uniform_data.original_id == top_keys.id_freqItems,
how='leftanti',
)
.crossJoin(replication_low)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
uniform_data_replicated = uniform_data_top_keys.union(uniform_data_rest)
FASTER JOINS
DataFrame 1
DataFrame 2_0
freqItems
DataFrame 2_1
Frequently
occurring keys
Broadcast
DataFrame 2_2
Replicate very frequent keys more often
skewed_data_tagged = (
skewed_data
.join(
top_keys,
skewed_data.id == top_keys.id_freqItems,
how='left',
)
.withColumn(
'replica_id',
F.when(
F.isnull(F.col('id_freqItems')), randint(replication_factor_low),
)
.otherwise(randint(replication_factor_high))
)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
100 million rows of data with uniformly-distributed keysExperiments with
synthetic data
10x replicated join completed in 1h16m
10x/50x differential replication completed in under 1h
100 billion rows of data with Zipf-distributed keys
Other ideas worth mentioning
Identify very common keys in skewed dataPartial broadcasting
Rare keys are joined the traditional way (without replication)
Union the resulting joined DataFrames
Select these rows from uniform data and broadcast join
Get approximate frequency for every key in skewed dataDynamic replication
Intuition: Sliding scale between rarest and most common
Can be hard to make this work in practice!
Replicate uniform data proportional to key frequency
Append random replica_id_that for other side, in [0, Rthat)Double-sided skew
Composite key on left: id, replica_id_that, replica_id_this
Composite key on right: id, replica_id_this, replica_id_that
Replicate each row Rthis times, and append replica_id_this
Summing up
Is the problem just outliers? Can you safely ignore them?Checklist
Look at your data to get an idea of the distribution
Start simple (fixed replication factor) then iterate if necessary
Try broadcast join if possible
Warren & Karau, High Performance Spark (O’Reilly 2017)Credits
Scalding developers: github.com/twitter/scalding
Spark, ML, and distributed systems community @ Yelp
Blog by Maria Restre: datarus.wordpress.com
Thanks!
@andrew_clegg

Más contenido relacionado

La actualidad más candente

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

La actualidad más candente (20)

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
SQLd360
SQLd360SQLd360
SQLd360
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Similar a Smart Join Algorithms for Fighting Skew at Scale

CJ 560 Module Seven Short Pa
CJ 560 Module Seven Short PaCJ 560 Module Seven Short Pa
CJ 560 Module Seven Short Pa
VinaOconner450
 

Similar a Smart Join Algorithms for Fighting Skew at Scale (20)

Why Our Code Smells
Why Our Code SmellsWhy Our Code Smells
Why Our Code Smells
 
Advanced SQL Webinar
Advanced SQL WebinarAdvanced SQL Webinar
Advanced SQL Webinar
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Sql alchemy bpstyle_4
Sql alchemy bpstyle_4Sql alchemy bpstyle_4
Sql alchemy bpstyle_4
 
CJ 560 Module Seven Short Pa
CJ 560 Module Seven Short PaCJ 560 Module Seven Short Pa
CJ 560 Module Seven Short Pa
 
Django - sql alchemy - jquery
Django - sql alchemy - jqueryDjango - sql alchemy - jquery
Django - sql alchemy - jquery
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order Functions
 
Bc0041
Bc0041Bc0041
Bc0041
 
Php & my sql
Php & my sqlPhp & my sql
Php & my sql
 
Error Reporting in ZF2: form messages, custom error pages, logging
Error Reporting in ZF2: form messages, custom error pages, loggingError Reporting in ZF2: form messages, custom error pages, logging
Error Reporting in ZF2: form messages, custom error pages, logging
 
Code smells (1).pptx
Code smells (1).pptxCode smells (1).pptx
Code smells (1).pptx
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Drupal7 dbtng
Drupal7  dbtngDrupal7  dbtng
Drupal7 dbtng
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Everything is composable
Everything is composableEverything is composable
Everything is composable
 
Fnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
Fnt Software Solutions Pvt Ltd Placement Papers - PHP TechnologyFnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
Fnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
 
A "M"ind Bending Experience. Power Query for Power BI and Beyond.
A "M"ind Bending Experience. Power Query for Power BI and Beyond.A "M"ind Bending Experience. Power Query for Power BI and Beyond.
A "M"ind Bending Experience. Power Query for Power BI and Beyond.
 
Contracts in-clojure-pete
Contracts in-clojure-peteContracts in-clojure-pete
Contracts in-clojure-pete
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Smart Join Algorithms for Fighting Skew at Scale

  • 2. Smart join algorithms for fighting skew at scale Andrew Clegg, Applied Machine Learning Group @ Yelp @andrew_clegg
  • 3. Data skew, outliers, power laws, and their symptomsOverview Joining skewed data more efficiently Extensions and tips How do joins work in Spark, and why skewed data hurts
  • 5. Classic skewed distributions DATA SKEW Image by Rodolfo Hermans, CC BY-SA 3.0, via Wikiversity
  • 6. Outliers DATA SKEW Image from Hedges & Shah, “Comparison of mode estimation methods and application in molecular clock analysis”, CC BY 4.0
  • 7. Power laws DATA SKEW Word rank vs. word frequency in an English text corpus
  • 8. Electrostatic and gravitational forces (inverse square law)Power laws are everywhere (approximately) ‘80/20 rule’ in distribution of income (Pareto principle) Relationship between body size and metabolism (Kleiber’s law) Distribution of earthquake magnitudes
  • 9. Word frequencies in natural language corpora (Zipf’s law)Power laws are everywhere (approximately) Participation inequality on wikis and forum sites (‘1% rule’) Popularity of websites and their content Degree distribution in social networks (‘Bieber problem’)
  • 10. Why is this a problem?
  • 11. Hot shards in databases — salt keys, change schemaPopularity hotspots: symptoms and fixes Hot mappers in map-only tasks — repartition randomly Hot reducers during joins and aggregations … ? Slow load times for certain users — look for O(n2) mistakes
  • 14. Shuffled hash join SPARK JOINS DataFrame 1 DataFrame 2 Partitions Shuffle Shuffle
  • 15. Shuffled hash join with very skewed data SPARK JOINS DataFrame 1 DataFrame 2 Partitions Shuffle Shuffle
  • 16. Broadcast join can help SPARK JOINS DataFrame 2 Broadcast DataFrame 1 Map Joined
  • 17. Broadcast join can help sometimes SPARK JOINS DataFrame 2 Broadcast DataFrame 1 Map Joined Must fit in RAM on each executor
  • 19. Append random int in [0, R) to each key in skewed dataSplitting a single key across multiple tasks Append replica ID to original key in non-skewed data Join on this newly-generated key Replicate each row in non-skewed data, R times R = replication factor
  • 20. Replicated join FASTER JOINS DataFrame 1 DataFrame 2_0 Shuffle Shuffle DataFrame 2_1 DataFrame 2_2 0 0 0 1 2 1 2 1 2 Logical Partitions Replica IDs
  • 21. replication_factor = 10 # spark.range creates an 'id' column: 0 <= id < replication_factor replication_ids = F.broadcast( spark.range(replication_factor).withColumnRenamed('id', 'replica_id') ) # Replicate uniform data, one copy of each row per bucket # composite_key looks like: 12345@3 (original_id@replica_id) uniform_data_replicated = ( uniform_data .crossJoin(replication_ids) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 22. def randint(limit): return F.least( F.floor(F.rand() * limit), F.lit(limit - 1), # just to avoid unlikely edge case ) # Randomly assign rows in skewed data to buckets # composite_key has same format as in uniform data skewed_data_tagged = ( skewed_data .withColumn( 'composite_key', F.concat( 'original_id', F.lit('@'), randint(replication_factor), ) ) )
  • 23. # Join them together on the composite key joined = skewed_data_tagged.join( uniform_data_replicated, on='composite_key', how='inner', )
  • 24. # Join them together on the composite key joined = skewed_data_tagged.join( uniform_data_replicated, on='composite_key', how='inner', ) WARNING Inner and left outer joins only
  • 25. Remember duplicates… FASTER JOINS All different rows Same row Same row Same row
  • 26. 100 million rows of data with uniformly-distributed keysExperiments with synthetic data Standard inner join ran for 7+ hours then I killed it! 10x replicated join completed in 1h16m 100 billion rows of data with Zipf-distributed keys
  • 27. Can we do better?
  • 28. Very common keys should be replicated many timesDifferential replication Identify frequent keys before replication Use different replication policy for those Rare keys don’t need to be replicated as much (or at all?)
  • 29. replication_factor_high = 50 replication_high = F.broadcast( spark .range(replication_factor_high) .withColumnRenamed('id', 'replica_id') ) replication_factor_low = 10 replication_low = … # as above # Determine which keys are highly over-represented top_keys = F.broadcast( skewed_data .freqItems(['original_id'], 0.0001) # return keys with frequency > this .select( F.explode('id_freqItems').alias('id_freqItems') ) )
  • 30. uniform_data_top_keys = ( uniform_data .join( top_keys, uniform_data.original_id == top_keys.id_freqItems, how='inner', ) .crossJoin(replication_high) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 31. uniform_data_rest = ( uniform_data .join( top_keys, uniform_data.original_id == top_keys.id_freqItems, how='leftanti', ) .crossJoin(replication_low) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) ) uniform_data_replicated = uniform_data_top_keys.union(uniform_data_rest)
  • 32. FASTER JOINS DataFrame 1 DataFrame 2_0 freqItems DataFrame 2_1 Frequently occurring keys Broadcast DataFrame 2_2 Replicate very frequent keys more often
  • 33. skewed_data_tagged = ( skewed_data .join( top_keys, skewed_data.id == top_keys.id_freqItems, how='left', ) .withColumn( 'replica_id', F.when( F.isnull(F.col('id_freqItems')), randint(replication_factor_low), ) .otherwise(randint(replication_factor_high)) ) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 34. 100 million rows of data with uniformly-distributed keysExperiments with synthetic data 10x replicated join completed in 1h16m 10x/50x differential replication completed in under 1h 100 billion rows of data with Zipf-distributed keys
  • 35. Other ideas worth mentioning
  • 36. Identify very common keys in skewed dataPartial broadcasting Rare keys are joined the traditional way (without replication) Union the resulting joined DataFrames Select these rows from uniform data and broadcast join
  • 37. Get approximate frequency for every key in skewed dataDynamic replication Intuition: Sliding scale between rarest and most common Can be hard to make this work in practice! Replicate uniform data proportional to key frequency
  • 38. Append random replica_id_that for other side, in [0, Rthat)Double-sided skew Composite key on left: id, replica_id_that, replica_id_this Composite key on right: id, replica_id_this, replica_id_that Replicate each row Rthis times, and append replica_id_this
  • 40. Is the problem just outliers? Can you safely ignore them?Checklist Look at your data to get an idea of the distribution Start simple (fixed replication factor) then iterate if necessary Try broadcast join if possible
  • 41. Warren & Karau, High Performance Spark (O’Reilly 2017)Credits Scalding developers: github.com/twitter/scalding Spark, ML, and distributed systems community @ Yelp Blog by Maria Restre: datarus.wordpress.com