SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
How Adobe Does 2 Million
Records Per Second using
Apache Spark!
Yeshwanth Vijayakumar
Project Lead/ Architect – Adobe Experience Platform
Goals
Share tips/experiences from our usecases
Hopefully saves atleast an hour of your time! J
What do you mean by Processing? Agenda!
Ingestion
▪ Structured Streaming - Know thy Lag
▪ Mircobatch the right way
Evaluation
▪ The Art of How I learned to cache my physical Plans
▪ Know Thy Join
▪ Skew Phew! And Sample Sample Sample
Redis – The Ultimate Swiss Army Knife!
Ingestion Scenario
Unified Profile Data Ingestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe AdCloud
Structured Streaming - Know thy Lag
What/How to measure?
Having a streaming based ingestion mechanism makes it that much harder to track.
Is that enough?
Reference: https://medium.com/@ronbarabash/how-to-measure-consumer-
lag-in-spark-structured-streaming-6c3645e45a37
Use Burrow to keep track of the Lag
Structured Streaming
Optimizing The Ingestion
Generic Flow
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Business
Logic
Business
Logic
Business
Logic
Read In
What can we optimize way upstream?
maxOffsetsPerTrigger
Determine what QPS you want to hit
Observe your QPS
minPartitions
▪ Enables a Fan-Out processing pattern
▪ Maps 1. Kafka Partition to multiple sub
Executor Resources
Keep this constant
Rinse and. Repeat till you have Throughput per Core
Make sure processingTime <= TriggerInterval
If its <<<Trigger Interval, you have headroom to grow in QPS
Flow with MinPartitions > partitions on Kafka
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Partition 1.1 Partition 1.2
Partition 1.3 Partition 2.1
Partition 2.2 Partition 2.3
MicroBatch Hard! Logic Best Practices
Pros
Easy to code
Cons
Slow!
No local aggregation , specify explicit
combiner
Too many individual tasks
Hard to get Connection Management
right
Pros
Explicit Connection Mangement
▪ Allows for good batching and re-use
Local Aggregations using HashMaps
at partition level
Cons
Needs more upfront memory
▪ OOM till tuning is done
Uglier to visualize
Might need some extra cpu per task
mapPartition() + forEachBatch()map() + foreach()
An Example
From another SAIS2020 talk: Every Day Probabilistic Data Structures For Humans
Speculate Away!
What can we optimize way upstream?
SparkConf Value Description
spark.speculation true
If set to "true", performs speculative
execution of tasks. This means if one or
more tasks are running slowly in a stage,
they will be re-launched.
spark.speculation.multiplier 5
How many times slower a task is than the
median to be considered for speculation.
spark.speculation.quantile 0.9
Fraction of tasks which must be complete
before speculation is enabled for a
particular stage.
Evaluation Scenario
What are we processing?
Run as many queries as possible in parallel on top a denormalized dataframe
Query 1
Query 2
Query 3
Query 1000
ProfileIds field1 field1000 eventsArray
a@a.com a x [e1,2,3]
b@g.com b x [e1]
d@d.com d y [e1,2,3]
z@z.com z y [e1,2,3,5,7]
Interactive Processing!
The Art of How I learned to
Cache My Physical Plans
For Repeated Queries Over Same DF
Prepared Statements in RDBMS
▪ Avoids repeated query Planning by taking in a template
▪ Compile (parse->optimize/translate to plan) ahead of time
Similarly we obtain the internal execution plan for a DF query
Taking inspiration from RDBMS Land
df.cache() This ^
Main Overhead
Dataframe has 1000’s of nested columns
Printing the queryplan caused an overflow while printing to logs in
debug mode
Time for query planning = 2-3 seconds or more
Significant impact while submitting interactive queries when total
untime < 10s
Ref: https://stackoverflow.com/questions/49583401/how-to-avoid-query-preparation-parsing-planning-and-optimizations-every-time
https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/util/StateCache.scala#L4
Cached
ahead of time
Know thy Join
Join Optimization For Interactive Queries
(Opinionated)
Avoid Them by de-normalizing if possible!
Broadcast The Join Table if its small enough!
▪ Can simulate a HashJoin
If too big to broadcast, See if the join info can be replicated into Redis
like KV Stores
▪ You still get the characteristics of Hash Join
Once you get into real large data, Shuffles will hate you and vice versa!
Sort-Merge is your friend Until it isn’t
Skew! Phew!
Skew is Real!
Default Partitioning of the dataframe might not be ideal
▪ Some partitions can have too much data
▪ Processing those can cause OOM/connection failures
Repartition is your friend
Might not still be enough, add some salt!
9997/10000 tasks don’t matter. The 3/10000 that fails is that matters
How to get the magic targetPartitionCount?
When reading/writing to parquet on HDFS, many recomendations to
mimic the HDFS block size (default: 128MB)
Sample a small portion of your large DF
▪ Df.head might suffice too with a large enough sample
Estimate size of each row and extrapolate
Sample here and sample there!
All put together!
Dataframe Size: 13 Million entries
Schema: 800 nested fields
Before
After
Redis – The Ultimate Swiss Army Knife!
Using Redis With Spark Uncommonly
Maintain Bloom Filters/HLL on Redis
Interactive Counting while processing results using mapPartitions()
Accumulator Replacement
Event Queue to Convert any normal batch Spark to Interactive Spark
Best Practices
Use Pipelining + Batching!
Tear down connections diligently
Turn Off Speculative Execution
Depends whom you ask
Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining
More Questions?
https://www.linkedin.com/in/yeshwanth-vijayakumar-75599431
yvijayak@adobe.com
Feel free to reach out to me at
How Adobe Does 2 Million Records Per Second Using Apache Spark!

Más contenido relacionado

La actualidad más candente

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 

La actualidad más candente (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 

Similar a How Adobe Does 2 Million Records Per Second Using Apache Spark!

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
guest18a0f1
 

Similar a How Adobe Does 2 Million Records Per Second Using Apache Spark! (20)

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 Mistakes
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Right time Vs real time
Right time Vs real timeRight time Vs real time
Right time Vs real time
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Último (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

How Adobe Does 2 Million Records Per Second Using Apache Spark!

  • 1.
  • 2. How Adobe Does 2 Million Records Per Second using Apache Spark! Yeshwanth Vijayakumar Project Lead/ Architect – Adobe Experience Platform
  • 3. Goals Share tips/experiences from our usecases Hopefully saves atleast an hour of your time! J
  • 4. What do you mean by Processing? Agenda! Ingestion ▪ Structured Streaming - Know thy Lag ▪ Mircobatch the right way Evaluation ▪ The Art of How I learned to cache my physical Plans ▪ Know Thy Join ▪ Skew Phew! And Sample Sample Sample Redis – The Ultimate Swiss Army Knife!
  • 6. Unified Profile Data Ingestion Unified Profile Experience Data Model Adobe Campaign AEM Adobe Analytics Adobe AdCloud
  • 7. Structured Streaming - Know thy Lag
  • 8. What/How to measure? Having a streaming based ingestion mechanism makes it that much harder to track.
  • 9. Is that enough? Reference: https://medium.com/@ronbarabash/how-to-measure-consumer- lag-in-spark-structured-streaming-6c3645e45a37
  • 10. Use Burrow to keep track of the Lag
  • 12. Generic Flow Partition 1 Partition 2 Kafka Topic Executor 1 Executor 2 Executor 3 Business Logic Business Logic Business Logic
  • 13. Read In What can we optimize way upstream? maxOffsetsPerTrigger Determine what QPS you want to hit Observe your QPS minPartitions ▪ Enables a Fan-Out processing pattern ▪ Maps 1. Kafka Partition to multiple sub Executor Resources Keep this constant Rinse and. Repeat till you have Throughput per Core Make sure processingTime <= TriggerInterval If its <<<Trigger Interval, you have headroom to grow in QPS
  • 14. Flow with MinPartitions > partitions on Kafka Partition 1 Partition 2 Kafka Topic Executor 1 Executor 2 Executor 3 Partition 1.1 Partition 1.2 Partition 1.3 Partition 2.1 Partition 2.2 Partition 2.3
  • 15. MicroBatch Hard! Logic Best Practices Pros Easy to code Cons Slow! No local aggregation , specify explicit combiner Too many individual tasks Hard to get Connection Management right Pros Explicit Connection Mangement ▪ Allows for good batching and re-use Local Aggregations using HashMaps at partition level Cons Needs more upfront memory ▪ OOM till tuning is done Uglier to visualize Might need some extra cpu per task mapPartition() + forEachBatch()map() + foreach()
  • 16. An Example From another SAIS2020 talk: Every Day Probabilistic Data Structures For Humans
  • 17. Speculate Away! What can we optimize way upstream? SparkConf Value Description spark.speculation true If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched. spark.speculation.multiplier 5 How many times slower a task is than the median to be considered for speculation. spark.speculation.quantile 0.9 Fraction of tasks which must be complete before speculation is enabled for a particular stage.
  • 19. What are we processing? Run as many queries as possible in parallel on top a denormalized dataframe Query 1 Query 2 Query 3 Query 1000 ProfileIds field1 field1000 eventsArray a@a.com a x [e1,2,3] b@g.com b x [e1] d@d.com d y [e1,2,3] z@z.com z y [e1,2,3,5,7] Interactive Processing!
  • 20. The Art of How I learned to Cache My Physical Plans
  • 21. For Repeated Queries Over Same DF Prepared Statements in RDBMS ▪ Avoids repeated query Planning by taking in a template ▪ Compile (parse->optimize/translate to plan) ahead of time Similarly we obtain the internal execution plan for a DF query Taking inspiration from RDBMS Land df.cache() This ^
  • 22. Main Overhead Dataframe has 1000’s of nested columns Printing the queryplan caused an overflow while printing to logs in debug mode Time for query planning = 2-3 seconds or more Significant impact while submitting interactive queries when total untime < 10s Ref: https://stackoverflow.com/questions/49583401/how-to-avoid-query-preparation-parsing-planning-and-optimizations-every-time https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/util/StateCache.scala#L4
  • 24.
  • 26. Join Optimization For Interactive Queries (Opinionated) Avoid Them by de-normalizing if possible! Broadcast The Join Table if its small enough! ▪ Can simulate a HashJoin If too big to broadcast, See if the join info can be replicated into Redis like KV Stores ▪ You still get the characteristics of Hash Join Once you get into real large data, Shuffles will hate you and vice versa! Sort-Merge is your friend Until it isn’t
  • 28. Skew is Real! Default Partitioning of the dataframe might not be ideal ▪ Some partitions can have too much data ▪ Processing those can cause OOM/connection failures Repartition is your friend Might not still be enough, add some salt! 9997/10000 tasks don’t matter. The 3/10000 that fails is that matters
  • 29. How to get the magic targetPartitionCount? When reading/writing to parquet on HDFS, many recomendations to mimic the HDFS block size (default: 128MB) Sample a small portion of your large DF ▪ Df.head might suffice too with a large enough sample Estimate size of each row and extrapolate Sample here and sample there!
  • 30. All put together! Dataframe Size: 13 Million entries Schema: 800 nested fields Before After
  • 31. Redis – The Ultimate Swiss Army Knife!
  • 32. Using Redis With Spark Uncommonly Maintain Bloom Filters/HLL on Redis Interactive Counting while processing results using mapPartitions() Accumulator Replacement Event Queue to Convert any normal batch Spark to Interactive Spark Best Practices Use Pipelining + Batching! Tear down connections diligently Turn Off Speculative Execution Depends whom you ask
  • 33. Digging into Redis Pipelining + Spark From https://redis.io/topics/pipelining Without Pipelining With Pipelining
  • 34.