SlideShare a Scribd company logo
1 of 43
Download to read offline
Optimizing Spark-based data pipelines -
Are you up for it?
Etti Gur & Itai Yaffe
Nielsen
@ItaiYaffe, @ettigur
Introduction
Etti Gur Itai Yaffe
● Senior Big Data developer
● Building data pipelines using
Spark, Kafka, Druid, Airflow
and more
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
● Women in Big Data Israeli
chapter co-founder
@ItaiYaffe, @ettigur
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark?
● First time at a Women in Big Data meetup?
@ItaiYaffe, @ettigur
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ The challenges
● The business use-case and our data pipeline
● Optimizing Spark resource allocation & utilization
○ Tools and examples
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application
@ItaiYaffe, @ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
@ItaiYaffe, @ettigur
Nielsen Marketing Cloud in numbers
>10B events/day >20TB/day
S3
1000s nodes/day 10s of TB
ingested/day
druid
$100Ks/month
@ItaiYaffe, @ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ItaiYaffe, @ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ItaiYaffe, @ettigur
What are the logical phases of a campaign?
The business use-case - measure campaigns in-flight
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
The business use-case - measure campaigns in-flight
@ItaiYaffe, @ettigur
In-flight analytics pipeline - high-level architecture
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems
@ItaiYaffe, @ettigur
In-flight analytics pipeline - Mart Generator
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Mart Generator problems
● Execution time: ran for over 7 hours
● Stability: experienced sporadic OOM failures
@ItaiYaffe, @ettigur
Digging deeper into resource allocation & utilization
There are various ways to examine Spark resource allocation and utilization:
● Spark UI (e.g Executors Tab)
● Spark metrics system, e.g:
○ JMX
○ Graphite
● YARN UI (if applicable)
● Cluster-wide monitoring tools, e.g Ganglia
@ItaiYaffe, @ettigur
Resource allocation - YARN UI
@ItaiYaffe, @ettigur
Resource allocation - YARN UI
@ItaiYaffe, @ettigur
Resource utilization - Ganglia
@ItaiYaffe, @ettigur
Resource utilization - Ganglia
@ItaiYaffe, @ettigur
Mart Generator - initial resource allocation
● EMR cluster with 32 X i3.8xlarge worker nodes
○ Each with 32 cores, 244GB RAM each, NVMe SSD
● spark.executor.cores=6
● spark.executor.memory=40g
● spark.executor.memoryOverhead=4g (0.10 * executorMemory)
● Executors per node=32/6=5(2)
● Unused resources per node=24GB mem, 2 cores
● Unused resources across the cluster=768GB mem, 64 cores
○ Remember our OOM failures?
@ItaiYaffe, @ettigur
How to better allocate resources?
Ec2 instance type Best for Cores per
executor
Memory
per
executor
Overhead
per
executor
Executors per
node
i3.8xlarge
32 vCore,
244 GiB mem
4 x 1,900 NVMe SSD
Memory & storage
optimized
8 50g 8g 32/8 = 4
executors per
node
r4.8xlarge
32 vCore,
244 GiB mem
Memory optimized 8 50g 8g 32/8 = 4
executors per
node
c4.8xlarge
36 vCore,
60 GiB mem
Compute optimized 6 7g 2g 36/6=6
Number of available executors = (total cores/num-cores-per-executor)
@ItaiYaffe, @ettigur
Mart Generator - better resource allocation
@ItaiYaffe, @ettigur
Mart Generator - better resource utilization, but...
@ItaiYaffe, @ettigur
Mart Generator requirement - overwrite latest date only
date=2019-11-22
date=2019-11-23
date=2019-11-24
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Mart Generator
Campaigns’ marts
@ItaiYaffe, @ettigur
Overwrite partitions - the “trivial” Spark implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder
@ItaiYaffe, @ettigur
Overwrite specific partitions - our “naive”
implementationdataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)
@ItaiYaffe, @ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders
● Possible side-effect due to sequential S3 MV cmd by the driver
@ItaiYaffe, @ettigur
Mart Generator - optimal resource utilization
@ItaiYaffe, @ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~40 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)
@ItaiYaffe, @ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Enricher problem - execution time
● Grew from 9 hours to 18 hours
● Sometimes took more than 20 hours
@ItaiYaffe, @ettigur
Enricher - initial resource utilization
@ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a thread pool
○ Thread pool size is configurable
● Each thread should execute a separate spark “job” (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler
@ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
val executorService = Executors. newFixedThreadPool(numOfThreads)
val futures = campaigns map (campaign => {
executorService.submit(new Callable[Result]() {
override def call: (Result) = {
val ans = processCampaign(campaign, appConf, log)
return Result(campaign. code, ans))
}
})
})
val completedCampaigns = futures map (future => {
try {
future.get()
} catch {
case e: Exception => {
log.info( "Some thread caused exception : " + e.getMessage)
Result( "", "", false, false)
}
}
})
@ItaiYaffe, @ettigur
Spark UI - multiple Spark “jobs” within a single Spark application
@ItaiYaffe, @ettigur
Enricher - optimal resource utilization
@ItaiYaffe, @ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~1:20 hours
@ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
@ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
> 90%
improvement
@ItaiYaffe, @ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
●
● Optimizing data pipelines is an ongoing effort (not a one-off)
@ItaiYaffe, @ettigur
DRUID
ES
Want to know more?
● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg
● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b
● NMC Tech Blog - https://medium.com/nmc-techblog
QUESTIONS
THANK YOU
Itai Yaffe Itai Yaffe
Etti Gur Etti Gur

More Related Content

What's hot

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

What's hot (20)

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
Use case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & DremioUse case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & Dremio
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Real-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with BeeswaxReal-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with Beeswax
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Data engineering
Data engineeringData engineering
Data engineering
 

Similar to Optimizing Spark-based data pipelines - are you up for it?

Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
t_ivanov
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 

Similar to Optimizing Spark-based data pipelines - are you up for it? (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Spring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in ActionSpring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in Action
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 

More from Itai Yaffe

A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Optimizing Spark-based data pipelines - are you up for it?

  • 1. Optimizing Spark-based data pipelines - Are you up for it? Etti Gur & Itai Yaffe Nielsen
  • 2. @ItaiYaffe, @ettigur Introduction Etti Gur Itai Yaffe ● Senior Big Data developer ● Building data pipelines using Spark, Kafka, Druid, Airflow and more ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012 ● Women in Big Data Israeli chapter co-founder
  • 3. @ItaiYaffe, @ettigur Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? ● First time at a Women in Big Data meetup?
  • 4. @ItaiYaffe, @ettigur Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ The challenges ● The business use-case and our data pipeline ● Optimizing Spark resource allocation & utilization ○ Tools and examples ● Parallelizing Spark output phase with dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application
  • 5. @ItaiYaffe, @ettigur Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  • 6. @ItaiYaffe, @ettigur Nielsen Marketing Cloud in numbers >10B events/day >20TB/day S3 1000s nodes/day 10s of TB ingested/day druid $100Ks/month
  • 9. @ItaiYaffe, @ettigur What are the logical phases of a campaign? The business use-case - measure campaigns in-flight
  • 10. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M CHECKOUT 3M HOMEPAGE 15M 7M Drop-off 5M Drop-off AD EXPOSURE 100M 85M Drop-off The business use-case - measure campaigns in-flight
  • 11. @ItaiYaffe, @ettigur In-flight analytics pipeline - high-level architecture date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 12. @ItaiYaffe, @ettigur The Problem Metric Growing execution time >24 hours/day Stability Sporadic failures High costs $33,000/month Exhausting recovery Many hours/incident (“babysitting”) In-flight analytics pipeline - problems
  • 13. @ItaiYaffe, @ettigur In-flight analytics pipeline - Mart Generator date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 14. @ItaiYaffe, @ettigur Mart Generator problems ● Execution time: ran for over 7 hours ● Stability: experienced sporadic OOM failures
  • 15. @ItaiYaffe, @ettigur Digging deeper into resource allocation & utilization There are various ways to examine Spark resource allocation and utilization: ● Spark UI (e.g Executors Tab) ● Spark metrics system, e.g: ○ JMX ○ Graphite ● YARN UI (if applicable) ● Cluster-wide monitoring tools, e.g Ganglia
  • 20. @ItaiYaffe, @ettigur Mart Generator - initial resource allocation ● EMR cluster with 32 X i3.8xlarge worker nodes ○ Each with 32 cores, 244GB RAM each, NVMe SSD ● spark.executor.cores=6 ● spark.executor.memory=40g ● spark.executor.memoryOverhead=4g (0.10 * executorMemory) ● Executors per node=32/6=5(2) ● Unused resources per node=24GB mem, 2 cores ● Unused resources across the cluster=768GB mem, 64 cores ○ Remember our OOM failures?
  • 21. @ItaiYaffe, @ettigur How to better allocate resources? Ec2 instance type Best for Cores per executor Memory per executor Overhead per executor Executors per node i3.8xlarge 32 vCore, 244 GiB mem 4 x 1,900 NVMe SSD Memory & storage optimized 8 50g 8g 32/8 = 4 executors per node r4.8xlarge 32 vCore, 244 GiB mem Memory optimized 8 50g 8g 32/8 = 4 executors per node c4.8xlarge 36 vCore, 60 GiB mem Compute optimized 6 7g 2g 36/6=6 Number of available executors = (total cores/num-cores-per-executor)
  • 22. @ItaiYaffe, @ettigur Mart Generator - better resource allocation
  • 23. @ItaiYaffe, @ettigur Mart Generator - better resource utilization, but...
  • 24. @ItaiYaffe, @ettigur Mart Generator requirement - overwrite latest date only date=2019-11-22 date=2019-11-23 date=2019-11-24 1. Read files of last day Data Lake 2. Write files by campaign,date Mart Generator Campaigns’ marts
  • 25. @ItaiYaffe, @ettigur Overwrite partitions - the “trivial” Spark implementation dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting the entire root folder
  • 26. @ItaiYaffe, @ettigur Overwrite specific partitions - our “naive” implementationdataframesMap is of type <campaignCode, campaignDataframe> dataframesMap.foreach(campaign => { val outputPath = rootPath+"campaign="+campaign.code+"/date="+date campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath) }) The result: ● Overwriting only relevant folders ● An extremely long tail (w.r.t execution time)
  • 27. @ItaiYaffe, @ettigur Overwrite specific partitions - Spark 2.3 implementation sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic") dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting only relevant folders ● Possible side-effect due to sequential S3 MV cmd by the driver
  • 28. @ItaiYaffe, @ettigur Mart Generator - optimal resource utilization
  • 29. @ItaiYaffe, @ettigur Mart Generator - summary ● Better resource allocation & utilization ● Execution time decreased from 7+ hours to ~40 minutes ● No sporadic OOM failures ● Overwriting only relevant folders (i.e partitions)
  • 30. @ItaiYaffe, @ettigur In-flight analytics pipeline - Enricher date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 31. @ItaiYaffe, @ettigur Enricher problem - execution time ● Grew from 9 hours to 18 hours ● Sometimes took more than 20 hours
  • 32. @ItaiYaffe, @ettigur Enricher - initial resource utilization
  • 33. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application ● Create one spark application with one sparkContext ● Create a thread pool ○ Thread pool size is configurable ● Each thread should execute a separate spark “job” (i.e action) ● “Jobs” are waiting in a queue and are executed based on available resources ○ This is managed by Spark’s scheduler
  • 34. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application val executorService = Executors. newFixedThreadPool(numOfThreads) val futures = campaigns map (campaign => { executorService.submit(new Callable[Result]() { override def call: (Result) = { val ans = processCampaign(campaign, appConf, log) return Result(campaign. code, ans)) } }) }) val completedCampaigns = futures map (future => { try { future.get() } catch { case e: Exception => { log.info( "Some thread caused exception : " + e.getMessage) Result( "", "", false, false) } } })
  • 35. @ItaiYaffe, @ettigur Spark UI - multiple Spark “jobs” within a single Spark application
  • 36. @ItaiYaffe, @ettigur Enricher - optimal resource utilization
  • 37. @ItaiYaffe, @ettigur Enricher - summary ● Running multiple Spark “jobs” within a single Spark app ● Better resource utilization ● Execution time decreased from 20+ hours to ~1:20 hours
  • 38. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after
  • 39. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after > 90% improvement
  • 40. @ItaiYaffe, @ettigur What have we learned? ● You too can optimize Spark resource allocation & utilization ○ Leverage the tools at hand to deep-dive into your cluster ● Spark output phase can be parallelized even when overwriting specific partitions ○ Use dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application can be useful ● ● Optimizing data pipelines is an ongoing effort (not a one-off)
  • 41. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg ● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b ● NMC Tech Blog - https://medium.com/nmc-techblog
  • 43. THANK YOU Itai Yaffe Itai Yaffe Etti Gur Etti Gur