SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Leveraging Databricks for
Spark Pipelines
How Coatue Management saved time and money
Rose Toomey
AWS Data Analytics Week
27 February 2020
1
About me
Finance. Technology. Code. And data.
So much data.
A software engineer making up for a lifetime of trying to fix data
problems at the wrong end of the pipeline.
• Software Engineer at Coatue Management
• Lead API Developer at Gemini Trust
• Director at Novus Partners
2
Apache Spark At Scale In The Cloud
Spark+AI Summit Europe 2019, Amsterdam
video / slides
Moonshot Spark: Serverless Spark with GraalVM
Scale By The Bay 2019, Oakland
video / slides
Upcoming! ✨
A Field Mechanic's Guide To Integration Testing Your Apache Spark App
ScalaDays 2020, Seattle
Recent Talks
3
About Coatue Management
What we do: data engineering at Coatue
• Terabyte scale, billions of rows
• Lambda architecture
• Functional programming
Our stack
• Scala (cats, shapeless, fs2, http4s)
• Spark / Hadoop
• Data warehouses
• Python / R / Tableau
4
Our Challenges
Coatue Management deal with a massive volume of data.
1. Many Spark pipelines, many different sizes of Spark pipelines. Small,
medium, large: each pipeline size has its own tuning and processing
issues.
2. All running in the cloud: each cloud provider presents separate
operational challenges.
Improving the cost efficiency and reliability of our Spark pipelines is a
huge win for our data engineering team.
We want to focus on the input and output of our pipelines, not the
operational details surrounding them. 5
• Standardize how pipelines are deployed
• Run faster
• Spend less money
• Don't get paged over nonsense
• That goes double for being rate limited when using S3
• Become BFF with our 🌈🦄💯 Databricks account team
OK, it wasn't a goal but it happened anyway 💕
Our goals
6
• Deployment and scaling toolbox: autoscale the cluster, autoscale the local storage,
define jobs and clusters or just go ad hoc. Multiple cloud providers through a single
API. Bigger master, smaller workers.
• We wanted to see if Databricks Runtime join and filter optimizations could make
our jobs faster relative to what's offered in Apache Spark
• Superior, easy to use tools.Spark history server (only recently available elsewhere),
historical Ganglia screenshots 🥰, easy access to logs from a browser.
• Optimized cloud storage access
• Run our Spark jobs in the same environment where we run our notebooks
Why Databricks?
7
What happened next
We began migrating over our pipelines in order of size and production impact.
This was not a direct transposition but brought about a useful re-evaluation of how
we were running our Spark jobs.
• Submitting a single large job using the Databricks Jobs API instead of multiple
smaller jobs using spark-submit
• Making the most of the Databricks Runtime
• Improve the performance and reliability of reading from and writing to S3 by
switching to instance types that support Delta Cache
• Optimizing join and filter operations
8
The Doge
Pipeline
Case study 1
The "Doge" pipeline
A sample large data pipeline that drags.
• ~20 billion rows
• File skew. Highly variable row length.
• CSV that needs multiline support. Occasional bad rows.
• Cluster is 200 nodes with 1600 vCPU, 13 TB memory
• Each run takes four hours
...and growing
10
• Instead of multiple jobs (ingestion, multiple processing
segments, writing final output to S3) pipeline now runs as a
single job
• Due to complex query lineages, we ran into an issue
where the whole cluster bogged down 🙀
• And thanks to using an instance type that supports
Databricks Delta Caching, found a novel workaround
The changes
11
1. Dataset foo (+ lineage 🧳) is in memory
2. Write dataset foo to S3 (delta cache not populated on write)
3. Read dataset foo (- lineage 👋) from S3 (delta cache lazily
populated by read)
4. Optional… some time later, read dataset foo from S3 again
- this time we would expect to hit delta cache instead
Counterintuitively, this is faster than checkpointing. Even if
your local storage is NVMe SSD.
Truncating lineage faster 🏎
12
The outcome
Runs almost twice as fast on about half the cluster.
• Pipeline now completes in around 2.5 hours
• Uses a cluster of 100 nodes
• Larger master, 16 vCPU and 120GB memory
• Worker nodes total 792 vCPUs and 5.94TB memory
• Big pickup provided by NVMe SSD local storage
13
The Hamster
Pipeline
Case Study 2
A medium sized older pipeline that turns out to have been running on its
hamster wheel most of the time.
• 5 billion rows.
• CSVs. Wide rows increase size of dataset to medium, relative to
small number of rows.
• Processing stages like Jurassic Park, but for regexes 🦖
• Cluster is 100 nodes with 800 vCPU, 6.5 TB memory
• Each run takes three hours, but recently started failing due to
resource issues even though "nothing changed"
The "Hamster" Pipeline
15
We were hoping for a straightforward win by dumping the Hamster
job on Databricks.
That didn't happen. We made the same superficial changes as for Doge:
single job, NVMe SSD local storage. The job didn't die but it was really
slow.
Examining the query planner results, we saw cost-based optimizer
results all over the place. 😿
Some of the queries made choices that were logical at that time but
counterproductive for Spark 2.4 using Databricks Runtime.
Time to rewrite those older queries!
Goodbye Hamster Wheel
16
The Hamster pipeline uses roughly the same amount of
resources as before but now completes with over 5x speedup
from the previous runtime. 🚀
• Pipeline now completes in 35 minutes
• Uses a cluster of 200 nodes
• Larger master, 16 vCPU and 120GB memory
• Worker nodes total 792 vCPUs and 5.94TB memory
Hello Hamster Rocket
17
Results
• Migrating Coatue Management’s Spark pipelines to
Databricks has reduced operational overhead while saving
time and money.
• Our cloud storage reads and writes are now more reliable
• Jobs and clusters are now managed through a single simple
REST API instead of a Tower of Babel toolchain for different
cloud providers
Interested in finding out more about data engineering at
Coatue Management? Come talk to us! 18

Más contenido relacionado

La actualidad más candente

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
Databricks
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 

La actualidad más candente (20)

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Honest performance testing with NDBench
Honest performance testing with NDBenchHonest performance testing with NDBench
Honest performance testing with NDBench
 
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedIn
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from Trenches
 

Similar a Leveraging Databricks for Spark Pipelines

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 

Similar a Leveraging Databricks for Spark Pipelines (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
How to Make SQL Server Go Faster
How to Make SQL Server Go FasterHow to Make SQL Server Go Faster
How to Make SQL Server Go Faster
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Leveraging Databricks for Spark Pipelines

  • 1. Leveraging Databricks for Spark Pipelines How Coatue Management saved time and money Rose Toomey AWS Data Analytics Week 27 February 2020 1
  • 2. About me Finance. Technology. Code. And data. So much data. A software engineer making up for a lifetime of trying to fix data problems at the wrong end of the pipeline. • Software Engineer at Coatue Management • Lead API Developer at Gemini Trust • Director at Novus Partners 2
  • 3. Apache Spark At Scale In The Cloud Spark+AI Summit Europe 2019, Amsterdam video / slides Moonshot Spark: Serverless Spark with GraalVM Scale By The Bay 2019, Oakland video / slides Upcoming! ✨ A Field Mechanic's Guide To Integration Testing Your Apache Spark App ScalaDays 2020, Seattle Recent Talks 3
  • 4. About Coatue Management What we do: data engineering at Coatue • Terabyte scale, billions of rows • Lambda architecture • Functional programming Our stack • Scala (cats, shapeless, fs2, http4s) • Spark / Hadoop • Data warehouses • Python / R / Tableau 4
  • 5. Our Challenges Coatue Management deal with a massive volume of data. 1. Many Spark pipelines, many different sizes of Spark pipelines. Small, medium, large: each pipeline size has its own tuning and processing issues. 2. All running in the cloud: each cloud provider presents separate operational challenges. Improving the cost efficiency and reliability of our Spark pipelines is a huge win for our data engineering team. We want to focus on the input and output of our pipelines, not the operational details surrounding them. 5
  • 6. • Standardize how pipelines are deployed • Run faster • Spend less money • Don't get paged over nonsense • That goes double for being rate limited when using S3 • Become BFF with our 🌈🦄💯 Databricks account team OK, it wasn't a goal but it happened anyway 💕 Our goals 6
  • 7. • Deployment and scaling toolbox: autoscale the cluster, autoscale the local storage, define jobs and clusters or just go ad hoc. Multiple cloud providers through a single API. Bigger master, smaller workers. • We wanted to see if Databricks Runtime join and filter optimizations could make our jobs faster relative to what's offered in Apache Spark • Superior, easy to use tools.Spark history server (only recently available elsewhere), historical Ganglia screenshots 🥰, easy access to logs from a browser. • Optimized cloud storage access • Run our Spark jobs in the same environment where we run our notebooks Why Databricks? 7
  • 8. What happened next We began migrating over our pipelines in order of size and production impact. This was not a direct transposition but brought about a useful re-evaluation of how we were running our Spark jobs. • Submitting a single large job using the Databricks Jobs API instead of multiple smaller jobs using spark-submit • Making the most of the Databricks Runtime • Improve the performance and reliability of reading from and writing to S3 by switching to instance types that support Delta Cache • Optimizing join and filter operations 8
  • 10. The "Doge" pipeline A sample large data pipeline that drags. • ~20 billion rows • File skew. Highly variable row length. • CSV that needs multiline support. Occasional bad rows. • Cluster is 200 nodes with 1600 vCPU, 13 TB memory • Each run takes four hours ...and growing 10
  • 11. • Instead of multiple jobs (ingestion, multiple processing segments, writing final output to S3) pipeline now runs as a single job • Due to complex query lineages, we ran into an issue where the whole cluster bogged down 🙀 • And thanks to using an instance type that supports Databricks Delta Caching, found a novel workaround The changes 11
  • 12. 1. Dataset foo (+ lineage 🧳) is in memory 2. Write dataset foo to S3 (delta cache not populated on write) 3. Read dataset foo (- lineage 👋) from S3 (delta cache lazily populated by read) 4. Optional… some time later, read dataset foo from S3 again - this time we would expect to hit delta cache instead Counterintuitively, this is faster than checkpointing. Even if your local storage is NVMe SSD. Truncating lineage faster 🏎 12
  • 13. The outcome Runs almost twice as fast on about half the cluster. • Pipeline now completes in around 2.5 hours • Uses a cluster of 100 nodes • Larger master, 16 vCPU and 120GB memory • Worker nodes total 792 vCPUs and 5.94TB memory • Big pickup provided by NVMe SSD local storage 13
  • 15. A medium sized older pipeline that turns out to have been running on its hamster wheel most of the time. • 5 billion rows. • CSVs. Wide rows increase size of dataset to medium, relative to small number of rows. • Processing stages like Jurassic Park, but for regexes 🦖 • Cluster is 100 nodes with 800 vCPU, 6.5 TB memory • Each run takes three hours, but recently started failing due to resource issues even though "nothing changed" The "Hamster" Pipeline 15
  • 16. We were hoping for a straightforward win by dumping the Hamster job on Databricks. That didn't happen. We made the same superficial changes as for Doge: single job, NVMe SSD local storage. The job didn't die but it was really slow. Examining the query planner results, we saw cost-based optimizer results all over the place. 😿 Some of the queries made choices that were logical at that time but counterproductive for Spark 2.4 using Databricks Runtime. Time to rewrite those older queries! Goodbye Hamster Wheel 16
  • 17. The Hamster pipeline uses roughly the same amount of resources as before but now completes with over 5x speedup from the previous runtime. 🚀 • Pipeline now completes in 35 minutes • Uses a cluster of 200 nodes • Larger master, 16 vCPU and 120GB memory • Worker nodes total 792 vCPUs and 5.94TB memory Hello Hamster Rocket 17
  • 18. Results • Migrating Coatue Management’s Spark pipelines to Databricks has reduced operational overhead while saving time and money. • Our cloud storage reads and writes are now more reliable • Jobs and clusters are now managed through a single simple REST API instead of a Tower of Babel toolchain for different cloud providers Interested in finding out more about data engineering at Coatue Management? Come talk to us! 18