SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
New Developments in Spark
Matei Zaharia
August 18th,2015
About Databricks
Founded by creatorsof Spark in 2013 and remains the top
contributor
End-to-end service for Spark on EC2
• Interactive notebooks,dashboards,
and production jobs
Our Goal for Spark
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …
Past 2 Years
Fast growth in libraries and
integration points
• New library for SQL + DataFrames
• 10xgrowth of ML library
• Pluggable data source API
• R language
Result: very diverse use of Spark
• Only 40% of userson Hadoop YARN
• Most users use at least 2 of Spark’s
built-in libraries
• 98%of Databricks customers use
SQL, 60% use Python
Beyond Libraries
Best thing about basing Spark’s libraries on a high-level API is
that we can also make big changesunderneaththem
Now working on some of the largestchangesto Spark Core
since the projectbegan
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Hardware Trends
Storage
Network
CPU
Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz L
Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Off-heap memory management
(2) Runtime code generation
(3) Cache-awarealgorithms
Interfaces to Tungsten
DataFrames
(Python, Java, Scala, R)
RDDsSpark SQL …
Data schema +
query plan
LLVMJVM GPU NVRAM
Tungsten
backends
…
DataFrame API
Single-node tabularstructure in R and Python,with APIs for:
relational algebra (filter, join,…)
math and stats
input/output(CSV, JSON, …)
Google Trends for “data frame”
DataFrame: lingua franca for “small data”
head(flights)
#>  Source:  local  data  frame  [6  x  16]
#>  
#>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum
#>  1    2013          1      1            517                  2            830                11            UA    N14228
#>  2    2013          1      1            533                  4            850                20            UA    N24211
#>  3    2013          1      1            542                  2            923                33            AA    N619AA
#>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB
#>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...
15
Spark DataFrames
Structureddata collections
with similar API to R/Python
• DataFrame = RDD + schema
Capture many operations as
expressionsin a DSL
• Enablesrich optimizations
df = jsonFile(“tweets.json”)
df(df(“user”) === “matei”)
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python RDD Scala RDD DataFrame
RunningTime
How does Tungsten help?
1. Off-Heap Memory Management
Store data outside JVM heap to avoid object overhead & GC
• For RDDs: fast serialization libraries
• For DataFrames & SQL: binary format we compute on directly
2-10x space saving, especiallyfor strings, nested objects
Can use new RAM-like devices, e.g. flash, 3D XPoint
2. Runtime Code Generation
GenerateJava code for DataFrame and
SQL expressionsrequestedby user
Avoids virtual calls and generics/boxing
Can do same in core, ML and graph
• Code-gen serializers,fused functions,
math expressions
9.3
9.4
36.7
Hand
writtenCodegen
Interpreted
Projection
Evaluating“SELECTa+a+a”
(timein seconds)
3. Cache-Aware Algorithms
Use custom memory layout to better leverageCPU cache
Example: AlphaSort-style prefix sort
• Store prefixes of sort key inside pointerarray
• Compare prefixes to avoid full record fetches+ comparisons
pointer record
key prefix pointer record
Naïve layout
Cache friendly layout
Tungsten Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x 16x
Run time
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Motivation
Network and storage speedshave improved 10x, but this
speed isn’t always easyto leverage!
Many challengeswith:
• Keeping diskoperationslarge (even on SSDs)
• Keeping networkconnectionsbusy & balanced across cluster
• Doing all this on many cores and many disks
Sort Benchmark
Started by Jim Grayin 1987 to measure HW+SW advances
• Many entrantsuse purpose-builthardware & software
Participated in largestcategory: Daytona GraySort
• Sort 100 TB of 100-byte recordsin a fault-tolerant manner
Seta new world record (tied with UCSD)
• Saturated 8 SSDs and 10 Gbps network/ node
• 1st time public cloud + open source won
On-Disk Sort Record
Time to sort 100 TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1 PB in 4 hours
Saturating the Network
1.1GB/sec per node
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Motivation
Queryplanning is crucial to performancein distributed setting
• Level of parallelismin operations
• Choice of algorithm(e.g. broadcast vs. shuffle join)
Hard to do well for big data even with cost-based optimization
• Unindexed data => don’t have statistics
• User-defined functions=> hard to predict
Solution: letSpark changequery plan adaptively
Traditional Spark Scheduling
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
map
reduce sort
Adaptive Planning
map
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Adaptive Planning
map
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Adaptive Planning
map
reduce
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Adaptive Planning
map
reduce
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Adaptive Planning
map
reduce
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Adaptive Planning
map
reduce
sort
file.map(word   =>  (word,   1)).reduceByKey(_   +  _)
.sortByKey()
Advanced Example: Join
Goal: Bringtogetherdata items with the same key
Advanced Example: Join
Shuffle join
(good if both
datasets large)
Goal: Bringtogetherdata items with the same key
Advanced Example: Join
Broadcast join
(good if top
dataset small)
Goal: Bringtogetherdata items with the same key
Advanced Example: Join
Hybrid join
(broadcast popular
key, shuffle rest)
Goal: Bringtogetherdata items with the same key
Advanced Example: Join
Hybrid join
(broadcast popular
key, shuffle rest)
Goal: Bringtogetherdata items with the same key
More details: SPARK-9850
Impact of Adaptive Planning
Level of parallelism: 2-3x
Choice of join algorithm: as much as 10x
Follow it at SPARK-9850
Effect of Optimizations in Core
Often, when we made one optimization, we saw all of the
Spark components get faster
• Scheduleroptimization for Spark Streaming => SQL 2xfaster
• Network optimizations=> speed up all comm-intensive libraries
• Tungsten => DataFrames, SQL and parts of ML
Same applies to other changesin core, e.g. debug tools
Conclusion
Spark has grown a lot, but it still remains the most active open
sourceproject in big data
Small core + high-level API => can make changesquickly
New hardware => exciting optimizations at all levels
Learn More: sparkhub.databricks.com

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Databricks @ Strata SJ
Databricks @ Strata SJDatabricks @ Strata SJ
Databricks @ Strata SJ
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 

Destacado

Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
sjoerdluteyn
 

Destacado (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Automatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learnAutomatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learn
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Performance
PerformancePerformance
Performance
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Julian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networksJulian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Scikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in PythonScikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in Python
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 

Similar a New Developments in Spark

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similar a New Developments in Spark (20)

Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

New Developments in Spark

  • 1. New Developments in Spark Matei Zaharia August 18th,2015
  • 2. About Databricks Founded by creatorsof Spark in 2013 and remains the top contributor End-to-end service for Spark on EC2 • Interactive notebooks,dashboards, and production jobs
  • 3. Our Goal for Spark Unified engineacross data workloads and platforms … SQLStreaming ML Graph Batch …
  • 4. Past 2 Years Fast growth in libraries and integration points • New library for SQL + DataFrames • 10xgrowth of ML library • Pluggable data source API • R language Result: very diverse use of Spark • Only 40% of userson Hadoop YARN • Most users use at least 2 of Spark’s built-in libraries • 98%of Databricks customers use SQL, 60% use Python
  • 5. Beyond Libraries Best thing about basing Spark’s libraries on a high-level API is that we can also make big changesunderneaththem Now working on some of the largestchangesto Spark Core since the projectbegan
  • 6. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  • 11. Tungsten: Preparing Spark for Next 5 Years Substantially speed up execution by optimizing CPU efficiency, via: (1) Off-heap memory management (2) Runtime code generation (3) Cache-awarealgorithms
  • 12. Interfaces to Tungsten DataFrames (Python, Java, Scala, R) RDDsSpark SQL … Data schema + query plan LLVMJVM GPU NVRAM Tungsten backends …
  • 13. DataFrame API Single-node tabularstructure in R and Python,with APIs for: relational algebra (filter, join,…) math and stats input/output(CSV, JSON, …) Google Trends for “data frame”
  • 14. DataFrame: lingua franca for “small data” head(flights) #>  Source:  local  data  frame  [6  x  16] #>   #>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum #>  1    2013          1      1            517                  2            830                11            UA    N14228 #>  2    2013          1      1            533                  4            850                20            UA    N24211 #>  3    2013          1      1            542                  2            923                33            AA    N619AA #>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB #>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...
  • 15. 15 Spark DataFrames Structureddata collections with similar API to R/Python • DataFrame = RDD + schema Capture many operations as expressionsin a DSL • Enablesrich optimizations df = jsonFile(“tweets.json”) df(df(“user”) === “matei”) .groupBy(“date”) .sum(“retweets”) 0 5 10 Python RDD Scala RDD DataFrame RunningTime
  • 17. 1. Off-Heap Memory Management Store data outside JVM heap to avoid object overhead & GC • For RDDs: fast serialization libraries • For DataFrames & SQL: binary format we compute on directly 2-10x space saving, especiallyfor strings, nested objects Can use new RAM-like devices, e.g. flash, 3D XPoint
  • 18. 2. Runtime Code Generation GenerateJava code for DataFrame and SQL expressionsrequestedby user Avoids virtual calls and generics/boxing Can do same in core, ML and graph • Code-gen serializers,fused functions, math expressions 9.3 9.4 36.7 Hand writtenCodegen Interpreted Projection Evaluating“SELECTa+a+a” (timein seconds)
  • 19. 3. Cache-Aware Algorithms Use custom memory layout to better leverageCPU cache Example: AlphaSort-style prefix sort • Store prefixes of sort key inside pointerarray • Compare prefixes to avoid full record fetches+ comparisons pointer record key prefix pointer record Naïve layout Cache friendly layout
  • 20. Tungsten Performance Results 0 200 400 600 800 1000 1200 1x 2x 4x 8x 16x Run time (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap
  • 21. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  • 22. Motivation Network and storage speedshave improved 10x, but this speed isn’t always easyto leverage! Many challengeswith: • Keeping diskoperationslarge (even on SSDs) • Keeping networkconnectionsbusy & balanced across cluster • Doing all this on many cores and many disks
  • 23. Sort Benchmark Started by Jim Grayin 1987 to measure HW+SW advances • Many entrantsuse purpose-builthardware & software Participated in largestcategory: Daytona GraySort • Sort 100 TB of 100-byte recordsin a fault-tolerant manner Seta new world record (tied with UCSD) • Saturated 8 SSDs and 10 Gbps network/ node • 1st time public cloud + open source won
  • 24. On-Disk Sort Record Time to sort 100 TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1 PB in 4 hours
  • 26. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  • 27. Motivation Queryplanning is crucial to performancein distributed setting • Level of parallelismin operations • Choice of algorithm(e.g. broadcast vs. shuffle join) Hard to do well for big data even with cost-based optimization • Unindexed data => don’t have statistics • User-defined functions=> hard to predict Solution: letSpark changequery plan adaptively
  • 28. Traditional Spark Scheduling file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort
  • 29. Adaptive Planning map file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 30. Adaptive Planning map file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 31. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 32. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 33. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 34. Adaptive Planning map reduce sort file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  • 35. Advanced Example: Join Goal: Bringtogetherdata items with the same key
  • 36. Advanced Example: Join Shuffle join (good if both datasets large) Goal: Bringtogetherdata items with the same key
  • 37. Advanced Example: Join Broadcast join (good if top dataset small) Goal: Bringtogetherdata items with the same key
  • 38. Advanced Example: Join Hybrid join (broadcast popular key, shuffle rest) Goal: Bringtogetherdata items with the same key
  • 39. Advanced Example: Join Hybrid join (broadcast popular key, shuffle rest) Goal: Bringtogetherdata items with the same key More details: SPARK-9850
  • 40. Impact of Adaptive Planning Level of parallelism: 2-3x Choice of join algorithm: as much as 10x Follow it at SPARK-9850
  • 41. Effect of Optimizations in Core Often, when we made one optimization, we saw all of the Spark components get faster • Scheduleroptimization for Spark Streaming => SQL 2xfaster • Network optimizations=> speed up all comm-intensive libraries • Tungsten => DataFrames, SQL and parts of ML Same applies to other changesin core, e.g. debug tools
  • 42. Conclusion Spark has grown a lot, but it still remains the most active open sourceproject in big data Small core + high-level API => can make changesquickly New hardware => exciting optimizations at all levels