SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Spark Application
Development Made
Fast and Easy
Shivnath Babu
Lance Co Ting Keh
Lance Co Ting Keh
Machine Learning @ Box
Distributed ML Infrastructure
Go Blue Devils!
Shivnath Babu
Associate Professor @ Duke
Chief Scientist at Unravel Data Sys.
R&D in Management of Data Systems
Spark @Box
What’s so great about Spark?
From:
https://weminoredinfilm.files.wordpress.com/2014/04/59911951.jpg
Complexity
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Spark Execution
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
HDFS
Stage 0 Stage 1
RDD0 RDD1 RDD2 RDD3 RDD4
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
Stage 0 Stage 1
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Spark Execution
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Anything that can go wrong,
will go wrong (at some point)
What can go wrong?
Failures
•  My query failed after 6 hours!
•  What does this exception mean?
What can go wrong?
•  Failures
•  My query failed after 6 hours!
•  What does this exception mean?
•  Wrong results
•  Result of my job looks wrong
•  Bad performance
•  My app is very slow
•  Pipeline is not meeting the 4hr SLA
•  Poor scalability
•  Oh, but it worked on the dev cluster!
•  Bad App(le)s
•  Tom’s query brought the cluster down
•  Application Problems
•  Poor choice of transformations
•  Ineffective caching
•  Bloated data structures
•  Data/Storage Problems
•  Skewed data, load imbalance
•  Small files, poor data partitioning
•  Spark Problems
•  Shuffle
•  Lazy evaluation causes confusion
•  Resource Problems
•  Resource contention
•  Performance degradation
And Why?
How do application developers
detect & fix these problems today?
SparkApplicationDevMadeEasy_Spark_Summit_2015
Too much to look at!
Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand
There has to be a better way for
application developers to
detect & fix problems
Visualize:
Show me all relevant data in one place
Optimize:
Analyze the data for me and give me
diagnoses and fixes
Strategize:
Help me prevent the problems from
happening and meet my goals
Demo
Visualize:
Show me all relevant data in one place
Visualize:
Show me all relevant data in one place
Optimize:
Analyze the data for me and give me diagnoses
and fixes
Strategize:
Help me prevent the problems from happening
and meet my goals
Strategize:
Help me prevent the problems from happening
and meet my goals
We are hiring: jobs@unraveldata.com
Sign up for early access at:
bit.ly/unravelspark

Más contenido relacionado

La actualidad más candente

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsPatterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsDatabricks
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&MDatabricks
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
 
Hyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven HafenegerHyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven Hafenegersparktc
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Databricks
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcpCatherine Kimani
 
H2O intro at Dallas Meetup
H2O intro at Dallas MeetupH2O intro at Dallas Meetup
H2O intro at Dallas MeetupSri Ambati
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OSri Ambati
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Databricks
 

La actualidad más candente (16)

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsPatterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
Hyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven HafenegerHyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven Hafeneger
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
H2O intro at Dallas Meetup
H2O intro at Dallas MeetupH2O intro at Dallas Meetup
H2O intro at Dallas Meetup
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2O
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 

Similar a SparkApplicationDevMadeEasy_Spark_Summit_2015

Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
Tech days 2011 - database design patterns for keeping your database applicati...
Tech days 2011 - database design patterns for keeping your database applicati...Tech days 2011 - database design patterns for keeping your database applicati...
Tech days 2011 - database design patterns for keeping your database applicati...Charley Hanania
 
Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Charley Hanania
 
Sql interview question part 9
Sql interview question part 9Sql interview question part 9
Sql interview question part 9kaashiv1
 
Sql interview-question-part-9
Sql interview-question-part-9Sql interview-question-part-9
Sql interview-question-part-9kaashiv1
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business IntelligenceEvan Leybourn
 
Bb world 2011 capacity planning
Bb world 2011 capacity planningBb world 2011 capacity planning
Bb world 2011 capacity planningSteve Feldman
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarNilesh Shah
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Chip Childers
 
Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Enkitec
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availabilityCharley Hanania
 
Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Jason Strate
 

Similar a SparkApplicationDevMadeEasy_Spark_Summit_2015 (20)

Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
Tech days 2011 - database design patterns for keeping your database applicati...
Tech days 2011 - database design patterns for keeping your database applicati...Tech days 2011 - database design patterns for keeping your database applicati...
Tech days 2011 - database design patterns for keeping your database applicati...
 
Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...
 
Ebook9
Ebook9Ebook9
Ebook9
 
Sql interview question part 9
Sql interview question part 9Sql interview question part 9
Sql interview question part 9
 
Ebook9
Ebook9Ebook9
Ebook9
 
Sql interview-question-part-9
Sql interview-question-part-9Sql interview-question-part-9
Sql interview-question-part-9
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Bb world 2011 capacity planning
Bb world 2011 capacity planningBb world 2011 capacity planning
Bb world 2011 capacity planning
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015
 
Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availability
 
Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)
 

SparkApplicationDevMadeEasy_Spark_Summit_2015