SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
From pipelines to refineries:
scaling big data applications
Tim Hunter
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Author of TensorFrames and GraphFrames
Introduction
• Spark 2.2 in the release process
• Spark 2.3 being thought about
• This is the time to discuss Spark 3
• This presentation is a personal perspective on a future Spark
Introduction
As Spark applications grow in
complexity, what challenges lie
ahead.
Can we find some good
foundations for building big data
frameworks?
There is nothing more practical than a good theory.
James Maxwell
What category does Spark live in?
An applicative?
A monad?
A functor?
Introduction
• Anything presented here can be implementedin a modern
programming language
• No need to know Haskell or category theory
Outline
• State of the union
• What is goodaboutSpark?
• What are the trends?
• Classics to the rescue
• Fightingthe fourhorsemen of the datapocalypse
• Laziness to the rescue
• From theory to practice
• Makingdata processing great again
State of the union
• What we strive for
Ad-hoc
Queries
Input
Stream
Output
Sink
Continuous
Application
StaticData
>_
State of the union
• What we deal with:
• Coordinatinga few tasks
State of the union
• The (rapidly approaching) future
• Thousandsof inputsources
• Thousandsof concurrent requests
• Mixinginteractive,batch, streaming
• How do we enable this?
The state of the union
• The image of a pipeline gives you the illusion of
simplicity
• One inputand one output
• Current big data systems: the tree paradigm
• Combine multipleinputsinto a single output
• The SQL paradigm
• Followed by Spark
• A forest is more than a set of trees
• Multipleinputs,multipleoutputs
• The DAG paradigm
Data processing as a complex system
• Orchestrating complex flows is nothing new:
• Tracking andscheduling:Gantt,JIRA, etc.
• Buildtools:Bazel (Blaze), Pants, Buck
• General schedulers: AirFlow, FBLearner, Luigi
• Successfully used for orchestrating complex data processing
• But they are general tools:
• They miss the fact that we are dealingwith data pipelines
• Separate system for schedulingandfor expressing transforms
• No notionof schema, no control of the side effects
Data processing as a complex system
• Any software becomes an exercise in managing complexity.
• Complexity comes from interactions
• Howto reduce interactions?
• How to build more complex logic based on simpler elements?
• Howto compose data pipelines?
• Continuous evolution
• Let’s changethe enginesin mid-flight,because it soundsgreat
The ideal big processing system:
• Scalability
• in quantity (big data) and diversity (lots of sources)
• Chaining
• express the dependencies between the datasets
• Composition
• assemble morecomplex programsout of simpler ones
• Determinism
• given a set ofinput data, the outputshould be unique*
• Coffee break threshold
• quick feedback to the user
How is Spark faring so far?
• You can do it, but it is not easy
• Spark includes multiple programming models:
• The RDD model: low-level manipulation of distributed datasets
• Mostly imperativewith some lazy aspects
• The Dataset/Dataframes model:
• Same as RDD,with some restrictions anda domain-specificlanguage
• The SQL model: only one output, only one query
What can go wrong with this program?
all_clicks = session.read.json("/tables/clicks/year=2017")
all_clicks.cache()
max_session_duration = all_clicks("session_duration").max()
top_sessions = all_clicks.filter(
all_clicks("session_duration") >= 0.9 * max_session_duration)
top_ad_served = top_sessions("ad_idd")
top_ad_served.write.parquet("/output_tables/top_ads")
leak
typo
missing directory
a few hours…
The 4 horsemen of the datapocalypse
• Eager evaluation
• Missing source or sink
• Resource leak
• Typing (schema) mismatch
Classics to the rescue
Theoretical foundations for a data system
• A dataset is a collection of elements, all of the same type
• Scala: Dataset[T]
• Principle: the content of a dataset cannot be accessed directly
• A dataset can be queried
• An observable is a single element, with a type
• intuition: datasetwith a singlerow
• Scala: Observable[T]
Theoretical foundations for a data system
Transform
President
Kitesurfer
Still senator
Observe
count = 3
largest hand =
Theoretical foundations for a data system
• Principle: the observation only dependson the content of the
dataset
• You cannotobserve partitions,ordering of elements, locationon disk, etc.
• Mathematical consequence: all reduction operations on
datasets are monoids:
• f(AUB) = f(A) + f(B) = f(B) + f(A)
• f(empty) = 0
Theoretical foundations for a data system
• Principle: closed world assumption
• All the effects are modeledwithin the framework
• The inputsand the transforms are sufficientto generate the outputs
• Practical consequence: strong checks and sci-fi optimizations
Examples of operations
• They are what you expect:
• Dataset[Int] : a datasetof integers
• Observable[Int]: an observationon a dataset
• max: Dataset[Int] => Observable[Int]
• union: (Dataset[Int], Dataset[Int]) => Dataset[Int]
• collect: Dataset[Int] => Observable[List[Int]]
Karps
• An implementation of these principles on top of Spark (Haskell + Scala)
• It outputs a graph of logical plans for Spark (or other systems)
• Follows the compiler principle:
• I willtry to prove you wronguntilyou have a good chance of being right
• Karps makes a number of correctness checks for your program
• It acts as a global optimizer for your pipeline
Demo 1
This is useless!
• Lazy construction of very complex programs
• Most operations in Spark can be translated to a small set of primitive
actions with well-defined composition rules.
• The optimizer can then rewrite the program without changing the
outcome
• Optimizations can leverage further SQL optimizations
Dealing with In and Out
• The only type of I/O: read and write datasets
• This is an observable
• Operations are deterministic+ results are cached
• -> only recompute when the data changes
• Demo
Example: Caching
1
data.json
count
(+)
data.json
timestamp=2
hash=3ab5
count
hash=6e08
data.json
count
1
(+)
hash=1aac
data.json
timestamp=2
hash=3ab5
count
hash=6e02
1
(+)
hash=1aad
Example: graph introspection
Future directions
• Python interface (interface + pandas backend)
• Writer module
• Finish GroupBy (cool stuff ahead)
• SQL (simple and cool stuff to do in this area)
Conclusion: trends in data processing
• How to manage the complexity of data flows?
• Taking inspiration from the functional world
• Spark provides solid foundation
• Laziness, declarative APIs alleviate complexity
Trying this demo
• https://github.com/krapsh/kraps-haskell
• Notebooks + Docker image:
• https://github.com/krapsh/kraps-haskell/tree/master/notebooks
• Server part (scala):
• https://github.com/krapsh/kraps-server
SPARK SUMMIT 2017
DATA SCIENCE AND ENGINEERING AT SCALE
JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO
ORGANIZED BY spark-summit.org/2017
Thank You

Más contenido relacionado

La actualidad más candente

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

La actualidad más candente (20)

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 

Similar a From Pipelines to Refineries: Scaling Big Data Applications

Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 

Similar a From Pipelines to Refineries: Scaling Big Data Applications (20)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
try
trytry
try
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Último (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 

From Pipelines to Refineries: Scaling Big Data Applications

  • 1. From pipelines to refineries: scaling big data applications Tim Hunter
  • 2. About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Author of TensorFrames and GraphFrames
  • 3. Introduction • Spark 2.2 in the release process • Spark 2.3 being thought about • This is the time to discuss Spark 3 • This presentation is a personal perspective on a future Spark
  • 4. Introduction As Spark applications grow in complexity, what challenges lie ahead. Can we find some good foundations for building big data frameworks? There is nothing more practical than a good theory. James Maxwell
  • 5. What category does Spark live in? An applicative? A monad? A functor?
  • 6. Introduction • Anything presented here can be implementedin a modern programming language • No need to know Haskell or category theory
  • 7. Outline • State of the union • What is goodaboutSpark? • What are the trends? • Classics to the rescue • Fightingthe fourhorsemen of the datapocalypse • Laziness to the rescue • From theory to practice • Makingdata processing great again
  • 8. State of the union • What we strive for Ad-hoc Queries Input Stream Output Sink Continuous Application StaticData >_
  • 9. State of the union • What we deal with: • Coordinatinga few tasks
  • 10. State of the union • The (rapidly approaching) future • Thousandsof inputsources • Thousandsof concurrent requests • Mixinginteractive,batch, streaming • How do we enable this?
  • 11. The state of the union • The image of a pipeline gives you the illusion of simplicity • One inputand one output • Current big data systems: the tree paradigm • Combine multipleinputsinto a single output • The SQL paradigm • Followed by Spark • A forest is more than a set of trees • Multipleinputs,multipleoutputs • The DAG paradigm
  • 12. Data processing as a complex system • Orchestrating complex flows is nothing new: • Tracking andscheduling:Gantt,JIRA, etc. • Buildtools:Bazel (Blaze), Pants, Buck • General schedulers: AirFlow, FBLearner, Luigi • Successfully used for orchestrating complex data processing • But they are general tools: • They miss the fact that we are dealingwith data pipelines • Separate system for schedulingandfor expressing transforms • No notionof schema, no control of the side effects
  • 13. Data processing as a complex system • Any software becomes an exercise in managing complexity. • Complexity comes from interactions • Howto reduce interactions? • How to build more complex logic based on simpler elements? • Howto compose data pipelines? • Continuous evolution • Let’s changethe enginesin mid-flight,because it soundsgreat
  • 14. The ideal big processing system: • Scalability • in quantity (big data) and diversity (lots of sources) • Chaining • express the dependencies between the datasets • Composition • assemble morecomplex programsout of simpler ones • Determinism • given a set ofinput data, the outputshould be unique* • Coffee break threshold • quick feedback to the user
  • 15. How is Spark faring so far? • You can do it, but it is not easy • Spark includes multiple programming models: • The RDD model: low-level manipulation of distributed datasets • Mostly imperativewith some lazy aspects • The Dataset/Dataframes model: • Same as RDD,with some restrictions anda domain-specificlanguage • The SQL model: only one output, only one query
  • 16. What can go wrong with this program? all_clicks = session.read.json("/tables/clicks/year=2017") all_clicks.cache() max_session_duration = all_clicks("session_duration").max() top_sessions = all_clicks.filter( all_clicks("session_duration") >= 0.9 * max_session_duration) top_ad_served = top_sessions("ad_idd") top_ad_served.write.parquet("/output_tables/top_ads") leak typo missing directory a few hours…
  • 17. The 4 horsemen of the datapocalypse • Eager evaluation • Missing source or sink • Resource leak • Typing (schema) mismatch
  • 18. Classics to the rescue
  • 19. Theoretical foundations for a data system • A dataset is a collection of elements, all of the same type • Scala: Dataset[T] • Principle: the content of a dataset cannot be accessed directly • A dataset can be queried • An observable is a single element, with a type • intuition: datasetwith a singlerow • Scala: Observable[T]
  • 20. Theoretical foundations for a data system Transform President Kitesurfer Still senator Observe count = 3 largest hand =
  • 21. Theoretical foundations for a data system • Principle: the observation only dependson the content of the dataset • You cannotobserve partitions,ordering of elements, locationon disk, etc. • Mathematical consequence: all reduction operations on datasets are monoids: • f(AUB) = f(A) + f(B) = f(B) + f(A) • f(empty) = 0
  • 22. Theoretical foundations for a data system • Principle: closed world assumption • All the effects are modeledwithin the framework • The inputsand the transforms are sufficientto generate the outputs • Practical consequence: strong checks and sci-fi optimizations
  • 23. Examples of operations • They are what you expect: • Dataset[Int] : a datasetof integers • Observable[Int]: an observationon a dataset • max: Dataset[Int] => Observable[Int] • union: (Dataset[Int], Dataset[Int]) => Dataset[Int] • collect: Dataset[Int] => Observable[List[Int]]
  • 24. Karps • An implementation of these principles on top of Spark (Haskell + Scala) • It outputs a graph of logical plans for Spark (or other systems) • Follows the compiler principle: • I willtry to prove you wronguntilyou have a good chance of being right • Karps makes a number of correctness checks for your program • It acts as a global optimizer for your pipeline
  • 26. This is useless! • Lazy construction of very complex programs • Most operations in Spark can be translated to a small set of primitive actions with well-defined composition rules. • The optimizer can then rewrite the program without changing the outcome • Optimizations can leverage further SQL optimizations
  • 27. Dealing with In and Out • The only type of I/O: read and write datasets • This is an observable • Operations are deterministic+ results are cached • -> only recompute when the data changes • Demo
  • 30. Future directions • Python interface (interface + pandas backend) • Writer module • Finish GroupBy (cool stuff ahead) • SQL (simple and cool stuff to do in this area)
  • 31. Conclusion: trends in data processing • How to manage the complexity of data flows? • Taking inspiration from the functional world • Spark provides solid foundation • Laziness, declarative APIs alleviate complexity
  • 32. Trying this demo • https://github.com/krapsh/kraps-haskell • Notebooks + Docker image: • https://github.com/krapsh/kraps-haskell/tree/master/notebooks • Server part (scala): • https://github.com/krapsh/kraps-server
  • 33. SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO ORGANIZED BY spark-summit.org/2017