SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Accelerating Data
Processing in Spark
SQL with Pandas UDFs
Michael Tong, Quantcast
Machine Learning Engineer
Agenda
Review of Pandas UDFs
Review what they are and go over some
development tips
Modeling at Quantcast
How we use Spark SQL in production
Example Problem
Introduce a real problem from our model training
pipeline that will be the main focus of our
optimization efforts for this talk.
Optimization tips and tricks.
Iteratively and aggressively optimize this problem
with pandas UDFs
Optimization Tricks
Do more things in memory
Loops > spark SQL intermediate rows. Look for
ways to do as many things in memory as possible
Aggregate Keys
Try to reduce the number of unique keys in your
data and/or process multiple keys in a single UDF
call.
Use inverted indices
Works especially well with sparse data.
Use python libraries
Pandas is easy to work with but slow, use other
python libraries for better performance
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Review of Pandas UDFs
What are Pandas UDFs?
▪ UDF = User Defined Function
▪ Pandas UDFs are part of Spark SQL,
which is Apache spark’s module for
working with structured data.
▪ Pandas UDFs are a great way of
writing custom data-processing
logic in a developer-friendly
environment.
Summary
What are Pandas UDFs?
▪ Scalar UDFs. One-to-one mapping
function that supports simple return
types (no Map/Struct types)
▪ Grouped Map UDFs. Requires a
groupby operation but can return a
variable number of output rows with
complicated return types.
▪ Grouped Agg UDFs. I recommend you
use Grouped Map UDFs instead.
Types of Pandas UDFs
Development tips and tricks
▪ Use an interactive development framework.
▪ At Quantcast we use jupyter notebooks.
▪ Develop with mock data.
▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly
iterate quickly on ideas when developing code.
▪ Use magic commands (if you are using jupyter)
▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out
of your pandas UDFs.
▪ Use python’s debugging tool
▪ The module is pdb (python debugger).
Modeling at Quantcast
Modeling at Quantcast
▪ Train tens of thousands of models
that refresh daily-weekly
▪ Models trained off first party data
from global internet traffic.
▪ We have a lot of data
▪ Around 400TB raw logs written/day
▪ Data is cleaned and compressed to
about 4-5 TB/day
▪ Typically train off of several days-
months of data for each model.
Scale, scale, and even more scale
Example Problem
Example Problem
▪ We have about 10k models that we want to train.
▪ Each of them cover different geo regions
▪ Some of them are over large regions (i.e. everybody in the US)
▪ Some of them are over specific regions (i.e. everybody in San Francisco)
▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco)
▪ For each model, we want to know how many unique ids (users) were
found in each region over a certain period of time.
A high level overview
Example Problem
▪ Each model will have a set of inclusion regions (i.e. US, San Francisco)
where each id must be in one of these regions to be considered part
of the model.
▪ Each model will have a set of exclusion regions (i.e. San Francisco)
where each id must be in none of these regions to be considered part
of the model.
▪ Each id only needs to satisfy the geo constraints once to be part of
the model (i.e., an id that moves from the US to Canada during the
training timeframe is considered valid for a US model)
More details
Example Problem
With some example tables
Feature Feature Id
US 0 or 100
San Francisco 1 or 101
Feature Map
Model Data and Result
Model Geo Incl. Geo Excl. # unique
Ids
ids
Model-0 (US) [0, 100] [] 4 A, B, C, D
Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B
Id Timestamp Feature ids Model ids
A ts-1 [0] [0, 1]
B ts-2 [0, 1] [0, 1]
C ts-3 [100, 101] [0]
D ts-4 [0, 1, 2, 3, 4] [0]
D ts-5 [999] []
Feature Store
Optimization:
Tips and Tricks
Naive approach: Use Spark SQL
▪ Spark has built in functions to do everything we need
▪ Get all (row, model) pairs using a cross join
▪ Use functions.array_intersect for the inclusion/exclusion logic.
▪ Use groupby and aggregate to get the counts..
▪ Code is really simple (<10 lines of code)
▪ Will test this on a sample of 100k rows.
Naive approach: Use Spark SQL
Source code
Naive approach: Use Spark SQL
▪ Only processes about 25
rows/CPU/second
▪ To see why, look at the graph.
▪ We generate about 700x the number
of intermediate rows as our input
data to process this.
▪ This is because every row on average
belongs to several models.
▪ There has to be a better way.
This solution is terrible
10,067
(# models)
100,000
(# input rows)
69,697,819
(# rows)
Optimization: Use Pandas UDFs for Looping
▪ One reason why spark is really slow is
because of the large number of
intermediate rows.
▪ What if we wrote a simple UDF that
would iterate over all of the rows in
memory instead?
▪ For this example problem, it speeds
things up by ~1.8x
Optimization: Use Pandas UDFs for Looping
▪ Store the model data
(model_data_df) in a pandas
dataframe.
▪ Use a pandas GROUPED_MAP UDF to
process the data for each id.
▪ Figure out which models belong to an
id in a nested for loop
▪ This is faster because we do not
have to generate intermediate rows.
The code in a nutshell
Optimization: Aggregate keys
▪ In model training, there are some
commonly used filters.
▪ Instead of counting the number of
unique models, count the number of
unique filters.
▪ In this data set, there are 473 unique
model filters, which is much less
than 10k models.
▪ ~9.82x faster than the previous
solution.
Most common geo
inclusions/exclusions
Idea
Count Inclusion Exclusion
2035 [US] []
409 [GB] []
389 [CA] []
358 [AU] []
274 [DE] []
Optimization: Aggregate Keys
▪ Create a UDF that iterates over the
unique filters (by filterId) instead of
model ids.
▪ In order to get the model ids, create
a table that contains the mapping
from model ids to filter ids
(filter_model_id_pairs) and use a
broadcast hash join.
The code in a nutshell
Optimization: Aggregate Keys in Batches
▪ What if we grouped things by
something bigger than an id?
▪ Generate less intermediate rows.
▪ Take advantage of python
vectorization.
▪ We can rewrite a UDF that takes in
batches of ~10k ids per UDF call.
▪ ~2.9x faster than the previous
solution.
▪ ~51.3x faster than the naive one.
Idea
Optimization: Aggregate Keys in Batches
▪ Group things into batches based off
the hash of the id.
▪ Have the UDF group each batch by id
and count the number of ids that
satisfy each filter, returning a partial
count for each filter id.
▪ The group by operation becomes a
sum instead of a count because we
do partial counts in the batches.
The code in a nutshell
Optimization: Inverted Indexes
▪ Each feature store row has relatively
few unique features.
▪ Feature store rows have 10-20
features/row.
▪ There are ~500 unique filters.
▪ Use an inverted index to iterate over
the unique features instead of filters.
▪ Use set operations for
inclusion/exclusion logic
▪ ~6.44x faster than previous solution.
Idea
Optimization: Inverted Indexes
▪ Create maps for filter id to
inclusion/exclusion filters.
▪ Use those maps to get the set of
inclusion/exclusion filters each row
belongs to.
▪ Use set operations to perform the
inclusion/exclusion logic.
▪ Have each UDF call process batches
of ids.
The code in a nutshell
Optimization: Use python libraries
▪ Pandas is optimized for ease of use,
not speed.
▪ Use python libraries (itertools) to
make python code run faster.
▪ reduce, and numpy are also good
candidates to consider for other
UDFs.
▪ ~2.6x faster than previous solution.
▪ ~860x faster than naive solution!
Idea
Optimization: Use python libraries
▪ Use .values to extract the columns
from a pandas dataframe.
▪ Use itertools to iterate through for
loops faster than default for loops.
▪ itertools.group_by is used to sort and
group the data.
▪ itertools.chain.from_iterable is to iterate
through a nested for loop.
The code in a nutshell
Optimization: Summary
▪ Pandas UDFs are extremely flexible
and can be used to speed up spark
SQL.
▪ We discussed a problem where we
could apply optimization tricks for
almost 1000x speedup.
▪ Apply these tricks to your own
problems and watch things
accelerate.
Key takeaways
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Thank you

Más contenido relacionado

La actualidad más candente

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 

La actualidad más candente (20)

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 

Similar a Accelerating Data Processing in Spark SQL with Pandas UDFs

Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studiooptimizatiodirectdirect
 
Converting Your Legacy Data to S1000D
Converting Your Legacy Data to S1000DConverting Your Legacy Data to S1000D
Converting Your Legacy Data to S1000Ddclsocialmedia
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Performant Django - Ara Anjargolian
Performant Django - Ara AnjargolianPerformant Django - Ara Anjargolian
Performant Django - Ara AnjargolianHakka Labs
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programmingJuggernaut Liu
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud EraMydbops
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 

Similar a Accelerating Data Processing in Spark SQL with Pandas UDFs (20)

Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
 
Converting Your Legacy Data to S1000D
Converting Your Legacy Data to S1000DConverting Your Legacy Data to S1000D
Converting Your Legacy Data to S1000D
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Performant Django - Ara Anjargolian
Performant Django - Ara AnjargolianPerformant Django - Ara Anjargolian
Performant Django - Ara Anjargolian
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
 
Large scalecplex
Large scalecplexLarge scalecplex
Large scalecplex
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
Sap abap
Sap abapSap abap
Sap abap
 
Sap abap
Sap abapSap abap
Sap abap
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Accelerating Data Processing in Spark SQL with Pandas UDFs

  • 1.
  • 2. Accelerating Data Processing in Spark SQL with Pandas UDFs Michael Tong, Quantcast Machine Learning Engineer
  • 3. Agenda Review of Pandas UDFs Review what they are and go over some development tips Modeling at Quantcast How we use Spark SQL in production Example Problem Introduce a real problem from our model training pipeline that will be the main focus of our optimization efforts for this talk. Optimization tips and tricks. Iteratively and aggressively optimize this problem with pandas UDFs
  • 4. Optimization Tricks Do more things in memory Loops > spark SQL intermediate rows. Look for ways to do as many things in memory as possible Aggregate Keys Try to reduce the number of unique keys in your data and/or process multiple keys in a single UDF call. Use inverted indices Works especially well with sparse data. Use python libraries Pandas is easy to work with but slow, use other python libraries for better performance
  • 5. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 7. What are Pandas UDFs? ▪ UDF = User Defined Function ▪ Pandas UDFs are part of Spark SQL, which is Apache spark’s module for working with structured data. ▪ Pandas UDFs are a great way of writing custom data-processing logic in a developer-friendly environment. Summary
  • 8. What are Pandas UDFs? ▪ Scalar UDFs. One-to-one mapping function that supports simple return types (no Map/Struct types) ▪ Grouped Map UDFs. Requires a groupby operation but can return a variable number of output rows with complicated return types. ▪ Grouped Agg UDFs. I recommend you use Grouped Map UDFs instead. Types of Pandas UDFs
  • 9. Development tips and tricks ▪ Use an interactive development framework. ▪ At Quantcast we use jupyter notebooks. ▪ Develop with mock data. ▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly iterate quickly on ideas when developing code. ▪ Use magic commands (if you are using jupyter) ▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out of your pandas UDFs. ▪ Use python’s debugging tool ▪ The module is pdb (python debugger).
  • 11. Modeling at Quantcast ▪ Train tens of thousands of models that refresh daily-weekly ▪ Models trained off first party data from global internet traffic. ▪ We have a lot of data ▪ Around 400TB raw logs written/day ▪ Data is cleaned and compressed to about 4-5 TB/day ▪ Typically train off of several days- months of data for each model. Scale, scale, and even more scale
  • 13. Example Problem ▪ We have about 10k models that we want to train. ▪ Each of them cover different geo regions ▪ Some of them are over large regions (i.e. everybody in the US) ▪ Some of them are over specific regions (i.e. everybody in San Francisco) ▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco) ▪ For each model, we want to know how many unique ids (users) were found in each region over a certain period of time. A high level overview
  • 14. Example Problem ▪ Each model will have a set of inclusion regions (i.e. US, San Francisco) where each id must be in one of these regions to be considered part of the model. ▪ Each model will have a set of exclusion regions (i.e. San Francisco) where each id must be in none of these regions to be considered part of the model. ▪ Each id only needs to satisfy the geo constraints once to be part of the model (i.e., an id that moves from the US to Canada during the training timeframe is considered valid for a US model) More details
  • 15. Example Problem With some example tables Feature Feature Id US 0 or 100 San Francisco 1 or 101 Feature Map Model Data and Result Model Geo Incl. Geo Excl. # unique Ids ids Model-0 (US) [0, 100] [] 4 A, B, C, D Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B Id Timestamp Feature ids Model ids A ts-1 [0] [0, 1] B ts-2 [0, 1] [0, 1] C ts-3 [100, 101] [0] D ts-4 [0, 1, 2, 3, 4] [0] D ts-5 [999] [] Feature Store
  • 17. Naive approach: Use Spark SQL ▪ Spark has built in functions to do everything we need ▪ Get all (row, model) pairs using a cross join ▪ Use functions.array_intersect for the inclusion/exclusion logic. ▪ Use groupby and aggregate to get the counts.. ▪ Code is really simple (<10 lines of code) ▪ Will test this on a sample of 100k rows.
  • 18. Naive approach: Use Spark SQL Source code
  • 19. Naive approach: Use Spark SQL ▪ Only processes about 25 rows/CPU/second ▪ To see why, look at the graph. ▪ We generate about 700x the number of intermediate rows as our input data to process this. ▪ This is because every row on average belongs to several models. ▪ There has to be a better way. This solution is terrible 10,067 (# models) 100,000 (# input rows) 69,697,819 (# rows)
  • 20. Optimization: Use Pandas UDFs for Looping ▪ One reason why spark is really slow is because of the large number of intermediate rows. ▪ What if we wrote a simple UDF that would iterate over all of the rows in memory instead? ▪ For this example problem, it speeds things up by ~1.8x
  • 21. Optimization: Use Pandas UDFs for Looping ▪ Store the model data (model_data_df) in a pandas dataframe. ▪ Use a pandas GROUPED_MAP UDF to process the data for each id. ▪ Figure out which models belong to an id in a nested for loop ▪ This is faster because we do not have to generate intermediate rows. The code in a nutshell
  • 22. Optimization: Aggregate keys ▪ In model training, there are some commonly used filters. ▪ Instead of counting the number of unique models, count the number of unique filters. ▪ In this data set, there are 473 unique model filters, which is much less than 10k models. ▪ ~9.82x faster than the previous solution. Most common geo inclusions/exclusions Idea Count Inclusion Exclusion 2035 [US] [] 409 [GB] [] 389 [CA] [] 358 [AU] [] 274 [DE] []
  • 23. Optimization: Aggregate Keys ▪ Create a UDF that iterates over the unique filters (by filterId) instead of model ids. ▪ In order to get the model ids, create a table that contains the mapping from model ids to filter ids (filter_model_id_pairs) and use a broadcast hash join. The code in a nutshell
  • 24. Optimization: Aggregate Keys in Batches ▪ What if we grouped things by something bigger than an id? ▪ Generate less intermediate rows. ▪ Take advantage of python vectorization. ▪ We can rewrite a UDF that takes in batches of ~10k ids per UDF call. ▪ ~2.9x faster than the previous solution. ▪ ~51.3x faster than the naive one. Idea
  • 25. Optimization: Aggregate Keys in Batches ▪ Group things into batches based off the hash of the id. ▪ Have the UDF group each batch by id and count the number of ids that satisfy each filter, returning a partial count for each filter id. ▪ The group by operation becomes a sum instead of a count because we do partial counts in the batches. The code in a nutshell
  • 26. Optimization: Inverted Indexes ▪ Each feature store row has relatively few unique features. ▪ Feature store rows have 10-20 features/row. ▪ There are ~500 unique filters. ▪ Use an inverted index to iterate over the unique features instead of filters. ▪ Use set operations for inclusion/exclusion logic ▪ ~6.44x faster than previous solution. Idea
  • 27. Optimization: Inverted Indexes ▪ Create maps for filter id to inclusion/exclusion filters. ▪ Use those maps to get the set of inclusion/exclusion filters each row belongs to. ▪ Use set operations to perform the inclusion/exclusion logic. ▪ Have each UDF call process batches of ids. The code in a nutshell
  • 28. Optimization: Use python libraries ▪ Pandas is optimized for ease of use, not speed. ▪ Use python libraries (itertools) to make python code run faster. ▪ reduce, and numpy are also good candidates to consider for other UDFs. ▪ ~2.6x faster than previous solution. ▪ ~860x faster than naive solution! Idea
  • 29. Optimization: Use python libraries ▪ Use .values to extract the columns from a pandas dataframe. ▪ Use itertools to iterate through for loops faster than default for loops. ▪ itertools.group_by is used to sort and group the data. ▪ itertools.chain.from_iterable is to iterate through a nested for loop. The code in a nutshell
  • 30. Optimization: Summary ▪ Pandas UDFs are extremely flexible and can be used to speed up spark SQL. ▪ We discussed a problem where we could apply optimization tricks for almost 1000x speedup. ▪ Apply these tricks to your own problems and watch things accelerate. Key takeaways
  • 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.