Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
3. Agenda
Review of Pandas UDFs
Review what they are and go over some
development tips
Modeling at Quantcast
How we use Spark SQL in production
Example Problem
Introduce a real problem from our model training
pipeline that will be the main focus of our
optimization efforts for this talk.
Optimization tips and tricks.
Iteratively and aggressively optimize this problem
with pandas UDFs
4. Optimization Tricks
Do more things in memory
Loops > spark SQL intermediate rows. Look for
ways to do as many things in memory as possible
Aggregate Keys
Try to reduce the number of unique keys in your
data and/or process multiple keys in a single UDF
call.
Use inverted indices
Works especially well with sparse data.
Use python libraries
Pandas is easy to work with but slow, use other
python libraries for better performance
7. What are Pandas UDFs?
▪ UDF = User Defined Function
▪ Pandas UDFs are part of Spark SQL,
which is Apache spark’s module for
working with structured data.
▪ Pandas UDFs are a great way of
writing custom data-processing
logic in a developer-friendly
environment.
Summary
8. What are Pandas UDFs?
▪ Scalar UDFs. One-to-one mapping
function that supports simple return
types (no Map/Struct types)
▪ Grouped Map UDFs. Requires a
groupby operation but can return a
variable number of output rows with
complicated return types.
▪ Grouped Agg UDFs. I recommend you
use Grouped Map UDFs instead.
Types of Pandas UDFs
9. Development tips and tricks
▪ Use an interactive development framework.
▪ At Quantcast we use jupyter notebooks.
▪ Develop with mock data.
▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly
iterate quickly on ideas when developing code.
▪ Use magic commands (if you are using jupyter)
▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out
of your pandas UDFs.
▪ Use python’s debugging tool
▪ The module is pdb (python debugger).
11. Modeling at Quantcast
▪ Train tens of thousands of models
that refresh daily-weekly
▪ Models trained off first party data
from global internet traffic.
▪ We have a lot of data
▪ Around 400TB raw logs written/day
▪ Data is cleaned and compressed to
about 4-5 TB/day
▪ Typically train off of several days-
months of data for each model.
Scale, scale, and even more scale
13. Example Problem
▪ We have about 10k models that we want to train.
▪ Each of them cover different geo regions
▪ Some of them are over large regions (i.e. everybody in the US)
▪ Some of them are over specific regions (i.e. everybody in San Francisco)
▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco)
▪ For each model, we want to know how many unique ids (users) were
found in each region over a certain period of time.
A high level overview
14. Example Problem
▪ Each model will have a set of inclusion regions (i.e. US, San Francisco)
where each id must be in one of these regions to be considered part
of the model.
▪ Each model will have a set of exclusion regions (i.e. San Francisco)
where each id must be in none of these regions to be considered part
of the model.
▪ Each id only needs to satisfy the geo constraints once to be part of
the model (i.e., an id that moves from the US to Canada during the
training timeframe is considered valid for a US model)
More details
15. Example Problem
With some example tables
Feature Feature Id
US 0 or 100
San Francisco 1 or 101
Feature Map
Model Data and Result
Model Geo Incl. Geo Excl. # unique
Ids
ids
Model-0 (US) [0, 100] [] 4 A, B, C, D
Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B
Id Timestamp Feature ids Model ids
A ts-1 [0] [0, 1]
B ts-2 [0, 1] [0, 1]
C ts-3 [100, 101] [0]
D ts-4 [0, 1, 2, 3, 4] [0]
D ts-5 [999] []
Feature Store
17. Naive approach: Use Spark SQL
▪ Spark has built in functions to do everything we need
▪ Get all (row, model) pairs using a cross join
▪ Use functions.array_intersect for the inclusion/exclusion logic.
▪ Use groupby and aggregate to get the counts..
▪ Code is really simple (<10 lines of code)
▪ Will test this on a sample of 100k rows.
19. Naive approach: Use Spark SQL
▪ Only processes about 25
rows/CPU/second
▪ To see why, look at the graph.
▪ We generate about 700x the number
of intermediate rows as our input
data to process this.
▪ This is because every row on average
belongs to several models.
▪ There has to be a better way.
This solution is terrible
10,067
(# models)
100,000
(# input rows)
69,697,819
(# rows)
20. Optimization: Use Pandas UDFs for Looping
▪ One reason why spark is really slow is
because of the large number of
intermediate rows.
▪ What if we wrote a simple UDF that
would iterate over all of the rows in
memory instead?
▪ For this example problem, it speeds
things up by ~1.8x
21. Optimization: Use Pandas UDFs for Looping
▪ Store the model data
(model_data_df) in a pandas
dataframe.
▪ Use a pandas GROUPED_MAP UDF to
process the data for each id.
▪ Figure out which models belong to an
id in a nested for loop
▪ This is faster because we do not
have to generate intermediate rows.
The code in a nutshell
22. Optimization: Aggregate keys
▪ In model training, there are some
commonly used filters.
▪ Instead of counting the number of
unique models, count the number of
unique filters.
▪ In this data set, there are 473 unique
model filters, which is much less
than 10k models.
▪ ~9.82x faster than the previous
solution.
Most common geo
inclusions/exclusions
Idea
Count Inclusion Exclusion
2035 [US] []
409 [GB] []
389 [CA] []
358 [AU] []
274 [DE] []
23. Optimization: Aggregate Keys
▪ Create a UDF that iterates over the
unique filters (by filterId) instead of
model ids.
▪ In order to get the model ids, create
a table that contains the mapping
from model ids to filter ids
(filter_model_id_pairs) and use a
broadcast hash join.
The code in a nutshell
24. Optimization: Aggregate Keys in Batches
▪ What if we grouped things by
something bigger than an id?
▪ Generate less intermediate rows.
▪ Take advantage of python
vectorization.
▪ We can rewrite a UDF that takes in
batches of ~10k ids per UDF call.
▪ ~2.9x faster than the previous
solution.
▪ ~51.3x faster than the naive one.
Idea
25. Optimization: Aggregate Keys in Batches
▪ Group things into batches based off
the hash of the id.
▪ Have the UDF group each batch by id
and count the number of ids that
satisfy each filter, returning a partial
count for each filter id.
▪ The group by operation becomes a
sum instead of a count because we
do partial counts in the batches.
The code in a nutshell
26. Optimization: Inverted Indexes
▪ Each feature store row has relatively
few unique features.
▪ Feature store rows have 10-20
features/row.
▪ There are ~500 unique filters.
▪ Use an inverted index to iterate over
the unique features instead of filters.
▪ Use set operations for
inclusion/exclusion logic
▪ ~6.44x faster than previous solution.
Idea
27. Optimization: Inverted Indexes
▪ Create maps for filter id to
inclusion/exclusion filters.
▪ Use those maps to get the set of
inclusion/exclusion filters each row
belongs to.
▪ Use set operations to perform the
inclusion/exclusion logic.
▪ Have each UDF call process batches
of ids.
The code in a nutshell
28. Optimization: Use python libraries
▪ Pandas is optimized for ease of use,
not speed.
▪ Use python libraries (itertools) to
make python code run faster.
▪ reduce, and numpy are also good
candidates to consider for other
UDFs.
▪ ~2.6x faster than previous solution.
▪ ~860x faster than naive solution!
Idea
29. Optimization: Use python libraries
▪ Use .values to extract the columns
from a pandas dataframe.
▪ Use itertools to iterate through for
loops faster than default for loops.
▪ itertools.group_by is used to sort and
group the data.
▪ itertools.chain.from_iterable is to iterate
through a nested for loop.
The code in a nutshell
30. Optimization: Summary
▪ Pandas UDFs are extremely flexible
and can be used to speed up spark
SQL.
▪ We discussed a problem where we
could apply optimization tricks for
almost 1000x speedup.
▪ Apply these tricks to your own
problems and watch things
accelerate.
Key takeaways