Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics
3. About me
Software engineer for the past 13 years and
started working with data 12 years ago with
Oracle and moved to reporting and BI over the
years. Last 4 years enabling business users at
GetYourGuide to make better decisions with data.
Senior BI Engineer, Data Platform
3
4. Agenda
1. Intro to GetYourGuide
2. Introduction to Funnels
3. Deep Dive
4. Further Possibilities
5. End Result
4
6. We make it simple to book and enjoy
incredible experiences
7. Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities
9. Requirements
1. Looker as frontend and Spark as backend
2. Respect the order of events
3. Each step can consist of multiple events
4. Anything can happen between two steps of the Funnel
5. Support for Funnel wide and step specific filters
6. Sessions based on Touch-points (for some use cases)
7. Performance: 4 weeks of data in under 60 sec ideally under 30 sec
8. Option to ignore the order of events :)
9#UnifiedDataAnalytics #SparkAISummit
10. Internal vs External solution
1. Performance
○ External is faster because it’s custom built bottom up
2. Where is the data?
○ With internal solution we are not sending data to 3rd parties
3. Flexibility
○ With internal solution we can join it to all of our internal data
to bring more insights to our stakeholders
○ Adding it to internal dashboards
4. Cost
○ Differs for each company :)
12. Funnel filtering explained
If you define that you want see visitors in the funnel A - B - C - D - E
How can the actual funnel look like?
1) xyzDBz A CyxD B ABAB C BC DE
2) xyzlmnop A CCCC
How many steps do we match in this cases
1) A B C D E
2) A B A B E
3) A B C E D
14. How do we build a Funnel
1) We filter for only selected Events
2) Concatenate all the events in a single session into a string
a) We use an alias for each step (A,B,C…)
i) It streamlines the rest of the query
ii) And nothing changed when we added step specific
filters
3) We compare the generated string with the Funnel specified by
filters in our BI tool
15. Pseudo SQL of the implementation
SELECT CASE … WHEN FROM (
SELECT concat_ws('', collect_list(
CASE
WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’
...
END)OVER(PARTITION BY session ORDER BY timestamp ROWS …)
AS funnel
// funnel => “ABABBBCDE...”
FROM event_log
WHERE ((event IN (‘LandingPage’,‘HomePage’))
OR (event = ‘ProductPage’ AND product_id = ‘123’) …)
AND date = ...
) t
WHERE t.funnel RLIKE '...'
16. Pseudo SQL - Innere Where
SELECT CASE … WHEN FROM (
SELECT concat_ws('', collect_list(
CASE
WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’
...
END)OVER(PARTITION BY session ORDER BY timestamp ROWS …)
AS funnel
FROM event_log
WHERE 1=1
AND ((event IN (‘LandingPage’,‘HomePage’))
OR (event = ‘ProductPage’ AND product_id = ‘123’) …)
AND date BETWEEN ...
) t
WHERE t.funnel RLIKE '...'
17. Explain plan before the fix
+ - Filter ( ((event = Landing Page) || (event = Home Page)) || ((event = ProductPage) && (product_id = …
+- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=AboutView ... PartitionCount: 1502,
PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], …
What do we see in the explain plan?
● Event partition listed is not for any of the filtered events
● Partition count is really high (date range is 7 days)
● Date partition pruning is applied
18. Pseudo SQL - Innere Where Fixed
SELECT CASE … WHEN FROM (
SELECT concat_ws('', collect_list(
CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
….
FROM event_log
WHERE 1=1
AND date BETWEEN ...
AND ( (event IN (‘LandingPage’,‘HomePage’))
OR (event = ‘ProductPage’)
OR (event …) )
AND ( (event IN (‘LandingPage’,‘HomePage’))
OR (event = ‘ProductPage’ AND product_id = ‘123’) …)
) t
WHERE t.funnel RLIKE '...'
19. Explain plan after the fix
+ - Filter ( ((event = LandingPage) || (event = HomePage)) || ((event = ProductPage) && (product_id = …
+- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=LandingPage ... PartitionCount: 21,
PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], (event_name = Landing…
What do we see in the explain plan?
● Event name partitions are pruned
● Partition count is a lot lower (7 days x 3 events)
● Date partition pruning is still applied
20. Sample of rows from inner to
outer query
Event Alias Funnel Visitor ID
A BBABABABCBC A1B1
B BBABABABCBC A1B1
C BBABABABCBC A1B1
A BAAA C9D9
B BAAA C9D9
21. Pseudo SQL - Outer Where
SELECT CASE … WHEN FROM (
SELECT concat_ws('', collect_list(
CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
….
FROM event_log
WHERE 1=1
AND date BETWEEN ...
) t
WHERE t.funnel RLIKE 'A.*B.*C.*D.*E'
22. Pseudo SQL - Outer Where
SELECT CASE … WHEN FROM (
SELECT concat_ws('', collect_list(
CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
….
FROM event_log
WHERE 1=1
AND date BETWEEN ...
) t
WHERE locate('A', funnel) > 0
23. Pseudo SQL - Outer Select
SELECT CASE
WHEN alias = ‘A’ THEN 1
WHEN alias = ‘B’ AND funnel RLIKE ‘A.*B’ THEN 2
WHEN alias = ‘C’ AND funnel RLIKE ‘A.*B.*C’ THEN 3
ELSE -1 END AS step
FROM (
SELECT concat_ws('', collect_list(
CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
….
FROM event_log
WHERE 1=1
AND date BETWEEN ...
) t
WHERE locate('A', funnel) > 0
24. Pseudo SQL - Outer Select
SELECT CASE
WHEN alias = ‘A’ THEN 1
WHEN alias = ‘B’ AND locate('B', funnel, locate('A', funnel)) > 0 THEN 2
WHEN alias = ‘C’ AND locate('C', funnel, locate('B', funnel, locate('A', funnel))) > 0
THEN 3
ELSE -1 END AS step
FROM (
SELECT concat_ws('', collect_list(
CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’
….
FROM event_log
WHERE 1=1
AND date BETWEEN ...
) t
WHERE locate('A', funnel) > 0
25. Sample of rows from outer to
BI tool query
Event Alias Funnel Visitor ID Step
A BBABABABCBC A1B1 1
B BBABABABCBC A1B1 2
C BBABABABCBC A1B1 3
A BAAA C9D9 1
B BAAA C9D9 -1
28. Slicing the funnel
● First we get all the values that satisfy the filters
● Then we collect them into an array with a window function to
apply them to every step of the funnel
● Last step is to explode using LATERAL VIEW to support multiple
dimensions
● And now we can expose it as a dimension to users
29. Slicing – Inner select
SELECT CASE … WHEN FROM (
SELECT …
collect_set(CASE WHEN product_id IN (1,2,3) THEN product_id END)
OVER(PARTITION BY session ORDER BY timestamp ROWS …)
AS product_id_array // Distinct values
…
FROM event_log
WHERE ((event IN (‘LandingPage’,‘HomePage’))
OR (event = ‘ProductPage’ AND product_id = ‘123’) …)
AND date = ...
) t
WHERE t.funnel RLIKE '...'
30. Sample of rows from inner to
outer query
Event Alias Funnel Visitor ID Product Array
A BBABABABCBC A1B1 [1,2]
A BAAA C9D9 [3]
31. Slicing – Outer query
SELECT visitor_id …
, product_id
, CASE WHEN …
…
FROM ( /* INNER QUERY */ )
LATERAL VIEW OUTER explode (product_id_arr) products AS product_id
LATERAL VIEW OUTER explode (category_id_arr) categories AS category_id
…
● With LATERAL VIEW we can explode multiple arrays
● This will multiply number of rows sent to the BI tool generated outer
query
● Benefit for end users -> they don’t have to run multiple funnels
analysis to get the same data
32. Sample of rows from outer to
BI tool query
Event Alias Funnel Visitor ID Product ID
A BBABABABCBC A1B1 1
A BBABABABCBC A1B1 2
A BAAA C9D9 3
33. Further optimisations
We have optimised the query in its current form and there is one
more part that allows further optimisations
FROM event_log
● We are reading data directly from the partitioned table
● We can consider partitioned table as a union of tables where
each partition is a table
● Could we optimise the query by replacing the table with unions?
What would that look like
34. Further optimisations
SELECT …
FROM (
SELECT *
FROM event_log
WHERE ((event IN (‘LandingPage’,‘HomePage’))
AND date BETWEEN ...
) t
UNION ALL (
SELECT *
FROM event_log e
INNER JOIN (SELECT visitor_id, min(timestamp) as timestamp
FROM event_log
WHERE ((event IN (‘LandingPage’,‘HomePage’)) AND date BETWEEN ...)
GROUP BY 1
) s1 ON e.visitor_id = s1.visitor_id AND e.timestamp >= s1.timestamp …
WHERE event = ‘ProductPage’ AND product_id = ‘123’ AND date BETWEEN ...
) t
39. How to ensure performance in the BI tool
1. We are using columnar storage for our data
2. Therefore we are using a feature to modify the generated SQL
a. To select only needed fields
b. And to include only the joins needed for those fields
For simpler use cases BI tools provide this out of the box and since we
need to use query with subqueries we had to use additional feature
which allows us to modify custom queries