@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
3. Existing join optimization techniques
Happen during planning phase:
• Join reordering
• Join distribution type (distributed vs. broadcast)
Depend on cost-based optimizer (need column
statistics)
• Should be enabled via session parameters
• Can be estimated using ANALYZE statement
4. Example: join reordering
SELECT * FROM items JOIN sales ON sales.item_id = items.id;
Prefer keeping the "smaller" table on the right-hand side of the join:
Join
(item_id=id)
Join
(item_id=id)
Scan
sales
Scan
items
Scan
items
Scan
sales
5. Example: broadcast join
If the right-hand side table is "small", it can be replicated to
all join workers - saving the CPU and network cost of left-
hand side repartitioning:
Join worker
Join worker
Join workerLeft-hand side
Right-hand side
6. Join worker
Join worker
Example: distributed join
Otherwise, both tables are repartitioned using the join key,
allowing joins with larger right-hand side tables:
Join workerLeft-hand side
Right-hand side
7. Dynamic filtering - introduction
Consider the following query:
SELECT * FROM sales JOIN items
ON sales.item_id = items.id
WHERE items.price > 1000;
Assumptions:
● sales table is large
● items scan results in a few rows
(due to predicate pushdown)
Most of the scanned sales rows will be
discarded during the join (i.e. high selectivity).
How can we optimize this use-case?
Join
(item_id=id)
Scan
sales
Scan
items
[price>1000]
8. Dynamic filtering - description
1. Collect relevant id values during items scan
2. Construct dynamic filter F using the collected ids
3. Apply dynamic predicate pushdown using F to
sales scan
Benefits:
• Connector may optimize the scan given F
• Most sales rows are not touched by Presto
• CPU & network savings for large tables
Requirements:
• F cannot be too large (memory-wise)
• F need to "back propagate" into sales scan in runtime
Join
(item_id=id)
Scan
items
[price>1000]
Scan
sales
[item_id∈F]
(3)
(1)
(2)
Construct
dynamic
filter F
9. Implementation details - Qubole et al.
Supports both distributed and broadcast joins, but
requires significant changes in Presto:
• Add plan nodes and optimizer rules for dynamic
filter collection and application
• New coordinator REST endpoint for dynamic filter
collection from worker nodes.
• Allow connectors to prune partitions during split
generation (when dynamic filter is ready)
More details can be found here:
qubole.com/blog/sql-join-optimizations-qubole-presto
(https://docs.google.com/document/d/1TOlxS8ZAXSIHR5ftHbPsgUkuUA-ky-odwmPdZARJrUQ)
10. Implementation - Varada
When broadcast join is used, sales'
ScanFilterAndProject and items' HashBuilder
operators run at the same process:
• Add a "pass-through" operator to collect
build-side ids.
• When ready, pushdown the resulting
predicate F into sales page source.
No changes needed at the planner, optimizer and
coordinator!
Implemented as a patch on top of
github.com/prestosql/presto (currently work-in-
progress).
ScanFilterAndProject
sales
[item_id∈F]
ScanFilterAndProject
items
[price>1000]
Exchange
Exchange
Collect
F:=F∪{id}
LookupJoin
[item_id=id]
HashBuilder
[id]
TaskOutput
11. Performance analysis - benchmark
Consider the following query (based on TPC-DS sf10000 dataset):
SELECT ss_item_sk FROM store_sales JOIN customer
ON ss_customer_sk = c_customer_sk
WHERE c_customer_id = 'AAAAAAAAMCOOKLCA';
• store_sales contains 27.7B rows
• customer contains 65M rows
• Query result contains 334 rows
12. Performance analysis - results
Regular join Dynamic filtering improvement
Execution time 25 sec 0.9 sec x27 faster
CPU time 57.4 min 7.8 sec x440 lower
Peak total memory 261 MB 2.2 MB x118 lower
Data read (from connector) 258 GB 3.3 kB x78M lower
Tested on Varada cluster (with CBO enabled):
13. Up next in Presto improvements
• Distributed Joins - extend dynamic filtering
• Aggregation Pushdown
• Coordinator HA
CBO is supported today by Presto Hive connector (using Hive statistics).
Since hash-join requires reading the right-hand side table into memory, we would like to estimate the expected sizes' and reorder the join accordingly.
It can be done manually - or automatically (using CBO) via connector-provided statistics.
Broadcast join optimization allows to save network cost for LHS repartitioning at the expense of RHS replication.
Can be set manually, or via CBO (by enumerating the possible join types and choosing the one with lowest cost).
If we knew the item IDs during the planning, we could use predicate pushdown to propagate them into the connector.
So instead, we need to construct the predicate in run-time (before starting the LHS scan).
We re-use existing predicate pushdown mechanism, which allows us to skip most of the LHS table (can be done efficiently in our case).
The coordination problem is much simpler in this case.
Note: we don't know the join key during the plan, so regular predicate pushdown doesn't work.
There is a single customer that matches RHS, so the results are highly selective.
Same query, same hardware - without / with dynamic filtering.
These results show that dynamic filtering may significantly improve the performance of highly-selective queries, by making relatively small changes in Presto.
We are planning to continue the work on dynamic filtering, as well as adding support for aggregation pushdown and coordinator high-availability.