The document discusses using machine learning techniques to learn vector representations of SQL queries that can then be used for various workload management tasks without requiring manual feature engineering. It shows that representations learned from SQL strings using models like Doc2Vec and LSTM autoencoders can achieve high accuracy for tasks like predicting query errors, auditing users, and summarizing workloads for index recommendation. These learned representations allow workload management to be database agnostic and avoid maintaining database-specific feature extractors.
10. Workload Management and Analytics
10
Workload
Summarization
Index Selection
Query Routing /
Resource
Allocation
Query
Recommendation
Pick your favorite
next challenge:
Query Forensics
Multi Query
optimization
Self-Tuning
Databases
Predicting
Cache
Performance
Modeling User
Behavior
11. Jain et al., CIDR 2019 11
○ Extract query type, count joins, etc. [Chaudhuri et al. 2002]
○ Extract fragments [Khoussainova et al. 2010]
○ Extract operators and sql functions [Jain et al. 2016]
○ etc.
Every workload management task => feature engineering
12. 12
N TasksM SQL Dialects
PostgreSQL
Snowflake
SQL Server
and so on...
Summarization
Error Prediction
Query Routing
Security audits
N * M feature
extractors
More if tenant-
specific features are
important
Manual feature engineering is hopeless
● Many databases, many tasks
● Maybe ~10 database services, each with different dialects of SQL
● The dialects may change frequently, at different rates:
○ Ex: Snowflake SQL parser changes ~10 times / month on average
● 100s of millions of SQL-like queries per day (hour/minute/sec)...
● Workloads are diverse (yet structured) due to multi-tenancy
13. We want a query representation that can
support all these learning tasks
SELECT A
FROM
tableA, tableB
WHERE
tableA.B = tableB.A
AND tableA.C LIKE ‘%something%’
[0.2, 1, 23, 0.01 … … … … …]
Given a
query
Find a vector in k
dimensional space that
represents it.
13
14. 14
predic
t
SELECT D,E,F,G FROM tableA, tableB WHERE tableA.A = tableB.B AND tableA.C = 4Q23
Doc2Vec
Word2Vec
Totally novel automatic feature learning:
Predict a token from its context;
use the learned weights as a
vector to represent the
predicted token
16. Lots of generic representations…
16
● Treat queries (or plans) as sentences (natural language text)
● Use representation learning methods for text
○ Doc2Vec
○ LSTM autoencoders
○ LSTM encoder-classifiers
○ TreeLSTM encoder-classifiers on query plans
○ CNNs
17. Sanity check: TPC-H
Query Representations for a TPC-H
workload projected onto two
dimensions using TSNE
17
Each color is a
different TPCH
query template
The learned
representations
are at least
minimally
coherent
Do generic NLP representations produce anything meaningful?
18. 18
Error Prediction
big, real SQL workload
Each point is a query that
generated an error.
Random sample of 4200
error-generating queries
over a 7 day period.
Colors are selected error
codes
OOM
Error
Unknown
Timezone in
Date
Date Parse
Error
Divide by Zero
25. Workload Summarization
for Index Recommendation
A lot of
Queries
Account_name =
‘xyz’
Workload
Apply
Filters
100
Queries
Sample
Uniform
Sample Output
Workload
25
26. 100
Queries
A lot of
Queries
Account_name =
‘xyz’
Workload
Apply
Filters
Summarization
using query vectors
Output
Workload
26
** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud
Database”, DBTEST 2018
** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud
Database”, US Patent Application No. 62/646,817
Workload Summarization
for Index Recommendation
27. Evaluation of workload summary:
index recommendation
27
○ Run the full workload with no indexes, record the time (t1)
○ Recommend and create indexes on the FULL workload
○ Run the full workload again, record the time (t2)
○ Generate small workload summary
○ Recommend and create indexes on the SUMMARY workload
○ Run the full workload again, record the time (t3)
○ Set a time budget for the recommender
28. 28
Transfer learning:
We can even learn the
model on Snowflake
workload, and use it to
infer representations for
the TPC-H workload
Workload Summarization for Index Selection
31. Last slide
● Every workload management task is query labeling
● You don’t need fancy features
● You can’t maintain fancy features anyway
● SQL strings (and plans) have a lot of signal
● There is tons of training data
● Your workload is not “all possible queries” – use the
patterns
● Transfer learning works – you can train on one workload
and use on another
● Opens up a lot of simple interesting little applications
○ User behavior modeling, resource allocation, …
● External “query labeling service” keeps everything
organized 31
Shrainik
Jain
XXXXI will be using analytics and management interchangeably.
Management means operationalizing a set of analysis and decision tasks
Predicting Cache Performance. [Sapia 2000, Dan et al. 1995]
Modeling User Behavior [Yu et al. 1992, Tran et al. 2015, Jain et al. 2016]
XXXXI will be using analytics and management interchangeably.
Management means operationalizing a set of analysis and decision tasks
Predicting Cache Performance. [Sapia 2000, Dan et al. 1995]
Modeling User Behavior [Yu et al. 1992, Tran et al. 2015, Jain et al. 2016]
Also called:
Embedding
Vector representation
Distributed representation
Low hanging fruit for shrainik’s phd; This is already a solved problem in NLP
Also called:
Embedding
Vector representation
Distributed representation
Low hanging fruit for shrainik’s phd; This is already a solved problem in NLP
Why just stop at these?
Heavy means involves top 20 tables by size
Why just stop at these?
Why just stop at these?
(Or Better-than-random sampling for an application within Snowflake**, maybe)
Lets compare against the gold standard.
Caveat: SQLServer does summarization no matter what. We couldn’t find a way to turn this off.
System architecture. Queries arrive for three different applications X, Y , and Z and are processed by one or more (embedder, labeler) pair before being sent on to the database, centralized for offline labeling tasks, or both.