SlideShare una empresa de Scribd logo
1 de 33
Database-Agnostic Workload
Management
Shrainik Jain, Jiaqi Yan*, Thierry
Cruanes*, Bill Howe
1/21/2019 1
Workload Management and Analytics
2
Workload
Summarization
Index Selection
Query Routing /
Resource
Allocation
Query
Recommendation
Pick your favorite
next challenge:
Query Forensics
Multi Query
optimization
Self-Tuning
Databases
Predicting
Cache
Performance
Modeling User
Behavior
Jain et al., CIDR 2019 3
Q
High
priority?
(Q, priority)
(Q, normal)
Fast server
Jain et al., CIDR 2019 4
Q
Heavy
hitter?
(Q, heavy)
(Q, normal)
Big cluster
Jain et al., CIDR 2019 5
Q
Likely
Error?
(Q, error)
(Q, no error)
Instrumented
cluster
Jain et al., CIDR 2019 6
Q
Atypical
query?
(Q, atypical)
(Q, typical)
Workload
summary for
periodic index
recommendation
Jain et al., CIDR 2019 7
Q
Suspicious
query?
(Q, suspicious)
(Q, not suspicious)
Audit Log
Jain et al., CIDR 2019 8
Q
(Q, estimated cost)
big cluster
optimizer
9
Q
heavy
suspicious
atypical
priority
(Q, heavy)
(Q, heavy, suspicious)
(Q, heavy, suspicious, atypical)
(Q, heavy, suspicious, atypical, priority)
RDS
Workload Management = Learning and
operationalizing a set of query labeling functions
Workload Management and Analytics
10
Workload
Summarization
Index Selection
Query Routing /
Resource
Allocation
Query
Recommendation
Pick your favorite
next challenge:
Query Forensics
Multi Query
optimization
Self-Tuning
Databases
Predicting
Cache
Performance
Modeling User
Behavior
Jain et al., CIDR 2019 11
○ Extract query type, count joins, etc. [Chaudhuri et al. 2002]
○ Extract fragments [Khoussainova et al. 2010]
○ Extract operators and sql functions [Jain et al. 2016]
○ etc.
Every workload management task => feature engineering
12
N TasksM SQL Dialects
PostgreSQL
Snowflake
SQL Server
and so on...
Summarization
Error Prediction
Query Routing
Security audits
N * M feature
extractors
More if tenant-
specific features are
important
Manual feature engineering is hopeless
● Many databases, many tasks
● Maybe ~10 database services, each with different dialects of SQL
● The dialects may change frequently, at different rates:
○ Ex: Snowflake SQL parser changes ~10 times / month on average
● 100s of millions of SQL-like queries per day (hour/minute/sec)...
● Workloads are diverse (yet structured) due to multi-tenancy
We want a query representation that can
support all these learning tasks
SELECT A
FROM
tableA, tableB
WHERE
tableA.B = tableB.A
AND tableA.C LIKE ‘%something%’
[0.2, 1, 23, 0.01 … … … … …]
Given a
query
Find a vector in k
dimensional space that
represents it.
13
14
predic
t
SELECT D,E,F,G FROM tableA, tableB WHERE tableA.A = tableB.B AND tableA.C = 4Q23
Doc2Vec
Word2Vec
Totally novel automatic feature learning:
Predict a token from its context;
use the learned weights as a
vector to represent the
predicted token
15
Lots of generic representations…
16
● Treat queries (or plans) as sentences (natural language text)
● Use representation learning methods for text
○ Doc2Vec
○ LSTM autoencoders
○ LSTM encoder-classifiers
○ TreeLSTM encoder-classifiers on query plans
○ CNNs
Sanity check: TPC-H
Query Representations for a TPC-H
workload projected onto two
dimensions using TSNE
17
Each color is a
different TPCH
query template
The learned
representations
are at least
minimally
coherent
Do generic NLP representations produce anything meaningful?
18
Error Prediction
big, real SQL workload
Each point is a query that
generated an error.
Random sample of 4200
error-generating queries
over a 7 day period.
Colors are selected error
codes
OOM
Error
Unknown
Timezone in
Date
Date Parse
Error
Divide by Zero
Error Prediction
19
Clusters are repeated syntactic patterns in the workload; they’re meaningful
DOES THIS ACTUALLY
WORK?
Jain et al., CIDR 2019 20
Datasets used
21
● Datasets for training Embedders
● Datasets for training classifiers
Workload Total
Queries
Distinct
Queries
Snowflake 500000 175958
TPC-H 4200 2180
Workload Total
Queries
Distinct
Queries
Snowflake-
MultiError
100000 17311
Snowflake-
OOM
4491 2501
Predicting OOM Errors
22
Method Precision Recall f1-score
Contains heavy joins 0.729 0.115 0.198
Contains window functions 0.762 0.377 0.504
Contains heavy joins OR window
functions
0.724 0.403 0.518
Contains heavy joins AND window
functions
0.931 0.162 0.162
Query2Vec-LSTM 0.983 0.977 0.980
Query2Vec-Doc2Vec 0.919 0.823 0.869
Predicting Other Errors
23
ErrorCode Precision Recall f1-score #queries
-1 (No Error) 0.986 0.992 0.989 7464
604 0.878 0.927 0.902 1106
606 0.929 0.578 0.712 45
608 0.996 0.993 0.995 3119
630 0.894 0.864 0.879 88
2031 0.765 0.667 0.712 39
90030 1 0.998 0.999 1529
100035 1 0.71 0.83 31
100037 1 0.417 0.588 12
100038 0.981 0.968 0.975 1191
100040 0.952 0.833 0.889 48
100046 1 0.923 0.96 13
100051 0.941 0.913 0.927 104
100069 0.857 0.5 0.632 12
100071 0.857 0.5 0.632 12
100078 1 0.974 0.987 77
100094 0.833 0.921 0.875 38
100097 0.923 0.667 0.774 18
~90% P/R
Security Audits:
Predict user, compare with actual user
#queries #users Accuracy
73881 28 49.30%
55333 10 37.40%
18487 46 31.80%
5471 21 96.20%
4213 6 58.50%
3894 12 99.70%
3373 9 99.80%
2867 6 99.80%
1953 15 89.10%
1924 4 98.10%
1776 9 95.20%
1699 5 99.80%
1108 12 98.20%
Account
Labeling
User
Labeling
Doc2Vec 78.8% 39%
LSTMAutoencode
r
99.1% 55.4%
Workload Summarization
for Index Recommendation
A lot of
Queries
Account_name =
‘xyz’
Workload
Apply
Filters
100
Queries
Sample
Uniform
Sample Output
Workload
25
100
Queries
A lot of
Queries
Account_name =
‘xyz’
Workload
Apply
Filters
Summarization
using query vectors
Output
Workload
26
** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud
Database”, DBTEST 2018
** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud
Database”, US Patent Application No. 62/646,817
Workload Summarization
for Index Recommendation
Evaluation of workload summary:
index recommendation
27
○ Run the full workload with no indexes, record the time (t1)
○ Recommend and create indexes on the FULL workload
○ Run the full workload again, record the time (t2)
○ Generate small workload summary
○ Recommend and create indexes on the SUMMARY workload
○ Run the full workload again, record the time (t3)
○ Set a time budget for the recommender
28
Transfer learning:
We can even learn the
model on Snowflake
workload, and use it to
infer representations for
the TPC-H workload
Workload Summarization for Index Selection
How good is this summary?
29
Querc: Query Classsifier
30
Reuse
embeddings
where possible
Collect training
labels from the
databases (cost,
error codes)
Retrain models
periodically, or
online
Last slide
● Every workload management task is query labeling
● You don’t need fancy features
● You can’t maintain fancy features anyway
● SQL strings (and plans) have a lot of signal
● There is tons of training data
● Your workload is not “all possible queries” – use the
patterns
● Transfer learning works – you can train on one workload
and use on another
● Opens up a lot of simple interesting little applications
○ User behavior modeling, resource allocation, …
● External “query labeling service” keeps everything
organized 31
Shrainik
Jain
Query recommendation:
Predict next query in a session
32
33
Up is
good
Learned features about as good as manual features,
even with generous assumptions

Más contenido relacionado

La actualidad más candente

Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-meansAndrei Novikov
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Kaggle talk series  top 0.2% kaggler on amazon employee access challengeKaggle talk series  top 0.2% kaggler on amazon employee access challenge
Kaggle talk series top 0.2% kaggler on amazon employee access challengeVivian S. Zhang
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTaegyun Jeon
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersEtsuji Nakai
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithmsiqbalphy1
 
Queue Data Structure
Queue Data StructureQueue Data Structure
Queue Data StructureZidny Nafan
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruitymrphilroth
 
Geek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz JavaGeek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz JavaOLBATI
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesWork-Bench
 

La actualidad más candente (20)

Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-means
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Project PPT
Project PPTProject PPT
Project PPT
 
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Kaggle talk series  top 0.2% kaggler on amazon employee access challengeKaggle talk series  top 0.2% kaggler on amazon employee access challenge
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
Aaex3 group2
Aaex3 group2Aaex3 group2
Aaex3 group2
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
 
PA1_template
PA1_templatePA1_template
PA1_template
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Queue Data Structure
Queue Data StructureQueue Data Structure
Queue Data Structure
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
 
Geek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz JavaGeek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz Java
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data Frames
 

Similar a Database Agnostic Workload Management (CIDR 2019)

Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
22-4_PerformanceTuningUsingtheAdvisorFramework.pdf
22-4_PerformanceTuningUsingtheAdvisorFramework.pdf22-4_PerformanceTuningUsingtheAdvisorFramework.pdf
22-4_PerformanceTuningUsingtheAdvisorFramework.pdfyishengxi
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsRafael Ferreira da Silva
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013Valeriy Kravchuk
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Anomaly Detection using Neural Networks with Pandas, Keras and Python
Anomaly Detection using Neural Networks with Pandas, Keras and PythonAnomaly Detection using Neural Networks with Pandas, Keras and Python
Anomaly Detection using Neural Networks with Pandas, Keras and PythonDean Langsam
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 EstimationLawrence Bernstein
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceVinod Kumar
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning projectAlex Austin
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cRachelBarker26
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar
 

Similar a Database Agnostic Workload Management (CIDR 2019) (20)

Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
22-4_PerformanceTuningUsingtheAdvisorFramework.pdf
22-4_PerformanceTuningUsingtheAdvisorFramework.pdf22-4_PerformanceTuningUsingtheAdvisorFramework.pdf
22-4_PerformanceTuningUsingtheAdvisorFramework.pdf
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and Workflows
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
SQL Optimizer vs Hive
SQL Optimizer vs Hive SQL Optimizer vs Hive
SQL Optimizer vs Hive
 
Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Anomaly Detection using Neural Networks with Pandas, Keras and Python
Anomaly Detection using Neural Networks with Pandas, Keras and PythonAnomaly Detection using Neural Networks with Pandas, Keras and Python
Anomaly Detection using Neural Networks with Pandas, Keras and Python
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query Performance
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
 
Maestro_Abstract
Maestro_AbstractMaestro_Abstract
Maestro_Abstract
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 

Más de University of Washington

Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 

Más de University of Washington (20)

Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 

Último

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 

Último (20)

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Database Agnostic Workload Management (CIDR 2019)

  • 1. Database-Agnostic Workload Management Shrainik Jain, Jiaqi Yan*, Thierry Cruanes*, Bill Howe 1/21/2019 1
  • 2. Workload Management and Analytics 2 Workload Summarization Index Selection Query Routing / Resource Allocation Query Recommendation Pick your favorite next challenge: Query Forensics Multi Query optimization Self-Tuning Databases Predicting Cache Performance Modeling User Behavior
  • 3. Jain et al., CIDR 2019 3 Q High priority? (Q, priority) (Q, normal) Fast server
  • 4. Jain et al., CIDR 2019 4 Q Heavy hitter? (Q, heavy) (Q, normal) Big cluster
  • 5. Jain et al., CIDR 2019 5 Q Likely Error? (Q, error) (Q, no error) Instrumented cluster
  • 6. Jain et al., CIDR 2019 6 Q Atypical query? (Q, atypical) (Q, typical) Workload summary for periodic index recommendation
  • 7. Jain et al., CIDR 2019 7 Q Suspicious query? (Q, suspicious) (Q, not suspicious) Audit Log
  • 8. Jain et al., CIDR 2019 8 Q (Q, estimated cost) big cluster optimizer
  • 9. 9 Q heavy suspicious atypical priority (Q, heavy) (Q, heavy, suspicious) (Q, heavy, suspicious, atypical) (Q, heavy, suspicious, atypical, priority) RDS Workload Management = Learning and operationalizing a set of query labeling functions
  • 10. Workload Management and Analytics 10 Workload Summarization Index Selection Query Routing / Resource Allocation Query Recommendation Pick your favorite next challenge: Query Forensics Multi Query optimization Self-Tuning Databases Predicting Cache Performance Modeling User Behavior
  • 11. Jain et al., CIDR 2019 11 ○ Extract query type, count joins, etc. [Chaudhuri et al. 2002] ○ Extract fragments [Khoussainova et al. 2010] ○ Extract operators and sql functions [Jain et al. 2016] ○ etc. Every workload management task => feature engineering
  • 12. 12 N TasksM SQL Dialects PostgreSQL Snowflake SQL Server and so on... Summarization Error Prediction Query Routing Security audits N * M feature extractors More if tenant- specific features are important Manual feature engineering is hopeless ● Many databases, many tasks ● Maybe ~10 database services, each with different dialects of SQL ● The dialects may change frequently, at different rates: ○ Ex: Snowflake SQL parser changes ~10 times / month on average ● 100s of millions of SQL-like queries per day (hour/minute/sec)... ● Workloads are diverse (yet structured) due to multi-tenancy
  • 13. We want a query representation that can support all these learning tasks SELECT A FROM tableA, tableB WHERE tableA.B = tableB.A AND tableA.C LIKE ‘%something%’ [0.2, 1, 23, 0.01 … … … … …] Given a query Find a vector in k dimensional space that represents it. 13
  • 14. 14 predic t SELECT D,E,F,G FROM tableA, tableB WHERE tableA.A = tableB.B AND tableA.C = 4Q23 Doc2Vec Word2Vec Totally novel automatic feature learning: Predict a token from its context; use the learned weights as a vector to represent the predicted token
  • 15. 15
  • 16. Lots of generic representations… 16 ● Treat queries (or plans) as sentences (natural language text) ● Use representation learning methods for text ○ Doc2Vec ○ LSTM autoencoders ○ LSTM encoder-classifiers ○ TreeLSTM encoder-classifiers on query plans ○ CNNs
  • 17. Sanity check: TPC-H Query Representations for a TPC-H workload projected onto two dimensions using TSNE 17 Each color is a different TPCH query template The learned representations are at least minimally coherent Do generic NLP representations produce anything meaningful?
  • 18. 18 Error Prediction big, real SQL workload Each point is a query that generated an error. Random sample of 4200 error-generating queries over a 7 day period. Colors are selected error codes OOM Error Unknown Timezone in Date Date Parse Error Divide by Zero
  • 19. Error Prediction 19 Clusters are repeated syntactic patterns in the workload; they’re meaningful
  • 20. DOES THIS ACTUALLY WORK? Jain et al., CIDR 2019 20
  • 21. Datasets used 21 ● Datasets for training Embedders ● Datasets for training classifiers Workload Total Queries Distinct Queries Snowflake 500000 175958 TPC-H 4200 2180 Workload Total Queries Distinct Queries Snowflake- MultiError 100000 17311 Snowflake- OOM 4491 2501
  • 22. Predicting OOM Errors 22 Method Precision Recall f1-score Contains heavy joins 0.729 0.115 0.198 Contains window functions 0.762 0.377 0.504 Contains heavy joins OR window functions 0.724 0.403 0.518 Contains heavy joins AND window functions 0.931 0.162 0.162 Query2Vec-LSTM 0.983 0.977 0.980 Query2Vec-Doc2Vec 0.919 0.823 0.869
  • 23. Predicting Other Errors 23 ErrorCode Precision Recall f1-score #queries -1 (No Error) 0.986 0.992 0.989 7464 604 0.878 0.927 0.902 1106 606 0.929 0.578 0.712 45 608 0.996 0.993 0.995 3119 630 0.894 0.864 0.879 88 2031 0.765 0.667 0.712 39 90030 1 0.998 0.999 1529 100035 1 0.71 0.83 31 100037 1 0.417 0.588 12 100038 0.981 0.968 0.975 1191 100040 0.952 0.833 0.889 48 100046 1 0.923 0.96 13 100051 0.941 0.913 0.927 104 100069 0.857 0.5 0.632 12 100071 0.857 0.5 0.632 12 100078 1 0.974 0.987 77 100094 0.833 0.921 0.875 38 100097 0.923 0.667 0.774 18 ~90% P/R
  • 24. Security Audits: Predict user, compare with actual user #queries #users Accuracy 73881 28 49.30% 55333 10 37.40% 18487 46 31.80% 5471 21 96.20% 4213 6 58.50% 3894 12 99.70% 3373 9 99.80% 2867 6 99.80% 1953 15 89.10% 1924 4 98.10% 1776 9 95.20% 1699 5 99.80% 1108 12 98.20% Account Labeling User Labeling Doc2Vec 78.8% 39% LSTMAutoencode r 99.1% 55.4%
  • 25. Workload Summarization for Index Recommendation A lot of Queries Account_name = ‘xyz’ Workload Apply Filters 100 Queries Sample Uniform Sample Output Workload 25
  • 26. 100 Queries A lot of Queries Account_name = ‘xyz’ Workload Apply Filters Summarization using query vectors Output Workload 26 ** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud Database”, DBTEST 2018 ** Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, Allison Lee, “Snowtrail: Testing with Production Queries on a Cloud Database”, US Patent Application No. 62/646,817 Workload Summarization for Index Recommendation
  • 27. Evaluation of workload summary: index recommendation 27 ○ Run the full workload with no indexes, record the time (t1) ○ Recommend and create indexes on the FULL workload ○ Run the full workload again, record the time (t2) ○ Generate small workload summary ○ Recommend and create indexes on the SUMMARY workload ○ Run the full workload again, record the time (t3) ○ Set a time budget for the recommender
  • 28. 28 Transfer learning: We can even learn the model on Snowflake workload, and use it to infer representations for the TPC-H workload Workload Summarization for Index Selection
  • 29. How good is this summary? 29
  • 30. Querc: Query Classsifier 30 Reuse embeddings where possible Collect training labels from the databases (cost, error codes) Retrain models periodically, or online
  • 31. Last slide ● Every workload management task is query labeling ● You don’t need fancy features ● You can’t maintain fancy features anyway ● SQL strings (and plans) have a lot of signal ● There is tons of training data ● Your workload is not “all possible queries” – use the patterns ● Transfer learning works – you can train on one workload and use on another ● Opens up a lot of simple interesting little applications ○ User behavior modeling, resource allocation, … ● External “query labeling service” keeps everything organized 31 Shrainik Jain
  • 32. Query recommendation: Predict next query in a session 32
  • 33. 33 Up is good Learned features about as good as manual features, even with generous assumptions

Notas del editor

  1. XXXXI will be using analytics and management interchangeably. Management means operationalizing a set of analysis and decision tasks Predicting Cache Performance. [Sapia 2000, Dan et al. 1995] Modeling User Behavior [Yu et al. 1992, Tran et al. 2015, Jain et al. 2016]
  2. XXXXI will be using analytics and management interchangeably. Management means operationalizing a set of analysis and decision tasks Predicting Cache Performance. [Sapia 2000, Dan et al. 1995] Modeling User Behavior [Yu et al. 1992, Tran et al. 2015, Jain et al. 2016]
  3. Also called: Embedding Vector representation Distributed representation Low hanging fruit for shrainik’s phd; This is already a solved problem in NLP
  4. Also called: Embedding Vector representation Distributed representation Low hanging fruit for shrainik’s phd; This is already a solved problem in NLP
  5. Why just stop at these? Heavy means involves top 20 tables by size
  6. Why just stop at these?
  7. Why just stop at these?
  8. (Or Better-than-random sampling for an application within Snowflake**, maybe)
  9. Lets compare against the gold standard. Caveat: SQLServer does summarization no matter what. We couldn’t find a way to turn this off.
  10. System architecture. Queries arrive for three different applications X, Y , and Z and are processed by one or more (embedder, labeler) pair before being sent on to the database, centralized for offline labeling tasks, or both.