SlideShare una empresa de Scribd logo
1 de 43
© 2019 Snowflake Computing Inc. All Rights Reserved
AUTOMATIC
CLUSTERING
PRASANNA RAJAPERUMAL I MARCH 2019
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
snowflake-automatic-clustering/
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
© 2018 Snowflake Computing Inc. All Rights Reserved
SNOWFLAKE
2
Our vision
Allow our customers to access all their
data in one place so they can make
actionable decisions anytime, anywhere,
with any number of users.
Our solution
Next-generation data warehouse
built from the ground up for the
cloud to address today’s data and
analytics challenges.
Delivered as
a service
Built for
the cloud
SQL Data
Warehouse
• Memory Management
• Degree of parallelism
• Failure recovery
• Scale Clusters
….
• Clustering
Automatic
ARCHITECTURE
Shared disk architecture
Data stored in S3
Metadata stored separately
Compute on-demand
TABLE DATA
Partitioned horizontally into files (!)
Columnar (Hybrid) storage (PAX)
Column values grouped
Compressed
Header contains index of column start
Name Age Country
Sam 30 USA
Trevor 35 Canada
Anna 38 England
Raj 40 India
Row Oriented StorageColumn Oriented StorageHybrid Columnar Storage
TABLE DATA
Immutable
File is a unit of
Update
Every change is
files deleted and files added
Concurrency
Sized to be few 10’s of MB at the most
10’s of millions of files
Sam 30 USA
Trevor 35 Canada
Sam 30 USA
Trevor 45 Canada
File 1:
Update table T set age = 45 where
name = “Trevor”
File 2:
METADATA
Stats on every column for every file
Zone maps (Netezza)
Min/Max, #distinct, #nulls etc
Improve Query performance
Pruning: Reduce scan
File1
Id: 3, 5, 9
date: 4/1, 4/2, 4/3
File2
Id: 6, 1, 10
date: 4/4, 4/3, 4/4
File3
Id: 7, 9, 10
date: 4/5, 4/5, 4/5
File4
Id: 1, 9, 10
date: 4/5, 4/6, 4/6
File
Name
Col Min Max
File1 Id
date
3
4/1
9
4/3
File2 Id
date
1
4/3
10
4/4
File3 Id
date
7
4/5
10
4/5
File4 Id
date
1
4/5
10
4/6
INDEX COMPARISON
Why not B-Trees?
Uni-dimensional
Index Maintenance
IO Cost on large queries
source
© 2019 Snowflake Computing Inc. All Rights Reserved
DEFAULT CLUSTERING
8
Partitioned on ingestion
• Based on physical property: size
• Clustered based on load time
• Not optimized for other dimensions
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
07/01/2018
08/01/2018
09/01/2018
10/01/2018
11/01/2018
12/01/2018
Pruned
Partitions
Query:
Sept-Dec
Select … from orders where
date between
’09-01-2018’ and ’12-01-2018’
Select … from orders where
product = ‘IPhone’
© 2019 Snowflake Computing Inc. All Rights Reserved
PARTITIONING COMPARISON
9
Hash based
• Not efficient for range scans
Range based
• Strict boundaries
• Range Splitting when it gets too big
Snowflake
• Key not used for data distribution
• Overlaps in ranges are okay
• Achieve good pruning with zone maps (min/max)
Partition 1 Partition 2 Partition 3
Hash (key)
Partition 1 Partition 2 Partition 3
< Key <
0 - 25 26 - 50 51 - 75
Partition 1
Partition 2
Partition 3
< Column <
0 - 30
20 - 60
50 - 75
© 2019 Snowflake Computing Inc. All Rights Reserved
DEFAULT CLUSTERED TABLE
10
Micro-partition 1 Micro-partition 2 Micro-partition 3 Micro-partition 4
Name
Country
Rows 1-6 Rows 7-12 Rows 13-18 Rows 19-24
2
4
3
2
3
2
3
2
4
5
1
5
2
4
2
2
5
3
1
4
5
5
3
2
A
C
C
B
A
C
Z
B
C
X
A
A
X
Z
Y
B
X
A
C
Z
Y
B
X
Z
ID Name Country Date
UK
SP
DE
DE
FR
SP
DE
UK
NL
FR
NL
FR
FR
NL
SP
SP
DE
UK
FR
NL
SP
SP
DE
UK
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/3
11/3
11/3
11/2
11/2
11/2
11/3
11/3
11/4
11/3
11/4
11/4
11/5
11/5
11/5
Query Select count(*) from T where id = 2 and date = ’11/2’;
Natural ordering by date
11/2 11/2 11/2
11/2 11/2 11/2
11/2 11/2 11/2
11/3 11/3 11/3
11/2 11/2 11/2
11/3 11/3 11/4
11/3 11/4 11/4
11/5 11/5 11/5
Date
ID
2 4 3
2 4 3
3 2 4
5 1 5
2 4 2
2 5 3
1 4 5
5 3 2
Scans 3 partitions
© 2019 Snowflake Computing Inc. All Rights Reserved
EXPLICIT CLUSTERING
AUTOMATIC CLUSTERING
Automatically reorganizes table storage
from natural order to
Clustering keys order
Clustering Keys
Identify interesting
expressions
Should be order
preserving
© 2019 Snowflake Computing Inc. All Rights Reserved
PROBLEM DEFINTION
CONTINIOUS SORTING
AT PETABYTE SCALE
© 2019 Snowflake Computing Inc. All Rights Reserved
EXPLICITLY CLUSTERED TABLE
13
Micro-partition 1 Micro-partition 2 Micro-partition 3 Micro-partition 4
Name
Country
Rows 1-6 Rows 7-12 Rows 13-18 Rows 19-24
2
4
3
2
3
2
3
2
4
5
1
5
2
4
2
2
5
3
1
4
5
5
3
2
A
C
C
B
A
C
Z
B
C
X
A
A
X
Z
Y
B
X
A
C
Z
Y
B
X
Z
ID Name Country Date
UK
SP
DE
DE
FR
SP
DE
UK
NL
FR
NL
FR
FR
NL
SP
SP
DE
UK
FR
NL
SP
SP
DE
UK
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/2
11/3
11/3
11/3
11/2
11/2
11/2
11/3
11/3
11/4
11/3
11/4
11/4
11/5
11/5
11/5
11/2 11/2 11/2
11/2 11/2 11/2
11/2 11/2 11/2
11/2 11/2 11/2
11/3 11/3 11/3
11/3 11/3 11/3
11/4 11/4 11/4
11/5 11/5 11/5
Date
ID
2 2 2
2 2 2
3 3 3
4 4 4
5 5 5
1 2 1
3 4 5
5 3 2
Clustering keys (date, id) Scans only 1 partition
Query Select count(*) from T where id = 2 and date = ’11/2’;
© 2019 Snowflake Computing Inc. All Rights Reserved
WHY EXPLICIT CLUSTERING?
GOOD
CLUSTERING
BETTER
PRUNING
QUERY
PERFORMANCE
JOIN
OPTIMIZATIONS
BETTER
REDUCTION
© 2019 Snowflake Computing Inc. All Rights Reserved
CHALLENGES: KEEPING UP WITH CHANGES
15
05/01/2018,
[0, 100]
New File
[0, 10)
[10, 20)
[20, 30)
[30, 40)
[40, 50)
[50, 60)
[60, 70)
[70, 80)
[80, 90)
[90, 100)
[0, 10)
[10, 20)
[20, 30)
[30, 40)
[40, 50)
[50, 60)
[60, 70)
[70, 80)
[80, 90)
[90, 100)
NEW DATA ADDED
PARTITION REWRITE
Original Immutable Files All Files Updated
© 2019 Snowflake Computing Inc. All Rights Reserved
APPROACHES TO KEEP UP
16
Batch re-cluster
Re-cluster inline
with the changes
Extremely expensive
Block other changes
Huge variation in Query
performance
Not practical on PB tables
Full table re-write
High write amplification
Bound by re-cluster speed
Not practical for TB/PB
tables
© 2019 Snowflake Computing Inc. All Rights Reserved
REQUIREMENTS
17
Actual sort order is not a requirement
Goal – Good pruning with min/max
Overlap of value ranges are acceptable
Background Service
Find incremental work to improve query performance
Should not interfere with other changes
Can change clustering keys
© 2019 Snowflake Computing Inc. All Rights Reserved
PROBLEM DEFINTION
CONTINIOUS SORTING
AT PETABYTE SCALE
Approximate
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING METRICS - WIDTH
19
Clustering value range
Width of a partition
Width of the line connecting the
min and max value in the
clustering values domain range
1 100
Sam 18 USA
Trevor 78 Canada
Anna 30 England
Raj 35 India
File1 : Wide File
File2 : Narrow File
Clustering key = AGE
18 78
File1 Width = 60
File 2 Width = 5
30 35
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING METRICS - DEPTH
20
Clustering value range
Depth at a single value (or range)
Number of partitions overlapping at
a certain value in the clustering key
domain value range
1 100
Sam 18 USA
Trevor 78 Canada
Anna 30 England
Raj 35 India
File1
File2
Clustering key = AGE
18 78
30 35
Query: 32
Depth: 3
27 100
Query: 70
Depth: 2
Query: 20
Depth: 1
© 2019 Snowflake Computing Inc. All Rights Reserved
GOAL
Reduce Worst Clustering Depth
below an
acceptable threshold
to get
Predictable Query Performance
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING SCENARIO
A B C D E F G H I J K L
Clustering Value Range
A B B C D
D E F G H
I J J K L
A C E H K
depth=2 depth=2
width=11
Start State
width=4
width=5
width=4
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING SCENARIO
A B C D E F G H I J K L
A B B C D
D E F G H
I J J K L
A C E H K
depth=3 depth=4
width=11
Iteration 1: Pick 3 files to recluster
width=4
width=5
width=4
C F B E H
E I D L K
width=8
width=9
Clustering Value Range
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING SCENARIO
A B C D E F G H I J K L
A B B C D
D E F G H
I J J K L
depth=2 depth=2
Improved Clustering state
Iteration 2:
width=4
width=5
width=4
E E E F H
H H I K L
A B C C D width=4
width=4
width=5
Clustering Value Range
© 2019 Snowflake Computing Inc. All Rights Reserved
INSIGHT
Reduce Worst Depth by
Reducing overlapping
(reduce width)
© 2019 Snowflake Computing Inc. All Rights Reserved 28
FILE SELECTION: INTUITION
Clustering Value Range
ClusteringDepth
© 2019 Snowflake Computing Inc. All Rights Reserved
FILE SELECTION: INTUITION
29
Peak1
Clustering Value Range
ClusteringDepth
Target Depth
Peak2
© 2019 Snowflake Computing Inc. All Rights Reserved
AFTER RECLUSTERING
30
Partitions selected
for clustering
Clustering Value Range
ClusteringDepth
© 2019 Snowflake Computing Inc. All Rights Reserved
ALGORITHM OUTLINE
31
Re-cluster
Execution
Optimistic
Commit
Partition
Batches
CHANGES
Partition
Selection
Stop
Keep re-clustering
until table is well
clustered enough
New Changes triggers
partition selection
Decouple Partition
Selection from
Execution
Divide work into
batches
© 2019 Snowflake Computing Inc. All Rights Reserved
FILE SELECTION ALGORITHM INTERNALS
Intuition: Work on the peaks
• Sort the micro-partition endpoints
• First pass: Compute peak ranges and calculate the
number of overlapping micro-partitions
• Stabbing count array
• Second pass: For the peak ranges – compute list of
partitions ordered by depth
• MinMaxPriorityQueue
© 2019 Snowflake Computing Inc. All Rights Reserved
CLUSTERING LEVELS
33
Clustering Key Domain
LEVEL N…
Reduces clustering width with levels
More partitions with shorter range of data
LEVEL 2:
LEVEL 3:
LEVEL 0:
LEVEL 1:
© 2019 Snowflake Computing Inc. All Rights Reserved
REVIEW
Wake up when data arrives
Split the table files into levels (clustering state)
Run clustering partition selection algorithm on each level
Queue up “recluster task” for execution
Scale up and execute re-cluster tasks
• Sized just right – optimized not to spill
• Non-blocking – optimistic commit
If clustering depth is not “good enough” repeat
Else sleep
© 2019 Snowflake Computing Inc. All Rights Reserved
EFFECT ON QUERY PERFORMANCE
35
Time
Average Query Response time
Background Recluster tasks
Files Added
T0 T1 T2 T3 T4 T5
© 2019 Snowflake Computing Inc. All Rights Reserved
FUTURE WORK
36
Multi Dimension Keys
• “Fake wide partition” problem
• High cardinality column renders
subsequent columns useless
• Avoid excessive churn
Manual cluster keys
Partition selection based on
usage
Other linearization functions
• Z-order, Grey-order,
Hilbert curves
Analyze workload to pick
best clustering keys
automatically
Partition selection Phase 2
© 2019 Snowflake Computing Inc. All Rights Reserved
JOIN US!
One of the “Best Places to Work” every year
Where else can you compete with Amazon, Google and Microsoft? ☺
Ambitious Projects Smart People Immediate Impact Fun Culture
THANK YOU
© 2018 Snowflake Computing Inc. All Rights Reserved
© 2019 Snowflake Computing Inc. All Rights Reserved
Jiaqi Yan
© 2019 Snowflake Computing Inc. All Rights Reserved
“FAKE WIDE PARTITIONS”
40
A B C D
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Actual Min/Max Column Metadata
Min/Max
A-1 A-3 A-A 1-3
A-5 B-1 A-B 1-5
B-2 B-4 B-B 2-4
B-4 C-2 B-C 2-4
C-3 C-4 C-C 3-4
C-5 D-2 C-D 2-5
© 2019 Snowflake Computing Inc. All Rights Reserved
CASE STUDY
41
> 1PB table, trickle load every 3 minutes
Provides dashboard service to customers
Sub-second query time SLA
Query by application IDs
Huge skews cross applications
Cluster by (app_id, time, name)
© 2019 Snowflake Computing Inc. All Rights Reserved
DASHBOARD
42
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
snowflake-automatic-clustering/

Más contenido relacionado

Más de C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Más de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Automatic Clustering at Snowflake

  • 1. © 2019 Snowflake Computing Inc. All Rights Reserved AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH 2019
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ snowflake-automatic-clustering/ • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. © 2018 Snowflake Computing Inc. All Rights Reserved SNOWFLAKE 2 Our vision Allow our customers to access all their data in one place so they can make actionable decisions anytime, anywhere, with any number of users. Our solution Next-generation data warehouse built from the ground up for the cloud to address today’s data and analytics challenges. Delivered as a service Built for the cloud SQL Data Warehouse • Memory Management • Degree of parallelism • Failure recovery • Scale Clusters …. • Clustering Automatic
  • 5. ARCHITECTURE Shared disk architecture Data stored in S3 Metadata stored separately Compute on-demand
  • 6. TABLE DATA Partitioned horizontally into files (!) Columnar (Hybrid) storage (PAX) Column values grouped Compressed Header contains index of column start Name Age Country Sam 30 USA Trevor 35 Canada Anna 38 England Raj 40 India Row Oriented StorageColumn Oriented StorageHybrid Columnar Storage
  • 7. TABLE DATA Immutable File is a unit of Update Every change is files deleted and files added Concurrency Sized to be few 10’s of MB at the most 10’s of millions of files Sam 30 USA Trevor 35 Canada Sam 30 USA Trevor 45 Canada File 1: Update table T set age = 45 where name = “Trevor” File 2:
  • 8. METADATA Stats on every column for every file Zone maps (Netezza) Min/Max, #distinct, #nulls etc Improve Query performance Pruning: Reduce scan File1 Id: 3, 5, 9 date: 4/1, 4/2, 4/3 File2 Id: 6, 1, 10 date: 4/4, 4/3, 4/4 File3 Id: 7, 9, 10 date: 4/5, 4/5, 4/5 File4 Id: 1, 9, 10 date: 4/5, 4/6, 4/6 File Name Col Min Max File1 Id date 3 4/1 9 4/3 File2 Id date 1 4/3 10 4/4 File3 Id date 7 4/5 10 4/5 File4 Id date 1 4/5 10 4/6
  • 9. INDEX COMPARISON Why not B-Trees? Uni-dimensional Index Maintenance IO Cost on large queries source
  • 10. © 2019 Snowflake Computing Inc. All Rights Reserved DEFAULT CLUSTERING 8 Partitioned on ingestion • Based on physical property: size • Clustered based on load time • Not optimized for other dimensions 01/01/2018 02/01/2018 03/01/2018 04/01/2018 05/01/2018 06/01/2018 07/01/2018 08/01/2018 09/01/2018 10/01/2018 11/01/2018 12/01/2018 Pruned Partitions Query: Sept-Dec Select … from orders where date between ’09-01-2018’ and ’12-01-2018’ Select … from orders where product = ‘IPhone’
  • 11. © 2019 Snowflake Computing Inc. All Rights Reserved PARTITIONING COMPARISON 9 Hash based • Not efficient for range scans Range based • Strict boundaries • Range Splitting when it gets too big Snowflake • Key not used for data distribution • Overlaps in ranges are okay • Achieve good pruning with zone maps (min/max) Partition 1 Partition 2 Partition 3 Hash (key) Partition 1 Partition 2 Partition 3 < Key < 0 - 25 26 - 50 51 - 75 Partition 1 Partition 2 Partition 3 < Column < 0 - 30 20 - 60 50 - 75
  • 12. © 2019 Snowflake Computing Inc. All Rights Reserved DEFAULT CLUSTERED TABLE 10 Micro-partition 1 Micro-partition 2 Micro-partition 3 Micro-partition 4 Name Country Rows 1-6 Rows 7-12 Rows 13-18 Rows 19-24 2 4 3 2 3 2 3 2 4 5 1 5 2 4 2 2 5 3 1 4 5 5 3 2 A C C B A C Z B C X A A X Z Y B X A C Z Y B X Z ID Name Country Date UK SP DE DE FR SP DE UK NL FR NL FR FR NL SP SP DE UK FR NL SP SP DE UK 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/3 11/3 11/3 11/2 11/2 11/2 11/3 11/3 11/4 11/3 11/4 11/4 11/5 11/5 11/5 Query Select count(*) from T where id = 2 and date = ’11/2’; Natural ordering by date 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/3 11/3 11/3 11/2 11/2 11/2 11/3 11/3 11/4 11/3 11/4 11/4 11/5 11/5 11/5 Date ID 2 4 3 2 4 3 3 2 4 5 1 5 2 4 2 2 5 3 1 4 5 5 3 2 Scans 3 partitions
  • 13. © 2019 Snowflake Computing Inc. All Rights Reserved EXPLICIT CLUSTERING AUTOMATIC CLUSTERING Automatically reorganizes table storage from natural order to Clustering keys order Clustering Keys Identify interesting expressions Should be order preserving
  • 14. © 2019 Snowflake Computing Inc. All Rights Reserved PROBLEM DEFINTION CONTINIOUS SORTING AT PETABYTE SCALE
  • 15. © 2019 Snowflake Computing Inc. All Rights Reserved EXPLICITLY CLUSTERED TABLE 13 Micro-partition 1 Micro-partition 2 Micro-partition 3 Micro-partition 4 Name Country Rows 1-6 Rows 7-12 Rows 13-18 Rows 19-24 2 4 3 2 3 2 3 2 4 5 1 5 2 4 2 2 5 3 1 4 5 5 3 2 A C C B A C Z B C X A A X Z Y B X A C Z Y B X Z ID Name Country Date UK SP DE DE FR SP DE UK NL FR NL FR FR NL SP SP DE UK FR NL SP SP DE UK 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/3 11/3 11/3 11/2 11/2 11/2 11/3 11/3 11/4 11/3 11/4 11/4 11/5 11/5 11/5 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/2 11/3 11/3 11/3 11/3 11/3 11/3 11/4 11/4 11/4 11/5 11/5 11/5 Date ID 2 2 2 2 2 2 3 3 3 4 4 4 5 5 5 1 2 1 3 4 5 5 3 2 Clustering keys (date, id) Scans only 1 partition Query Select count(*) from T where id = 2 and date = ’11/2’;
  • 16. © 2019 Snowflake Computing Inc. All Rights Reserved WHY EXPLICIT CLUSTERING? GOOD CLUSTERING BETTER PRUNING QUERY PERFORMANCE JOIN OPTIMIZATIONS BETTER REDUCTION
  • 17. © 2019 Snowflake Computing Inc. All Rights Reserved CHALLENGES: KEEPING UP WITH CHANGES 15 05/01/2018, [0, 100] New File [0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100) [0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100) NEW DATA ADDED PARTITION REWRITE Original Immutable Files All Files Updated
  • 18. © 2019 Snowflake Computing Inc. All Rights Reserved APPROACHES TO KEEP UP 16 Batch re-cluster Re-cluster inline with the changes Extremely expensive Block other changes Huge variation in Query performance Not practical on PB tables Full table re-write High write amplification Bound by re-cluster speed Not practical for TB/PB tables
  • 19. © 2019 Snowflake Computing Inc. All Rights Reserved REQUIREMENTS 17 Actual sort order is not a requirement Goal – Good pruning with min/max Overlap of value ranges are acceptable Background Service Find incremental work to improve query performance Should not interfere with other changes Can change clustering keys
  • 20. © 2019 Snowflake Computing Inc. All Rights Reserved PROBLEM DEFINTION CONTINIOUS SORTING AT PETABYTE SCALE Approximate
  • 21. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING METRICS - WIDTH 19 Clustering value range Width of a partition Width of the line connecting the min and max value in the clustering values domain range 1 100 Sam 18 USA Trevor 78 Canada Anna 30 England Raj 35 India File1 : Wide File File2 : Narrow File Clustering key = AGE 18 78 File1 Width = 60 File 2 Width = 5 30 35
  • 22. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING METRICS - DEPTH 20 Clustering value range Depth at a single value (or range) Number of partitions overlapping at a certain value in the clustering key domain value range 1 100 Sam 18 USA Trevor 78 Canada Anna 30 England Raj 35 India File1 File2 Clustering key = AGE 18 78 30 35 Query: 32 Depth: 3 27 100 Query: 70 Depth: 2 Query: 20 Depth: 1
  • 23. © 2019 Snowflake Computing Inc. All Rights Reserved GOAL Reduce Worst Clustering Depth below an acceptable threshold to get Predictable Query Performance
  • 24. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING SCENARIO A B C D E F G H I J K L Clustering Value Range A B B C D D E F G H I J J K L A C E H K depth=2 depth=2 width=11 Start State width=4 width=5 width=4
  • 25. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING SCENARIO A B C D E F G H I J K L A B B C D D E F G H I J J K L A C E H K depth=3 depth=4 width=11 Iteration 1: Pick 3 files to recluster width=4 width=5 width=4 C F B E H E I D L K width=8 width=9 Clustering Value Range
  • 26. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING SCENARIO A B C D E F G H I J K L A B B C D D E F G H I J J K L depth=2 depth=2 Improved Clustering state Iteration 2: width=4 width=5 width=4 E E E F H H H I K L A B C C D width=4 width=4 width=5 Clustering Value Range
  • 27. © 2019 Snowflake Computing Inc. All Rights Reserved INSIGHT Reduce Worst Depth by Reducing overlapping (reduce width)
  • 28. © 2019 Snowflake Computing Inc. All Rights Reserved 28 FILE SELECTION: INTUITION Clustering Value Range ClusteringDepth
  • 29. © 2019 Snowflake Computing Inc. All Rights Reserved FILE SELECTION: INTUITION 29 Peak1 Clustering Value Range ClusteringDepth Target Depth Peak2
  • 30. © 2019 Snowflake Computing Inc. All Rights Reserved AFTER RECLUSTERING 30 Partitions selected for clustering Clustering Value Range ClusteringDepth
  • 31. © 2019 Snowflake Computing Inc. All Rights Reserved ALGORITHM OUTLINE 31 Re-cluster Execution Optimistic Commit Partition Batches CHANGES Partition Selection Stop Keep re-clustering until table is well clustered enough New Changes triggers partition selection Decouple Partition Selection from Execution Divide work into batches
  • 32. © 2019 Snowflake Computing Inc. All Rights Reserved FILE SELECTION ALGORITHM INTERNALS Intuition: Work on the peaks • Sort the micro-partition endpoints • First pass: Compute peak ranges and calculate the number of overlapping micro-partitions • Stabbing count array • Second pass: For the peak ranges – compute list of partitions ordered by depth • MinMaxPriorityQueue
  • 33. © 2019 Snowflake Computing Inc. All Rights Reserved CLUSTERING LEVELS 33 Clustering Key Domain LEVEL N… Reduces clustering width with levels More partitions with shorter range of data LEVEL 2: LEVEL 3: LEVEL 0: LEVEL 1:
  • 34. © 2019 Snowflake Computing Inc. All Rights Reserved REVIEW Wake up when data arrives Split the table files into levels (clustering state) Run clustering partition selection algorithm on each level Queue up “recluster task” for execution Scale up and execute re-cluster tasks • Sized just right – optimized not to spill • Non-blocking – optimistic commit If clustering depth is not “good enough” repeat Else sleep
  • 35. © 2019 Snowflake Computing Inc. All Rights Reserved EFFECT ON QUERY PERFORMANCE 35 Time Average Query Response time Background Recluster tasks Files Added T0 T1 T2 T3 T4 T5
  • 36. © 2019 Snowflake Computing Inc. All Rights Reserved FUTURE WORK 36 Multi Dimension Keys • “Fake wide partition” problem • High cardinality column renders subsequent columns useless • Avoid excessive churn Manual cluster keys Partition selection based on usage Other linearization functions • Z-order, Grey-order, Hilbert curves Analyze workload to pick best clustering keys automatically Partition selection Phase 2
  • 37. © 2019 Snowflake Computing Inc. All Rights Reserved JOIN US! One of the “Best Places to Work” every year Where else can you compete with Amazon, Google and Microsoft? ☺ Ambitious Projects Smart People Immediate Impact Fun Culture
  • 38. THANK YOU © 2018 Snowflake Computing Inc. All Rights Reserved
  • 39. © 2019 Snowflake Computing Inc. All Rights Reserved Jiaqi Yan
  • 40. © 2019 Snowflake Computing Inc. All Rights Reserved “FAKE WIDE PARTITIONS” 40 A B C D 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Actual Min/Max Column Metadata Min/Max A-1 A-3 A-A 1-3 A-5 B-1 A-B 1-5 B-2 B-4 B-B 2-4 B-4 C-2 B-C 2-4 C-3 C-4 C-C 3-4 C-5 D-2 C-D 2-5
  • 41. © 2019 Snowflake Computing Inc. All Rights Reserved CASE STUDY 41 > 1PB table, trickle load every 3 minutes Provides dashboard service to customers Sub-second query time SLA Query by application IDs Huge skews cross applications Cluster by (app_id, time, name)
  • 42. © 2019 Snowflake Computing Inc. All Rights Reserved DASHBOARD 42
  • 43. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ snowflake-automatic-clustering/