Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2OKHJt9.
Prasanna Rajaperumal presents Snowflake’s clustering capabilities, including their algorithm for incremental maintenance of approximate clustering of partitioned tables, as well as their infrastructure to perform such maintenance automatically. He also covers some real-world problems they run into and their solutions. Filmed at qconlondon.com.
Prasanna Rajaperumal is a senior engineer at Snowflake, working on Snowflake Databases' Query Engine. Before Snowflake, he worked on building the next generation Data infrastructure at Uber. Over the last decade, he has been building data systems that scale in Cloudera, Cisco and few other companies before that.
2. InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
snowflake-automatic-clustering/
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
6. TABLE DATA
Partitioned horizontally into files (!)
Columnar (Hybrid) storage (PAX)
Column values grouped
Compressed
Header contains index of column start
Name Age Country
Sam 30 USA
Trevor 35 Canada
Anna 38 England
Raj 40 India
Row Oriented StorageColumn Oriented StorageHybrid Columnar Storage
7. TABLE DATA
Immutable
File is a unit of
Update
Every change is
files deleted and files added
Concurrency
Sized to be few 10’s of MB at the most
10’s of millions of files
Sam 30 USA
Trevor 35 Canada
Sam 30 USA
Trevor 45 Canada
File 1:
Update table T set age = 45 where
name = “Trevor”
File 2:
8. METADATA
Stats on every column for every file
Zone maps (Netezza)
Min/Max, #distinct, #nulls etc
Improve Query performance
Pruning: Reduce scan
File1
Id: 3, 5, 9
date: 4/1, 4/2, 4/3
File2
Id: 6, 1, 10
date: 4/4, 4/3, 4/4
File3
Id: 7, 9, 10
date: 4/5, 4/5, 4/5
File4
Id: 1, 9, 10
date: 4/5, 4/6, 4/6
File
Name
Col Min Max
File1 Id
date
3
4/1
9
4/3
File2 Id
date
1
4/3
10
4/4
File3 Id
date
7
4/5
10
4/5
File4 Id
date
1
4/5
10
4/6
9. INDEX COMPARISON
Why not B-Trees?
Uni-dimensional
Index Maintenance
IO Cost on large queries
source