SlideShare a Scribd company logo
1 of 26
Daniel Abadi -- Yale University
Column-Stores vs. Row-Stores:
How Different Are They
Really?
Daniel Abadi (Yale),
Samuel Madden (MIT),
Nabil Hachem (AvantGarde Consulting)
June 12th
, 2008
Row vs. Column-Stores
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail Phone #
Street
Address
Row-Store Column-Store
− Might read in
unnecessary data
+ Only need to read
in relevant data
+ Easy to add a new record
− Tuple writes might require
multiple seeks
Column-Stores
• Really good for read-mostly data
warehouses
 Lot’s of column scans and aggregations
 Writes tend to be in batch
 [CK85], [SAB+05], [ZBN+05], [HLA+06],
[SBC+07] all verify this
 Top 3 in TPC-H rankings (Exasol, ParAccel,
and Kickfire) are column-stores
 Factor of 5 faster on performance
 Factor of 2 superior on price/performance
Data Warehouse DBMS
Software
• $4.5 billion industry (out of total $16 billion
DBMS software industry)
• Growing 10% annually
Momentum
• Right solution for growing market  $$$$
• Vertica, ParAccel, Kickfire, Calpont,
Infobright, and Exasol new entrants
• Sybase IQ’s profits rapidly increasing
• Yahoo’s world largest (multi-petabyte)
data warehouse is a column-store (from
Mahat Technologies acquisition)
Paper Looks At Key Question
• How much of the buzz around column-
stores just marketing hype?
 Do you really need to buy Sybase
IQ/Vertica/ParAccel?
 How far will your current row-store take you?
 Can you get column-store performance from a row-
store?
 Can you simulate a column-store in a row-store
Paper Methodology
• Comparing row-store vs. column-store is
dangerous/borderline meaningless
• Instead, compare row-store vs. row-store
and column-store vs. column-store
 Simulate a column-store inside of a row-store
 Remove column-oriented features from
column-store until it behaves like a row-store
Simulate Column-Store
Inside Row-Store
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail
1
2
3
1
2
3
1
2
3
Option A:
Vertical
Partitioning
…
Option B:
Index Every
Column
Last Name
Index
First Name Index
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
SSBM Averages
0.0
50.0
100.0
150.0
200.0
250.0
Time(seconds)
Average 25.7 79.9 221.2
Normal Row-Store
Vertically Partitioned
Row-Store
Row-Store With All
Indexes
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Reconstruction
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
Tuple Size
1
2
3
Column
Data
TID
1
2
3
TID Column
Data
1
2
3
TID Column
Data
Tuple
Header
•Queries touch 3-4 foreign keys in fact table, 1-2 numeric
columns
•Complete fact table takes up ~4 GB (compressed)
•Vertically partitioned tables take up 0.7-1.1 GB
(compressed)
Horizontal Partitioning
• Fact table horizontally partitioned on year
 Year is an element of the ‘Date’ dimension
table
 Most queries in SSBM have a predicate on
year
 Since vertically partitioned tables do not
contain the ‘Date’ foreign key, row-store could
not similarly partition them
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Construction
Tuple Construction
• Pretty much all queries require a column
to be extracted (in the SELECT clause)
that has not yet been accessed, e.g.:
 SELECT store_name, SUM(revenue)
FROM Facts, Stores
WHERE fact.store_id = stores.store_id
AND stores.area = “NEW ENGLAND”
GROUP BY store_name
Tuple Construction
• Result of lower part of query plan is a set
of TIDs that passed all predicates
• Need to extract SELECT attributes at
these TIDs
 BUT: index maps value to TID
 You really want to map TID to value (i.e., a
vertical partition)
  Tuple construction is SLOW
So….
• All indexes approach is pretty obviously a
poor way to simulate a column-store
• Problems with vertical partitioning are
NOT fundamental
 Store tuple header in a separate partition
 Allow virtual TIDs
 Allow HP using a foreign key on a different VP
• So can row-stores simulate column-
stores?
Row-Store vs. Column-Store
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Time(seconds)
Average 25.7 11.7 4.4
Row-Store Row-Store (M V) C-Store
Column-Store Experiments
• Start with column-store (C-Store)
• Remove column-store-specific
performance optimizations
• End with column-store that behaves like a
row-store
Compression
• Higher data value locality
in column-stores
 Better ratio  reduced I/O
• Can use schemes like
run-length encoding
 Easy to operate on directly
for improved performance
([AMF06])
Q1
Q1
Q1
Q1
Q1
Q1
Q1
Q2
Q2
Q2
Q2
…
…
Quarter
(Q1, 1, 300)
Quarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
• Early Materialization:
create rows first. But:
 Poor memory bandwidth
utilization
 Lose opportunity for
vectorized operation
2
1
3
1
2
3
3
3
7
13
42
80
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeIDcustID price
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
Early vs. Late Materialization
4
4
4
4
Other Column-Store
Optimizations
• Invisible join
 Column-store specific join
 Optimizations for star schemas
 Similar to a semi-join
• Block Processing
Simplified Version of Results
0.0
10.0
20.0
30.0
40.0
50.0
Time(seconds)
Average 4.4 14.9 40.7
Original C-Store
C-Store,No
Compression
C-Store,Early
Materialization
Conclusion
• Might be possible to simulate a row-store
in a column-store, BUT:
 Need better support for vertical partitioning at
the storage layer
 Need support for column-specific
optimizations at the executer level
• Working with HP-Labs to find out
Come Join the Yale DB
Group!

More Related Content

What's hot

Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetHoang Nguyen Phong
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfWaiYipLiew
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Google File System
Google File SystemGoogle File System
Google File Systemguest2cb4689
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rulesswapnac12
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Marina Santini
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelJunya Tanaka
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
9. chapter 8 np hard and np complete problems
9. chapter 8   np hard and np complete problems9. chapter 8   np hard and np complete problems
9. chapter 8 np hard and np complete problemsJyotsna Suryadevara
 

What's hot (20)

Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
 
Hidden markov model ppt
Hidden markov model pptHidden markov model ppt
Hidden markov model ppt
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdf
 
Attention
AttentionAttention
Attention
 
Chapter 15 - Security
Chapter 15 - SecurityChapter 15 - Security
Chapter 15 - Security
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Google File System
Google File SystemGoogle File System
Google File System
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
9. chapter 8 np hard and np complete problems
9. chapter 8   np hard and np complete problems9. chapter 8   np hard and np complete problems
9. chapter 8 np hard and np complete problems
 

Viewers also liked

Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar DatabaseBiju Nair
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceBiju Nair
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developersBiju Nair
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented DatabaseSuvradeep Rudra
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesLuis Cipriani
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in databaseNikhil Patteri
 
Efficient transaction processing in sap hana
Efficient transaction processing in sap hanaEfficient transaction processing in sap hana
Efficient transaction processing in sap hanaMysa Vijay
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryJustin Swanhart
 
Beckman abadi-5min-pres
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-presDaniel Abadi
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi
 

Viewers also liked (20)

Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
 
Efficient transaction processing in sap hana
Efficient transaction processing in sap hanaEfficient transaction processing in sap hana
Efficient transaction processing in sap hana
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Beckman abadi-5min-pres
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-pres
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 Panel
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really?

2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2Kelvin Chan
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseAnil Gupta
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataMarketingArrowECS_CZ
 
SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoresqlserver.co.il
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Julien SIMON
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSFOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSHITIKAJAIN4
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxReneeClintGortifacio
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really? (20)

2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2017 biological databasespart2
2017 biological databasespart22017 biological databasespart2
2017 biological databasespart2
 
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
 
Sql rally 2013 columnstore indexes
Sql rally 2013   columnstore indexesSql rally 2013   columnstore indexes
Sql rally 2013 columnstore indexes
 
BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2
 
MySql
MySqlMySql
MySql
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
 
Excel Tips 101
Excel Tips 101Excel Tips 101
Excel Tips 101
 
Excel tips
Excel tipsExcel tips
Excel tips
 
SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStore
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Tunning overview
Tunning overviewTunning overview
Tunning overview
 
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSFOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
 
Excel Tips.pptx
Excel Tips.pptxExcel Tips.pptx
Excel Tips.pptx
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptx
 

More from Daniel Abadi

Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi
 

More from Daniel Abadi (9)

Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Column-Stores vs. Row-Stores: How Different are they Really?

  • 1. Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th , 2008
  • 2. Row vs. Column-Stores Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail Phone # Street Address Row-Store Column-Store − Might read in unnecessary data + Only need to read in relevant data + Easy to add a new record − Tuple writes might require multiple seeks
  • 3. Column-Stores • Really good for read-mostly data warehouses  Lot’s of column scans and aggregations  Writes tend to be in batch  [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this  Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores  Factor of 5 faster on performance  Factor of 2 superior on price/performance
  • 4. Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually
  • 5. Momentum • Right solution for growing market  $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition)
  • 6. Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype?  Do you really need to buy Sybase IQ/Vertica/ParAccel?  How far will your current row-store take you?  Can you get column-store performance from a row- store?  Can you simulate a column-store in a row-store
  • 7. Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store  Simulate a column-store inside of a row-store  Remove column-oriented features from column-store until it behaves like a row-store
  • 8. Simulate Column-Store Inside Row-Store Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail 1 2 3 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Last Name Index First Name Index
  • 9. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 10. SSBM Averages 0.0 50.0 100.0 150.0 200.0 250.0 Time(seconds) Average 25.7 79.9 221.2 Normal Row-Store Vertically Partitioned Row-Store Row-Store With All Indexes
  • 11. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Reconstruction
  • 12. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 13. Tuple Size 1 2 3 Column Data TID 1 2 3 TID Column Data 1 2 3 TID Column Data Tuple Header •Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed)
  • 14. Horizontal Partitioning • Fact table horizontally partitioned on year  Year is an element of the ‘Date’ dimension table  Most queries in SSBM have a predicate on year  Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them
  • 15. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Construction
  • 16. Tuple Construction • Pretty much all queries require a column to be extracted (in the SELECT clause) that has not yet been accessed, e.g.:  SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.area = “NEW ENGLAND” GROUP BY store_name
  • 17. Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs  BUT: index maps value to TID  You really want to map TID to value (i.e., a vertical partition)   Tuple construction is SLOW
  • 18. So…. • All indexes approach is pretty obviously a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental  Store tuple header in a separate partition  Allow virtual TIDs  Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores?
  • 20. Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store that behaves like a row-store
  • 21. Compression • Higher data value locality in column-stores  Better ratio  reduced I/O • Can use schemes like run-length encoding  Easy to operate on directly for improved performance ([AMF06]) Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter (Q1, 1, 300) Quarter (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)
  • 22. • Early Materialization: create rows first. But:  Poor memory bandwidth utilization  Lose opportunity for vectorized operation 2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4 prodID storeIDcustID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID Early vs. Late Materialization 4 4 4 4
  • 23. Other Column-Store Optimizations • Invisible join  Column-store specific join  Optimizations for star schemas  Similar to a semi-join • Block Processing
  • 24. Simplified Version of Results 0.0 10.0 20.0 30.0 40.0 50.0 Time(seconds) Average 4.4 14.9 40.7 Original C-Store C-Store,No Compression C-Store,Early Materialization
  • 25. Conclusion • Might be possible to simulate a row-store in a column-store, BUT:  Need better support for vertical partitioning at the storage layer  Need support for column-specific optimizations at the executer level • Working with HP-Labs to find out
  • 26. Come Join the Yale DB Group!