SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
DATA STORAGE FORMATS
in Hadoop
Botond Balázs balazsbotond@gmail.com @botond_balazs
OUR MAIN CONCERNS
• Read performance (improve)
• Disk usage (reduce)
• Splittability (provide)
• Failure behavior
• Write performance (keep reasonable)
Disks are so slow that it is worth sacrificing a lot of
CPU cycles to reduce disk I/O.
In a distributed system, reducing network traffic is also important.
3 WAYS OF REPRESENTING
THISTABLE ON DISK
CourseId Title Instructor CategoryId
25 Databases 1 Jennifer Widom 10
27 Databases 2 Jennifer Widom 10
28 Algorithms Charles Leiserson 12
30 Discrete Math Donald Knuth 12
35 Operating Systems A.Tanenbaum 40
ROW-ORIENTED
• Fields of a row are stored contiguously
• Quick and easy:
• Retrieve an entire row
• Insert, update
• Drawbacks:
• Without indexing, filtering is slower
• Entire row has to be read even if we only need a few columns
25 Databases 1
Jennifer
Widom
10 27 Databases 2
Jennifer
Widom
10 28
COLUMN-ORIENTED
• Fields of a column are stored contiguously
• Benefits:
• Each column can serve as an index (fast filtering operations on the whole dataset)
• Only selected columns are read
• Drawbacks:
• Whole-row operations require a lot of disk I/O
• Slow and hard inserting and updating
• The same row can be stored on different nodes in a distributed environment
25 27 28 30 35 Databases 1
Databases 2 Algorithms Discrete M. Operating S. J.Widom J.Widom
C. Leiserson:003 D. Knuth:004 A.Tanenbaum:005 10 10 12
12 40
RECORD COLUMNAR
CourseId Title Instructor CategoryId
25 Databases 1 Jennifer
Widom
10
27 Databases 2 Jennifer
Widom
10
28 Algorithms Charles
Leiserson
12
30 Discrete Math Donald Knuth 12
35 Operating
Systems
A.Tanenbaum 40
CourseId Title Instructor CategoryId
25 Databases 1 Jennifer
Widom
10
27 Databases 2 Jennifer
Widom
10
CourseId Title Instructor CategoryId
28 Algorithms Charles
Leiserson
12
30 Discrete Math Donald Knuth 12
35 Operating
Systems
A.Tanenbaum 40
Horizontal Partitioning
Row Groups
RECORD COLUMNAR
CourseId Title Instructor CategoryId
25 Databases 1 Jennifer
Widom
10
27 Databases 2 Jennifer
Widom
10
CourseId Title Instructor CategoryId
28 Algorithms Charles
Leiserson
12
30 Discrete Math Donald Knuth 12
35 Operating
Systems
A.Tanenbaum 40
Row Groups
25 27 Databases 1
Databases 2 Jennifer Widom Jennifer Widom
10 10
28 30 35
Algorithms Discrete Math Operating Sys.
C. Leiserson Donald Knuth A.Tanenbaum
12 12 40
High redundancy in columns
Compress them!
SERIALIZATION FORMATS
Row-Oriented Record Columnar
Neither
RCFileThrift
SequenceFile
ORC
SEQUENCEFILE
Header
version 3-byte magic number eg. „SEQ6”
keyClassName String, Java class name of keys
valueClassName String, Java class name of values
compression Bool, true if record compression is on
blockCompression Bool, true if block compression is on
compressorClass String, Java class name of compressor
metadata SequenceFile.Metadata (key-value pairs)
sync A sync marker to denote end of header
Java-only format!
SEQUENCEFILE
Header
SYNC
Record
Record
Record
SYNC
Record
Record
Record
SYNC
Record
Record
Record
Split points
SEQUENCEFILE FAILURE
BEHAVIOR
• Readable to the first failed row
• Not recoverable after that point
AVRO
{
"type": "record",
"name": "LongList",
"aliases": ["LinkedLongs"],
"fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
]
}
JSON schema
AVRO
• Schema is stored in the header
• Supports writing and reading with a different schema (schema evolution)
• Supports nested types
• Block-based splittable format (SYNC marker)
• Optional block compression (Snappy, Deflate)
• Excellent failure behavior: only the failed block is lost, reading will
continue at the next SYNC marker
RCFILE
First widespread record columnar format
Has much better alternatives today: ORC, Parquet
PARQUET
• ORC is designed specifically for Hive
• Parquet is a general purpose format
• Supports complex nested data structures
• Stores full metadata at the end of files
PARQUET
FAILURE BEHAVIOR OF RECORD
COLUMNAR FORMATS
Failure can lead to incomplete rows
They don’t handle failure well
COMPRESSION
Format Splittability Write Speed Read Speed Compression
gzip ✖ ★★ ★★★ ★★★
bzip2 ✔ ★ ★ ★★★
Snappy ✖ ★★★ ★★★ ★
LZO ✔ ★★★ ★★★ ★
Each of these are splittable when inside a container format.
RECOMMENDATION
Analytics Archival
Format Parquet Avro
Compression Snappy/gzip bzip2
The End.

Más contenido relacionado

La actualidad más candente

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 

La actualidad más candente (20)

Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Destacado

하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
진호 박
 
AMERTA_UNAIR_Brochure_2015.compressed
AMERTA_UNAIR_Brochure_2015.compressedAMERTA_UNAIR_Brochure_2015.compressed
AMERTA_UNAIR_Brochure_2015.compressed
Dewi Sartika
 
evaluación del tercer parcial
evaluación del tercer parcialevaluación del tercer parcial
evaluación del tercer parcial
diego lopez
 

Destacado (20)

Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Financial communication
Financial communicationFinancial communication
Financial communication
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hive
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
AMERTA_UNAIR_Brochure_2015.compressed
AMERTA_UNAIR_Brochure_2015.compressedAMERTA_UNAIR_Brochure_2015.compressed
AMERTA_UNAIR_Brochure_2015.compressed
 
Allied School Project
Allied School ProjectAllied School Project
Allied School Project
 
Pharma-Cycle's CleanMed 2014 Presentation
Pharma-Cycle's CleanMed 2014 PresentationPharma-Cycle's CleanMed 2014 Presentation
Pharma-Cycle's CleanMed 2014 Presentation
 
Prandina_Hotels
Prandina_HotelsPrandina_Hotels
Prandina_Hotels
 
evaluación del tercer parcial
evaluación del tercer parcialevaluación del tercer parcial
evaluación del tercer parcial
 

Similar a Data Storage Formats in Hadoop

Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
Anar Godjaev
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
Marcus Davage
 

Similar a Data Storage Formats in Hadoop (20)

4.Database Management System.pdf
4.Database Management System.pdf4.Database Management System.pdf
4.Database Management System.pdf
 
1.8 Data Protection.pdf
1.8 Data Protection.pdf1.8 Data Protection.pdf
1.8 Data Protection.pdf
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Full Table Scan: friend or foe
Full Table Scan: friend or foeFull Table Scan: friend or foe
Full Table Scan: friend or foe
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
 
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
 
SQL
SQLSQL
SQL
 

Último

Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 

Data Storage Formats in Hadoop

  • 1. DATA STORAGE FORMATS in Hadoop Botond Balázs balazsbotond@gmail.com @botond_balazs
  • 2. OUR MAIN CONCERNS • Read performance (improve) • Disk usage (reduce) • Splittability (provide) • Failure behavior • Write performance (keep reasonable)
  • 3. Disks are so slow that it is worth sacrificing a lot of CPU cycles to reduce disk I/O. In a distributed system, reducing network traffic is also important.
  • 4. 3 WAYS OF REPRESENTING THISTABLE ON DISK CourseId Title Instructor CategoryId 25 Databases 1 Jennifer Widom 10 27 Databases 2 Jennifer Widom 10 28 Algorithms Charles Leiserson 12 30 Discrete Math Donald Knuth 12 35 Operating Systems A.Tanenbaum 40
  • 5. ROW-ORIENTED • Fields of a row are stored contiguously • Quick and easy: • Retrieve an entire row • Insert, update • Drawbacks: • Without indexing, filtering is slower • Entire row has to be read even if we only need a few columns 25 Databases 1 Jennifer Widom 10 27 Databases 2 Jennifer Widom 10 28
  • 6. COLUMN-ORIENTED • Fields of a column are stored contiguously • Benefits: • Each column can serve as an index (fast filtering operations on the whole dataset) • Only selected columns are read • Drawbacks: • Whole-row operations require a lot of disk I/O • Slow and hard inserting and updating • The same row can be stored on different nodes in a distributed environment 25 27 28 30 35 Databases 1 Databases 2 Algorithms Discrete M. Operating S. J.Widom J.Widom C. Leiserson:003 D. Knuth:004 A.Tanenbaum:005 10 10 12 12 40
  • 7. RECORD COLUMNAR CourseId Title Instructor CategoryId 25 Databases 1 Jennifer Widom 10 27 Databases 2 Jennifer Widom 10 28 Algorithms Charles Leiserson 12 30 Discrete Math Donald Knuth 12 35 Operating Systems A.Tanenbaum 40 CourseId Title Instructor CategoryId 25 Databases 1 Jennifer Widom 10 27 Databases 2 Jennifer Widom 10 CourseId Title Instructor CategoryId 28 Algorithms Charles Leiserson 12 30 Discrete Math Donald Knuth 12 35 Operating Systems A.Tanenbaum 40 Horizontal Partitioning Row Groups
  • 8. RECORD COLUMNAR CourseId Title Instructor CategoryId 25 Databases 1 Jennifer Widom 10 27 Databases 2 Jennifer Widom 10 CourseId Title Instructor CategoryId 28 Algorithms Charles Leiserson 12 30 Discrete Math Donald Knuth 12 35 Operating Systems A.Tanenbaum 40 Row Groups 25 27 Databases 1 Databases 2 Jennifer Widom Jennifer Widom 10 10 28 30 35 Algorithms Discrete Math Operating Sys. C. Leiserson Donald Knuth A.Tanenbaum 12 12 40 High redundancy in columns Compress them!
  • 9. SERIALIZATION FORMATS Row-Oriented Record Columnar Neither RCFileThrift SequenceFile ORC
  • 10. SEQUENCEFILE Header version 3-byte magic number eg. „SEQ6” keyClassName String, Java class name of keys valueClassName String, Java class name of values compression Bool, true if record compression is on blockCompression Bool, true if block compression is on compressorClass String, Java class name of compressor metadata SequenceFile.Metadata (key-value pairs) sync A sync marker to denote end of header Java-only format!
  • 12. SEQUENCEFILE FAILURE BEHAVIOR • Readable to the first failed row • Not recoverable after that point
  • 13. AVRO { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } JSON schema
  • 14. AVRO • Schema is stored in the header • Supports writing and reading with a different schema (schema evolution) • Supports nested types • Block-based splittable format (SYNC marker) • Optional block compression (Snappy, Deflate) • Excellent failure behavior: only the failed block is lost, reading will continue at the next SYNC marker
  • 15. RCFILE First widespread record columnar format Has much better alternatives today: ORC, Parquet
  • 16. PARQUET • ORC is designed specifically for Hive • Parquet is a general purpose format • Supports complex nested data structures • Stores full metadata at the end of files
  • 18. FAILURE BEHAVIOR OF RECORD COLUMNAR FORMATS Failure can lead to incomplete rows They don’t handle failure well
  • 19. COMPRESSION Format Splittability Write Speed Read Speed Compression gzip ✖ ★★ ★★★ ★★★ bzip2 ✔ ★ ★ ★★★ Snappy ✖ ★★★ ★★★ ★ LZO ✔ ★★★ ★★★ ★ Each of these are splittable when inside a container format.
  • 20. RECOMMENDATION Analytics Archival Format Parquet Avro Compression Snappy/gzip bzip2