SlideShare a Scribd company logo
1 of 13
Data Storage
Formats in HDFS
Evaluation Criteria
- The processing tools
- i.e Cloudera do not support ORC
- Whether data has a changing nature or not
- Splitability
- XML is not splittable
- Compression
- Speed up I/O operation
- Save Storage
- Increase processing time : DECOMPRESSION!
- The data size
- Processing and query performance
Common File Formats
All File Formats
ColumnarStandard
Sequence Data Structure Data Parquet ORC
Serialization
Avro
Summary of some file formats’ features
Data Format Type of Format Splittable Changing Compression Meta Data
Json, XML Standards - + - +
CSV File Standards + - - -
JSON Records Standards + + - +
Sequence Files Standards + - + -
Avro Files Serialization + + + +
ORC Files Columnar + + + +
Parquet Files Columnar + + + +
Sequence File
- An optimal solution for small files
- Save as <key, value>
- Support compression
- Record
- Block
Parquet
- Optimized for Impala
- Used by Twitter
- Data Structure
- Data partitioned into rows
- Pages can be compressed
Parquet
- Data Structure
ORC
- Optimized for Hive, Presto
- Data Structure
- Index contain basic statistics
- File footer contain a list of stripes information
- Postscript holds compression parameters
Avro
- Row base storage
- Found in Apache Kafka
- Robust Support for changing schema
- Data Structure
Avro vs Parquet
- Avro is ideal for ETL
- Parquet is ideal for query analysis
- Read operation is better in Parquet
- Write operation is better in Avro
- Avro support full changing schema
- Parquet just support append
Parquet vs ORC
- Parquet is better for nested data
- ORC is more compression efficient
Uber Use Case
The End

More Related Content

What's hot

The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)Medhat HUSSAIN
 
Microsoft Windows File System in Operating System
Microsoft Windows File System in Operating SystemMicrosoft Windows File System in Operating System
Microsoft Windows File System in Operating SystemMeghaj Mallick
 
Sql server lesson3
Sql server lesson3Sql server lesson3
Sql server lesson3Ala Qunaibi
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileShivaniJayaprakash1
 
Foreign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresForeign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresEDB
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in indiaEdhole.com
 
Eol Drupal Dman Presentation
Eol   Drupal   Dman PresentationEol   Drupal   Dman Presentation
Eol Drupal Dman PresentationDavid Shorthouse
 
[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace conceptaltistory
 
SQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic SearchSQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic SearchSperasoft
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18karenostil
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
All about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataDAGEOP LTD
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePete Kisich
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Vantara
 
Hitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series DatasheetHitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series DatasheetHitachi Vantara
 
Ch 1-final-file organization from korth
Ch 1-final-file organization from korthCh 1-final-file organization from korth
Ch 1-final-file organization from korthRupali Rana
 

What's hot (20)

Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)
 
Microsoft Windows File System in Operating System
Microsoft Windows File System in Operating SystemMicrosoft Windows File System in Operating System
Microsoft Windows File System in Operating System
 
Sql server lesson3
Sql server lesson3Sql server lesson3
Sql server lesson3
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
 
Foreign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresForeign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with Postgres
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Eol Drupal Dman Presentation
Eol   Drupal   Dman PresentationEol   Drupal   Dman Presentation
Eol Drupal Dman Presentation
 
[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept
 
SQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic SearchSQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic Search
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
All about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining Data
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
 
VeloxDFS
VeloxDFSVeloxDFS
VeloxDFS
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
 
Hitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series DatasheetHitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series Datasheet
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Ch 1-final-file organization from korth
Ch 1-final-file organization from korthCh 1-final-file organization from korth
Ch 1-final-file organization from korth
 

Similar to Data storage format in hdfs

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables Sperasoft
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath
 
Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)James Aylett
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...RCAHMW
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Vipin Mishra
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical TrainingDave Reynolds
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeClay Helberg
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveKevin Epstein
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1Marco Gralike
 

Similar to Data storage format in hdfs (20)

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
 
Xml
XmlXml
Xml
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
23xml
23xml23xml
23xml
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical Training
 
1 xml fundamentals
1 xml fundamentals1 xml fundamentals
1 xml fundamentals
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data Merge
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
XML Databases
XML DatabasesXML Databases
XML Databases
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 

Recently uploaded

Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentationanshikakulshreshtha11
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 

Recently uploaded (20)

Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 

Data storage format in hdfs

  • 2. Evaluation Criteria - The processing tools - i.e Cloudera do not support ORC - Whether data has a changing nature or not - Splitability - XML is not splittable - Compression - Speed up I/O operation - Save Storage - Increase processing time : DECOMPRESSION! - The data size - Processing and query performance
  • 3. Common File Formats All File Formats ColumnarStandard Sequence Data Structure Data Parquet ORC Serialization Avro
  • 4. Summary of some file formats’ features Data Format Type of Format Splittable Changing Compression Meta Data Json, XML Standards - + - + CSV File Standards + - - - JSON Records Standards + + - + Sequence Files Standards + - + - Avro Files Serialization + + + + ORC Files Columnar + + + + Parquet Files Columnar + + + +
  • 5. Sequence File - An optimal solution for small files - Save as <key, value> - Support compression - Record - Block
  • 6. Parquet - Optimized for Impala - Used by Twitter - Data Structure - Data partitioned into rows - Pages can be compressed
  • 8. ORC - Optimized for Hive, Presto - Data Structure - Index contain basic statistics - File footer contain a list of stripes information - Postscript holds compression parameters
  • 9. Avro - Row base storage - Found in Apache Kafka - Robust Support for changing schema - Data Structure
  • 10. Avro vs Parquet - Avro is ideal for ETL - Parquet is ideal for query analysis - Read operation is better in Parquet - Write operation is better in Avro - Avro support full changing schema - Parquet just support append
  • 11. Parquet vs ORC - Parquet is better for nested data - ORC is more compression efficient