SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
EFFI C I EN T C OL U M N AR STORAG E
W I TH
APACH E PARQU ET
Apache: Big Data North America 2017
Ranganathan Balashanmugam, Aconex
“The Tables Have Turned.”
Apache: Big Data North America 2017
“The Tables Have Turned.” - 90°
Apache: Big Data North America 2017
Apache: Big Data North America 2017
ANALY TI CAL OV ER
TRAN SAC TI ON AL
Context
Apache: Big Data North America 2017
SIM PL E S TRUCTURE
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a1 b1 c1 a2 b2 c2 a3 b3 c3
row
a1 a2 a3 b1 b2 b3 c1 c2 c3
columnar
Apache: Big Data North America 2017
“Optimizing the disk seeks.”
“The Tables Have Turned.” - 90°
Apache: Big Data North America 2017
Efficient writes
Efficient reads
Apache: Big Data North America 2017
SIM PL E S TRUCTURE
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a1 b1 c1 a2 b2 c2 a3 b3 c3
row
a1 a2 a3 b1 b2 b3 c1 c2 c3
columnar
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Apache: Big Data North America 2017
N ES TED A N D R EPEATED S TR U C TU R ES
How to preserve in column store?
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
Url: ‘http://a'
Name:
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
Apache: Big Data North America 2017
Colum
nar
Apache: Big Data North America 2017
“Get these performance benefits for nested structures into Hadoop
ecosystem.”
problem
statem
ent
Apache: Big Data North America 2017
M OTI VATI ON
* Allow complex nested data
structures
* Very efficient compression and
encoding schemes
* Support many frameworks
Apache: Big Data North America 2017
“Columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or
programming language.”
Apache: Big Data North America 2017
D ES I G N G OAL S
* Interoperability
* Space efficiency
* Query efficiency
Apache: Big Data North America 2017
parquet
avro thrift protobuf pig hive ….
avro thrift protobuf pig hive ….
object
model
converters
storage
format
object
models
parquet binary file
src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/
column
readers
scalding
Apache: Big Data North America 2017
Page 1
Page 2
… Page n
parquet file
header - Magic number (4 bytes) : “PAR1”
row group 0
column a column b column c
row group 1
…row group n
footer
Page 1
Page 2
… Page n
Page 1
Page 2
… Page n
src: https://github.com/Parquet/parquet-format
* good enough for
compression efficiency
* 8KB - 1 MB
* good enough to read
* group of rows
* max size buffered while writing
* 50 MB - 1GB
* data of one column in row group
* can be read independently
Apache: Big Data North America 2017
footer
src: https://github.com/Parquet/parquet-format
file metadata (ThriftCompactProtocol)
- version
- schema
row group 0 metadata
footer length (4 bytes)
column 0
- total byte size
- total rows
* type/path/encodings/codec
* num of values
* compressed/uncompressed size
* data page offset
* index page offset
column 1
* …
Magic number (4 bytes): “PAR1”
row group …
column 0
…
column 1
* …
* …
Apache: Big Data North America 2017
page
src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/
repetition levels
definition levels
values
page header (ThriftCompactProtocol)
encoded
and
com
pressed
Apache: Big Data North America 2017
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Schema
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
Url: ‘http://a'
Name:
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Documents
Apache: Big Data North America 2017
“Can we represent it in columnar former efficiently and read them back to
their original nested data structure?”
problem
statem
ent
Apache: Big Data North America 2017
“Can we represent it in columnar former efficiently and read them back to
their original nested data structure?”
Dremel encoding
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
fill all the nulls
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
C O L U M N S R D VAL U E
Links.Forward 0 2 20
Links.Forward[1] 1 2 40
Links.Forward[2] 1 2 60
Links.Forward 0 2 80
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code en-us
Name.Language[1].Code en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] 1 1 null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] 1 1 null
Name[2].Language.Code 1 2 en-gb
Name.Language.Code 0 1 null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
C O L U M N S R D VAL U E
DocId 0 0 10
Links.Forward 0 2 20
Links.Forward[1] 1 2 40
Links.Forward[2] 1 2 60
Links.[Backward] 0 1 null
Name.Language.Code 0 2 en-us
Name.Language.Country 0 3 us
Name.Language[1].Code 2 2 en
Name.Language[1].[Country] 2 2 null
Name.Url 0 2 http://a
Name[1].[Language].[Code] 1 1 null
Name[1].[Language].[Country] 1 1 null
Name[1].Url 1 2 http://b
Name[2].Language.Code 1 2 en-gb
Name[2].Language.Country 1 3 gb
Name[2].[Url] 1 1 null
Apache: Big Data North America 2017
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
OPTI M I Z ATI ONS
projection push down predicate push down read required data
fetch required columns filter records while reading
Apache: Big Data North America 2017
EN C OD I N G
* Plain
* Dictionary Encoding
* Run Length Encoding/ Bit-Packing Hybrid
* Delta Encoding
* Delta-length byte array
* Delta Strings
Apache: Big Data North America 2017
C O M PR ES S I O N TR A D E O F F
* Storage
* IO
* Bandwidth
* CPU time
Thank you
Apache: Big Data North America 2017
Ran.ga.na.than B
@ran_than

Más contenido relacionado

La actualidad más candente

How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 

La actualidad más candente (20)

How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Utilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud ComputingUtilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud Computing
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 

Similar a Apache parquet - Apache big data North America 2017

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsBenjamin Darfler
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, LarusNeo4j
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL dataKevin Lee
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similar a Apache parquet - Apache big data North America 2017 (20)

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL data
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

Más de techmaddy

Qcon London2020 Scaling distributed teams
Qcon London2020  Scaling distributed teamsQcon London2020  Scaling distributed teams
Qcon London2020 Scaling distributed teamstechmaddy
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architecturestechmaddy
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startupstechmaddy
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startupstechmaddy
 
The best of Apache Kafka Architecture
The best of Apache Kafka ArchitectureThe best of Apache Kafka Architecture
The best of Apache Kafka Architecturetechmaddy
 
Offline First Applications
Offline First ApplicationsOffline First Applications
Offline First Applicationstechmaddy
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLstechmaddy
 
Apache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big DataApache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big Datatechmaddy
 

Más de techmaddy (8)

Qcon London2020 Scaling distributed teams
Qcon London2020  Scaling distributed teamsQcon London2020  Scaling distributed teams
Qcon London2020 Scaling distributed teams
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startups
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startups
 
The best of Apache Kafka Architecture
The best of Apache Kafka ArchitectureThe best of Apache Kafka Architecture
The best of Apache Kafka Architecture
 
Offline First Applications
Offline First ApplicationsOffline First Applications
Offline First Applications
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLs
 
Apache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big DataApache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big Data
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Apache parquet - Apache big data North America 2017

  • 1. EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Apache: Big Data North America 2017 Ranganathan Balashanmugam, Aconex
  • 2. “The Tables Have Turned.” Apache: Big Data North America 2017
  • 3. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017
  • 4. Apache: Big Data North America 2017 ANALY TI CAL OV ER TRAN SAC TI ON AL Context
  • 5. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  • 6. Apache: Big Data North America 2017 “Optimizing the disk seeks.”
  • 7. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017 Efficient writes Efficient reads
  • 8. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  • 9. message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Apache: Big Data North America 2017 N ES TED A N D R EPEATED S TR U C TU R ES How to preserve in column store? DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’
  • 10. Apache: Big Data North America 2017 Colum nar
  • 11. Apache: Big Data North America 2017 “Get these performance benefits for nested structures into Hadoop ecosystem.” problem statem ent
  • 12. Apache: Big Data North America 2017 M OTI VATI ON * Allow complex nested data structures * Very efficient compression and encoding schemes * Support many frameworks
  • 13. Apache: Big Data North America 2017 “Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”
  • 14. Apache: Big Data North America 2017 D ES I G N G OAL S * Interoperability * Space efficiency * Query efficiency
  • 15. Apache: Big Data North America 2017 parquet avro thrift protobuf pig hive …. avro thrift protobuf pig hive …. object model converters storage format object models parquet binary file src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ column readers scalding
  • 16. Apache: Big Data North America 2017 Page 1 Page 2 … Page n parquet file header - Magic number (4 bytes) : “PAR1” row group 0 column a column b column c row group 1 …row group n footer Page 1 Page 2 … Page n Page 1 Page 2 … Page n src: https://github.com/Parquet/parquet-format * good enough for compression efficiency * 8KB - 1 MB * good enough to read * group of rows * max size buffered while writing * 50 MB - 1GB * data of one column in row group * can be read independently
  • 17. Apache: Big Data North America 2017 footer src: https://github.com/Parquet/parquet-format file metadata (ThriftCompactProtocol) - version - schema row group 0 metadata footer length (4 bytes) column 0 - total byte size - total rows * type/path/encodings/codec * num of values * compressed/uncompressed size * data page offset * index page offset column 1 * … Magic number (4 bytes): “PAR1” row group … column 0 … column 1 * … * …
  • 18. Apache: Big Data North America 2017 page src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ repetition levels definition levels values page header (ThriftCompactProtocol) encoded and com pressed
  • 19. Apache: Big Data North America 2017 message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Schema
  • 20. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Documents
  • 21. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” problem statem ent
  • 22. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Dremel encoding
  • 23. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } fill all the nulls
  • 24. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.Forward 0 2 80 In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ?
  • 25. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 26. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 27. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 28. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code en-gb Name.Language.Code null
  • 29. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code 1 2 en-gb Name.Language.Code 0 1 null
  • 30. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E DocId 0 0 10 Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.[Backward] 0 1 null Name.Language.Code 0 2 en-us Name.Language.Country 0 3 us Name.Language[1].Code 2 2 en Name.Language[1].[Country] 2 2 null Name.Url 0 2 http://a Name[1].[Language].[Code] 1 1 null Name[1].[Language].[Country] 1 1 null Name[1].Url 1 2 http://b Name[2].Language.Code 1 2 en-gb Name[2].Language.Country 1 3 gb Name[2].[Url] 1 1 null
  • 31. Apache: Big Data North America 2017 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 OPTI M I Z ATI ONS projection push down predicate push down read required data fetch required columns filter records while reading
  • 32. Apache: Big Data North America 2017 EN C OD I N G * Plain * Dictionary Encoding * Run Length Encoding/ Bit-Packing Hybrid * Delta Encoding * Delta-length byte array * Delta Strings
  • 33. Apache: Big Data North America 2017 C O M PR ES S I O N TR A D E O F F * Storage * IO * Bandwidth * CPU time
  • 34. Thank you Apache: Big Data North America 2017 Ran.ga.na.than B @ran_than