SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Parquet

Open columnar storage for Hadoop
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
http://parquet.io

http://parquet.io
Julien

1
Twitter Context
•

Twitter’s data
•
•

100TB+ a day of compressed data

•

•

200M+ monthly active users generating and consuming 500M+ tweets a day.
Scale is huge: Instrumentation, User graph, Derived data, ...

Analytics infrastructure:
•
•

Log collection pipeline

•

•

Several 1K+ nodes Hadoop clusters
Processing tools

Role of Twitter’s analytics infrastructure team
•

Platform for the whole company.

•

Manages the data and enables analysis.

•

Optimizes the cluster’s workload as a whole.
The Parquet Planers
Gustave Caillebotte

2

Julien
- Hundreds of TB
- Hadoop: a distributed file system and a parallel execution engine focused on data locality
- clusters: ad hoc, prod, ...
- Log collection pipeline (Scribe, Kafka, ...)
- Processing tools (Pig...)
- Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ...
- optimize space/io/cpu usage.
- the clusters are constantly growing and never big enough.
- We ingest hundreds of TeraBytes, IO bound processes:
Improved compression saves space. take advantage of column storage for better scans.
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 levels of nesting.

struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}

struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}

http://parquet.io
Julien
- Logs stored using Thrift with heavily nested data structure.
- Example: aggregate event counts per IP address for a given event_type.
We just need to access the ip_address and event_name columns

3
Goal
To have a state of the art columnar storage available
across the Hadoop platform
Hadoop is very reliable for big long running queries but also IO
heavy.
•

Incrementally take advantage of column based storage in existing
framework.
•
•

Not tied to any framework in particular

http://parquet.io

4
Columnar storage
•

Limits the IO to only the data that is needed.

•

Saves space: columnar layout compresses better

•

Enables better scans: load only the columns that
need to be accessed

•

Enables vectorized execution engines.

@EmrgencyKittens

http://parquet.io

5
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified.

Implementation in Java for Map/Reduce:
•

•

https://github.com/Parquet/parquet-mr

C++ and code generation in Cloudera Impala:
•

https://github.com/cloudera/impala

http://parquet.io

6
Twitter: Initial results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files

Scan time

Space
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0%

100.00%
75.00%
50.00%
25.00%
0%
Thrift

Space
Parquet

1

Thrift

Parquet

30

columns

Space saving: 28% using the same compression algorithm
Scan + assembly time compared to original:
One column: 10%
All columns: 114%

http://parquet.io
Julien

7
Additional gains with dictionary
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw data but only 3% of compressed data
Scan time
to thrift
Space
100.00%
60.0%
50.0%

75.00%

40.0%

50.00%

30.0%
20.0%

25.00%
0%

10.0%
Space

Parquet compressed (LZO)
Parquet Dictionary uncompressed
Parquet Dictionary compressed (LZO)

0%

1

13

Parquet compressed
Parquet dictionary compressed

Columns

Space saving: another 52% using the same compression algorithm (on
top of the original columnar storage gains)
Scan + assembly time compared to plain Parquet:
All 13 columns: 48% (of the already faster columnar scan)

http://parquet.io
Julien

8
Format

Row group
Column a
Page 0

Column b
Page 0

Column c
Page 0

Page 1
Page 1
Page 2

•

Row group: A group of rows in columnar format.
•
•

roughly: 50MB < row group < 1 GB

Page 4

Page 2

Page 3

Row group

Column chunk: The data for one column in a row group.
•

•

Page 2
Page 3

One (or more) per split while reading. 

•

•

Max size buffered in memory while writing.

Page 1

Column chunks can be read independently for efficient scans.

Page: Unit of access in a column chunk.
•

Should be big enough for compression to be efficient.

•

Minimum size to read to access a single record (when index pages are available).

•

roughly: 8KB < page < 1MB

http://parquet.io

9
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Well defined format.
Hadoop and Cloudera
Impala support.

http://parquet.io

10
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Schema:
Document
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}

Max rep.
level

Max def. level

DocId

Links

Backward

Forward

0

0

Links.Backward

DocId

Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80

Columns

1

2

Links.Forward

1

2

Document

Column

value

R

D

DocId

Backward

10

30

Forward

80

0

Links.Backward

Links

20

0

10

0

2

Links.Backward

DocId

20

30

1

2

Links.Forward

80

0

2

http://parquet.io
Julien
- We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure.
- We can picture the schema as a tree. The columns are the leaves identified by the path in the schema.
- Additionally to the values we store two integers called the repetition level and definition level.
- When the value is NULL, the definition level captures up to which level in the path the value was defined.
- the repetition level is the last level in the path where there was a repetition. 0 means we have a new record.
- both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding.
- a flat schema with all fields required does not have any overhead as R and D are always 0 in that case.
Reference: http://research.google.com/pubs/pub36632.html

11
Repetition level
record 1: [[a, b, c], [d, e, f, g]]
record 2: [[h], [i, j]]

Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f,g,h,i,j

http://parquet.io

12
Differences of Parquet and ORC
Parquet:
•

Repetition/Definition levels capture the structure.

Document

DocId

=> one column per Leaf in the schema.

Links

Backward

•

Array<int> is one column.

•

Nullity/repetition of an inner node is stored in each of its children

•

Forward

=> One column independently of nesting with some redundancy.

ORC:
•

An extra column for each Map or List to record their size.
=> one column per Node in the schema.

•

Array<int> is two columns: array size and content.

•

=> An extra column per nesting level.
http://parquet.io

13
Iteration on fully assembled records
•

To integrate with existing row based engines (Hive, Pig, M/R).
a1
a2

•

b1

a2

Aware of dictionary encoding: enable optimizations.

a1

b2

a3

b3

a3
b1
b2
b3

•

Assembles projection for any subset of the columns:
- only those are loaded from disc.

Document

DocId

Document

Document

Links

Document

Links

Links

20

Backward

10

30

Backward

10

30

Forward

80

Forward

80

http://parquet.io
Julien
Explain in the context of example.
- columns: for implementing column based engines
- record based: for integrating with existing row based engines (Hive, Pig, M/R).

14
Iteration on columns

Row:

R

D

V

0

0

1

A

1

0

1

B

1

1

C

2

0

0

3

0

1

•

To implement column based execution engine

•

D<1 => Null

Iteration on triplets: repetition level, definition level, value.

•

R=1 => same row

Repetition level = 0 indicates a new record.

D

Encoded or decoded values: computing aggregations on
integers is faster than on strings.
•

http://parquet.io
Julien

15
APIs
•

Schema definition and record materialization:
Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift,
Avro, ProtocolBuffers do.
•

•

Event-based SAX-style record materialization layer. No double conversion.

Integration with existing type systems and processing
frameworks:
•

•

Impala

•

Pig

•

Thrift and Scrooge for M/R, Cascading and Scalding

•

Cascading tuples

•

Avro

•

Hive

http://parquet.io
Julien
all define a schema in a different way.

16
Encodings
•

Bit packing:

1

3

2

0

0

2

2

0

01|11|10|00 00|10|10|00

•
•
•

Small integers encoded in the minimum bits required
Useful for repetition level, definition levels and dictionary keys

Run Length Encoding:

1

1

1

1

1

1

1

1

•

•

Cheap compression

•

Works well for definition level of sparse columns.

Dictionary encoding:
•

•

1

Used in combination with bit packing,

•

8

Useful for columns with few ( < 50,000 ) distinct values

Extensible:
•

Defining new encodings is supported by the format

http://parquet.io

17
Contributors
Main contributors:
•

Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings

•

Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala

Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng
(Twitter): Encodings, projection push down
•

•

Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration

Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading
integration
•
•

Tom White (Cloudera): Avro integration

•

Avi Bryant, Colin Marc (Stripe): Cascading tuples integration

•

Matt Massie (Berkeley AMP lab): predicate and projection push down

•

David Chen (Linkedin): Avro integration improvements

http://parquet.io

18
Parquet 2.0

- More encodings: more compact storage without heavyweight compression
•

Delta encodings: for integers, strings and dictionary

•

Improved encoding for strings and boolean

- Statistics: to be used by query planners and
- New page format: faster predicate push down,

http://parquet.io

19
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Look for tickets marked as “pick me up!”

http://parquet.io

20

Más contenido relacionado

La actualidad más candente

Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkChristophePraud2
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryPVS-Studio
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesJoud Khattab
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + HadoopDavid Pavlis
 
Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_frameworksamuelhard
 
fileop report
fileop reportfileop report
fileop reportJason Lu
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 

La actualidad más candente (20)

Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
 
I pv6
I pv6I pv6
I pv6
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memory
 
CPP17 - File IO
CPP17 - File IOCPP17 - File IO
CPP17 - File IO
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_framework
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
Matlab netcdf guide
Matlab netcdf guideMatlab netcdf guide
Matlab netcdf guide
 
fileop report
fileop reportfileop report
fileop report
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 

Similar a (Julien le dem) parquet

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 

Similar a (Julien le dem) parquet (20)

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Data Science
Data ScienceData Science
Data Science
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 

Más de NAVER D2

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다NAVER D2
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...NAVER D2
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발NAVER D2
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈NAVER D2
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&ANAVER D2
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기NAVER D2
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep LearningNAVER D2
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingNAVER D2
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기NAVER D2
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화NAVER D2
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)NAVER D2
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual SearchNAVER D2
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화NAVER D2
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지NAVER D2
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터NAVER D2
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?NAVER D2
 

Más de NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Último

Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Último (20)

Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

(Julien le dem) parquet

  • 1. Parquet Open columnar storage for Hadoop Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter http://parquet.io http://parquet.io Julien 1
  • 2. Twitter Context • Twitter’s data • • 100TB+ a day of compressed data • • 200M+ monthly active users generating and consuming 500M+ tweets a day. Scale is huge: Instrumentation, User graph, Derived data, ... Analytics infrastructure: • • Log collection pipeline • • Several 1K+ nodes Hadoop clusters Processing tools Role of Twitter’s analytics infrastructure team • Platform for the whole company. • Manages the data and enables analysis. • Optimizes the cluster’s workload as a whole. The Parquet Planers Gustave Caillebotte 2 Julien - Hundreds of TB - Hadoop: a distributed file system and a parallel execution engine focused on data locality - clusters: ad hoc, prod, ... - Log collection pipeline (Scribe, Kafka, ...) - Processing tools (Pig...) - Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ... - optimize space/io/cpu usage. - the clusters are constantly growing and never big enough. - We ingest hundreds of TeraBytes, IO bound processes: Improved compression saves space. take advantage of column storage for better scans.
  • 3. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io Julien - Logs stored using Thrift with heavily nested data structure. - Example: aggregate event counts per IP address for a given event_type. We just need to access the ip_address and event_name columns 3
  • 4. Goal To have a state of the art columnar storage available across the Hadoop platform Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • • Not tied to any framework in particular http://parquet.io 4
  • 5. Columnar storage • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. @EmrgencyKittens http://parquet.io 5
  • 6. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io 6
  • 7. Twitter: Initial results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Thrift Space Parquet 1 Thrift Parquet 30 columns Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% http://parquet.io Julien 7
  • 8. Additional gains with dictionary 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Scan time to thrift Space 100.00% 60.0% 50.0% 75.00% 40.0% 50.00% 30.0% 20.0% 25.00% 0% 10.0% Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) 0% 1 13 Parquet compressed Parquet dictionary compressed Columns Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) http://parquet.io Julien 8
  • 9. Format Row group Column a Page 0 Column b Page 0 Column c Page 0 Page 1 Page 1 Page 2 • Row group: A group of rows in columnar format. • • roughly: 50MB < row group < 1 GB Page 4 Page 2 Page 3 Row group Column chunk: The data for one column in a row group. • • Page 2 Page 3 One (or more) per split while reading.  • • Max size buffered in memory while writing. Page 1 Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 9
  • 10. Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 10
  • 11. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: Document message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. level Max def. level DocId Links Backward Forward 0 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns 1 2 Links.Forward 1 2 Document Column value R D DocId Backward 10 30 Forward 80 0 Links.Backward Links 20 0 10 0 2 Links.Backward DocId 20 30 1 2 Links.Forward 80 0 2 http://parquet.io Julien - We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure. - We can picture the schema as a tree. The columns are the leaves identified by the path in the schema. - Additionally to the values we store two integers called the repetition level and definition level. - When the value is NULL, the definition level captures up to which level in the path the value was defined. - the repetition level is the last level in the path where there was a repetition. 0 means we have a new record. - both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding. - a flat schema with all fields required does not have any overhead as R and D are always 0 in that case. Reference: http://research.google.com/pubs/pub36632.html 11
  • 12. Repetition level record 1: [[a, b, c], [d, e, f, g]] record 2: [[h], [i, j]] Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j http://parquet.io 12
  • 13. Differences of Parquet and ORC Parquet: • Repetition/Definition levels capture the structure. Document DocId => one column per Leaf in the schema. Links Backward • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • Forward => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. http://parquet.io 13
  • 14. Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). a1 a2 • b1 a2 Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: - only those are loaded from disc. Document DocId Document Document Links Document Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io Julien Explain in the context of example. - columns: for implementing column based engines - record based: for integrating with existing row based engines (Hive, Pig, M/R). 14
  • 15. Iteration on columns Row: R D V 0 0 1 A 1 0 1 B 1 1 C 2 0 0 3 0 1 • To implement column based execution engine • D<1 => Null Iteration on triplets: repetition level, definition level, value. • R=1 => same row Repetition level = 0 indicates a new record. D Encoded or decoded values: computing aggregations on integers is faster than on strings. • http://parquet.io Julien 15
  • 16. APIs • Schema definition and record materialization: Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • • Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive http://parquet.io Julien all define a schema in a different way. 16
  • 17. Encodings • Bit packing: 1 3 2 0 0 2 2 0 01|11|10|00 00|10|10|00 • • • Small integers encoded in the minimum bits required Useful for repetition level, definition levels and dictionary keys Run Length Encoding: 1 1 1 1 1 1 1 1 • • Cheap compression • Works well for definition level of sparse columns. Dictionary encoding: • • 1 Used in combination with bit packing, • 8 Useful for columns with few ( < 50,000 ) distinct values Extensible: • Defining new encodings is supported by the format http://parquet.io 17
  • 18. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down • • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • • Tom White (Cloudera): Avro integration • Avi Bryant, Colin Marc (Stripe): Cascading tuples integration • Matt Massie (Berkeley AMP lab): predicate and projection push down • David Chen (Linkedin): Avro integration improvements http://parquet.io 18
  • 19. Parquet 2.0 - More encodings: more compact storage without heavyweight compression • Delta encodings: for integers, strings and dictionary • Improved encoding for strings and boolean - Statistics: to be used by query planners and - New page format: faster predicate push down, http://parquet.io 19
  • 20. How to contribute Questions? Ideas? Contribute at: github.com/Parquet Look for tickets marked as “pick me up!” http://parquet.io 20