Efficient DataFrame storage with Apache Parquet

•

3 likes•1,366 views

Uwe Korn

Slides for my presentation at PyData London 2017: https://pydata.org/london2017/schedule/presentation/54/

Data & Analytics

1
Eﬃcient and portable DataFrame
storage with Apache Parquet
Uwe L. Korn, PyData London 2017

2
• Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas User
About me
xhochy
uwe@apache.org

3
Agenda
• History of Apache Parquet
• The format in detail
• Use it in Python

4
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option

5
Why use Parquet?
1. Columnar format 
—> vectorized operations
2. Eﬃcient encodings and compressions 
—> small size without the need for a fat CPU
3. Query push-down 
—> bring computation to the I/O layer
4. Language independent format 
—> libs in Java / Scala / C++ / Python /…

6
Who uses Parquet?
• Query Engines
• Hive
• Impala
• Drill
• Presto
• …
• Frameworks
• Spark
• MapReduce
• …
• Pandas
• Dask

File Structure
File
RowGroup
Column Chunks
Page
Statistics

Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml

Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB

Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels

Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)

Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli 
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%) 
Snappy: 216 MiB (14 %)

Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded

Read & Write Parquet
17
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

18
Apache Arrow?
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
Just released 0.3

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
19
Get Involved!

Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
20

What's hot

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Data Science Languages and Industry AnalyticsWes McKinney

Strata NY 2017 Parquet Arrow roadmapJulien Le Dem

Rust is for "Big Data"Andy Grove

Strata NY 2018: The deconstructed databaseJulien Le Dem

Mule soft mar 2017 Parquet ArrowJulien Le Dem

My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Ibis: Scaling the Python Data ExperienceWes McKinney

DataFrames: The Extended CutWes McKinney

If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem

From flat files to deconstructed databaseJulien Le Dem

Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem

Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney

What's hot (20)

ACM TechTalks : Apache Arrow and the Future of Data Frames

Strata London 2016: The future of column oriented data processing with Arrow ...

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

An Incomplete Data Tools Landscape for Hackers in 2015

Apache Arrow at DataEngConf Barcelona 2018

Apache Arrow: Present and Future @ ScaledML 2020

Apache Arrow Workshop at VLDB 2019 / BOSS Session

Data Science Languages and Industry Analytics

Strata NY 2017 Parquet Arrow roadmap

Rust is for "Big Data"

Strata NY 2018: The deconstructed database

Mule soft mar 2017 Parquet Arrow

My Data Journey with Python (SciPy 2015 Keynote)

Apache Arrow Flight: A New Gold Standard for Data Transport

Ibis: Scaling the Python Data Experience

DataFrames: The Extended Cut

If you have your own Columnar format, stop now and use Parquet 😛

From flat files to deconstructed database

Data Eng Conf NY Nov 2016 Parquet Arrow

Python Data Ecosystem: Thoughts on Building for the Future

Similar to Efficient DataFrame storage with Apache Parquet

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

What's new in Hadoop Common and HDFS DataWorks Summit/Hadoop Summit

Taming the resource tigerElizabeth Smith

Realtime traffic analyserAlex Moskvin

OpenPOWER Acceleration of HPCC SystemsHPCC Systems

Scaling systems for research computingThe BioTeam Inc.

Storage in hadoopPuneet Tripathi

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

Hadoop ppt1chariorienit

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez

Silicon Valley Code Camp 2014 - Advanced MongoDBDaniel Coupal

Spectrum Scale Unified File and Object with WAN CachingSandeep Patil

Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar

Running MongoDB 3.0 on AWSMongoDB

The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi

From a student to an apache committer practice of apache io tdbjixuan1989

Similar to Efficient DataFrame storage with Apache Parquet (20)

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

What's new in Hadoop Common and HDFS

Taming the resource tiger

Realtime traffic analyser

OpenPOWER Acceleration of HPCC Systems

Scaling systems for research computing

Storage in hadoop

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Hadoop ppt1

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...

Silicon Valley Code Camp 2014 - Advanced MongoDB

Spectrum Scale Unified File and Object with WAN Caching

Software Defined Analytics with File and Object Access Plus Geographically Di...

Running MongoDB 3.0 on AWS

The state of Hive and Spark in the Cloud (July 2017)

From a student to an apache committer practice of apache io tdb

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Discover Why Less is More in B2B Researchmichael115558

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

Week-01-2.ppt BBB human Computer interactionfulawalesam

Midocean dropshipping via API with DroFxolyaivanovalion

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

BigBuy dropshipping via API with DroFx.pptx

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Edukaciniai dropshipping via API with DroFx

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Sampling (random) method and Non random.ppt

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Discover Why Less is More in B2B Research

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...

Determinants of health, dimensions of health, positive health and spectrum of...

Week-01-2.ppt BBB human Computer interaction

Midocean dropshipping via API with DroFx

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Efficient DataFrame storage with Apache Parquet

1. 1 Eﬃcient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017

2. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy uwe@apache.org

3. 3 Agenda • History of Apache Parquet • The format in detail • Use it in Python

4. 4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option

5. 5 Why use Parquet? 1. Columnar format  —> vectorized operations 2. Eﬃcient encodings and compressions  —> small size without the need for a fat CPU 3. Query push-down  —> bring computation to the I/O layer 4. Language independent format  —> libs in Java / Scala / C++ / Python /…

6. 6 Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask

7. File Structure File RowGroup Column Chunks Page Statistics

8. Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml

9. Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB

10. Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels

11. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)

12. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli  —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)  Snappy: 216 MiB (14 %)

13. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded

14. Benchmarks (size)

15. Benchmarks (time)

16. Benchmarks (size vs time)

17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

18. 18 Apache Arrow? • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3

19. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!

20. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20

Efficient DataFrame storage with Apache Parquet

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficient DataFrame storage with Apache Parquet

Similar to Efficient DataFrame storage with Apache Parquet (20)

More from Uwe Korn

More from Uwe Korn (7)

Recently uploaded

Recently uploaded (20)

Efficient DataFrame storage with Apache Parquet