SlideShare a Scribd company logo
1 of 20
Download to read offline
1
Eļ¬ƒcient and portable DataFrame
storage with Apache Parquet
Uwe L. Korn, PyData London 2017
2
ā€¢ Data Scientist at Blue Yonder
(@BlueYonderTech)
ā€¢ Apache {Arrow, Parquet} PMC
ā€¢ Work in Python, Cython, C++11 and SQL
ā€¢ Heavy Pandas User
About me
xhochy
uwe@apache.org
3
Agenda
ā€¢ History of Apache Parquet
ā€¢ The format in detail
ā€¢ Use it in Python
4
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
ā€¢ often used as the default I/O option
5
Why use Parquet?
1. Columnar formatā€Ø
ā€”> vectorized operations
2. Eļ¬ƒcient encodings and compressionsā€Ø
ā€”> small size without the need for a fat CPU
3. Query push-downā€Ø
ā€”> bring computation to the I/O layer
4. Language independent formatā€Ø
ā€”> libs in Java / Scala / C++ / Python /ā€¦
6
Who uses Parquet?
ā€¢ Query Engines
ā€¢ Hive
ā€¢ Impala
ā€¢ Drill
ā€¢ Presto
ā€¢ ā€¦
ā€¢ Frameworks
ā€¢ Spark
ā€¢ MapReduce
ā€¢ ā€¦
ā€¢ Pandas
ā€¢ Dask
File Structure
File
RowGroup
Column Chunks
Page
Statistics
Encodings
ā€¢ Know the data
ā€¢ Exploit the knowledge
ā€¢ Cheaper than universal compression
ā€¢ Example dataset:
ā€¢ NYC TLC Trip Record data for January 2016
ā€¢ 1629 MiB as CSV
ā€¢ columns: bool(1), datetime(2), float(12), int(4)
ā€¢ Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
Encodings ā€” PLAIN
ā€¢ Simply write the binary representation to disk
ā€¢ Simple to read & write
ā€¢ Performance limited by I/O throughput
ā€¢ ā€”> 1499 MiB
Encodings ā€” RLE & Bit Packing
ā€¢ bit-packing: only use the necessary bit
ā€¢ RunLengthEncoding: 378 times ā€ž12ā€œ
ā€¢ hybrid: dynamically choose the best
ā€¢ Used for Definition & Repetition levels
Encodings ā€” Dictionary
ā€¢ PLAIN_DICTIONARY / RLE_DICTIONARY
ā€¢ every value is assigned a code
ā€¢ Dictionary: store a map of code ā€”> value
ā€¢ Data: store only codes, use RLE on that
ā€¢ ā€”> 329 MiB (22%)
Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotliā€Ø
ā€”> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)ā€Ø
Snappy: 216 MiB (14 %)
Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Benchmarks (size)
Benchmarks (time)
Benchmarks (size vs time)
Read & Write Parquet
17
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
18
Apache Arrow?
ā€¢ Specification for in-memory columnar data layout
ā€¢ No overhead for cross-system communication
ā€¢ Designed for eļ¬ƒciency (exploit SIMD, cache locality, ..)
ā€¢ Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R and the JVM
ā€¢ This brought Parquet to Pandas without any Python code in
parquet-cpp
Just released 0.3
Cross language DataFrame library
ā€¢ Website: https://arrow.apache.org/
ā€¢ ML: dev@arrow.apache.org
ā€¢ Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
ā€¢ Slack: https://
apachearrowslackin.herokuapp.com/
ā€¢ Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
ā€¢ Website: https://parquet.apache.org/
ā€¢ ML: dev@parquet.apache.org
ā€¢ Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
ā€¢ Slack: https://parquet-slack-
invite.herokuapp.com/
ā€¢ C++ Github mirror: https://github.com/
apache/parquet-cpp
19
Get Involved!
Blue Yonder GmbH
OhiostraƟe 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
20

More Related Content

What's hot

ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
Ā 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
Ā 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
Ā 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Ā 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Ā 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
Ā 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Ā 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
Ā 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
Ā 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"Andy Grove
Ā 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
Ā 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
Ā 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
Ā 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
Ā 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
Ā 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
Ā 
If you have your own Columnar format, stop now and use Parquet šŸ˜›
If you have your own Columnar format,  stop now and use Parquet  šŸ˜›If you have your own Columnar format,  stop now and use Parquet  šŸ˜›
If you have your own Columnar format, stop now and use Parquet šŸ˜›Julien Le Dem
Ā 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
Ā 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
Ā 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
Ā 

What's hot (20)

ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Ā 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Ā 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Ā 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Ā 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Ā 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Ā 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Ā 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Ā 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Ā 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Ā 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Ā 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Ā 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Ā 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Ā 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
Ā 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Ā 
If you have your own Columnar format, stop now and use Parquet šŸ˜›
If you have your own Columnar format,  stop now and use Parquet  šŸ˜›If you have your own Columnar format,  stop now and use Parquet  šŸ˜›
If you have your own Columnar format, stop now and use Parquet šŸ˜›
Ā 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Ā 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Ā 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Ā 

Similar to Efficient DataFrame storage with Apache Parquet

ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detail
ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detailApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detail
ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detailUwe Korn
Ā 
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...Uwe Korn
Ā 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
Ā 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
Ā 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
Ā 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
Ā 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computingThe BioTeam Inc.
Ā 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
Ā 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
Ā 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
Ā 
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...javier ramirez
Ā 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBDaniel Coupal
Ā 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
Ā 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
Ā 
Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSMongoDB
Ā 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Ā 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
Ā 

Similar to Efficient DataFrame storage with Apache Parquet (20)

ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detail
ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detailApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detail
ApacheCon Europe Big Data 2016 ā€“ Parquet in practice & detail
Ā 
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 ā€“ Connecting PyData to other Big Data Landsca...
Ā 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
Ā 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Ā 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Ā 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
Ā 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
Ā 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computing
Ā 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
Ā 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Ā 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
Ā 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
Ā 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
Ā 
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...
CĆ³mo se diseƱa una base de datos que pueda ingerir mĆ”s de cuatro millones de ...
Ā 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDB
Ā 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
Ā 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
Ā 
Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWS
Ā 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Ā 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
Ā 

More from Uwe Korn

Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsUwe Korn
Ā 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
Ā 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
Ā 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
Ā 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn
Ā 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
Ā 
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...Uwe Korn
Ā 

More from Uwe Korn (7)

Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settings
Ā 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
Ā 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Ā 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Ā 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Ā 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
Ā 
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 ā€“Ā Building customer-visible data science dashboards wit...
Ā 

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
Ā 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
Ā 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceDelhi Call girls
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...amitlee9823
Ā 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...amitlee9823
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...amitlee9823
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
Ā 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Ā 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Ā 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
Ā 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Ā 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Ā 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā 

Efficient DataFrame storage with Apache Parquet

  • 1. 1 Eļ¬ƒcient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017
  • 2. 2 ā€¢ Data Scientist at Blue Yonder (@BlueYonderTech) ā€¢ Apache {Arrow, Parquet} PMC ā€¢ Work in Python, Cython, C++11 and SQL ā€¢ Heavy Pandas User About me xhochy uwe@apache.org
  • 3. 3 Agenda ā€¢ History of Apache Parquet ā€¢ The format in detail ā€¢ Use it in Python
  • 4. 4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem ā€¢ often used as the default I/O option
  • 5. 5 Why use Parquet? 1. Columnar formatā€Ø ā€”> vectorized operations 2. Eļ¬ƒcient encodings and compressionsā€Ø ā€”> small size without the need for a fat CPU 3. Query push-downā€Ø ā€”> bring computation to the I/O layer 4. Language independent formatā€Ø ā€”> libs in Java / Scala / C++ / Python /ā€¦
  • 6. 6 Who uses Parquet? ā€¢ Query Engines ā€¢ Hive ā€¢ Impala ā€¢ Drill ā€¢ Presto ā€¢ ā€¦ ā€¢ Frameworks ā€¢ Spark ā€¢ MapReduce ā€¢ ā€¦ ā€¢ Pandas ā€¢ Dask
  • 8. Encodings ā€¢ Know the data ā€¢ Exploit the knowledge ā€¢ Cheaper than universal compression ā€¢ Example dataset: ā€¢ NYC TLC Trip Record data for January 2016 ā€¢ 1629 MiB as CSV ā€¢ columns: bool(1), datetime(2), float(12), int(4) ā€¢ Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml
  • 9. Encodings ā€” PLAIN ā€¢ Simply write the binary representation to disk ā€¢ Simple to read & write ā€¢ Performance limited by I/O throughput ā€¢ ā€”> 1499 MiB
  • 10. Encodings ā€” RLE & Bit Packing ā€¢ bit-packing: only use the necessary bit ā€¢ RunLengthEncoding: 378 times ā€ž12ā€œ ā€¢ hybrid: dynamically choose the best ā€¢ Used for Definition & Repetition levels
  • 11. Encodings ā€” Dictionary ā€¢ PLAIN_DICTIONARY / RLE_DICTIONARY ā€¢ every value is assigned a code ā€¢ Dictionary: store a map of code ā€”> value ā€¢ Data: store only codes, use RLE on that ā€¢ ā€”> 329 MiB (22%)
  • 12. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotliā€Ø ā€”> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)ā€Ø Snappy: 216 MiB (14 %)
  • 13. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded
  • 17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  • 18. 18 Apache Arrow? ā€¢ Specification for in-memory columnar data layout ā€¢ No overhead for cross-system communication ā€¢ Designed for eļ¬ƒciency (exploit SIMD, cache locality, ..) ā€¢ Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM ā€¢ This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3
  • 19. Cross language DataFrame library ā€¢ Website: https://arrow.apache.org/ ā€¢ ML: dev@arrow.apache.org ā€¢ Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW ā€¢ Slack: https:// apachearrowslackin.herokuapp.com/ ā€¢ Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format ā€¢ Website: https://parquet.apache.org/ ā€¢ ML: dev@parquet.apache.org ā€¢ Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET ā€¢ Slack: https://parquet-slack- invite.herokuapp.com/ ā€¢ C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!
  • 20. Blue Yonder GmbH OhiostraƟe 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20