2. 2
ā¢ Data Scientist at Blue Yonder
(@BlueYonderTech)
ā¢ Apache {Arrow, Parquet} PMC
ā¢ Work in Python, Cython, C++11 and SQL
ā¢ Heavy Pandas User
About me
xhochy
uwe@apache.org
4. 4
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
ā¢ often used as the default I/O option
5. 5
Why use Parquet?
1. Columnar formatāØ
ā> vectorized operations
2. Eļ¬cient encodings and compressionsāØ
ā> small size without the need for a fat CPU
3. Query push-downāØ
ā> bring computation to the I/O layer
4. Language independent formatāØ
ā> libs in Java / Scala / C++ / Python /ā¦
8. Encodings
ā¢ Know the data
ā¢ Exploit the knowledge
ā¢ Cheaper than universal compression
ā¢ Example dataset:
ā¢ NYC TLC Trip Record data for January 2016
ā¢ 1629 MiB as CSV
ā¢ columns: bool(1), datetime(2), float(12), int(4)
ā¢ Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
9. Encodings ā PLAIN
ā¢ Simply write the binary representation to disk
ā¢ Simple to read & write
ā¢ Performance limited by I/O throughput
ā¢ ā> 1499 MiB
10. Encodings ā RLE & Bit Packing
ā¢ bit-packing: only use the necessary bit
ā¢ RunLengthEncoding: 378 times ā12ā
ā¢ hybrid: dynamically choose the best
ā¢ Used for Definition & Repetition levels
11. Encodings ā Dictionary
ā¢ PLAIN_DICTIONARY / RLE_DICTIONARY
ā¢ every value is assigned a code
ā¢ Dictionary: store a map of code ā> value
ā¢ Data: store only codes, use RLE on that
ā¢ ā> 329 MiB (22%)
12. Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, BrotliāØ
ā> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)āØ
Snappy: 216 MiB (14 %)
13. Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
17. Read & Write Parquet
17
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
18. 18
Apache Arrow?
ā¢ Specification for in-memory columnar data layout
ā¢ No overhead for cross-system communication
ā¢ Designed for eļ¬ciency (exploit SIMD, cache locality, ..)
ā¢ Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R and the JVM
ā¢ This brought Parquet to Pandas without any Python code in
parquet-cpp
Just released 0.3
20. Blue Yonder GmbH
OhiostraĆe 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
20