Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop
About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry...
Agenda
The Problem
Arrow
Parquet
Outlook
Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Different Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at...
Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communicat...
Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in ...
Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python /...
Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
dis...
Apache Parquet
Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query efficient...
The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per colu...
Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()	
• Needs to...
State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
•...
Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native wit...
Get Involved!
• dev@arrow.apache.org & dev@parquet.apache.org
• https://apachearrowslackin.herokuapp.com/
• https://arrow....
Questions ?!
Próxima SlideShare
Cargando en…5
×

How Apache Arrow and Parquet boost cross-language interoperability

2.087 visualizaciones

Publicado el

PyData Paris 2016 about the importance and recent developments on the Python side of Apache Arrow and Apache Parquet.

Publicado en: Datos y análisis
  • Sé el primero en comentar

How Apache Arrow and Parquet boost cross-language interoperability

  1. 1. Uwe L. Korn PyData Paris 14th June 2016 How Apache Arrow and Parquet boost cross-language interop
  2. 2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  3. 3. Agenda The Problem Arrow Parquet Outlook
  4. 4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  5. 5. Different Systems - Varying Python Support • Various levels of Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  6. 6. Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  7. 7. Apache Arrow - The Impact • An example: Retrieve a dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  8. 8. Apache Arrow • Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  9. 9. Arrow in Action: Feather • Language-agnostic file format for binary data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  10. 10. Apache Parquet
  11. 11. Apache Parquet • Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  12. 12. The Basics • 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  13. 13. Using Parquet in Python • You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  14. 14. State of Arrow & Parquet Arrow in-memory spec for columnar data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  15. 15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way • alpha implementation using memory mapped files • JVM <-> native with shared reference counting
  16. 16. Get Involved! • dev@arrow.apache.org & dev@parquet.apache.org • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet
  17. 17. Questions ?!

×