Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Apache Arrow -- Cross-language development platform for in-memory data

2.241 visualizaciones

Publicado el

Slides from NYC R Conference, April 20, 2018

Publicado en: Tecnología
  • Sé el primero en comentar

Apache Arrow -- Cross-language development platform for in-memory data

  1. 1. Wes McKinney Apache Arrow Cross-Language Development Platform for In-Memory Analytics NYC R Conference- 20 April 2018
  2. 2. Wes McKinney • Created Python pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly Co-founder / CEO of DataPad (acquired by Cloudera in 2014) • Other OSS work: Ibis, Feather, Apache Kudu, statsmodels
  3. 3. ● Raise money to support full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Build relationships across industry
  4. 4. People
  5. 5. Initial Sponsors and Partners Prospective sponsors / partners, please reach out:
  6. 6. Apache Arrow • • Open source community initiative started in 2016 • Backed by ~13 major OSS projects at start, significantly more now • Shared standards and systems for memory interoperability and computation • Cross-language libraries
  7. 7. Defragmenting Data Access
  8. 8. “Portable” Data Frames pandas R JVM Non-Portable Data Frames Arrow Portable Data Frames … Share data and algorithms at ~zero cost
  9. 9. Some Arrow Use Cases • Runtime in-memory format for analytical query engines • Zero-copy (no deserialization) interchange via shared memory • Low-overhead streaming messaging / RPC • Serialization format implementation • Zero-copy random access to on-disk data • Example: Feather files • Data ingest / data access
  10. 10. Arrow’s Columnar Memory Format • Runtime memory format for analytical query processing • Companion to serialization tech like Apache {Parquet, ORC} • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
  11. 11. Arrow Implementations and Bindings Upcoming: Rust (native), R (binding), Julia (native)
  12. 12. Example use: Ray ML framework from Berkeley RISELab March 20, 2017All Rights Reserved 12 Source: • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  13. 13. Some Industry Contributors in Apache Arrow ClearCode
  14. 14. Arrow Project Growth • 138 Contributors on GitHub • > 1900 Resolved JIRAs • > 100K binary package downloads per month JIRA Burndown since Project Inception
  15. 15. Current Project Status • 0.9.0 Release: March 21, 2018 • Some focus areas • Columnar format stability / forward compatibility • Streaming messaging / RPC procedure • Language implementations / interop • Data access (e.g. Parquet input/output, ORC) • Downstream integrations (Apache Spark, Python/pandas, …)
  16. 16. Upcoming Roadmap • Software development lifecycle improvements • Data ingest / access / export • Computational libraries (CPU + GPU) • Expanded language support • Richer RPC / messaging • More system integrations
  17. 17. The current data science stack’s computational foundation is severely dated, rooted in 1980s / 1990s FORTRAN-style semantics Single-core / single-threaded algorithms Naïve execution model, eager evaluation Primitive memory management, expensive data access Fragmented language ecosystems, “Proprietary” memory models …
  18. 18. Data scientists working with “small” data have not experienced great pain Small Data (< ~10GB) Medium Data (~10 - ~100GB) Big Data (> ~100GB-1TB) Current Python/R stack begins to “fail” around this point Users doing fine here
  19. 19. We can do so much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed
  20. 20. Computational libraries • “Kernel functions” performing vectorized analytics on Arrow memory format • Select CPU or GPU variant based on data location • Operator graphs (compose multiple operators) • Subgraph compiler (using LLVM) • Runtime engine: execute operator graphs
  21. 21. Data Access / Ingest • Apache Avro • Apache Parquet nested data support • Apache ORC • CSV • JSON • ODBC / JDBC • … and likely other data access points
  22. 22. Arrow-powered Data Science Systems • Portable runtime libraries, usable from multiple programming languages • Decoupled front ends • Companion to distributed systems like Dask, Ray
  23. 23. Getting involved • Join • PRs to • Learn more about the Ursa Labs vision for Arrow-powered data science: