3. • Director of Ursa Labs, not-for-profit dev group
working on Apache Arrow
• Created Python pandas project (~2008), lead
developer/maintainer until 2013
• PMC Apache Arrow, Apache Parquet, ASF Member
• Wrote Python for Data Analysis (1e 2012, 2e
2017)
• Formerly: Two Sigma, Cloudera, DataPad, AQR
Wes McKinney
5. Aside: Increasing dialogue amongst OSS devs
• Many important discussions either do not happen or
only happen in face-to-face meetings
• New Discourse forum: discuss.ossdata.org
• Topics: design & architecture, developer tools,
collaboration opportunities, ...
6.
7. Overview
• Review Arrow Mission Statement
• “Shallow dive” into columnar format + protocol
• Some success stories
• Update on latest projects + initiatives for 2020 and
beyond
8. Apache Arrow
● Open source community project launched in 2016
● Intersection of database systems, big data, and data
science tools
● Purpose: Language-independent open standards and
libraries to accelerate and simplify in-memory computing
● https://github.com/apache/arrow
9. Personal motivations
● Interoperability problems with other data processing
systems
● Awareness of fundamental computational problems in
pandas or R data frames
○ Limited data types
○ Memory use problems
○ Slow processing efficiency
○ Difficulty with larger-than-memory datasets
10. Apache Arrow Big Picture
● Language-agnostic in-memory columnar format for
analytical query engines, data frames
● Binary protocol for IPC / RPC
● “Batteries included” development platform for building
data processing applications
13. 2020 Development Status
● 16 major releases
● Over 400 unique contributors
● Over 50M package installs in
2019
● ASF roster: 50 committers, 28
PMC members
● 11 programming languages
represented
14. Relationship with data science libraries
● Retrofit existing packages with faster IO or faster
processing code
● Superior computational foundation for new projects
● Aside: many “scalable data frame ” projects have
limited investments in improving single-node
processing efficiency
16. Arrow’s Columnar Memory Format
• Runtime memory format for analytical query processing
• Ideal companion to columnar storage like Apache Parquet
• “Fully shredded” columnar, supports flat and nested schemas
• Organized for cache-efficient access on CPUs/GPUs
• Optimized for data locality, SIMD, parallel processing
• Accommodates both random access and scan workloads
17. Arrow Binary Protocol
• Record batch: ordered collection of named arrays
• Streaming wire format for transferring datasets between address
spaces
• Intended for both IPC / shared memory and RPC use cases
SCHEMA DICTIONARY DICTIONARY
RECORD
BATCH
RECORD
BATCH
...
receiver sender
18. Encapsulated protocol (“IPC”) messages
• Serialization wire format suitable for stream-based parsing
metadata body
Metadata size or
end-of-stream marker
“Message” Flatbuffer
(see format/Message.fbs)
padding
Metadata contains memory
addresses within body to
reconstruct data structures
19. Record Batch serialization
• IPC message body contains buffers
concatenated end-to-end
• Serialized metadata records memory
offset and size of each buffer, for
later pointer arithmetic
schema {
a: int32,
b: list<item: binary>
}
a: buffer 0
a: buffer 1
b: buffer 0
b: buffer 1
b.item: buffer 0
b.item: buffer 1
b.item: buffer 2
BODY
20. Value type metadata
• Reasonably comprehensive set of built-in value types
• Application-defined logical types can be defined and transmitted
using special custom_metadata fields in the Schema
• Extension data stored using in a built-in type
• Examples
• UUID stored as FixedSizeBinary<16>
• LatitudeLongitude stored as struct<x: double, y: double>
21. Columnar Format Future Directions
• In-memory encoding, compression, sparseness
• e.g. run-length encoding
• See mailing list discussions, we need your
feedback!
• Expansion of logical types
28. Arrow C++ development platform
Allocators and
Buffers
Columnar Data
Structures and
Builders
File Format Interfaces
PARQUET
CSV JSON ORC
AVRO
Binary IPC
Protocol
Gandiva: LLVM
Expr Compiler
Compute Kernels
IO / Filesystem Platform
localfs
AWS S3 HDFS
mmap
GCP
Azure
Red means planned /
under construction work
Plasma:
Shared Mem
Object Store
Multithreading
Runtime
Datasets
Framework
Data Frame
Interface
Embeddable
Query Engine
Compressor
Interfaces
… and much more
CUDA Interop
Flight RPC
29. Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Processing
Datasets
Framework
Arrow Flight RPC
Network
Storage
30. Example: use in R libraries
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
32. Arrow Flight Overview
• A gRPC-based framework for defining custom data
services that send and receive Arrow columnar data
natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations to avoid unnecessary
serialization
33. Arrow Flight - Parallel Get
Client Planner
GetFlightInfo
FlightInfo
DoGet Data Nodes
FlightData
DoGet
FlightData
...
34. Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”