SlideShare una empresa de Scribd logo
1 de 31
© 2017 Dremio Corporation @DremioHQ
Apache Arrow: In Theory, In Practice
Apache Arrow Meetup @ Enigma
November 1, 2017
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Arrow In Theory
© 2017 Dremio Corporation @DremioHQ
The Apache Arrow Project
• Started Feb 17, 2016 (Apache tlp)
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to
choose best of breed systems
3. Designed to work with any programming
language
4. Support for both relational and complex data
as-is
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Committers & Contributors from:
© 2017 Dremio Corporation @DremioHQ
Arrow goals
• Well-documented and cross language
compatible
• Designed to take advantage of modern CPU
characteristics
• Embeddable in execution engines, storage
layers, etc.
• Interoperable
© 2017 Dremio Corporation @DremioHQ
Arrow In Memory Columnar Format
• Shredded Nested Data Structures
• Randomly Accessible
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
© 2017 Dremio Corporation @DremioHQ
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg: Parquet-to-
Arrow reader)
© 2017 Dremio Corporation @DremioHQ
Common Processing Libraries (soon)
• High Performance Canonical processing for Arrow
Data Structures
– Sort
– Hash Table
– Dictionary encoding
– Predicate application & masking
• Multiple Medium and Processing Paradigms
– Memory, NVMe, 3d Xpoint
– X86, GPU, Many Core (Phi), etc.
© 2017 Dremio Corporation @DremioHQ
Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, Map, List
• Advanced
– Union (sparse & dense)
© 2017 Dremio Corporation @DremioHQ
Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary encoded
Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
© 2017 Dremio Corporation @DremioHQ
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
© 2017 Dremio Corporation @DremioHQ
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
© 2017 Dremio Corporation @DremioHQ
Arrow Components
• Core Libraries
• Within Project Integrations
• Extended Integrations
© 2017 Dremio Corporation @DremioHQ
Arrow: Core Components
• Java Library
• C++ Library
• C Library
• Ruby Library
• Python Library
• JavaScript Library
© 2017 Dremio Corporation @DremioHQ
In-Project Arrow Building Blocks/Applications
• Plasma:
– Shared memory caching layer, originally created in Ray
• Feather:
– Fast ephemeral format for movement of data between
R/Python
• ArrowRest (soon):
– RPC/IPC interchange library (active development)
• ArrowRoutines (soon):
– Common data manipulation components
© 2017 Dremio Corporation @DremioHQ
Arrow Integrations
• Pandas
– Move seamlessly to from Arrow as a means for communication, serialization,
fast processing
• GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe
– Leverages Arrow as internal representation
• Parquet
– Read and write Parquet quickly to/from Parquet. C++ library builds directly on
Arrow.
• Spark
– Supports conversion to Pandas via Arrow construction using Arrow Java Library
• Dremio
– OSS project, Sabot Engine executes entirely on Arrow memory
© 2017 Dremio Corporation @DremioHQ
Arrow In Practice
© 2017 Dremio Corporation @DremioHQ
Real World Arrow: Sabot
• Dremio is an OSS data fabric
product
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM
© 2017 Dremio Corporation @DremioHQ
Sabot: Arrow in Practice
• Memory Management
• Vector sizing
• RPC Communication
• Filtering/Sorting
• Rowwise-algorithms: Hash Tables
• Vector-wise Algorithms
– Aggregation
– Unnesting
© 2017 Dremio Corporation @DremioHQ
Practice: Memory Management
• Arrow includes chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Support both reservation and local limits
– Include leak detection, debug ownership logs and location accounting
• Size allocators (reservation and maximum) based on workload
management, when to trigger spilling, etc.
• All Arrow Vectors hold one or more off-heap buffers
• Everything is manually reference managed
– Some code more complex
– Provides strong memory availability understanding
Root
res: 0
max: 20g
Job 1
res: 10m
max: 1g
Job 2
res: 10m
max: 1g
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
IntVector
Validity
Data
© 2017 Dremio Corporation @DremioHQ
Practice: Memory Management Cont’d
• Data moves through data pipelines
• Ownership needs to be clear (to
plan/control execution
– Allocated memory can be referenced
by many consumers
– One allocator ‘owns’ the accounted
memory
– Consumers can use Vector’s transfer
capability to leverage transfer
semantics and handoff data ownership
https://goo.gl/HN9nCH
Scan
Aggregate
Aggregate
res: 10m
max: 1g
Scan
res: 10m
max: 1g
transfer
ownership
© 2017 Dremio Corporation @DremioHQ
Practice: Vector Sizing
• Batches are the smallest work unit
• Batches of records can be 1..64k
records in size.
• Optimization Problem
– Larger improve processing
performance
– Larger causes pipeline problems
– Smaller causes more heap overhead
• Execution-Level Adaptive Resizing for
wide records (100-1000s fields)
Narrow Batch
Wide Batch
4095 records
127 records
© 2017 Dremio Corporation @DremioHQ
Practice: RPC Communication
• Goals
– Leverage Gathering Writes
– Ensure connection resilience despite
memory pressure
• Custom Netty-based RPC protocol
– All messages include structured
(proto) and sidecar memory message
– Out of memory at message
consumption time, ensuring fail-ack
as opposed to connection disconnect
Send:
Listener listener
Proto structuredMessage
ArrowBuf... dataBodies
https://goo.gl/XWyrc1
Structured message
Gathering
write
© 2017 Dremio Corporation @DremioHQ
Filtering & Sorting
• For filtering and sorting, create a selection
vector
– Describes valid values and ordering without
reorganizing underlying data.
– Two bytes for filter purposes (single batch
horizon)
– Four bytes for sort purposes (multi-batch
horizon)
• 4-Byte selection vector pattern frequently by
other operations
• 6-Byte selection vector used in some cases
(to manage wide batches)
• Defer copy/compacting
2
14
35
99
1-2
2-14
1-35
2-99
sv4
sv2
© 2017 Dremio Corporation @DremioHQ
Row-wise Algorithms: Hash Table + Aggregation
For generating hash table, maintaining a
columnar structure for keys slows hashing
insertion and lookup
• Break data into fixed and variable values
• Use consistent fixed value insertion
• Use dynamic variable output
• Pivot data
– Vector at time for fixed values
– All variable at same time for variable
vectors
• Hash and equality as bucket of bytes
• Avoids excessive indirection
• Maintain Aggregation tables in columnar
format
Fixed Block Vector Variable Block Vector
Aggregation Tables
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
len|data|len|data|len|data|len
|data|len|data|len|data|len|da
ta|len|data|len|data|len|data|l
en|data|len|data|len|data|len|
data|len|data|len|data
Partial-agg2
Partial-agg1
Partial-agg3
Partial-agg4
Partial-agg5
Partial-agg6
pivot fixed
pivot variable
unpivot
unpivot
direct
projection
© 2017 Dremio Corporation @DremioHQ
Example Pivot Code
• Takes advantage of runs of
nullable values, working a
word at a time
– ALL_SET, NONE_SET, SOME_SET
• Ensure canonicalization of
values based on validity
– Typically validity data is zeroed
on allocation, other vectors are
not.
– Vector data has to be cleared
when pivoting nulled values
• Conditions are avoided
static void pivot8Bytes(
VectorPivotDef def,
FixedBlockVector fixedBlock,
final int count
){
...
// decode word at a time.
while (srcDataAddr < finalWordAddr) {
final long bitValues = PlatformDependent.getLong(srcBitsAddr);
if (bitValues == NONE_SET) {
// noop (all nulls).
bitTargetAddr += (WORD_BITS * blockLength);
valueTargetAddr += (WORD_BITS * blockLength);
srcDataAddr += (WORD_BITS * EIGHT_BYTE);
} else if (bitValues == ALL_SET) {
// all set, set the bit values using a constant AND. Independently set the data values without transformation.
final int bitVal = 1 << bitOffset;
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) {
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal);
}
for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) {
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr));
}
} else {
// some nulls, some not, update each value to zero or the value, depending on the null bit.
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E
final int bitVal = ((int) (bitValues >>> i)) & 1;
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset));
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal);
}
}
srcBitsAddr += WORD_BYTES;
}
https://goo.gl/EgLy9r
© 2017 Dremio Corporation @DremioHQ
Node 1
Mux’d
Practice: Parallel Columnar Shuffle
• Partition data based on a hashed key
• Avoid excessive batch buffering cost
• Steps
1. Consolidate node-local streams
• Allow reduction in buffering memory in large
clusters (k*n instead of n*n)
2. Hash the key(s) to determine bucket offset
• Generate bucket vector
3. Pre-allocate output buffers at target output
size
• Sized depending on narrow/wide batches
4. Do columnar copies per vector
• Written in C-like low overhead pattern with
no abstraction
Node 2
Thread 1 Thread 2
generate bucket vector
Do bucket-
level copies
Gathering
Write
Thread 1 Thread 2
© 2017 Dremio Corporation @DremioHQ
Example Copier Code
• Two byte offset
addresses (sv2)
• Tight loop focused on
• Far more efficient than
runtime-generated row-
wise code
– Also has faster startup
time
public void copy(long offsetAddr, int count) {
final List<ArrowBuf> sourceBuffers = source.getFieldBuffers();
targetAlt.allocateNew(count);
final List<ArrowBuf> targetBuffers = target.getFieldBuffers();
final long max = offsetAddr + count * STEP_SIZE;
final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){
PlatformDependent.putLong(dstAddr,
PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE));
}
}
https://goo.gl/fZEsfy
© 2017 Dremio Corporation @DremioHQ
Unnesting List Vectors
• Common Pattern: List of objects that want to be
unrolled to separate records.
• Arrow’s representation allows a direct unroll (no
inner data copies required)
• Since leaf vectors can be larger (up to 2B), may
need to split apart inner vectors
– Make use of SplitAndTransfer necessary
– SplitAndTransfer as cheap as possible
• Noop for fixed data
• Offset rewrite for variable width vectors, noop for variable
data
• Bit rewrite & shifting for Validity vectors
List Vector
OffsetVector
Struct Vector
Inner Vectors
© 2017 Dremio Corporation @DremioHQ
What’s Coming
• Arrow RPC/REST
– Generic way to retrieve data in Arrow format
– Generic way to serve data in Arrow format
– Simplify integrations across the ecosystem
• Arrow Routines
– GPU and LLVM
© 2017 Dremio Corporation @DremioHQ
Get Involved
• Join the community
– dev@arrow.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ, @intjesus

Más contenido relacionado

La actualidad más candente

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
 

La actualidad más candente (20)

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 

Destacado

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 

Destacado (8)

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 

Similar a Apache Arrow: In Theory, In Practice

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Casesinside-BigData.com
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at ScaleDataWorks Summit
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesVladimír Schreiner
 
Orchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and PortworxOrchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and PortworxVMware Tanzu
 

Similar a Apache Arrow: In Theory, In Practice (20)

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Orchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and PortworxOrchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and Portworx
 

Último

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 

Último (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 

Apache Arrow: In Theory, In Practice

  • 1. © 2017 Dremio Corporation @DremioHQ Apache Arrow: In Theory, In Practice Apache Arrow Meetup @ Enigma November 1, 2017 Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Arrow In Theory
  • 4. © 2017 Dremio Corporation @DremioHQ The Apache Arrow Project • Started Feb 17, 2016 (Apache tlp) • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R Committers & Contributors from:
  • 5. © 2017 Dremio Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc. • Interoperable
  • 6. © 2017 Dremio Corporation @DremioHQ Arrow In Memory Columnar Format • Shredded Nested Data Structures • Randomly Accessible • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 7. © 2017 Dremio Corporation @DremioHQ High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to- Arrow reader)
  • 8. © 2017 Dremio Corporation @DremioHQ Common Processing Libraries (soon) • High Performance Canonical processing for Arrow Data Structures – Sort – Hash Table – Dictionary encoding – Predicate application & masking • Multiple Medium and Processing Paradigms – Memory, NVMe, 3d Xpoint – X86, GPU, Many Core (Phi), etc.
  • 9. © 2017 Dremio Corporation @DremioHQ Arrow Data Types • Scalars – Boolean – [u]int[8,16,32,64], Decimal, Float, Double – Date, Time, Timestamp – UTF8 String, Binary • Complex – Struct, Map, List • Advanced – Union (sparse & dense)
  • 10. © 2017 Dremio Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  • 11. © 2017 Dremio Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 12. © 2017 Dremio Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 13. © 2017 Dremio Corporation @DremioHQ Arrow Components • Core Libraries • Within Project Integrations • Extended Integrations
  • 14. © 2017 Dremio Corporation @DremioHQ Arrow: Core Components • Java Library • C++ Library • C Library • Ruby Library • Python Library • JavaScript Library
  • 15. © 2017 Dremio Corporation @DremioHQ In-Project Arrow Building Blocks/Applications • Plasma: – Shared memory caching layer, originally created in Ray • Feather: – Fast ephemeral format for movement of data between R/Python • ArrowRest (soon): – RPC/IPC interchange library (active development) • ArrowRoutines (soon): – Common data manipulation components
  • 16. © 2017 Dremio Corporation @DremioHQ Arrow Integrations • Pandas – Move seamlessly to from Arrow as a means for communication, serialization, fast processing • GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe – Leverages Arrow as internal representation • Parquet – Read and write Parquet quickly to/from Parquet. C++ library builds directly on Arrow. • Spark – Supports conversion to Pandas via Arrow construction using Arrow Java Library • Dremio – OSS project, Sabot Engine executes entirely on Arrow memory
  • 17. © 2017 Dremio Corporation @DremioHQ Arrow In Practice
  • 18. © 2017 Dremio Corporation @DremioHQ Real World Arrow: Sabot • Dremio is an OSS data fabric product • The core engine is “Sabot” – Built entirely on top of Arrow libraries, runs in JVM
  • 19. © 2017 Dremio Corporation @DremioHQ Sabot: Arrow in Practice • Memory Management • Vector sizing • RPC Communication • Filtering/Sorting • Rowwise-algorithms: Hash Tables • Vector-wise Algorithms – Aggregation – Unnesting
  • 20. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management • Arrow includes chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Support both reservation and local limits – Include leak detection, debug ownership logs and location accounting • Size allocators (reservation and maximum) based on workload management, when to trigger spilling, etc. • All Arrow Vectors hold one or more off-heap buffers • Everything is manually reference managed – Some code more complex – Provides strong memory availability understanding Root res: 0 max: 20g Job 1 res: 10m max: 1g Job 2 res: 10m max: 1g Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m IntVector Validity Data
  • 21. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management Cont’d • Data moves through data pipelines • Ownership needs to be clear (to plan/control execution – Allocated memory can be referenced by many consumers – One allocator ‘owns’ the accounted memory – Consumers can use Vector’s transfer capability to leverage transfer semantics and handoff data ownership https://goo.gl/HN9nCH Scan Aggregate Aggregate res: 10m max: 1g Scan res: 10m max: 1g transfer ownership
  • 22. © 2017 Dremio Corporation @DremioHQ Practice: Vector Sizing • Batches are the smallest work unit • Batches of records can be 1..64k records in size. • Optimization Problem – Larger improve processing performance – Larger causes pipeline problems – Smaller causes more heap overhead • Execution-Level Adaptive Resizing for wide records (100-1000s fields) Narrow Batch Wide Batch 4095 records 127 records
  • 23. © 2017 Dremio Corporation @DremioHQ Practice: RPC Communication • Goals – Leverage Gathering Writes – Ensure connection resilience despite memory pressure • Custom Netty-based RPC protocol – All messages include structured (proto) and sidecar memory message – Out of memory at message consumption time, ensuring fail-ack as opposed to connection disconnect Send: Listener listener Proto structuredMessage ArrowBuf... dataBodies https://goo.gl/XWyrc1 Structured message Gathering write
  • 24. © 2017 Dremio Corporation @DremioHQ Filtering & Sorting • For filtering and sorting, create a selection vector – Describes valid values and ordering without reorganizing underlying data. – Two bytes for filter purposes (single batch horizon) – Four bytes for sort purposes (multi-batch horizon) • 4-Byte selection vector pattern frequently by other operations • 6-Byte selection vector used in some cases (to manage wide batches) • Defer copy/compacting 2 14 35 99 1-2 2-14 1-35 2-99 sv4 sv2
  • 25. © 2017 Dremio Corporation @DremioHQ Row-wise Algorithms: Hash Table + Aggregation For generating hash table, maintaining a columnar structure for keys slows hashing insertion and lookup • Break data into fixed and variable values • Use consistent fixed value insertion • Use dynamic variable output • Pivot data – Vector at time for fixed values – All variable at same time for variable vectors • Hash and equality as bucket of bytes • Avoids excessive indirection • Maintain Aggregation tables in columnar format Fixed Block Vector Variable Block Vector Aggregation Tables validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset len|data|len|data|len|data|len |data|len|data|len|data|len|da ta|len|data|len|data|len|data|l en|data|len|data|len|data|len| data|len|data|len|data Partial-agg2 Partial-agg1 Partial-agg3 Partial-agg4 Partial-agg5 Partial-agg6 pivot fixed pivot variable unpivot unpivot direct projection
  • 26. © 2017 Dremio Corporation @DremioHQ Example Pivot Code • Takes advantage of runs of nullable values, working a word at a time – ALL_SET, NONE_SET, SOME_SET • Ensure canonicalization of values based on validity – Typically validity data is zeroed on allocation, other vectors are not. – Vector data has to be cleared when pivoting nulled values • Conditions are avoided static void pivot8Bytes( VectorPivotDef def, FixedBlockVector fixedBlock, final int count ){ ... // decode word at a time. while (srcDataAddr < finalWordAddr) { final long bitValues = PlatformDependent.getLong(srcBitsAddr); if (bitValues == NONE_SET) { // noop (all nulls). bitTargetAddr += (WORD_BITS * blockLength); valueTargetAddr += (WORD_BITS * blockLength); srcDataAddr += (WORD_BITS * EIGHT_BYTE); } else if (bitValues == ALL_SET) { // all set, set the bit values using a constant AND. Independently set the data values without transformation. final int bitVal = 1 << bitOffset; for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) { PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal); } for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) { PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr)); } } else { // some nulls, some not, update each value to zero or the value, depending on the null bit. for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E final int bitVal = ((int) (bitValues >>> i)) & 1; PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset)); PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal); } } srcBitsAddr += WORD_BYTES; } https://goo.gl/EgLy9r
  • 27. © 2017 Dremio Corporation @DremioHQ Node 1 Mux’d Practice: Parallel Columnar Shuffle • Partition data based on a hashed key • Avoid excessive batch buffering cost • Steps 1. Consolidate node-local streams • Allow reduction in buffering memory in large clusters (k*n instead of n*n) 2. Hash the key(s) to determine bucket offset • Generate bucket vector 3. Pre-allocate output buffers at target output size • Sized depending on narrow/wide batches 4. Do columnar copies per vector • Written in C-like low overhead pattern with no abstraction Node 2 Thread 1 Thread 2 generate bucket vector Do bucket- level copies Gathering Write Thread 1 Thread 2
  • 28. © 2017 Dremio Corporation @DremioHQ Example Copier Code • Two byte offset addresses (sv2) • Tight loop focused on • Far more efficient than runtime-generated row- wise code – Also has faster startup time public void copy(long offsetAddr, int count) { final List<ArrowBuf> sourceBuffers = source.getFieldBuffers(); targetAlt.allocateNew(count); final List<ArrowBuf> targetBuffers = target.getFieldBuffers(); final long max = offsetAddr + count * STEP_SIZE; final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){ PlatformDependent.putLong(dstAddr, PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE)); } } https://goo.gl/fZEsfy
  • 29. © 2017 Dremio Corporation @DremioHQ Unnesting List Vectors • Common Pattern: List of objects that want to be unrolled to separate records. • Arrow’s representation allows a direct unroll (no inner data copies required) • Since leaf vectors can be larger (up to 2B), may need to split apart inner vectors – Make use of SplitAndTransfer necessary – SplitAndTransfer as cheap as possible • Noop for fixed data • Offset rewrite for variable width vectors, noop for variable data • Bit rewrite & shifting for Validity vectors List Vector OffsetVector Struct Vector Inner Vectors
  • 30. © 2017 Dremio Corporation @DremioHQ What’s Coming • Arrow RPC/REST – Generic way to retrieve data in Arrow format – Generic way to serve data in Arrow format – Simplify integrations across the ecosystem • Arrow Routines – GPU and LLVM
  • 31. © 2017 Dremio Corporation @DremioHQ Get Involved • Join the community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow, @DremioHQ, @intjesus