SlideShare una empresa de Scribd logo
1 de 49
The columnar roadmap:
Apache Parquet and Apache Arrow
Julien Le Dem
VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork
@J_
julien.ledem@wework.com
June 2018
Julien Le Dem
@J_
Principal Engineer
Data Platform
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow based communication
CommunityDriven Standard
An open source standard
• Parquet: Common need for on disk columnar.
• Arrow: Common need for in memory columnar.
• Arrow is building on the success of Parquet.
• Top-level Apache project
• Standard from the start:
– Members from 13+ major open source projects involved
• Benefits:
– Share the effort
– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Interoperability and Ecosystem
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
Benefits of Columnar representation
Columnar layout
Logical table
representation
Row layout
Column layout
@EmrgencyKittens
On Disk and in Memory
• Different trade offs
– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.
• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
Parquet on disk columnar format
Parquet on disk columnar format
• Nested data structures
• Compact format:
– type aware encodings
– better compression
• Optimized I/O:
– Projection push down (column pruning)
– Predicate push down (filters based on stats)
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read only the
data you need!
Parquet file layout
Arrow in memory columnar format
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
– in execution engines, storage layers, etc.
• Interoperable
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
CPU pipeline
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
Vertical integration: Parquet to Arrow
Representation comparison for flat schema
Naïve conversion
Naïve conversion Data dependent branch
Peeling away abstraction layers
Vectorized read
(They have layers)
Bit packing case
Bit packing case No branch
Run length encoding case
Run length encoding case
Block of defined values
Run length encoding case
Block of null values
Predicate pushdown
Example: filter and projection
Naive filter and projection implementation
Peeling away abstractions
Arrow based communication
Universal high performance UDFs
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
Arrow RPC/REST API
• Generic way to retrieve data in Arrow format
• Generic way to serve data in Arrow format
• Simplify integrations across the ecosystem
• Arrow based pipe
RPC: arrow based storage interchange
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
projection/predicate
push down
Operator
Arrow batches
Storage
Mem
Disk
SQL
execution
Scanner Operator
Scanner Operator
Storage
Mem
Disk
Storage
Mem
Disk
…
RPC: arrow based cache
The memory
representation is sent
over the wire.
No serialization
overhead.
projection
push down
Operator
Arrow-based
Cache
SQL
execution
Operator
Operator
…
RPC: Single system execution
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
Results
- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)
http://s.apache.org/arrowresult1
- Streaming Arrow Performance
7.75GB/s data movement
http://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration
4GB/s reads
http://s.apache.org/arrowresult3
- Pandas Integration
9.71GB/s
http://s.apache.org/arrowresult4
Language Bindings
Parquet
• Target Languages
– Java
– CPP
– Python & Pandas
• Engines integration:
– Many!
Arrow
• Target Languages
– Java
– CPP, Python
– R (underway)
– C, Ruby, JavaScript
• Engines integration:
– Drill
– Pandas, R
– Spark (underway)
Arrow Releases
237
131
76
17
62
22 34
178
195
311
89
138
93
137
October 10, 2016 February 18,
2017
May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17,
2017
0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0
Days Changes
Current activity:
• Spark Integration (SPARK-13534)
• Pages index in Parquet footer (PARQUET-922)
• Arrow REST API (ARROW-1077)
• Bindings:
– C, Ruby (ARROW-631)
–JavaScript (ARROW-541)
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
June 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs
Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Singapore
• WeWork has 283 locations, in 75 cities internationally
• Over 253,000 members worldwide
• More than 40,000 member companies
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
June 2018

Más contenido relacionado

La actualidad más candente

Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

La actualidad más candente (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 

Similar a The columnar roadmap: Apache Parquet and Apache Arrow

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Similar a The columnar roadmap: Apache Parquet and Apache Arrow (20)

Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

The columnar roadmap: Apache Parquet and Apache Arrow

  • 1. The columnar roadmap: Apache Parquet and Apache Arrow Julien Le Dem VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork @J_ julien.ledem@wework.com June 2018
  • 2. Julien Le Dem @J_ Principal Engineer Data Platform • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3. Agenda • Community Driven Standard • Benefits of Columnar representation • Vertical integration: Parquet to Arrow • Arrow based communication
  • 5. An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow is building on the success of Parquet. • Top-level Apache project • Standard from the start: – Members from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 6. Interoperability and Ecosystem Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 7. Benefits of Columnar representation
  • 8. Columnar layout Logical table representation Row layout Column layout @EmrgencyKittens
  • 9. On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  • 10. Parquet on disk columnar format
  • 11. Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  • 12. Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 13. Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  • 15. Arrow in memory columnar format
  • 16. Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable – in execution engines, storage layers, etc. • Interoperable
  • 17. Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 19. Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 20. Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 24. Naïve conversion Data dependent branch
  • 25. Peeling away abstraction layers Vectorized read (They have layers)
  • 27. Bit packing case No branch
  • 29. Run length encoding case Block of defined values
  • 30. Run length encoding case Block of null values
  • 32. Example: filter and projection
  • 33. Naive filter and projection implementation
  • 36. Universal high performance UDFs SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  • 37. Arrow RPC/REST API • Generic way to retrieve data in Arrow format • Generic way to serve data in Arrow format • Simplify integrations across the ecosystem • Arrow based pipe
  • 38. RPC: arrow based storage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  • 39. RPC: arrow based cache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  • 40. RPC: Single system execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  • 41. Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  • 42. Language Bindings Parquet • Target Languages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) – C, Ruby, JavaScript • Engines integration: – Drill – Pandas, R – Spark (underway)
  • 43. Arrow Releases 237 131 76 17 62 22 34 178 195 311 89 138 93 137 October 10, 2016 February 18, 2017 May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17, 2017 0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 Days Changes
  • 44. Current activity: • Spark Integration (SPARK-13534) • Pages index in Parquet footer (PARQUET-922) • Arrow REST API (ARROW-1077) • Bindings: – C, Ruby (ARROW-631) –JavaScript (ARROW-541)
  • 45. Get Involved • Join the community – dev@{arrow,parquet}.apache.org – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}
  • 48. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Singapore • WeWork has 283 locations, in 75 cities internationally • Over 253,000 members worldwide • More than 40,000 member companies Contact: julien.ledem@wework.com
  • 49. Questions? Julien Le Dem @J_ julien.ledem@wework.com June 2018