Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

•

1 recomendación•1,731 vistas

This document discusses how Apache Arrow enables sharing data between Python and Java without copying. It summarizes Arrow's capabilities for efficient in-memory columnar data and its ability to exchange data between different programming languages. The document then outlines how Arrow, through its Java and Python libraries, allows querying data in Java from Python without copying, by passing memory addresses between the two environments. This enables faster data science workflows that involve both Python and Java/Scala.

Datos y análisis

1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn

2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com

3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4
February 2016: Birth of Apache Arrow
Just a goal…

5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine

6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning

7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S

8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„oﬀ-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow

9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?

10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly diﬀerent memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy

11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!

NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash

13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢

14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24

16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!

17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm 
/
fletcher

Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!

Más contenido relacionado

La actualidad más candente

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

Pandas/Data Analysis at BaypiggiesAndy Hayden

DataFrames: The Extended CutWes McKinney

PrestoChen Chun

PyCon Singapore 2013 KeynoteWes McKinney

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Presto in my_use_case2wyukawa

Rust is for "Big Data"Andy Grove

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Presto Meetup 2016 Small StartHiroshi Toyama

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Apache Spark & MLlibGrigory Sapunov

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Strata2017 sgwyukawa

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Cascalognathanmarz

La actualidad más candente (20)

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...

Pandas/Data Analysis at Baypiggies

DataFrames: The Extended Cut

Presto

PyCon Singapore 2013 Keynote

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Apache Arrow -- Cross-language development platform for in-memory data

Presto as a Service - Tips for operation and monitoring

Presto in my_use_case2

Rust is for "Big Data"

Apache Arrow at DataEngConf Barcelona 2018

An Incomplete Data Tools Landscape for Hackers in 2015

Fabian Hueske – Juggling with Bits and Bytes

Presto Meetup 2016 Small Start

Resource-Efficient Deep Learning Model Selection on Apache Spark

Apache Spark & MLlib

Apache Spark MLlib 2.0 Preview: Data Science and Production

Strata2017 sg

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Cascalog

Similar a Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Apache Spark in IndustryDorian Beganovic

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Lightning Fast Dataframes with PolarsAlberto Danese

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Scalable Scientific Computing with DaskUwe Korn

Apache Arrow and Python: The latestWes McKinney

Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

3 python packagesFEG

Koalas: Unifying Spark and pandas APIsXiao Li

Data Science meets Software DevelopmentAlexis Seigneurin

Apache spark-melbourne-april-2015-meetupNed Shawa

Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok

Similar a Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...

Data Science at Scale: Using Apache Spark for Data Science at Bitly

How Apache Arrow and Parquet boost cross-language interoperability

Next-generation Python Big Data Tools, powered by Apache Arrow

Apache Spark in Industry

Apache Spark for Everyone - Women Who Code Workshop

Lightning Fast Dataframes with Polars

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Deep Learning on Apache® Spark™: Workflows and Best Practices

Scalable Scientific Computing with Dask

Apache Arrow and Python: The latest

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Apache Arrow (Strata-Hadoop World San Jose 2016)

3 python packages

Koalas: Unifying Spark and pandas APIs

Data Science meets Software Development

Apache spark-melbourne-april-2015-meetup

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Último

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Switzerland Constitution 2002.pdf.........EfruzAsilolu

SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu

PLE-statistics document for primary schscnajjemba

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制vexqp

Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Ranking and Scoring Exercises for ResearchRajesh Mondal

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn

2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com

3. 3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4. 4 February 2016: Birth of Apache Arrow Just a goal…

5. 5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine

6. 6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning

7. 7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S

8. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored„oﬀ-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow

9. 9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?

10. 10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly diﬀerent memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy

11. 11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!

12. NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash

13. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy 😢

14. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24

15. 15 Photo by Niklas Tidbury on Unsplash

16. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!

17. 17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm  / fletcher

18. 18 ??? Does it work?

19. 19 Does it work?

20. 20 Does it work?

21. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21

22. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Similar a Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

Último

Último (20)

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy