SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Apache Arrow and
Pandas UDF on Apache Spark
Takuya UESHIN
2018-12-08, Apache Arrow Tokyo Meetup 2018
2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin
3
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
4
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Physical Operators
• Work In Progress
• Follow-up Events
5
Apache Spark and PySpark
“Apache Spark™ is a unified analytics engine for large-scale data
processing.”
https://spark.apache.org/
• The latest release:
2.4.0 (2018/11/02)
• PySpark is a Python API
• SparkR is an R API
6
PySpark and Pandas
“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.”
• https://pandas.pydata.org/
• The latest release: v0.23.4 Final (2018/08/03)
• PySpark supports Pandas >= "0.19.2"
7
PySpark and Pandas
PySpark can convert data between PySpark DataFrame and
Pandas DataFrame.
• pdf = df.toPandas()
• df = spark.createDataFrame(pdf)
We can use Arrow as an intermediate format by setting config:
“spark.sql.execution.arrow.enabled” to “true” (“false” by default).
8
Python UDF and Pandas UDF
• UDF: User Defined Function
• Python UDF
• Serialize/Deserialize data with Pickle
• Fetch data block, but invoke UDF row by row
• Pandas UDF
• Serialize/Deserialize data with Arrow
• Fetch data block, and invoke UDF block by block
• PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG
We don’t need any config, but the declaration is different.
9
Python UDF and Pandas UDF
@udf(’double’)
def plus_one(v):
return v + 1
@pandas_udf(’double’, PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
10
Python UDF and Pandas UDF
• SCALAR
• A transformation: One or more Pandas Series -> One Pandas Series
• The length of the returned Pandas Series must be of the same as the
input Pandas Series
• GROUPED_MAP
• A transformation: One Pandas DataFrame -> One Pandas DataFrame
• The length of the returned Pandas DataFrame can be arbitrary
• GROUPED_AGG
• A transformation: One or more Pandas Series -> One scalar
• The returned value type should be a primitive data type
11
Performance: Python UDF vs Pandas UDF
From a blog post: Introducing Pandas UDF for PySpark
• Plus One
• Cumulative Probability
• Subtract Mean
“Pandas UDFs perform much
better than Python UDFs,
ranging from 3x to over 100x.”
12
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
13
Apache Arrow
“A cross-language development platform for in-memory data”
https://arrow.apache.org/
• The latest release
- 0.11.0 (2018/10/08)
• Columnar In-Memory
• docs/memory_layout.html
PySpark supports Arrow >= "0.8.0"
• "0.10.0" is recommended
14
Apache Arrow and Pandas UDF
• Use Arrow to Serialize/Deserialize data
• Streaming format for Interprocess messaging / communication (IPC)
• ArrowWriter and ArrowColumnVector
• Communicate JVM and Python worker via Socket
• ArrowPythonRunner
• worker.py
• Physical Operators for each PythonUDFType
• ArrowEvalPythonExec
• FlatMapGroupsInPandasExec
• AggregateInPandasExec
15
Overview of Pandas UDF execution
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
16
Arrow IPC format and Converters
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
17
Encapsulated message format
• https://arrow.apache.org/docs/ipc.html
• Messages
• Schema, RecordBatch, DictionaryBatch, Tensor
• Formats
• Streaming format
– Schema + (DictionaryBatch + RecordBatch)+
• File format
– header + (Streaming format) + footer
Pandas UDFs use Streaming format.
18
Arrow Converters in Spark
in Java/Scala
• ArrowWriter [src]
• A wrapper for writing VectorSchemaRoot and ValueVectors
• ArrowColumnVector [src]
• A wrapper for reading ValueVectors, works with ColumnarBatch
in Python
• ArrowStreamPandasSerializer [src]
• A wrapper for RecordBatchReader and RecordBatchWriter
19
Handling Communication
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
20
Handling Communication
ArrowPythonRunner [src]
• Handle the communication between JVM and the Python
worker
• Create or reuse a Python worker
• Open a Socket to communicate
• Write data to the socket with ArrowWriter in a separate thread
• Read data from the socket
• Return an iterator of ColumnarBatch of ArrowColumnVectors
21
Physical Operators
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
ArrowColumnVectors
ArrowWriter
groups of rows
ColumnarBatches
ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
22
Physical Operators
Create a RDD to execute the UDF.
• There are several operators for each PythonUDFType
• Group input data and pass to ArrowPythonRunner
• SCALAR: every configured number of rows
– “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default)
• GROUP_XXX: every group
• Read the result iterator of ColumnarBatch
• Return the iterator of rows over ColumnarBatches
23
Python worker
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
24
Python worker
worker.py [src]
• Open a Socket to communicate
• Set up a UDF execution for each PythonUDFType
• Create a map function
– prepare the arguments
– invoke the UDF
– check and return the result
• Execute the map function over the input iterator of Pandas
DataFrame
• Write back the results
25
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
26
Work In Progress
We can track issues related to Pandas UDF.
• [SPARK-22216] Improving PySpark/Pandas interoperability
• 37 subtasks in total
• 3 subtasks are in progress
• 4 subtasks are open
27
Work In Progress
• Window Pandas UDF
• [SPARK-24561] User-defined window functions with pandas udf
(bounded window)
• Performance Improvement of toPandas -> merged!
• [SPARK-25274] Improve toPandas with Arrow by sending out-of-order
record batches
• SparkR
• [SPARK-25981] Arrow optimization for conversion from R DataFrame
to Spark DataFrame
28
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
29
Follow-up Events
Spark Developers Meetup
• 2018/12/15 (Sat) 10:00-18:00
• @ Yahoo! LODGE
• https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf
auj.html
30
Follow-up Events
Hadoop/Spark Conference Japan 2019
• 2019/03/14 (Thu)
• @ Oi-machi
• http://hadoop.apache.jp/
31
Follow-up Events
Spark+AI Summit 2019
• 2019/04/23 (Tue) - 04/25 (Thu)
• @ Moscone West Convention Center, San Francisco
• https://databricks.com/sparkaisummit/north-america
Thank you!
33
Appendix
How to contribute?
• See: Contributing to Spark
• Open an issue on JIRA
• Send a pull-request at GitHub
• Communicate with committers and reviewers
• Congratulations!
Thanks for your contributions!
34
Appendix
• PySpark Usage Guide for Pandas with Apache Arrow
• https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.
html
• Vectorized UDF: Scalable Analysis with Python and PySpark
• https://databricks.com/session/vectorized-udf-scalable-analysis-with-
python-and-pyspark
• Demo for Apache Arrow Tokyo Meetup 2018
• https://databricks-prod-cloudfront.cloud.databricks.com/public/4027
ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920
1/7497868276316206/latest.html

Más contenido relacionado

La actualidad más candente

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018Amazon Web Services
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Oracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web ServicesOracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web ServicesJeff Smith
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
openCypher: Introducing subqueries
openCypher: Introducing subqueriesopenCypher: Introducing subqueries
openCypher: Introducing subqueriesopenCypher
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 

La actualidad más candente (20)

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Oracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web ServicesOracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web Services
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
openCypher: Introducing subqueries
openCypher: Introducing subqueriesopenCypher: Introducing subqueries
openCypher: Introducing subqueries
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 

Similar a Apache Arrow and Pandas UDF on Apache Spark

Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinDatabricks
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Databricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductSourabh Bajaj
 
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...aiuy
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

Similar a Apache Arrow and Pandas UDF on Apache Spark (20)

Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Spark7
Spark7Spark7
Spark7
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
 
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

Más de Takuya UESHIN

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Takuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystTakuya UESHIN
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)Takuya UESHIN
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミングTakuya UESHIN
 

Más de Takuya UESHIN (11)

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
 
Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミング
 

Último

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Último (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Apache Arrow and Pandas UDF on Apache Spark

  • 1. Apache Arrow and Pandas UDF on Apache Spark Takuya UESHIN 2018-12-08, Apache Arrow Tokyo Meetup 2018
  • 2. 2 About Me - Software Engineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin
  • 3. 3 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 4. 4 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Physical Operators • Work In Progress • Follow-up Events
  • 5. 5 Apache Spark and PySpark “Apache Spark™ is a unified analytics engine for large-scale data processing.” https://spark.apache.org/ • The latest release: 2.4.0 (2018/11/02) • PySpark is a Python API • SparkR is an R API
  • 6. 6 PySpark and Pandas “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” • https://pandas.pydata.org/ • The latest release: v0.23.4 Final (2018/08/03) • PySpark supports Pandas >= "0.19.2"
  • 7. 7 PySpark and Pandas PySpark can convert data between PySpark DataFrame and Pandas DataFrame. • pdf = df.toPandas() • df = spark.createDataFrame(pdf) We can use Arrow as an intermediate format by setting config: “spark.sql.execution.arrow.enabled” to “true” (“false” by default).
  • 8. 8 Python UDF and Pandas UDF • UDF: User Defined Function • Python UDF • Serialize/Deserialize data with Pickle • Fetch data block, but invoke UDF row by row • Pandas UDF • Serialize/Deserialize data with Arrow • Fetch data block, and invoke UDF block by block • PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG We don’t need any config, but the declaration is different.
  • 9. 9 Python UDF and Pandas UDF @udf(’double’) def plus_one(v): return v + 1 @pandas_udf(’double’, PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1
  • 10. 10 Python UDF and Pandas UDF • SCALAR • A transformation: One or more Pandas Series -> One Pandas Series • The length of the returned Pandas Series must be of the same as the input Pandas Series • GROUPED_MAP • A transformation: One Pandas DataFrame -> One Pandas DataFrame • The length of the returned Pandas DataFrame can be arbitrary • GROUPED_AGG • A transformation: One or more Pandas Series -> One scalar • The returned value type should be a primitive data type
  • 11. 11 Performance: Python UDF vs Pandas UDF From a blog post: Introducing Pandas UDF for PySpark • Plus One • Cumulative Probability • Subtract Mean “Pandas UDFs perform much better than Python UDFs, ranging from 3x to over 100x.”
  • 12. 12 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 13. 13 Apache Arrow “A cross-language development platform for in-memory data” https://arrow.apache.org/ • The latest release - 0.11.0 (2018/10/08) • Columnar In-Memory • docs/memory_layout.html PySpark supports Arrow >= "0.8.0" • "0.10.0" is recommended
  • 14. 14 Apache Arrow and Pandas UDF • Use Arrow to Serialize/Deserialize data • Streaming format for Interprocess messaging / communication (IPC) • ArrowWriter and ArrowColumnVector • Communicate JVM and Python worker via Socket • ArrowPythonRunner • worker.py • Physical Operators for each PythonUDFType • ArrowEvalPythonExec • FlatMapGroupsInPandasExec • AggregateInPandasExec
  • 15. 15 Overview of Pandas UDF execution Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 16. 16 Arrow IPC format and Converters Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 17. 17 Encapsulated message format • https://arrow.apache.org/docs/ipc.html • Messages • Schema, RecordBatch, DictionaryBatch, Tensor • Formats • Streaming format – Schema + (DictionaryBatch + RecordBatch)+ • File format – header + (Streaming format) + footer Pandas UDFs use Streaming format.
  • 18. 18 Arrow Converters in Spark in Java/Scala • ArrowWriter [src] • A wrapper for writing VectorSchemaRoot and ValueVectors • ArrowColumnVector [src] • A wrapper for reading ValueVectors, works with ColumnarBatch in Python • ArrowStreamPandasSerializer [src] • A wrapper for RecordBatchReader and RecordBatchWriter
  • 19. 19 Handling Communication Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 20. 20 Handling Communication ArrowPythonRunner [src] • Handle the communication between JVM and the Python worker • Create or reuse a Python worker • Open a Socket to communicate • Write data to the socket with ArrowWriter in a separate thread • Read data from the socket • Return an iterator of ColumnarBatch of ArrowColumnVectors
  • 22. 22 Physical Operators Create a RDD to execute the UDF. • There are several operators for each PythonUDFType • Group input data and pass to ArrowPythonRunner • SCALAR: every configured number of rows – “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default) • GROUP_XXX: every group • Read the result iterator of ColumnarBatch • Return the iterator of rows over ColumnarBatches
  • 23. 23 Python worker Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 24. 24 Python worker worker.py [src] • Open a Socket to communicate • Set up a UDF execution for each PythonUDFType • Create a map function – prepare the arguments – invoke the UDF – check and return the result • Execute the map function over the input iterator of Pandas DataFrame • Write back the results
  • 25. 25 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 26. 26 Work In Progress We can track issues related to Pandas UDF. • [SPARK-22216] Improving PySpark/Pandas interoperability • 37 subtasks in total • 3 subtasks are in progress • 4 subtasks are open
  • 27. 27 Work In Progress • Window Pandas UDF • [SPARK-24561] User-defined window functions with pandas udf (bounded window) • Performance Improvement of toPandas -> merged! • [SPARK-25274] Improve toPandas with Arrow by sending out-of-order record batches • SparkR • [SPARK-25981] Arrow optimization for conversion from R DataFrame to Spark DataFrame
  • 28. 28 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 29. 29 Follow-up Events Spark Developers Meetup • 2018/12/15 (Sat) 10:00-18:00 • @ Yahoo! LODGE • https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf auj.html
  • 30. 30 Follow-up Events Hadoop/Spark Conference Japan 2019 • 2019/03/14 (Thu) • @ Oi-machi • http://hadoop.apache.jp/
  • 31. 31 Follow-up Events Spark+AI Summit 2019 • 2019/04/23 (Tue) - 04/25 (Thu) • @ Moscone West Convention Center, San Francisco • https://databricks.com/sparkaisummit/north-america
  • 33. 33 Appendix How to contribute? • See: Contributing to Spark • Open an issue on JIRA • Send a pull-request at GitHub • Communicate with committers and reviewers • Congratulations! Thanks for your contributions!
  • 34. 34 Appendix • PySpark Usage Guide for Pandas with Apache Arrow • https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow. html • Vectorized UDF: Scalable Analysis with Python and PySpark • https://databricks.com/session/vectorized-udf-scalable-analysis-with- python-and-pyspark • Demo for Apache Arrow Tokyo Meetup 2018 • https://databricks-prod-cloudfront.cloud.databricks.com/public/4027 ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920 1/7497868276316206/latest.html