Optimizing Apache Spark UDFs

Optimizing Spark UDFs
Shivangi Srivastava
Senior Engineering Manager

Presenter Introduction
▪ Shivangi Srivastava
▪ Senior Engineering Leader at Informatica
▪ 10 + years of experience with Distributed computing, database systems
▪ Linkedin - https://www.linkedin.com/in/shivangi-srivastava1

Informatica
• Leading provider of Data Engineering solutions
• Informatica offerings –

Agenda
▪ Introduction to UDF
▪ Example and benefits of UDF
▪ Performance considerations
▪ Suggested Alternatives
▪ Performance results
▪ Questions/Feedback

Introduction to UDFs
▪ User-Defined Functions (aka UDF) is a feature of Spark SQL to define
new Column-based functions that extend the vocabulary of Spark
SQL’s DSL for transforming Datasets.
▪ UDFs are key features of most SQL environment and extend the system’s in-built functionality.
▪ Custom functions can be defined and registered as UDFs in Spark SQL with an associated alias that is made available to SQL
queries.

Typical Example
▪ Define a UDF in scala
val plusOne = udf((x: Int) => x + 1)
▪ Register the UDF
spark.udf.register("plusOne", plusOne)
▪ Usage
spark.sql("SELECT plusOne(5)").show()
// +------+
// |UDF(5)|
// +------+
// | 6|
// +------+

Benefits of UDFs
▪ Extends the in-built capabilities of Spark SQL.
▪ Simple and straightforward to implement.
▪ Plug and play architecture.
▪ Define once and use across multiple dataframes.
▪ Backward compatible. Stable API not impacted by version upgrades.

Performance concerns with UDFs
▪ UDFs are black-box to Spark optimizations.
▪ UDFs block many spark optimizations like
▪ WholeStageCodegen
▪ Null Optimizations
▪ Predicate Pushdown
▪ More optimizations from Catalyst Optimizer
▪ String Handling within UDFs
▪ UTF-8 to UTF-16 conversion. Spark maintains string in UTF-8 encoding versus Java runtime encodes in UTF-16.
▪ Any String input to UDF requires UTF-8 to UTF-16 conversion.
▪ Conversely, a String output requires a UTF-16 to UTF-8 conversion.
age codegen

Analyzing physical plan
X+1 versus PlusOne

Analyzing physical plan
X+1 versus plusOne

Analyzing physical plan – Predicate Pushdown
X+1 versus plusOne

Redesign UDF implementation
▪ Implement UDFs as Spark native functions.
▪ Design Goals for reimplementing.
▪ Extend Spark’s capabilities with minimal changes to the existing Spark code.
▪ Ability to upgrade to later Spark versions without significant engineering effort

Reimplementing Spark UDFs as Spark native
▪ Create a new project structure like spark.

▪ Extend from Spark’s Expression class
▪ UnaryExpression – Single argument expressions
▪ TernaryExpression – Multi argument expressions
▪ Satisfy expression contract

▪ Examples for existing Spark functions can be found at –
▪ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/

Reimplementing Spark UDFs
▪ Define the function
implementation

functions
▪ Add new function definition file to package org.apache.spark.sql
▪ Compile and add the jar to spark/jars folder.

Reimplementing - Usage
▪ Select as a normal function available within spark library.

Reimplementing - SQL
▪ Functions require editing of FunctionsRegistry.scala if sql support is
needed.
▪ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
▪ Register your function
▪ expression[PlusOneNative]("plusonenative“)
▪ Recompile spark code to generate spark-catalyst_*.jar
▪ Edit the pom to add dependency to your previously created jar.
▪ Replace under spark/jars location

Reimplementing - Usage
▪ Exercise using spark-sql

Analyzing physical plan – Predicate Pushdown
plusOneNative

More Tips and Tricks
▪ Make conscious effort to avoid temporary object allocation.
▪ Use Scala’s while construct over for.
▪ For causes creation of temporary object creation.
▪ Consider imperative style over functional style.
▪ Consider using thread static variable to allocate temporary buffer
▪ Sometimes UTF-8 to UTF-16 conversion is required, consider lazy
conversion based on presence of UTF-16 characters.
▪ Inspect the string for presence of UTF-16 characters

Performance Comparison – String function
▪ Performance
improvement of about
20% to 200%
▪ Overhead of UTF-8 to
UTF-16 conversion
avoided. SF1 SF10 SF50
optimization disabled 0:02:57 0:04:13 0:18:29
optimization enabled 0:02:38 0:03:26 0:06:21
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
0:20:10
Time(hh:mm:ss)
Scale Factor
String Function
optimization disabled optimization enabled

Performance Comparison – Date function
▪ Performance
20% to 100%
▪ Avoided creating
temporary objects
▪ Used imperative style programming
▪ Used while instead of for
SF1 SF10 SF50
0:00:00
0:01:26
0:02:53
0:04:19
0:05:46
0:07:12
0:08:38
0:10:05
0:11:31
Time(hh:mm:ss)
Scale Factor
Date Function

Performance Comparison – Numeric function
▪ Performance
15% to 50%
SF1 SF10 SF50
0:00:00
0:01:26
0:02:53
0:04:19
0:05:46
0:07:12
0:08:38
0:10:05
0:11:31
Time(hh:mm:ss)
Scale Factor
Numeric Function

Performance comparison - Summary
▪ 200% faster for certain String functions with large datasets (50GB).
▪ 50%-100% faster for date and numeric functions.
▪ Performance difference becomes noticeable with larger datasets.
▪ Conversion and optimization cost goes up.
▪ Garbage collection overhead becomes a significant contributor to the overall execution time.

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Optimizing Apache Spark UDFs

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Optimizing Apache Spark UDFs

Similar a Optimizing Apache Spark UDFs (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Optimizing Apache Spark UDFs