User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
4. Agenda
▪ Introduction to UDF
▪ Example and benefits of UDF
▪ Performance considerations
▪ Suggested Alternatives
▪ Performance results
▪ Questions/Feedback
5. Introduction to UDFs
▪ User-Defined Functions (aka UDF) is a feature of Spark SQL to define
new Column-based functions that extend the vocabulary of Spark
SQL’s DSL for transforming Datasets.
▪ UDFs are key features of most SQL environment and extend the system’s in-built functionality.
▪ Custom functions can be defined and registered as UDFs in Spark SQL with an associated alias that is made available to SQL
queries.
6. Typical Example
▪ Define a UDF in scala
val plusOne = udf((x: Int) => x + 1)
▪ Register the UDF
spark.udf.register("plusOne", plusOne)
▪ Usage
spark.sql("SELECT plusOne(5)").show()
// +------+
// |UDF(5)|
// +------+
// | 6|
// +------+
7. Benefits of UDFs
▪ Extends the in-built capabilities of Spark SQL.
▪ Simple and straightforward to implement.
▪ Plug and play architecture.
▪ Define once and use across multiple dataframes.
▪ Backward compatible. Stable API not impacted by version upgrades.
8. Performance concerns with UDFs
▪ UDFs are black-box to Spark optimizations.
▪ UDFs block many spark optimizations like
▪ WholeStageCodegen
▪ Null Optimizations
▪ Predicate Pushdown
▪ More optimizations from Catalyst Optimizer
▪ String Handling within UDFs
▪ UTF-8 to UTF-16 conversion. Spark maintains string in UTF-8 encoding versus Java runtime encodes in UTF-16.
▪ Any String input to UDF requires UTF-8 to UTF-16 conversion.
▪ Conversely, a String output requires a UTF-16 to UTF-8 conversion.
age codegen
13. Redesign UDF implementation
▪ Implement UDFs as Spark native functions.
▪ Design Goals for reimplementing.
▪ Extend Spark’s capabilities with minimal changes to the existing Spark code.
▪ Ability to upgrade to later Spark versions without significant engineering effort
15. Reimplementing Spark UDFs as Spark native
▪ Extend from Spark’s Expression class
▪ UnaryExpression – Single argument expressions
▪ TernaryExpression – Multi argument expressions
▪ Satisfy expression contract
16. Reimplementing Spark UDFs as Spark native
▪ Examples for existing Spark functions can be found at –
▪ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/
18. Reimplementing Spark UDFs as Spark native
functions
▪ Add new function definition file to package org.apache.spark.sql
▪ Compile and add the jar to spark/jars folder.
20. Reimplementing - SQL
▪ Functions require editing of FunctionsRegistry.scala if sql support is
needed.
▪ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
▪ Register your function
▪ expression[PlusOneNative]("plusonenative“)
▪ Recompile spark code to generate spark-catalyst_*.jar
▪ Edit the pom to add dependency to your previously created jar.
▪ Replace under spark/jars location
23. More Tips and Tricks
▪ Make conscious effort to avoid temporary object allocation.
▪ Use Scala’s while construct over for.
▪ For causes creation of temporary object creation.
▪ Consider imperative style over functional style.
▪ Consider using thread static variable to allocate temporary buffer
▪ Sometimes UTF-8 to UTF-16 conversion is required, consider lazy
conversion based on presence of UTF-16 characters.
▪ Inspect the string for presence of UTF-16 characters
24. Performance Comparison – String function
▪ Performance
improvement of about
20% to 200%
▪ Overhead of UTF-8 to
UTF-16 conversion
avoided. SF1 SF10 SF50
optimization disabled 0:02:57 0:04:13 0:18:29
optimization enabled 0:02:38 0:03:26 0:06:21
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
0:20:10
Time(hh:mm:ss)
Scale Factor
String Function
optimization disabled optimization enabled
25. Performance Comparison – Date function
▪ Performance
improvement of about
20% to 100%
▪ Avoided creating
temporary objects
▪ Used imperative style programming
▪ Used while instead of for
SF1 SF10 SF50
optimization disabled 0:03:10 0:04:28 0:10:09
optimization enabled 0:02:38 0:02:41 0:05:39
0:00:00
0:01:26
0:02:53
0:04:19
0:05:46
0:07:12
0:08:38
0:10:05
0:11:31
Time(hh:mm:ss)
Scale Factor
Date Function
optimization disabled optimization enabled
26. Performance Comparison – Numeric function
▪ Performance
improvement of about
15% to 50%
SF1 SF10 SF50
optimization disabled 0:04:15 0:06:07 0:10:41
optimization enabled 0:03:36 0:04:04 0:09:15
0:00:00
0:01:26
0:02:53
0:04:19
0:05:46
0:07:12
0:08:38
0:10:05
0:11:31
Time(hh:mm:ss)
Scale Factor
Numeric Function
optimization disabled optimization enabled
27. Performance comparison - Summary
▪ 200% faster for certain String functions with large datasets (50GB).
▪ 50%-100% faster for date and numeric functions.
▪ Performance difference becomes noticeable with larger datasets.
▪ Conversion and optimization cost goes up.
▪ Garbage collection overhead becomes a significant contributor to the overall execution time.