The document discusses how to extend Apache Spark APIs without modifying Spark source code using Scala's "Enrich My Library" pattern. It provides an example of adding a .validate() method to Dataset objects to enable validation checks. The pattern involves defining an implicit class that augments existing types with new methods. This allows validation classes to integrate seamlessly with Spark jobs while keeping code concise, isolated and testable. Other uses like metrics collection and logging are also discussed.
2. What This Talk is About
• Scala programming constructs
• Functional programming paradigms
• Tips for organizing code in production systems
2#DevSAIS19
3. Who am I
• Lead Data Engineer at Target since 2016
• Deep love of all things Target
• Primary career focus has been building backend
systems with a personal passion for Machine Learning
problems
• Started working in Spark in 2015
3#DevSAIS19
6. Motivation
Let’s go through an example…
• We have a system of Authors, Articles, and
Comments on those Articles
• From the example, Spark/Scala lends itself
well to functional programming paradigms
• What happens when the system grows in
size/complexity and it becomes necessary
to inject more custom code into the mix?
• Can we keep things concise, readable, and
efficient using the same functional style of
code development?
6#DevSAIS19
7. Motivation
Functional Programming Refresher
• Declarative style of writing code (vs.
Imperative)
• Favors composition with functions
• Avoids shared state, mutability, and side
effects.
7#DevSAIS19
8. Motivation
A Validation Framework was born…
• Tasked with building an on-demand
computation system consuming various
data sources
• There were many ways for this data to go
wrong
• Needed a way to fail fast and in a
predictable way when a certain bar for
quality was not being met
8#DevSAIS19
9. Motivation
A Validation Framework was born…
• Desired ability to “sprinkle” .validate() calls
throughout our existing Spark ETL code
9#DevSAIS19
This is possible with
Scala’s
“Enrich My Library”
Pattern
11. “Enrich My Library”
A Scala programming pattern…
• Allows us to augment existing APIs
• Analogous features in other languages
• Also known as “Pimp My Library” for
Googling purposes
• Syntactic sugar that uses implicit classes
to guide the compiler
11#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
12. “Enrich My Library”
What are implicits?
Scala supports a keyword “implicit” that allows
the compiler to implicitly make connections at
compile-time as opposed to explicitly having to
call a function or feed in a variable. Scala
supports implicit values, parameters, functions,
and classes.
What is an implicit class?
Introduced formally with Scala 2.10 although
it’s possible to achieve the same effect in
previous versions through different constructs.
Allows extension of classes one normally
wouldn’t have access to.
12#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
16. An Example
16#DevSAIS19
Step 1: Build a Validation class to
work with
• Abstract class parameterized with type T
representing the object type that we plan to
validate
• Contains metadata relevant to running a
validation
• Has an abstract .execute() method to be filled
in by concrete subclasses
• Contains a concrete implementation
.performValidation() that calls on the abstract
execute method
17. An Example
17#DevSAIS19
Step 2: Add an implicit class to allow
the decoration of existing types with
new methods
• The class can be named anything
• It must be nested in a package or object
• It can only have one parameter. This defines
what class it’s augmenting.
• Extra arguments can be passed through the
implicit parameter list.
• .validate() delegates back to the validation object
being passed into the method and uses the
object being decorated to carry out the
validation.
18. An Example
18#DevSAIS19
Step 3: Define a validation
• Our validation extends a Validation typed with
Dataset[Article]
• It fills in the abstract method .execute() which
defines what the validation is checking for
• This means that any time the compiler finds a
Dataset[Article] type, we can call .validate() on
it with this validation supplied because of our
implicit class
• Roughly 20 lines of concise and isolated code
is nicely separated from the core ETL job
19. An Example
19#DevSAIS19
Step 4: Instantiate your validation
and pull it in scope
• This is what triggers the compiler to link
Datasets of Articles to the .validate()
method through the defined implicit class
20. An Example
20#DevSAIS19
Step 5: Don’t forget Unit Tests
• It is straightforward to develop concise
and isolated unit tests for each validation
that is developed
• ScalaTest with FunSpec are used to
achieve BDD-style tests
21. An Example
21#DevSAIS19
Step 6: And we’re done!
• We have been able to develop concise,
isolated, testable code that can fit
seamlessly into existing Spark jobs
• Data is messy, and we have the ability to
address this problem in an elegant way
• “Enrich my library” has allowed us to
extend Spark APIs so we can stay true to
functional programming paradigms
24. Other Uses
24#DevSAIS19
Support other common functionalities
used in production systems
ü Validations
• Metrics Collection
• Logging
• Checkpointing
• Notifications
• …
25. Disclaimer
These are powerful programming constructs that
can greatly increase productivity and enable the
buildout of concise and elegant framework code.
Overuse can lead to cryptic and esoteric systems
that can cause engineers great pain and suffering.
Find the right balance!
25#DevSAIS19
26. Takeaways
• The “Enrich My Library” programming pattern
enables concise, clean, and readable code
• It enabled us to create a framework that supports
rapid development of new validations with a
relatively small amount of code
• The resulting code is isolated, testable, and easy to
understand
26#DevSAIS19
27. Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems ranging from supply chain
logistics to smart stores to personalization and so on
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
27#DevSAIS19
work somewhere you
28. Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
28#DevSAIS19