2. Vlad Ureche
PhD in the Scala Team @ EPFL. Soon to graduate ;)
● Working on program transformations focusing on data representation
● Author of miniboxing, which improves generics performance by up to 20x
● Contributed to the Scala compiler and to the scaladoc tool.
@
@VladUreche
@VladUreche
vlad.ureche@gmail.com
scala-miniboxing.org
6. Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.
7. Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.
Performance gap between
RDDs and DataFrames
24. Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Traversal requires
dereferencing a pointer
for each employee.
26. A Better Representation
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
27. A Better Representation
●
more efficient heap usage
●
faster iteration
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
30. The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
31. The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
32. The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
33. The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
Challenge: No means of
communicating this
to the compiler
55. Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
56. Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
57. Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
58. Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
Oooops...
60. Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
61. Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
62. Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
Using
Scopes!
65. Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Now the method operates
on the EmployeeVector
representation.
66. Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
67. Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
●
Mark locally closed parts of the code
– Incoming/outgoing values go through conversions
– You can reject unexpected values
72. Best ...?
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
73. Best ...?
EmployeeJSON
{
id: 123,
name: “John Doe”
salary: 100
}
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
79. Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
80. Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
87. Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Automatically introduce conversions
between values in the two representations
e.g. EmployeeVector Vector[Employee] or back→
97. Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
Method print in the class
implements
method print in the trait
100. Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
transform(VectorOfEmployeeOpt) {
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait
101. Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
transform(VectorOfEmployeeOpt) {
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait
Taken care by the
compiler for you!
106. Retrofitting value class status
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
107. Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
108. Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
14x faster, lower
heap requirements
120. Research ahead*
!
* This may not make it into a product.
But you can play with it nevertheless.
121. Spark
●
Optimizations
– DataFrames do deforestation
– DataFrames do predicate push-down
– DataFrames do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
122. Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
123. Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
This is what
makes them slower
124. Spark
●
Optimizations
– Datasets do deforestation
– Datasets do predicate push-down
– Datasets do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
135. Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
136. Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
137. Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
138. Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
●
Reuse the machinery in miniboxing
scala-miniboxing.org
140. Challenge: Internal API changes
●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
141. Challenge: Internal API changes
●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
●
Solution: Extensive refactoring/rewrite
144. Challenge: Automation
●
Existing code should run out of the box
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
145. Challenge: Automation
●
Existing code should run out of the box
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Where are we now?
149. Prototype Hack
●
Modified version of Spark core
– RDD data representation is configurable
●
It's very limited:
– Custom data repr. only in map, filter and flatMap
– Otherwise we revert to costly objects
– Large parts of the automation still need to be done
158. Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
159. Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
– Is it worth it? You tell me!
161. Deforestation and Language Semantics
●
Notice that we changed language semantics:
– Before: collections were eager
– After: collections are lazy
– This can lead to effects reordering
162. Deforestation and Language Semantics
●
Such transformations are only acceptable with
programmer consent
– JIT compilers/staged DSLs can't change semantics
– metaprogramming (macros) can, but it should be
documented/opt-in
163. Code Generation
●
Also known as
– Deep Embedding
– Multi-Stage Programming
●
Awesome speedups, but restricted to small DSLs
●
SparkSQL uses code gen to improve performance
– By 2-4x over Spark
164. Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
165. Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
●
Best optimizations break semantics
– You can't do this in the JIT compiler!
– Only the programmer can decide to break semantics
166. Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity
167. Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity
●
Can we restrict macros so they're safer?
– Data-centric metaprogramming