Apache Spark has been a great driver of not only Scala adoption, but introducing a new generation of developers to functional programming concepts. As Spark places more emphasis on its newer DataFrame & Dataset APIs, it’s important to ask ourselves how we can benefit from this while still keeping our fun functional roots. We will explore the cases where the Dataset APIs empower us to do cool things we couldn’t before, what the different approaches to serialization mean, and how to figure out when the shiny new API is actually just trying to steal your lunch money (aka CPU cycles).
6. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer
7. Why Google Cloud care about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the next RCs (2.1.3 / 2.4 probably) -
https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-cha
nges-on-google-kubernetes-engine-and-cloud-dataproc
8. Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● May or may not know some Scala
○ If you’re new to Scala welcome to the community!
● Might know some Spark
● Want to keep things functional
● Ok with things getting a little bit silly
Lori Erickson
9. What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● What Datasets mean for Spark instead of RDDs
● Current limitations of Datasets (and the sad implications as a result)
● What Dataset let accomplish that we couldn’t* before
● What we can do to make this more awesome for future generations
● We’re going to talk about a lot of things we need to fix but please remember
everything is has lots of things that need fixing to.
10. What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
11. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
12. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
14. What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin
15. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
16. What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly
Stuart
17. What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
indamage
18. What are these “new” APIs?
● First of what is “new” - replaces an old not yet removed working thing with
something that might work
● DataFrames - not that new, kind of superseed ish by Datasets (yay)
● “New” ML API (called ML) - Look ma no types :(
○ We “forgot” to add a serving layer. We started, but then got bored.
● Structured Streaming
○ Hey buddy, want to try a new execution engine? It might not lose your data. Don’t pay any
attention to the missing/broken windows, self-joins, changing APIs, and…. yeah maybe give it
a few months
Susanne Nilsson
19. DataFrames/Datasets
● DataFrames: Everything is a Row. Even case classes are Rows.
● Datasets: Oh shit, types were useful lets add those back.
● More SQL inspired than functional inspired
○ select etc.
● Started out no functional operations or types, added later (and it shows)
● Schema (not type) inference
○ “How many people know the types of their JSON data?”/ eskati everyone say “fuck json”
○ If you don’t get that reference listen to lil’ pump (or not)
● No automatic tuple magic on read instead “Row” of pretty much anything
● Overhead to apply strict types
● Many many operations through away types
● Required for much of Spark’s new functionality
○ RDDs will still be around, but… the cool new toys are in Datasets :(
Paul Harrison
20. Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
23. What about compared to Kryo?
● Depend who you listen to
○ According to the people who wrote it still better
● Nominally also allows sort operations directly on serialized data
○ Some restrictions do apply
● Custom classes with complex times require custom work :(
laurenbeth93
24. Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
25. So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
26. And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
27. A Word count w/Datasets (ish)
val df = spark.read.load(src).select("text")
val ds = df.as[String]
// Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
Loose type information
33. UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x:
len(x), IntegerType())
Yağmur Adam
34. Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol,
strLen(stringCol) from myTable")
35. Aggregates - Classes are fun right?
abstract class UserDefinedAggregateFunction {
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
}
Sil Silv
36. Spark SQL Aggregates
● We could make a functional version, but we haven’t yet
● Maybe simple good PR for someone looking to help us keep it functional :p
○ Although to be fair their might be push back
● Hint hint :)
37. Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol),
dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))
38. Functions.scala: Everything is a string (or column)
● Lots of operators, yay!
● Mini sadness
● Frameless brings typed columns! -
https://github.com/typelevel/frameless/blob/master/dataset/src/main/scala/fra
meless/TypedColumn.scala
39. Spark ML pipelines
● Scikit inspired
● No types :(
○ Instead kind of hokey runtime schema checking that isn’t always correct
○ When it fails you can have a job fail after 8+ hours :(
● Frameless to the (optional) rescue -
https://github.com/typelevel/frameless/tree/master/ml/src/main/scala/frameles
s/ml/feature
● Also similar efforts exist inside of certain companies
○ Which I wish they would open source
george erws
40. Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
41. So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
42. What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
43. Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)
45. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
46. Arrow powered magic (numeric :p):
add = pandas_udf(lambda x, y: x + y, IntegerType())
James Willamor
47. And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● New execution engine option in 2.3
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still early stages - but now have flexibility to change engines (sort of)
48. Get a streaming dataset
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
streaming
source
49. Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
streaming
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)
50. Scala might matter “less”
● I float between Python & Scala so I’ll still have a job
● But I _like_ functional programming & types
● Traditionally (for better or worse) large overhead to work in Python on
distributed data
○ The overhead is quickly going down
○ As Kelly mentioned in her talk this morning, PySpark folks used sometimes to learn (some)
Scala for performance -- we’ll have to offer new shiny things instead
KLMircea
51. Key takeaways
● Datasets are a functional API
○ With easier “support” for window operations and similar compared to
RDDs
○ We can still sell enterprise support contracts and training to banks.
● Spark ML still uses Dataframes (no types)
○ Frameless has types for (some of) it!
○ Yes you can use deep learning with it. No I didn’t talk about that, it’s
extra.
● We have some important work to do to keep functional
programming competitive with SQL in Spark.
○ And with Python, seriously.
jeffreyw
52. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
53. High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
54. And some upcoming talks:
● June
○ Live streams (this Friday & weekly*) - follow me on twitch & YouTube
● July
○ Possible PyData Meetup in Amsterdam (tentative)
○ Curry on Amsterdam
○ OSCON Portland
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
55. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback