Beyond Wordcount: With Datasets & Scaling

Beyond Wordcount:
With Datasets & Scaling
With a Wordcount example as well because guild
@holdenkarau
Presented at Nike PDX Jan 2018

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos

Who do I think you all are?
● Nice people*
● Possibly some knowledge of Apache Spark?
● Interested in using Spark Datasets
● Familiar-ish with Scala or Java or Python
Amanda

What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

What we are going to explore together!
● What are Spark Datasets
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● How Datasets are different than DataFrames
● The limitations for Datasets :( and then working beyond
them
Ryan McGilchrist

The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson

Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge

Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom

What is the performance like?
Andrew Skudder

How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder

How much more space efficient?

And in Python it’s even more important!
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau

Plus a little magic :)
Steven Saus

Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
_torne

Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi

Mini “fix”: Datasets (aka DataFrames)
● Still super powerful
● Still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● groupBy returns a GroupedDataStructure and offers
special aggregates
● Selects can push filters down for us*
● Etc.

Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has:
○ SQLContext and HiveContext (pre-2.0)
○ Unified in SparkSession post 2.0
Petful

Spark Dataset Data Types
● Requires element types have Spark SQL encoder
○ Many common basic types already have encoders
○ case classes of supported types don’t require their own encoder
○ RDDs support any serializable object
● Many common “native” data types are directly
supported
● Can add encoders for others
loiez Deniel

How to start?
● Load Data in DataFrames & Datasets - use
SparkSession
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables
○ Run SQL queries against them
● Write programmatic queries against DataFrames
● Apply functional transformations to Datasets
U-nagi

Loading data Spark SQL (DataSets)
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson

Loading some simple JSON data
df = sqlContext.read
.format("json")
.load("sample.json")
Jess Johnson

What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ csv*
○ Hive
● Available as packages
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*built in starting in 2.0, prior to that use a package

Sample json record
{"name":"mission",
"pandas":[{"id":1,"zip":"94110","pt":"giant",
"happy":true, "attributes":[0.4,0.5]}]}
Xiahong Chen

Getting the schema
● printSchema() for human readable
● schema for machine readable

Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes:
Array[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs

Then apply some type magic
// Convert our runtime typed DataFrame into a Dataset
// This makes functional operations much easier to perform
// and fails fast if the schema does not match the type
val pandas: Dataset[RawPanda] = df.as[RawPanda]

We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster

What do relational transforms look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● sql (register as a table and run “arbitrary” SQL)
● etc.

Writing a relational transformation:
relevant = df.select(df("place"),
df("happyPandas"))
nice_places = relevant.filter(
df("happyPandas") >= minHappyPandas)
Jody McIntyre

Ok word count RDD (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley

Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(

Word count w/Datasets
val df = spark.read.load(src).select("text")
val ds = df.as[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class

What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight

So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
● Allows the optimizer to see what aggregates are being
performed
Sherrie Thai

Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))

Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier

Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)

So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)

And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood

But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad

DF & RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin

RDDs: The have issues to (especially with skew)
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce

So what does groupByKey look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi

So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)

What is DS functional perf like?
● Often better than RDD
● Generally not as good relational - less understanding
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements

The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions
● POC w/Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 65% faster than regular Python
● Arrow - faster interchange between languages
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!

Andrew Skudder
*Arrow: possibly the future. I really hope so. Spark 2.3 and beyond!
* *

Summary: Why to use Datasets
● We can solve problems tricky to solve with RDDs
○ Window operations
○ Multiple aggregations
● Fast
○ Awesome optimizer
○ Super space efficient data format
● We can solve problems tricky/frustrating to solve with
Dataframes
○ Writing UDFs and UDAFs can really break your flow
Lindsay Attaway

And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? @holdenkarau
○ Wellington Spark + BEAM meetup
○ LinuxConf AU
○ Sydney Spark meetup
○ Data Day Texas
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback

Bonus/support slides
● “Bonus” like the coupons you get from bed bath and beyond :p
● Not as many cat pictures (I’m sorry)
● The talk really should be finished at this point… but maybe someone asked a
question?

Doing the (comparatively) impossible
Hey Paul

Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1

Window specs
import org.apache.spark.sql.expressions.Window
val spec =
Window.partitionBy("age").orderBy("capital-gain"
).rowsBetween(-10, 10)
val rez =
df.select(avg("capital-gain").over(spec))
Ryo Chijiiwa

UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x:
len(x), IntegerType())
Yağmur Adam

Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol,
strLen(stringCol) from myTable")

Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol),
dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))

Beyond Wordcount: With Datasets & Scaling

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Beyond Wordcount: With Datasets & Scaling

Similar a Beyond Wordcount: With Datasets & Scaling (20)

Último

Último (20)

Beyond Wordcount: With Datasets & Scaling