The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Beyond Wordcount: With Datasets & Scaling
1. Beyond Wordcount:
With Datasets & Scaling
With a Wordcount example as well because guild
@holdenkarau
Presented at Nike PDX Jan 2018
2. Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
3.
4. Who do I think you all are?
● Nice people*
● Possibly some knowledge of Apache Spark?
● Interested in using Spark Datasets
● Familiar-ish with Scala or Java or Python
Amanda
5. What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
6. What we are going to explore together!
● What are Spark Datasets
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● How Datasets are different than DataFrames
● The limitations for Datasets :( and then working beyond
them
Ryan McGilchrist
7. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
8. Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
9. Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
11. How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder
13. And in Python it’s even more important!
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.
14. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
15. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
16. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
18. Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
_torne
19. Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
21. Mini “fix”: Datasets (aka DataFrames)
● Still super powerful
● Still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● groupBy returns a GroupedDataStructure and offers
special aggregates
● Selects can push filters down for us*
● Etc.
22. Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has:
○ SQLContext and HiveContext (pre-2.0)
○ Unified in SparkSession post 2.0
Petful
23. Spark Dataset Data Types
● Requires element types have Spark SQL encoder
○ Many common basic types already have encoders
○ case classes of supported types don’t require their own encoder
○ RDDs support any serializable object
● Many common “native” data types are directly
supported
● Can add encoders for others
loiez Deniel
24. How to start?
● Load Data in DataFrames & Datasets - use
SparkSession
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables
○ Run SQL queries against them
● Write programmatic queries against DataFrames
● Apply functional transformations to Datasets
U-nagi
25. Loading data Spark SQL (DataSets)
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson
26. Loading some simple JSON data
df = sqlContext.read
.format("json")
.load("sample.json")
Jess Johnson
27. What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ csv*
○ Hive
● Available as packages
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*built in starting in 2.0, prior to that use a package
31. Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes:
Array[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs
32. Then apply some type magic
// Convert our runtime typed DataFrame into a Dataset
// This makes functional operations much easier to perform
// and fails fast if the schema does not match the type
val pandas: Dataset[RawPanda] = df.as[RawPanda]
33. We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster
34. What do relational transforms look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● sql (register as a table and run “arbitrary” SQL)
● etc.
36. Ok word count RDD (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
37. Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
38. Word count w/Datasets
val df = spark.read.load(src).select("text")
val ds = df.as[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
39. What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight
40. So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
● Allows the optimizer to see what aggregates are being
performed
Sherrie Thai
41. Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
43. Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
44. So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
45. And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
46. But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
47. DF & RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
48. RDDs: The have issues to (especially with skew)
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
49. So what does groupByKey look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
50. So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
51. What is DS functional perf like?
● Often better than RDD
● Generally not as good relational - less understanding
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements
52. The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions
● POC w/Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 65% faster than regular Python
● Arrow - faster interchange between languages
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
54. Summary: Why to use Datasets
● We can solve problems tricky to solve with RDDs
○ Window operations
○ Multiple aggregations
● Fast
○ Awesome optimizer
○ Super space efficient data format
● We can solve problems tricky/frustrating to solve with
Dataframes
○ Writing UDFs and UDAFs can really break your flow
Lindsay Attaway
55. And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? @holdenkarau
○ Wellington Spark + BEAM meetup
○ LinuxConf AU
○ Sydney Spark meetup
○ Data Day Texas
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
56. Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
57. Bonus/support slides
● “Bonus” like the coupons you get from bed bath and beyond :p
● Not as many cat pictures (I’m sorry)
● The talk really should be finished at this point… but maybe someone asked a
question?