Nello scorso decennio sono nate soluzioni per affrontare l'elaborazione di grandi quantità di dati con strumenti nuovi che sfruttassero la possibilità di scalare orizzontalmente, Hadoop in primis. Oggi a questa necessità si aggiunge quella di elaborare flussi ininterrotti di dati in tempo reale e Apache Spark è uno cluster computing framework alternativo a MapReduce che mira a dare gli strumenti per rendere facile questo compito. In questo talk introdurremo Spark e il suo ecosistema, con qualche breve esempio.
Stefano Baghino - From Big Data to Fast Data: Apache Spark
1. From Big Data
to Fast Data
An introduction to Apache Spark
Stefano Baghino
Codemotion Milan 2015
2. From Big Data to Fast
Data with Functional
Reactive Containerized
Microservices and AI-
driven Monads in a
galaxy far far away…
3. Hello!
I am Stefano Baghino
Software Engineer @ DATABIZ
stefano.baghino@databiz.it
@stefanobaghino
Favorite PL: Scala
My hero: XKCD’s Beret Guy
What I fear: [object Object]
4. Agenda
u Big Data?
u Fast Data?
u What do we have now?
u How can we do better?
u What is Spark?
u What does it do?
u How does it work?
And also code, somewhere here and there.
10. Disk I/O all the time
Each step reads input
from and writes output to
disk
Let’s look at MapReduce
Limited model
It’s difficult to fit all algos
in the MapReduce model
11. Ok, so what is so good about Spark?
May sit on top of an existing
Hadoop deployment.
Builds heavily on simple
functional programming ideas.
Computes and caches data in-
memory to deliver blazing
performances.
17. Deploy on the cluster manager of your choice
Local
127.0.0.1
Standalone
Hadoop
Mesos
18. Working with Spark
◎ Resilient Distributed Dataset
◎ Closely resembles a Scala collection
◎ Very natural to use for Scala devs
By the user’s point of view, the RDD is effectively
a collection, hiding all the details of its
distribution throughout the cluster.
22. What is an RDD, really?
create
filter
filter
join
collect
create
23. Transformations
Produce a new RDD,
extending the execution
graph at each step
e.g.:
u map
u flatMap
u filter
What can I do with an RDD?
Actions
They are “terminal”
operations, actually calling
for the execution to
extract a value
e.g.:
u collect
u reduce
24. The execution model
1. Create DAG of RDDs to represent comp.
2. Create logical execution plan for the DAG
3. Schedule and execute individual tasks
25. The execution model in action
Let’s count distinct names grouped by their initial
sc.textFile("hdfs://...")
.map(n => (n.charAt(0), n))
.groupByKey()
.mapValues(n => n.toSet.size)
.collect()
27. Step 2: Create the execution plan
u Pipeline as much as possible
u Split into “stages” based on the need to “shuffle” data
HadoopRDD
MappedRDD
ShuffledRDD
MappedValuesRDD
Array[(Char, Int)]
Alice
Bob
Andy
(A, Alice)
(B, Bob)
(A, Andy)
(A, (Alice, Andy))
(B, Bob)
(A, 2)
Res0 = [(A, 2),….]
(B, 1)
Stage
1
Res0 = [(A, 2), (B, 1)]
Stage
2
28. So, how is it a Resilient Distributed Dataset?
Being a lazy, immutable representation of
computation, rather than an actual collection
of data, RDDs achieve resiliency by simply
being re-executed when their results are
lost*.
* because distributed systems and Murphy’s Law are best buddies.
29. The ecosystem
Spark SQL
Structured data
Spark Streaming
Real-time
MLLib
Machine learning
GraphX
Graph processing
Spark Core
Standalone Scheduler
YARN
Mesos
Spark R
Stat. analysis
30. What we’ll see today: Spark Streaming
Spark SQL
Structured data
Spark Streaming
Real-time
MLLib
Machine learning
GraphX
Graph processing
Spark Core
Standalone Scheduler
YARN
Mesos
Spark R
Stat. analysis
34. “Mini-batches” are DStreams
These “mini-batches” are DStreams or
discretized streams and they are basically a
collection of RDDs.
DStreams can be created from streaming
sources or by applying transformations to an
existing DStream.