This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points:
- Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013.
- Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem.
- Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on
2. Spark
● Processing of large volumes of data
● Distributed processing on commodity
hardware
● Written in Scala, Java and Python bindings
3. History
● 2009: AMPLab, Berkeley University
● June 2013 : "Top-level project" of the
Apache foundation
● May 2014: version 1.0.0
● Currently: version 1.2.0
4. Use cases
● Logs analysis
● Processing of text files
● Analytics
● Distributed search (Google, before)
● Fraud detection
● Product recommendation
5. ● Same use cases
● Same development
model: MapReduce
● Integration with the
ecosystem
Proximity with Hadoop
6. Simpler than Hadoop
● API simpler to learn
● “Relaxed” MapReduce
● Spark Shell: interactive processing
7. Faster than Hadoop
Spark officially sets a new record in large-scale
sorting (5th November 2014)
● Sorting 100 To of data
● Hadoop MR: 72 minutes
○ With 2100 noeuds (50400 cores)
● Spark: 23 minutes
○ With 206 noeuds (6592 cores)
11. ● Resilient Distributed Dataset
● Abstraction of a collection processed in
parallel
● Fault tolerant
● Can work with tuples:
○ Key - Value
○ Tuples must be independent from each other
RDD
12. Sources
● Files on HDFS
● Local files
● Collection in memory
● Amazon S3
● NoSQL database
● ...
● Or a custom implementation of
InputFormat
13. Transformations
● Processes an RDD, returns another RDD
● Lazy!
● Examples :
○ map(): one value → another value
○ mapToPair(): one value → a tuple
○ filter(): filters values/tuples given a condition
○ groupByKey(): groups values by key
○ reduceByKey(): aggregates values by key
○ join(), cogroup()...: joins two RDDs
14. Actions
● Does not return an RDD
● Examples:
○ count(): counts values/tuples
○ saveAsHadoopFile(): saves results in Hadoop’s
format
○ foreach(): applies a function on each item
○ collect(): retrieves values in a list (List<T>)
16. ● Trees of Paris: CSV file, Open Data
● Count of trees by specie
Spark - Example
geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta
48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;
48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;
48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;
48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29
...
17. Spark - Example
JavaSparkContext sc = new JavaSparkContext("local", "arbres");
sc.textFile("data/arbresalignementparis2010.csv")
.filter(line -> !line.startsWith("geom"))
.map(line -> line.split(";"))
.mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1))
.reduceByKey((x, y) -> x + y)
.sortByKey()
.foreach(t -> System.out.println(t._1 + " : " + t._2));
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
u
m
k
m
a
a
textFile mapToPairmap
reduceByKey
foreach
1
1
1
1
1
u
m
k
1
2
1
2a
...
...
...
...
filter
...
...
sortByKey
a
m
2
1
2
1u
...
...
...
...
...
...
geom;...
1 k
20. Topology & Terminology
● One master / several workers
○ (+ one standby master)
● Submit an application to the cluster
● Execution managed by a driver
21. Spark in a cluster
Several options
● YARN
● Mesos
● Standalone
○ Workers started manually
○ Workers started by the master
28. ● Usage of an RDD in SQL
● SQL engine: converts SQL instructions to
low-level instructions
Spark SQL
29. Spark SQL
Prerequisites:
● Use tabular data
● Describe the schema → SchemaRDD
Describing the schema :
● Programmatic description of the data
● Schema inference through reflection (POJO)
32. ● Counting trees by specie
Spark SQL - Example
sqlContext.sql("SELECT espece, COUNT(*)
FROM tree
WHERE espece <> ''
GROUP BY espece
ORDER BY espece")
.foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...
39. Spark Streaming Demo
● Receive Tweets with hashtag #Android
○ Twitter4J
● Detection of the language of the Tweet
○ Language Detection
● Indexing with Elasticsearch
● Reporting with Kibana 4