Spark Data Processing Framework Overview

Alexis Seigneurin
@aseigneurin @ippontech

Spark
● Processing of large volumes of data
● Distributed processing on commodity
hardware
● Written in Scala, Java and Python bindings

History
● 2009: AMPLab, Berkeley University
● June 2013 : "Top-level project" of the
Apache foundation
● May 2014: version 1.0.0
● Currently: version 1.2.0

Use cases
● Logs analysis
● Processing of text files
● Analytics
● Distributed search (Google, before)
● Fraud detection
● Product recommendation

● Same use cases
● Same development
model: MapReduce
● Integration with the
ecosystem
Proximity with Hadoop

Simpler than Hadoop
● API simpler to learn
● “Relaxed” MapReduce
● Spark Shell: interactive processing

Faster than Hadoop
Spark officially sets a new record in large-scale
sorting (5th November 2014)
● Sorting 100 To of data
● Hadoop MR: 72 minutes
○ With 2100 noeuds (50400 cores)
● Spark: 23 minutes
○ With 206 noeuds (6592 cores)

Spark ecosystem
● Spark
● Spark Shell
● Spark Streaming
● Spark SQL
● Spark ML
● GraphX

Integration
● Yarn, Zookeeper, Mesos
● HDFS
● Cassandra
● Elasticsearch
● MongoDB

● Resilient Distributed Dataset
● Abstraction of a collection processed in
parallel
● Fault tolerant
● Can work with tuples:
○ Key - Value
○ Tuples must be independent from each other
RDD

Sources
● Files on HDFS
● Local files
● Collection in memory
● Amazon S3
● NoSQL database
● ...
● Or a custom implementation of
InputFormat

Transformations
● Processes an RDD, returns another RDD
● Lazy!
● Examples :
○ map(): one value → another value
○ mapToPair(): one value → a tuple
○ filter(): filters values/tuples given a condition
○ groupByKey(): groups values by key
○ reduceByKey(): aggregates values by key
○ join(), cogroup()...: joins two RDDs

Actions
● Does not return an RDD
● Examples:
○ count(): counts values/tuples
○ saveAsHadoopFile(): saves results in Hadoop’s
format
○ foreach(): applies a function on each item
○ collect(): retrieves values in a list (List<T>)

● Trees of Paris: CSV file, Open Data
● Count of trees by specie
Spark - Example
geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta
48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;
48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;
48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;
48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29
...

Spark - Example
JavaSparkContext sc = new JavaSparkContext("local", "arbres");
sc.textFile("data/arbresalignementparis2010.csv")
.filter(line -> !line.startsWith("geom"))
.map(line -> line.split(";"))
.mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1))
.reduceByKey((x, y) -> x + y)
.sortByKey()
.foreach(t -> System.out.println(t._1 + " : " + t._2));
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
u
m
k
m
a
a
textFile mapToPairmap
reduceByKey
foreach
1
1
1
1
1
u
m
k
1
2
1
2a
...
...
...
...
filter
...
...
sortByKey
a
m
2
1
2
1u
...
...
...
...
...
...
geom;...
1 k

Spark - Example
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...

Topology & Terminology
● One master / several workers
○ (+ one standby master)
● Submit an application to the cluster
● Execution managed by a driver

Spark in a cluster
Several options
● YARN
● Mesos
● Standalone
○ Workers started manually
○ Workers started by the master

MapReduce
● Spark (API)
● Distributed processing
● Fault tolerant
Storage
● HDFS, base NoSQL...
● Distributed storage
● Fault tolerant
Storage & Processing

Data locality
● Process the data where it is stored
● Avoid network I/Os

Data locality
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark Master
HDFS
Namenode
HDFS
Namenode
(Standby)
Spark
Master
(Standby)

Demo
$ $SPARK_HOME/sbin/start-master.sh
$ $SPARK_HOME/bin/spark-class
org.apache.spark.deploy.worker.Worker
spark://MBP-de-Alexis:7077
--cores 2 --memory 2G
$ mvn clean package
$ $SPARK_HOME/bin/spark-submit
--master spark://MBP-de-Alexis:7077
--class com.seigneurin.spark.WikipediaMapReduceByKey
--deploy-mode cluster
target/pres-spark-0.0.1-SNAPSHOT.jar

● Usage of an RDD in SQL
● SQL engine: converts SQL instructions to
low-level instructions
Spark SQL

Spark SQL
Prerequisites:
● Use tabular data
● Describe the schema → SchemaRDD
Describing the schema :
● Programmatic description of the data
● Schema inference through reflection (POJO)

JavaRDD<Row> rdd = trees.map(fields -> Row.create(
Float.parseFloat(fields[3]), fields[4]));
● Creating tabular data (type Row)
Spark SQL - Example
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |

Spark SQL - Example
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));
fields.add(DataType.createStructField("espece", DataType.StringType, false));
StructType schema = DataType.createStructType(fields);
JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);
schemaRDD.registerTempTable("tree");
---------------------------------------
| hauteurenm | espece |
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
● Describing the schema

● Counting trees by specie
Spark SQL - Example
sqlContext.sql("SELECT espece, COUNT(*)
FROM tree
WHERE espece <> ''
GROUP BY espece
ORDER BY espece")
.foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...

Micro-batches
● Slices a continuous flow of data into batches
● Same API
● ≠ Apache Storm

DStream
● Discretized Streams
● Sequence of RDDs
● Initialized with a Duration

Window operations
● Sliding window
● Reuses data from other windows
● Initialized with a window length and a slide
interval

Sources
● Socket
● Kafka
● Flume
● HDFS
● MQ (ZeroMQ...)
● Twitter
● ...
● Or a custom implementation of Receiver

Spark Streaming Demo
● Receive Tweets with hashtag #Android
○ Twitter4J
● Detection of the language of the Tweet
○ Language Detection
● Indexing with Elasticsearch
● Reporting with Kibana 4

$ curl -X DELETE localhost:9200
$ curl -X PUT localhost:9200/spark/_mapping/tweets '{
"tweets": {
"properties": {
"user": {"type": "string","index": "not_analyzed"},
"text": {"type": "string"},
"createdAt": {"type": "date","format": "date_time"},
"language": {"type": "string","index": "not_analyzed"}
}
}
}'
● Launch ElasticSearch
Demo
● Launch Kibana -> http://localhost:5601
● Launch the Spark Streaming process

@aseigneurin
aseigneurin.github.io
@ippontech
blog.ippon.fr

Spark Data Processing Framework Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Spark Data Processing Framework Overview

Similar to Spark Data Processing Framework Overview (20)

Recently uploaded

Recently uploaded (20)

Spark Data Processing Framework Overview