Apache spark linkedin

Ericsson Internal | 2015-08-11 | Page 2
• Wh a t ?
• Wh y ?
• Ho w ?
• De m o
• EDR A n a l y t i c s
AGENDA

Spark eco-system
Technology landscape
Spark eco-system

“Fast and general engine for big
data processing with libraries for
SQL, streaming, advanced
analytics(machine learning)

WHAT?
Originally developed in 2009 in
UC Berkeley’sAMPLab
Fully open sourced in 2010 –
now at Apache Software
Foundation
http://spark.apache.org

Spark is the Most Active
Open Source Project in
Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140

Distributors Applications
7
The Spark Community

2015 SNAPSHOT

WHY SPARK?
Speed
Run programs up to
100x faster than
Hadoop Map
Reduce in memory,
or 10x faster on
disk.
Ease of Use
Supports different
languages for
developing
applications using
Spark
Generality
Combine SQL,
streaming, and
complex analytics
into one platform
Runs
Everywhere
Spark runs on
Hadoop, Mesos,
standalone, or in
the cloud.

Easy: Get Started
Immediately
Interactive Shell

Monitoring

FEATURE COMPARISON
12
Source: Daytona GraySort benchmark, sortbenchmark.org

WORD COUNT

Spark eco-system
Local YARN Mesos
Spark Streaming Spark SQL GraphX MLLib
Spark Core Engine (Scala/Java/Python)
Standalone cluster
Persistence
Cluster Manager
…
1
4

SPARK ON HDFS

HADOOP SPARK
SQL Query interface HIVE SPARKSQL
Machine Learning APACHE MAHOUT MLIB
Graph processing APACHE GIRAPH GRAPHX
Streaming APACHE STORM SPARK STREAMING
ECOSYSTEM

HOW?

So, HOW is It BETTER

THE BIG QUESTION?
Is Spark going to replace Hadoop?
Answer – Yes, Spark will be used on top of Hadoop and replace
MapReduce Reasons:
1. Hadoop MapReduce cannot handle real-time
processing
2. Hadoop MapReduce is slower than Hadoop Spark
3. With rise of IOT, Spark is a must

RDD & SPARK
COMPONENTS
Technology landscape
Spark eco-system

RESILIENT Distributed
Dataset
RDDs track lineage information that can be used to efficiently
re-compute lost data

Partitions in the
cluster
SparkM
SparkW
SparkWSparkW
SparkW
partition
RDD
@doanduy 2
2

RDD TRANSFORMATIONS
& ACTIONS

PARTITION
TRANSFORMATION
map(tuple => (tuple._3, tuple))
groupByKey()
countByKey()
partition
RDD
direct transformation
shuffle

Stage 1
Stages
Shuffle operation
Stage 2
Delimits "shuffle"
frontiers
@doanduy 2
5

SPARK COMPONENTS

SPARK STREAMING

SPARK SQL

Let’s try some
examples…

Spark Shell
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run
locally with one thread, or local[N] to run locally with N threads. You should start by
using local for testing.

scala> textFile.count() // Number of items in this RDD
ees0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
Simplier scala> textFile.filter(line =>
line.contains("Spark")).count() // How many lines contain
"Spark"?
res3: Long = 15
scala> val textFile = sc.textFile(“../README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Basic operations…

Map - Reduce
scala> textFile.map(line => line.split("
").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
scala> textFile.map(line => line.split("
").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey((a,
b) => a + b)
wordCounts: spark.RDD[(String, Int)] =
spark.ShuffledAggregatedRDD@71f027b8
wordCounts.collect()

With Caching…
scala> linesWithSpark.cache()
res7: spark.RDD[String] =
spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15

With HDFS…
val lines = spark.textFile(“hdfs://...”)
val errors = lines.filter(line =>
line.startsWith(“ERROR”))
println(Total errors: + errors.count())

Job Submission
$SPARK_HOME/bin/spark-submit
--class "SimpleApp"
--master local[4]
target/scala-2.10/simple-project_2.10-1.0.jar

Configuration
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

SQL to RDD Translation
Projection & selection
SELECT name, age
FROM people
WHERE age ≥ 13 AND age ≤ 19
SELECT name, age
WHERE age ≥ 13 AND age ≤ 19
val people:RDD[Person]
val teenagers:RDD[(String,Int)]
= people
.filter(p => p.age ≥ 13 && p.age ≤ 19)
.map(p => (p.name, p.age))
.map(p => (p.name, p.age))
.filter(p => p.age ≥ 13 && p.age ≤ 19)

Apache spark linkedin

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Apache spark linkedin

Similar a Apache spark linkedin (20)

Más de Yukti Kaura

Más de Yukti Kaura (8)

Último

Último (20)

Apache spark linkedin