Hands-On Apache Spark

Hands-On
@samklr & @nivdul
2015-03-10

is a fast and general engine for large-scale
data processing
• big data analytics in memory/disk
• complements Hadoop
• faster and more flexible
• Resilient Distributed Datasets (RDD)
• shared variables
interactive shell (scala & python)
Lambda
(Java 8)

RDD
(Resilient Distributed Dataset)
• process in parallel
• controllable persistence (memory, disk…)
• higher-level operations (transformation & actions)
• rebuilt automatically

Example : Wordcount
// create configuration for Spark and the context
val conf = new SparkConf()
.setAppName("Spark word count")
.setMaster("local")
!
val sc = new SparkContext(conf)
!
// load the data
val data = sc.textFile("filepath/wordcount.txt")
// map then reduce step
val wordCounts = data.flatMap(line => line.split("s+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
// persist the data
wordCounts.cache()

unifies access to structured data
SQL
// make sql request on RDD
val nb = sqlContext.sql("SELECT user, COUNT(*) AS c FROM tweet " +
"WHERE user <> '' " +
"GROUP BY user " +
"ORDER BY c ");
!
// create a sql context from the Spark context
val sqlContext = new SQLContext(sc);
!
// load data and create an RDD
val tweets = sqlContext.jsonFile(pathToFile);
// register tweets as a table to operate on it later
tweets.registerAsTable("tweet");

makes it easy to build scalable fault-tolerant streaming
applications
Streaming
// create a java streaming context and define the window
val jssc = new StreamingContext(conf, Durations.seconds(10))
!
// create our DStream (sequence of RDD)
val tweetsStream = TwitterUtils.createStream(jssc, StreamUtils.getAuth())
!
// find all user
val tweetUser = tweetsStream.map(tweetStatus => tweetStatus.getUser())

MLlib
is Apache Spark's scalable machine learning library
• regression
• classiﬁcation
• clustering
• optimization
• collaborative ﬁltering
• feature extraction (TF-IDF, Word2Vec…)

Exercices
Part 1 : Spark API
!
!
Part 2 : Spark Streaming
!
!
Part 3 : Spark SQL
!
!
Part 4 : MLlib

Let’s go !
Clone the projet from the Duchess France github repository
!
Java
https://github.com/DuchessFrance/Hands-On-Spark-java
Scala
https://github.com/DuchessFrance/Hands-On-Spark-scala
!
!
All about Spark
http://spark.apache.org/
!
!
And ask if you have any questions :)
!
!
Have Fun !

Hands-On Apache Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hands-On Apache Spark

Similar a Hands-On Apache Spark (20)

Más de Duchess France

Más de Duchess France (15)

Último

Último (20)

Hands-On Apache Spark