5. What We Need?
• Spark as data processsing in cluster, originally
written in Scala, which allows concise
function syntax and interactive use
• Mesos as cluster manager
• ZooKeeper as highly reliable distributed
coordinator
• HDFS as distributed storage
6. What We Need?
• Pure functions
• Atomic operations
• Parallel patterns or skeletons
• Lightweight algorithms
The only thing that works for parallel programming
is functional programming.
--Carnegie Mello Professor Bob Harper
8. FP Quick Tour In Scala
• Basic transformations:
var array = new Array[Int](10)
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• Indexing:
array(0) = 1
println(list(0))
• Anonymous functions:
val multiplay = (x: Int, y: Int) => x * y
val procedure = { x: Int => {
println(“Hello, ”+x)
println(x * 10)
}
}
9. FP Quick Tour In Scala
• Scala closure syntax:
(x: Int) => x * 10 // full version
x => x * 10 // type interference
_ * 10 // underscore syntax
x => { // body is block of code
val y = 10
x * y
}
10. FP Quick Tour In Scala
• Processing collections:
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x))
list.map(_ * 10)
list.filter(x => x % 2 == 0)
list.reduce((x, y) => x + y)
list.reduce(_ + _)
def f(x: Int) = List(x-1, x x+1)
list.map(x => f(x))
list.map(f(_))
list.flatMap(x => f(x))
list.map(x => f(x)).reduce(_ ++ _)
11. Spark Quick Tour
• Spark context:
• Entry point to Spark functionality
• In spark-shell, crated as sc
• In standalone-spark-program, we must create it
• Resilient distributed datasets (RDDs) :
• A distributed memory abstraction
• A logically centralized entity but physically partitioned across multiple
machines inside a cluster based on some notion of key
• Immutable
• Automatically rebuilt on failure
• Based on LRU (Least Recent Use) eviction algorithm
14. Spark Quick Tour
• Transformations:
• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no data shuffling) :
• map
• flatMap
• filter
• Wide transformation (involves data shuffling):
• sortByKey
• reduceByKey
• groupByKey
• Actions:
• Return a result or write it to storage
• collect
• count
• take(n)
16. Spark Quick Tour
• Creating RDDs:
val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")
val textFile = sc.textFile("hdfs://localhost/test/*.txt")
• Basic transformations:
val squares = numbers.map(x => x * x)
val evens = squares.filter(_ < 9)
val mapto = numbers.flatMap(x => 1 to x)
val words = textFile.flatMap(_.split(" ")).cache()
Base RDD
Transformed RDD
Turn a collection
to RDD
17. Spark Quick Tour
• Basic actions:
words.collect()
words take(5)
words count
words.reduce(_ + _)
words.filter(_ == “be").count()
words.filter(_ == “or").count()
words.saveAsTextFile("hdfs://localhost/test/result")
The influence of
cache
18. Spark Quick Tour
• Pair syntax:
val pair = (a, b)
• Accessing pair elements:
pair._1
pair._2
• Key-value operations:
val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))
pets.reduceByKey(_ + _)
pets.groupByKey()
pets.sortByKey()
19. Hello World
val logFile = "hdfs://localhost/test/tobe.txt"
val logData = sc.textFile(logFile).cache()
val wordCount = logData.flatMap(_.split(“ “))
.map((_, 1))
.reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs://localhost/wordcount/result")
sc.stop()