14. SCALDING
class WordCount(args : Args) extends Job(args) {
TextLine(args("input"))
.flatMap ('line -> 'word) {
line :String => line.split(“ s+”)
}
.groupBy('word){ group => group.size }
.write(Tsv(args("output")))
}
15. SCALDING : Clustering with Mahout
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =
TextLine(args("input"))
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
DenseVector(vec))
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent);
cl }
.flatMap(c => c.iterator.asScala.toIterable)
16. SCALDING : Clustering with Mahout
val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values
17. Scalding
- Two APIs : Field based API, and Typed API
- Field API : project, map, discard , groupBy…
- Typed API : TypedPipe[T], works like
scala.collection.Iterator[T]
- Matrix Library
- ALGEBIRD : Abstract Algebra library … we’ll
talk about it later
20. - Distributed, fault tolerant, real time stream
computation engine.
- Four concepts
- Streams : infinite sequence of tuples
- Spouts : Source of streams
- Bolts : Process and produces streams
Can do : Filtering, aggregations, Joins, …
- Topologies : define a flow or network of
spouts and blots.
23. Trident
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);
24. ScalaStorm by Evan Chan
class SplitSentence extends
StormBolt(outputFields = List("word")) {
def execute(t: Tuple) = t matchSeq {
case Seq(line: String) => line.split(‘’’’).foreach
{ word => using anchor t emit (word) }
t ack
}
}
37. What is Spark?
•
•
•
Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude
faster (order of magnitude faster)
Improves efficiency through:
-General execution graphs
-In-memory storage
Improves usability through:
-Rich APIs in Java, Scala, Python
-Interactive shell
38. Key idea
•
•
Write programs in terms of transformations on distributed
datasets
Concept: resilient distributed datasets (RDDs)
- Collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)
41. Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Base Transformed
RDD
RDD
results
errors = lines.filter(s => s.startswith(“ERROR”))
messages = errors.map(s => s.split(“t”))
messages.cache()
messages.filter(s=> s.contains(“foo”)).count()
Cache 1
Driver
Worker
tasks Block 1
Action
messages.filter(s=> s.contains(“bar”)).count()
Cache 2
Worker
. . .
Cache 3
Worker
Result: full-text search scaled to 1 TBin 0.5 in 5 (vs 20 s for on-disk
Result: of Wikipedia data sec sec
(vs 180 sec for on-disk data)
data)
Block 3
Block 2
42. Fault Recovery
RDDs track lineage information that can be
used to efficiently recompute lost data
Ex:
msgs = textFile.filter(-=> _.startsWith(“ERROR”))
.map(_ => _.split(“t”))
HDFS File
Filtered RDD
filter
(func = _.contains(...))
Mapped RDD
map
(func = _.split(...))
43. Spark Streaming
- Extends Spark capabilities to large scale stream
processing.
- Scales to 100s of nodes and achieves second scale
latencies
-Efficient and fault-tolerant stateful stream processing
- Simple batch-like API for implementing complex
algorithms
44. Discretized Stream
Processing
live data stream
Chop up the live stream into batches of X
seconds
Spark treats each batch of data as RDDs and
processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches
Spark
Streaming
batches of X
seconds
Spark
processed
results
44
45. Discretized Stream
Processing
live data stream
Batch sizes as low as ½ second, latency
of about 1 second
Potential for combining batch
processing and streaming processing
in the same system
Spark
Streaming
batches of X seconds
Spark
processed
results
45
46. Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
DStream: a sequence of RDDs representing a stream
of data
Twitter Streaming API
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
47. Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
new DStream
transformation: modify data in one DStream to create another
DStream
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
flatMap
flatMap
…
flatMap
new RDDs created
for every batch
48. Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })
foreach: do whatever you want with the processed
data
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
flatMap
hashTags
DStream
flatMap
flatMap
foreach
foreach
foreach
Write to database, update analytics
UI, do whatever you want
49. Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
flatMap
flatMap
flatMap
save
save
save
hashTags DStream
every batch
saved to HDFS
50. Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length
sliding interval
window length
DStream of data
sliding interval
51. Compute TopK Ip addresses
val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …)
val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..)
val addresses = stream.map(ipAddress => ipAddress.getText)
val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC)
val globalCMS = cms.zero
val mm = new MapMonoid[Long, Int]()
//init
val topAddresses = adresses.mapPartitions(ids => {
ids.map(id => cms.create(id))
})
.reduce(_ ++ _)
52. topAddresses.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
val partialTopK = partial.heavyHitters.map(id =>
(id, partial.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id =>
(id, globalCMS.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalTopK.mkString("[", ",", "]")))
}
})