Más contenido relacionado La actualidad más candente (20) Similar a Osd ctw spark (20) Osd ctw spark3. Who am I?
• Wisely Chen ( thegiive@gmail.com )
• Sr. Engineer inYahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013,
Coscup 2012, PHPConf 2012 , RubyConf 2012
9. Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From http://0rz.tw/y3OfM
10. What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open
source project since Hadoop
15. Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
19. 3X~25X than MapReduce framework
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80
MR Spark
3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
20. What is Spark
• Apache Spark™ is a very fast and general
engine for large-scale data processing
22. HDFS
• 100X lower than memory
• Store data into Network+Disk
• Network speed is 100X than memory
• Implement fault tolerance
25. First iteration!
take 200 sec
3rd iteration!
take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!
take 20 sec
26. RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built through parallel transformations
28. RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
29. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
Worker!
!
!
!
Worker!
!
!
!Task
TaskTask
30. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!Block1
RDD a
Worker!
!
!
!
!Block2
RDD a
Worker!
!
!
!
!Block3
RDD a
31. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Block1 Block2
Block3
32. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Block1 Block2
Block3
33. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Cache1 Cache2
Cache3
34. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD m
Worker!
!
!
!
!
RDD m
Worker!
!
!
!
!
RDD m
Cache1 Cache2
Cache3
35. Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD a
Worker!
!
!
!
!
RDD a
Worker!
!
!
!
!
RDD a
Cache1 Cache2
Cache3
37. RDD Cache
• Data locality
• Cache
A big shuffle!
take 20min
After cache, take
only 265ms
self join 5 billion record data
38. Easy to use
• Interactive Shell
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
39. Scala Word Count
• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
40. Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
41. Java Wordcount
• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
42. Java vs Scala
• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new
FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
43. Python
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" "))
• .map(lambda word: (word, 1))
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
45. FYI
• Combiner : ReduceByKey(_+_)
!
• Typical WordCount :
• groupByKey().mapValues{ arr =>
• var r = 0 ; arr.foreach{i=> r+=i} ; r
• }
48. • FB 打卡 Yahoo! 徵人 息,獲
得 Yahoo! 沐浴小鴨
• FB打卡說 ”Yahoo!
APP超讚!!”
並附上超級商城或新聞APP截
圖,即可憑打卡記錄,獲得小
鴨護腕 或購物袋一只
49. Just memory?
• From Matei’s paper: http://0rz.tw/VVqgP
• HBM: stores data in an in-memory HDFS instance.
• SP : Spark
• HBM’1, SP’1 : first run
• Storage: HDFS with 256 MB blocks
• Node information
• m1.xlarge EC2 nodes
• 4 cores
• 15 GB of RAM
50. 100GB data on 100 node cluster
Logistic regression
RunningTime(S)
0
35
70
105
140
HBM'1 HBM SP'1 SP
3
46
62
139
KMeans
RunningTime(S)
0
50
100
150
200
HBM'1 HBM SP'1 SP
33
8287
182
51. There is more
• General DAG scheduler
• Control partition shuffle
• Fast driven RPC to launch task
!
• For more info, check http://0rz.tw/jwYwI