Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

AI與大數據數據處理 Spark實戰(20171216)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 127 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a AI與大數據數據處理 Spark實戰(20171216) (20)

Anuncio

Más reciente (20)

Anuncio

AI與大數據數據處理 Spark實戰(20171216)

  1. 1. Practical Machine Learning in Spark Chih-Chieh Hung Tamkang University
  2. 2. Chih-Chieh Hung 洪智傑 • Tamkang University (Assistant Professor) 2016- • Rakuten Inc., Japan (Data Scientist) 2013-2015 • Yahoo! Inc., Taiwan (Research Engineer) 2011-2013 • Microsoft Research Asia, China (Research Intern) 2010
  3. 3. Something About Big Data
  4. 4. Big Data Definition • No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 6
  5. 5. Scale (Volume) • Data Volume • 44x increase from 2009 to 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially
  6. 6. Complexity (Varity) • Various formats, types, and structures • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data • A single application can be generating/collecting many types of data 8 To extract knowledge all these types of data need to linked together
  7. 7. Speed (Velocity) • Data is begin generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities
  8. 8. Four V Challenges in Big Data *. http://www-05.ibm.com/fr/events/netezzaDM_2012/Solutions_Big_Data.pdf
  9. 9. Apache Hadoop Stack
  10. 10. Apache Hadoop • The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. • Three major modules: • Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  11. 11. Hadoop Components: HDFS • File system • Sit on top of a native file system • Based on Google’s GFS • Provide redundant storage • Read/Write • Good at large, sequential reads • Files are “Write once” • Components • DataNodes: metadata of files • NameNodes: actual blocks • Secondary NameNode: merges the fsimage and the edits log files periodically and keeps edits log size within a limit
  12. 12. Hadoop Components: YARN • Manage resource (Data operating system). • YARN = Yet Another Resource Negotiator • Manage and monitor workloads • Maintain a multi-tenant platform. • Implement security control. • Support multiple processing models in addition to MapReduce.
  13. 13. Hadoop Components: MapReduce • Process data in cluster. • Two phases: Map + Reduce • Between the two is the “shuffle-and-sort” stage • Map • Operates on a discrete portion of the overall dataset • Reduce • After all maps are complete, the intermediate data are separated to nodes which perform the Reduce phase.
  14. 14. The MapReduce Framework
  15. 15. MapReduce Algorithm For Word Count • Input and Output
  16. 16. Step 1: Design Mapper (Must Implement) • Write the mapper: output the key-value pair <word, 1>
  17. 17. Step 2: Sort and Shuffle (Don’t Need to Do) • The values with the same key will send to the same reducer.
  18. 18. Step 3: Design Reducer (Must Implement) • Write reducer as: (word, sum of all the values)
  19. 19. Spark
  20. 20. What is Spark? Efficient • General execution graphs • In-memory storage Usable • Rich APIs in Java, Scala, Python • Interactive shell • Fast and Expressive Cluster Computing System Compatible with Apache Hadoop
  21. 21. Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) • Write programs in terms of transformations on distributed datasets
  22. 22. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  23. 23. Spark Ecosystem
  24. 24. import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext( “local”, “WordCount”, sys.argv[0], None) lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(sys.argv[2]) An Simple Example of Spark App sc RDD ops
  25. 25. SparkContext • Main entry point • SparkContext is the object that manages the connection to the clusters in Spark and coordinates running processes on the clusters themselves. SparkContext connects to cluster managers, which manage the actual executors that run the specific computations
  26. 26. SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  27. 27. Create SparkContext: Local Mode • Very simple
  28. 28. Create SparkContext: Cluster Mode • Need to write SparkConf about the clusters
  29. 29. Resilient Distributed Datasets (RDD) • An RDD is Spark's representation of a dataset that is distributed across the RAM, or memory, of lots of machines. • An RDD object is essentially a collection of elements that you can use to hold lists of tuples, dictionaries, lists, etc. • Lazy Evaluation : the ability to lazily evaluate code, postponing running a calculation until absolutely necessary. •
  30. 30. Working with RDDs
  31. 31. Transformation and Actions in Spark • RDDs have actions, which return values, and transformations, which return pointers to new RDDs. • RDDs’ value is only updated once that RDD is computed as part of an action
  32. 32. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() messages.filter(lambda s: “php” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk
  33. 33. Creating RDDs # Turn a Python collection into an RDD >sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 >sc.textFile(“file.txt”) >sc.textFile(“directory/*.txt”) >sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) >sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  34. 34. Most Widely-Used Action and Transformation
  35. 35. Transformation
  36. 36. Basic Transformations >nums = sc.parallelize([1, 2, 3]) # Pass each element through a function >squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate >even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others >nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  37. 37. map() and flatMap() • map() map() transformation applies changes on each line of the RDD and returns the transformed RDD as iterable of iterables i.e. each line is equivalent to a iterable and the entire RDD is itself a list
  38. 38. map() and flatMap() • flatMap() This transformation apply changes to each line same as map but the return is not a iterable of iterables but it is only an iterable holding entire RDD contents.
  39. 39. map() and flatMap() examples >lines.take(2) [‘#good d#ay #’, ‘#good #weather’] >words = lines.map(lambda lines: lines.split(' ')) [[‘#good’, ‘d#ay’, ’#’], [‘#good’, ‘#weather’]] >words = lines. flatMap(lambda lines: lines.split(' ')) [‘#good’, ‘d#ay’, ‘#’, ‘#good’, ‘#weather’]
  40. 40. Filter() • Filter() transformation is used to reduce the old RDD based on some condition.
  41. 41. Filter() example • How to filter out hashtags from words >hashtags = words.filter(lambda word: word.startswith("#")).filter(lambda word: word != "#") [‘#good’, ‘#good’, ‘#weather’]
  42. 42. Join() • Return a RDD containing all pairs of elements having the same key in the original RDDs
  43. 43. Join() Example
  44. 44. KeyBy() • Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-defined function.
  45. 45. KeyBy() examples
  46. 46. GroupBy() • Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.
  47. 47. GroupBy() example
  48. 48. GroupByKey() • Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.
  49. 49. GroupByKey() example
  50. 50. ReduceByKey() • reduceByKey(f) combines tuples with the same key using the function we specify f. >hashtagsNum = hashtags.map(lambda word: (word, 1)) [(‘#good’,1), (‘#good’, 1), (‘#weather’, 1)] >hashtagsCount = hashtagsNum.reduceByKey(lambda a,b: a+b) [(‘#good’,2), (‘#weather’, 1)]
  51. 51. The Difference between GroupByKey() and ReduceByKey()
  52. 52. Example: Word Count > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  53. 53. Actions
  54. 54. Basic Actions >nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection >nums.collect() # => [1, 2, 3] # Return first K elements >nums.take(2) # => [1, 2] # Count number of elements >nums.count() # => 3 # Merge elements with an associative function >nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file >nums.saveAsTextFile(“hdfs://file.txt”)
  55. 55. Collect() • Return all elements in the RDD to the driver in a single list • Do not do that if you work on a big RDD.
  56. 56. Reduce() • Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver.
  57. 57. Aggregate() • Aggregate all elements of the RDD by: • Applying a user function seqOp to combine elements with user-supplied objects • Then combining those user-defined results via a second user function combOp • And finally returning a result to the driver
  58. 58. Aggregate(): Using the seqOp in each partition
  59. 59. Aggregate(): Using combOp among Partitions
  60. 60. Aggregate() example
  61. 61. More RDD Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  62. 62. Lab 1
  63. 63. Example: PageRank • Good example of a more complex algorithm • Multiple stages of map & reduce • Benefits from Spark’s in-memory caching • Multiple iterations over the same data
  64. 64. Basic Idea Give pages ranks (scores) based on links to them • Links from many pages  high rank • Link from a high-rank page  high rank Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
  65. 65. Algorithm 1.0 1.0 1.0 1.0 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  66. 66. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5
  67. 67. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 1.0 1.85 0.58
  68. 68. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5
  69. 69. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.39 1.72 1.31 0.58 . . .
  70. 70. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.46 1.37 1.44 0.73 Final state:
  71. 71. Lab 2
  72. 72. Machine Learning in 30 min
  73. 73. Machine Learning is… • Machine learning is about predicting the future based on the past. -- Hal Daume III Training Data model/ predictor past model/ predictor future Testing Data
  74. 74. Machine Learning Types
  75. 75. Supervised vs. Unsupervised Learning
  76. 76. Reinforcement Learning
  77. 77. General Flow for Machine Learning
  78. 78. Training Data, Testing Data, Validation Data • Training data: used to train a model (we have) • Testing data: test the performance of a model (we don’t have) • Validation data: “artificial” testing data (we have)
  79. 79. Model Evaluation: What Are We Seeking? • Minimize the error between training data and the model
  80. 80. Example: The Error of The Model
  81. 81. General Flow of Training and Testing
  82. 82. Classification Concept
  83. 83. Supervised Learning in A Nutshell • Try to think how you learn when you were a baby. Mom taught you…
  84. 84. Supervised Learning in A Nutshell • What is it?
  85. 85. Supervised Learning in A Nutshell • Training data • Testing data Label Features Features Rabbit! Label (We Guessed)
  86. 86. Handwritten Recognition • Input: 1. hand-written words and labels, 2. a hand-written word W • Output: the label of W ?
  87. 87. General Classification Flow
  88. 88. Before Hands-on
  89. 89. What is MLlib • MLlib is an Apache Spark component focusing on machine learning: • MLlib is Spark’s core ML library • Developed by MLbase team in AMPLab • 80+ contributions from various organization • Support Scala, Python, and Java APIs
  90. 90. Spark Ecosystem
  91. 91. Algorithms in MLlib • Statistics: Description, correlation • Clustering: k-means • Classification: SVMs, naive Bayes, decision tree, logistic regression • Regression: linear regression (+lasso, +ridge) • Dimensionality: SVD, PCA • Optimization Primitives: SGD, Parallel Gradient • Collaborative filtering: ALS
  92. 92. Why Mllib • Scalability • Performance • user-friendly documentation and APIs • Cost of maintenance
  93. 93. Performance
  94. 94. Data Type • Dense vector • Sparse vector • Labeled point
  95. 95. Dense & Sparse • Raw Data: ID A B C D E F 1 1 0 0 0 0 3 2 0 1 0 1 0 2 3 1 1 1 0 1 1
  96. 96. Dense vs Sparse • A case study - number of example: 12 million - number of features: 500 - sparsity: 10% • Not only save storage, but also received a 4x speed up Dense Sparse Storge 47GB 7GB Time 240s 58s
  97. 97. Labeled Point • Dummy variable (1,0) • Categorical variable (0, 1, 2, …) from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint # Create a labeled point with a positive label and a dense feature vector. pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
  98. 98. Descriptive Statistics • Supported function: - count - max - min - mean - variance … • Supported data types - Dense - Sparse - Labeled Point
  99. 99. Example from pyspark.mllib.stat import Statistics from pyspark.mllib.linalg import Vectors import numpy as np ## example data(2 x 2 matrix at least) data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]]) ## to RDD distData = sc.parallelize(data) ## Compute Statistic Value summary = Statistics.colStats(distData) print "Duration Statistics:" print " Mean: {}".format(round(summary.mean()[0],3)) print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3)) print " Max value: {}".format(round(summary.max()[0],3)) print " Min value: {}".format(round(summary.min()[0],3)) print " Total value count: {}".format(summary.count()) print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
  100. 100. Classification Algorithms
  101. 101. 1. Naïve Bayesian Classification • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem • MAP (maximum posteriori) hypothesis )( )()|()|( DP hPhDPDhP  .)()|(maxarg)|(maxarg hPhDP Hh DhP HhMAP h    
  102. 102. Play-Tennis Example • Given a training set and an unseen sample X = <rain, hot, high, false>, what class will X be? Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N
  103. 103. Training Step: Compute Probabilities • We can compute: Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N P(true|n) = 3/5P(true|p) = 3/9 P(false|n) = 2/5P(false|p) = 6/9 P(high|n) = 4/5P(high|p) = 3/9 P(normal|n) = 2/5P(normal|p) = 6/9 P(hot|n) = 2/5P(hot|p) = 2/9 P(mild|n) = 2/5P(mild|p) = 4/9 P(cool|n) = 1/5P(cool|p) = 3/9 P(rain|n) = 2/5P(rain|p) = 3/9 P(overcast|n) = 0P(overcast|p) = 4/9 P(sunny|n) = 3/5P(sunny|p) = 2/9 windy humidity temperature outlook P(n) = 5/14 P(p) = 9/14
  104. 104. Prediction Step • An unseen sample X = <rain, hot, high, false> 1. P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 2. P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 • Sample X is classified in class n (don’t play)
  105. 105. Try It on Spark • Download Experimental Data: https://raw.githubusercontent.com/apache/spark/master/data/mllib/s ample_naive_bayes_data.txt • Download the Example Code of Naïve Bayes Classification: https://raw.githubusercontent.com/apache/spark/master/examples/sr c/main/python/mllib/naive_bayes_example.py
  106. 106. Experimental Data 0,1 0 0 0,2 0 0 0,3 0 0 0,4 0 0 1,0 1 0 1,0 2 0 1,0 3 0 1,0 4 0 2,0 0 1 2,0 0 2 2,0 0 3 2,0 0 4 Feature Vector: (0,2,0) Class Label: 1
  107. 107. Naïve Bayes in Spark • Step 1: Prepare data • Step 2: NaiveBayes.train() • Step 3: NaiveBayes.predict() • Step 4: Evaluation 1 2 3 4 *. Full Version: https://spark.apache.org/docs/latest/mllib-naive-bayes.html
  108. 108. 2. Decision Tree • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample • Test the attribute values of the sample against the decision tree
  109. 109. Example: Predict the Buys_Computer age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  110. 110. Decision Tree age? overcast student? credit rating? no yes fairexcellent <=30 >40 no noyes yes yes 30..40
  111. 111. Build A Decision Tree • Step 1: All data in Root • Step 2: Split the node which can lead to more pure sub-nodes • Step 3: Repeat until terminal conditions meet
  112. 112. Measures for Purity • Information Gain, Gini Index,… • Example
  113. 113. Terminal Conditions
  114. 114. Decision Tree in Spark • Step 1: Prepare data • Step 2: DT.trainClassifier() • Step 3: DT.predict() • Step 4: Evaluation *. Full Version: https://spark.apache.org/docs/latest/mllib-decision-tree.html 1 2 3 4
  115. 115. Ensemble Decision-Tree-based Algorithms • Random Forest Pick random subsets to build trees • AdaBoost Improve trees sequentially
  116. 116. 3. Logistic Regression • A classification algorithm
  117. 117. Hypotheses function • hypotheses:
  118. 118. When outcome is only 1/0
  119. 119. Logistic Regression in Spark • Step 1: Prepare data • Step 2: LR.train() • Step 3: LR.predict() • Step 4: Evaluation *. Full Version: https://spark.apache.org/docs/latest/mllib-linear-methods.html#classification 1 2 3 4
  120. 120. 4. Support Vector Machine (SVM) • SVMs maximize the margin around the separating hyperplane. • The decision function is fully specified by a subset of training samples, the support vectors. 122 Sec. 15.1
  121. 121. How About Data Are Not Linear Separable? • General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable. Sec. 15.2.3 KERNEL FUNCTION
  122. 122. Kernels • Why use kernels? • Make non-separable problem separable. • Map data into better representational space • Common kernels • Linear • Polynomial K(x,z) = (1+xTz)d • Radial basis function (RBF) 124 Sec. 15.2.3 RBF
  123. 123. SVM with Different Kernels
  124. 124. SVM in Spark • Step 1: Prepare data • Step 2: SVM.train() • Step 3: SVM.predict() • Step 4: Evaluation 1 2 3 4 *. Full Version: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms
  125. 125. Lab 3

×