Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Lecture 07 - CS-5040 - modern database systems

481 visualizaciones

Publicado el

Spark Programming
For course "Modern Database Systems" taught at Aalto University, Spring 2016.

Publicado en: Educación
  • Sé el primero en comentar

Lecture 07 - CS-5040 - modern database systems

  1. 1. Modern  Database  Systems   Lecture  7   Aris6des  Gionis   Michael  Mathioudakis     Spring  2016  
  2. 2. logis6cs   assignment  1    currently  marking   might  upload  best  student  solu6ons?   will  provide  annotated  pdfs  with  marks     virtualbox  has  same  path  &  filename  as  previous  one   Michael  Mathioudakis   2  
  3. 3. con6nuing  from  last  lecture   original  paper  on  spark   Michael  Mathioudakis   3  
  4. 4. previously...    on  modern  database  systems   generaliza6on  of  mapreduce   more  suitable  for  itera6ve  workflows   itera6ve  algorithms  and  repe66ve  querying     rdd:  resilient  distributed  dataset   read-­‐only,  lazily  evaluated,  easily  re-­‐created   ephemeral,  unless  we  need  to  keep  them  in  memory   Michael  Mathioudakis   4  
  5. 5. example:  text  search   suppose  that  a  web  service  is  experiencing  errors   you  want  to  search  over  terabytes  of  logs  to  find  the  cause   the  logs  are  stored  in  Hadoop  Filesystem  (HDFS)   errors  are  wriUen  in  the  logs  as  lines  that     start  with  the  keyword  “ERROR”   Michael  Mathioudakis   5  
  6. 6. example:  text  search   HDFS errors time fields map(_.split(‘t’)(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: F S m W p B e Ta 2. T m tr te a in lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() e Ta 2. To m tri te a in bu gr w D m to in  Scala...   rdd   rdd   from  a  file   transforma6on   hint:  keep  in  memory!   no  work  on  the  cluster  so  far   ac6on!   lines  is  not  loaded  to  ram!   Michael  Mathioudakis   6  
  7. 7. example  -­‐  text  search  ctd.   let  us  find  errors  related  to  “MySQL”   Michael  Mathioudakis   7  
  8. 8. example  -­‐  text  search  ctd.   Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() m W p B e Ta 2. T m tr te a in bu gr w collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) 2. To m tri te a in bu gr w D m to R gr w to m memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will memo tribut tems, a glob includ but a graine which DSM make tolera Th RDD graine writes to app more need be rec partit ure, a transforma6on   ac6on   Michael  Mathioudakis   8  
  9. 9. example  -­‐  text  search  ctd.  again   let  us  find  errors  related  to  “HDFS”  and  extract   their  6me  field   assuming  6me  is  field  no.  3  in  tab-­‐separated  format   Michael  Mathioudakis   9  
  10. 10. example  -­‐  text  search  ctd.  again   Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: Wo pla Beh eno Tab 2.3 To mem trib tem a gl incl but grai whi DSM mak collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() To mem tribu tem a gl incl but grai whi DSM mak tole T RDD grai writ to a mor need be r the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will store the partitions of errors in memory, greatly speed- ing up subsequent computations on it. Note that the base RDD, lines, is not loaded into RAM. This is desirable tributed tems, ap a global include but also grained which p DSM is makes tolerant The m RDDs c grained writes t to appli more ef need to be reco partition ure, and nodes, w A sec At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will store the partitions of errors in memory, greatly speed- tems a glo inclu but grain whic DSM mak toler Th RDD grain write to ap more need be re parti ure, node transforma6ons   ac6on   Michael  Mathioudakis   10  
  11. 11. example:  text  search   lineage  of  6me  fields   lines errors filter(_.startsWith(“ERROR”)) HDFS errors time fields filter(_.contains(“HDFS”))) map(_.split(‘t’)(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) cached   pipelined   transforma6ons  if  a  par66on  of  errors  is  lost,   filter  is  applied  only  the   corresponding  par66on  of  lines   Michael  Mathioudakis   11  
  12. 12. represen6ng  rdds   internal  informa6on  about  rdds     par66ons  &  par66oning  scheme   dependencies  on  parent  RDDs   func6on  to  compute  it  from  parents     Michael  Mathioudakis   12  
  13. 13. rdd  dependencies   narrow  dependencies   each  par66on  of  the  parent  rdd  is  used  by   at  most  one  par66on  of  the  child  rdd     otherwise,  wide  dependencies   Michael  Mathioudakis   13  
  14. 14. rdd  dependencies   union groupByKey join with inputs not co-partitioned join with inputs co-partitioned map, filter Narrow Dependencies: Wide Dependencies: Figure 4: Examples of narrow and wide dependencies. EachMichael  Mathioudakis   14  
  15. 15. scheduling   when  an  ac6on  is  performed...   (e.g.,  count()  or  save())   ...  the  scheduler  examines  the  lineage  graph   builds  a  DAG  of  stages  to  execute     each  stage  is  a  maximal  pipeline  of   transforma6ons  over  narrow  dependencies   Michael  Mathioudakis   15  
  16. 16. scheduling   join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: Figure 5: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD rdd   par66on   already  in  ram   Michael  Mathioudakis   16  
  17. 17. memory  management   when  not  enough  memory   apply  LRU  evic6on  policy  at  rdd  level   evict  par66on  from  least  recently  used  rdd   Michael  Mathioudakis   17  
  18. 18. performance   logis6c  regression  and  k-­‐means   amazon  EC2   10  itera6ons  on  100GB  datasets   100  node-­‐clusters   Michael  Mathioudakis   18  
  19. 19. performance   - e m - r - e n 80! 139! 46! 115! 182! 82! 76! 62! 3! 106! 87! 33! 0! 40! 80! 120! 160! 200! 240! Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark! Logistic Regression! K-Means! Iterationtime(s)! First Iteration! Later Iterations! Figure 7: Duration of the first and later iterations in Hadoop, HadoopBinMem and Spark for logistic regression and k-means using 100 GB of data on a 100-node cluster.Michael  Mathioudakis   19  
  20. 20. performance   Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s logis6c  regression   2015   Michael  Mathioudakis   20  
  21. 21. spark  programming   with  python   Michael  Mathioudakis   21  
  22. 22. spark  programming   crea6ng  rdds   transforma6ons  &  ac6ons   lazy  evalua6on  &  persistency   passing  custom  func6ons   working  with  key-­‐value  pairs   data  par66oning   accumulators  &  broadcast  variables     pyspark   Michael  Mathioudakis   22  
  23. 23. driver  program   contains  the  main  func6on  of  the  applica6on   defines  rdds   applies  opera6ons  on  them     e.g.,  the  spark  shell  itself  is  a  driver  program     driver  programs  access  spark  through  a  SparkContext  object   Michael  Mathioudakis   23  
  24. 24. example   import pyspark sc = pyspark.SparkContext(master = "local", appName = "tour") text = sc.textFile(“myfile.txt”) # load data text.count() # count lines opera6on   create  rdd   this  part  is  assumed   from  now  on   if  we  are  running  on  a  cluster  on  machines,   different  machines  might  count  different  parts  of  the  file   SparkContext   automa6cally  created  in   spark  shell   Michael  Mathioudakis   24  
  25. 25. example   text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines opera6on  with  custom  func6on   on  a  cluster,  Spark  ships  the  func6on  to  all  workers   Michael  Mathioudakis   25  
  26. 26. lambda  func6ons  in  python   f =( lambda line: 'Spark' in line ) f("we are learning Spark”) def f(line): return 'Spark' in line f("we are learning Spark") Michael  Mathioudakis   26  
  27. 27. stopping   text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines sc.stop() Michael  Mathioudakis   27  
  28. 28. rdds   resilient  distributed  datasets     resilient     easy  to  recover   distributed   different  par66ons  materialize  on  different  nodes     read-­‐only  (immutable)   but  can  be  transformed  to  other  rdds   Michael  Mathioudakis   28  
  29. 29. crea6ng  rdds   loading  an  external  dataset   text  =  sc.textFile("myfile.txt")   distribu6ng  a  collec6on  of  objects   data  =  sc.parallelize(  [0,1,2,3,4,5,6,7,8,9]  )   transforming  other  rdds   text_spark  =  text.filter(lambda  line:  'Spark'  in  line)     data_length  =  data.map(lambda  num:  num  **  2)   Michael  Mathioudakis   29  
  30. 30. rdd  opera6ons   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   errorsRDD  =  inputRDD.filter(lambda  x:  "error"  in  x)   warningsRDD  =  inputRDD.filter(lambda  x:  "warning"  in  x)   badlinesRDD  =  errorsRDD.union(warningsRDD)   print("Input  had",  badlinesRDD.count(),  "concerning  lines.")   print("Here  are  some  of  them:")   for  line  in  badlinesRDD.take(10):          print(line)   inputRDD  =  sc.textFile(”logfile.txt")   Michael  Mathioudakis   30  
  31. 31. rdd  opera6ons   def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primes  =  primesRDD.collect()     print(primes[:100])   create  RDD   transforma6on   ac6on   opera6on  in  driver  what  if  primes  does   not  fit  in  memory?   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   Michael  Mathioudakis   31  
  32. 32. rdd  opera6ons   def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.saveAsTextFile("primes.txt")   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   Michael  Mathioudakis   32  
  33. 33. rdds   evaluated  lazily   ephemeral   can  persist  in  memory  (or  disk)  if  we  ask   Michael  Mathioudakis   33  
  34. 34. lazy  evalua6on   numbersRDD  =  sc.parallelize(range(1,  1000000))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.saveAsTextFile("primes.txt")   no  cluster  ac6vity  un6l  here   numbersRDD  =  sc.parallelize(range(1,  1000000))     primesRDD  =  numbersRDD.filter(is_prime)     primes  =  primesRDD.collect()     print(primes[:100])   no  cluster  ac6vity  un6l  here   Michael  Mathioudakis   34  
  35. 35. persistence   RDDs  can  persist  in  memory,   if  we  ask  politely   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.persist()     primesRDD.count()     primesRDD.take(10)   RDD  already  in   memory   causes  RDD  to   materialize   Michael  Mathioudakis   35  
  36. 36. persistence   why?   screenshot  from  jupyter  notebook   Michael  Mathioudakis   36  
  37. 37. persistence   we  can  ask  Spark  to  maintain  rdds  on  disk   or  even  keep  replicas  on  different  nodes   data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2)) to  cease  persistence     data.unpersist() removes  rdd  from  memory  and  disk   Michael  Mathioudakis   37  
  38. 38. passing  func6ons   lambda  func6ons   func6on  references   text = sc.textFile("myfile.txt") text_spark = text.filter(lambda line: 'Spark' in line) def f(line): return 'Spark' in line text = sc.textFile("myfile.txt") text_spark = text.filter(f) Michael  Mathioudakis   38  
  39. 39. passing  func6ons   warning!   if  func6on  is  member  of  an  object  (self.method)  or   references  fields  of  an  object  (e.g.,  self.field)...   Spark  serializes  and  sends  the  en6re  object   to  worker  nodes   this  can  be  very  inefficient   Michael  Mathioudakis   39  
  40. 40. passing  func6ons   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x) where  is  the  problem  in  the  code  below?   Michael  Mathioudakis   40  
  41. 41. passing  func6ons   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x) where  is  the  problem  in  the  code  below?   reference  to  object  method   reference  to  object  field   Michael  Mathioudakis   41  
  42. 42. passing  func6ons   beUer  implementa6on   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd(self, rdd): query = self.query return rdd.filter(lambda x: query in x) Michael  Mathioudakis   42  
  43. 43. common  rdd  opera6ons   element-­‐wise  transforma6ons   map  and  filter   inputRDD   {1,2,3,4}   mappedRDD   {1,2,3,4}   filteredRDD   {2,3,4}   .map(lambda  x:  x**2)   .filter(lambda  x:  x!=1)   map’s  return  type  can  be  different  that  its  input’s   Michael  Mathioudakis   43  
  44. 44. common  rdd  opera6ons   element-­‐wise  transforma6ons   produce  mul6ple  elements  per  input  element   flatMap   phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.count() 9   Michael  Mathioudakis   44  
  45. 45. common  rdd  opera6ons   phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.collect() phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.map(lambda phrase: phrase.split(" ")) words.collect() how  is  the  result  different?   [['hello',  'world'],  ['how',  'are',  'you'],  ['how',  'do',  'you',  'do']]   ['hello',  'world',  'how',  'are',  'you',  'how',  'do',  'you',  'do']   Michael  Mathioudakis   45  
  46. 46. common  rdd  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() (pseudo)  set  opera6ons   Michael  Mathioudakis   46  
  47. 47. common  rdd  opera6ons   (pseudo)  set  opera6ons   union   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() oneRDD.union(otherRDD).collect() [1,  1,  1,  2,  3,  3,  4,  4,  1,  4,  4,  7]   Michael  Mathioudakis   47  
  48. 48. common  rdd  opera6ons   subtrac6on   oneRDD.subtract(otherRDD).collect() [2,  3,  3]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   48  
  49. 49. common  rdd  opera6ons   duplicate  removal   oneRDD.distinct().collect() [1,  2,  3,  4]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   49  
  50. 50. common  rdd  opera6ons   intersec6on   oneRDD.intersection(otherRDD).collect() [1,  4]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() removes  duplicates!   Michael  Mathioudakis   50  
  51. 51. common  rdd  opera6ons   cartesian  product   oneRDD.cartesian(otherRDD).collect()[:5] [(1,  1),  (1,  4),  (1,  4),  (1,  7),  (1,  1)]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   51  
  52. 52. common  rdd  opera6ons   (pseudo)  set  opera6ons   union   subtrac6on   duplicate  removal   intersec6on   cartesian  product   big  difference  in  implementa6on   (and  efficiency)   par66on  shuffling   yes  no   Michael  Mathioudakis   52  
  53. 53. sortBy   common  rdd  opera6ons   how  is  sortBy  implemented?   we’ll  see  later...   data = sc.parallelize(np.random.rand(10)) data.sortBy(lambda x: x) Michael  Mathioudakis   53  
  54. 54. common  rdd  opera6ons   ac6ons   reduce   successively  operates  on  two  elements  of  rdd   returns  new  element  of  same  type   data.reduce(lambda x, y: x + y) 181   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: x * y) 3188536   commuta6ve   &  associa6ve   func6ons   Michael  Mathioudakis   54  
  55. 55. commuta6ve  &  associa6ve     func6on  f(x,  y)   e.g.,  add(x,  y)  =  x  +  y     commuta6ve   f(x,  y)  =  f(y,  x)   e.g.,  add(x,  y)  =  x  +  y  =  y  +  x  =  add(y,  x)     associa6ve   f(x,  f(y,  z))  =  f((x,  y),  z)   e.g.,  add(x,  add(y,  z))  =  x  +  (y  +  z)  =  (x  +  y)  +  z  =  add(add(x,  y),  z)     Michael  Mathioudakis   55  
  56. 56. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: x**2 + y**2) 137823683725010149883130929   compute  sum  of  squares  of  data   is  the  following  correct?   no  -­‐  why?   reduce   successively  operates  on  two  elements  of  rdd   produces  single  aggregate   56  
  57. 57. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2 8927.0   yes  -­‐  why?   compute  sum  of  squares  of  data   is  the  following  correct?   reduce   successively  operates  on  two  elements  of  rdd   produces  single  aggregate   Michael  Mathioudakis   57  
  58. 58. common  rdd  opera6ons   ac6ons   aggregate   generalizes  reduce   the  user  provides   a  zero  value   the  iden6ty  element  for  aggrega6on   a  sequen6al  opera6on  (func6on)   to  update  aggrega6on  for  one  more  element  in  one  par66on     a  combining  opera6on  (func6on)   to  combine  aggregates  from  different  par66ons   Michael  Mathioudakis   58  
  59. 59. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) aggr = data.aggregate(zeroValue = (0,0), seqOp = (lambda x, y: (x[0] + y, x[1] + 1)), combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))) aggr[0] / aggr[1] what  does  the  following  compute?   zero   value   sequen6al   opera6on   combining   opera6on   the  average  value  of  data   aggregate   generalizes  reduce   Michael  Mathioudakis   59  
  60. 60. common  rdd  opera6ons   ac6ons   opera5on   return   collect   all  elements   take(num)   num  elements   tries  to  minimize  disk  access  (e.g.,  by  accessing  one   par66on)   takeSample   a  random  sample  of  elements   count   number  of  elements   countByValue   number  of  6me  each  element  appears   first  on  each  par66on,  then  combines  par66on  results   top(num)   num  maximum  elements   sorts  par66ons  and  merges   Michael  Mathioudakis   60  
  61. 61. all  opera6ons  we  have  described   so  far  apply  to  all  rdds     that’s  why  the  word  “common”  has  been  in  the  6tle   common  rdd  opera6ons   Michael  Mathioudakis   61  
  62. 62. pair  rdds   elements  are  key-­‐value  pairs   pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) pairRDD.collect()[:5] [(0,  0),  (1,  1),  (2,  4),  (3,  9),  (4,  16)]   they  come  from  mapreduce  model   prac6cal  in  many  cases   spark  provides  opera6ons  tailored  to  pair  rdds   Michael  Mathioudakis   62  
  63. 63. transforma6ons  on  pair  rdds   pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) keys  and  values   pairRDD.keys().collect()[:5] [0,  1,  2,  3,  4]   pairRDD.values().collect()[:5] [0,  1,  4,  9,  16]   Michael  Mathioudakis   63  
  64. 64. transforma6ons  on  pair  rdds   reduceByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y) volumePerKey.collect() [('$APPL',  201.16),  ('$AMZN',  1104.64),  ('$GOOG',  706.2)]   reduceByKey  is  a  transforma6on   unlike  reduce   Michael  Mathioudakis   64  
  65. 65. transforma6ons  on  pair  rdds   combineByKey   generalizes  reduceByKey   user  provides   createCombiner  func6on   provides  the  zero  value  for  each  key   mergeValue  func6on   combines  current  aggregate  in  one  par66on  with  new  value   mergeCombiner   to  combine  aggregates  from  par66ons   Michael  Mathioudakis   65  
  66. 66. combineByKey   generalizes  reduceByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1), mergeValue = lambda x, y: (x[0] + y, x[1] + 1), mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1])) avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1])) avgPerKey.collect() what  does  the  following  produce   transforma6ons  on  pair  rdds   Michael  Mathioudakis   66  
  67. 67. transforma6ons  on  pair  rdds   sortByKey     samples  values  from  rdd  to   es6mate  sorted  par66on  boundaries   shuffles  data   sorts  by  external  sor6ng     used  to  implement  common  sortBy   idea:  create  a  pair  RDD  with  (sort-­‐key,  item)  elements   apply  sortByKey  on  that   Michael  Mathioudakis   67  
  68. 68. transforma6ons  on  pair  rdds   implemented  with  a  variant  of  hash  join   spark  also  has  func6ons  for   le|  outer  join,  right  outer  join,  full  outer  join   (inner)  join   Michael  Mathioudakis   68  
  69. 69. transforma6ons  on  pair  rdds   (inner)  join   course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)]) course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)]) result = course_a.join(course_b) result.collect() [('Tuukka',  (10,  10)),  ('Leena',  (9,  10))]   Michael  Mathioudakis   69  
  70. 70. transforma6ons  on  pair  rdds   other  transforma6ons   groupByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) pairRDD.groupByKey().collect() [('$APPL',  <  values  >),    ('$AMZN',  <  values  >),    ('$GOOG',  <  values  >)]   for  grouping  together  mul6ple  rdds   cogroup  and  groupWith   Michael  Mathioudakis   70  
  71. 71. ac6ons  on  pair  rdds   lookup(key)   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) countByKey   {'$AMZN':  2,  '$APPL':  2,  '$GOOG':  1}   pairRDD.countByKey() collectAsMap   pairRDD.collectAsMap() {'$AMZN':  552.32,  '$APPL':  100.52,  '$GOOG':  706.2}   pairRDD.lookup("$APPL") [100.64,  100.52]   Michael  Mathioudakis   71  
  72. 72. shared  variables   accumulators   write-­‐only  for  workers     broadcast  variables   read-­‐only  for  workers   Michael  Mathioudakis   72  
  73. 73. accumulators   text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 return length llengthRDD = text.map(line_len) llengthRDD.count() 95   long_lines.value 45   lazy!   Michael  Mathioudakis   73  
  74. 74. accumulators   fault  tolerance     spark  executes  updates  in  ac6ons  only  once   e.g.,  foreach()   foreach:  special  ac6on     this  is  not  guaranteed  for  transforma=ons   in  transforma6ons,  use  accumulators   only  for  debugging  purposes!   Michael  Mathioudakis   74  
  75. 75. accumulators  +  foreach   text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 text.foreach(line_len) long_lines.value 45   Michael  Mathioudakis   75  
  76. 76. broadcast  variables     sent  to  workers  only  once   read-­‐only   even  if  you  change  its  value  on  a  worker,   the  change  does  not  propagate  to  other  workers   (actually  broadcast  object  is  wriUen  to  file,  read   from  there  by  each  worker)     release  with  unpersist()   Michael  Mathioudakis   76  
  77. 77. broadcast  variables   def load_address_table(): return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2", "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"} address_table = sc.broadcast(load_address_table()) def find_address(name): res = None if name in address_table.value: res = address_table.value[name] return res data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"]) pairRDD = data.map(lambda name: (name, find_address(name))) pairRDD.collectAsMap() Michael  Mathioudakis   77  
  78. 78. par66oning   certain  opera6ons  take  advantage   of  par66oning   e.g.,  reduceByKey,  join   anRDD.par66onBy(numPar66ons,  par66onFunc)   users  can  set  number  of  par66ons  and  par66oning  func6on   Michael  Mathioudakis   78  
  79. 79. working  on  per-­‐par66on  basis   spark  provides  opera6ons  that  operate   at  par66on  level   e.g.,  mapPar66on   rdd = sc.parallelize(range(100), 4) def f(iterator): yield sum(iterator) rdd.mapPartitions(f).collect() used  in  implementa6on  of  Spark   Michael  Mathioudakis   79  
  80. 80. see  implementa6on  of  spark  on   hUps://github.com/apache/spark/       Michael  Mathioudakis   80  
  81. 81. references     1.  Zaharia,  Matei,  et  al.  "Spark:  Cluster  Compu6ng  with  Working  Sets."   HotCloud  10  (2010):  10-­‐10.   2.  Zaharia,  Matei,  et  al.  "Resilient  distributed  datasets:  A  fault-­‐tolerant   abstrac6on  for  in-­‐memory  cluster  compu6ng."  Proceedings  of  the  9th   USENIX  conference  on  Networked  Systems  Design  and  Implementa=on.   3.  Learning  Spark:  Lightning-­‐Fast  Big  Data  Analysis,  by  Holden  Karau,  Andy   Konwinski,  Patrick  Wendell,  Matei  Zaharia   4.  Spark  programming  guide   hUps://spark.apache.org/docs/latest/programming-­‐guide.html   5.  Spark  implementa6on  hUps://github.com/apache/spark/   6.  "Making  Big  Data  Processing  Simple  with  Spark,"  Matei  Zaharia,   hUps://youtu.be/d9D-­‐Z3-­‐44F8       Michael  Mathioudakis   81  

×