Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Bill howe 3_mapreduce

scalability

  • Inicia sesión para ver los comentarios

Bill howe 3_mapreduce

  1. 1. WHAT DOES SCALABLE MEAN? Operationally: •  In the past: “Works even if data doesn’t fit in main memory” •  Now: “Can make use of 1000s of cheap computers” Formally: •  In the past: If you have N data items, you must do no more than Nk operations -- “polynomial time algorithms” •  Soon: If you have N data items, you must do no more than N * log(N) operations -- “logarithmic time algorithms” •  As data sizes go up, you’ll only get one pass at the data •  So you better make that one pass count 1/16/16 Bill Howe, eScience Institute 1
  2. 2. EXAMPLE:FIND MATCHING DNA SEQUENCES Given a set of sequences Find all sequences equal to “GATTACGATATTA” 1/16/16 Bill Howe, eScience Institute 2
  3. 3. GATTACGATATTATACCTGCCGTAA
  4. 4. 1/16/16 Bill Howe, eScience Institute 4 TACCTGCCGTAA = GATTACGATATTA? GATTACGATATTA time = 0 No.
  5. 5. 1/16/16 Bill Howe, eScience Institute 5 CCCCCAATGAC = GATTACGATATTA? GATTACGATATTA time = 1 No.
  6. 6. 1/16/16 Bill Howe, eScience Institute 6 GATTACGATATTA contains GATTACGATATTA? GATTACGATATTA time = 17 Yes! Send it to the output.
  7. 7. 1/16/16 Bill Howe, eScience Institute 7 GATTACGATATTA 40 records, 40 comparions N records, N comparisons The algorithmic complexity is order N: O(N)
  8. 8. GATTACGATATTA AAAATCCTGCA AAACGCCTGCA TTTACGTCAA TTTTCGTAATT What if we sort the sequences?
  9. 9. CTGTACACAACCT < GATTACGATATTA GATTACGATATTA No match. Skip to 75% mark Start at the 50% mark time = 0 0% 100% CTGTACACAACCT
  10. 10. GGATACACATTTA > GATTACGATATTA GATTACGATATTA No match. Go back to 62.5% mark time = 1 0% 100% GGATACACATTTA
  11. 11. GATATTTTAAGC < GATTACGATATTA GATTACGATATTA No match. Skip back to 68.75% mark 0% 100% GATATTTTAAGC
  12. 12. GATTACGATATTA = GATTACGATATTA GATTACGATATTA 0% 100% Match! Walk through the records until we fail to match.
  13. 13. GATTACGATATTA 0% 100% How many comparisons did we do? 40 records, only 4 comparisons N records, log(N) comparisons This algorithm is O(log(N)) Far better scalability
  14. 14. RELATIONAL DATABASES Databases are good at “Needle in Haystack” problems: •  Extracting small results from big datasets •  Transparently provide “old style” scalability •  Your query will always* finish, regardless of dataset size. •  Indexes are easily built and automatically used when appropriate 1/16/16 Bill Howe, eScience Institute 14 CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’; *almost
  15. 15. NEW TASK:READ TRIMMING Given a set of DNA sequences Trim the final n bps of each sequence Generate a new dataset 1/16/16 Bill Howe, eScience Institute 15
  16. 16. GATTACGATATTATACCTGCCGTAA
  17. 17. TACCTGCCGTAA becomes TACCT time = 0
  18. 18. CCCCCAATGAC becomes CCCCC time = 1
  19. 19. GATTACGATATTA becomes GATTA time = 17
  20. 20. Can we use an index? No. We have to touch every record no matter what. The task is fundamentally O(N) Can we do any better?
  21. 21. time = 0
  22. 22. time = 1
  23. 23. time = 2
  24. 24. time = 3
  25. 25. time = 7 How much time did this take? 7 cycles 40 records, 6 workers Ο( N k )
  26. 26. You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads SCHEMATIC OF A PARALLEL“READ TRIMMING”TASK
  27. 27. You are given TIFF images Distribute the images among k computers f f f f f f f is a function to convert TIFF to PNG; apply it to every item Now we have a big distributed set of converted images http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services- timesmachine/ NEW TASK:CONVERT 405K TIFF IMAGES TO PNG
  28. 28. You have sets of parameters for thousands of small simulations Divide the parameter sets among k computers f f f f f f f is runs the simulation and produces some output; apply it to every item Now we have a big distributed set of simulation results http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud NEW TASK:RUN THOUSANDS OF SIMULATIONS
  29. 29. You have millions of documents Distribute the documents among k computers f f f f f f f finds the most common word in a single document Now we have a big distributed list of (doc_id, word) pairs FIND THE MOST COMMONWORD IN EACH DOCUMENT
  30. 30. Consider a slightly more general program to compute the word frequency of every word in a single document Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. (people, 2) (government, 6) (assume, 1) (history, 2) …
  31. 31. You have millions of documents Distribute the documents among k computers f f f f f f For each document f returns a set of (word,freq) pairs Now we have a big distributed list of sets of word freqs. COMPUTE THEWORD FREQUENCY OF 5M DOCUMENTS
  32. 32. THERE’S A PATTERN HERE…. A function that maps a read to a trimmed read A function that maps a TIFF image to a PNG image A function that maps a set of parameters to a simulation result A function that maps a document to its most common word A function that maps a document to a histogram of word frequencies
  33. 33. WHAT IFWEWANT TO COMPUTE THEWORD FREQUENCY ACROSS ALL DOCUMENTS? US Constitution Declaration of Independence Articles of Confederation(people, 78) (government, 123) (assume, 23) (history, 38) …
  34. 34. You have millions of documents Distribute the documents among k computers map map map map map map For each document, return a set of (word,freq) pairs Now what? COMPUTE THEWORD FREQUENCY ACROSS 5M DOCUMENTS
  35. 35. Distribute the documents among k computers map map map map map map For each document, return a set of (word,freq) pairs Now we have a big distributed list of sets of word freqs. reduce reduce reduce reduce Now count the occurrences of each word 44 3 We have our distributed histogram COMPUTE THEWORD FREQUENCY ACROSS 5M DOCUMENTS
  36. 36. SOME DISTRIBUTED ALGORITHM… 1/16/16 Bill Howe, eScience Institute 38 Map (Shuffle) Reduce
  37. 37. MAPREDUCE PROGRAMMING MODEL Input & Output: each a set of key/value pairs Programmer specifies two functions: •  Processes input key/value pair •  Produces set of intermediate pairs •  Combines all intermediate values for a particular key •  Produces a set of merged output values (usually just one) 1/16/16 Bill Howe, eScience Institute 39 map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp,Scheme,and Haskell slide source: Google, Inc.
  38. 38. EXAMPLE:WHAT DOES THIS DO? 1/16/16 Bill Howe, eScience Institute 40 map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  39. 39. EXAMPLE:DOCUMENT PROCESSING 3/12/09 Bill Howe, eScience Institute 41 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
  40. 40. EXAMPLE:WORD LENGTH HISTOGRAM 3/12/09 Bill Howe, eScience Institute 42 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. How many “big”, “medium”, and “small” words are used?
  41. 41. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Big =Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2.4 letters Tiny = Pink = 1 letter EXAMPLE:WORD LENGTH HISTOGRAM
  42. 42. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. EXAMPLE:WORD LENGTH HISTOGRAM Split the document into chunks and process each chunk on a different computer Chunk 1 Chunk 2
  43. 43. (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map Task 1 (204 words) Map Task 2 (190 words) (key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3) EXAMPLE:WORD LENGTH HISTOGRAM
  44. 44. EXAMPLE:WORD LENGTH HISTOGRAM 3/12/09 Bill Howe, eScience Institute 46 (yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Reduce tasks (yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3) A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map task 1 Map task 2 “Shuffle step” (yellow, 37) (red, 148) (blue, 200) (pink, 9)
  45. 45. Local storage ` MR PHASES Each Map and Reduce task has multiple phases:
  46. 46. 1/16/16 Bill Howe, eScience Institute 48 Hadoop in One Slide src: HuyVo,NYU Poly
  47. 47. TAXONOMY OF PARALLEL ARCHITECTURES 1/16/16 Bill Howe, eScience Institute 49 Easiest to program, but $$$$ Scales to 1000s of computers
  48. 48. DESIGN SPACE 1/16/16 Bill Howe 50 50 Throughput Latency Internet Private data center Data- parallel Shared memory The area we’re discussing inspired by a slide by Michael Isard at Microsoft Research In a few weeks
  49. 49. 1/16/16 Bill Howe, UW 51
  50. 50. LARGE-SCALE DATA PROCESSING 1/16/16 Bill Howe, eScience Institute 52 Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs •  ... but this needs to be easy •  Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily scale to hundreds of nodes. MapReduce is a lightweight framework, providing: •  Automatic parallelization and distribution •  Fault-tolerance •  I/O scheduling •  Status and monitoring
  51. 51. KEY IDEA:DECLARATIVE LANGUAGES 1/16/16 Bill Howe, eScience Institute 53 SELECT * FROM Order o, Item i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order oItem i Find all orders from today,along with the items ordered
  52. 52. TWO NOTIONS OF PARALLEL QUERY PROCESSING “Distributed Query” •  Rewrite the query as a union of subqueries •  Workers communicate through standard interfaces, so compatible with federated, heterogeneous, or distributed databases “Parallel Query” •  Each operator is implemented with a parallel algorithm 1/16/16 Bill Howe, UW 54
  53. 53. DISTRIBUTED QUERY EXAMPLE 1/16/16 Bill Howe, UW 55 CREATE VIEW Sales AS SELECT * FROM JanSales UNION ALL SELECT * FROM FebSales UNION ALL SELECT * FROM MarSales CREATE TABLE MarSales( OrderID INT, CustomerID INT NOT NULL, OrderDate DATETIME NULL CHECK (DATEPART(mm, OrderDate) = 3), CONSTRAINT OrderIDMonth PRIMARY KEY(OrderID) )
  54. 54. DISTRIBUTED FILE SYSTEM (DFS) For very large files:TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (≥3), on different racks, for fault tolerance Implementations: •  Google’s DFS: GFS, proprietary •  Hadoop’s DFS: HDFS, open source CSE 344 – Winter 2012 56
  55. 55. PARALLEL QUERY EXAMPLE:TERADATA 1/16/16 Bill Howe, eScience Institute 57 AMP = unit of parallelism
  56. 56. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 58 SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today() Find all orders from today,along with the items ordered join select scan scan date = today() o.item = i.item Order oItem i
  57. 57. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 59 AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 4 AMP 5 AMP 6
  58. 58. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 60 AMP 1 AMP 2 AMP 3 scan Item i AMP 4 AMP 5 AMP 6 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  59. 59. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 61 AMP 4 AMP 5 AMP 6 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  60. 60. MAPREDUCE CONTEMPORARIES Dryad (Microsoft) •  Relational Algebra Pig (Yahoo) •  Near Relational Algebra over MapReduce HIVE (Facebook) •  SQL over MapReduce Cascading •  Relational Algebra Clustera •  U of Wisconsin Hbase •  Indexing on HDFS 1/16/16 Bill Howe, eScience Institute 62
  61. 61. MAPREDUCEVS RDBMS RDBMS •  Declarative query languages •  Schemas •  Logical Data Independence •  Indexing •  Algebraic Optimization •  Caching/Materialized Views •  ACID/Transactions MapReduce •  High Scalability •  Fault-tolerance •  “One-person deployment” 1/16/16 Bill Howe, eScience Institute 63 HIVE, Pig, DryadLINQ DryadLINQ, Pig, HIVE Pig, (Dryad, HIVE) Hbase
  62. 62. COMPARISON 1/16/16 Bill Howe, eScience Institute 64 Data Model Prog. Model Services GPL * * Typing (maybe) Workflow * dataflow typing, provenance, scheduling, caching, task parallelism, reuse Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, data parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance MS Dryad IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control
  63. 63. Local storage ` MAP REDUCE PHASES Each Map and Reduce task has multiple phases:
  64. 64. MAP REDUCE (w1,1) (w2,1) (w1,1) … (w1,1) (w2,1) … (did1,v1) (did2,v2) (did3,v3) . . . . (w1, (1,1,1,…,1)) (w2, (1,1,…)) (w3,(1…)) … … … … (w1, 25) (w2, 77) (w3, 12) … … … … Shuffle Same word appears twice. Why not just send (w1,2)?
  65. 65. map map map map map map reduce reduce reduce reduce 44 3 COUNTWORD OCCURRENCES ACROSS ALL DOCUMENTS
  66. 66. MAP REDUCE (w1,1) (w2,1) (w3,1) … (w1,1) (w2,1) … (did1,v1) (did2,v2) (did3,v3) . . . . (w1, (1,1,1,…,1)) (w2, (1,1,…)) (w3,(1…)) … … … … (w1, 25) (w2, 77) (w3, 12) … … … … Shuffle
  67. 67. MAP REDUCE Google: paper published 2004 Free variant: Hadoop Map-reduce = high-level programming model and implementation for large-scale parallel data processing slide src: Dan Suciu and Magda Balazinska 69
  68. 68. DATA MODEL Files ! A file = a bag of (key, value) pairs A map-reduce program: Input: a bag of (inputkey, value)pairs Output: a bag of (outputkey, value)pairs slide src: Dan Suciu and Magda Balazinska 70
  69. 69. STEP 1:THE MAP PHASE User provides the MAP-function: Input: (input key, value) Ouput: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file CSE 344 – Winter 2012 71
  70. 70. STEP 2:THE REDUCE PHASE User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output (values) The system will group all pairs with the same intermediate key, and passes the bag of values to the REDUCE function CSE 344 – Winter 2012 72
  71. 71. MAPREDUCE PROGRAMMING MODEL Input & Output: each a set of key/value pairs Programmer specifies two functions: •  Processes input key/value pair •  Produces set of intermediate pairs •  Combines all intermediate values for a particular key •  Produces a set of merged output values (usually just one) 1/16/16 Bill Howe, eScience Institute 73 map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp,Scheme,and Haskell slide source: Google, Inc.
  72. 72. EXAMPLE:WHAT DOES THIS DO? 1/16/16 Bill Howe, eScience Institute 74 map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator, intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  73. 73. MORE EXAMPLES:BUILD AN INVERTED INDEX 1/16/16 Bill Howe, UW 75 Input: tweet1, (“I love pancakes for breakfast”) tweet2, (“I dislike pancakes”) tweet3, (“What should I eat for breakfast?”) tweet4, (“I love to eat”) Desired output: “pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4) …
  74. 74. MORE EXAMPLES:RELATIONAL JOIN 1/16/16 Bill Howe, UW 76 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Emplyee ⨝ Assigned Departments Employee Assigned Departments Name SSN EmpSSN DepName Sue 999999999 999999999 Accounts Tony 777777777 777777777 Sales Tony 777777777 777777777 Marketing
  75. 75. RELATIONAL JOIN IN MAPREDUCE:BEFORE MAP PHASE 1/16/16 Bill Howe, UW 77 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Employee Assigned Departments Key idea: Lump all the tuples together into one dataset Employee, Sue, 999999999 Employee,Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing What is this for?
  76. 76. RELATIONAL JOIN IN MAPREDUCE:MAP PHASE 1/16/16 Bill Howe, UW 78 Employee, Sue, 999999999 Employee,Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing key=999999999, value=(Employee, Sue, 999999999) key=777777777, value=(Employee,Tony, 777777777) key=999999999, value=(Department, 999999999, Accounts) key=777777777, value=(Department, 777777777, Sales) key=777777777, value=(Department, 777777777, Marketing) why do we use this as the key?
  77. 77. RELATIONAL JOIN IN MAPREDUCE:REDUCE PHASE 1/16/16 Bill Howe, UW 79 key=999999999, values=[(Employee, Sue, 999999999), (Department, 999999999, Accounts)] key=777777777, values=[(Employee,Tony, 777777777), (Department, 777777777, Sales), (Department, 777777777, Marketing)] Sue, 999999999, 999999999, Accounts Tony, 777777777, 777777777, Sales Tony, 777777777, 777777777, Marketing
  78. 78. RELATIONAL JOIN IN MAPREDUCE,AGAIN 1/16/16 Bill Howe, Data Science, Autumn 2012 80 Order(orderid, account, date) 1, aaa, d1 2, aaa, d2 3, bbb, d3 LineItem(orderid, itemid, qty) 1, 10, 1 1, 20, 3 2, 10, 5 2, 50, 100 3, 20, 1tagged with relation nameOrder 1, aaa, d1 à 1 :“Order”, (1,aaa,d1) 2, aaa, d2 à 2 :“Order”, (2,aaa,d2) 3, bbb, d3 à 3 :“Order”, (3,bbb,d3) Line 1, 10, 1 à 1 :“Line”, (1, 10, 1) 1, 20, 3 à 1 :“Line”, (1, 20, 3) 2, 10, 5 à 2 :“Line”, (2, 10, 5) 2, 50, 100 à 2 :“Line”, (2, 50, 100) 3, 20, 1 à 3 :“Line”, (3, 20, 1) Map Reducer for key 1 “Order”, (1,aaa,d1) “Line”, (1, 10, 1) “Line”, (1, 20, 3) (1, aaa, d1, 1, 10, 1) (1, aaa, d1, 1, 20, 3)
  79. 79. SIMPLE SOCIAL NETWORK ANALYSIS: COUNT FRIENDS 1/16/16 Bill Howe, UW 81 Jim, Sue Sue, Jim Lin, Joe Joe, Lin Jim, Kai Kai, Jim Jim, Lin Lin, Jim Input Desired Output Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1 Jim,1 Sue,1 Lin,1 Joe,1 Jim,1 Kai,1 Jim,1 Lin,1 MAP REDUCE Jim,(1,1,1) Lin,(1,1) Sue,(1) Kai,(1) Joe,(1) SHUFFLE
  80. 80. MATRIX MULTIPLY IN MAPREDUCE C = A X B A has dimensions L,M B has dimensions M,N In the map phase: •  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N •  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L In the reduce phase, emit •  key = (i,k) •  value = Sumj (A[i,j] * B[j,k]) 1/16/16 Bill Howe, Data Science, Autumn 2012 82
  81. 81. 1/16/16 Bill Howe, Data Science, Autumn 2012 83 AB X A B = •  One reducer per output cell •  •  Each reducer computes: Ai, jBj,k j ∑
  82. 82. CLUSTER COMPUTING Large number of commodity servers, connected by high speed, commodity network Rack: holds a small number of servers Data center: holds many racks CSE 344 – Winter 2012 84
  83. 83. CLUSTER COMPUTING Massive parallelism: •  100s, or 1000s, or 10000s servers •  Many hours Failure: •  If medium-time-between-failure is 1 year •  Then 10000 servers have one failure / hour slide src: Dan Suciu and Magda Balazinska 85
  84. 84. DISTRIBUTED FILE SYSTEM (DFS) For very large files:TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (≥3), on different racks, for fault tolerance Implementations: •  Google’s DFS: GFS, proprietary •  Hadoop’s DFS: HDFS, open source slide src: Dan Suciu and Magda Balazinska 86
  85. 85. MAP REDUCE VS.DATABASES 1/16/16 87 Bill Howe, UW
  86. 86. HADOOPVS.RDBMS Comparison of 3 systems •  Hadoop •  Vertica (a column-oriented database) •  DBMS-X (a row-oriented database) •  rhymes with “schmoracle” Qualitative •  Programming model, ease of setup, features, etc. Quantitative •  Data loading, different types of queries 1/16/16 Bill Howe, eScience Institute 88 Pavlo et al.2009
  87. 87. GREP TASK •  Find 3-byte pattern in 100-byte record •  1 match per 10,000 records •  Data set: •  10-byte unique key, 90-byte value •  1TB spread across 25, 50, or 100 nodes •  10 billion records •  Original MR Paper (Dean et al. 2004) 89
  88. 88. GREP TASK LOADING RESULTS 90 Seconds
  89. 89. GREP TASK EXECUTION RESULTS 91 Seconds
  90. 90. SELECTION TASK 92 SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X 1 GB / node
  91. 91. ANALYTICAL TASKS •  Simple web programming schema •  Data set •  600k HTML Documents (6GB/node) •  155 million UserVisit records (20GB/node) •  18 million Rankings records (1GB/node) 93
  92. 92. AGGREGATE TASK •  Simple query to find adRevenue by IP prefix 94 SELECT SUBSTR(sourceIP, 1, 7), SUM (adRevenue) FROM userVisits GROUP BY SUBSTR (sourceIP, 1, 7)
  93. 93. AGGREGATE TASK RESULTS 95
  94. 94. JOIN TASK •  Find the sourceIP that generated the most adRevenue along with its average pageRank •  Implementations: •  DMPSs – complex SQL using temporary table •  MapReduce – Three separate MR programs 96
  95. 95. JOIN TASK 97 SELECT INTO TempsourceIP, AVG(pageRank)as avgPageRank, SUM(adRevenue)as totalRevenue FROM RankingsAS R , UserVisitsAS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenueDESC LIMIT 1;
  96. 96. JOIN TASK RESULTS 98
  97. 97. PROBLEMSWITH THIS ANALYSIS? Other ways to avoid sequential scans? Fault-tolerance in large clusters? Tasks that cannot be expressed as queries? 1/16/16 Bill Howe, UW 99
  98. 98. GOOGLE’S RESPONSE:CLUSTER SIZE •  Largest known database installations •  Greenplum – 96 nodes – 4.5 PB (eBay)[1] •  Teradata – 72 nodes – 2+PB (eBay)[1] •  Largest known MR installations: •  Hadoop – 3658 nodes – 1 PB (Yahoo)[2] •  Hive – 600+ nodes – 2.5 PB (Facebook)[3] [1] eBay’s two enormous data warehouses – April 30th, 2009 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ [2] Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds – May 11th, 2009 https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html [3] Hive – A Petabyte Scale DataWarehouse using Hadoop – June 10th, 2009 https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919 100
  99. 99. CONCLUDING REMARKS •  What can MapReduce learn from Databases? •  Declarative languages are a good thing •  Schemas are important •  What can Databases learn from MapReduce? •  Query fault-tolerance •  Support for in situ data •  Embrace open-source 101
  100. 100. SLIDES CAN BE FOUND AT: TEACHINGDATASCIENCE.ORG

×