SlideShare una empresa de Scribd logo
1 de 32
Computing PageRank  Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],Parallel Programming Models ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],MapReduce
MapReduce Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MapReduce is NOT… ,[object Object],[object Object],[object Object],[object Object],MapReduce is a programming paradigm!
MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key.  Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
MapReduce WordCount Example Output: Number of occurrences  of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye    1, 1, 1> <Hadoop    1, 1, 1, 1> <Hello    1, 1, 1> <World    1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],@Yahoo! Some Webmap size data: Number of links between pages in the index:  roughly 1 trillion links   Size of output:  over 300 TB, compressed!   Number of cores used to run a single Map-Reduce job:  over 10,000   Raw disk used in the production cluster:  over 5 Petabytes   (source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)
Typical Hadoop Setup
Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch
Our Hadoop Setup Demo: Hadoop Admin Pages!
[object Object],[object Object],Storage: HDFS
Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram
[object Object],Processing: Hadoop MapReduce
Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements
Sample Map Class ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample Reduce Class public static class Reduce extends MapReduceBase  implements Reducer<Text, IntWritable, Text, IntWritable> {        public void reduce(Text key, Iterator<IntWritable> values,  OutputCollector<Text, IntWritable> output, Reporter reporter)  throws IOException  {        int sum = 0;         while (values.hasNext())  {       sum += values.next().get();         }    output.collect(key, new IntWritable(sum));        }      }
Running a Job Demo: Show WordCount Example
Project5: PageRank on Hadoop
#|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor
PageRank on MapReduce Very Basic PageRank Algorithm Input: PageRankVector DistributionMatrix ComputePageRank { Until converged { PageRankVector = DistributionMatrix * PageRankVector; } } Output: PageRankVector ,[object Object],[object Object],[object Object],[object Object],[object Object]
PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12  = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns!  (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!    Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition:  - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6  = 400MB Notice: No more random access!
PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row  sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication
Hadoop: Implementing Own File Format HDFS File ,[object Object],[object Object],[object Object],[object Object],InputSplit - Filename - Start Offset - End Offset - Hosts in HDFS ,[object Object],[object Object],[object Object],[object Object],InputSplit InputSplit RecordReader RecordReader Map Map Map
[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation  (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()

Más contenido relacionado

La actualidad más candente

Frame relay
Frame relay Frame relay
Frame relay
balub4
 

La actualidad más candente (20)

Consuming Web Services in Android
Consuming Web Services in AndroidConsuming Web Services in Android
Consuming Web Services in Android
 
Network Layer,Computer Networks
Network Layer,Computer NetworksNetwork Layer,Computer Networks
Network Layer,Computer Networks
 
Application layer
Application layerApplication layer
Application layer
 
Threads in JAVA
Threads in JAVAThreads in JAVA
Threads in JAVA
 
Database programming
Database programmingDatabase programming
Database programming
 
Express js
Express jsExpress js
Express js
 
Difference Program vs Process vs Thread
Difference Program vs Process vs ThreadDifference Program vs Process vs Thread
Difference Program vs Process vs Thread
 
Servlets
ServletsServlets
Servlets
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
 
Function overloading and overriding
Function overloading and overridingFunction overloading and overriding
Function overloading and overriding
 
Frame relay
Frame relay Frame relay
Frame relay
 
Basic of Java
Basic of JavaBasic of Java
Basic of Java
 
Java Socket Programming
Java Socket ProgrammingJava Socket Programming
Java Socket Programming
 
Web services
Web servicesWeb services
Web services
 
HTTP request and response
HTTP request and responseHTTP request and response
HTTP request and response
 
Java servlet life cycle - methods ppt
Java servlet life cycle - methods pptJava servlet life cycle - methods ppt
Java servlet life cycle - methods ppt
 
Socket programming using java
Socket programming using javaSocket programming using java
Socket programming using java
 
servlet in java
servlet in javaservlet in java
servlet in java
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 

Destacado

Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
George Ang
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
Tianwei Liu
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
Kundan Bhaduri
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
sscdotopen
 

Destacado (20)

Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Chap14_Ecom
Chap14_EcomChap14_Ecom
Chap14_Ecom
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar a Behm Shah Pagerank

Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 

Similar a Behm Shah Pagerank (20)

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Behm Shah Pagerank

  • 1. Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
  • 8. MapReduce WordCount Example Output: Number of occurrences of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
  • 9. MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
  • 10. MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye  1, 1, 1> <Hadoop  1, 1, 1, 1> <Hello  1, 1, 1> <World  1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce
  • 11.
  • 13. Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch
  • 14. Our Hadoop Setup Demo: Hadoop Admin Pages!
  • 15.
  • 16. Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram
  • 17.
  • 18. Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements
  • 19.
  • 20. Sample Reduce Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {       int sum = 0;        while (values.hasNext()) {      sum += values.next().get();        }   output.collect(key, new IntWritable(sum));      }    }
  • 21. Running a Job Demo: Show WordCount Example
  • 23. #|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor
  • 24.
  • 25. PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12 = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
  • 26. PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!  Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition: - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
  • 27. PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6 = 400MB Notice: No more random access!
  • 28. PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
  • 29. Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication
  • 30.
  • 31. [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
  • 32. Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()