Behm Shah Pagerank

Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],Parallel Programming Models ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],MapReduce

MapReduce Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MapReduce is NOT… ,[object Object],[object Object],[object Object],[object Object],MapReduce is a programming paradigm!

MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …

MapReduce WordCount Example Output: Number of occurrences of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!

MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map

MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye  1, 1, 1> <Hadoop  1, 1, 1, 1> <Hello  1, 1, 1> <World  1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],@Yahoo! Some Webmap size data: Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes (source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)

Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch

Our Hadoop Setup Demo: Hadoop Admin Pages!

[object Object],[object Object],Storage: HDFS

Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram

[object Object],Processing: Hadoop MapReduce

Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements

Sample Map Class ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample Reduce Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Running a Job Demo: Show WordCount Example

#|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor

PageRank on MapReduce Very Basic PageRank Algorithm Input: PageRankVector DistributionMatrix ComputePageRank { Until converged { PageRankVector = DistributionMatrix * PageRankVector; } } Output: PageRankVector ,[object Object],[object Object],[object Object],[object Object],[object Object]

PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12 = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)

PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!  Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition: - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)

PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6 = 400MB Notice: No more random access!

PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix

Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication

Hadoop: Implementing Own File Format HDFS File ,[object Object],[object Object],[object Object],[object Object],InputSplit - Filename - Start Offset - End Offset - Hosts in HDFS ,[object Object],[object Object],[object Object],[object Object],InputSplit InputSplit RecordReader RecordReader Map Map Map

[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References

Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()

Behm Shah Pagerank

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Behm Shah Pagerank

Similar a Behm Shah Pagerank (20)

Último

Último (20)

Behm Shah Pagerank