SlideShare a Scribd company logo
1 of 4
Download to read offline
Pagerank on Mapreduce (With Example)
Song Liu, sl9885@bristol.ac.uk


Pagerank is a Good Thing.

It can indicate the internal authority of each node in a network by investigating the relationships
between them.

The original formula is can be seen here
http://en.wikipedia.org/wiki/PageRank

For simplicity, it can be written in this way: Rv=beta * Sigma(Rn/Nn)+ (1-beta)/N

Where Rv is the Pagerank of PageV and Rn represents the Pagerank of PageN which links to
PageV, and Nn is the number of total links of PageN. By summing up (Rn/Nn) for all inlinks of
PageV, we can get the initial Rank of PageV. To normalize this rank, we use two parameters. Beta
is a parameter which is called "damping coefficient", and N is the total number of webpages.

Again, for simplicity, we don’t consider the normalization: the "damping coefficient" beta and
total number N.


Let's Parallelize It

We assume the input follows the logical pattern below:

PageA -> Page1, Page2, Page3...
PageB -> page1', Page2', Page3'...

Before the first iteration, we need to assign the initial rank for each webpage. To my experience,
0.5 is a good number for small website or network. In real calculation, this assignment can be
done by a single-pass mapreduce procedure, after which, the input becomes to the following
structure:

<PageA, 0.5> -> Page1, Page2, Page3...
<PageB, 0.5> -> page1', Page2', Page3'...

This data structure indicates the outlinks for PageA, PageB... But in Pagerank, we concern more
about the inlinks. So the first thing we need to do in Mapper function is to invert this data structure.
Second, the factor Rn/Nn can be also easily calculated at this stage for each webpage.
>>>>>Pseudo Code Starts<<<<<
function Mapper
input <PageN, RankN> -> PageA, PageB, PageC...
begin
     Nn := the number of outlinks for PageN;
     for each outlink PageK
           output PageK-> <PageN, RankN/Nn>
     // output the outlinks for PageN.
     output PageN-> PageA, PageB, PageC…
end
>>>>>Pseudo Code Ends<<<<<

So after Mapper function, we have the following data structure:
PageK-> <PageN1, RankN1/Nn1>
           <PageN2, RankN2/Nn2>
           ...
           <PageAk, PageBk, PageCk…>

where PageN1, PageN2 is the inlinks of PageK, and PageAk, PageBk… are the outlinks. These
outlinks are saved because we need them to rebuild the input format for the next iteration.

The job for Reducer function is to update the pagerank for PageK, using the ranks of inlinks.

>>>>>Pseudo Code Starts<<<<<
function Reducer
input PageK-> <PageN1, RankN1/Nn1>
                <PageN2, RankN2/Nn2>
                ...
                <PageAk, PageBk, PageCk…> //outlinks of PageK
begin
     RankK := 0;
     for each inlink PageNi
           RankK += RankNi/Nni * beta
     // output the PageK and its new Rank for the next iteration
     output <PageK, RankK> -> <PageAk, PageBk, PageCk…>
end
>>>>>Pseudo Code Ends<<<<<

It's almost done! All you need is to run this Map-Reduce procedure iteratively till convergence. To
my knowledge, 15 iterations are enough in small cases, but larger case may take hundreds of
iterations.
Let’s Taste It

Here is some top pageranks of today (23/Feb)'s New York Times website (I only crawled world
news). There are around 14000 pages in total.
The statistics is interesting, and surely, you will find the same distribution pattern if you try to
rank more website and plot more.

http://www.nytimes.com/    1307.420286
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html 834.3881531
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl 372.152869
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp      119.3507215
http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html      81.86431592
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=all 70.64134337
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2 60.43903422
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw 59.92536076
http://www.nytimes.com/2010/02/23/world/asia/23afghan.html?hpw 59.92536076
http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html?hp        59.92536076
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw         59.92536076
http://www.nytimes.com/2010/02/22/world/americas/22gays.html       47.20198605
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl=&pagewanted=all          41.90279295
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html 10.56957042
http://www.nytimes.com/2010/02/23/world/asia/23afghan.html 10.1774012
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html       10.02345548
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp=&pagewanted=all 9.718268616
http://www.nytimes.com/2010/02/17/world/asia/17afghan.html?fta=y 7.693364285
http://www.nytimes.com/2010/02/20/world/asia/20drones.html?fta=y 7.615689791
http://www.nytimes.com/2002/07/07/world/afghan-official-is-assassinated-blow-to-karzai.html   6.967736047
http://www.nytimes.com/2010/01/08/world/asia/08afghan.html?fta=y 6.29221417
http://www.nytimes.com/2010/01/07/world/asia/07afghan.html?fta=y 5.909407591
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?pagewanted=all         5.799193096
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw=&pagewanted=all           5.657414027
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw=&pagewanted=all         5.534888339
http://www.nytimes.com/2010/02/23/world/asia/23taliban.html?ref=asia    5.3387006
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2&hpw 4.840375361
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=1 4.816480315
http://www.nytimes.com/2010/02/16/world/asia/16intel.html    4.671784654
http://www.nytimes.com/2010/02/15/world/asia/15afghan.html?fta=y 4.292737953
Here is the rank distribution (only top 30):




Tools

The ranks I plotted above were calculated by a small search engine which contains the Pagerank
calculation on Mapreduce. Fortunately, it seems working fine with the Bluecrystal cluster.
Further detailed usage can be found in its website: http://joycrawler.googlecode.com
Or feel free to email me: sl9885@bristol.ac.uk

More Related Content

Similar to Pagerank is a good thing

Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithmalexandrelevada
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)James Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteJames Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)James Arnold
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRanksjuyal
 
Eric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerankChengeng Ma
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESSubhajit Sahu
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 

Similar to Pagerank is a good thing (20)

Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithm
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
 
Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcomplete
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRank
 
Eric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan Insight Project Demo
Eric Fan Insight Project Demo
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
J046045558
J046045558J046045558
J046045558
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerank
 
T180304125129
T180304125129T180304125129
T180304125129
 
Pagerank
PagerankPagerank
Pagerank
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTES
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
I04015559
I04015559I04015559
I04015559
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
~Ns2~
~Ns2~~Ns2~
~Ns2~
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 

Pagerank is a good thing

  • 1. Pagerank on Mapreduce (With Example) Song Liu, sl9885@bristol.ac.uk Pagerank is a Good Thing. It can indicate the internal authority of each node in a network by investigating the relationships between them. The original formula is can be seen here http://en.wikipedia.org/wiki/PageRank For simplicity, it can be written in this way: Rv=beta * Sigma(Rn/Nn)+ (1-beta)/N Where Rv is the Pagerank of PageV and Rn represents the Pagerank of PageN which links to PageV, and Nn is the number of total links of PageN. By summing up (Rn/Nn) for all inlinks of PageV, we can get the initial Rank of PageV. To normalize this rank, we use two parameters. Beta is a parameter which is called "damping coefficient", and N is the total number of webpages. Again, for simplicity, we don’t consider the normalization: the "damping coefficient" beta and total number N. Let's Parallelize It We assume the input follows the logical pattern below: PageA -> Page1, Page2, Page3... PageB -> page1', Page2', Page3'... Before the first iteration, we need to assign the initial rank for each webpage. To my experience, 0.5 is a good number for small website or network. In real calculation, this assignment can be done by a single-pass mapreduce procedure, after which, the input becomes to the following structure: <PageA, 0.5> -> Page1, Page2, Page3... <PageB, 0.5> -> page1', Page2', Page3'... This data structure indicates the outlinks for PageA, PageB... But in Pagerank, we concern more about the inlinks. So the first thing we need to do in Mapper function is to invert this data structure. Second, the factor Rn/Nn can be also easily calculated at this stage for each webpage.
  • 2. >>>>>Pseudo Code Starts<<<<< function Mapper input <PageN, RankN> -> PageA, PageB, PageC... begin Nn := the number of outlinks for PageN; for each outlink PageK output PageK-> <PageN, RankN/Nn> // output the outlinks for PageN. output PageN-> PageA, PageB, PageC… end >>>>>Pseudo Code Ends<<<<< So after Mapper function, we have the following data structure: PageK-> <PageN1, RankN1/Nn1> <PageN2, RankN2/Nn2> ... <PageAk, PageBk, PageCk…> where PageN1, PageN2 is the inlinks of PageK, and PageAk, PageBk… are the outlinks. These outlinks are saved because we need them to rebuild the input format for the next iteration. The job for Reducer function is to update the pagerank for PageK, using the ranks of inlinks. >>>>>Pseudo Code Starts<<<<< function Reducer input PageK-> <PageN1, RankN1/Nn1> <PageN2, RankN2/Nn2> ... <PageAk, PageBk, PageCk…> //outlinks of PageK begin RankK := 0; for each inlink PageNi RankK += RankNi/Nni * beta // output the PageK and its new Rank for the next iteration output <PageK, RankK> -> <PageAk, PageBk, PageCk…> end >>>>>Pseudo Code Ends<<<<< It's almost done! All you need is to run this Map-Reduce procedure iteratively till convergence. To my knowledge, 15 iterations are enough in small cases, but larger case may take hundreds of iterations.
  • 3. Let’s Taste It Here is some top pageranks of today (23/Feb)'s New York Times website (I only crawled world news). There are around 14000 pages in total. The statistics is interesting, and surely, you will find the same distribution pattern if you try to rank more website and plot more. http://www.nytimes.com/ 1307.420286 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html 834.3881531 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl 372.152869 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp 119.3507215 http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html 81.86431592 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=all 70.64134337 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2 60.43903422 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw 59.92536076 http://www.nytimes.com/2010/02/23/world/asia/23afghan.html?hpw 59.92536076 http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html?hp 59.92536076 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw 59.92536076 http://www.nytimes.com/2010/02/22/world/americas/22gays.html 47.20198605 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl=&pagewanted=all 41.90279295 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html 10.56957042 http://www.nytimes.com/2010/02/23/world/asia/23afghan.html 10.1774012 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html 10.02345548 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp=&pagewanted=all 9.718268616 http://www.nytimes.com/2010/02/17/world/asia/17afghan.html?fta=y 7.693364285 http://www.nytimes.com/2010/02/20/world/asia/20drones.html?fta=y 7.615689791 http://www.nytimes.com/2002/07/07/world/afghan-official-is-assassinated-blow-to-karzai.html 6.967736047 http://www.nytimes.com/2010/01/08/world/asia/08afghan.html?fta=y 6.29221417 http://www.nytimes.com/2010/01/07/world/asia/07afghan.html?fta=y 5.909407591 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?pagewanted=all 5.799193096 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw=&pagewanted=all 5.657414027 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw=&pagewanted=all 5.534888339 http://www.nytimes.com/2010/02/23/world/asia/23taliban.html?ref=asia 5.3387006 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2&hpw 4.840375361 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=1 4.816480315 http://www.nytimes.com/2010/02/16/world/asia/16intel.html 4.671784654 http://www.nytimes.com/2010/02/15/world/asia/15afghan.html?fta=y 4.292737953
  • 4. Here is the rank distribution (only top 30): Tools The ranks I plotted above were calculated by a small search engine which contains the Pagerank calculation on Mapreduce. Fortunately, it seems working fine with the Bluecrystal cluster. Further detailed usage can be found in its website: http://joycrawler.googlecode.com Or feel free to email me: sl9885@bristol.ac.uk