PageRank is an algorithm developed by Larry Page and Sergey Brin at Stanford University to rank web pages for Google search results. It determines a page's importance by counting the number and quality of links to a page. PageRank is calculated through an iterative process and spreads importance across the web like a "random surfer" clicking links. Key factors that influence PageRank include internal site structure, external links, and minimizing links out of the site from high PageRank pages. PageRank helped Google revolutionize search and become a multi-billion dollar company by providing more relevant results.
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
PageRank & Searching
1. PageRank & Searching
Rahul Bindra, Aditya Nigudkar
B.E, Computer Science, 2nd Year
Madhav Institute of Technology and Science, Gwalior
rahulbindra@gmail.com
adimits@gmail.com
ABSTRACT
Information on the web is increasing exponentially and so is the number of people seeking it. In order
to waste minimum time in searching for queries, we use search engines almost for every bit of
information on web. There have been many search engines in the past like AltaVista, Inktomi,
Overture etc. but they failed to rank their results. In late 90’s Sergey Brin and Larry Page led
GOOGLE revolutionized the search engine industry when it came up with its PageRank system to
rank the search results which helped GOOGLE to change from a struggling 2-member search
amateur to a multi-billion dollar search enterprise. The paper aims at giving an insight on the pros
and cons of a PageRank system and how it changed the meaning of searching for billions of people
across the globe.
1. Introduction
PageRank was developed at Stanford University by Larry Page (hence the name Page-Rank) and
Sergey Brin as part of a research project about a new kind of search engine. The project started in
1995 and led to a functional prototype, named Google, in 1998. Shortly after, Page and Brin founded
Google Inc., the company behind the Google search engine. While just one of many factors which
determine the ranking of Google search results, PageRank continues to provide the basis for all of
Google's web search tools.
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an
indicator of an individual page's value. PageRank actually works on an intuitive system, which works
2. as a model of a web surfer’s behaviour. It works on the probability that a web page will be randomly
accessed by a web surfer. It also takes into consideration the pages that link to the page. Using Yahoo
as an example, the justification of this is that if a page like Yahoo were to link directly to another page,
it is very likely that the page is of high quality (Brin, S., Page, L., 2000).
2. PageRank
2.1 What is PageRank ?
PageRank is a link analysis algorithm which assigns a numerical weighting to each element of a
hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its
relative importance within the set. The algorithm may be applied to any collection of entities with
reciprocal quotations and references. The numerical weight that it assigns to any given element E is
also called the PageRank of E and denoted by PR(E).
Google figures that when one page links to another page, it is effectively casting a vote for the other
page. The more votes that are cast for a page, the more important the page must be. Also, the
importance of the page that is casting the vote determines how important the vote itself is. Google
calculates a page's importance from the votes cast for it. How important each vote is taken into
account when a page's PageRank is calculated. PageRank is Google's method of measuring a page's
"importance." When all other factors such as Title tag and keywords are taken into account, Google
uses PageRank to adjust results so that sites that are deemed more "important" will move up in the
results page of a user's search accordingly. A basic overview of how Google ranks pages in their
search engine results pages (SERPS) follows:
1) Find all pages matching the keywords of the search.
2) Rank accordingly using "on the page factors" such as keywords.
3) Calculate in the inbound anchor text.
4) Adjust the results by PageRank scores.
2.2 How is PageRank Calculated ?
Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a
damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details
about d in the next section. Also C(A) is defined as the number of links going out of page A. The
PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages'
PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the web.
but that’s not too helpful so let’s break it down into sections.
PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the
web all the way up to “PR(Tn)” for the last page.
3. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, or
number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.
PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A
will get is “PR(Tn)/C(Tn)”.
d(... - All these fractions of votes are added together but, to stop the other pages having too much
influence, this total vote is “damped down” by multiplying it by 0.85 (the factor “d”).
(1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web
pages' PageRanks will be one”: it adds in the bit lost by the d(.... It also means that if a page has no
links to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the Google
paper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “the
average” to you and me.
We can think of it in a simpler way:-
a page's PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it)
"share" = the linking page's PageRank divided by the number of outbound links on the page.
2.3 Need Of Iterations
The equation above clearly shows how a page's PageRank is arrived at. But what isn't immediately
obvious is that it can't work if the calculation is done just once. Suppose we have 2 pages, A and B,
which link to each other, and neither have any other links of any kind. This is what happens:-
Step 1: Calculate page A's PageRank from the value of its inbound links
Page A now has a new PageRank value. The calculation used the value of the inbound link from page
B. But page B has an inbound link (from page A) and its new PageRank value hasn't been worked out
yet, so page A's new PageRank value is based on inaccurate data and can't be accurate.
Step 2: Calculate page B's PageRank from the value of its inbound links
Page B now has a new PageRank value, but it can't be accurate because the calculation used the new
PageRank value of the inbound link from page A, which is inaccurate.
It's a Catch 22 situation. We can't work out A's PageRank until we know B's PageRank, and we can't
work out B's PageRank until we know A's PageRank.
The problem is overcome by repeating the calculations many times. Each time produces slightly more
accurate values. In fact, total accuracy can never be achieved because the calculations are always
based on inaccurate values. 40 to 50 iterations are sufficient to reach a point where any further
iterations wouldn't produce enough of a change to the values to matter. This is precisely what Google
does at each update, and it's the reason why the updates take so long.
2.4 Examples
4. 1. Let's consider a 3 page site (pages A, B and C) with no links coming in from the outside. We will
allocate each page an initial PageRank of 1, although it makes no difference whether we start each
page with 1, 0 or 99. Apart from a few millionths of a PageRank point, after many iterations the end
result is always the same. Starting with 1 requires fewer iterations for the PageRanks to converge to a
suitable result than when starting with 0 or any other number.
A B
PR 1
PR 1
C
PR 1
The site's maximum PageRank is the amount of PageRank in the site. In this case, we have 3 pages so
the site's maximum is 3. At the moment, none of the pages link to any other pages and none link to
them. If you make the calculation once for each page, you'll find that each of them ends up with a
PageRank of 0.15. No matter how many iterations you run, each page's PageRank remains at 0.15. The
total PageRank in the site = 0.45, whereas it could be 3. The site is seriously wasting most of its
potential PageRank.
2. Now, adding some links to above example. Start again with PR1 all round. After 1iteration:-
Page A = 1.425
Page B = 1
Page C = 0.575
By comparison to the 1 iteration figures in the previous example, page A has lost some PageRank,
page B has gained some and page C stayed the same. Page C now shares its "vote" between A and B.
Previously A received all of it. That's why page A has lost out and why page B has gained. and after
100 iterations:-
Page A = 1.298245
Page B = 0.9999999
Page C = 0.7017543
5. A B
PR 1
PR 1
C
PR 1
3. Another example involving multiple links is as follows:
The calculated PR values are depicted in the following diagram:
As you’d expect, the home page has the most PR – after all, it has the most incoming links! But what’s
happened to the average? It’s only 0.378. Take a look at the “external site” pages – what’s happening
6. to their PageRank? They’re not passing it on, they’re not voting for anyone, they’re wasting their PR.
A possible remedy to this problem is suggested in the following figure:
Thus it can be clearly inferred from the above example that the architectural interlinking of a site with
other sites plays a very important role in determining its “Average PR” which helps it to improve its
position in GOOGLE’s search results.
2.5 PageRank Feedback
Another important aspect of PageRank is "PageRank Feedback." Pages linking to each other can
create a feedback effect that can increase the PageRank of those pages.
Let's say that page A currently links to nowhere. If we add a link from page A to page B, then page A
is saying that page B is important. This means the measure of page B's votes is also increased. Page B
is now saying that the pages it links to are more important than they otherwise would be. So the
measure of those page's votes will be increased.... and so on (with the pages they link to through the
link structure). The effect is diluted as it moves down through the links. If we could point our web
browser at page B and through clicking on
the links, get to page A, then so could the Google algorithm (at least one of the pages linking to page
A has become more important). If a page linking to page A is more important, then so is its vote, and
subsequently page A becomes more important! So by linking to page B, page A has made itself more
important, thus creating PageRank Feedback.
An interesting thing about PR Feedback is that you can use it to your advantage via the internal
navigational structure of your site. It is very important to keep as much PageRank within your site as
possible. Therefore, only link out to other sites from low PageRank pages of your site.
3. Optimization Of PageRank
7. There are three fundamental areas to look at when trying to optimize the PageRank for your site:
1. The links you choose to have link to you, i.e., which ones you choose, and how much effort you put
in to getting them.
2. Who you choose to link out to from your site, and from which page of your site you place their link.
(Maximizing PageRank Feedback and minimizing PageRank leakage).
3. The internal navigational structure and linkage of your pages, in order to best distribute PageRank
within your site.
3.1 Links To Your Site
When looking for links to your site, from a purely PageRank point of view, one might think you
should simply look for pages that have the highest Toolbar PageRank. However, this way of thinking
is incorrect. As more and more people try to get links from only high PageRank sites, it becomes less
and less of a winning proposition.
The actual PageRank from an individual page is shared out amongst the links on that page. So, a link
from a page with a PageRank of 4 might be better than a link from a page with a PageRank of 6 if
there are less total links on the PR 4 page.
3.2 Links Out Of Your Site
When considering links out of your site there is one golden rule:
Generally, you will want to keep PageRank within your own site.
This does not mean that you will lose PageRank from your site by linking out, but that the total
PageRank within it may be lower than it could have been had you not linked out.
The best outbound link PageRank scenario occurs when the outbound link comes from a page that has
both:
a) A low PageRank
b) A lot of links to pages on your site.
How can this best be achieved? One way would be by writing reviews of the sites we link out to on a
separate page of our site, and by providing a link to those reviews along with each hyperlink to the
external site. We will have to make sure that the review page also links back to a page in our own site
that is high up its structure. (It’s best if this is our home page, but any important page will do.) By
doing this, we’ve significantly reduced the amount of PageRank we’ve let out of our site. We’ve
targeted the distribution of PageRank to the home page to ensure that less is passed back through our
links page (which would be a wasted opportunity), and more is put elsewhere in our site. This concept
can be illustrated diagrammatically as follows:
8. If we perform the same calculations but include the review pages, here’s what we get:
Without the Review Pages:
The HomePage PR is: 0.9536152797
B,C,D PR is: 0.4201909959
Total is: 2.2141882674
9. With the Review Pages:
The HomePage PR is: 2.439718935
B,C,D PR is: 0.8412536982
Total is: 4.9634800296
Thus it could be clearly observed from the above results that adding review pager in an architectural
linking helps to improve its PR by keeping the PR within the architecture.
3.3 Internal Structures And Linkages
To get a high PageRank, it is not enough to have tens of thousands of pages. Those pages must also be
in Google’s index. To achieve this they must contain enough content for Google’s algorithms to
consider them worthy of being added to the index. As you develop content for your site, you are also
creating more PageRank for your site. Its hard work, and you’re creating it slowly – but if you’re also
creating pages that people will want to link to, then you’re doing yourself two favors at once:
PageRank from both directions. Or, to put it in very basic terms: The best “internal” thing that can be
done to build PageRank is to write lots of good content. Ensure that pages aren’t overly short or overly
long and break the content into several pages where necessary. There are three different ways in which
pages can be interlinked within a site:
Hierarchical
Total PageRank in this site is 4
Looping
10. Total PageRank in this site is 4
Extensive Interlinking
Total PageRank in this site is 4
The maximum benefit from PageRank is derived when this methodology is applied towards pages that
want to rank high for highly competitive keyword phrases, or towards pages that must compete on a
large number of keyword phrases. The Extensive Interlinking strategy retains the most PageRank
within the site. Following this is the Hierarchical strategy and lastly the Looping strategy.
4. Significance Of PageRank
Of course, important pages mean nothing to you if they don't match your query. So, Google combines
PageRank with sophisticated text-matching techniques to find pages that are both important and
relevant to your search. Google goes far beyond the number of times a term appears on a page and
examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's
a good match for your query. Thus, in this way, PageRank helps GOOGLE to bring the best ranked
search results corresponding to your query.
A version of PageRank has recently been proposed as a replacement for the traditional ISI impact
factor. Instead of merely counting citations of a journal, the "quality" of a citation is determined in a
PageRank fashion.
A Web crawler may use PageRank as one of a number of importance metrics it uses to determine
which URL to visit next during a crawl of the web. One of the early working papers which was used in
the creation of Google is “Efficient crawling through URL ordering”, which discusses the use of a
number of different importance metrics to determine how deeply, and how much of a site Google will
11. crawl. PageRank is presented as one of a number of these importance metrics, though there are others
listed such as the number of inbound and outbound links for a URL, and the distance from the root
directory on a site to the URL.
5. Conclusion
To sum up, PageRank is a technique that has revolutionized the concept of searching for every web
surfer across the globe putting the whole worldly knowledge just a click away for anyone. It has
assisted in unification of cultures and sharing of ideas by providing an easier, faster and more relevant
way to answer to the query of a user. This technique, along with GOOGLE’s huge database of
information, has emerged to be a one-stop point for every user around the world with any type of
query.
12. 6. References
1. David A. Vise and Mark Malseed, “The Google Story”, Delacorte Press.
2. Nancy Blachman, “The Google Guide”
3. http://www.google.com/support/librariancenter/Article 12-2005-1
4. http://www.google.com/technology/pigeonrank.html
5. http://infolab.stanford.edu/~backrub/google.html
6. Sergey Brin and Lawrence Page, ”The Anatomy of a search engine”.
7. http://www.centaurhosting.com /Understanding Google's Page Rank System.htm
8. Chris Ridings and Mike Shishigin, “PageRank Uncovered”.
9. http://www.WebWorkshop.com/ Pagerank Explained_ Google's PageRank and how to make the
most of it.htm
10. Google Support Center, “Google Technology”.
11. Wikipedia, “How PageRank Works”.
12. Ian Rogers, “The Google Pagerank Algorithm and How It Works”.