Review of "The anatomy of a large scale hyper textual web search engine"

•

1 like•2,134 views

Sai Malleswar

a review on Google search engine

Technology

KOSURU SAI MALLESWAR; SC09B093; SEM-6.

REVIEW OF “The Anatomy of a Large-Scale Hyper textual Web Search Engine”

Sergey Brin and Lawrence Page started the design of ‘Google’ to make a search engine that can
crawl and index the web quickly and efficiently and to effectively deal with huge uncontrolled
hypertext collections. One of the main goals was to improve the quality and scalability of search.
Another goal was to setup a system that can support novel research activities on large-scale web
data and a reasonable number of people can actually use it for their academic research.

Google makes efficient use of storage space to store the index. This allows the quality of the
search to scale effectively to the size of the web as it grows. Its data structures are optimized for
fast and efficient access. To get high precision, Google uses the link structure of the Web to
calculate a quality ranking for each web page. This ranking is called PageRank. The probability
that the ‘random surfer’ visits a page is its PageRank. The ranking also involves damping factor,
which is the probability at each page the ‘random surfer’ will get bored and request another
random page. It allows for personalization and can make it nearly impossible to deliberately
mislead the system in order to get a higher ranking. The text of a link is associated with the page
that the link is on and also with the page the link points to. This idea of anchor text propagation
provides better quality search but the challenge was the efficient usage of it because of the heavy
data processing task. Along with page rank Google keeps a track of location information of all
hits, some visual presentation details and stores full raw HTML of pages in the repository.

Most of the Google’s architecture is implemented in C or C++ for efficiency and can run in
either Solaris or Linux. The data structures of Google include big files, document indexes,
lexicon, forward and reverse indexes and a huge repository. Google’s data structures are
optimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has a
fast distributed crawling system, where URL server and the crawlers are implemented in Python.
Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IO
and a no. of queues. The steps involved in indexing are parsing, indexing documents into barrels
using multiple indexers running in parallel and sorting. The Google’s ranking system is designed
so that no particular factor can have too much influence. The dot product of the vector of count-
weights with the vector of type-weights is used to compute an IR score for the document.
Finally, the IR score is combined with PageRank to give a final rank to the document. For multi
word search, Google has a complex algorithm. Google also considers feedback by trusted users
while updating the ranks of webpages.

Google can produce better results than the major commercial search engines for most searches.
Google has evolved to overcome a number of bottlenecks in CPU, memory access, memory
capacity, disk seeks, disk throughput, disk capacity, and network IO during various operations.
By the efficient crawling and indexing performed by Google, information can be kept up to date
and major changes can be tested relatively quickly. Google does not have optimizations such as
query caching, sub-indices on common terms. The inventors intended to speed up Google
considerably through distribution and hardware, software, and algorithmic improvements. They
wished to make Google as a high quality search tool for searchers and researchers all around the
world, sparking the next generation of search engine technology.

What's hot

Working Of Search EngineNIKHIL NAIR

Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce

WT - Web & Working of Search Enginevinay arora

Googling of GooGlebinit singh

Google Birth2imscott

IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal

Web crawlingTushar Tilwani

Working of search engineNikhil Deswal

Seminar on crawlerSanjeev Kumar Jaiswal

GoogleworksMadhan Madhu

Using HBase for Real Time AccessRahul Gaikwad

MongoDB and Hadoop: Driving Business InsightsMongoDB

Data Infrastructure in KumparanYosua Michael Maranatha

Linq a framework for location aware indexing and query processingieeepondy

NNg Visioneering-MKishkishmc

Search enginePritika Saini

What's hot (17)

Working Of Search Engine

Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.

WT - Web & Working of Search Engine

Googling of GooGle

Google Birth2

IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler

Web crawling

Working of search engine

Seminar on crawler

Googleworks

Using HBase for Real Time Access

MongoDB and Hadoop: Driving Business Insights

Data Infrastructure in Kumparan

Linq a framework for location aware indexing and query processing

NNg Visioneering-MKish

Search engine

Viewers also liked

VbulletinKirk Hansen

Ambulatory & Cursorial legs modification of Insects by M.SalmanMuhammad Salman

Sheng mu yun mu biaopohyeanlee

Olleh TV for International UsersMeherunnesha (Nishat)

Is this your penrobertwozniak777

B604 Revision BookletMr R Collinson

Human & Social Biology - Sample Project on 'The Impact of Heath Practices on ...Raheme Matthie

KDU Fees 2015Pervindran Rao

Your memories will always remain in our hearts slideshow of joyRuben Cabato

Meet Your 7 Fascination TriggersHOW TO FASCINATE

Antidysrhythmicsraj kumar

Another/Other/Others/The other Presentationnlopez74

Chapter 10 toward a theory of second language acquisitionNoni Ib

Writing part 3rosangelacs

Factors causes students low english language in national university of laosSam Rany

What is a lyric poemteacher

Zinc Smelter Project ReportPushkar Raj Chandna

Subsistema de desenvolvimento de recursos humanosUniversidade Pedagogica

Econ315 Money and Banking: Learning Unit #05: Indirect Financesakanor

Materi buku look ahead sma x (10)pychan-ketapang. blogspot.com

Viewers also liked (20)

Vbulletin

Ambulatory & Cursorial legs modification of Insects by M.Salman

Sheng mu yun mu biao

Olleh TV for International Users

Is this your pen

B604 Revision Booklet

Human & Social Biology - Sample Project on 'The Impact of Heath Practices on ...

KDU Fees 2015

Your memories will always remain in our hearts slideshow of joy

Meet Your 7 Fascination Triggers

Antidysrhythmics

Another/Other/Others/The other Presentation

Chapter 10 toward a theory of second language acquisition

Writing part 3

Factors causes students low english language in national university of laos

What is a lyric poem

Zinc Smelter Project Report

Subsistema de desenvolvimento de recursos humanos

Econ315 Money and Banking: Learning Unit #05: Indirect Finance

Materi buku look ahead sma x (10)

Similar to Review of "The anatomy of a large scale hyper textual web search engine"

How Google WorksGanesh Solanke

Google Research Paperdidip

Testuser1234test

GoogleOUM SAOKOSAL

The anatomy of googlemaelmardi

ΟΚΤΩΒΡΙΟΣ 2010steverz

GoogleMohd Arif

PageRank algorithm and its variations: A Survey reportIOSR Journals

[LvDuit//Lab] Crawling the webVan-Duyet Le

Page RankPramit Kumar

Web Crawlervaibhavtyagi111

Google ppt by amitDAVV

Google Search Engine Aniket_1415

Pageranktkgcse

History page-brin thesis - anatomy of a large scale hypertextual web search...Bitsytask

GoogleAshish Verma

The New Content SEO - Sydney SEO Conference 2023Amanda King

Google Looks Into the Index Now Protocol for Crawling and IndexingPaulDonahue16

Modern web search: Web Information SystemsArtificial Intelligence Institute at UofSC

Modern web search: Lecture 11Artificial Intelligence Institute at UofSC

Similar to Review of "The anatomy of a large scale hyper textual web search engine" (20)

How Google Works

Google Research Paper

Test

Google

The anatomy of google

ΟΚΤΩΒΡΙΟΣ 2010

Google

PageRank algorithm and its variations: A Survey report

[LvDuit//Lab] Crawling the web

Page Rank

Web Crawler

Google ppt by amit

Google Search Engine

Pagerank

History page-brin thesis - anatomy of a large scale hypertextual web search...

Google

The New Content SEO - Sydney SEO Conference 2023

Google Looks Into the Index Now Protocol for Crawling and Indexing

Modern web search: Web Information Systems

Modern web search: Lecture 11

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Tech Trends Report 2024 Future Today Institute.pdfhans926745

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Finology Group – Insurtech Innovation Award 2024

What Are The Drone Anti-jamming Systems Technology?

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Friends Colony Women Seeking Men

[2024]Digital Global Overview Report 2024 Meltwater.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Strategies for Landing an Oracle DBA Job as a Fresher

Tech Trends Report 2024 Future Today Institute.pdf

Automating Google Workspace (GWS) & more with Apps Script

Data Cloud, More than a CDP by Matt Robison

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

GenCyber Cyber Security Day Presentation

Handwritten Text Recognition for manuscripts and early printed texts

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

2024: Domino Containers - The Next Step. News from the Domino Container commu...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

CNv6 Instructor Chapter 6 Quality of Service

Scaling API-first – The story of a global engineering organization

Review of "The anatomy of a large scale hyper textual web search engine"

1. KOSURU SAI MALLESWAR; SC09B093; SEM-6. REVIEW OF “The Anatomy of a Large-Scale Hyper textual Web Search Engine” Sergey Brin and Lawrence Page started the design of ‘Google’ to make a search engine that can crawl and index the web quickly and efficiently and to effectively deal with huge uncontrolled hypertext collections. One of the main goals was to improve the quality and scalability of search. Another goal was to setup a system that can support novel research activities on large-scale web data and a reasonable number of people can actually use it for their academic research. Google makes efficient use of storage space to store the index. This allows the quality of the search to scale effectively to the size of the web as it grows. Its data structures are optimized for fast and efficient access. To get high precision, Google uses the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank. The probability that the ‘random surfer’ visits a page is its PageRank. The ranking also involves damping factor, which is the probability at each page the ‘random surfer’ will get bored and request another random page. It allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. The text of a link is associated with the page that the link is on and also with the page the link points to. This idea of anchor text propagation provides better quality search but the challenge was the efficient usage of it because of the heavy data processing task. Along with page rank Google keeps a track of location information of all hits, some visual presentation details and stores full raw HTML of pages in the repository. Most of the Google’s architecture is implemented in C or C++ for efficiency and can run in either Solaris or Linux. The data structures of Google include big files, document indexes, lexicon, forward and reverse indexes and a huge repository. Google’s data structures are optimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has a fast distributed crawling system, where URL server and the crawlers are implemented in Python. Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IO and a no. of queues. The steps involved in indexing are parsing, indexing documents into barrels using multiple indexers running in parallel and sorting. The Google’s ranking system is designed so that no particular factor can have too much influence. The dot product of the vector of count- weights with the vector of type-weights is used to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document. For multi word search, Google has a complex algorithm. Google also considers feedback by trusted users while updating the ranks of webpages. Google can produce better results than the major commercial search engines for most searches. Google has evolved to overcome a number of bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, disk capacity, and network IO during various operations. By the efficient crawling and indexing performed by Google, information can be kept up to date and major changes can be tested relatively quickly. Google does not have optimizations such as query caching, sub-indices on common terms. The inventors intended to speed up Google considerably through distribution and hardware, software, and algorithmic improvements. They wished to make Google as a high quality search tool for searchers and researchers all around the world, sparking the next generation of search engine technology.

Review of "The anatomy of a large scale hyper textual web search engine"

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Review of "The anatomy of a large scale hyper textual web search engine"

Similar to Review of "The anatomy of a large scale hyper textual web search engine" (20)

More from Sai Malleswar

More from Sai Malleswar (18)

Recently uploaded

Recently uploaded (20)

Review of "The anatomy of a large scale hyper textual web search engine"