2. Search Engine and Web Crawler
2
Abstract
The World Wide Web is a rapidly growing and changing information
source. Due to the dynamic nature of the Web, it becomes harder to find
relevant and recent information.
Search engines are the primary gateways of information access on the Web.
Today search engines are becoming necessity of most of the people in day to
day life for navigation on internet or for finding anything. Search engine
answer millions of queries every day. Whatever comes in our mind, we just
enter the keyword or combination of keywords to trigger the search and get
relevant result in seconds without knowing the technology behind it. I
searched for ―search engine‖ and it returned 68,900 results. In addition with
this, the engine returned some sponsored results across the side of the page,
as well as some spelling suggestion. All in 0.36 seconds. And for popular
queries the engine is even faster. For example, searches for World Cup or
dance shows (both recent events) tookless than .2 seconds each.
To engineer a search engine is a challenging task. Web crawler is an
indispensable part of search engine. A web crawler is a program that, given
one or more seed URLs, downloads the web pages associated with these
URLs, extracts any hyperlinks contained in them, and recursively continues
to download the web pages identified by these hyperlinks. Web crawlers are
an important component of web search engines, where they are used to
collect the corpus of web pages indexed by the search engine. Moreover,
they are used in many other applications that process large numbers of web
pages, such as web data mining, comparison shopping engines, and so on.
3. Search Engine and Web Crawler
3
Introductionto Search Engine
Search engine is a tool that allows people to find information over World
Wide Web. Search engine is a website that you can use to look up web
pages, like yellow pages for the Internet. A web search engine is a software
system that is designed to search for information on the World Wide Web.
Assume you are reading a book and want to find references to a specific
word in the book. What do you do? You turn the pages to the end and look
in the index! You will then locate the word in the index, find the page
numbers mentioned there and flip to the corresponding pages.
Search Engines also work in a similar way.
Figure 1 telephone directory
Search engines are constantly building and updating their index to the World
Wide Web. They do this by using ―spiders‖ that ―crawl‖ the web and fetch
web pages. Then the words used in these web pages are added to the index
along with where the words came from. [1]
4. Search Engine and Web Crawler
4
How stuff works?
A search engine operates in the following order:
1. Web crawling
2. Indexing
3. Searching
Web search engines work by storing information about many web pages.
These pages are retrieved by a Web crawler (sometimes also known as a
spider) — an automated Web crawler which follows every link on the site.
The search engine then analyzes the contents of each page to determine how
it should be indexed (for example, words can be extracted from the titles,
page content, headings, or special fields called meta tags).
Figure 2 working flow of search engine
When a user enters a query into a search engine (typically by using
keywords), the engine examines its index and provides a listing of best-
matching web pages according to its criteria, usually with a short summary
5. Search Engine and Web Crawler
5
containing the document's title and sometimes parts of the text. The index is
built from the information stored with the data.
From 2007 the Google.com search engine has allowed one to search by date
by clicking 'Show search tools' in the leftmost column of the initial search
results page, and then selecting the desired date range.
Most search engines support the use of the boolean operators AND, OR and
NOT to further specify the search query. Boolean operators are for literal
searches that allow the user to refine and extend the terms of the search.
As well, natural language queries allow the user to type a question in the
same form one would ask it to a human. A site like this would be ask.com.
The usefulness of a search engine depends on the relevance of the result set
it gives back. While there may be millions of web pages that include a
particular word or phrase, some pages may be more relevant, popular, or
authoritative than others. Most search engines employ methods to rank the
results to provide the "best" results first.
Search engines that do not accept money for their search results make
money by running search related ads alongside the regular search engine
results. The search engines make money every time someone clicks on one
of these ads. [2]
6. Search Engine and Web Crawler
6
Major Search Engines - A Comparison
Today there are many search engines available to web searchers. What
makes one search engine different from another? Following are some
important measure.[3]
The contents of that database are a crucial factor determining whether or
not you will succeed in finding the information we need. Because when
we are doing searching, we are not actually searching the Web directly.
Rather, we are searching the cache of the web or database that contains
information about all the Web sites visited by that search engine’s spider
or crawler.
Size is also one important measure. How many Web pages has the spider
visited, scanned, and stored in the database? Some of the larger Search
Engines have databases that are covering over three billion Web pages,
while the databases of smaller Search Engines cover half a billion or less
Another important measure is how up to date the database is. As we
know that the Web is continuously changing and growing. New Websites
appear, old sites vanish, and existing sites modify their content. So the
information stored in the database will become out of date unless Search
engine’s spider keep up with these changes.
In addition with these, the ranking algorithm used by the Search Engine
determines whether the most relevant search results appear or not
towards the top of results list.
Figure 3. Google logo.
Google has been in the search game a long time, it has the highest share
market of Search Engine (about 81%) [3].
1) Web Crawler-based service provides both comprehensive coverage of the
Web along with great relevancy.
2) Google is much better than the other engines at determining whether a
link is an artificial link or true editorial link
7. Search Engine and Web Crawler
7
3) Google gives much importance to Sites which add fresh content on a
regular basis. This is why Google likes blogs, especially popular ones.
4) Google prefer informational pages to commercial sites.
5) A page on a site or sub domain of a site with significant age or link can
rank much better than it should, even with no external citations.
6) It has aggressive duplicate content filters that filter out many pages with
similar content.
7) Crawl depth determined not only by link quantity, but also link quality.
Excessive low quality links may make your site less likely to be crawled
deep or even included in the index.
8) In addition we can search for twelve different file formats, cached pages,
images, news and Usenet group postings.
Figure 4 Yahoo logo.
Yahoo has been in the search game for many years [3].
1) It shares the second largest share market of the search engine
(about12%).
2) When it comes to counting back lings, Yahoo is the most accurate search
engine
3) Yahoo is better than MSN but near as good as Google at determining
whether a link is artificial or natural.
4) Crawl rate of the Yahoo's spiders is at least 3 times faster than Google‟s
Spiders.
5) Yahoo! tends to prefer commercial pages to informational pages as
comparing with Google.
6) At Yahoo search engine "exact matching" is given more importance than
"concept matching" which makes them slightly more susceptible to
spamming.
7) Yahoo! gives more importance to meta keywords and description tags.
8. Search Engine and Web Crawler
8
Figure. 5. MSN logo.
1) MSN has the share of 3% of the total search engine market [3].
2) MSN Search uses its own Web database and also has separate News,
Images, and Local databases.
3) Its strengths include: this large unique database, its query building
Search Builder" and Boolean searching, cached copies of Web pages
including date cached, and automatic local search options.
4) The spider crawls only the beginning of the pages (as opposed to the
other two search engine which crawl the entire content) and also the
number of pages found in its index or database is extremely low.
5) It is bad at determining if a link is natural or artificial in nature.
6) Due to sucking at link analysis they place too much weight on the page
content.
7) New sites that are generally untrusted in other systems can rank quickly
in MSN Search. But it also makes them more susceptible to spam.
8) Another downside of this search engine is its habit of supplying the
results based on geo-targeting, which makes it extremely hard to
determine if the results we see are the same ones everybody sees.
Figure. 6. ASK Jeeves logo.
1) The Ask search engine has the lowest share (about 1%) out of the total
search engine market [3].
2) Ask is a topical search site. It gives more importance to sites that are
linked to topical communities
3) Ask is more susceptible to spamming.
4) Ask is smaller and more specialized than other search engines, it is wise
to approach this engine more from a networking or marketing
perspective.
9. Search Engine and Web Crawler
9
Figure 7 live search logo
1) Lunched in sept 2006
2) Live Search (formerly Windows Live Search) is the name of Microsoft's
web search engine, successor to MSN Search, designed to compete with
the industry leaders Google and Yahoo.
3) It also allows the user to save searches and see them updated
automatically on Live.com.
Figure 8 bing logo
1) Lunched in july 2009 by Microsoft. Use msn search.
2) Things like 'wiki' suggestions, 'visual search', and 'related searches' might
be very useful to you.
10. Search Engine and Web Crawler
10
Introductionto Web Crawler
A crawler is a program that visits Web sites and reads their pages and other
information in order to create entries for a search engine index. Also known
as a "spider" or a "bot" (short for "robot")
Spider – programs like a browser to download the web page.
Crawler – programs automatically follow the links of web pages.
Robots - It had automated computer program can visit websites. It will be
guided by search engine algorithms It can combine the tasks of crawler &
spider helpful of the indexing the web pages and through the search engines.
[4]
Why Crawlers?
Figure 9 result of searching term web crawler in Google.
Crawling: gathering pages from the internet, in order to index them
It has 2 main objectives:
• fast gathering
• efficient gathering [5]
Internet has a wide expanse of
Information.
Finding relevant information
requires an efficient
mechanism.
Web Crawlers provide that
scope to the search engine.
11. Search Engine and Web Crawler
11
Features
Features a crawlermust provide
Robustness: The Web contains servers that create spider traps, which are
generators of web pages that mislead crawlers into getting stuck fetching
an infinite number of pages in a particular domain. Crawlers must be
designed to be resilient to such traps. Not all such traps are malicious;
some are the inadvertent side-effect of faulty website development.
Politeness: Web servers have both implicit and explicit policies
regulating the rate at which a crawler can visit them. These politeness
policies must be respected.
Features a crawlershould provide
Distributed: The crawler should have the ability to execute in a
distributed fashion across multiple machines.
Scalable: The crawler architecture should permit scaling up the crawl
rate by adding extra machines and bandwidth.
Performance and efficiency: The crawl system should make efficient
use of various system resources including processor, storage and network
bandwidth.
Quality: Given that a significant fraction of all web pages are of poor
utility for serving user query needs, the crawler should be biased towards
fetching ―useful‖ pages first.
Freshness: In many applications, the crawler should operate in
continuous mode: it should obtain fresh copies of previously fetched
pages.
Extensible: Crawlers should be designed to be extensible in many ways
– to copewith new data formats, new fetch protocols, and so on.
This demands that the crawler architecture be modular. [5]
12. Search Engine and Web Crawler
12
Architectureof Crawler
Flow of basic sequential crawler
Web crawlers are mainly used to index the links of all the visited
pages for later processing by a search engine. Such search engines
rely on massive collections of web pages that are acquired with the
help of web crawlers, which traverse the web by following hyperlinks
and storing downloaded pages in a large database that is later indexed
for efficient execution of user queries. Despite the numerous
applications for Web crawlers, at the core they are all fundamentally
the same. Following is the process bywhich Web crawlers work [6]:
1) Download the Web page.
2) Parse through the downloaded page and retrieve all the links.
3) For each link retrieved, repeat the process.
Figure 10 shows the flow of a basic sequential crawler. The crawler
maintains a list of unvisited URLs called the frontier.
The list is initialized with seed URLs which may be provided by a
user or another program. Each crawling loop involves picking the next
URL to crawl from the frontier, fetching the page corresponding to the
URL through HTTP, parsing the retrieved page to extract the URLs
and application specific information, and finally adding the unvisited
URLs to the frontier.
Before the URLs are added to the frontier they may be assigned a
score that represents the estimated benefit of visiting the page
corresponding to the URL. The crawling process may be terminated
when a certain number of pages have been crawled. If the crawler is
ready to crawl another page and the frontier is empty, the situation
signals a dead-end for the crawler. The crawler has no new page to
fetch and hence it stops. [6]
13. Search Engine and Web Crawler
13
Figure 10 Flow of a basic sequential crawler
The multi-threaded crawler model needs to deal with an empty
frontier just like a sequential crawler [6].
14. Search Engine and Web Crawler
14
Figure 11 A multi-threaded crawler model
High level architecture
Here, the multi-threaded downloader downloads the web pages from
the WWW, and using some parsers the web pages are decomposed
into URLs, contents, title etc.
The URLs are queued and sent to the downloader using some
scheduling algorithm. The downloaded data are stored in a database
[7].
15. Search Engine and Web Crawler
15
Figure 12 high level architecture of web crawler
The design of the downloader scheduler algorithm is crucial as too
many objects will exhaust many resources and make the system slow,
too small number of downloader will degrade the system
performance. The scheduler algorithm is as follows: [7]
1) System allocates a pre-defined number of downloader objects
2) User input a new URL to start crawler.
3) If any downloader is busy and there are new URLs to be
processed, then a check is made to see if any downloader object is
free. If true assign new URL to it and set its status as busy; else go
to 6.
4) After the downloader object downloads the contents of web pages
set its status as free.
5) If any downloader object runs longer than an upper time limit,
abort it. Set its status as free.
6) If there are more than predefined number of downloader or if all
the downloader objects are busy then allocate new threads and
distribute the downloader to them
7) Continue allocating the new threads and free threads to the
downloader until the number of downloader becomes less than the
threshold value, provided the number of threads being used be kept
under a limit.
8) Goto 3.
16. Search Engine and Web Crawler
16
Crawling Strategies
There are mainly four types of crawling strategies as below [8]:
1) Breadth-First Crawling
Figure 13 breath first crawling
This algorithm starts at the root URL and searches the all the
neighbour URL at the same level. If the goal is reached, then it is
reports success and the search terminates. If it is not, search proceeds
down to the next level sweeping the search across the neighbour URL
at that level and so on until the goal is reached. When all the URLs are
searched, but the objective is not met then it is reported as failure.
2) Depth-First Crawling
Figure 14 depth first crawling
It starts at the root URL and traverse deeper through the child URL. If
there are more than one child, then priority is given to the left most
child and traverse deep until no more child is available. It is
17. Search Engine and Web Crawler
17
backtracked to the next unvisited node and then continues in a similar
manner
3) Repetitive Crawling
once page have been crawled,some systems requrie the process to be
repeated periodically so that indexes are kept updated.which may be
achieved by launching a second crawl in parallel,to overcome this
problem we should constantly update the ―Index List.‖
4) TargetedCrawling
Here main objective is to retrieve the greatest number of pages
relating to a particular subject by using the ―Minimum Bandwidth‖.
most search engines use crawling process heuristics in order to target
certain type of page on specific topic.
Crawling Policies
The characteristics of web that make crawling difficult:
1) Its Large Volume
2) Its Fast Rate of Change
To remove these dificulties the web crawler is having the following
policies. [5]
A SelectionPolicythat states which page to download.
A Re-Visit Policy that states when to check for changes in pages.
A Politeness Policythat states how to avoid overloading web sites.
A Parallelization Policy that states how to coordinate distributed
Web Crawlers.
18. Search Engine and Web Crawler
18
Implementation
I have developed Web crawler application java works on Windows
operating system. It makes use net bins or any java compactable IDE to run.
For database connectivity it uses my sql - wamp server interface. The
currently proposed web crawler uses breadth first search crawling to search
the links. The proposedweb crawler is deployed on a client machine.
Once the start the IDE and run the program, an automated browsing process
is initiated. The HTML page contents of rediffmail.com homepage are given
to the parser. The parser puts it in a suitable format as described above and
the list of URLs in the HTML page are listed and stored in the frontier. The
URLs are picked up from the frontier and each URL is assigned to a
downloader. The status of downloader whether busy or free can be known.
After the page is downloaded it is added to the database and then the
particular downloader is set as free (i.e. released). The implementation
details are given in table 1.
Figure 15 main program of web crawler application
19. Search Engine and Web Crawler
19
Figure 16 output in IDE
Table 1: Functionality of the web crawler application on client machine.
Feature Support
Search for a search string Yes
Help manual No
Integration with other applications Yes
Specifying casesensitivity for a search string No
Specifying start URL Yes
Supportfor Breadth First crawling Yes
Check for Validity of URL specified Yes
20. Search Engine and Web Crawler
20
Figure 17 webpage content on database
21. Search Engine and Web Crawler
21
Conclusion
Web Crawler forms the back-bone of applications that facilitate Web
information Retrieval. In this report I have presented the architecture and
implementation details of my crawling system which can be deployed on the
client machine to browse the web concurrently and autonomously. It
combines the simplicity of asynchronous downloader and the advantage of
using multiple threads. It reduces the consumption of resources as it is not
implemented on the mainframe servers as other crawlers also reducing
server management. The proposed architecture uses the available resources
efficiently to make up the task done by high cost mainframe servers.
A major open issue for future work is a detailed study of how the system
could become even more distributed, retaining though quality of the content
of the crawled pages. Due to dynamic nature of the Web, the average
freshness or quality of the page downloaded need to be checked, the crawler
can be enhanced to check this and also detect links written in JAVA scripts
or VB scripts and also provision to support file formats like XML, RTF,
PDF, Microsoft word and Microsoft PPT can be done.
References
[1]―basic search handout‖ url: WWW.digitallearn.org
[2]―web search engine‖ url: www.wikipedia.org
[3]Krishan Kant Lavania, Sapna Jain, Madhur Kumar Gupta, and Nicy Sharma,
―Google: A Case Study (Web Searching and Crawling)‖, International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
[4]―web crawler‖ url: www.wikipedia.org
[5]―web crawling and indexes‖ Online edition, April 1, 2009 Cambridge
University Press.
[6]―Crawling the web‖ G. Pant, P. Srinivasan, F. Menczer
[7]Rajashree Shettar, Dr. Shobha G, ―Web Crawler On Client Machine‖,
IMECS 2008, Vol II ,19-21 March, 2008, Hong Kong
[8]Rashmi Janbandhu, Prashant Dahiwale, M.M.Raghuwanshi, ―Analysis of
Web Crawling Algorithms‖ International Journal on Recent and Innovation
Trends in Computing and Communication ISSN: 2321-8169 Volume: 2
Issue: 3