“Web crawler”

“Web Crawler”

Ranjit R. Banshpal
1 1

Overview
 OBJECTIVE
 INTRODUCTION
PROBLEM STATEMENT
ARCHITECTURE OF WEB CRAWLER
APPROACHES FOR CRAWLING PROCESS
POLICIES USED
UTILITIES OF WEB CRAWLER
CONCLUSION
SCOPE FOR FUTURE
REFERENCES

2

2

Objective
 Internet users and accessible web pages.
Hypertext system .
Most crucial components in search engines and their
optimization would have a great effect on improving the
searching efficiency.

3

3

Introduction
Programs that exploit the graph structures of the web to
move from page to page.
Program that browses the World Wide Web in a
methodical, automated manner.
Search Engines:
Most crucial components
Improves the searching efficiency.

4

Literature survey
Literature survey paper 1
“Distributed Ontology-Driven Focused Crawling”
•Vertical search technologies.
•Focused crawling.
•Ontological structure.
Web Crawler architechture uses URL scoring functions,Scheduler
and DOM parser,Page ranker to download web pages.
57

• Literature survey paper 2
“Efficient Focused Crawling based on Best First Search”
•Seek out pages that are relevant to given keywords.
•A focused crawler analyze links that are likely to be most
relevant.
•“Best” first search strategy is identified as a “focused crawler”
Focused crawler has two main components:
(i)To find specific web page.
(ii)To proceed from seed pages.
8
6

Literature survey paper 3
“Design of an Ontology based Adaptive Crawler for
Hidden Web”.
•Deep web/ invisible web / hidden web.
•Accessing deep web using ontology.
•Download relevant hidden web pages.

79

“URL Rule Based Focused Crawlers.”
• Use of URL regular expression .
• Retrieving Topic-specific Pages.

Search the topic-specific information, need to crawl a small
part of data use fewer server resources .

8 10

“A Topic-Specific Web Crawler with Web Page
Hierarchy Based on HTML Dom-Tree.”
•Representation of data in hierarchical Dom-Tree.
•Dom-Tree is structural representation of HTML pages.
•Use the concept of Ontology.

9

Problem statement
Most prominent challenge with current web crawlers
Selection of important pages for downloading.
Cannot download all pages from the web.
It is important for the crawler
“To select the pages and to visit “important” pages first by
prioritizing the URLs in the queue properly.”
It minimizing the load on the websites crawled with
parallelization of the crawling process.

12

Functional diagram of web crawler

11

Approaches for Crawling process
Basically if we consider there are 2 different types of crawler
Priory
Defined path
A priory
Do not follow a specific path.

12 14

Policies Used
 A selection policy that states which pages to download.
 A politeness policy that states how to avoid overloading
web sites.
 A parallelization policy that states how to coordinate
distributed web crawl.

13

Utilities of Web Crawler
 Gather pages from the Web.
 Support a search engine.
 Perform data mining
 Improving the sites (web site analysis)

1416

Conclusion
The number of extracted documents was reduced. Link
analyzed, and deleted a great deal of irrelevant web page.
Crawling time is reduced. After a great deal of irrelevant
web page is deleted, crawling load is reduced.

15

References
Rodrigo Campos, Oscar Rojas, Mauricio Mar´ın, Marcelo Mendoza “Distributed
Ontology-Driven Focused Crawling” 2013 21st Euromicro International
Conference on Parallel, Distributed, and Network-Based Processing. 10666192/12 © 2012 IEEE DOI 10.1109/PDP.2013.23

Sunita Rawat, D. R. Patil “Efficient Focused Crawling based on Best First
Search” 978-1-4673-4529-3/12/c2012 IEEE.

Manvi, Ashutosh Dixit, Komal Kumar Bhatia “Design of an Ontology based
Adaptive Crawler for Hidden Web” 978-0-7695-4958-3/13© 2013 IEEE DOI
10.1109/CSNT.2013.140.

Xiaolin Zheng, Tao Zhou, Zukun Yu, Deren Chen “URL Rule Based Focused
Crawlers” IEEE International Conference on e-Business Engineering. 978-07695-3395-7/08 © 2008 IEEE DOI 10.1109/ICEBE.2008.61.

Yuekui Yang, Yajun Du, Yufeng Hai, Zhaoqiong Gao “A Topic-Specific Web
Crawler with Web Page Hierarchy
Based on HTML Dom-Tree” 2009 Asia-Pacific Conference on Information
Processing.

16

“Web crawler”

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to “Web crawler”

Similar to “Web crawler” (20)

More from ranjit banshpal

More from ranjit banshpal (15)

Recently uploaded

Recently uploaded (20)

“Web crawler”