2. Overview
OBJECTIVE
INTRODUCTION
PROBLEM STATEMENT
ARCHITECTURE OF WEB CRAWLER
APPROACHES FOR CRAWLING PROCESS
POLICIES USED
UTILITIES OF WEB CRAWLER
CONCLUSION
SCOPE FOR FUTURE
REFERENCES
2
2
3. Objective
Internet users and accessible web pages.
Hypertext system .
Most crucial components in search engines and their
optimization would have a great effect on improving the
searching efficiency.
3
3
4. Introduction
Programs that exploit the graph structures of the web to
move from page to page.
Program that browses the World Wide Web in a
methodical, automated manner.
Search Engines:
Most crucial components
Improves the searching efficiency.
4
5. Literature survey
Literature survey paper 1
“Distributed Ontology-Driven Focused Crawling”
•Vertical search technologies.
•Focused crawling.
•Ontological structure.
Web Crawler architechture uses URL scoring functions,Scheduler
and DOM parser,Page ranker to download web pages.
57
6. • Literature survey paper 2
“Efficient Focused Crawling based on Best First Search”
•Seek out pages that are relevant to given keywords.
•A focused crawler analyze links that are likely to be most
relevant.
•“Best” first search strategy is identified as a “focused crawler”
Focused crawler has two main components:
(i)To find specific web page.
(ii)To proceed from seed pages.
8
6
7. Literature survey paper 3
“Design of an Ontology based Adaptive Crawler for
Hidden Web”.
•Deep web/ invisible web / hidden web.
•Accessing deep web using ontology.
•Download relevant hidden web pages.
79
8. • Literature survey paper 4
“URL Rule Based Focused Crawlers.”
• Use of URL regular expression .
• Retrieving Topic-specific Pages.
Search the topic-specific information, need to crawl a small
part of data use fewer server resources .
8 10
9. • Literature survey paper 5
“A Topic-Specific Web Crawler with Web Page
Hierarchy Based on HTML Dom-Tree.”
•Representation of data in hierarchical Dom-Tree.
•Dom-Tree is structural representation of HTML pages.
•Use the concept of Ontology.
9
10. Problem statement
Most prominent challenge with current web crawlers
Selection of important pages for downloading.
Cannot download all pages from the web.
It is important for the crawler
“To select the pages and to visit “important” pages first by
prioritizing the URLs in the queue properly.”
It minimizing the load on the websites crawled with
parallelization of the crawling process.
12
12. Approaches for Crawling process
Basically if we consider there are 2 different types of crawler
Priory
Defined path
A priory
Do not follow a specific path.
12 14
13. Policies Used
A selection policy that states which pages to download.
A politeness policy that states how to avoid overloading
web sites.
A parallelization policy that states how to coordinate
distributed web crawl.
13
14. Utilities of Web Crawler
Gather pages from the Web.
Support a search engine.
Perform data mining
Improving the sites (web site analysis)
1416
15. Conclusion
The number of extracted documents was reduced. Link
analyzed, and deleted a great deal of irrelevant web page.
Crawling time is reduced. After a great deal of irrelevant
web page is deleted, crawling load is reduced.
15