SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Search Engine & Web Crawling
Presented By:-
Vinay Arora
Assistant Professor
CSED, Thapar University
Patiala (Punjab)
Contents
• What is search engine
• Example and need of a search engine
• How search engine works?
• Web crawler
• Web crawling
▫ Factor affecting web crawling
robots.txt
sitemap.xml
manual submission of websites into database of specific search engine
amendment in <a> tag with <href> option
• Areas related to web crawling
▫ Indexing
▫ Searching algorithms
▫ Data mining and analysis
• Web crawler as Add On
▫ Downloading whole website (offline dump)
Demo Tool – httrack
• Examples of Web crawler
▫ Open source
What is a search engine
• A search engine is a searchable database which collects
information on web pages from the Internet.
• Indexes the information and then stores the result in a huge
database where it can be quickly searched.
• The search engine provides an interface to search the
database.
• When you enter a keyword into the search engine, the search
engine will look through the billions of web pages to help you
find the ones that you are looking for.
Examples of search engine
Need of search engine
• Variety An Internet search can generate a variety of sources for
information. Results from online encyclopedias, news stories, university
studies, discussion boards, and even personal blogs can come up in a basic
Internet search. This variety allows anyone searching for information to
choose the types of sources they would like to use, or to use a variety of
sources to gain a greater understanding of a subject.
• Organization Internet search engines help to organize the Internet and
individual websites. Search engines aid in organizing the vast amount of
information that can sometimes be scattered in various places on the same
web page into an organized list that can be used more easily.
• Precision Search engines do have the ability to provide refined or more
precise results. Being able to search more precisely allows you to cut down
on the amount of information generated by your search.
Searching for the keyword “thapar
university” @ google
How search engine works?
A Search engine has three parts.
• Spider: Deploys a robot program
called a spider or robot designed
to track down web pages. It
follows the links these pages
contain, and add information to
search engines’ database.
Example: Googlebot (Google’s
robot program)
• Index: Database containing a
copy of each Web page gathered
by the spider.
• Search engine software :
Technology that enables users to
query the index and that returns
results in a schematic order.
How search engine works? (Conti…)
Web crawler
• A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner.
• Other names
Crawler
Spider
Robot (or bot)
Web agent
Wanderer, worm
• Examples: googlebot, msnbot, etc.
Sequential crawler
• This is a sequential crawler
• Seeds can be any list of
starting URLs
• Order of page visits is
determined by frontier data
structure
• Stop criterion can be
anything
Architecture of a crawler
Architecture of a crawler (Conti…)
• URL Frontier: containing URLs yet to be fetches in the
current crawl. At first, a seed set is stored in URL Frontier,
and a crawler begins by taking a URL from the seed set.
• DNS: domain name service resolution. Look up IP address for
domain names.
• Fetch: generally use the http protocol to fetch the URL.
• Parse: the page is parsed. Texts (images, videos, and etc.)
and Links are extracted.
Architecture of a crawler (Conti…)
• Content Seen?: test whether a web page with the same
content has already been seen at another URL. Need to
develop a way to measure the fingerprint of a web page.
• URL Filter:
▫ Whether the extracted URL should be excluded from the
frontier (robots.txt).
▫ URL should be normalized.
• Duplicate URL Elimination: the URL is checked for
duplicate elimination.
Webcrawling & factors affecting it
• Crawling (spidering): finding and downloading web pages
automatically.
• Factors include the things that deviate or restrict the crawler
to perform the crawling.
▫ robots.txt
▫ sitemap.xml
▫ manual submission of websites into database of specific
search engine
▫ amendment in <a> tag with <href> option
robots.txt
• The robots exclusion standard, also known as the robots
exclusion protocol or robots.txt protocol, is a standard used
by websites to communicate with web crawlers and other web
robots.
• The standard specifies the instruction format to be used to
inform the robot about which areas of the website should not
be processed or scanned.
• Robots are often used by search engines to categorize and
archive web sites, or by webmasters to proofread source code.
robots.txt (Conti…)
sitemap.xml
• The Sitemaps protocol allows a webmaster to inform search
engines about URLs on a website that are available for
crawling.
• A Sitemap is an XML file that lists the URLs for a site.
• It allows webmasters to include additional information about
each URL: when it was last updated, how often it changes, and
how important it is in relation to other URLs in the site.
• This allows search engines to crawl the site more intelligently.
Sitemaps are a URL inclusion protocol and
complement robots.txt, a URL exclusion protocol.
sitemap.xml (Conti…)
Manual submission of websites into
database of specific search engine
amendment in <a> tag with <href>
option
• The <a> tag defines a hyperlink, which is used to link from
one page to another.
• Visit W3Schools.com!
<a href="http://www.w3schools.com">Visit W3Schools.com!</a>
• <a rel="nofollow" href="http://www.w3schools.com">Visit
W3Schools.com!</a>
Areas related to web crawling -
Indexing
• Search engine indexing collects, parses, and stores data to
facilitate fast and accurate information retrieval.
• The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search
query.
• Without an index, the search engine would scan every
document in the corpus, which would require considerable
time and computing power.
Areas related to web crawling –
Indexing (Conti…)
• Search engine architectures vary in the way indexing is
performed and in methods of index storage to meet the
various design factors.
• Index data structures
▫ Suffix tree
▫ Inverted index
▫ Citation index
▫ Ngram index
▫ Document-term matrix
Areas related to web crawling -
Searching algorithms
• String Matching Algorithms
• Brute Force Algorithm
• Rabin Karp Algorithm
• Knuth-Morris-Pratt Algorithm
• Boyer Moore Algorithm
Areas related to web crawling - Data
mining and analysis
• Graph Mining
▫ Apriori-based Approach
▫ Pattern-Growth Approach
▫ Pattern growth-based frequent substructure mining
Web crawler as Add On
• Downloading whole website (offline dump) - httrack
Httrack (Conti…)
Httrack (Conti…)
Httrack (Conti…)
Examples of Web crawler – Open source
crawler4j
Application of crawling concepts
SEO – Search Engine Optimization
Search engine and web crawler

Más contenido relacionado

La actualidad más candente

Web Standards And Protocols
Web Standards And ProtocolsWeb Standards And Protocols
Web Standards And Protocols
Steven Cahill
 

La actualidad más candente (20)

The vector space model
The vector space modelThe vector space model
The vector space model
 
PHP Cookies and Sessions
PHP Cookies and SessionsPHP Cookies and Sessions
PHP Cookies and Sessions
 
Html5 Basic Structure
Html5 Basic StructureHtml5 Basic Structure
Html5 Basic Structure
 
javaScript.ppt
javaScript.pptjavaScript.ppt
javaScript.ppt
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic models
 
INTERNET TECHNOLOGY
INTERNET  TECHNOLOGYINTERNET  TECHNOLOGY
INTERNET TECHNOLOGY
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Xml
XmlXml
Xml
 
Web Development - Lecture 1
Web Development - Lecture 1Web Development - Lecture 1
Web Development - Lecture 1
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Weaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfWeaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdf
 
PHP and Mysql
PHP and MysqlPHP and Mysql
PHP and Mysql
 
Web Standards And Protocols
Web Standards And ProtocolsWeb Standards And Protocols
Web Standards And Protocols
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
NLTK
NLTKNLTK
NLTK
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Programming Languages used in AI
Programming Languages used in AIProgramming Languages used in AI
Programming Languages used in AI
 

Destacado (20)

6 java - loop
6  java - loop6  java - loop
6 java - loop
 
4 java - decision
4  java - decision4  java - decision
4 java - decision
 
3 java - variable type
3  java - variable type3  java - variable type
3 java - variable type
 
CG - Introduction to Computer Graphics
CG - Introduction to Computer GraphicsCG - Introduction to Computer Graphics
CG - Introduction to Computer Graphics
 
Uta005 lecture1
Uta005 lecture1Uta005 lecture1
Uta005 lecture1
 
2 java - operators
2  java - operators2  java - operators
2 java - operators
 
1 java - data type
1  java - data type1  java - data type
1 java - data type
 
Security & Protection
Security & ProtectionSecurity & Protection
Security & Protection
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process Synchronization
 
Lab exercise questions (AD & CD)
Lab exercise questions (AD & CD)Lab exercise questions (AD & CD)
Lab exercise questions (AD & CD)
 
Adding Your URL to Search Engine
Adding Your URL to Search EngineAdding Your URL to Search Engine
Adding Your URL to Search Engine
 
Easy computer for bds10 entrance for website
Easy computer for bds10 entrance for websiteEasy computer for bds10 entrance for website
Easy computer for bds10 entrance for website
 
Uta005 lecture3
Uta005 lecture3Uta005 lecture3
Uta005 lecture3
 
Uta005 lecture2
Uta005 lecture2Uta005 lecture2
Uta005 lecture2
 
C Prog - Array
C Prog - ArrayC Prog - Array
C Prog - Array
 
Sql tutorial
Sql tutorialSql tutorial
Sql tutorial
 
C Tutorial
C TutorialC Tutorial
C Tutorial
 
C programming tutorial
C programming tutorialC programming tutorial
C programming tutorial
 
C Prog. - Structures
C Prog. - StructuresC Prog. - Structures
C Prog. - Structures
 
C Prog. - ASCII Values, Break, Continue
C Prog. -  ASCII Values, Break, ContinueC Prog. -  ASCII Values, Break, Continue
C Prog. - ASCII Values, Break, Continue
 

Similar a Search engine and web crawler

Notes for
Notes forNotes for
Notes for
9pallen
 

Similar a Search engine and web crawler (20)

webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Notes for
Notes forNotes for
Notes for
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Search engine
Search engineSearch engine
Search engine
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATG
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
The Technical SEO Full Course how to do
The Technical SEO  Full Course  how to doThe Technical SEO  Full Course  how to do
The Technical SEO Full Course how to do
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
Search engine
Search engineSearch engine
Search engine
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 

Más de vinay arora

Más de vinay arora (16)

Use case diagram (airport)
Use case diagram (airport)Use case diagram (airport)
Use case diagram (airport)
 
Use case diagram
Use case diagramUse case diagram
Use case diagram
 
SEM - UML (1st case study)
SEM - UML (1st case study)SEM - UML (1st case study)
SEM - UML (1st case study)
 
CG - Output Primitives
CG - Output PrimitivesCG - Output Primitives
CG - Output Primitives
 
CG - Display Devices
CG - Display DevicesCG - Display Devices
CG - Display Devices
 
CG - Input Output Devices
CG - Input Output DevicesCG - Input Output Devices
CG - Input Output Devices
 
C Prog. - Strings (Updated)
C Prog. - Strings (Updated)C Prog. - Strings (Updated)
C Prog. - Strings (Updated)
 
A&D - UML
A&D - UMLA&D - UML
A&D - UML
 
A&D - Object Oriented Design using UML
A&D - Object Oriented Design using UMLA&D - Object Oriented Design using UML
A&D - Object Oriented Design using UML
 
C Prog - Strings
C Prog - StringsC Prog - Strings
C Prog - Strings
 
C Prog - Pointers
C Prog - PointersC Prog - Pointers
C Prog - Pointers
 
C Prog - Array
C Prog - ArrayC Prog - Array
C Prog - Array
 
A&D - Input Design
A&D - Input DesignA&D - Input Design
A&D - Input Design
 
A&D - Object Oriented Analysis using UML
A&D - Object Oriented Analysis using UMLA&D - Object Oriented Analysis using UML
A&D - Object Oriented Analysis using UML
 
A&D - Use Case Diagram
A&D - Use Case DiagramA&D - Use Case Diagram
A&D - Use Case Diagram
 
A&D - Output
A&D - OutputA&D - Output
A&D - Output
 

Último

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 

Search engine and web crawler

  • 1. Search Engine & Web Crawling Presented By:- Vinay Arora Assistant Professor CSED, Thapar University Patiala (Punjab)
  • 2. Contents • What is search engine • Example and need of a search engine • How search engine works? • Web crawler • Web crawling ▫ Factor affecting web crawling robots.txt sitemap.xml manual submission of websites into database of specific search engine amendment in <a> tag with <href> option • Areas related to web crawling ▫ Indexing ▫ Searching algorithms ▫ Data mining and analysis • Web crawler as Add On ▫ Downloading whole website (offline dump) Demo Tool – httrack • Examples of Web crawler ▫ Open source
  • 3. What is a search engine • A search engine is a searchable database which collects information on web pages from the Internet. • Indexes the information and then stores the result in a huge database where it can be quickly searched. • The search engine provides an interface to search the database. • When you enter a keyword into the search engine, the search engine will look through the billions of web pages to help you find the ones that you are looking for.
  • 5. Need of search engine • Variety An Internet search can generate a variety of sources for information. Results from online encyclopedias, news stories, university studies, discussion boards, and even personal blogs can come up in a basic Internet search. This variety allows anyone searching for information to choose the types of sources they would like to use, or to use a variety of sources to gain a greater understanding of a subject. • Organization Internet search engines help to organize the Internet and individual websites. Search engines aid in organizing the vast amount of information that can sometimes be scattered in various places on the same web page into an organized list that can be used more easily. • Precision Search engines do have the ability to provide refined or more precise results. Being able to search more precisely allows you to cut down on the amount of information generated by your search.
  • 6. Searching for the keyword “thapar university” @ google
  • 7. How search engine works? A Search engine has three parts. • Spider: Deploys a robot program called a spider or robot designed to track down web pages. It follows the links these pages contain, and add information to search engines’ database. Example: Googlebot (Google’s robot program) • Index: Database containing a copy of each Web page gathered by the spider. • Search engine software : Technology that enables users to query the index and that returns results in a schematic order.
  • 8. How search engine works? (Conti…)
  • 9. Web crawler • A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. • Other names Crawler Spider Robot (or bot) Web agent Wanderer, worm • Examples: googlebot, msnbot, etc.
  • 10. Sequential crawler • This is a sequential crawler • Seeds can be any list of starting URLs • Order of page visits is determined by frontier data structure • Stop criterion can be anything
  • 11. Architecture of a crawler
  • 12. Architecture of a crawler (Conti…) • URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. • DNS: domain name service resolution. Look up IP address for domain names. • Fetch: generally use the http protocol to fetch the URL. • Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  • 13. Architecture of a crawler (Conti…) • Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. • URL Filter: ▫ Whether the extracted URL should be excluded from the frontier (robots.txt). ▫ URL should be normalized. • Duplicate URL Elimination: the URL is checked for duplicate elimination.
  • 14. Webcrawling & factors affecting it • Crawling (spidering): finding and downloading web pages automatically. • Factors include the things that deviate or restrict the crawler to perform the crawling. ▫ robots.txt ▫ sitemap.xml ▫ manual submission of websites into database of specific search engine ▫ amendment in <a> tag with <href> option
  • 15. robots.txt • The robots exclusion standard, also known as the robots exclusion protocol or robots.txt protocol, is a standard used by websites to communicate with web crawlers and other web robots. • The standard specifies the instruction format to be used to inform the robot about which areas of the website should not be processed or scanned. • Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.
  • 17. sitemap.xml • The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. • A Sitemap is an XML file that lists the URLs for a site. • It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. • This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.
  • 19. Manual submission of websites into database of specific search engine
  • 20. amendment in <a> tag with <href> option • The <a> tag defines a hyperlink, which is used to link from one page to another. • Visit W3Schools.com! <a href="http://www.w3schools.com">Visit W3Schools.com!</a> • <a rel="nofollow" href="http://www.w3schools.com">Visit W3Schools.com!</a>
  • 21. Areas related to web crawling - Indexing • Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. • The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. • Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
  • 22. Areas related to web crawling – Indexing (Conti…) • Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. • Index data structures ▫ Suffix tree ▫ Inverted index ▫ Citation index ▫ Ngram index ▫ Document-term matrix
  • 23. Areas related to web crawling - Searching algorithms • String Matching Algorithms • Brute Force Algorithm • Rabin Karp Algorithm • Knuth-Morris-Pratt Algorithm • Boyer Moore Algorithm
  • 24. Areas related to web crawling - Data mining and analysis • Graph Mining ▫ Apriori-based Approach ▫ Pattern-Growth Approach ▫ Pattern growth-based frequent substructure mining
  • 25. Web crawler as Add On • Downloading whole website (offline dump) - httrack
  • 29. Examples of Web crawler – Open source
  • 32. SEO – Search Engine Optimization