SlideShare una empresa de Scribd logo
1 de 33
Web Robots


 ISHAN MISHRA
www.IshanTech.org



                    1
Outline
   Robot applications
   How it works
   Cycle Avoidance




                         2
Applications
   Behavior of web robots
       Wander from web site to site (recursively),
       1. Fetching content,
       2. Following hyperlinks,
       3. Process the data they find.

   Colorful names
       Crawlers,
       Spiders,
       Worms,
       Bots


                                                      3
Where to Start: The “Root Set”


        A               G           L           S



B       C       D               M       N   T       U
                    H       I


                    J           O
    E       F


                    K       P   Q       R



                                                        4
Cycle Avoidance


      A        B         E                   B         E                   B       E


                                          AB

  A                C            A                C           A       ABC       C




           D                             D                            D


(a) Robot fetches page A,     (b) Robot follows link       (c) Robot follows link and
    follows link, fetches B       and fetches page C           is back to A
                                                                                    5
Loops
   Cycles are bad for crawlers for there
    reasons.
       Spending robot’s time and space
       Overwhelm the web site.
       Duplicate content.




                                            6
Data structure for robot
   Trees and hash table
   Lossy presence bit maps
   Checkpoints
       Save the list of visited URL to disk, in case the
        robot crashes
   Partitioning
       Robot farms


                                                            7
Canonicalizing URLs
       Most web robots try to eliminate the
        obvious aliases by “canonicalizing” URL
        into a standard form, by:
         adding “:80” to the hostname, if the port
          isn’t specified.
         Converting all %xx escaped characters into
          their character equivalents.
         Removing # tags


                                                       8
Symbolic link cycles

          /                              /




index.html    subdir           index.html     subdir




    index.html     logo.gif


(a) subdir is a directory     (b) subdir is an upward symbolic link


                                                                      9
Dynamic Virtual Web Spaces
   It can be possible to publish a URL that looks like a normal
    file but really is a gateway application.
   This application can generate HTML on the fly that
    contains links to imaginary URLs on the same server.
    When these imaginary URLs are requested, new imaginary
    URLs are generated.

   Such kind of malicious web server take the poor robot on
    an Alice-in-Wonderland journey through an infinite virtual
    space, even if the web server doesn’t really contain any
    files. Sometimes the robot is hard to detect this trap,
    because HTML and URLs may look very different all the
    time.

   For example, a CGI-based calendaring program
                                                                 10
Malicious dynamic web space
example




                              11
Techniques for avoiding loops
   Canonicalizing URLs
   Breath-first crawling
   Throttling
       Limit the number of pages the robot can fetch from a
        web site in a period of time.
   Limit URL size
       Avoid symbolic cycle problem.
       Problem: many sites use URLs to maintain user state.
   URL/site blacklist
       vs. “excluding Robot”

                                                               12
Techniques for avoiding loops
   Pattern detection
       e.g., “subdir/subdir/subdir…”
       e.g., “subdir/images/subdir/images/subdir/…”

   Content fingerprinting
       A checksum concept, while the odds of two different pages
        having the same check sum are small.
       Message digest functions such as MD5 are popular for this
        purpose.

   Human monitoring
       Should design your robot with diagnostics and logging, so
        human beings can easily monitor the robot’s process and be
        warned quickly if something unusual is happening.
                                                                     13
Robotic HTTP
   No different from any other HTTP client program.
   Many robots try to implement the minimum
    amount of HTTP needed to request the content
    they seek.

   It is recommended that robot implementers
    send some basic header information to notify
    the site of the capabilities of the robot, the robot
    identify, and where it originated.

                                                           14
Identifying Request Header
   User-Agent
       Tell the server the robot’s name
   From
       Tell the email of the robot’s user/admin email.
   Accept
       Tell the server what media types are okay to send.
        (e.g. only fetch text and sound).
   Referer
       Tell the server how a robot found links to this site’s
        content.


                                                                 15
Virtual docroots cause trouble if
 no Host header is sent


              Robot tries to request index.html
              from www.csie.ncnu.edu.tw, but does
                                                     Servers is configured to serve
              not include a Host header.
                                                     both sites, but serves
                                                     www.ncnu.edu.tw by default.
Web robot client
Request message
GET /index.html HTTP/1.0
User-agent: ShopBot 1.0
                                                        www.ncnu.edu.tw
                                                       www.csie.ncnu.edu.tw
                                     Response message
                                      HTTP/1.0 200 OK
                                      […]
                                      <HTML>
                                      <TITLE>National Chi Nan University</TITLE>
                                      […]                                        16
What else a robot should support
   Support Virtual Hosting
        Not including this can lead to robots identifying the wrong content with
         a particular URL.

   Conditional Requests
        To minimize the amount of content retrieved, by conditional HTTP
         requests. (like cache revalidation)

   Response Handling
        Status code: 200 OK, 404 Not Found, 304
        Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>

   User-Agent Targeting
        Web master should keep in mind that many robot will visit their site.
         Many sites optimize content for various user agents (I.E. or netscape).
        Problem: “your browser does not support frame.”


                                                                                    17
Misbehaving Robots
   Runaway robot
       Robots issue HTTP requests as fast as they can.
   Stale URLs
       Robots visit the old lists of URLs.
   Long, wrong URLs
       May reduce web server’s performance, clutter server’s access
        logs, even crash server.
   Nosy robots
       Some robots may get URLs that point to private data and make
        that data easily accessible through search engine.
   Dynamic gateway access
       Robots don’t always know what they are accessing.


                                                                       18
Excluding Robots


                                          www.ncnu.edu.tw


Robot parses the robots.txt file and
determines if it is allowed to access
the acetylene-torches.html file.

It is, so it proceeds with the request.




                                                            19
robots.txt format
   #allow google, csiebot to crawl the public parts
    of our site, but no other robots are allowed to
    crawl anything of our sites
   User-Agent: googlebot
   User-Agent: csiebot
   Disallow: /private

   User-Agent: *
   Disallow:
                                                       20
Robots Exclusion Standard
        versions

Version Title and description                              Date
0.0      A Standard for Robot Exclusion-Martijn Koster’s   June 1994
         original robot.txt mechanism with Disallow
         directive


1.0      A Method for Web Robots Control-Martijn           Nov. 1996
         Koster’s IETF draft with additional support for
         Allow


2.0      An Extended Standard for Robot Exclusion-Sean     Nov. 1996
         Conner’s extension including regex and timing
         information; not widely supported




                                                                       21
Robots.txt path matching
        examples
Rule path          URL path           Match?   Comments
/tmp               /tmp               ˇ        Rule path==URL path

/tmp               /tmpfile.html      ˇ        Rule path is a prefix of URL
                                               path
/tmp               /tmp/a.html        ˇ        Rule path is a prefix of URL
                                               path
/tmp/              /tmp               x        /tmp/ is not a prefix of /tmp

                   README.TXT         ˇ        Empty rule path matches
                                               everything
/~fred/hi.html     %7Efred/hi.html    ˇ        %7E is treated the same as ~

/%7Efred/hi.html   /~fred/hi.html     ˇ        %7E is treated the same as ~

/%7efred/hi.html   /%7Efred/hi.html   ˇ        Case isn’t significant in escapes

/~fred/hi.html     ~fred%2Fhi.html    x        %2F is slash, but slash is a
                                               special case that must match
                                               exactly                             22
HTML Robot-control Meta Tags
   e.g.
        <META NAME=“ROBOTS” CONTENT=directive-list>

   Directive-list
        NOINDEX
             Not to process this document content
        NOFOLLOW
             Not to crawl any outgoing links from this page

        INDEX
        FOLLOW
        NOARCHIVE
             Should not cache a local copy of the page
        ALL (equivalent to INDEX, FOLLOW)
        NONE (equivalent to NOINDEX, NOFOLLOW)


                                                               23
Additional META tag directives

name=                content=      Description
DESCRIPTION          <text>        Allows an author to define a short text summary of the web
                                   page. Many search engines look at META DESCROPTION
                                   tags,allowing page author to specify appropriate short
                                   abstracts to describe their web pages.
                                   <meta name=“description”
                                       content=“Welcome to Mary’s Antiques web site”>
KEYWORDS             <comma        Associates a comma-separated list of words that describes the
                     list>         web page, to assist in keyword searches.
                                   <meta name=“keywords”
                                      content=“antiques,mary,furniture,restoration”>

REVISIT-AFTER*       <no.days>     Instructs the robot or search engine that the page should be
                                   revisited, presumably because it is subject to change, after the
                                   specified number of days.
                                   <meta name=“revisit-after” content=“10 days”>


*   This directive is not likely to have wide support.                                                24
Guidelines for web robot
operators (Robot Etiquette)




                              25
Guidelines for web robot
operators (cont.)




                           26
Guidelines for web robot
operators (cont.)




                           27
Guidelines for web robot
operators (cont.)




                           28
Guidelines for web robot
operators (cont.)




                           29
Modern Search Engine
             Architecture


      User
                                                                            Web server




      User
                                                                            Web server
                   Web search                      Search engine
                   gateway                         crawler/indexer
      User
                                 Full-text index
                                 database
                                                                            Web server
      User

Web search users       Query engine                       Crawling and indexing
                                                                                         30
Full-Text Index




                  31
Posting the Query
User fills out HTML search
form (with a GET action
HTTP method) on site in
browser and hits Submit




          Client                                                     Query:”drills”

Request message
                                                                     Results:File”BD.html”
GET /search.html?query=drills HTTP/1.1
Host: www.csie.ncnu.edu.tw               www.csie.ncnu.edu.tw
Accept: *
User-agent: ShopBot
                                           Response message                                  Search gateway
                                           HTTP/1.1 200 OK
                                           Content-type: text/html
                                           Content-length: 1037

                                           <HTML>
                                           <HEAD><TITLE>Search Results</TITLE>
                                           […]
                                                                                                       32
Reference (HW#4)
 paper reading: “searching the Web”
 paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.
http://www.searchtools.com
  Search Tools for Web Sites and Intranets-resources for search tools and
  robots.
http://www.robotstxt.org/wc/robots.html
  The Web Robots Pages-resources for robot developers, including the
  registry of Internet Robots.
http://www.searchengineworld.com
  Search Engine World-resource for search engines and robots.
http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm
 RobotRules Perl source.
http://www.conman.org/people/spc/robots2.html
 An Extended Standard for Robot Exclusion.
Managing Gigabytes: Compressing and Indexing Documents and Images
  Written, I., Moffat, A., and Bell, T., Morgan Kaufmann.                     33

Más contenido relacionado

Destacado

How to create favicon
How to   create    faviconHow to   create    favicon
How to create faviconOM Maurya
 
how to create a blog on wordpress
how to create  a blog  on  wordpress how to create  a blog  on  wordpress
how to create a blog on wordpress OM Maurya
 
How to create rss feed for your website
How to create  rss feed  for  your  websiteHow to create  rss feed  for  your  website
How to create rss feed for your websiteOM Maurya
 
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Robert M Chapple
 
How to create rss feed
How to create rss feedHow to create rss feed
How to create rss feedTanuja Talekar
 
How to track website visitors using Google analytics
How to track website visitors using Google analyticsHow to track website visitors using Google analytics
How to track website visitors using Google analyticsTanuja Talekar
 
how to setup Google analytics tracking code for website
how to setup  Google analytics tracking code for websitehow to setup  Google analytics tracking code for website
how to setup Google analytics tracking code for websiteOM Maurya
 
How to create sitemap for website
How to create sitemap for websiteHow to create sitemap for website
How to create sitemap for websiteOM Maurya
 

Destacado (10)

R&amp;b history
R&amp;b historyR&amp;b history
R&amp;b history
 
How to create favicon
How to   create    faviconHow to   create    favicon
How to create favicon
 
how to create a blog on wordpress
how to create  a blog  on  wordpress how to create  a blog  on  wordpress
how to create a blog on wordpress
 
How to create rss feed for your website
How to create  rss feed  for  your  websiteHow to create  rss feed  for  your  website
How to create rss feed for your website
 
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
 
How to create rss feed
How to create rss feedHow to create rss feed
How to create rss feed
 
How to track website visitors using Google analytics
How to track website visitors using Google analyticsHow to track website visitors using Google analytics
How to track website visitors using Google analytics
 
how to setup Google analytics tracking code for website
how to setup  Google analytics tracking code for websitehow to setup  Google analytics tracking code for website
how to setup Google analytics tracking code for website
 
How to create sitemap for website
How to create sitemap for websiteHow to create sitemap for website
How to create sitemap for website
 
Evareporte
EvareporteEvareporte
Evareporte
 

Similar a Introduction to "robots.txt

Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development PresentationTurnToTech
 
HTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityHTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityPeter Lubbers
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentRandy Connolly
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersBrian Huff
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityJohn Varghese
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Jessica Dunbar
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentationTom Johnson
 
Browser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyBrowser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyKrishna T
 
improve website performance
improve website performanceimprove website performance
improve website performanceamit Sinha
 
Web development using ASP.NET MVC
Web development using ASP.NET MVC Web development using ASP.NET MVC
Web development using ASP.NET MVC Adil Mughal
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website Phase2
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankWeLoveSEO
 
Kotlin server side frameworks
Kotlin server side frameworksKotlin server side frameworks
Kotlin server side frameworksKen Yee
 
From ZERO to REST in an hour
From ZERO to REST in an hour From ZERO to REST in an hour
From ZERO to REST in an hour Cisco DevNet
 

Similar a Introduction to "robots.txt (20)

Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development Presentation
 
HTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityHTML5 Real-Time and Connectivity
HTML5 Real-Time and Connectivity
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud Developers
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills security
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Browser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyBrowser Internals-Same Origin Policy
Browser Internals-Same Origin Policy
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Webbasics
WebbasicsWebbasics
Webbasics
 
improve website performance
improve website performanceimprove website performance
improve website performance
 
Web development using ASP.NET MVC
Web development using ASP.NET MVC Web development using ASP.NET MVC
Web development using ASP.NET MVC
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rank
 
Kotlin server side frameworks
Kotlin server side frameworksKotlin server side frameworks
Kotlin server side frameworks
 
From ZERO to REST in an hour
From ZERO to REST in an hour From ZERO to REST in an hour
From ZERO to REST in an hour
 
Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)
 

Más de Ishan Mishra

Political Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignPolitical Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignIshan Mishra
 
Social Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreSocial Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreIshan Mishra
 
Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Ishan Mishra
 
SEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreSEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreIshan Mishra
 
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...Ishan Mishra
 
Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Ishan Mishra
 
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Ishan Mishra
 
AdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueAdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueIshan Mishra
 
Online Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveOnline Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveIshan Mishra
 
Management lesson from Mahabharat
Management lesson from MahabharatManagement lesson from Mahabharat
Management lesson from MahabharatIshan Mishra
 
Atif Aslam's Biography
Atif Aslam's BiographyAtif Aslam's Biography
Atif Aslam's BiographyIshan Mishra
 
Inbound Marketing Agency India | ISHAN-Tech
Inbound Marketing Agency India  | ISHAN-TechInbound Marketing Agency India  | ISHAN-Tech
Inbound Marketing Agency India | ISHAN-TechIshan Mishra
 
Crystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesCrystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesIshan Mishra
 
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 Global Management Consulting, Technology and Outsourcing Services from ISHAN... Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Global Management Consulting, Technology and Outsourcing Services from ISHAN...Ishan Mishra
 
ISHAN-TECH Consulting
ISHAN-TECH ConsultingISHAN-TECH Consulting
ISHAN-TECH ConsultingIshan Mishra
 
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Ishan Mishra
 

Más de Ishan Mishra (16)

Political Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignPolitical Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaign
 
Social Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreSocial Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in Indore
 
Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020
 
SEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreSEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company Indore
 
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
 
Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015
 
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India
 
AdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueAdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad Revenue
 
Online Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveOnline Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of Trave
 
Management lesson from Mahabharat
Management lesson from MahabharatManagement lesson from Mahabharat
Management lesson from Mahabharat
 
Atif Aslam's Biography
Atif Aslam's BiographyAtif Aslam's Biography
Atif Aslam's Biography
 
Inbound Marketing Agency India | ISHAN-Tech
Inbound Marketing Agency India  | ISHAN-TechInbound Marketing Agency India  | ISHAN-Tech
Inbound Marketing Agency India | ISHAN-Tech
 
Crystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesCrystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompanies
 
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 Global Management Consulting, Technology and Outsourcing Services from ISHAN... Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 
ISHAN-TECH Consulting
ISHAN-TECH ConsultingISHAN-TECH Consulting
ISHAN-TECH Consulting
 
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Introduction to "robots.txt

  • 1. Web Robots ISHAN MISHRA www.IshanTech.org 1
  • 2. Outline  Robot applications  How it works  Cycle Avoidance 2
  • 3. Applications  Behavior of web robots  Wander from web site to site (recursively),  1. Fetching content,  2. Following hyperlinks,  3. Process the data they find.  Colorful names  Crawlers,  Spiders,  Worms,  Bots 3
  • 4. Where to Start: The “Root Set” A G L S B C D M N T U H I J O E F K P Q R 4
  • 5. Cycle Avoidance A B E B E B E AB A C A C A ABC C D D D (a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and follows link, fetches B and fetches page C is back to A 5
  • 6. Loops  Cycles are bad for crawlers for there reasons.  Spending robot’s time and space  Overwhelm the web site.  Duplicate content. 6
  • 7. Data structure for robot  Trees and hash table  Lossy presence bit maps  Checkpoints  Save the list of visited URL to disk, in case the robot crashes  Partitioning  Robot farms 7
  • 8. Canonicalizing URLs  Most web robots try to eliminate the obvious aliases by “canonicalizing” URL into a standard form, by:  adding “:80” to the hostname, if the port isn’t specified.  Converting all %xx escaped characters into their character equivalents.  Removing # tags 8
  • 9. Symbolic link cycles / / index.html subdir index.html subdir index.html logo.gif (a) subdir is a directory (b) subdir is an upward symbolic link 9
  • 10. Dynamic Virtual Web Spaces  It can be possible to publish a URL that looks like a normal file but really is a gateway application.  This application can generate HTML on the fly that contains links to imaginary URLs on the same server. When these imaginary URLs are requested, new imaginary URLs are generated.  Such kind of malicious web server take the poor robot on an Alice-in-Wonderland journey through an infinite virtual space, even if the web server doesn’t really contain any files. Sometimes the robot is hard to detect this trap, because HTML and URLs may look very different all the time.  For example, a CGI-based calendaring program 10
  • 11. Malicious dynamic web space example 11
  • 12. Techniques for avoiding loops  Canonicalizing URLs  Breath-first crawling  Throttling  Limit the number of pages the robot can fetch from a web site in a period of time.  Limit URL size  Avoid symbolic cycle problem.  Problem: many sites use URLs to maintain user state.  URL/site blacklist  vs. “excluding Robot” 12
  • 13. Techniques for avoiding loops  Pattern detection  e.g., “subdir/subdir/subdir…”  e.g., “subdir/images/subdir/images/subdir/…”  Content fingerprinting  A checksum concept, while the odds of two different pages having the same check sum are small.  Message digest functions such as MD5 are popular for this purpose.  Human monitoring  Should design your robot with diagnostics and logging, so human beings can easily monitor the robot’s process and be warned quickly if something unusual is happening. 13
  • 14. Robotic HTTP  No different from any other HTTP client program.  Many robots try to implement the minimum amount of HTTP needed to request the content they seek.  It is recommended that robot implementers send some basic header information to notify the site of the capabilities of the robot, the robot identify, and where it originated. 14
  • 15. Identifying Request Header  User-Agent  Tell the server the robot’s name  From  Tell the email of the robot’s user/admin email.  Accept  Tell the server what media types are okay to send. (e.g. only fetch text and sound).  Referer  Tell the server how a robot found links to this site’s content. 15
  • 16. Virtual docroots cause trouble if no Host header is sent Robot tries to request index.html from www.csie.ncnu.edu.tw, but does Servers is configured to serve not include a Host header. both sites, but serves www.ncnu.edu.tw by default. Web robot client Request message GET /index.html HTTP/1.0 User-agent: ShopBot 1.0 www.ncnu.edu.tw www.csie.ncnu.edu.tw Response message HTTP/1.0 200 OK […] <HTML> <TITLE>National Chi Nan University</TITLE> […] 16
  • 17. What else a robot should support  Support Virtual Hosting  Not including this can lead to robots identifying the wrong content with a particular URL.  Conditional Requests  To minimize the amount of content retrieved, by conditional HTTP requests. (like cache revalidation)  Response Handling  Status code: 200 OK, 404 Not Found, 304  Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>  User-Agent Targeting  Web master should keep in mind that many robot will visit their site. Many sites optimize content for various user agents (I.E. or netscape).  Problem: “your browser does not support frame.” 17
  • 18. Misbehaving Robots  Runaway robot  Robots issue HTTP requests as fast as they can.  Stale URLs  Robots visit the old lists of URLs.  Long, wrong URLs  May reduce web server’s performance, clutter server’s access logs, even crash server.  Nosy robots  Some robots may get URLs that point to private data and make that data easily accessible through search engine.  Dynamic gateway access  Robots don’t always know what they are accessing. 18
  • 19. Excluding Robots www.ncnu.edu.tw Robot parses the robots.txt file and determines if it is allowed to access the acetylene-torches.html file. It is, so it proceeds with the request. 19
  • 20. robots.txt format  #allow google, csiebot to crawl the public parts of our site, but no other robots are allowed to crawl anything of our sites  User-Agent: googlebot  User-Agent: csiebot  Disallow: /private  User-Agent: *  Disallow: 20
  • 21. Robots Exclusion Standard versions Version Title and description Date 0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994 original robot.txt mechanism with Disallow directive 1.0 A Method for Web Robots Control-Martijn Nov. 1996 Koster’s IETF draft with additional support for Allow 2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996 Conner’s extension including regex and timing information; not widely supported 21
  • 22. Robots.txt path matching examples Rule path URL path Match? Comments /tmp /tmp ˇ Rule path==URL path /tmp /tmpfile.html ˇ Rule path is a prefix of URL path /tmp /tmp/a.html ˇ Rule path is a prefix of URL path /tmp/ /tmp x /tmp/ is not a prefix of /tmp README.TXT ˇ Empty rule path matches everything /~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~ /%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~ /%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes /~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a special case that must match exactly 22
  • 23. HTML Robot-control Meta Tags  e.g.  <META NAME=“ROBOTS” CONTENT=directive-list>  Directive-list  NOINDEX  Not to process this document content  NOFOLLOW  Not to crawl any outgoing links from this page  INDEX  FOLLOW  NOARCHIVE  Should not cache a local copy of the page  ALL (equivalent to INDEX, FOLLOW)  NONE (equivalent to NOINDEX, NOFOLLOW) 23
  • 24. Additional META tag directives name= content= Description DESCRIPTION <text> Allows an author to define a short text summary of the web page. Many search engines look at META DESCROPTION tags,allowing page author to specify appropriate short abstracts to describe their web pages. <meta name=“description” content=“Welcome to Mary’s Antiques web site”> KEYWORDS <comma Associates a comma-separated list of words that describes the list> web page, to assist in keyword searches. <meta name=“keywords” content=“antiques,mary,furniture,restoration”> REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be revisited, presumably because it is subject to change, after the specified number of days. <meta name=“revisit-after” content=“10 days”> * This directive is not likely to have wide support. 24
  • 25. Guidelines for web robot operators (Robot Etiquette) 25
  • 26. Guidelines for web robot operators (cont.) 26
  • 27. Guidelines for web robot operators (cont.) 27
  • 28. Guidelines for web robot operators (cont.) 28
  • 29. Guidelines for web robot operators (cont.) 29
  • 30. Modern Search Engine Architecture User Web server User Web server Web search Search engine gateway crawler/indexer User Full-text index database Web server User Web search users Query engine Crawling and indexing 30
  • 32. Posting the Query User fills out HTML search form (with a GET action HTTP method) on site in browser and hits Submit Client Query:”drills” Request message Results:File”BD.html” GET /search.html?query=drills HTTP/1.1 Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw Accept: * User-agent: ShopBot Response message Search gateway HTTP/1.1 200 OK Content-type: text/html Content-length: 1037 <HTML> <HEAD><TITLE>Search Results</TITLE> […] 32
  • 33. Reference (HW#4)  paper reading: “searching the Web”  paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001. http://www.searchtools.com Search Tools for Web Sites and Intranets-resources for search tools and robots. http://www.robotstxt.org/wc/robots.html The Web Robots Pages-resources for robot developers, including the registry of Internet Robots. http://www.searchengineworld.com Search Engine World-resource for search engines and robots. http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm RobotRules Perl source. http://www.conman.org/people/spc/robots2.html An Extended Standard for Robot Exclusion. Managing Gigabytes: Compressing and Indexing Documents and Images Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33