Introduction to "robots.txt

Web Robots

ISHAN MISHRA
www.IshanTech.org

1

Outline
 Robot applications
 How it works
 Cycle Avoidance

2

Applications
 Behavior of web robots
 Wander from web site to site (recursively),
 1. Fetching content,
 2. Following hyperlinks,
 3. Process the data they find.

 Colorful names
 Crawlers,
 Spiders,
 Worms,
 Bots

3

Where to Start: The “Root Set”

A G L S

B C D M N T U
H I

J O
E F

K P Q R

4

Cycle Avoidance

A B E B E B E

AB

A C A C A ABC C

D D D

(a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and
follows link, fetches B and fetches page C is back to A
5

Loops
 Cycles are bad for crawlers for there
reasons.
 Spending robot’s time and space
 Overwhelm the web site.
 Duplicate content.

6

Data structure for robot
 Trees and hash table
 Lossy presence bit maps
 Checkpoints
 Save the list of visited URL to disk, in case the
robot crashes
 Partitioning
 Robot farms

7

Canonicalizing URLs
 Most web robots try to eliminate the
obvious aliases by “canonicalizing” URL
into a standard form, by:
 adding “:80” to the hostname, if the port
isn’t specified.
 Converting all %xx escaped characters into
their character equivalents.
 Removing # tags

8

Symbolic link cycles

/ /

index.html subdir index.html subdir

index.html logo.gif

(a) subdir is a directory (b) subdir is an upward symbolic link

9

Dynamic Virtual Web Spaces
 It can be possible to publish a URL that looks like a normal
file but really is a gateway application.
 This application can generate HTML on the fly that
contains links to imaginary URLs on the same server.
When these imaginary URLs are requested, new imaginary
URLs are generated.

 Such kind of malicious web server take the poor robot on
an Alice-in-Wonderland journey through an infinite virtual
space, even if the web server doesn’t really contain any
files. Sometimes the robot is hard to detect this trap,
because HTML and URLs may look very different all the
time.

 For example, a CGI-based calendaring program
10

Malicious dynamic web space
example

11

Techniques for avoiding loops
 Canonicalizing URLs
 Breath-first crawling
 Throttling
 Limit the number of pages the robot can fetch from a
web site in a period of time.
 Limit URL size
 Avoid symbolic cycle problem.
 Problem: many sites use URLs to maintain user state.
 URL/site blacklist
 vs. “excluding Robot”

12

Techniques for avoiding loops
 Pattern detection
 e.g., “subdir/subdir/subdir…”
 e.g., “subdir/images/subdir/images/subdir/…”

 Content fingerprinting
 A checksum concept, while the odds of two different pages
having the same check sum are small.
 Message digest functions such as MD5 are popular for this
purpose.

 Human monitoring
 Should design your robot with diagnostics and logging, so
human beings can easily monitor the robot’s process and be
warned quickly if something unusual is happening.
13

Robotic HTTP
 No different from any other HTTP client program.
 Many robots try to implement the minimum
amount of HTTP needed to request the content
they seek.

 It is recommended that robot implementers
send some basic header information to notify
the site of the capabilities of the robot, the robot
identify, and where it originated.

14

Identifying Request Header
 User-Agent
 Tell the server the robot’s name
 From
 Tell the email of the robot’s user/admin email.
 Accept
 Tell the server what media types are okay to send.
(e.g. only fetch text and sound).
 Referer
 Tell the server how a robot found links to this site’s
content.

15

Virtual docroots cause trouble if
no Host header is sent

Robot tries to request index.html
from www.csie.ncnu.edu.tw, but does
Servers is configured to serve
not include a Host header.
both sites, but serves
www.ncnu.edu.tw by default.
Web robot client
Request message
GET /index.html HTTP/1.0
User-agent: ShopBot 1.0
www.ncnu.edu.tw
www.csie.ncnu.edu.tw
Response message
HTTP/1.0 200 OK
[…]
<HTML>
<TITLE>National Chi Nan University</TITLE>
[…] 16

What else a robot should support
 Support Virtual Hosting
 Not including this can lead to robots identifying the wrong content with
a particular URL.

 Conditional Requests
 To minimize the amount of content retrieved, by conditional HTTP
requests. (like cache revalidation)

 Response Handling
 Status code: 200 OK, 404 Not Found, 304
 Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>

 User-Agent Targeting
 Web master should keep in mind that many robot will visit their site.
Many sites optimize content for various user agents (I.E. or netscape).
 Problem: “your browser does not support frame.”

17

Misbehaving Robots
 Runaway robot
 Robots issue HTTP requests as fast as they can.
 Stale URLs
 Robots visit the old lists of URLs.
 Long, wrong URLs
 May reduce web server’s performance, clutter server’s access
logs, even crash server.
 Nosy robots
 Some robots may get URLs that point to private data and make
that data easily accessible through search engine.
 Dynamic gateway access
 Robots don’t always know what they are accessing.

18

Excluding Robots

www.ncnu.edu.tw

Robot parses the robots.txt file and
determines if it is allowed to access
the acetylene-torches.html file.

It is, so it proceeds with the request.

19

robots.txt format
 #allow google, csiebot to crawl the public parts
of our site, but no other robots are allowed to
crawl anything of our sites
 User-Agent: googlebot
 User-Agent: csiebot
 Disallow: /private

 User-Agent: *
 Disallow:
20

Robots Exclusion Standard
versions

Version Title and description Date
0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994
original robot.txt mechanism with Disallow
directive

1.0 A Method for Web Robots Control-Martijn Nov. 1996
Koster’s IETF draft with additional support for
Allow

2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996
Conner’s extension including regex and timing
information; not widely supported

21

Robots.txt path matching
examples
Rule path URL path Match? Comments
/tmp /tmp ˇ Rule path==URL path

/tmp /tmpfile.html ˇ Rule path is a prefix of URL
path
/tmp /tmp/a.html ˇ Rule path is a prefix of URL
path
/tmp/ /tmp x /tmp/ is not a prefix of /tmp

README.TXT ˇ Empty rule path matches
everything
/~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~

/%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~

/%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes

/~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a
special case that must match
exactly 22

HTML Robot-control Meta Tags
 e.g.
 <META NAME=“ROBOTS” CONTENT=directive-list>

 Directive-list
 NOINDEX
 Not to process this document content
 NOFOLLOW
 Not to crawl any outgoing links from this page

 INDEX
 FOLLOW
 NOARCHIVE
 Should not cache a local copy of the page
 ALL (equivalent to INDEX, FOLLOW)
 NONE (equivalent to NOINDEX, NOFOLLOW)

23

Additional META tag directives

name= content= Description
DESCRIPTION <text> Allows an author to define a short text summary of the web
page. Many search engines look at META DESCROPTION
tags,allowing page author to specify appropriate short
abstracts to describe their web pages.
<meta name=“description”
content=“Welcome to Mary’s Antiques web site”>
KEYWORDS <comma Associates a comma-separated list of words that describes the
list> web page, to assist in keyword searches.
<meta name=“keywords”
content=“antiques,mary,furniture,restoration”>

REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be
revisited, presumably because it is subject to change, after the
specified number of days.
<meta name=“revisit-after” content=“10 days”>

* This directive is not likely to have wide support. 24

Guidelines for web robot
operators (Robot Etiquette)

25

operators (cont.)

26

operators (cont.)

27

operators (cont.)

28

operators (cont.)

29

Modern Search Engine
Architecture

User
Web server

User
Web server
Web search Search engine
gateway crawler/indexer
User
Full-text index
database
Web server
User

Web search users Query engine Crawling and indexing
30

Posting the Query
User fills out HTML search
form (with a GET action
HTTP method) on site in
browser and hits Submit

Client Query:”drills”

Request message
Results:File”BD.html”
GET /search.html?query=drills HTTP/1.1
Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw
Accept: *
User-agent: ShopBot
Response message Search gateway
HTTP/1.1 200 OK
Content-type: text/html
Content-length: 1037

<HTML>
<HEAD><TITLE>Search Results</TITLE>
[…]
32

Reference (HW#4)
 paper reading: “searching the Web”
 paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.
http://www.searchtools.com
Search Tools for Web Sites and Intranets-resources for search tools and
robots.
http://www.robotstxt.org/wc/robots.html
The Web Robots Pages-resources for robot developers, including the
registry of Internet Robots.
http://www.searchengineworld.com
Search Engine World-resource for search engines and robots.
http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm
RobotRules Perl source.
http://www.conman.org/people/spc/robots2.html
An Extended Standard for Robot Exclusion.
Managing Gigabytes: Compressing and Indexing Documents and Images
Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33

Introduction to "robots.txt

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Similar a Introduction to "robots.txt

Similar a Introduction to "robots.txt (20)

Más de Ishan Mishra

Más de Ishan Mishra (16)

Último

Último (20)

Introduction to "robots.txt