2. Outline
Robot applications
How it works
Cycle Avoidance
2
3. Applications
Behavior of web robots
Wander from web site to site (recursively),
1. Fetching content,
2. Following hyperlinks,
3. Process the data they find.
Colorful names
Crawlers,
Spiders,
Worms,
Bots
3
4. Where to Start: The “Root Set”
A G L S
B C D M N T U
H I
J O
E F
K P Q R
4
5. Cycle Avoidance
A B E B E B E
AB
A C A C A ABC C
D D D
(a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and
follows link, fetches B and fetches page C is back to A
5
6. Loops
Cycles are bad for crawlers for there
reasons.
Spending robot’s time and space
Overwhelm the web site.
Duplicate content.
6
7. Data structure for robot
Trees and hash table
Lossy presence bit maps
Checkpoints
Save the list of visited URL to disk, in case the
robot crashes
Partitioning
Robot farms
7
8. Canonicalizing URLs
Most web robots try to eliminate the
obvious aliases by “canonicalizing” URL
into a standard form, by:
adding “:80” to the hostname, if the port
isn’t specified.
Converting all %xx escaped characters into
their character equivalents.
Removing # tags
8
9. Symbolic link cycles
/ /
index.html subdir index.html subdir
index.html logo.gif
(a) subdir is a directory (b) subdir is an upward symbolic link
9
10. Dynamic Virtual Web Spaces
It can be possible to publish a URL that looks like a normal
file but really is a gateway application.
This application can generate HTML on the fly that
contains links to imaginary URLs on the same server.
When these imaginary URLs are requested, new imaginary
URLs are generated.
Such kind of malicious web server take the poor robot on
an Alice-in-Wonderland journey through an infinite virtual
space, even if the web server doesn’t really contain any
files. Sometimes the robot is hard to detect this trap,
because HTML and URLs may look very different all the
time.
For example, a CGI-based calendaring program
10
12. Techniques for avoiding loops
Canonicalizing URLs
Breath-first crawling
Throttling
Limit the number of pages the robot can fetch from a
web site in a period of time.
Limit URL size
Avoid symbolic cycle problem.
Problem: many sites use URLs to maintain user state.
URL/site blacklist
vs. “excluding Robot”
12
13. Techniques for avoiding loops
Pattern detection
e.g., “subdir/subdir/subdir…”
e.g., “subdir/images/subdir/images/subdir/…”
Content fingerprinting
A checksum concept, while the odds of two different pages
having the same check sum are small.
Message digest functions such as MD5 are popular for this
purpose.
Human monitoring
Should design your robot with diagnostics and logging, so
human beings can easily monitor the robot’s process and be
warned quickly if something unusual is happening.
13
14. Robotic HTTP
No different from any other HTTP client program.
Many robots try to implement the minimum
amount of HTTP needed to request the content
they seek.
It is recommended that robot implementers
send some basic header information to notify
the site of the capabilities of the robot, the robot
identify, and where it originated.
14
15. Identifying Request Header
User-Agent
Tell the server the robot’s name
From
Tell the email of the robot’s user/admin email.
Accept
Tell the server what media types are okay to send.
(e.g. only fetch text and sound).
Referer
Tell the server how a robot found links to this site’s
content.
15
16. Virtual docroots cause trouble if
no Host header is sent
Robot tries to request index.html
from www.csie.ncnu.edu.tw, but does
Servers is configured to serve
not include a Host header.
both sites, but serves
www.ncnu.edu.tw by default.
Web robot client
Request message
GET /index.html HTTP/1.0
User-agent: ShopBot 1.0
www.ncnu.edu.tw
www.csie.ncnu.edu.tw
Response message
HTTP/1.0 200 OK
[…]
<HTML>
<TITLE>National Chi Nan University</TITLE>
[…] 16
17. What else a robot should support
Support Virtual Hosting
Not including this can lead to robots identifying the wrong content with
a particular URL.
Conditional Requests
To minimize the amount of content retrieved, by conditional HTTP
requests. (like cache revalidation)
Response Handling
Status code: 200 OK, 404 Not Found, 304
Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>
User-Agent Targeting
Web master should keep in mind that many robot will visit their site.
Many sites optimize content for various user agents (I.E. or netscape).
Problem: “your browser does not support frame.”
17
18. Misbehaving Robots
Runaway robot
Robots issue HTTP requests as fast as they can.
Stale URLs
Robots visit the old lists of URLs.
Long, wrong URLs
May reduce web server’s performance, clutter server’s access
logs, even crash server.
Nosy robots
Some robots may get URLs that point to private data and make
that data easily accessible through search engine.
Dynamic gateway access
Robots don’t always know what they are accessing.
18
19. Excluding Robots
www.ncnu.edu.tw
Robot parses the robots.txt file and
determines if it is allowed to access
the acetylene-torches.html file.
It is, so it proceeds with the request.
19
20. robots.txt format
#allow google, csiebot to crawl the public parts
of our site, but no other robots are allowed to
crawl anything of our sites
User-Agent: googlebot
User-Agent: csiebot
Disallow: /private
User-Agent: *
Disallow:
20
21. Robots Exclusion Standard
versions
Version Title and description Date
0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994
original robot.txt mechanism with Disallow
directive
1.0 A Method for Web Robots Control-Martijn Nov. 1996
Koster’s IETF draft with additional support for
Allow
2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996
Conner’s extension including regex and timing
information; not widely supported
21
22. Robots.txt path matching
examples
Rule path URL path Match? Comments
/tmp /tmp ˇ Rule path==URL path
/tmp /tmpfile.html ˇ Rule path is a prefix of URL
path
/tmp /tmp/a.html ˇ Rule path is a prefix of URL
path
/tmp/ /tmp x /tmp/ is not a prefix of /tmp
README.TXT ˇ Empty rule path matches
everything
/~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~
/%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~
/%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes
/~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a
special case that must match
exactly 22
23. HTML Robot-control Meta Tags
e.g.
<META NAME=“ROBOTS” CONTENT=directive-list>
Directive-list
NOINDEX
Not to process this document content
NOFOLLOW
Not to crawl any outgoing links from this page
INDEX
FOLLOW
NOARCHIVE
Should not cache a local copy of the page
ALL (equivalent to INDEX, FOLLOW)
NONE (equivalent to NOINDEX, NOFOLLOW)
23
24. Additional META tag directives
name= content= Description
DESCRIPTION <text> Allows an author to define a short text summary of the web
page. Many search engines look at META DESCROPTION
tags,allowing page author to specify appropriate short
abstracts to describe their web pages.
<meta name=“description”
content=“Welcome to Mary’s Antiques web site”>
KEYWORDS <comma Associates a comma-separated list of words that describes the
list> web page, to assist in keyword searches.
<meta name=“keywords”
content=“antiques,mary,furniture,restoration”>
REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be
revisited, presumably because it is subject to change, after the
specified number of days.
<meta name=“revisit-after” content=“10 days”>
* This directive is not likely to have wide support. 24
30. Modern Search Engine
Architecture
User
Web server
User
Web server
Web search Search engine
gateway crawler/indexer
User
Full-text index
database
Web server
User
Web search users Query engine Crawling and indexing
30
32. Posting the Query
User fills out HTML search
form (with a GET action
HTTP method) on site in
browser and hits Submit
Client Query:”drills”
Request message
Results:File”BD.html”
GET /search.html?query=drills HTTP/1.1
Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw
Accept: *
User-agent: ShopBot
Response message Search gateway
HTTP/1.1 200 OK
Content-type: text/html
Content-length: 1037
<HTML>
<HEAD><TITLE>Search Results</TITLE>
[…]
32
33. Reference (HW#4)
paper reading: “searching the Web”
paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.
http://www.searchtools.com
Search Tools for Web Sites and Intranets-resources for search tools and
robots.
http://www.robotstxt.org/wc/robots.html
The Web Robots Pages-resources for robot developers, including the
registry of Internet Robots.
http://www.searchengineworld.com
Search Engine World-resource for search engines and robots.
http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm
RobotRules Perl source.
http://www.conman.org/people/spc/robots2.html
An Extended Standard for Robot Exclusion.
Managing Gigabytes: Compressing and Indexing Documents and Images
Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33