Current challenges in web crawling

ICWE’13 Tutorial:
CURRENT CHALLENGES IN WEB CRAWLING
Denis Shestakov (denshe at gmail-dot-com)
Department of Media Technology
School of Science, Aalto University, Finland
Version 1.5: 09.03.2015
Version 1.4: 08.07.2013

Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
2/80
References to this tutorial
To cite please use:
D. Shestakov, "Current Challenges in Web Crawling," in
Proc. ICWE 2013, 2013, pp. 518-521.
[BibTeX]

Denis Shestakov
3/80
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in web
crawling Web Services Group in 2011

Denis Shestakov
4/80
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/

Denis Shestakov
5/80
Tutorial Outline
OVERVIEW
Web crawling in a nutshell
Web structure& statistics
Large-scale crawling
Break
CHALLENGES
Collaborative web crawling
Crawling the deep Web
Crawling the multimedia content
Future directions

Denis Shestakov
6/80
PART I: OVERVIEW
Vizualization of http://media.tkk.fi/webservices by aharef.info
applet

Denis Shestakov
7/80
Outline of Part I
Overview of Web Crawling
Web Crawling in a Nutshell
Applications
Industry vs. Academia
Web Ecosystem and Crawling
Web Structure& Statistics
Large-scale crawling
Basic architecture
Implementations
Design issues and considerations

Denis Shestakov
8/80
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
A set of policies involved (like ’ignore links to images’, etc.)

Denis Shestakov
9/80
Example:
1. Follow http://media.tkk.ﬁ/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1

Denis Shestakov
10/80
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)

Denis Shestakov
11/80
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)

Denis Shestakov
12/80
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites

Denis Shestakov
13/80
Classiﬁcation
General/universal crawlers
- Not so many of them, lots of resources required
- Big web search engines
Topical/focused crawlers
- Pages/sites on certain topic
- Crawling all in one speciﬁc (i.e., national) web segment is
rather general, though
Batch crawling
- One or several (static) snapshots
Incremental/continuous crawling
- Re-visiting
- Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
- Search engines

Denis Shestakov
14/80
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks

Denis Shestakov
15/80
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?

Denis Shestakov
16/80
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)

Denis Shestakov
17/80
Web Archiving
Digital preservation
’Librarian’ look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites - web sites at
country-speciﬁc TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia

Denis Shestakov
18/80
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search

Denis Shestakov
19/80
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)

Denis Shestakov
20/80
Web Monitoring
Monitoring sites/pages for changes and updates

Denis Shestakov
21/80
Detection of malicious web sites
Typically a part of anti-virus, ﬁrewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such

Denis Shestakov
22/80
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)

Denis Shestakov
23/80
Fighting crime! :) well, copyright violations
Crawl to ﬁnd (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like ﬁnding terrorist chat rooms also go here

Denis Shestakov
24/80
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API

Denis Shestakov
25/80
Web Mirroring
Copying of web sites
Often hosting copies on different servers to ensure
constant accessibility

Denis Shestakov
26/80
In web crawling domain
Huge lag between industrial and academic web crawlers
- Research-wise and development-wise
- Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale (=dozens of
billions pages)
- Only a few (three?) academic crawlers dealt with more
than one billion pages
- Academic scale is rather hundreds of millions

Denis Shestakov
27/80
Re-crawling
- Batch crawls in academia
- Regular re-crawls by industrial crawlers
Evaluation of crawled data
- And hence corrections/improvements into crawlers
- Direct evaluation by users of search engines
- To some extent, artiﬁcial evaluation of academic crawls

Denis Shestakov
28/80
Industrial (search engines’) crawlers are much more
appreciated
- Eventually they attract visitors
(=revenue/prestige/inﬂuence/...)
- It makes perfect sense to trick them
Academic crawlers just consume resources (e.g., network
bandwidth)
- Don’t bring anything
- No point to do tricks with them (assuming site
administrator bothers to differentiate them from search
engines’ bots)

Denis Shestakov
29/80
Pull vs. Push model
Web Content Provider (site owners)
Web Aggregators (crawler operators)
Aggregator pulls content
Content is not pushed to aggregators

Denis Shestakov
30/80
Why not Push?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No speciﬁc protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed

Denis Shestakov
31/80
Why not Push?
Still pull model has several disadvantages
What are these?

Denis Shestakov
32/80
Why not Push?
Still pull model has several disadvantages
Avoiding redundant requests from crawlers, more control
over the content from providers

Denis Shestakov
33/80
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to deﬁne access to parts of a site
Via direct banning of agents hitting a site too often

Denis Shestakov
34/80
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:

Denis Shestakov
35/80
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
Must try to keep all crawling threads busy
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008

Denis Shestakov
36/80
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) conﬁrms one billion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billions pages over 2 months
- Throughput: 1000-1500 pages per second
- Over 30 billions discovered URLs

Denis Shestakov
37/80
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155

Denis Shestakov
38/80
Basic Crawler Architecture
Crawler crawls the Web
Illustration taken from CMSC 476/676 course slides by Charles Nicholas

Denis Shestakov
39/80
Typically in a distributed fashion
Illustration taken from CMSC 476/676 course slides by Charles Nicholas

Denis Shestakov
40/80
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps

Denis Shestakov
41/80
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Denis Shestakov
42/80
DNS
Given a URL, retrieve its IP address
Distributed service – lookup latencies can be high
(seconds)
Critical component
Common implementations of DNS lookup (e.g., nslookup)
are synchronous: one request at a time
Asynchronous DNS resolving
Pre-caching
Batch DNS resolving

Denis Shestakov
43/80
Content seen?
If page fetched is already in the base/index, don’t process it
Document ﬁngerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+ﬁltered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)

Denis Shestakov
44/80
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes

Denis Shestakov
45/80
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Denis Shestakov
46/80
Implementations
Popular languages: Perl, Java, Python, C/C++
HTTP fetching, HTML parser, asynchronous DNS
resolving libraries
Open-source, in Java: Heritrix, Nutch

Denis Shestakov
47/80
Implementations
Simple code example in Perl

Denis Shestakov
48/80
Large-scale Crawling
Objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)

Denis Shestakov
49/80
Architecture of
sequential crawler
Seeds – list of starting
URLs
Order of page visits
determined by frontier
data structure
Stop condition (e.g., X
pages fetched)
Illustration taken from Ch.8 Web Crawling by Filippo
Menczer in Bing Liu’s Web Data Mining (Springer, 2007)

Denis Shestakov
50/80
Graph Traversal
Breadth ﬁrst search
- Implemented with
QUEUE (FIFO)
- Pages with shortest
paths
Depth ﬁrst search
- Implemented with
STACK (LIFO)

Denis Shestakov
51/80
Some implementation notes
Get only the ﬁrst part of pages (10-100KB)
Detect redirection loops
Handle all possible errors (e.g., server not responding),
timeouts, etc.
Deal with lots of invalid HTML
Take care of dynamic pages
- Some are ’spider traps’ (think of Next month link on a
calendar)
- E.g., limit number of pages per host

Denis Shestakov
52/80
Delays in crawling
Resolving host to IP address
Connecting a socket to server and sending request
Receiving requested page in response
Overlap delays by fetching many pages concurrently

Denis Shestakov
53/80
Architecture of
concurrent crawler
Illustration taken from Ch.8 Web Crawling by Filippo Menczer
in Bing Liu’s Web Data Mining (Springer, 2007)

Denis Shestakov
54/80
Design points: frontier data structure
Most links on a page refer to the same site/server
- Note: remember of virtual hosting
Problem with a FIFO queue – too many requests to the
same server
Common policy is to delay next request by, say, 10 x time
(it took to download last page from the server)
’Mercator’ scheme – have more additional queues to the
frontier queue

Denis Shestakov
55/80
Design points: URL seen test
To not add multiple instances of URL to the frontier
For batch crawling, two operations required: insertion and
membership testing
For continuous crawling, one more operation: deletion
URLs compressed (e.g., 10-byte hash value)
In-memory implementations: hash table, Bloom ﬁlter
Search engines keep all URLs in-memory in the crawling
cluster (hash table partitioned across nodes; partitioning
can be based on host part of URL)

Denis Shestakov
56/80
Design points: URL seen test
If in-memory not possible, disk-based hash table used with
caching
Limits crawling rate to tens of pages per second – disk
lookups are slow
To scale, sequential read/writes are faster and thus used
’Mercator/IRLbot’ scheme: combining (reading-writing)
sorted URL (visited) hashes on disk with hashes of ’just
extracted’ URLs
Delay due to batch merging manageable

Denis Shestakov
57/80
PART II: CHALLENGES

Denis Shestakov
58/80
Outline of Part II
Challenges in Web Crawling
Collaborative Crawling
Deep Web Crawling
Crawling content behind search forms
Crawling JavaScript-rich web sites
Crawling Multimedia
Other Challenges in Crawling
Future Directions
References

Denis Shestakov
59/80
Main considerations
Lots of redundant crawling
To get data (often on a speciﬁc topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify ﬁlters to select required pages
Crawler as a common service

Denis Shestakov
60/80
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests

Denis Shestakov
61/80
New component
Process a stream of documents against a ﬁlter index

Denis Shestakov
62/80
Filter processing architecture

Denis Shestakov
63/80
Filter processing architecture

Denis Shestakov
64/80
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web

Denis Shestakov
65/80
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet

Denis Shestakov
66/80
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)

Denis Shestakov
67/80
Deep Web Crawling
See slides on deep Web crawling at http://goo.gl/Oohoo

Denis Shestakov
68/80
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)

Denis Shestakov
69/80
Challenges in crawling multimedia
Bigger load on web sites since ﬁles are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy

Denis Shestakov
70/80
Approaches
Utilize metadata info (fetch and analyse small metadata file
to decide on full download)
Intelligent crawling: better ranking of URLs in frontier
(based on specified domain of crawl)
Move from pull to push model
API-directed crawling
- Access to data via predefined APIs
- Need in annotation/discovery of such APIs
Technically: use additional component for multimedia crawl
- With its own URL queue
- Main crawler component provides it with URLs to
multimedia
- In return, it sends feedback to main crawler to better
score links in frontier

Denis Shestakov
71/80
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-deﬁned web sites
Data can be exported in WARC (Web ARChive) ﬁles and in
RDF

Denis Shestakov
72/80
Other Crawling Challenges
Ordering policy
Resources are limited, while number of pages to visit
essentially inﬁnite
Decision should be done based on URL itself
PageRank-like metrics can be used
More complicated in case of incremental crawls
Focused crawling
Avoid links leading to content out of the topic of interest
Content of a page can be taken into account when decide
if a particular link leads to
Setting a good seed is a challenge

Denis Shestakov
73/80
Other Crawling Challenges
Re-visiting policy
Generating good seed URLs
Avoiding redundant content
Avoid visiting duplicate pages (different URLs leading to
identical or near-identical content)
- Near-duplicates might be very tricky (think of a news item
propagation on the Web)
Avoid crawler traps
Avoid useless content (i.e., web spam)

Denis Shestakov
74/80
Future Directions
Collaborative crawling, mixed pull-push model
Understanding site structure
Deep Web crawling
Media content crawling
Social network crawling

Denis Shestakov
75/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages

Denis Shestakov
76/80
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)

Denis Shestakov
77/80
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC ﬁles
- 2.7 billions pages
- Includes multimedia data
- Available by request

Denis Shestakov
78/80
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites

Denis Shestakov
79/80
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017

Denis Shestakov
80/80
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/

Current challenges in web crawling

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Destacado

Destacado (20)

Similar a Current challenges in web crawling

Similar a Current challenges in web crawling (20)

Más de Denis Shestakov

Más de Denis Shestakov (6)

Último

Último (20)

Current challenges in web crawling