Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Current challenges in web crawling
1. ICWE’13 Tutorial:
CURRENT CHALLENGES IN WEB CRAWLING
Denis Shestakov (denshe at gmail-dot-com)
Department of Media Technology
School of Science, Aalto University, Finland
Version 1.5: 09.03.2015
Version 1.4: 08.07.2013
2. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
2/80
References to this tutorial
To cite please use:
D. Shestakov, "Current Challenges in Web Crawling," in
Proc. ICWE 2013, 2013, pp. 518-521.
[BibTeX]
3. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
3/80
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in web
crawling Web Services Group in 2011
4. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
4/80
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/
5. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
5/80
Tutorial Outline
OVERVIEW
Web crawling in a nutshell
Web structure& statistics
Large-scale crawling
Break
CHALLENGES
Collaborative web crawling
Crawling the deep Web
Crawling the multimedia content
Future directions
6. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
6/80
PART I: OVERVIEW
Vizualization of http://media.tkk.fi/webservices by aharef.info
applet
7. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
7/80
Outline of Part I
Overview of Web Crawling
Web Crawling in a Nutshell
Applications
Industry vs. Academia
Web Ecosystem and Crawling
Web Structure& Statistics
Large-scale crawling
Basic architecture
Implementations
Design issues and considerations
8. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
8/80
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
A set of policies involved (like ’ignore links to images’, etc.)
9. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
9/80
Web Crawling in a Nutshell
Example:
1. Follow http://media.tkk.fi/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1
10. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
10/80
Web Crawling in a Nutshell
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)
11. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
11/80
Web Crawling in a Nutshell
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)
12. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
12/80
Web Crawling in a Nutshell
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites
13. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
13/80
Web Crawling in a Nutshell
Classification
General/universal crawlers
- Not so many of them, lots of resources required
- Big web search engines
Topical/focused crawlers
- Pages/sites on certain topic
- Crawling all in one specific (i.e., national) web segment is
rather general, though
Batch crawling
- One or several (static) snapshots
Incremental/continuous crawling
- Re-visiting
- Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
- Search engines
14. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
14/80
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks
15. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
15/80
Applications of Web Crawling
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?
16. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
16/80
Applications of Web Crawling
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)
17. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
17/80
Applications of Web Crawling
Web Archiving
Digital preservation
’Librarian’ look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites - web sites at
country-specific TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia
18. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
18/80
Applications of Web Crawling
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search
19. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
19/80
Applications of Web Crawling
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)
20. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
20/80
Applications of Web Crawling
Web Monitoring
Monitoring sites/pages for changes and updates
21. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
21/80
Applications of Web Crawling
Detection of malicious web sites
Typically a part of anti-virus, firewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such
22. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
22/80
Applications of Web Crawling
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)
23. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
23/80
Applications of Web Crawling
Fighting crime! :) well, copyright violations
Crawl to find (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like finding terrorist chat rooms also go here
24. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
24/80
Applications of Web Crawling
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API
25. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
25/80
Applications of Web Crawling
Web Mirroring
Copying of web sites
Often hosting copies on different servers to ensure
constant accessibility
26. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
26/80
Industry vs. Academia
In web crawling domain
Huge lag between industrial and academic web crawlers
- Research-wise and development-wise
- Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale (=dozens of
billions pages)
- Only a few (three?) academic crawlers dealt with more
than one billion pages
- Academic scale is rather hundreds of millions
27. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
27/80
Industry vs. Academia
Re-crawling
- Batch crawls in academia
- Regular re-crawls by industrial crawlers
Evaluation of crawled data
- And hence corrections/improvements into crawlers
- Direct evaluation by users of search engines
- To some extent, artificial evaluation of academic crawls
28. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
28/80
Industry vs. Academia
Industrial (search engines’) crawlers are much more
appreciated
- Eventually they attract visitors
(=revenue/prestige/influence/...)
- It makes perfect sense to trick them
Academic crawlers just consume resources (e.g., network
bandwidth)
- Don’t bring anything
- No point to do tricks with them (assuming site
administrator bothers to differentiate them from search
engines’ bots)
29. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
29/80
Web Ecosystem and Crawling
Pull vs. Push model
Web Content Provider (site owners)
Web Aggregators (crawler operators)
Aggregator pulls content
Content is not pushed to aggregators
30. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
30/80
Web Ecosystem and Crawling
Why not Push?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No specific protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed
31. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
31/80
Web Ecosystem and Crawling
Why not Push?
Still pull model has several disadvantages
What are these?
32. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
32/80
Web Ecosystem and Crawling
Why not Push?
Still pull model has several disadvantages
Avoiding redundant requests from crawlers, more control
over the content from providers
33. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
33/80
Web Ecosystem and Crawling
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to define access to parts of a site
Via direct banning of agents hitting a site too often
34. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
34/80
Web Ecosystem and Crawling
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:
35. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
35/80
Web Structure& Statistics
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
Must try to keep all crawling threads busy
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008
36. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
36/80
Web Structure& Statistics
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) confirms one billion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billions pages over 2 months
- Throughput: 1000-1500 pages per second
- Over 30 billions discovered URLs
37. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
37/80
Web Structure& Statistics
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155
38. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
38/80
Basic Crawler Architecture
Crawler crawls the Web
Illustration taken from CMSC 476/676 course slides by Charles Nicholas
39. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
39/80
Basic Crawler Architecture
Typically in a distributed fashion
Illustration taken from CMSC 476/676 course slides by Charles Nicholas
40. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
40/80
Basic Crawler Architecture
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps
41. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
41/80
Basic Crawler Architecture
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
42. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
42/80
Basic Crawler Architecture
DNS
Given a URL, retrieve its IP address
Distributed service – lookup latencies can be high
(seconds)
Critical component
Common implementations of DNS lookup (e.g., nslookup)
are synchronous: one request at a time
Asynchronous DNS resolving
Pre-caching
Batch DNS resolving
43. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
43/80
Basic Crawler Architecture
Content seen?
If page fetched is already in the base/index, don’t process it
Document fingerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+filtered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)
44. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
44/80
Basic Crawler Architecture
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes
45. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
45/80
Basic Crawler Architecture
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
46. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
46/80
Implementations
Popular languages: Perl, Java, Python, C/C++
HTTP fetching, HTML parser, asynchronous DNS
resolving libraries
Open-source, in Java: Heritrix, Nutch
47. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
47/80
Implementations
Simple code example in Perl
48. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
48/80
Large-scale Crawling
Objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)
49. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
49/80
Large-scale Crawling
Architecture of
sequential crawler
Seeds – list of starting
URLs
Order of page visits
determined by frontier
data structure
Stop condition (e.g., X
pages fetched)
Illustration taken from Ch.8 Web Crawling by Filippo
Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
50. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
50/80
Large-scale Crawling
Graph Traversal
Breadth first search
- Implemented with
QUEUE (FIFO)
- Pages with shortest
paths
Depth first search
- Implemented with
STACK (LIFO)
51. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
51/80
Large-scale Crawling
Some implementation notes
Get only the first part of pages (10-100KB)
Detect redirection loops
Handle all possible errors (e.g., server not responding),
timeouts, etc.
Deal with lots of invalid HTML
Take care of dynamic pages
- Some are ’spider traps’ (think of Next month link on a
calendar)
- E.g., limit number of pages per host
52. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
52/80
Large-scale Crawling
Delays in crawling
Resolving host to IP address
Connecting a socket to server and sending request
Receiving requested page in response
Overlap delays by fetching many pages concurrently
53. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
53/80
Large-scale Crawling
Architecture of
concurrent crawler
Illustration taken from Ch.8 Web Crawling by Filippo Menczer
in Bing Liu’s Web Data Mining (Springer, 2007)
54. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
54/80
Large-scale Crawling
Design points: frontier data structure
Most links on a page refer to the same site/server
- Note: remember of virtual hosting
Problem with a FIFO queue – too many requests to the
same server
Common policy is to delay next request by, say, 10 x time
(it took to download last page from the server)
’Mercator’ scheme – have more additional queues to the
frontier queue
55. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
55/80
Large-scale Crawling
Design points: URL seen test
To not add multiple instances of URL to the frontier
For batch crawling, two operations required: insertion and
membership testing
For continuous crawling, one more operation: deletion
URLs compressed (e.g., 10-byte hash value)
In-memory implementations: hash table, Bloom filter
Search engines keep all URLs in-memory in the crawling
cluster (hash table partitioned across nodes; partitioning
can be based on host part of URL)
56. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
56/80
Large-scale Crawling
Design points: URL seen test
If in-memory not possible, disk-based hash table used with
caching
Limits crawling rate to tens of pages per second – disk
lookups are slow
To scale, sequential read/writes are faster and thus used
’Mercator/IRLbot’ scheme: combining (reading-writing)
sorted URL (visited) hashes on disk with hashes of ’just
extracted’ URLs
Delay due to batch merging manageable
58. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
58/80
Outline of Part II
Challenges in Web Crawling
Collaborative Crawling
Deep Web Crawling
Crawling content behind search forms
Crawling JavaScript-rich web sites
Crawling Multimedia
Other Challenges in Crawling
Future Directions
References
59. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
59/80
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a specific topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify filters to select required pages
Crawler as a common service
60. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
60/80
Collaborative Crawling
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests
61. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
61/80
Collaborative Crawling
New component
Process a stream of documents against a filter index
62. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
62/80
Collaborative Crawling
Filter processing architecture
63. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
63/80
Collaborative Crawling
Filter processing architecture
64. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
64/80
Collaborative Crawling
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web
65. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
65/80
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet
66. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
66/80
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)
67. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
67/80
Deep Web Crawling
See slides on deep Web crawling at http://goo.gl/Oohoo
68. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
68/80
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)
69. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
69/80
Crawling Multimedia Content
Challenges in crawling multimedia
Bigger load on web sites since files are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy
70. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
70/80
Crawling Multimedia Content
Approaches
Utilize metadata info (fetch and analyse small metadata file
to decide on full download)
Intelligent crawling: better ranking of URLs in frontier
(based on specified domain of crawl)
Move from pull to push model
API-directed crawling
- Access to data via predefined APIs
- Need in annotation/discovery of such APIs
Technically: use additional component for multimedia crawl
- With its own URL queue
- Main crawler component provides it with URLs to
multimedia
- In return, it sends feedback to main crawler to better
score links in frontier
71. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
71/80
Crawling Multimedia Content
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-defined web sites
Data can be exported in WARC (Web ARChive) files and in
RDF
72. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
72/80
Other Crawling Challenges
Ordering policy
Resources are limited, while number of pages to visit
essentially infinite
Decision should be done based on URL itself
PageRank-like metrics can be used
More complicated in case of incremental crawls
Focused crawling
Avoid links leading to content out of the topic of interest
Content of a page can be taken into account when decide
if a particular link leads to
Setting a good seed is a challenge
73. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
73/80
Other Crawling Challenges
Re-visiting policy
Generating good seed URLs
Avoiding redundant content
Avoid visiting duplicate pages (different URLs leading to
identical or near-identical content)
- Near-duplicates might be very tricky (think of a news item
propagation on the Web)
Avoid crawler traps
Avoid useless content (i.e., web spam)
74. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
74/80
Future Directions
Collaborative crawling, mixed pull-push model
Understanding site structure
Deep Web crawling
Media content crawling
Social network crawling
75. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
75/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages
76. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
76/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)
77. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
77/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC files
- 2.7 billions pages
- Includes multimedia data
- Available by request
78. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
78/80
References: Crawl Datasets
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites
79. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
79/80
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017
80. Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
80/80
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/