SlideShare una empresa de Scribd logo
1 de 80
Descargar para leer sin conexión
ICWE’13 Tutorial:
CURRENT CHALLENGES IN WEB CRAWLING
Denis Shestakov (denshe at gmail-dot-com)
Department of Media Technology
School of Science, Aalto University, Finland
Version 1.5: 09.03.2015
Version 1.4: 08.07.2013
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
2/80
References to this tutorial
To cite please use:
D. Shestakov, "Current Challenges in Web Crawling," in
Proc. ICWE 2013, 2013, pp. 518-521.
[BibTeX]
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
3/80
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in web
crawling Web Services Group in 2011
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
4/80
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
5/80
Tutorial Outline
OVERVIEW
Web crawling in a nutshell
Web structure& statistics
Large-scale crawling
Break
CHALLENGES
Collaborative web crawling
Crawling the deep Web
Crawling the multimedia content
Future directions
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
6/80
PART I: OVERVIEW
Vizualization of http://media.tkk.fi/webservices by aharef.info
applet
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
7/80
Outline of Part I
Overview of Web Crawling
Web Crawling in a Nutshell
Applications
Industry vs. Academia
Web Ecosystem and Crawling
Web Structure& Statistics
Large-scale crawling
Basic architecture
Implementations
Design issues and considerations
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
8/80
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
A set of policies involved (like ’ignore links to images’, etc.)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
9/80
Web Crawling in a Nutshell
Example:
1. Follow http://media.tkk.fi/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
10/80
Web Crawling in a Nutshell
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
11/80
Web Crawling in a Nutshell
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
12/80
Web Crawling in a Nutshell
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
13/80
Web Crawling in a Nutshell
Classification
General/universal crawlers
- Not so many of them, lots of resources required
- Big web search engines
Topical/focused crawlers
- Pages/sites on certain topic
- Crawling all in one specific (i.e., national) web segment is
rather general, though
Batch crawling
- One or several (static) snapshots
Incremental/continuous crawling
- Re-visiting
- Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
- Search engines
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
14/80
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
15/80
Applications of Web Crawling
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
16/80
Applications of Web Crawling
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
17/80
Applications of Web Crawling
Web Archiving
Digital preservation
’Librarian’ look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites - web sites at
country-specific TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
18/80
Applications of Web Crawling
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
19/80
Applications of Web Crawling
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
20/80
Applications of Web Crawling
Web Monitoring
Monitoring sites/pages for changes and updates
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
21/80
Applications of Web Crawling
Detection of malicious web sites
Typically a part of anti-virus, firewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
22/80
Applications of Web Crawling
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
23/80
Applications of Web Crawling
Fighting crime! :) well, copyright violations
Crawl to find (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like finding terrorist chat rooms also go here
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
24/80
Applications of Web Crawling
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
25/80
Applications of Web Crawling
Web Mirroring
Copying of web sites
Often hosting copies on different servers to ensure
constant accessibility
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
26/80
Industry vs. Academia
In web crawling domain
Huge lag between industrial and academic web crawlers
- Research-wise and development-wise
- Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale (=dozens of
billions pages)
- Only a few (three?) academic crawlers dealt with more
than one billion pages
- Academic scale is rather hundreds of millions
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
27/80
Industry vs. Academia
Re-crawling
- Batch crawls in academia
- Regular re-crawls by industrial crawlers
Evaluation of crawled data
- And hence corrections/improvements into crawlers
- Direct evaluation by users of search engines
- To some extent, artificial evaluation of academic crawls
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
28/80
Industry vs. Academia
Industrial (search engines’) crawlers are much more
appreciated
- Eventually they attract visitors
(=revenue/prestige/influence/...)
- It makes perfect sense to trick them
Academic crawlers just consume resources (e.g., network
bandwidth)
- Don’t bring anything
- No point to do tricks with them (assuming site
administrator bothers to differentiate them from search
engines’ bots)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
29/80
Web Ecosystem and Crawling
Pull vs. Push model
Web Content Provider (site owners)
Web Aggregators (crawler operators)
Aggregator pulls content
Content is not pushed to aggregators
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
30/80
Web Ecosystem and Crawling
Why not Push?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No specific protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
31/80
Web Ecosystem and Crawling
Why not Push?
Still pull model has several disadvantages
What are these?
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
32/80
Web Ecosystem and Crawling
Why not Push?
Still pull model has several disadvantages
Avoiding redundant requests from crawlers, more control
over the content from providers
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
33/80
Web Ecosystem and Crawling
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to define access to parts of a site
Via direct banning of agents hitting a site too often
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
34/80
Web Ecosystem and Crawling
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
35/80
Web Structure& Statistics
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
Must try to keep all crawling threads busy
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
36/80
Web Structure& Statistics
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) confirms one billion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billions pages over 2 months
- Throughput: 1000-1500 pages per second
- Over 30 billions discovered URLs
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
37/80
Web Structure& Statistics
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
38/80
Basic Crawler Architecture
Crawler crawls the Web
Illustration taken from CMSC 476/676 course slides by Charles Nicholas
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
39/80
Basic Crawler Architecture
Typically in a distributed fashion
Illustration taken from CMSC 476/676 course slides by Charles Nicholas
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
40/80
Basic Crawler Architecture
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
41/80
Basic Crawler Architecture
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
42/80
Basic Crawler Architecture
DNS
Given a URL, retrieve its IP address
Distributed service – lookup latencies can be high
(seconds)
Critical component
Common implementations of DNS lookup (e.g., nslookup)
are synchronous: one request at a time
Asynchronous DNS resolving
Pre-caching
Batch DNS resolving
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
43/80
Basic Crawler Architecture
Content seen?
If page fetched is already in the base/index, don’t process it
Document fingerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+filtered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
44/80
Basic Crawler Architecture
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
45/80
Basic Crawler Architecture
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
46/80
Implementations
Popular languages: Perl, Java, Python, C/C++
HTTP fetching, HTML parser, asynchronous DNS
resolving libraries
Open-source, in Java: Heritrix, Nutch
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
47/80
Implementations
Simple code example in Perl
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
48/80
Large-scale Crawling
Objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
49/80
Large-scale Crawling
Architecture of
sequential crawler
Seeds – list of starting
URLs
Order of page visits
determined by frontier
data structure
Stop condition (e.g., X
pages fetched)
Illustration taken from Ch.8 Web Crawling by Filippo
Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
50/80
Large-scale Crawling
Graph Traversal
Breadth first search
- Implemented with
QUEUE (FIFO)
- Pages with shortest
paths
Depth first search
- Implemented with
STACK (LIFO)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
51/80
Large-scale Crawling
Some implementation notes
Get only the first part of pages (10-100KB)
Detect redirection loops
Handle all possible errors (e.g., server not responding),
timeouts, etc.
Deal with lots of invalid HTML
Take care of dynamic pages
- Some are ’spider traps’ (think of Next month link on a
calendar)
- E.g., limit number of pages per host
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
52/80
Large-scale Crawling
Delays in crawling
Resolving host to IP address
Connecting a socket to server and sending request
Receiving requested page in response
Overlap delays by fetching many pages concurrently
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
53/80
Large-scale Crawling
Architecture of
concurrent crawler
Illustration taken from Ch.8 Web Crawling by Filippo Menczer
in Bing Liu’s Web Data Mining (Springer, 2007)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
54/80
Large-scale Crawling
Design points: frontier data structure
Most links on a page refer to the same site/server
- Note: remember of virtual hosting
Problem with a FIFO queue – too many requests to the
same server
Common policy is to delay next request by, say, 10 x time
(it took to download last page from the server)
’Mercator’ scheme – have more additional queues to the
frontier queue
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
55/80
Large-scale Crawling
Design points: URL seen test
To not add multiple instances of URL to the frontier
For batch crawling, two operations required: insertion and
membership testing
For continuous crawling, one more operation: deletion
URLs compressed (e.g., 10-byte hash value)
In-memory implementations: hash table, Bloom filter
Search engines keep all URLs in-memory in the crawling
cluster (hash table partitioned across nodes; partitioning
can be based on host part of URL)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
56/80
Large-scale Crawling
Design points: URL seen test
If in-memory not possible, disk-based hash table used with
caching
Limits crawling rate to tens of pages per second – disk
lookups are slow
To scale, sequential read/writes are faster and thus used
’Mercator/IRLbot’ scheme: combining (reading-writing)
sorted URL (visited) hashes on disk with hashes of ’just
extracted’ URLs
Delay due to batch merging manageable
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
57/80
PART II: CHALLENGES
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
58/80
Outline of Part II
Challenges in Web Crawling
Collaborative Crawling
Deep Web Crawling
Crawling content behind search forms
Crawling JavaScript-rich web sites
Crawling Multimedia
Other Challenges in Crawling
Future Directions
References
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
59/80
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a specific topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify filters to select required pages
Crawler as a common service
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
60/80
Collaborative Crawling
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
61/80
Collaborative Crawling
New component
Process a stream of documents against a filter index
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
62/80
Collaborative Crawling
Filter processing architecture
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
63/80
Collaborative Crawling
Filter processing architecture
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
64/80
Collaborative Crawling
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
65/80
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
66/80
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
67/80
Deep Web Crawling
See slides on deep Web crawling at http://goo.gl/Oohoo
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
68/80
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
69/80
Crawling Multimedia Content
Challenges in crawling multimedia
Bigger load on web sites since files are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
70/80
Crawling Multimedia Content
Approaches
Utilize metadata info (fetch and analyse small metadata file
to decide on full download)
Intelligent crawling: better ranking of URLs in frontier
(based on specified domain of crawl)
Move from pull to push model
API-directed crawling
- Access to data via predefined APIs
- Need in annotation/discovery of such APIs
Technically: use additional component for multimedia crawl
- With its own URL queue
- Main crawler component provides it with URLs to
multimedia
- In return, it sends feedback to main crawler to better
score links in frontier
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
71/80
Crawling Multimedia Content
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-defined web sites
Data can be exported in WARC (Web ARChive) files and in
RDF
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
72/80
Other Crawling Challenges
Ordering policy
Resources are limited, while number of pages to visit
essentially infinite
Decision should be done based on URL itself
PageRank-like metrics can be used
More complicated in case of incremental crawls
Focused crawling
Avoid links leading to content out of the topic of interest
Content of a page can be taken into account when decide
if a particular link leads to
Setting a good seed is a challenge
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
73/80
Other Crawling Challenges
Re-visiting policy
Generating good seed URLs
Avoiding redundant content
Avoid visiting duplicate pages (different URLs leading to
identical or near-identical content)
- Near-duplicates might be very tricky (think of a news item
propagation on the Web)
Avoid crawler traps
Avoid useless content (i.e., web spam)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
74/80
Future Directions
Collaborative crawling, mixed pull-push model
Understanding site structure
Deep Web crawling
Media content crawling
Social network crawling
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
75/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
76/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
77/80
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC files
- 2.7 billions pages
- Includes multimedia data
- Available by request
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
78/80
References: Crawl Datasets
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
79/80
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017
Denis Shestakov
Current Challenges in Web Crawling
ICWE’13, Aalborg, Denmark, 08.07.2013
80/80
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/

Más contenido relacionado

La actualidad más candente

When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Asynchronous javascript
 Asynchronous javascript Asynchronous javascript
Asynchronous javascriptEman Mohamed
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale testAliasgar Ginwala
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsHan Zhou
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBayAliasgar Ginwala
 
Introduction to AngularJS
Introduction to AngularJSIntroduction to AngularJS
Introduction to AngularJSDavid Parsons
 
Angular App Presentation
Angular App PresentationAngular App Presentation
Angular App PresentationElizabeth Long
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingCloudFundoo
 

La actualidad más candente (17)

When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Ajax Ppt 1
Ajax Ppt 1Ajax Ppt 1
Ajax Ppt 1
 
Asynchronous javascript
 Asynchronous javascript Asynchronous javascript
Asynchronous javascript
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale test
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Aspnet Caching
Aspnet CachingAspnet Caching
Aspnet Caching
 
9534715
95347159534715
9534715
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
 
Java script
Java scriptJava script
Java script
 
Introduction to AngularJS
Introduction to AngularJSIntroduction to AngularJS
Introduction to AngularJS
 
Angular Observables & RxJS Introduction
Angular Observables & RxJS IntroductionAngular Observables & RxJS Introduction
Angular Observables & RxJS Introduction
 
Angular App Presentation
Angular App PresentationAngular App Presentation
Angular App Presentation
 
Ajax and PHP
Ajax and PHPAjax and PHP
Ajax and PHP
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent Hashing
 
File system node js
File system node jsFile system node js
File system node js
 
Php Presentation
Php PresentationPhp Presentation
Php Presentation
 

Destacado

Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingNate Murray
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDataWorks Summit
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...Liber2012
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web CrawlerSanchit Saini
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Access to Knowledge Policy
Access to Knowledge PolicyAccess to Knowledge Policy
Access to Knowledge PolicyAlbert Simard
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep WebDenis Shestakov
 
Jsoup Tutorial for Beginners - Javatpoint
Jsoup Tutorial for Beginners - JavatpointJsoup Tutorial for Beginners - Javatpoint
Jsoup Tutorial for Beginners - JavatpointJavaTpoint.Com
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applicationsPartnered Health
 

Destacado (20)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Access to Knowledge Policy
Access to Knowledge PolicyAccess to Knowledge Policy
Access to Knowledge Policy
 
Crawling The Web
Crawling The WebCrawling The Web
Crawling The Web
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
Jsoup tutorial
Jsoup tutorialJsoup tutorial
Jsoup tutorial
 
Jsoup Tutorial for Beginners - Javatpoint
Jsoup Tutorial for Beginners - JavatpointJsoup Tutorial for Beginners - Javatpoint
Jsoup Tutorial for Beginners - Javatpoint
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”
 

Similar a Current challenges in web crawling

Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...museums and the web
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web ObservatoriesSteffen Staab
 
Chapter 1 (asp.net over view)
Chapter 1 (asp.net over view)Chapter 1 (asp.net over view)
Chapter 1 (asp.net over view)let's go to study
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
Museum Collections Management: Possibilities for Access and Use with Linked D...
Museum Collections Management: Possibilities for Access and Use with Linked D...Museum Collections Management: Possibilities for Access and Use with Linked D...
Museum Collections Management: Possibilities for Access and Use with Linked D...cbogen
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Nuxeo
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselNuxeo
 
Visdjango presentation django_boston_oct_2014
Visdjango presentation django_boston_oct_2014Visdjango presentation django_boston_oct_2014
Visdjango presentation django_boston_oct_2014jlbaldwin
 
Incremental Reasoning on Streams and Rich Background Knowledge
Incremental Reasoning on Streams andRich Background Knowledge Incremental Reasoning on Streams andRich Background Knowledge
Incremental Reasoning on Streams and Rich Background Knowledge Emanuele Della Valle
 
Web Archiving Intro (circa 2015)
Web Archiving Intro (circa 2015)Web Archiving Intro (circa 2015)
Web Archiving Intro (circa 2015)Anna Perricci
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web ArchivesMARAC Bethlehem PC
 
2005-01-04 Web Services Survey an Inventory Background, Goals and Status
2005-01-04 Web Services Survey an Inventory Background, Goals and Status2005-01-04 Web Services Survey an Inventory Background, Goals and Status
2005-01-04 Web Services Survey an Inventory Background, Goals and StatusRudolf Husar
 
Web Services Inventory
Web Services InventoryWeb Services Inventory
Web Services InventoryRudolf Husar
 

Similar a Current challenges in web crawling (20)

Lecture 6 Data Driven Design
Lecture 6  Data Driven DesignLecture 6  Data Driven Design
Lecture 6 Data Driven Design
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...
MW2010: Slavko Milekic, Gaze-tracking and museums: current research and impli...
 
WoT 2013 Interop
WoT 2013 InteropWoT 2013 Interop
WoT 2013 Interop
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web Observatories
 
Chapter 1 (asp.net over view)
Chapter 1 (asp.net over view)Chapter 1 (asp.net over view)
Chapter 1 (asp.net over view)
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
Museum Collections Management: Possibilities for Access and Use with Linked D...
Museum Collections Management: Possibilities for Access and Use with Linked D...Museum Collections Management: Possibilities for Access and Use with Linked D...
Museum Collections Management: Possibilities for Access and Use with Linked D...
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogrisel
 
Visdjango presentation django_boston_oct_2014
Visdjango presentation django_boston_oct_2014Visdjango presentation django_boston_oct_2014
Visdjango presentation django_boston_oct_2014
 
Incremental Reasoning on Streams and Rich Background Knowledge
Incremental Reasoning on Streams andRich Background Knowledge Incremental Reasoning on Streams andRich Background Knowledge
Incremental Reasoning on Streams and Rich Background Knowledge
 
Web Archiving Intro (circa 2015)
Web Archiving Intro (circa 2015)Web Archiving Intro (circa 2015)
Web Archiving Intro (circa 2015)
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web Archives
 
Seminario Sobre Datasets Consorcio Madrono
Seminario Sobre Datasets Consorcio Madrono Seminario Sobre Datasets Consorcio Madrono
Seminario Sobre Datasets Consorcio Madrono
 
2005-01-04 Web Services Survey an Inventory Background, Goals and Status
2005-01-04 Web Services Survey an Inventory Background, Goals and Status2005-01-04 Web Services Survey an Inventory Background, Goals and Status
2005-01-04 Web Services Survey an Inventory Background, Goals and Status
 
Web Services Inventory
Web Services InventoryWeb Services Inventory
Web Services Inventory
 

Más de Denis Shestakov

Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

Más de Denis Shestakov (6)

Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Current challenges in web crawling

  • 1. ICWE’13 Tutorial: CURRENT CHALLENGES IN WEB CRAWLING Denis Shestakov (denshe at gmail-dot-com) Department of Media Technology School of Science, Aalto University, Finland Version 1.5: 09.03.2015 Version 1.4: 08.07.2013
  • 2. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 2/80 References to this tutorial To cite please use: D. Shestakov, "Current Challenges in Web Crawling," in Proc. ICWE 2013, 2013, pp. 518-521. [BibTeX]
  • 3. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 3/80 Speaker’s Bio (2009-2013) Postdoc in Web Services Group, Aalto University, Finland PhD thesis (2008) on limited coverage of web crawlers Over ten years of experience in web crawling Web Services Group in 2011
  • 4. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 4/80 Speaker’s Info As of 2013: Current: http://www.linkedin.com/in/dshestakov http://www.mendeley.com/profiles/ denis-shestakov/ http://www.researchgate.net/profile/ Denis_Shestakov https://mediatech.aalto.fi/~denis/
  • 5. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 5/80 Tutorial Outline OVERVIEW Web crawling in a nutshell Web structure& statistics Large-scale crawling Break CHALLENGES Collaborative web crawling Crawling the deep Web Crawling the multimedia content Future directions
  • 6. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 6/80 PART I: OVERVIEW Vizualization of http://media.tkk.fi/webservices by aharef.info applet
  • 7. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 7/80 Outline of Part I Overview of Web Crawling Web Crawling in a Nutshell Applications Industry vs. Academia Web Ecosystem and Crawling Web Structure& Statistics Large-scale crawling Basic architecture Implementations Design issues and considerations
  • 8. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 8/80 Web Crawling in a Nutshell Automatic harvesting of web content Done by web crawlers (also known as robots, bots or spiders) Follow a link from a set of links (URL queue), download a page, extract all links, eliminate already visited, add the rest to the queue Then repeat A set of policies involved (like ’ignore links to images’, etc.)
  • 9. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 9/80 Web Crawling in a Nutshell Example: 1. Follow http://media.tkk.fi/webservices (vizualization of its HTML DOM tree below) 2. Extract URLs inside blue bubbles (designating <a> tags) 3. Remove already visited URLs 4. For each non-visited URL, start at Step 1
  • 10. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 10/80 Web Crawling in a Nutshell In essence: simple and naive process However, a number of ’restrictions’ imposed make it much more complicated Most complexities due to operating environment (Web) For example, do not overload web servers (challenging as distribution of web pages on web servers is non-uniform) Or avoiding web spam (not only useless but consumes resources and often spoils the collected content)
  • 11. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 11/80 Web Crawling in a Nutshell Crawler Agents First in 1993: the Wanderer (written in Perl) Over different 1100 crawler signatures (User-Agent string in HTTP request header) mentioned at http://www.crawltrack.net/crawlerlist.php Educated guess on overall number of different crawlers – at least several thousands Write your own in a few dozens lines of code (using libraries for URL fetching and HTML parsing) Or use existing agent: e.g., wget tool (developed from 1996; http://www.gnu.org/software/wget/)
  • 12. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 12/80 Web Crawling in a Nutshell Crawler Agents For advanced things, you may modify the code of existing projects for programming language preferred Crawlers play a big role on the Web Bring more traffic to certain web sites than human visitors Generate sizeable portion of traffic to any (public) web site Crawler traffic important for emerging web sites
  • 13. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 13/80 Web Crawling in a Nutshell Classification General/universal crawlers - Not so many of them, lots of resources required - Big web search engines Topical/focused crawlers - Pages/sites on certain topic - Crawling all in one specific (i.e., national) web segment is rather general, though Batch crawling - One or several (static) snapshots Incremental/continuous crawling - Re-visiting - Resources divided between fetching newly discovered pages and re-downloading previously crawled pages - Search engines
  • 14. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 14/80 Applications of Web Crawling Web Search Engines Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex, Ask, ... One of three underlying technology stacks
  • 15. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 15/80 Applications of Web Crawling Web Search Engines One of three underlying technology stacks BTW, what are the other two and which is the most ’crucial’?
  • 16. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 16/80 Applications of Web Crawling Web Search Engines What are the other two and which is the most ’crucial’? Query processor (particularly, ranking)
  • 17. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 17/80 Applications of Web Crawling Web Archiving Digital preservation ’Librarian’ look on the Web The biggest: Internet Archive Quite huge collections Batch crawls Primarily, collection of national web sites - web sites at country-specific TLDs or physically hosted in a country There are quite many and some are huge! see the list of Web Archiving Initiatives at Wikipedia
  • 18. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 18/80 Applications of Web Crawling Vertical Search Engines Data aggregating from many sources on certain topic E.g., apartment search, car search
  • 19. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 19/80 Applications of Web Crawling Web Data Mining “To get data to be actually mined” Usually using focused crawlers For example, opinion mining Or digests of current happenings on the Web (e.g., what music people listen now)
  • 20. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 20/80 Applications of Web Crawling Web Monitoring Monitoring sites/pages for changes and updates
  • 21. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 21/80 Applications of Web Crawling Detection of malicious web sites Typically a part of anti-virus, firewall, search engine, etc. service Building a list of such web sites and inform a user about potential threat of visiting such
  • 22. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 22/80 Applications of Web Crawling Web site/application testing Crawl a web site to check a navigation through it, validity the links, etc. Regression/security/... testing a rich internet application (RIA) via crawling Checking different application states by simulating possible user interaction events (e.g., mouse click, time-out)
  • 23. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 23/80 Applications of Web Crawling Fighting crime! :) well, copyright violations Crawl to find (media) items under copyright or links to them Regular re-visiting ’suspicious’ web sites, forums, etc. Tasks like finding terrorist chat rooms also go here
  • 24. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 24/80 Applications of Web Crawling Web Scraping Extracting particular pieces of information from a group of typically similar pages When API to data is not available Interestingly, scraping might be more preferable even with API available as scraped data often more clean and up-to-date than data-via-API
  • 25. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 25/80 Applications of Web Crawling Web Mirroring Copying of web sites Often hosting copies on different servers to ensure constant accessibility
  • 26. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 26/80 Industry vs. Academia In web crawling domain Huge lag between industrial and academic web crawlers - Research-wise and development-wise - Algorithms, techniques, strategies used in industrial crawlers (namely, operated by search engines) poorly known Industrial crawlers operate on a web-scale (=dozens of billions pages) - Only a few (three?) academic crawlers dealt with more than one billion pages - Academic scale is rather hundreds of millions
  • 27. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 27/80 Industry vs. Academia Re-crawling - Batch crawls in academia - Regular re-crawls by industrial crawlers Evaluation of crawled data - And hence corrections/improvements into crawlers - Direct evaluation by users of search engines - To some extent, artificial evaluation of academic crawls
  • 28. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 28/80 Industry vs. Academia Industrial (search engines’) crawlers are much more appreciated - Eventually they attract visitors (=revenue/prestige/influence/...) - It makes perfect sense to trick them Academic crawlers just consume resources (e.g., network bandwidth) - Don’t bring anything - No point to do tricks with them (assuming site administrator bothers to differentiate them from search engines’ bots)
  • 29. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 29/80 Web Ecosystem and Crawling Pull vs. Push model Web Content Provider (site owners) Web Aggregators (crawler operators) Aggregator pulls content Content is not pushed to aggregators
  • 30. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 30/80 Web Ecosystem and Crawling Why not Push? Pull is just easier for both parties No ’agreement’ between provider and aggregator No specific protocols for content providers – serving content is enough Perhaps pull model is the reason why the Web is succeeded while earlier hypertext systems failed
  • 31. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 31/80 Web Ecosystem and Crawling Why not Push? Still pull model has several disadvantages What are these?
  • 32. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 32/80 Web Ecosystem and Crawling Why not Push? Still pull model has several disadvantages Avoiding redundant requests from crawlers, more control over the content from providers
  • 33. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 33/80 Web Ecosystem and Crawling Crawler politeness Content providers possess some control over crawlers Via special protocols to define access to parts of a site Via direct banning of agents hitting a site too often
  • 34. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 34/80 Web Ecosystem and Crawling Crawler politeness Robots.txt says what can(not) be crawled Sitemaps is newer protocol specifying access restrictions and other info No agent should visit any URL starting with “yoursite/notcrawldir”, except an agent called “goodsearcher” Example User-agent: * Disallow: yoursite/notcrawldir User-agent: goodsearcher Disallow:
  • 35. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 35/80 Web Structure& Statistics Some numbers Number of pages per host is not uniform: most hosts contain only a few pages, others contain millions Roughly 100 links on a page Must try to keep all crawling threads busy According to Google statistics (over 4 billions pages, 2010): fetching a page takes 320KB (textual content plus all embeddings) Page has 10-100KB of textual (HTML) content on average One trillion URLs known by Google/Yahoo in 2008
  • 36. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 36/80 Web Structure& Statistics Some numbers 20 million web pages in 1995 (indexed by AltaVista) One trillion (1012) URLs known by Google/Yahoo in 2008 - ’Independent’ search engine called Majestic12 (P2P-crawling) confirms one billion items Doesn’t mean one trillion indexed pages Supposedly, index has dozens times less pages Cool crawler facts: IRLbot crawler (running on one server) downloaded 6.4 billions pages over 2 months - Throughput: 1000-1500 pages per second - Over 30 billions discovered URLs
  • 37. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 37/80 Web Structure& Statistics Bow-tie model of the Web Illustration taken from http://dx.doi.org/doi:10.1038/35012155
  • 38. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 38/80 Basic Crawler Architecture Crawler crawls the Web Illustration taken from CMSC 476/676 course slides by Charles Nicholas
  • 39. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 39/80 Basic Crawler Architecture Typically in a distributed fashion Illustration taken from CMSC 476/676 course slides by Charles Nicholas
  • 40. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 40/80 Basic Crawler Architecture URL Frontier Include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Prioritization also helps
  • 41. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 41/80 Basic Crawler Architecture Crawler Architecture Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  • 42. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 42/80 Basic Crawler Architecture DNS Given a URL, retrieve its IP address Distributed service – lookup latencies can be high (seconds) Critical component Common implementations of DNS lookup (e.g., nslookup) are synchronous: one request at a time Asynchronous DNS resolving Pre-caching Batch DNS resolving
  • 43. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 43/80 Basic Crawler Architecture Content seen? If page fetched is already in the base/index, don’t process it Document fingerprints (shingles) Filtering Filter out URLs – due to ’politeness’, restrictions on crawl Fetched robots.txt are cached to avoid fetching them repeatedly Duplicate URL Elimination Check if an extracted+filtered URL has been already passed to frontier (batch crawling) More complicated in continuous crawling (different URL frontier implementation)
  • 44. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 44/80 Basic Crawler Architecture Distributed Crawling Run multiple crawl threads, under different processes (often at different nodes) Nodes can be geographically distributed Partition hosts being crawled into nodes
  • 45. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 45/80 Basic Crawler Architecture Host Splitter Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  • 46. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 46/80 Implementations Popular languages: Perl, Java, Python, C/C++ HTTP fetching, HTML parser, asynchronous DNS resolving libraries Open-source, in Java: Heritrix, Nutch
  • 47. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 47/80 Implementations Simple code example in Perl
  • 48. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 48/80 Large-scale Crawling Objectives High web coverage High page freshness High content quality High download rate Internal and External factors Amount of hardware (I) Network bandwidth (I) Rate of web growth (E) Rate of web change (E) Amount of malicious content (i.e., spam, duplicates) (E)
  • 49. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 49/80 Large-scale Crawling Architecture of sequential crawler Seeds – list of starting URLs Order of page visits determined by frontier data structure Stop condition (e.g., X pages fetched) Illustration taken from Ch.8 Web Crawling by Filippo Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
  • 50. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 50/80 Large-scale Crawling Graph Traversal Breadth first search - Implemented with QUEUE (FIFO) - Pages with shortest paths Depth first search - Implemented with STACK (LIFO)
  • 51. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 51/80 Large-scale Crawling Some implementation notes Get only the first part of pages (10-100KB) Detect redirection loops Handle all possible errors (e.g., server not responding), timeouts, etc. Deal with lots of invalid HTML Take care of dynamic pages - Some are ’spider traps’ (think of Next month link on a calendar) - E.g., limit number of pages per host
  • 52. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 52/80 Large-scale Crawling Delays in crawling Resolving host to IP address Connecting a socket to server and sending request Receiving requested page in response Overlap delays by fetching many pages concurrently
  • 53. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 53/80 Large-scale Crawling Architecture of concurrent crawler Illustration taken from Ch.8 Web Crawling by Filippo Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
  • 54. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 54/80 Large-scale Crawling Design points: frontier data structure Most links on a page refer to the same site/server - Note: remember of virtual hosting Problem with a FIFO queue – too many requests to the same server Common policy is to delay next request by, say, 10 x time (it took to download last page from the server) ’Mercator’ scheme – have more additional queues to the frontier queue
  • 55. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 55/80 Large-scale Crawling Design points: URL seen test To not add multiple instances of URL to the frontier For batch crawling, two operations required: insertion and membership testing For continuous crawling, one more operation: deletion URLs compressed (e.g., 10-byte hash value) In-memory implementations: hash table, Bloom filter Search engines keep all URLs in-memory in the crawling cluster (hash table partitioned across nodes; partitioning can be based on host part of URL)
  • 56. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 56/80 Large-scale Crawling Design points: URL seen test If in-memory not possible, disk-based hash table used with caching Limits crawling rate to tens of pages per second – disk lookups are slow To scale, sequential read/writes are faster and thus used ’Mercator/IRLbot’ scheme: combining (reading-writing) sorted URL (visited) hashes on disk with hashes of ’just extracted’ URLs Delay due to batch merging manageable
  • 57. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 57/80 PART II: CHALLENGES
  • 58. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 58/80 Outline of Part II Challenges in Web Crawling Collaborative Crawling Deep Web Crawling Crawling content behind search forms Crawling JavaScript-rich web sites Crawling Multimedia Other Challenges in Crawling Future Directions References
  • 59. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 59/80 Collaborative Crawling Main considerations Lots of redundant crawling To get data (often on a specific topic) need to crawl broadly - Often lack of expertise when large crawl required - Often, crawl a lot, use only a small subset Too many redundant requests for content providers Idea: have one crawler doing very broad and intensive crawl and many parties accessing the crawled data via API - Specify filters to select required pages Crawler as a common service
  • 60. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 60/80 Collaborative Crawling Some requirements Filter language for specifying conditions Efficient filter processing (millions filter to process) Efficient fetching (hundreds pages per second) Support real-time requests
  • 61. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 61/80 Collaborative Crawling New component Process a stream of documents against a filter index
  • 62. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 62/80 Collaborative Crawling Filter processing architecture
  • 63. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 63/80 Collaborative Crawling Filter processing architecture
  • 64. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 64/80 Collaborative Crawling Based on ’The architecture and implementation of an extensible web crawler’ by Hsieh, Gribble, Levy, 2010 (illustrations on slides 61-62 from Hsieh’s slides) E.g., 80legs provides similar crawling services In a way, it is reconsidering pull/push model of content delivery on the Web
  • 65. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 65/80 Deep Web Crawling Visualization of http://amazon.com by aharef.info applet
  • 66. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 66/80 Deep Web Crawling In a nutshell Problem is in yellow nodes (designating web form elements)
  • 67. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 67/80 Deep Web Crawling See slides on deep Web crawling at http://goo.gl/Oohoo
  • 68. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 68/80 Crawling Multimedia Content The web is now multimedia platform Images, video, audio are integral part of web pages (not just supplementing them) Almost all crawlers, however, consider it as a textual repository One reason: indexing techniques for multimedia doesn’t reach yet the maturity required by interesting use cases/applications Hence, no real need to harvest multimedia But state-of-the-art multimedia retrieval/computer vision techniques already provide adequate search quality E.g., search for images with a cat and a man based on actual image content (not text around/close to image) In case of video: set of frames plus audio (can be converted to textual form)
  • 69. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 69/80 Crawling Multimedia Content Challenges in crawling multimedia Bigger load on web sites since files are bigger More apparent copyright issues More resources (e.g., bandwidth, storage place) required from a crawler More complicated duplicate resolving Re-visiting policy
  • 70. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 70/80 Crawling Multimedia Content Approaches Utilize metadata info (fetch and analyse small metadata file to decide on full download) Intelligent crawling: better ranking of URLs in frontier (based on specified domain of crawl) Move from pull to push model API-directed crawling - Access to data via predefined APIs - Need in annotation/discovery of such APIs Technically: use additional component for multimedia crawl - With its own URL queue - Main crawler component provides it with URLs to multimedia - In return, it sends feedback to main crawler to better score links in frontier
  • 71. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 71/80 Crawling Multimedia Content Scalable Multimedia Web Observatory of ARCOMEM project (http://www.arcomem.eu) Focus on web archiving issues Uses several crawlers - ’Standard’ crawler for regular web pages - API crawler to mine social media sources (e.g., Twitter, Facebook, YouTube, etc.) - Deep Web crawler able to extract information from pre-defined web sites Data can be exported in WARC (Web ARChive) files and in RDF
  • 72. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 72/80 Other Crawling Challenges Ordering policy Resources are limited, while number of pages to visit essentially infinite Decision should be done based on URL itself PageRank-like metrics can be used More complicated in case of incremental crawls Focused crawling Avoid links leading to content out of the topic of interest Content of a page can be taken into account when decide if a particular link leads to Setting a good seed is a challenge
  • 73. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 73/80 Other Crawling Challenges Re-visiting policy Generating good seed URLs Avoiding redundant content Avoid visiting duplicate pages (different URLs leading to identical or near-identical content) - Near-duplicates might be very tricky (think of a news item propagation on the Web) Avoid crawler traps Avoid useless content (i.e., web spam)
  • 74. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 74/80 Future Directions Collaborative crawling, mixed pull-push model Understanding site structure Deep Web crawling Media content crawling Social network crawling
  • 75. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 75/80 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. ClueWeb09 Dataset: - http://lemurproject.org/clueweb09.php/ - One billion web pages, in ten languages - 5TBs compressed - Hosted at several cloud services (free license required) or a copy can be ordered on hard disks (pay for disks) ClueWeb12: - Almost 900 millions English web pages
  • 76. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 76/80 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Common Crawl Corpus: - See http://commoncrawl.org/data/accessing-the-data/ and http://aws.amazon.com/datasets/41740 - Around six billion web pages - Over 100TB uncompressed - Available as Amazon Web Services’ public dataset (pay for processing)
  • 77. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 77/80 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Internet Archive: - See http://blog.archive.org/2012/10/26/ 80-terabytes-of-archived-web-crawl-data-available-for-resea - Crawl of 2011 - 80TB WARC files - 2.7 billions pages - Includes multimedia data - Available by request
  • 78. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 78/80 References: Crawl Datasets LAW Datasets: - http://law.dsi.unimi.it/datasets.php - Variety of web graphs datasets (nodes, arcs, etc.) including basic properties of recent Facebook graphs (!) - Thoroughly studied in a number of publications ICWSM 2011 Spinn3r Dataset: - http://www.icwsm.org/data/ - 130mln blog posts and 230mln social media publications - 2TB compressed Academic Web Link Database Project: - http://cybermetrics.wlv.ac.uk/database/ - Crawls of national universities web sites
  • 79. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 79/80 References: Literature For beginners: Udacity/CS101 course; http://www.udacity.com/overview/Course/cs101 Intermediate: Chapter 20 of Introduction to Information Retrieval book by Manning, Raghavan, Schütze; http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf Advanced: Web Crawling by Olston and Najork; http://www.nowpublishers.com/product.aspx?product= INR&doi=1500000017
  • 80. Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 80/80 References: Literature See relevant publications at Mendeley: http://www.mendeley.com/groups/531771/web-crawling/ Feel free to join the group! Check ’Deep Web’ group too http://www.mendeley.com/groups/601801/deep-web/