Roll your own web crawler. RubyDay 2013

Get the Data you want, because you want the Data now!
Francesco Laurita
RubyDay 2013, Milan - Italy
Roll you own Web
Crawler
Friday, June 14, 13

What a web crawler is?
“A Web crawler is an Internet bot that systematically browses the World
Wide Web, typically for the purpose of Web indexing.”
http://en.wikipedia.org/wiki/Web_crawler
Friday, June 14, 13

How does it work?
1.Starts with a list of urls to visit (seeds)
2.Get all of the hyperlinks in the page and
adds them to the list of urls to visit (push)
1. The page content is stored somewhere
2.The visited url is marked as visited
3.Urls are recursively visited
Directed graph
Queue (FIFO)
Friday, June 14, 13

How does it work?
Web Crawler is able to “walk” a
“WebGraph”
A WebGraph is a directed graph whose
vertices are pages and a direct edge
connects page A to page B if there is a link
between A and B
Directed graph
Queue (FIFO)
Friday, June 14, 13

Generic Web Crawler Infrastructure
While it’s fairly easy to build and write a standalone single-instance Crawler,
building a distribute and scalable system that can download millions of
pages over weeks is not
Friday, June 14, 13

Why should you roll your own Web Crawler?
Universal Crawlers:
* General purpose
* Most interested contents (page rank)
Focused Crawlers:
* Better accuracy
* Only certain topic
* Highly selective
* Not only for search engines
Ready to be used for Machine Learning Engine as a service
Data warehouse and so on
Friday, June 14, 13

Sentiment Analysis
Friday, June 14, 13

A.I, Machine Learning, Recommendation
Engine as A Service
Friday, June 14, 13

Last but not least....
Friday, June 14, 13

Polipus (because octopus was taken)
Friday, June 14, 13

Polipus (because octopus was taken)
A distributed easy-to-use DSL-ish web crawler framework written
in ruby
* Distributed and scalable
* Easy to use
https://github.com/taganaka/polipus
Heavily inspired to Anemone
* Well designed
* Easy to use
* Not distributed
* Not Scalable
https://github.com/chriskite/anemone
Friday, June 14, 13

Polipus in action
Friday, June 14, 13

Polipus: Under the hood
Redis
(What is it?)
* Is a NoSQL DB
* Is an advanced Key/Value Store
* Is a caching server
* Is a lot of things...
Friday, June 14, 13

Redis
(What is it?)
* It is a way to share Memory over TCP/IP
Can share memory (data structure) between diﬀerent processes
* List (LinkedList) --> queue.pop, queue.push
* Hash --> {}
* Set --> Set
* SortedSet --> SortedSet.new
* ....
Friday, June 14, 13

Redis
* Reliable and Distributed Queue
1) A producer pushes an URL to visit into the Queue
RPUSH
2) A consumer fetches the URL and at the same time pushes
it into a processing LIST
RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking)
An additional client may monitor the processing list for
items that remain there for too much time, and will push
those timed out items into the queue again if needed.
Friday, June 14, 13

Redis
* Reliable and Distributed Queue
https://github.com/taganaka/redis-queue
Friday, June 14, 13

Redis
* URL Tracker
A crawler should know if an URL has been already visited or it
about to be visited
* SET
(a = Set.new, a << url ; a.include?(url))
* Bloom Filter (SETBIT / GETBIT)
Friday, June 14, 13

Redis
Bloom Filter:
“A Bloom filter, is a space-efficient probabilistic data structure that is used
to test whether an element is a member of a set.”
http://en.wikipedia.org/wiki/Bloom_filter
Friday, June 14, 13

Redis
Bloom Filter:
* Very space efficient! 1.000.000 of elements ~2Mb on Redis
* With a cost: False positive retrieval are possible, while negative are not
With a probability of 0.1% of false positive, every 1M of pages, 1k of them
might be marked erroneously as already visited
Using SET : No errors at all but 1.000.000 of elements are ~150MB
occupied on Redis
https://github.com/taganaka/redis-bloomﬁlter
Friday, June 14, 13

MongoDB
1) MongoDB is used mainly for storing pages
2) Pages are stored using upsert command so that a document can be easily
updated during a fresh crawling on the same contents
3) By default the body of the page is compressed in order to save disk space
4) No query() is needed because of bloom filter
Friday, June 14, 13

Polipus: The infrastructure
Friday, June 14, 13

Is it so easy?!
Not really...
1) Redis is an in-memory database
2) A queue of URLs can grow very fast
3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars
for each entry)
4) MongoDB will eat your disk space: 50M of saved pages are around 400GB
Suggested Redis conf:
maxmemory 2.5GB (or whatever your instance can handle)
maxmemory-policy noeviction
After 6M I
got Redis to
refuse writes
Friday, June 14, 13

An experiment using the current available code
Setup:
6x t1.micro (web crawlers, 5 workers each)
1x m1.medium (Redis and MongoDB)
MongoDB with default settings
Redis
maxmemory 2.5GB
maxmemory-policy noeviction
~4.700.000 of Pages downloaded in 24h
...then I ran out of disk because of MongoDB
Friday, June 14, 13

TODO
•Redis memory Guard
• Should be able to move items from the Redis queue to MongoDB if the
queue size hits a threshold and move items back on Redis at some
point
•Honor the robot.txt ﬁle
• So that we can be respect Disallow directives if any
•Add support for Ruby Mechanize
• Maintain browsing sessions
• Filling and submitting forms
Friday, June 14, 13

Questions?
francesco@gild.com
facebook.com/francesco.laurita
www.gild.com
Friday, June 14, 13

Roll your own web crawler. RubyDay 2013

Recommended

Recommended

More Related Content

Similar to Roll your own web crawler. RubyDay 2013

Similar to Roll your own web crawler. RubyDay 2013 (20)

Recently uploaded

Recently uploaded (20)

Roll your own web crawler. RubyDay 2013