SlideShare a Scribd company logo
1 of 26
Get the Data you want, because you want the Data now!
Francesco Laurita
RubyDay 2013, Milan - Italy
Roll you own Web
Crawler
Friday, June 14, 13
What a web crawler is?
“A Web crawler is an Internet bot that systematically browses the World
Wide Web, typically for the purpose of Web indexing.”
http://en.wikipedia.org/wiki/Web_crawler
Friday, June 14, 13
How does it work?
1.Starts with a list of urls to visit (seeds)
2.Get all of the hyperlinks in the page and
adds them to the list of urls to visit (push)
1. The page content is stored somewhere
2.The visited url is marked as visited
3.Urls are recursively visited
Directed graph
Queue (FIFO)
Friday, June 14, 13
How does it work?
Web Crawler is able to “walk” a
“WebGraph”
A WebGraph is a directed graph whose
vertices are pages and a direct edge
connects page A to page B if there is a link
between A and B
Directed graph
Queue (FIFO)
Friday, June 14, 13
Generic Web Crawler Infrastructure
While it’s fairly easy to build and write a standalone single-instance Crawler,
building a distribute and scalable system that can download millions of
pages over weeks is not
Friday, June 14, 13
Why should you roll your own Web Crawler?
Universal Crawlers:
* General purpose
* Most interested contents (page rank)
Focused Crawlers:
* Better accuracy
* Only certain topic
* Highly selective
* Not only for search engines
Ready to be used for Machine Learning Engine as a service
Data warehouse and so on
Friday, June 14, 13
Sentiment Analysis
Friday, June 14, 13
Finance
Friday, June 14, 13
A.I, Machine Learning, Recommendation
Engine as A Service
Friday, June 14, 13
Last but not least....
Friday, June 14, 13
Polipus (because octopus was taken)
Friday, June 14, 13
Polipus (because octopus was taken)
A distributed easy-to-use DSL-ish web crawler framework written
in ruby
* Distributed and scalable
* Easy to use
https://github.com/taganaka/polipus
Heavily inspired to Anemone
* Well designed
* Easy to use
* Not distributed
* Not Scalable
https://github.com/chriskite/anemone
Friday, June 14, 13
Polipus in action
Friday, June 14, 13
Polipus: Under the hood
Redis
(What is it?)
* Is a NoSQL DB
* Is an advanced Key/Value Store
* Is a caching server
* Is a lot of things...
Friday, June 14, 13
Polipus: Under the hood
Redis
(What is it?)
* It is a way to share Memory over TCP/IP
Can share memory (data structure) between different processes
* List (LinkedList) --> queue.pop, queue.push
* Hash --> {}
* Set --> Set
* SortedSet --> SortedSet.new
* ....
Friday, June 14, 13
Polipus: Under the hood
Redis
* Reliable and Distributed Queue
1) A producer pushes an URL to visit into the Queue
RPUSH
2) A consumer fetches the URL and at the same time pushes
it into a processing LIST
RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking)
An additional client may monitor the processing list for
items that remain there for too much time, and will push
those timed out items into the queue again if needed.
Friday, June 14, 13
Polipus: Under the hood
Redis
* Reliable and Distributed Queue
https://github.com/taganaka/redis-queue
Friday, June 14, 13
Polipus: Under the hood
Redis
* URL Tracker
A crawler should know if an URL has been already visited or it
about to be visited
* SET
(a = Set.new, a << url ; a.include?(url))
* Bloom Filter (SETBIT / GETBIT)
Friday, June 14, 13
Polipus: Under the hood
Redis
Bloom Filter:
“A Bloom filter, is a space-efficient probabilistic data structure that is used
to test whether an element is a member of a set.”
http://en.wikipedia.org/wiki/Bloom_filter
Friday, June 14, 13
Polipus: Under the hood
Redis
Bloom Filter:
* Very space efficient! 1.000.000 of elements ~2Mb on Redis
* With a cost: False positive retrieval are possible, while negative are not
With a probability of 0.1% of false positive, every 1M of pages, 1k of them
might be marked erroneously as already visited
Using SET : No errors at all but 1.000.000 of elements are ~150MB
occupied on Redis
https://github.com/taganaka/redis-bloomfilter
Friday, June 14, 13
Polipus: Under the hood
MongoDB
1) MongoDB is used mainly for storing pages
2) Pages are stored using upsert command so that a document can be easily
updated during a fresh crawling on the same contents
3) By default the body of the page is compressed in order to save disk space
4) No query() is needed because of bloom filter
Friday, June 14, 13
Polipus: The infrastructure
Friday, June 14, 13
Is it so easy?!
Not really...
1) Redis is an in-memory database
2) A queue of URLs can grow very fast
3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars
for each entry)
4) MongoDB will eat your disk space: 50M of saved pages are around 400GB
Suggested Redis conf:
maxmemory 2.5GB (or whatever your instance can handle)
maxmemory-policy noeviction
After 6M I
got Redis to
refuse writes
Friday, June 14, 13
An experiment using the current available code
Setup:
6x t1.micro (web crawlers, 5 workers each)
1x m1.medium (Redis and MongoDB)
MongoDB with default settings
Redis
maxmemory 2.5GB
maxmemory-policy noeviction
~4.700.000 of Pages downloaded in 24h
...then I ran out of disk because of MongoDB
Friday, June 14, 13
TODO
•Redis memory Guard
• Should be able to move items from the Redis queue to MongoDB if the
queue size hits a threshold and move items back on Redis at some
point
•Honor the robot.txt file
• So that we can be respect Disallow directives if any
•Add support for Ruby Mechanize
• Maintain browsing sessions
• Filling and submitting forms
Friday, June 14, 13
Questions?
francesco@gild.com
facebook.com/francesco.laurita
www.gild.com
Friday, June 14, 13

More Related Content

Similar to Roll your own web crawler. RubyDay 2013

Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)Francesco Fullone
 
Geekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web PrimerGeekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web Primerianibbo
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseTugdual Grall
 
The Virtual Repository
The Virtual RepositoryThe Virtual Repository
The Virtual RepositoryFabio Simeoni
 
Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolGabriel Dragomir
 
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...PatrickCrompton
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionTreasure Data, Inc.
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmahp3rnilla
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Olaf Alders
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaThomas Kurz
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web Apptechnicolorenvy
 
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...Dr. Haxel Consult
 
Drupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersDrupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersMarcus Deglos
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...xu liwei
 

Similar to Roll your own web crawler. RubyDay 2013 (20)

Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)
 
Geekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web PrimerGeekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web Primer
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with Couchbase
 
The Virtual Repository
The Virtual RepositoryThe Virtual Repository
The Virtual Repository
 
Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache Stanbol
 
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmah
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache Marmotta
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web App
 
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
 
Drupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersDrupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappers
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Roll your own web crawler. RubyDay 2013

  • 1. Get the Data you want, because you want the Data now! Francesco Laurita RubyDay 2013, Milan - Italy Roll you own Web Crawler Friday, June 14, 13
  • 2. What a web crawler is? “A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.” http://en.wikipedia.org/wiki/Web_crawler Friday, June 14, 13
  • 3. How does it work? 1.Starts with a list of urls to visit (seeds) 2.Get all of the hyperlinks in the page and adds them to the list of urls to visit (push) 1. The page content is stored somewhere 2.The visited url is marked as visited 3.Urls are recursively visited Directed graph Queue (FIFO) Friday, June 14, 13
  • 4. How does it work? Web Crawler is able to “walk” a “WebGraph” A WebGraph is a directed graph whose vertices are pages and a direct edge connects page A to page B if there is a link between A and B Directed graph Queue (FIFO) Friday, June 14, 13
  • 5. Generic Web Crawler Infrastructure While it’s fairly easy to build and write a standalone single-instance Crawler, building a distribute and scalable system that can download millions of pages over weeks is not Friday, June 14, 13
  • 6. Why should you roll your own Web Crawler? Universal Crawlers: * General purpose * Most interested contents (page rank) Focused Crawlers: * Better accuracy * Only certain topic * Highly selective * Not only for search engines Ready to be used for Machine Learning Engine as a service Data warehouse and so on Friday, June 14, 13
  • 9. A.I, Machine Learning, Recommendation Engine as A Service Friday, June 14, 13
  • 10. Last but not least.... Friday, June 14, 13
  • 11. Polipus (because octopus was taken) Friday, June 14, 13
  • 12. Polipus (because octopus was taken) A distributed easy-to-use DSL-ish web crawler framework written in ruby * Distributed and scalable * Easy to use https://github.com/taganaka/polipus Heavily inspired to Anemone * Well designed * Easy to use * Not distributed * Not Scalable https://github.com/chriskite/anemone Friday, June 14, 13
  • 14. Polipus: Under the hood Redis (What is it?) * Is a NoSQL DB * Is an advanced Key/Value Store * Is a caching server * Is a lot of things... Friday, June 14, 13
  • 15. Polipus: Under the hood Redis (What is it?) * It is a way to share Memory over TCP/IP Can share memory (data structure) between different processes * List (LinkedList) --> queue.pop, queue.push * Hash --> {} * Set --> Set * SortedSet --> SortedSet.new * .... Friday, June 14, 13
  • 16. Polipus: Under the hood Redis * Reliable and Distributed Queue 1) A producer pushes an URL to visit into the Queue RPUSH 2) A consumer fetches the URL and at the same time pushes it into a processing LIST RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking) An additional client may monitor the processing list for items that remain there for too much time, and will push those timed out items into the queue again if needed. Friday, June 14, 13
  • 17. Polipus: Under the hood Redis * Reliable and Distributed Queue https://github.com/taganaka/redis-queue Friday, June 14, 13
  • 18. Polipus: Under the hood Redis * URL Tracker A crawler should know if an URL has been already visited or it about to be visited * SET (a = Set.new, a << url ; a.include?(url)) * Bloom Filter (SETBIT / GETBIT) Friday, June 14, 13
  • 19. Polipus: Under the hood Redis Bloom Filter: “A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.” http://en.wikipedia.org/wiki/Bloom_filter Friday, June 14, 13
  • 20. Polipus: Under the hood Redis Bloom Filter: * Very space efficient! 1.000.000 of elements ~2Mb on Redis * With a cost: False positive retrieval are possible, while negative are not With a probability of 0.1% of false positive, every 1M of pages, 1k of them might be marked erroneously as already visited Using SET : No errors at all but 1.000.000 of elements are ~150MB occupied on Redis https://github.com/taganaka/redis-bloomfilter Friday, June 14, 13
  • 21. Polipus: Under the hood MongoDB 1) MongoDB is used mainly for storing pages 2) Pages are stored using upsert command so that a document can be easily updated during a fresh crawling on the same contents 3) By default the body of the page is compressed in order to save disk space 4) No query() is needed because of bloom filter Friday, June 14, 13
  • 23. Is it so easy?! Not really... 1) Redis is an in-memory database 2) A queue of URLs can grow very fast 3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars for each entry) 4) MongoDB will eat your disk space: 50M of saved pages are around 400GB Suggested Redis conf: maxmemory 2.5GB (or whatever your instance can handle) maxmemory-policy noeviction After 6M I got Redis to refuse writes Friday, June 14, 13
  • 24. An experiment using the current available code Setup: 6x t1.micro (web crawlers, 5 workers each) 1x m1.medium (Redis and MongoDB) MongoDB with default settings Redis maxmemory 2.5GB maxmemory-policy noeviction ~4.700.000 of Pages downloaded in 24h ...then I ran out of disk because of MongoDB Friday, June 14, 13
  • 25. TODO •Redis memory Guard • Should be able to move items from the Redis queue to MongoDB if the queue size hits a threshold and move items back on Redis at some point •Honor the robot.txt file • So that we can be respect Disallow directives if any •Add support for Ruby Mechanize • Maintain browsing sessions • Filling and submitting forms Friday, June 14, 13