This document discusses using MongoDB as a data repository for storing data scraped by Gyes crawlers. Gyes is a web scraping platform that uses JavaScript and jQuery scrapers. Storing scraped data in MongoDB provides a flexible data model and semi-structured access to the data. Key advantages of MongoDB include flexible storage of dynamic scraped data, powerful querying, and scalability. The document explores earlier approaches using key-value pairs and JSON, and explains how MongoDB meets goals of flexible storage and access to scraped web data.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB and Web Scrapping with the Gyes Platform
1. MongoDB and Web Scraping
with the Gyes Platform
Jesus Diaz
jesus.diaz@infinithread.com
2. What is Gyes?
● Aggregation Platform for the Web
○ Finance (Mint.com, Manilla.com)
○ Travel (kayak.com)
○ Shopping (nextag.com)
3. What is Gyes? (cont)
● Domain-specific Scrapers
● Gyes scrapers = JavaScript + jQuery
● Full Web context access
4. Goals
● Decouple Data Extraction from Data
Consumption
● Provide a Flexible Data Model
● Provide a Semi-structured Model to Access
Scraped Data
5. Overall Architecture
REST API (latest, run, collect)
UI (www.gyeslab.com)
● Develop crawlers
● Check scrapped data
Schedule
bot.open('http://somesite.com',
function(status) {
...
Data Repository
return {a: 3, b: 4};
});
6. From Goals to Challenges
● Flexible Data Model
● Flexible, semi-structured Means to Access
Data
7. Take 1: Key-Value pairs (Tuple spaces)
Crawler returns: Data gets stored as:
result.source = "Newspaper A" key1 key2 key3 ... value
result.date = "1/20/2013"
result.news[0].id = 8 result source Newspaper A
result.news[0].text = "Headline
#1" result date 1/20/2013
result.news[1].id = 9
result news[0] id 8
result.news[1].text = "Headline
#2" result news[0] text Headline #1
....
result news[1] id 9
result news[1] text Headline #2
8. Key-Value pairs (cont)
Advantages Disadvantages
● Flexible Data Model ● Cumbersome to
"rebuild" data
● Hard to handle
versioning
● Lack of great
commercial
implementations
(diy?)
9. Take 2: Enter JSON
We are scraping the web, using Javascript + jQuery. Why don't we use JSON? Thank you, captain
obvious!
Crawler returns: Data gets stored as
{ Plain Text.
"source": "Newspaper A",
"date": "1/20/2013",
"news": [
{ "id": 8, "text": "Headline
#1"},
{ "id": 9,"text": "Headline #2"}
....
]
}
10. Enter JSON (cont)
Advantages Disadvantages
● Flexible Data Model ● Plain text
What about that flexible, semi-structured
mechanism to access the data we wanted to
provide?
11. MongoDB to the rescue
● No tricks, store data as-is
● Flexible (structure of scraped data can
change, MongoDB doesn't care)
● Semi-structural model allow users to convert
data to strongly typed objects
● Powerful query mechanisms
● Scalable (oh yeah)
● Again, store data as-is, consume as-is.
13. What's Next with Gyes and MongoDB
● Scale Data Repository + API
○ Sharding
○ Get data closer to users
● Add support for querying data by projection
○ Slices of data
○ Arbitrary attribute subset selection.