SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
MongoDB and Web Scraping
with the Gyes Platform

                                  Jesus Diaz
                 jesus.diaz@infinithread.com
What is Gyes?
● Aggregation Platform for the Web
  ○ Finance (Mint.com, Manilla.com)

  ○ Travel (kayak.com)

  ○ Shopping (nextag.com)
What is Gyes? (cont)
● Domain-specific Scrapers
● Gyes scrapers = JavaScript + jQuery
● Full Web context access
Goals
● Decouple Data Extraction from Data
  Consumption
● Provide a Flexible Data Model
● Provide a Semi-structured Model to Access
  Scraped Data
Overall Architecture

                                              REST API (latest, run, collect)




UI (www.gyeslab.com)
 ● Develop crawlers
 ● Check scrapped data
                                   Schedule
 bot.open('http://somesite.com',
     function(status) {
       ...
                                                         Data Repository
       return {a: 3, b: 4};
 });
From Goals to Challenges
● Flexible Data Model
● Flexible, semi-structured Means to Access
  Data
Take 1: Key-Value pairs (Tuple spaces)

Crawler returns:                         Data gets stored as:
result.source = "Newspaper A"     key1      key2      key3   ...   value
result.date   = "1/20/2013"
result.news[0].id = 8             result    source                 Newspaper A
result.news[0].text = "Headline
#1"                               result    date                   1/20/2013

result.news[1].id = 9
                                  result    news[0]   id           8
result.news[1].text = "Headline
#2"                               result    news[0]   text         Headline #1
....
                                  result    news[1]   id           9


                                  result    news[1]   text         Headline #2
Key-Value pairs (cont)

Advantages              Disadvantages
● Flexible Data Model   ● Cumbersome to

                           "rebuild" data
                        ● Hard to handle

                           versioning
                        ● Lack of great

                           commercial
                           implementations
                           (diy?)
Take 2: Enter JSON
    We are scraping the web, using Javascript + jQuery. Why don't we use JSON? Thank you, captain
    obvious!




Crawler returns:                                         Data gets stored as
{                                                        Plain Text.
  "source": "Newspaper A",
  "date": "1/20/2013",
  "news": [
    { "id": 8, "text": "Headline
#1"},
    { "id": 9,"text": "Headline #2"}
    ....
  ]
}
Enter JSON (cont)

Advantages              Disadvantages
● Flexible Data Model   ● Plain text




  What about that flexible, semi-structured
  mechanism to access the data we wanted to
  provide?
MongoDB to the rescue
● No tricks, store data as-is
● Flexible (structure of scraped data can
  change, MongoDB doesn't care)
● Semi-structural model allow users to convert
  data to strongly typed objects
● Powerful query mechanisms
● Scalable (oh yeah)
● Again, store data as-is, consume as-is.
Overall Architecture (2)
                                                Clients          Clients         Clients



                                             JSON           JSON            JSON


                                               REST API (latest, run, collect)


                                                                     BSON/JSON


UI (www.gyeslab.com)
 ● Develop crawlers
 ● Check scrapped data
                                  Schedule
bot.open('http://somesite.com',
                                                          Data Repository
    function(status) {                                     (MongoDB)
      ...
      return {a: 3, b: 4};
});
What's Next with Gyes and MongoDB
● Scale Data Repository + API
  ○   Sharding
  ○   Get data closer to users
● Add support for querying data by projection
  ○ Slices of data

  ○ Arbitrary attribute subset selection.
The End



          Questions?

Más contenido relacionado

La actualidad más candente

Intro to mongodb mongouk jun2010
Intro to mongodb mongouk jun2010Intro to mongodb mongouk jun2010
Intro to mongodb mongouk jun2010
Skills Matter
 
N this article first we will create a table in a my sql database and then we ...
N this article first we will create a table in a my sql database and then we ...N this article first we will create a table in a my sql database and then we ...
N this article first we will create a table in a my sql database and then we ...
Mark Daday
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL Database
Ruben Inoto Soto
 

La actualidad más candente (20)

Intro to mongodb mongouk jun2010
Intro to mongodb mongouk jun2010Intro to mongodb mongouk jun2010
Intro to mongodb mongouk jun2010
 
Academy PRO: D3, part 1
Academy PRO: D3, part 1Academy PRO: D3, part 1
Academy PRO: D3, part 1
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1db
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Querying mongo db
Querying mongo dbQuerying mongo db
Querying mongo db
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Intro to mongo db
Intro to mongo dbIntro to mongo db
Intro to mongo db
 
#nosql introduction
#nosql introduction#nosql introduction
#nosql introduction
 
N this article first we will create a table in a my sql database and then we ...
N this article first we will create a table in a my sql database and then we ...N this article first we will create a table in a my sql database and then we ...
N this article first we will create a table in a my sql database and then we ...
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB at CodeMash 2.0.1.0
MongoDB at CodeMash 2.0.1.0MongoDB at CodeMash 2.0.1.0
MongoDB at CodeMash 2.0.1.0
 
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL Database
 
MongoDB - javascript for your data
MongoDB - javascript for your dataMongoDB - javascript for your data
MongoDB - javascript for your data
 
An Introduction to MongoDB
An Introduction to MongoDBAn Introduction to MongoDB
An Introduction to MongoDB
 
Core data in Swfit
Core data in SwfitCore data in Swfit
Core data in Swfit
 
[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
Object Oriented JS
Object Oriented JSObject Oriented JS
Object Oriented JS
 

Similar a MongoDB and Web Scrapping with the Gyes Platform

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
MongoDB
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
MongoDB
 

Similar a MongoDB and Web Scrapping with the Gyes Platform (20)

Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
MongoDB Basics Unileon
MongoDB Basics UnileonMongoDB Basics Unileon
MongoDB Basics Unileon
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
SQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDBSQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDB
 
Web App Prototypes with Google App Engine
Web App Prototypes with Google App EngineWeb App Prototypes with Google App Engine
Web App Prototypes with Google App Engine
 
MongoDB
MongoDBMongoDB
MongoDB
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisAn Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
Dust.js
Dust.jsDust.js
Dust.js
 
MongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness PlatformMongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness Platform
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
A Brief MongoDB Intro
A Brief MongoDB IntroA Brief MongoDB Intro
A Brief MongoDB Intro
 
using Spring and MongoDB on Cloud Foundry
using Spring and MongoDB on Cloud Foundryusing Spring and MongoDB on Cloud Foundry
using Spring and MongoDB on Cloud Foundry
 
Mongodb (1)
Mongodb (1)Mongodb (1)
Mongodb (1)
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDBPostgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
 

Más de MongoDB

Más de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

MongoDB and Web Scrapping with the Gyes Platform

  • 1. MongoDB and Web Scraping with the Gyes Platform Jesus Diaz jesus.diaz@infinithread.com
  • 2. What is Gyes? ● Aggregation Platform for the Web ○ Finance (Mint.com, Manilla.com) ○ Travel (kayak.com) ○ Shopping (nextag.com)
  • 3. What is Gyes? (cont) ● Domain-specific Scrapers ● Gyes scrapers = JavaScript + jQuery ● Full Web context access
  • 4. Goals ● Decouple Data Extraction from Data Consumption ● Provide a Flexible Data Model ● Provide a Semi-structured Model to Access Scraped Data
  • 5. Overall Architecture REST API (latest, run, collect) UI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedule bot.open('http://somesite.com', function(status) { ... Data Repository return {a: 3, b: 4}; });
  • 6. From Goals to Challenges ● Flexible Data Model ● Flexible, semi-structured Means to Access Data
  • 7. Take 1: Key-Value pairs (Tuple spaces) Crawler returns: Data gets stored as: result.source = "Newspaper A" key1 key2 key3 ... value result.date = "1/20/2013" result.news[0].id = 8 result source Newspaper A result.news[0].text = "Headline #1" result date 1/20/2013 result.news[1].id = 9 result news[0] id 8 result.news[1].text = "Headline #2" result news[0] text Headline #1 .... result news[1] id 9 result news[1] text Headline #2
  • 8. Key-Value pairs (cont) Advantages Disadvantages ● Flexible Data Model ● Cumbersome to "rebuild" data ● Hard to handle versioning ● Lack of great commercial implementations (diy?)
  • 9. Take 2: Enter JSON We are scraping the web, using Javascript + jQuery. Why don't we use JSON? Thank you, captain obvious! Crawler returns: Data gets stored as { Plain Text. "source": "Newspaper A", "date": "1/20/2013", "news": [ { "id": 8, "text": "Headline #1"}, { "id": 9,"text": "Headline #2"} .... ] }
  • 10. Enter JSON (cont) Advantages Disadvantages ● Flexible Data Model ● Plain text What about that flexible, semi-structured mechanism to access the data we wanted to provide?
  • 11. MongoDB to the rescue ● No tricks, store data as-is ● Flexible (structure of scraped data can change, MongoDB doesn't care) ● Semi-structural model allow users to convert data to strongly typed objects ● Powerful query mechanisms ● Scalable (oh yeah) ● Again, store data as-is, consume as-is.
  • 12. Overall Architecture (2) Clients Clients Clients JSON JSON JSON REST API (latest, run, collect) BSON/JSON UI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedule bot.open('http://somesite.com', Data Repository function(status) { (MongoDB) ... return {a: 3, b: 4}; });
  • 13. What's Next with Gyes and MongoDB ● Scale Data Repository + API ○ Sharding ○ Get data closer to users ● Add support for querying data by projection ○ Slices of data ○ Arbitrary attribute subset selection.
  • 14. The End Questions?