MongoDB and Web Scrapping with the Gyes Platform

•

1 recomendación•1,067 vistas

This document discusses using MongoDB as a data repository for storing data scraped by Gyes crawlers. Gyes is a web scraping platform that uses JavaScript and jQuery scrapers. Storing scraped data in MongoDB provides a flexible data model and semi-structured access to the data. Key advantages of MongoDB include flexible storage of dynamic scraped data, powerful querying, and scalability. The document explores earlier approaches using key-value pairs and JSON, and explains how MongoDB meets goals of flexible storage and access to scraped web data.

MongoDB and Web Scraping
with the Gyes Platform

Jesus Diaz
jesus.diaz@infinithread.com

What is Gyes?
● Aggregation Platform for the Web
○ Finance (Mint.com, Manilla.com)

○ Travel (kayak.com)

○ Shopping (nextag.com)

What is Gyes? (cont)
● Domain-specific Scrapers
● Gyes scrapers = JavaScript + jQuery
● Full Web context access

Goals
● Decouple Data Extraction from Data
Consumption
● Provide a Flexible Data Model
● Provide a Semi-structured Model to Access
Scraped Data

Overall Architecture

REST API (latest, run, collect)

UI (www.gyeslab.com)
● Develop crawlers
● Check scrapped data
Schedule
bot.open('http://somesite.com',
function(status) {
...
Data Repository
return {a: 3, b: 4};
});

From Goals to Challenges
● Flexible Data Model
● Flexible, semi-structured Means to Access
Data

Take 1: Key-Value pairs (Tuple spaces)

Crawler returns: Data gets stored as:
result.source = "Newspaper A" key1 key2 key3 ... value
result.date = "1/20/2013"
result.news[0].id = 8 result source Newspaper A
result.news[0].text = "Headline
#1" result date 1/20/2013

result.news[1].id = 9
result news[0] id 8
result.news[1].text = "Headline
#2" result news[0] text Headline #1
....
result news[1] id 9

result news[1] text Headline #2

Key-Value pairs (cont)

Advantages Disadvantages
● Flexible Data Model ● Cumbersome to

"rebuild" data
● Hard to handle

versioning
● Lack of great

commercial
implementations
(diy?)

Take 2: Enter JSON
We are scraping the web, using Javascript + jQuery. Why don't we use JSON? Thank you, captain
obvious!

Crawler returns: Data gets stored as
{ Plain Text.
"source": "Newspaper A",
"date": "1/20/2013",
"news": [
{ "id": 8, "text": "Headline
#1"},
{ "id": 9,"text": "Headline #2"}
....
]
}

Enter JSON (cont)

Advantages Disadvantages
● Flexible Data Model ● Plain text

What about that flexible, semi-structured
mechanism to access the data we wanted to
provide?

MongoDB to the rescue
● No tricks, store data as-is
● Flexible (structure of scraped data can
change, MongoDB doesn't care)
● Semi-structural model allow users to convert
data to strongly typed objects
● Powerful query mechanisms
● Scalable (oh yeah)
● Again, store data as-is, consume as-is.

Overall Architecture (2)
Clients Clients Clients

JSON JSON JSON

REST API (latest, run, collect)

BSON/JSON

UI (www.gyeslab.com)
● Develop crawlers
● Check scrapped data
Schedule
bot.open('http://somesite.com',
Data Repository
function(status) { (MongoDB)
...
return {a: 3, b: 4};
});

What's Next with Gyes and MongoDB
● Scale Data Repository + API
○ Sharding
○ Get data closer to users
● Add support for querying data by projection
○ Slices of data

○ Arbitrary attribute subset selection.

Más contenido relacionado

La actualidad más candente

Intro to mongodb mongouk jun2010

Skills Matter

Academy PRO: D3, part 1

Binary Studio

An introduction to U1db

David Planella

Introduction to MongoDB

Cali Mongo

Querying mongo db

Bogdan Sabău

NoSQL in the context of Social Web

Bogdan Gaza

Intro to mongo db

Chi Lee

#nosql introduction

jethrobakker

N this article first we will create a table in a my sql database and then we ...

Mark Daday

MongoDB

Albin John

MongoDB at CodeMash 2.0.1.0

Mike Dirolf

Introduction to Mongodb

Tulbendra Singh yadav

MongoDB - A Document NoSQL Database

Ruben Inoto Soto

MongoDB - javascript for your data

aaronheckmann

An Introduction to MongoDB

Chamodi Adikaram

Core data in Swfit

allanh0526

[WebMuses] Big data dla zdezorientowanych

Przemek Maciolek

DBpedia Viewer - LDOW 2014

Dimitris Kontokostas

Introduction to MongoDB Basics from SQL to NoSQL

Mayur Patil

Object Oriented JS

Bharti Gurav

La actualidad más candente (20)

Intro to mongodb mongouk jun2010

Academy PRO: D3, part 1

An introduction to U1db

Introduction to MongoDB

Querying mongo db

NoSQL in the context of Social Web

Intro to mongo db

#nosql introduction

N this article first we will create a table in a my sql database and then we ...

MongoDB

MongoDB at CodeMash 2.0.1.0

Introduction to Mongodb

MongoDB - A Document NoSQL Database

MongoDB - javascript for your data

An Introduction to MongoDB

Core data in Swfit

[WebMuses] Big data dla zdezorientowanych

DBpedia Viewer - LDOW 2014

Introduction to MongoDB Basics from SQL to NoSQL

Object Oriented JS

Similar a MongoDB and Web Scrapping with the Gyes Platform

Big Query - Women Techmarkers (Ukraine - March 2014)

Ido Green

Siddhi - cloud-native stream processor

Sriskandarajah Suhothayan

MongoDB Basics Unileon

Juan Antonio Roy Couto

Social Data and Log Analysis Using MongoDB

Takahiro Inoue

Introduction To MongoDB

ElieHannouch

MongoDB Tick Data Presentation

MongoDB

SQL vs NoSQL, an experiment with MongoDB

Marco Segato

Web App Prototypes with Google App Engine

Vlad Filippov

MongoDB

wiTTyMinds1

In this presentation a summary of the work done for comparing NoSQL versus MySQL for a pretended Internet Access Logs application is done. The work done has four parts: - An initial study of what is the actual state of Open Source NoSQL solutions - Why MongoDB has been chosen and how it has been installed and configured - Design of a schema, a few PHP classes and scripts for testing MongoDB and MySQL - The comparative results and conclussions More info at http://www.ciges.net/mysql-vs-mongodb-para-el-analisis-de-logs-de-acceso-a-internet or at https://github.com/Ciges/internet_access_control_demo

An Open Source NoSQL solution for Internet Access Logs Analysis

José Manuel Ciges Regueiro

Eagle6 mongo dc revised

MongoDB

Eagle6 is a product that use system artifacts to create a replica model that represents a near real-time view of system architecture. Eagle6 was built to collect system data (log files, application source code, etc.) and to link system behaviors in such a way that the user is able to quickly identify risks associated with unknown or unwanted behavioral events that may result in unknown impacts to seemingly unrelated down-stream systems. This session is designed to present the capabilities of the Eagle6 modeling product and how we are using MongoDB to support near-real-time analysis of large disparate datasets.

Eagle6 Enterprise Situational Awareness

MongoDB

Dust.js

Yevgeniy Brikman

Chris Merz, Manager of Operations, MapMyFitness The MMF user base more than doubled in 2011, beginning an era of rapid data growth. With Big Data come Big Data Headaches. The traditional MySQL solution for our suite of web applications had hit its ceiling. MongoDB was chosen as the candidate for exploration into NoSQL implementations, and now serves as our go-to data store for rapid application deployment. This talk will detail several of the MongoDB use cases at MMF, from serving 2TB+ of geolocation data, to time-series data for live tracking, to user sessions, app logging, and beyond. Topics will include migration patterns, indexing practices, backend storage choices, and application access patterns, monitoring, and more.

MongoDB Versatility: Scaling the MapMyFitness Platform

MongoDB

Big Query Basics

Ido Green

Presented by Austin Zellner, Solutions Architect, MongoDB Schema design is as much art as it is science, but it is central to understanding how to get the most out of MongoDB. Attendees will walk away with an understanding of how to approach schema design, what influences it, and the science behind the art. After this session, attendees will be ready to design new schemas, as well as re-evaluate existing schemas with a new mental model.

MongoDB Schema Design: Practical Applications and Implications

MongoDB

A Brief MongoDB Intro

Scott Hernandez

using Spring and MongoDB on Cloud Foundry

Joshua Long

Mongodb (1)

Deepak Kumar

Postgres-XC as a Key Value Store Compared To MongoDB

Mason Sharp

Similar a MongoDB and Web Scrapping with the Gyes Platform (20)

Big Query - Women Techmarkers (Ukraine - March 2014)

Siddhi - cloud-native stream processor

MongoDB Basics Unileon

Social Data and Log Analysis Using MongoDB

Introduction To MongoDB

MongoDB Tick Data Presentation

SQL vs NoSQL, an experiment with MongoDB

Web App Prototypes with Google App Engine

MongoDB

An Open Source NoSQL solution for Internet Access Logs Analysis

Eagle6 mongo dc revised

Eagle6 Enterprise Situational Awareness

Dust.js

MongoDB Versatility: Scaling the MapMyFitness Platform

Big Query Basics

MongoDB Schema Design: Practical Applications and Implications

A Brief MongoDB Intro

using Spring and MongoDB on Cloud Foundry

Mongodb (1)

Postgres-XC as a Key Value Store Compared To MongoDB

Más de MongoDB

During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas

MongoDB

MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!

MongoDB

MongoDB Kubernetes operator and MongoDB Open Service Broker are ready for production operations. Learn about how MongoDB can be used with the most popular container orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications. A demo will show you how easy it is to enable MongoDB clusters as an External Service using the Open Service Broker API for MongoDB

MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...

MongoDB

MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB

MongoDB

MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...

MongoDB

Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe. This talk covers: Common components of an IoT solution The challenges involved with managing time-series data in IoT applications Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance. How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.

MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data

MongoDB

MongoDB SoCal 2020: MongoDB Atlas Jump Start

MongoDB

Our clients have unique use cases and data patterns that mandate the choice of a particular strategy. To implement these strategies, it is mandatory that we unlearn a lot of relational concepts while designing and rapidly developing efficient applications on NoSQL. In this session, we will talk about some of our client use cases, the strategies we have adopted, and the features of MongoDB that assisted in implementing these strategies.

MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]

MongoDB

Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch". This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.

MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2

MongoDB

MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...

MongoDB

MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!

MongoDB

When you need to model data, is your first instinct to start breaking it down into rows and columns? Mine used to be too. When you want to develop apps in a modern, agile way, NoSQL databases can be the best option. Come to this talk to learn how to take advantage of all that NoSQL databases have to offer and discover the benefits of changing your mindset from the legacy, tabular way of modeling data. We’ll compare and contrast the terms and concepts in SQL databases and MongoDB, explain the benefits of using MongoDB compared to SQL databases, and walk through data modeling basics so you feel confident as you begin using MongoDB.

MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset

MongoDB

MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart

MongoDB

MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...

MongoDB

MongoDB .local San Francisco 2020: Aggregation Pipeline Power++

MongoDB

MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...

MongoDB

MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business. This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.

MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive

MongoDB

Virtual assistants are becoming the new norm when it comes to daily life, with Amazon’s Alexa being the leader in the space. As a developer, not only do you need to make web and mobile compliant applications, but you need to be able to support virtual assistants like Alexa. However, the process isn’t quite the same between the platforms. How do you handle requests? Where do you store your data and work with it to create meaningful responses with little delay? How much of your code needs to change between platforms? In this session we’ll see how to design and develop applications known as Skills for Amazon Alexa powered devices using the Go programming language and MongoDB.

MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang

MongoDB

MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...

MongoDB

Il n’a jamais été aussi facile de commander en ligne et de se faire livrer en moins de 48h très souvent gratuitement. Cette simplicité d’usage cache un marché complexe de plus de 8000 milliards de $. La data est bien connu du monde de la Supply Chain (itinéraires, informations sur les marchandises, douanes,…), mais la valeur de ces données opérationnelles reste peu exploitée. En alliant expertise métier et Data Science, Upply redéfinit les fondamentaux de la Supply Chain en proposant à chacun des acteurs de surmonter la volatilité et l’inefficacité du marché.

MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

MongoDB

Más de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas

MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!

MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...

MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB

MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...

MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data

MongoDB SoCal 2020: MongoDB Atlas Jump Start

MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]

MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2

MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...

MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!

MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset

MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart

MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...

MongoDB .local San Francisco 2020: Aggregation Pipeline Power++

MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...

MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive

MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang

MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...

MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

MongoDB and Web Scrapping with the Gyes Platform

1. MongoDB and Web Scraping with the Gyes Platform Jesus Diaz jesus.diaz@infinithread.com

2. What is Gyes? ● Aggregation Platform for the Web ○ Finance (Mint.com, Manilla.com) ○ Travel (kayak.com) ○ Shopping (nextag.com)

3. What is Gyes? (cont) ● Domain-specific Scrapers ● Gyes scrapers = JavaScript + jQuery ● Full Web context access

4. Goals ● Decouple Data Extraction from Data Consumption ● Provide a Flexible Data Model ● Provide a Semi-structured Model to Access Scraped Data

5. Overall Architecture REST API (latest, run, collect) UI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedule bot.open('http://somesite.com', function(status) { ... Data Repository return {a: 3, b: 4}; });

6. From Goals to Challenges ● Flexible Data Model ● Flexible, semi-structured Means to Access Data

7. Take 1: Key-Value pairs (Tuple spaces) Crawler returns: Data gets stored as: result.source = "Newspaper A" key1 key2 key3 ... value result.date = "1/20/2013" result.news[0].id = 8 result source Newspaper A result.news[0].text = "Headline #1" result date 1/20/2013 result.news[1].id = 9 result news[0] id 8 result.news[1].text = "Headline #2" result news[0] text Headline #1 .... result news[1] id 9 result news[1] text Headline #2

8. Key-Value pairs (cont) Advantages Disadvantages ● Flexible Data Model ● Cumbersome to "rebuild" data ● Hard to handle versioning ● Lack of great commercial implementations (diy?)

9. Take 2: Enter JSON We are scraping the web, using Javascript + jQuery. Why don't we use JSON? Thank you, captain obvious! Crawler returns: Data gets stored as { Plain Text. "source": "Newspaper A", "date": "1/20/2013", "news": [ { "id": 8, "text": "Headline #1"}, { "id": 9,"text": "Headline #2"} .... ] }

10. Enter JSON (cont) Advantages Disadvantages ● Flexible Data Model ● Plain text What about that flexible, semi-structured mechanism to access the data we wanted to provide?

11. MongoDB to the rescue ● No tricks, store data as-is ● Flexible (structure of scraped data can change, MongoDB doesn't care) ● Semi-structural model allow users to convert data to strongly typed objects ● Powerful query mechanisms ● Scalable (oh yeah) ● Again, store data as-is, consume as-is.

12. Overall Architecture (2) Clients Clients Clients JSON JSON JSON REST API (latest, run, collect) BSON/JSON UI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedule bot.open('http://somesite.com', Data Repository function(status) { (MongoDB) ... return {a: 3, b: 4}; });

13. What's Next with Gyes and MongoDB ● Scale Data Repository + API ○ Sharding ○ Get data closer to users ● Add support for querying data by projection ○ Slices of data ○ Arbitrary attribute subset selection.

14. The End Questions?

MongoDB and Web Scrapping with the Gyes Platform

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a MongoDB and Web Scrapping with the Gyes Platform

Similar a MongoDB and Web Scrapping with the Gyes Platform (20)

Más de MongoDB

Más de MongoDB (20)

MongoDB and Web Scrapping with the Gyes Platform