SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Web and REST API + Scraping
Marco Brambilla, Riccardo Volonterio
@marcobrambi
@manudellavalleWeb Science #WebSciCo
The Data Analysis Pipeline
The 5 3 Ws of (Social) API & Scraping
1) What
2) Why
3) Who
4) When? Now.
5) Where? Everywhere.
6) How
7) Use cases
a. UrbanScope
b. La Scala
What
“An API (Application Program Interface) is a set of routines, protocols and
tools for building software applications”
– Wikipedia
A WebAPI = HTTP based API
Allows a programmatic access to data and platform
And what if no API is available?
What
“An API (Application Program Interface) is a set of routines, protocols and
tools for building software applications”
– Wikipedia
A WebAPI = HTTP based API
Allows a programmatic access to data and platform
And what if no API is available?
Scraping!
Why
• Separation between model and presentation
• Regulate access to the data
• Traceable accounts
• Enable access to only a portion of the available data
• By Time
• By Location
• Avoid direct access to the platform website
• Request throttling
• Avoid server congestion
• Avoid DOS/DDOS
• Provide paid access to full (or higher volume) data
Who
More	than	14.000		APIs	available	(see	programmableweb.com)
How
• URLs
• API types
• RESTful APIs
• Authentication
• Scraping
How - URL
URL format (RFC 2396):
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
E.G.:
git@github.com:nodejs/node.git
mongodb://root:pass@localhost:27017/TestDB?options?replicaSet=test
http://example.com
abc://user:pass@example.com:123/path/data?key=value&k2=v2#fragid1
scheme credentials domain port path querystring fragment
How – HTTP request
When the scheme is HTTP(s), we can make HTTP requests as:
VERB URL
The VERB (or method) can be one of:
• GET
• POST
• DELETE
• OPTIONS
• and more: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
How - Endpoints
Each API exposes a set of HTTP(s) endpoints (URLs) to call in order to get
data or perform action on the platform.
Example (retrieve data):
GET https://graph.facebook.com/search
GET https://api.twitter.com/1.1/search/tweets.json
GET https://api.instagram.com/v1/tags/nofilter/media/recent
Example (platform actions):
POST https://api.instagram.com/v1/media/{media-id}/comments
POST https://api.twitter.com/1.1/statuses/update.json
POST https://graph.facebook.com/me/feed
Legend:
PROTOCOL
DOMAIN
API	VERSION
ENDPOINT
HTTP	METHOD
How - Parameters
Most of the endpoints can be “tweaked” via one or more parameters. Parameters
can be added to an URL by appending to the endpoint some (URL encoded) strings.
Example (Twitter):
…/search/tweets.json
?q=%23expo2015milano
&lang=it
&result_type=recent
&count=100
&token=1235abcd
The request in plain English is: “Twitter, I am 1235abcd, can you please give me the
100 most recent tweets, in italian, by the user @expo2015milano?”
How - REST
REST
“Representational state transfer (REST) is a style of software architecture.”
- Roy T. Fielding
Client-Server: a uniform interface separates clients from servers.
Stateless: the client-server communication is constrained by no client context
being stored on the server.
Cacheable: clients and intermediaries can cache responses.
Layered system: clients cannot tell whether are connected directly to the
server.
Uniform interface: simplifies and decouples the architecture.
How - APIs
WebAPI vs RESTful API
WebAPI: An API that uses the HTTP protocol. Non standard/unpredictable
URL format.
POST http://example.com/getUserData.do?user=1234
RESTful API: Resource based WebAPI. Standard/predictable URL format.
GET http://example.com/user/1234
How – RESTful API
The RESTful API uses the available HTTP verbs to perform CRUD operations
based on the “context”:
• Collection: A set of items (e.g.: /users)
• Item: A specific item in a collection (e.g.: /users/{id})
VERB Collection Item
POST Create a	new	item. Not	used
GET Get list	of	elements. Get	the	selected	item.
PUT Not	used Update	the	selected item.
DELETE Not	used Delete the	selected	 item.
How - Authentication
Almost all the APIs require a kind of user authentication. The user must
register to the developer of the provider to obtain the keys to access the API.
Most of the main API providers use the OAuth protocol to authenticate a
user.
The user registers to the developer portal to obtain a key and a secret.
Platform authenticates the user via key/secret pair and supply a token that
can be used to perform HTTP(S) requests to the desired API endpoint.
Challenges
• Pagination
• Timelines
• Parallelization
• Multiple accounts
• API gaps
Challenges - Pagination
Like SQL queries, most APIs support data pagination to split huge chunks of
data into smaller set of data.
Smaller chunks of data are easier to create, transfer (avoid long response
time), cache and require less server computation time.
Challenges - Timeline
Most of the Social Networks leverage on the concept of timelines instead of
standard pagination to iterate through results.
An ideal pagination will work like the following (Twitter example):
Challenges – Timeline 2
The problem is that data
are continuously added to
the top of the Twitter
timeline.
So the next request
request can lead to
retrieving the already
processed tweets.
Challenges – Timeline 3
The solution is to use the
cursoring technique.
Instead of reading from
the top of the timeline we
read the data relative to
the already processed
ids.
Challenges - Parallelization
When possible make parallel requests to gather more data in less time.
Example:
Get all the media from Instagram about @potus and #obama.
We can make two parallel requests (one for each tag) and paginate the results
instead of waiting for the first tag to finish.
Some time this is not possible due to the requests interdependency like in
pagination; each request depends on the result of previous one.
Challenges – Multiple accounts
Handling multiple accounts can increase the system throughput but increases
the complexity of the system.
Strategies to handle multiple accounts:
• Request based
• Round robin
• Account pool
• Account based
• Requests stack
Challenges – Multi accounts – Round robin
The	account	through	which	the	request	is	made	is	chosen	using	the	round-robin	strategy.
Challenges – Multi accounts – Account pool
The account through which the request is made is chosen sequentially until it’s rate limit
is reached, then we use the next available one.
Challenges – Multi accounts – Requests Stack
This strategy overturn the roles. Now the accounts, in
parallel, get the next request to do from the pool. Once an
account completes a request the cycle restarts until the pool
is empty.
Challenges – Fill Gaps
Sometimes the API does not give us all the data we want, so we need to fill
those gaps.
On quantity or time (Twitter) or on quality/content (Foursquare).
Example: Twitter time limited data. We need to access data older then 10
days.
Example:
Foursquare check-ins timeline not available.
Facebook like timestamps not available.
Build	your	own	timeline.
Use case – UrbanScope (Crawler)
“An API Crawler is a software that methodically interacts with a WebAPI to
download data or to take some actions at predefined time intervals.”
http://urbanscope.polimi.it/
Use case – UrbanScope (Twitter)
Goal: Get all the geolocated tweets in Milan and analyze the languages to detect
anomalies.
Problem: Twitter does not provide a simple endpoint “give me all tweets in Milan”.
Solution: Create a grid of points (50m distance) over Milan and ask Twitter for each point.
Problem: 70.000 points, with 1 account is 4 days.
Solution: Use more keys!
Problem: Handle multiple keys.
Solution: Cry… Then handle multiple keys.
Use case - UrbanScope - Explore Tweets
Use case – UrbanScope (Foursquare)
Goal: See how the check-ins progress in time to see where people go and visualize
the pulse of Milan.
Problem: Foursquare does not provide the timeline of the check-ins.
Solution: Build our own timeline by collecting the number of check-ins every 4 hours
for 90.000 venues.
Problem: 90.000 api requests to do in 4 hours with an API time limit of 5000 requests
per hour.
Solution: See twitter… with more tears and more keys (24)
Problem: Foursquare ban accounts that seems to crawl data automatically -.-
Use case – UrbanScope – City magnets
Use case – UrbanScope- Top Venues
0
200
400
600
800
1000
1200
1400
Use case – Teatro Alla Scala (Milan)
Scraping
“Web scraping means collecting information from websites by extracting them
directly from the HTML source code.”
Use case – La Scala (Scraper)
Goal: Steal Get data from twitter for 1 year ago.
Problem: API allow access to 10 days in the past.
Solution: The Twitter homepage allows to search without a time limit. Crete a web
scraper that collect data from the HTML.
Problem: The page needs to scroll to load more data.
Solution: Use a (headless) fake browser to simulate the user interaction.
Problem: You need to “start a browser” to get data, can be memory greedy.
Solution: No. Maybe.
Use case – La Scala (Scraper)
Once we have all the data we can extract the required data using
the following pipeline:
Gather	tweets
Extract	
Sentiment
Train	classifiers
Automatic	
classification
Extract	users
Calculate	
Popularity	
score
Scraping strategies
Pre-packaged headless browsers
– PhantomJS for JavaScript
– Study and replicate/simulate navigation
programmatively
Manual implementation of navigation
– Replicate JS (AJAX) invocations
– Capture the data from the page
– Marshall the data
Guidelines and example
• Scrape the minimum possible
• Turn to API invocations as much as possible
• Example: Twitter Scraping
• Grab the tweet IDs
• Get the detailed data about tweets via API
• https://github.com/Volox/TwitterScraper
Marco Brambilla, @marcobrambi, marco.brambilla@polimi.it
Emanuele Della Valle, @manudellavalle, emanuele.dellavalle@polimi.it
http://datascience.deib.polimi.it/course

Más contenido relacionado

Más de Marco Brambilla

Trigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demoTrigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demoMarco Brambilla
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Marco Brambilla
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsMarco Brambilla
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...Marco Brambilla
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksMarco Brambilla
 
Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals Marco Brambilla
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionMarco Brambilla
 
Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...Marco Brambilla
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Marco Brambilla
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...Marco Brambilla
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.Marco Brambilla
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoMarco Brambilla
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introductionMarco Brambilla
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...Marco Brambilla
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Marco Brambilla
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Marco Brambilla
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Marco Brambilla
 
Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...Marco Brambilla
 

Más de Marco Brambilla (20)

Trigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demoTrigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demo
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networks
 
Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extraction
 
Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introduction
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...
 
Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...
 

Último

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Web API, REST API and Web Scraping

  • 1. Web and REST API + Scraping Marco Brambilla, Riccardo Volonterio @marcobrambi @manudellavalleWeb Science #WebSciCo
  • 2. The Data Analysis Pipeline
  • 3. The 5 3 Ws of (Social) API & Scraping 1) What 2) Why 3) Who 4) When? Now. 5) Where? Everywhere. 6) How 7) Use cases a. UrbanScope b. La Scala
  • 4. What “An API (Application Program Interface) is a set of routines, protocols and tools for building software applications” – Wikipedia A WebAPI = HTTP based API Allows a programmatic access to data and platform And what if no API is available?
  • 5. What “An API (Application Program Interface) is a set of routines, protocols and tools for building software applications” – Wikipedia A WebAPI = HTTP based API Allows a programmatic access to data and platform And what if no API is available? Scraping!
  • 6. Why • Separation between model and presentation • Regulate access to the data • Traceable accounts • Enable access to only a portion of the available data • By Time • By Location • Avoid direct access to the platform website • Request throttling • Avoid server congestion • Avoid DOS/DDOS • Provide paid access to full (or higher volume) data
  • 8. How • URLs • API types • RESTful APIs • Authentication • Scraping
  • 9. How - URL URL format (RFC 2396): scheme:[//[user:password@]host[:port]][/]path[?query][#fragment] E.G.: git@github.com:nodejs/node.git mongodb://root:pass@localhost:27017/TestDB?options?replicaSet=test http://example.com abc://user:pass@example.com:123/path/data?key=value&k2=v2#fragid1 scheme credentials domain port path querystring fragment
  • 10. How – HTTP request When the scheme is HTTP(s), we can make HTTP requests as: VERB URL The VERB (or method) can be one of: • GET • POST • DELETE • OPTIONS • and more: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
  • 11. How - Endpoints Each API exposes a set of HTTP(s) endpoints (URLs) to call in order to get data or perform action on the platform. Example (retrieve data): GET https://graph.facebook.com/search GET https://api.twitter.com/1.1/search/tweets.json GET https://api.instagram.com/v1/tags/nofilter/media/recent Example (platform actions): POST https://api.instagram.com/v1/media/{media-id}/comments POST https://api.twitter.com/1.1/statuses/update.json POST https://graph.facebook.com/me/feed Legend: PROTOCOL DOMAIN API VERSION ENDPOINT HTTP METHOD
  • 12. How - Parameters Most of the endpoints can be “tweaked” via one or more parameters. Parameters can be added to an URL by appending to the endpoint some (URL encoded) strings. Example (Twitter): …/search/tweets.json ?q=%23expo2015milano &lang=it &result_type=recent &count=100 &token=1235abcd The request in plain English is: “Twitter, I am 1235abcd, can you please give me the 100 most recent tweets, in italian, by the user @expo2015milano?”
  • 13. How - REST REST “Representational state transfer (REST) is a style of software architecture.” - Roy T. Fielding Client-Server: a uniform interface separates clients from servers. Stateless: the client-server communication is constrained by no client context being stored on the server. Cacheable: clients and intermediaries can cache responses. Layered system: clients cannot tell whether are connected directly to the server. Uniform interface: simplifies and decouples the architecture.
  • 14. How - APIs WebAPI vs RESTful API WebAPI: An API that uses the HTTP protocol. Non standard/unpredictable URL format. POST http://example.com/getUserData.do?user=1234 RESTful API: Resource based WebAPI. Standard/predictable URL format. GET http://example.com/user/1234
  • 15. How – RESTful API The RESTful API uses the available HTTP verbs to perform CRUD operations based on the “context”: • Collection: A set of items (e.g.: /users) • Item: A specific item in a collection (e.g.: /users/{id}) VERB Collection Item POST Create a new item. Not used GET Get list of elements. Get the selected item. PUT Not used Update the selected item. DELETE Not used Delete the selected item.
  • 16. How - Authentication Almost all the APIs require a kind of user authentication. The user must register to the developer of the provider to obtain the keys to access the API. Most of the main API providers use the OAuth protocol to authenticate a user. The user registers to the developer portal to obtain a key and a secret. Platform authenticates the user via key/secret pair and supply a token that can be used to perform HTTP(S) requests to the desired API endpoint.
  • 17. Challenges • Pagination • Timelines • Parallelization • Multiple accounts • API gaps
  • 18. Challenges - Pagination Like SQL queries, most APIs support data pagination to split huge chunks of data into smaller set of data. Smaller chunks of data are easier to create, transfer (avoid long response time), cache and require less server computation time.
  • 19. Challenges - Timeline Most of the Social Networks leverage on the concept of timelines instead of standard pagination to iterate through results. An ideal pagination will work like the following (Twitter example):
  • 20. Challenges – Timeline 2 The problem is that data are continuously added to the top of the Twitter timeline. So the next request request can lead to retrieving the already processed tweets.
  • 21. Challenges – Timeline 3 The solution is to use the cursoring technique. Instead of reading from the top of the timeline we read the data relative to the already processed ids.
  • 22. Challenges - Parallelization When possible make parallel requests to gather more data in less time. Example: Get all the media from Instagram about @potus and #obama. We can make two parallel requests (one for each tag) and paginate the results instead of waiting for the first tag to finish. Some time this is not possible due to the requests interdependency like in pagination; each request depends on the result of previous one.
  • 23. Challenges – Multiple accounts Handling multiple accounts can increase the system throughput but increases the complexity of the system. Strategies to handle multiple accounts: • Request based • Round robin • Account pool • Account based • Requests stack
  • 24. Challenges – Multi accounts – Round robin The account through which the request is made is chosen using the round-robin strategy.
  • 25. Challenges – Multi accounts – Account pool The account through which the request is made is chosen sequentially until it’s rate limit is reached, then we use the next available one.
  • 26. Challenges – Multi accounts – Requests Stack This strategy overturn the roles. Now the accounts, in parallel, get the next request to do from the pool. Once an account completes a request the cycle restarts until the pool is empty.
  • 27. Challenges – Fill Gaps Sometimes the API does not give us all the data we want, so we need to fill those gaps. On quantity or time (Twitter) or on quality/content (Foursquare). Example: Twitter time limited data. We need to access data older then 10 days. Example: Foursquare check-ins timeline not available. Facebook like timestamps not available. Build your own timeline.
  • 28. Use case – UrbanScope (Crawler) “An API Crawler is a software that methodically interacts with a WebAPI to download data or to take some actions at predefined time intervals.” http://urbanscope.polimi.it/
  • 29. Use case – UrbanScope (Twitter) Goal: Get all the geolocated tweets in Milan and analyze the languages to detect anomalies. Problem: Twitter does not provide a simple endpoint “give me all tweets in Milan”. Solution: Create a grid of points (50m distance) over Milan and ask Twitter for each point. Problem: 70.000 points, with 1 account is 4 days. Solution: Use more keys! Problem: Handle multiple keys. Solution: Cry… Then handle multiple keys.
  • 30. Use case - UrbanScope - Explore Tweets
  • 31. Use case – UrbanScope (Foursquare) Goal: See how the check-ins progress in time to see where people go and visualize the pulse of Milan. Problem: Foursquare does not provide the timeline of the check-ins. Solution: Build our own timeline by collecting the number of check-ins every 4 hours for 90.000 venues. Problem: 90.000 api requests to do in 4 hours with an API time limit of 5000 requests per hour. Solution: See twitter… with more tears and more keys (24) Problem: Foursquare ban accounts that seems to crawl data automatically -.-
  • 32. Use case – UrbanScope – City magnets
  • 33. Use case – UrbanScope- Top Venues
  • 35. Scraping “Web scraping means collecting information from websites by extracting them directly from the HTML source code.”
  • 36. Use case – La Scala (Scraper) Goal: Steal Get data from twitter for 1 year ago. Problem: API allow access to 10 days in the past. Solution: The Twitter homepage allows to search without a time limit. Crete a web scraper that collect data from the HTML. Problem: The page needs to scroll to load more data. Solution: Use a (headless) fake browser to simulate the user interaction. Problem: You need to “start a browser” to get data, can be memory greedy. Solution: No. Maybe.
  • 37. Use case – La Scala (Scraper) Once we have all the data we can extract the required data using the following pipeline: Gather tweets Extract Sentiment Train classifiers Automatic classification Extract users Calculate Popularity score
  • 38. Scraping strategies Pre-packaged headless browsers – PhantomJS for JavaScript – Study and replicate/simulate navigation programmatively Manual implementation of navigation – Replicate JS (AJAX) invocations – Capture the data from the page – Marshall the data
  • 39. Guidelines and example • Scrape the minimum possible • Turn to API invocations as much as possible • Example: Twitter Scraping • Grab the tweet IDs • Get the detailed data about tweets via API • https://github.com/Volox/TwitterScraper
  • 40. Marco Brambilla, @marcobrambi, marco.brambilla@polimi.it Emanuele Della Valle, @manudellavalle, emanuele.dellavalle@polimi.it http://datascience.deib.polimi.it/course