SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Web Scraping is BS
John Downs - Software Engineer at Yodle
My Scraper
import requests
from bs4 import BeautifulSoup
!
def get_front_page():
target = "https://news.ycombinator.com"
frontpage = requests.get(target)
news_soup =
BeautifulSoup(frontpage.text)
return news_soup
def find_interesting_links(soup):
search_attrs = {'align': 'right',
'class': 'title'}
items = soup.findAll('td', search_attrs)
links = []
for i in items:
siblings = i.find_next_siblings(limit=2)
post_id = siblings[0].a.attrs['td'][3:]
link = siblings[1].a.attrs['href']
score = get_score(soup, post_id)
comments = get_comments(soup, post_id)
!
if 'python' in title.lower()
or (score > 50 and comments > 10):
links.append({'link': link,
'title': title,
'score': score,
'comments': comments})
return links
def get_score(soup, post_id):
span_tag = soup.find(‘span',
id='score_' + post_id)
return int(span_tag.text.split()[0])
!
!
!
def get_comments(soup, post_id):
a_tag = soup.find('a',
href='item?id=' + post_id)
return int(span_tag.text.split()[0])
!
def add_to_pocket(consumer_key, access_token, url):
target = 'https://getpocket.com/v3/add'
request_params = {
'url': url,
'consumer_key': consumer_key,
'access_token': access_token}
result = requests.post(target,
data=request_params)
return result.text
!
if __name__ == '__main__':
soup = get_front_page()
pocket = partial(add_to_pocket,
consumer_key,
access_token)
results = find_interesting_links(soup)
print(results)
for r in results:
print(pocket(r['link']) )
Searching with BS
• find() / find_all()
• find_parent() / find_parents()
• find_next_sibling() / find_next_siblings()
• find_previous_sibling() / find_previous_siblings()
• find_next() / find_all_next()
• find_previous() / find_all_previous()
Filters
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class=“story">
Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id=“link3">Tillie</
a>;
and they lived at the bottom of a well.</p>
!
<p class="story">...</p>
“”"
!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
!
Filters - Strings
>>> soup.find_all('b')
!
[<b>The Dormouse's story</b>]
Filters - Regex
regex = re.compile("^b")
for tag in soup.find_all(regex):
print(tag.name)
!
!
body
b
Filters - Lists
soup.find_all(["a", "b"])
!
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie"
id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie"
id="link3">Tillie</a>]
Filters - Functions
def has_class_but_no_id(tag):
return tag.has_attr('class') and
not tag.has_attr('id')
Filters - Functions
soup.find_all(has_class_but_no_id)
!
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were...</p>,
<p class="story">...</p>]
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
name: A string that matches tags
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
attrs: a dictionary of html attributes to match
soup.find_all("a", attrs={"class": "sister"})
[<a class="sister" href="http://example.com/elsie"
id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie"
id="link3">Tillie</a>]
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
recursive: a boolean value to seach grandchildren
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
text: search for text instead of tags



soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
limit: an int to control the number of items returned
The API
find_all(name, attrs, recursive, 

text, limit, **kwargs)
keyword: A keyword argument will search for a tag
with that attribute
!
>>>soup.find_all(id=‘link2’)
[<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>]
Navigating with BS
The easiest way to navigate elements down the tree
is to use the dot notation.
>>>soup.head

<head><title>The Dormouse's story</
title></head>
>>>soup.title

<title>The Dormouse's story</title>
Navigating with BS
You can look at the children of an element
with .contents
>>>head_tag = soup.head

>>>head_tag

<head><title>The Dormouse's story</
title></head>
>>>head_tag.contents

[<title>The Dormouse's story</title>]
A Bigger Challenge?
Fifty Years Ago
import requests
import datetime
from bs4 import BeautifulSoup
!
today = datetime.date.today()
month = today.strftime(‘%B')
fyo = str(today.year - 50)
url = 'http://en.wikipedia.org/wiki/' + month + '_' + fyo
target = month + '_' + str(today.day) + '.'
!
data = requests.get(url).text
soup = BeautifulSoup(data)
!
contents = soup.find('div', id='toc')
a = contents.findAll(
lambda tag: tag.name == ‘span' and
tag.has_attr(‘id'))
!
for event in a.ul.findAll(‘li'):
print(event.text)
RTFM
• http://docs.python-requests.org/en/latest
• http://www.crummy.com/software/BeautifulSoup/
• http://lxml.de/
• http://jakeaustwick.me/python-web-scraping-resource/
• https://github.com/jdowns/scraper
• john.downs@acm.org

Más contenido relacionado

La actualidad más candente

Rails 3: Dashing to the Finish
Rails 3: Dashing to the FinishRails 3: Dashing to the Finish
Rails 3: Dashing to the Finish
Yehuda Katz
 
Tips and tricks for building api heavy ruby on rails applications
Tips and tricks for building api heavy ruby on rails applicationsTips and tricks for building api heavy ruby on rails applications
Tips and tricks for building api heavy ruby on rails applications
Tim Cull
 
Being a tweaker modern web performance techniques
Being a tweaker   modern web performance techniquesBeing a tweaker   modern web performance techniques
Being a tweaker modern web performance techniques
Chris Love
 

La actualidad más candente (20)

Rails 3: Dashing to the Finish
Rails 3: Dashing to the FinishRails 3: Dashing to the Finish
Rails 3: Dashing to the Finish
 
Swift Summit 2017: Server Swift State of the Union
Swift Summit 2017: Server Swift State of the UnionSwift Summit 2017: Server Swift State of the Union
Swift Summit 2017: Server Swift State of the Union
 
Rails 4.0
Rails 4.0Rails 4.0
Rails 4.0
 
Twitter4R OAuth
Twitter4R OAuthTwitter4R OAuth
Twitter4R OAuth
 
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
 
play! scala file resource handling and image resizing
play! scala file resource handling and image resizingplay! scala file resource handling and image resizing
play! scala file resource handling and image resizing
 
Tips and tricks for building api heavy ruby on rails applications
Tips and tricks for building api heavy ruby on rails applicationsTips and tricks for building api heavy ruby on rails applications
Tips and tricks for building api heavy ruby on rails applications
 
Being a tweaker modern web performance techniques
Being a tweaker   modern web performance techniquesBeing a tweaker   modern web performance techniques
Being a tweaker modern web performance techniques
 
Rails 101
Rails 101Rails 101
Rails 101
 
Webinar AWS 201 Delivering apps without servers
Webinar AWS 201 Delivering apps without serversWebinar AWS 201 Delivering apps without servers
Webinar AWS 201 Delivering apps without servers
 
Week3 adb
Week3 adbWeek3 adb
Week3 adb
 
Moving from Django Apps to Services
Moving from Django Apps to ServicesMoving from Django Apps to Services
Moving from Django Apps to Services
 
&lt;x> Rails Web App Security Title
&lt;x> Rails Web App Security Title&lt;x> Rails Web App Security Title
&lt;x> Rails Web App Security Title
 
YQL & Yahoo! Apis
YQL & Yahoo! ApisYQL & Yahoo! Apis
YQL & Yahoo! Apis
 
Transforming WordPress Search and Query Performance with Elasticsearch
Transforming WordPress Search and Query Performance with Elasticsearch Transforming WordPress Search and Query Performance with Elasticsearch
Transforming WordPress Search and Query Performance with Elasticsearch
 
Symfony: Your Next Microframework (SymfonyCon 2015)
Symfony: Your Next Microframework (SymfonyCon 2015)Symfony: Your Next Microframework (SymfonyCon 2015)
Symfony: Your Next Microframework (SymfonyCon 2015)
 
Lu solr32 34-20110912
Lu solr32 34-20110912Lu solr32 34-20110912
Lu solr32 34-20110912
 
A re introduction to webpack - reactfoo - mumbai
A re introduction to webpack - reactfoo - mumbaiA re introduction to webpack - reactfoo - mumbai
A re introduction to webpack - reactfoo - mumbai
 
Puppet Camp DC 2015: Stop Writing Puppet Modules: A Guide to Best Practices i...
Puppet Camp DC 2015: Stop Writing Puppet Modules: A Guide to Best Practices i...Puppet Camp DC 2015: Stop Writing Puppet Modules: A Guide to Best Practices i...
Puppet Camp DC 2015: Stop Writing Puppet Modules: A Guide to Best Practices i...
 
Intro to PAS REST API
Intro to PAS REST APIIntro to PAS REST API
Intro to PAS REST API
 

Destacado

Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
Jim Chang
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 

Destacado (16)

Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

Similar a Web Scraping is BS

ISUCONアプリを Pythonで書いてみた
ISUCONアプリを Pythonで書いてみたISUCONアプリを Pythonで書いてみた
ISUCONアプリを Pythonで書いてみた
memememomo
 
Ruby conf 2011, Create your own rails framework
Ruby conf 2011, Create your own rails frameworkRuby conf 2011, Create your own rails framework
Ruby conf 2011, Create your own rails framework
Pankaj Bhageria
 
第49回Php勉強会@関東 Datasource
第49回Php勉強会@関東 Datasource第49回Php勉強会@関東 Datasource
第49回Php勉強会@関東 Datasource
Kaz Watanabe
 
Web весна 2013 лекция 6
Web весна 2013 лекция 6Web весна 2013 лекция 6
Web весна 2013 лекция 6
Technopark
 
Web осень 2012 лекция 6
Web осень 2012 лекция 6Web осень 2012 лекция 6
Web осень 2012 лекция 6
Technopark
 

Similar a Web Scraping is BS (20)

Elegant APIs
Elegant APIsElegant APIs
Elegant APIs
 
【AWS Developers Meetup】RESTful APIをChaliceで紐解く
【AWS Developers Meetup】RESTful APIをChaliceで紐解く【AWS Developers Meetup】RESTful APIをChaliceで紐解く
【AWS Developers Meetup】RESTful APIをChaliceで紐解く
 
JSON and the APInauts
JSON and the APInautsJSON and the APInauts
JSON and the APInauts
 
Using WordPress as your application stack
Using WordPress as your application stackUsing WordPress as your application stack
Using WordPress as your application stack
 
Rails3 changesets
Rails3 changesetsRails3 changesets
Rails3 changesets
 
ISUCONアプリを Pythonで書いてみた
ISUCONアプリを Pythonで書いてみたISUCONアプリを Pythonで書いてみた
ISUCONアプリを Pythonで書いてみた
 
Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)
 
Ruby conf 2011, Create your own rails framework
Ruby conf 2011, Create your own rails frameworkRuby conf 2011, Create your own rails framework
Ruby conf 2011, Create your own rails framework
 
Bag Of Tricks From Iusethis
Bag Of Tricks From IusethisBag Of Tricks From Iusethis
Bag Of Tricks From Iusethis
 
[WLDN] Supercharging word press development in 2018
[WLDN] Supercharging word press development in 2018[WLDN] Supercharging word press development in 2018
[WLDN] Supercharging word press development in 2018
 
第49回Php勉強会@関東 Datasource
第49回Php勉強会@関東 Datasource第49回Php勉強会@関東 Datasource
第49回Php勉強会@関東 Datasource
 
Web весна 2013 лекция 6
Web весна 2013 лекция 6Web весна 2013 лекция 6
Web весна 2013 лекция 6
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearch
 
Let's read code: python-requests library
Let's read code: python-requests libraryLet's read code: python-requests library
Let's read code: python-requests library
 
Web осень 2012 лекция 6
Web осень 2012 лекция 6Web осень 2012 лекция 6
Web осень 2012 лекция 6
 
My First Ruby
My First RubyMy First Ruby
My First Ruby
 
Odoo from 7.0 to 8.0 API
Odoo from 7.0 to 8.0 APIOdoo from 7.0 to 8.0 API
Odoo from 7.0 to 8.0 API
 
Odoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new apiOdoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new api
 
Automation using Scripting and the Canvas API
Automation using Scripting and the Canvas APIAutomation using Scripting and the Canvas API
Automation using Scripting and the Canvas API
 
Tomer Elmalem - GraphQL APIs: REST in Peace - Codemotion Milan 2017
Tomer Elmalem - GraphQL APIs: REST in Peace - Codemotion Milan 2017Tomer Elmalem - GraphQL APIs: REST in Peace - Codemotion Milan 2017
Tomer Elmalem - GraphQL APIs: REST in Peace - Codemotion Milan 2017
 

Último

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Último (20)

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 

Web Scraping is BS

  • 1. Web Scraping is BS John Downs - Software Engineer at Yodle
  • 2.
  • 3.
  • 4.
  • 5. My Scraper import requests from bs4 import BeautifulSoup ! def get_front_page(): target = "https://news.ycombinator.com" frontpage = requests.get(target) news_soup = BeautifulSoup(frontpage.text) return news_soup
  • 6. def find_interesting_links(soup): search_attrs = {'align': 'right', 'class': 'title'} items = soup.findAll('td', search_attrs) links = [] for i in items: siblings = i.find_next_siblings(limit=2) post_id = siblings[0].a.attrs['td'][3:] link = siblings[1].a.attrs['href'] score = get_score(soup, post_id) comments = get_comments(soup, post_id) ! if 'python' in title.lower() or (score > 50 and comments > 10): links.append({'link': link, 'title': title, 'score': score, 'comments': comments}) return links
  • 7. def get_score(soup, post_id): span_tag = soup.find(‘span', id='score_' + post_id) return int(span_tag.text.split()[0]) ! ! ! def get_comments(soup, post_id): a_tag = soup.find('a', href='item?id=' + post_id) return int(span_tag.text.split()[0]) !
  • 8. def add_to_pocket(consumer_key, access_token, url): target = 'https://getpocket.com/v3/add' request_params = { 'url': url, 'consumer_key': consumer_key, 'access_token': access_token} result = requests.post(target, data=request_params) return result.text ! if __name__ == '__main__': soup = get_front_page() pocket = partial(add_to_pocket, consumer_key, access_token) results = find_interesting_links(soup) print(results) for r in results: print(pocket(r['link']) )
  • 9. Searching with BS • find() / find_all() • find_parent() / find_parents() • find_next_sibling() / find_next_siblings() • find_previous_sibling() / find_previous_siblings() • find_next() / find_all_next() • find_previous() / find_all_previous()
  • 10. Filters html_doc = """ <html> <head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class=“story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id=“link3">Tillie</ a>; and they lived at the bottom of a well.</p> ! <p class="story">...</p> “”" ! from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) !
  • 11. Filters - Strings >>> soup.find_all('b') ! [<b>The Dormouse's story</b>]
  • 12. Filters - Regex regex = re.compile("^b") for tag in soup.find_all(regex): print(tag.name) ! ! body b
  • 13. Filters - Lists soup.find_all(["a", "b"]) ! [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  • 14. Filters - Functions def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')
  • 15. Filters - Functions soup.find_all(has_class_but_no_id) ! [<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were...</p>, <p class="story">...</p>]
  • 16. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs)
  • 17. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) name: A string that matches tags
  • 18. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) attrs: a dictionary of html attributes to match soup.find_all("a", attrs={"class": "sister"}) [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  • 19. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) recursive: a boolean value to seach grandchildren
  • 20. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) text: search for text instead of tags
 
 soup.find_all(text=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]
  • 21. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) limit: an int to control the number of items returned
  • 22. The API find_all(name, attrs, recursive, 
 text, limit, **kwargs) keyword: A keyword argument will search for a tag with that attribute ! >>>soup.find_all(id=‘link2’) [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  • 23. Navigating with BS The easiest way to navigate elements down the tree is to use the dot notation. >>>soup.head
 <head><title>The Dormouse's story</ title></head> >>>soup.title
 <title>The Dormouse's story</title>
  • 24. Navigating with BS You can look at the children of an element with .contents >>>head_tag = soup.head
 >>>head_tag
 <head><title>The Dormouse's story</ title></head> >>>head_tag.contents
 [<title>The Dormouse's story</title>]
  • 26. Fifty Years Ago import requests import datetime from bs4 import BeautifulSoup ! today = datetime.date.today() month = today.strftime(‘%B') fyo = str(today.year - 50) url = 'http://en.wikipedia.org/wiki/' + month + '_' + fyo target = month + '_' + str(today.day) + '.' ! data = requests.get(url).text soup = BeautifulSoup(data) ! contents = soup.find('div', id='toc') a = contents.findAll( lambda tag: tag.name == ‘span' and tag.has_attr(‘id')) ! for event in a.ul.findAll(‘li'): print(event.text)
  • 27. RTFM • http://docs.python-requests.org/en/latest • http://www.crummy.com/software/BeautifulSoup/ • http://lxml.de/ • http://jakeaustwick.me/python-web-scraping-resource/ • https://github.com/jdowns/scraper • john.downs@acm.org