SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Web Scraping 1-2-3 with
   Python + Scrapy
        Sammy Fung
    sammy.hk, gownjob.com
Today Agenda
● Some Cases
● Python and Scrapy
Web Scraping
● a computer software technique of extracting
  information from websites. (Wikipedia)
● for business, hobbies, research.......
● NOT talk about business cases today.
CableTV & NOWTV Programme
(Past)
● 2004.
● slow, slow, slow, or worst - can't connect.
● use Flash.
HK Observatory and Joint Typhoon
Warning Center
● no easy data exchange format, eg.
  RSS/Atom.
● We won't check websites everyday.
Transportation - KMB, PTES
● no map view on KMB website for a bus route
  in the past.
● Exteremly Poor, Ugly (or much worse) map
  UI on PTES.
My experiences on web scraping
● 2004: php
● year after: python
● recent year: python with scrapy
Document Types
● HTML, XML,......
● Text
● Others, eg. pictures, videos,......
Web Scraping
●   Look for right URLs to scrap.
●   Look for right content from webpages.
●   Saving data into data store.
●   When to run the web scraping program ?
What is Scrapy ?
● An open source web scraping framework for
  Python.
● Scrapy is a fast high-level screen scraping
  and web crawling framework, used to crawl
  websites and extract structured data from
  their pages. It can be used for a wide range
  of purposes, from data mining to monitoring
  and automated testing.
Features of Scrapy
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from
  HTML and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others
Installation of Scrapy
●   pip
●   APT repo
●   RPM
●   tarball (binary/source)
Create new scrapy project
$ scrapy startproject mybot
mybot/
mybot/scrapy.cfg
mybot/mybot/items.py
mybot/mybot/pipeline.py
mybot/mybot/settings.py
mybot/mybot/spiders/myspider.py
etc.......
items.py
from scrapy.item import Item, Field

class HKOCurrentItem(Item):
 time = Field()
 station = Field()
 temperature = Field()
 humidity = Field()
 #......
spiders/hko_spider.py (1/5)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from weatherhk.items import HKOCurrentItem

import datetime, re
spiders/hko_spider.py (2/5)
class HKOCurrentSpider(BaseSpider):
 name = "HKOCurrentSpider"
 #allowed_domains = ["www.weather.gov.hk"]
 start_urls = [
   "http://www.weather.gov.
hk/textonly/forecast/chinesewx.htm"
 ]
spiders/hko_spider.py (3/5)
def parse(self, response):
 hxs = HtmlXPathSelector(response)

 stations = []
 # Getting weather data from each stations.
 tx = hxs.select("//pre[1]/text()").re('[^n]*n')
spiders/hko_spider.py (4/5)
   for i in tx:
     if re.search(u'd 度',i):
       data = HKOCurrentItem()
       data['time'] = int(dt)
       data['station'] = self.station.code(i)
       data['temperature'] = int(re.findall(u'd+',i)
[0])
       stations.append(data)
spiders/hko_spider.py (5/5)
 return stations
pipelines.py (1/2)
class HKOCurrentPipeline(object):
 def process_item(self, item, spider):
   station = self.db[item['station']]
   storeditem = dict(item.__dict__)['_values']
pipelines.py (2/2)
   try:
     if 'temperature' in storeditem:
       lasttime = station.find({'temperature': {'$gt':
0}}).sort('time', -1).limit(1)
     if lasttime[0]['time'] != storeditem['time']:
       id = self.insert(storeditem)

  return item

Más contenido relacionado

La actualidad más candente

Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)
MongoSF
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
Audrey Lim
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Steven Pousty
 

La actualidad más candente (20)

Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documents
 
Dive into .git
Dive into .gitDive into .git
Dive into .git
 
世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With Groovy
 
Django Mongodb Engine
Django Mongodb EngineDjango Mongodb Engine
Django Mongodb Engine
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them
 
Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
Writing dumb tests
Writing dumb testsWriting dumb tests
Writing dumb tests
 
Kiosk / PHP
Kiosk / PHP Kiosk / PHP
Kiosk / PHP
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
 

Destacado

Destacado (20)

Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Scrapy
ScrapyScrapy
Scrapy
 
Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15
 
CPA-Script
CPA-ScriptCPA-Script
CPA-Script
 
摘星
摘星摘星
摘星
 
Relevance Assessment Tool
Relevance Assessment ToolRelevance Assessment Tool
Relevance Assessment Tool
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design Fundamentals
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Website vs web app
Website vs web appWebsite vs web app
Website vs web app
 
Mobile Website vs Mobile App
Mobile Website vs Mobile AppMobile Website vs Mobile App
Mobile Website vs Mobile App
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional Software
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 

Similar a Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntop
PyCon Italia
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
Sammy Fung
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
Diep Nguyen
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
Praveen Gollakota
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 

Similar a Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) (20)

Open Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageOpen Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object Storage
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntop
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Deploy your own P2P network
Deploy your own P2P networkDeploy your own P2P network
Deploy your own P2P network
 
Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015
 
Python. Why to learn?
Python. Why to learn?Python. Why to learn?
Python. Why to learn?
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
 
Crawler
CrawlerCrawler
Crawler
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 

Más de Sammy Fung

Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
Sammy Fung
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
Sammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
Sammy Fung
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
Sammy Fung
 

Más de Sammy Fung (20)

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to online
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and Community
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web framework
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 Forecast
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and Community
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server Edition
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source Community
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with Drupal
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at work
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and Community
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job Board
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

  • 1. Web Scraping 1-2-3 with Python + Scrapy Sammy Fung sammy.hk, gownjob.com
  • 2. Today Agenda ● Some Cases ● Python and Scrapy
  • 3. Web Scraping ● a computer software technique of extracting information from websites. (Wikipedia) ● for business, hobbies, research....... ● NOT talk about business cases today.
  • 4. CableTV & NOWTV Programme (Past) ● 2004. ● slow, slow, slow, or worst - can't connect. ● use Flash.
  • 5. HK Observatory and Joint Typhoon Warning Center ● no easy data exchange format, eg. RSS/Atom. ● We won't check websites everyday.
  • 6. Transportation - KMB, PTES ● no map view on KMB website for a bus route in the past. ● Exteremly Poor, Ugly (or much worse) map UI on PTES.
  • 7. My experiences on web scraping ● 2004: php ● year after: python ● recent year: python with scrapy
  • 8. Document Types ● HTML, XML,...... ● Text ● Others, eg. pictures, videos,......
  • 9. Web Scraping ● Look for right URLs to scrap. ● Look for right content from webpages. ● Saving data into data store. ● When to run the web scraping program ?
  • 10. What is Scrapy ? ● An open source web scraping framework for Python. ● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  • 11. Features of Scrapy ● define data you want to scrapy ● write spider to extract data ● Built-in: selecting and extracting data from HTML and XML ● Built-in: JSON, CSV, XML output ● Interactive shell console ● Built-in: web service, telnet console, logging ● Others
  • 12. Installation of Scrapy ● pip ● APT repo ● RPM ● tarball (binary/source)
  • 13. Create new scrapy project $ scrapy startproject mybot mybot/ mybot/scrapy.cfg mybot/mybot/items.py mybot/mybot/pipeline.py mybot/mybot/settings.py mybot/mybot/spiders/myspider.py etc.......
  • 14. items.py from scrapy.item import Item, Field class HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
  • 15. spiders/hko_spider.py (1/5) from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from weatherhk.items import HKOCurrentItem import datetime, re
  • 16. spiders/hko_spider.py (2/5) class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov. hk/textonly/forecast/chinesewx.htm" ]
  • 17. spiders/hko_spider.py (3/5) def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re('[^n]*n')
  • 18. spiders/hko_spider.py (4/5) for i in tx: if re.search(u'd 度',i): data = HKOCurrentItem() data['time'] = int(dt) data['station'] = self.station.code(i) data['temperature'] = int(re.findall(u'd+',i) [0]) stations.append(data)
  • 20. pipelines.py (1/2) class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item['station']] storeditem = dict(item.__dict__)['_values']
  • 21. pipelines.py (2/2) try: if 'temperature' in storeditem: lasttime = station.find({'temperature': {'$gt': 0}}).sort('time', -1).limit(1) if lasttime[0]['time'] != storeditem['time']: id = self.insert(storeditem) return item