Enviar búsqueda
Cargar
My first-crawler-in-python
•
5 recomendaciones
•
853 vistas
Viller Hsiao
Seguir
Denunciar
Compartir
Denunciar
Compartir
1 de 18
Descargar ahora
Descargar para leer sin conexión
Recomendados
Android x 網路爬蟲
Android x 網路爬蟲
Engine Bai
Yql hacku iitd_2012
Yql hacku iitd_2012
Anshu Prateek
Selenium再入門
Selenium再入門
Norio Suzuki
Hacking Wordpress Plugins
Hacking Wordpress Plugins
Larry Cashdollar
[2010]我有一个梦想
[2010]我有一个梦想
Twinsen Liang
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App Engine
Yared Ayalew
HTML+JQuery by Rio
HTML+JQuery by Rio
Agate Studio
Cross platform Mobile development on Titanium
Cross platform Mobile development on Titanium
Yiguang Hu
Recomendados
Android x 網路爬蟲
Android x 網路爬蟲
Engine Bai
Yql hacku iitd_2012
Yql hacku iitd_2012
Anshu Prateek
Selenium再入門
Selenium再入門
Norio Suzuki
Hacking Wordpress Plugins
Hacking Wordpress Plugins
Larry Cashdollar
[2010]我有一个梦想
[2010]我有一个梦想
Twinsen Liang
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App Engine
Yared Ayalew
HTML+JQuery by Rio
HTML+JQuery by Rio
Agate Studio
Cross platform Mobile development on Titanium
Cross platform Mobile development on Titanium
Yiguang Hu
Python - A Comprehensive Programming Language
Python - A Comprehensive Programming Language
TsungWei Hu
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Rhio Kim
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
discoversudhir
Flutter 4
Flutter 4
Warren Lin
Twitter bootstrap2.0 taste
Twitter bootstrap2.0 taste
Tencent
How dojo works
How dojo works
Amit Tyagi
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
Web前端性能优化 2014
Web前端性能优化 2014
Yubei Li
Django Overview
Django Overview
Brian Tol
Mezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.py
Max Lai
Scaling business app development with Play and Scala
Scaling business app development with Play and Scala
Peter Hilton
Write Less Do More
Write Less Do More
Remy Sharp
Web Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
Protostrap
Protostrap
Memi Beltrame
Drools and jBPM 6 Overview
Drools and jBPM 6 Overview
Mark Proctor
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
discoversudhir
スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一
okyawa
Client-side MVC with Backbone.js
Client-side MVC with Backbone.js
iloveigloo
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
mohamed hadrich
Python在豆瓣的应用
Python在豆瓣的应用
Qiangning Hong
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4 bcc
Viller Hsiao
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
Más contenido relacionado
Similar a My first-crawler-in-python
Python - A Comprehensive Programming Language
Python - A Comprehensive Programming Language
TsungWei Hu
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Rhio Kim
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
discoversudhir
Flutter 4
Flutter 4
Warren Lin
Twitter bootstrap2.0 taste
Twitter bootstrap2.0 taste
Tencent
How dojo works
How dojo works
Amit Tyagi
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
Web前端性能优化 2014
Web前端性能优化 2014
Yubei Li
Django Overview
Django Overview
Brian Tol
Mezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.py
Max Lai
Scaling business app development with Play and Scala
Scaling business app development with Play and Scala
Peter Hilton
Write Less Do More
Write Less Do More
Remy Sharp
Web Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
Protostrap
Protostrap
Memi Beltrame
Drools and jBPM 6 Overview
Drools and jBPM 6 Overview
Mark Proctor
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
discoversudhir
スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一
okyawa
Client-side MVC with Backbone.js
Client-side MVC with Backbone.js
iloveigloo
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
mohamed hadrich
Python在豆瓣的应用
Python在豆瓣的应用
Qiangning Hong
Similar a My first-crawler-in-python
(20)
Python - A Comprehensive Programming Language
Python - A Comprehensive Programming Language
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
Flutter 4
Flutter 4
Twitter bootstrap2.0 taste
Twitter bootstrap2.0 taste
How dojo works
How dojo works
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Web前端性能优化 2014
Web前端性能优化 2014
Django Overview
Django Overview
Mezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.py
Scaling business app development with Play and Scala
Scaling business app development with Play and Scala
Write Less Do More
Write Less Do More
Web Scrapping Using Python
Web Scrapping Using Python
Protostrap
Protostrap
Drools and jBPM 6 Overview
Drools and jBPM 6 Overview
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一
Client-side MVC with Backbone.js
Client-side MVC with Backbone.js
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Python在豆瓣的应用
Python在豆瓣的应用
Más de Viller Hsiao
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4 bcc
Viller Hsiao
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdso
Viller Hsiao
Linux kernel tracing
Linux kernel tracing
Viller Hsiao
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
Viller Hsiao
mbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graph
Viller Hsiao
Introduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisor
Viller Hsiao
Yet another introduction to Linux RCU
Yet another introduction to Linux RCU
Viller Hsiao
Trace kernel code tips
Trace kernel code tips
Viller Hsiao
f9-microkernel-ktimer
f9-microkernel-ktimer
Viller Hsiao
Más de Viller Hsiao
(10)
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4 bcc
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdso
Linux kernel tracing
Linux kernel tracing
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
mbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graph
Introduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisor
Yet another introduction to Linux RCU
Yet another introduction to Linux RCU
Trace kernel code tips
Trace kernel code tips
f9-microkernel-ktimer
f9-microkernel-ktimer
My first-crawler-in-python
1.
Viller Hsiao
2.
⽤用 Python 抓取財報資訊 •
練習 python • 練習 好好寫 python
3.
⽤用 Python 抓取財報資訊 •
練習 python • 練習好好寫 python • 了解 web 架構 • 計算股票價值
4.
Steps • 抓網⾴頁 • 解析內容 •
資料計算
5.
資料來源 表格別 股票id
6.
檢查元素
7.
開發⼈人員⼯工具 • 練習 google
python style guide
8.
中年Py的奇幻漂流 http://static.ettoday.net/images/206/206484.jpg
9.
Python Modules • Parse
DOM • urllib + SGMLParser • requests + BeautifulSoup4 • Excel • xlutils
10.
urllib url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm' webcode
= urllib.urlopen(url) if webcode.code == 200: self.webpage = webcode.read() webcode.close()
11.
SGMLParser class AccountTable(SGMLParser): def feed(self,
data): def start_tr(self, attr): def end_tr(self): def handle_data(self):
12.
Oops def start_table(self, attrs): if
len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True
13.
中⽂文轉碼 line.encode(‘big5’).decode(‘utf8’)
14.
v2.0 • Coding style
refinement • google python style guide • pyhon 慣⽤用語
15.
g0v 專案
16.
requests import requests def parse_url(url): r
= requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)
17.
BeautifulSoup from bs4 import
BeautifulSoup def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’) rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols) <td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>
18.
Future Plan • concurrent
/ gevent • fake browser header • free proxy
Descargar ahora