Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Viller Hsiao
⽤用 Python 抓取財報資訊
• 練習 python
• 練習 好好寫 python
⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python
• 了解 web 架構
• 計算股票價值
Steps
• 抓網⾴頁
• 解析內容
• 資料計算
資料來源
表格別 股票id
檢查元素
開發⼈人員⼯工具
• 練習 google python style guide
中年Py的奇幻漂流
http://static.ettoday.net/images/206/206484.jpg
Python Modules
• Parse DOM
• urllib + SGMLParser
• requests + BeautifulSoup4
• Excel
• xlutils
urllib
url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'
webcode = urllib.urlopen(url)
if webcode.code == 200:
se...
SGMLParser
class AccountTable(SGMLParser):
def feed(self, data):
def start_tr(self, attr):
def end_tr(self):
def handle_da...
Oops
def start_table(self, attrs):
if len(attrs) > 0:
for at in attrs:
if at[0] == 'id' and at[1] == 'oMainTable':
self.is...
中⽂文轉碼
line.encode(‘big5’).decode(‘utf8’)
v2.0
• Coding style refinement
• google python style guide
• pyhon 慣⽤用語
g0v 專案
requests
import requests
def parse_url(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
parse_html(r.tex...
BeautifulSoup
from bs4 import BeautifulSoup
def parse_html(html_text):
soup = BeautifulSoup(html_text)
rows = soup.find(‘ta...
Future Plan
• concurrent / gevent
• fake browser header
• free proxy
Próxima SlideShare
Cargando en…5
×

My first-crawler-in-python

  • Inicia sesión para ver los comentarios

My first-crawler-in-python

  1. 1. Viller Hsiao
  2. 2. ⽤用 Python 抓取財報資訊 • 練習 python • 練習 好好寫 python
  3. 3. ⽤用 Python 抓取財報資訊 • 練習 python • 練習好好寫 python • 了解 web 架構 • 計算股票價值
  4. 4. Steps • 抓網⾴頁 • 解析內容 • 資料計算
  5. 5. 資料來源 表格別 股票id
  6. 6. 檢查元素
  7. 7. 開發⼈人員⼯工具 • 練習 google python style guide
  8. 8. 中年Py的奇幻漂流 http://static.ettoday.net/images/206/206484.jpg
  9. 9. Python Modules • Parse DOM • urllib + SGMLParser • requests + BeautifulSoup4 • Excel • xlutils
  10. 10. urllib url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm' webcode = urllib.urlopen(url) if webcode.code == 200: self.webpage = webcode.read() webcode.close()
  11. 11. SGMLParser class AccountTable(SGMLParser): def feed(self, data): def start_tr(self, attr): def end_tr(self): def handle_data(self):
  12. 12. Oops def start_table(self, attrs): if len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True
  13. 13. 中⽂文轉碼 line.encode(‘big5’).decode(‘utf8’)
  14. 14. v2.0 • Coding style refinement • google python style guide • pyhon 慣⽤用語
  15. 15. g0v 專案
  16. 16. requests import requests def parse_url(url): r = requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)
  17. 17. BeautifulSoup from bs4 import BeautifulSoup def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’) rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols) <td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>
  18. 18. Future Plan • concurrent / gevent • fake browser header • free proxy

×