3. Some sites suck - "for your own good"
For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
4. ...but they work in a web browser!
Let's use the web browser to scrape them
6. Selenium can...
● navigate (windows, frames, links)
● find elements and parse attributes
● interact and trigger events (click, type, ...)
● capture screenshots
● run javascript
● let the browser take care of the hard stuff
(cookies, javascript, sessions, profiles,
DOM)
Comes with various components and bindings
... including python
7. General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
○ record a session, write less code
● python and its batteries
● python-selenium
● xvfb and pyvirtualdisplay (optional)
● other libraries to taste
○ eg image manipulation, database access, DOM
parsing, OCR
8. General Recipe
Method:
● Install requirements (apt-get, pip etc)
○ sudo apt-get install xvfb firefox
○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
○ Add in some assertions along the way
● Export test as Python script
● Hack from there
○ Loops
○ Image/data extraction
○ Wrangling data into a database
9.
10. Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait( 30)
self.base_url = "https://www.ingdirect.com.au"
self.verificationErrors = []
def test_ingdirect2(self):
driver = self.driver
But what about
driver.get( self.base_url + "/client/index.aspx")
that dang
driver.switch_to_frame( 'body') # Had to add this keypad? ...
driver.find_element_by_id( "txtCIF").clear()
driver.find_element_by_id( "txtCIF").send_keys( "12345678")
driver.find_element_by_id( "objKeypad_B1").click()
driver.find_element_by_id( "objKeypad_B2").click()
driver.find_element_by_id( "objKeypad_B3").click()
driver.find_element_by_id( "objKeypad_B4").click()
driver.find_element_by_id( "btnLogin").click()
self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
11. PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))
table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")
# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
button_image = im.crop(getcropbox(button))
hexid = hashlib.md5(button_image.tostring()).hexdigest()
button_mapping[hexid] = button.get_attribute( "id")
# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()
driver.find_element_by_id( "btnLogin").click()
# We're in!!!11one
12. But why do all this?
It's my data! ... and I'll graph if i want to
* Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
13. That's all folks
Slides
● http://bit.ly/scrapium
Code
● https://gist.github.com/3015852
Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au