SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
Scraping recalcitrant web sites
              with Python & Selenium
                     Roger Barnes




SyPy July 2012
Some sites suck
Some sites suck - "for your own good"




For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
...but they work in a web browser!




  Let's use the web browser to scrape them
Enter Selenium



      Selenium automates browsers

                 That's it
Selenium can...
●   navigate (windows, frames, links)
●   find elements and parse attributes
●   interact and trigger events (click, type, ...)
●   capture screenshots
●   run javascript
●   let the browser take care of the hard stuff
    (cookies, javascript, sessions, profiles,
    DOM)

Comes with various components and bindings
                         ... including python
General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
    ○ record a session, write less code
●   python and its batteries
●   python-selenium
●   xvfb and pyvirtualdisplay (optional)
●   other libraries to taste
    ○ eg image manipulation, database access, DOM
      parsing, OCR
General Recipe
Method:
● Install requirements (apt-get, pip etc)
   ○ sudo apt-get install xvfb firefox
   ○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
   ○ Add in some assertions along the way
● Export test as Python script
● Hack from there
   ○ Loops
   ○ Image/data extraction
   ○ Wrangling data into a database
Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait( 30)
        self.base_url = "https://www.ingdirect.com.au"
        self.verificationErrors = []

   def test_ingdirect2(self):
       driver = self.driver
                                                           But what about
       driver.get( self.base_url + "/client/index.aspx")
                                                           that dang
       driver.switch_to_frame( 'body') # Had to add this keypad? ...
       driver.find_element_by_id( "txtCIF").clear()
       driver.find_element_by_id( "txtCIF").send_keys( "12345678")
       driver.find_element_by_id( "objKeypad_B1").click()
       driver.find_element_by_id( "objKeypad_B2").click()
       driver.find_element_by_id( "objKeypad_B3").click()
       driver.find_element_by_id( "objKeypad_B4").click()
       driver.find_element_by_id( "btnLogin").click()
       self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
    button_image = im.crop(getcropbox(button))
    hexid = hashlib.md5(button_image.tostring()).hexdigest()
    button_mapping[hexid] = button.get_attribute( "id")


# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
    driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one
But why do all this?
It's my data!                                  ... and I'll graph if i want to




       * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
That's all folks
Slides
● http://bit.ly/scrapium

Code
● https://gist.github.com/3015852

Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au

Más contenido relacionado

La actualidad más candente

Selenium Automation Using Ruby
Selenium Automation Using RubySelenium Automation Using Ruby
Selenium Automation Using Ruby
Kumari Warsha Goel
 
Jenkins and Groovy
Jenkins and GroovyJenkins and Groovy
Jenkins and Groovy
Kiyotaka Oku
 

La actualidad más candente (20)

Web driver training
Web driver trainingWeb driver training
Web driver training
 
Selenium Automation Using Ruby
Selenium Automation Using RubySelenium Automation Using Ruby
Selenium Automation Using Ruby
 
Selenium webdriver
Selenium webdriverSelenium webdriver
Selenium webdriver
 
Automated Testing With Watir
Automated Testing With WatirAutomated Testing With Watir
Automated Testing With Watir
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsers
 
Webdriver.io
Webdriver.io Webdriver.io
Webdriver.io
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh PrajapatiSession on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
 
Introduction to Protractor
Introduction to ProtractorIntroduction to Protractor
Introduction to Protractor
 
Jenkins and Groovy
Jenkins and GroovyJenkins and Groovy
Jenkins and Groovy
 
Zombiejs
ZombiejsZombiejs
Zombiejs
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)
 
High Performance JavaScript 2011
High Performance JavaScript 2011High Performance JavaScript 2011
High Performance JavaScript 2011
 
Code ceptioninstallation
Code ceptioninstallationCode ceptioninstallation
Code ceptioninstallation
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applicationsProtractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
 
Integrační testy - Selenium
Integrační testy - SeleniumIntegrační testy - Selenium
Integrační testy - Selenium
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合
 

Similar a Scraping recalcitrant web sites with Python & Selenium

Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
drewz lin
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on CloudAutomating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on Cloud
Jonghyun Park
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
Tom Croucher
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
Mario Heiderich
 

Similar a Scraping recalcitrant web sites with Python & Selenium (20)

Web UI test automation instruments
Web UI test automation instrumentsWeb UI test automation instruments
Web UI test automation instruments
 
Testing web application with Python
Testing web application with PythonTesting web application with Python
Testing web application with Python
 
iOS Automation Primitives
iOS Automation PrimitivesiOS Automation Primitives
iOS Automation Primitives
 
Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016
 
How to execute Automation Testing using Selenium
How to execute Automation Testing using SeleniumHow to execute Automation Testing using Selenium
How to execute Automation Testing using Selenium
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on CloudAutomating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on Cloud
 
Nightwatch 101 - Salvador Molina
Nightwatch 101 - Salvador MolinaNightwatch 101 - Salvador Molina
Nightwatch 101 - Salvador Molina
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript Everywhere
 
UI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected JourneyUI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected Journey
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
 
End-to-end testing with geb
End-to-end testing with gebEnd-to-end testing with geb
End-to-end testing with geb
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Writing automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjectsWriting automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjects
 
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
 
Testing ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NETTesting ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NET
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
 
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFGStHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
 
Automation - web testing with selenium
Automation - web testing with seleniumAutomation - web testing with selenium
Automation - web testing with selenium
 
Catch a spider monkey
Catch a spider monkeyCatch a spider monkey
Catch a spider monkey
 

Más de Roger Barnes

Más de Roger Barnes (6)

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with Django
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsIntro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django Apps
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Scraping recalcitrant web sites with Python & Selenium

  • 1. Scraping recalcitrant web sites with Python & Selenium Roger Barnes SyPy July 2012
  • 3. Some sites suck - "for your own good" For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed
  • 4. ...but they work in a web browser! Let's use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers That's it
  • 6. Selenium can... ● navigate (windows, frames, links) ● find elements and parse attributes ● interact and trigger events (click, type, ...) ● capture screenshots ● run javascript ● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM) Comes with various components and bindings ... including python
  • 7. General Recipe Ingredients: ● firefox (or chrome) ● firebug (or chrome dev tools) ● Selenium IDE ○ record a session, write less code ● python and its batteries ● python-selenium ● xvfb and pyvirtualdisplay (optional) ● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General Recipe Method: ● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay ● Start up Firefox and Selenium IDE ● Record a "test" run through site ○ Add in some assertions along the way ● Export test as Python script ● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 9.
  • 10. Example from Selenium IDE class Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( 'body') # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 11. PIL saves the day # Get screenshot for extraction of button images screenshot = driver.get_screenshot_as_base64() im = Image.open(StringIO.StringIO(base64.decodestring(screenshot))) table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table') all_buttons = table.find_elements_by_tag_name( "input") # Determine md5sum of each button by cropping based on element positions for button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id") # Now we know which button is which ( based on previous lookup), enter the PIN for char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click() driver.find_element_by_id( "btnLogin").click() # We're in!!!11one
  • 12. But why do all this? It's my data! ... and I'll graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 13. That's all folks Slides ● http://bit.ly/scrapium Code ● https://gist.github.com/3015852 Me ● https://twitter.com/mindsocket ● https://github.com/mindsocket ● roger@mindsocket.com.au