SlideShare a Scribd company logo
1 of 26
Download to read offline
the SEO’s guide to: !

SCRAPING!
EVERYTHING!
  @eppievojt!
  digital marketing consultant, JPL!
NEXT LEVEL!
XPATH-ING!

  Use Case 1:
  Does site x link to any page on
  eppie.net?
NEXT LEVEL!
XPATH-ING!
  Scrape partial       What we know:"

  matches using        1)  Link will contain"
                           http://www.eppie.net in the "
  XPath’s “contains”       href attribute"
  function to find
                       2)  Some people like to hurt the internet
  inexact data.
           by capitalizing URLs, so we’ll need
                           to account for that"

                       3)  People who link to you don’t care
                           about your desire for
                           canonicalization
DO YOU LINK!
TO ME?!

  //a[contains(@href,'http://www.eppie.net’)]




             PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
Add translate() to normalize case
//a[contains(translate(@href,
   'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
   pqrstuvwxyz'),'http://www.eppie.net’)]




                             DO YOU LINK!
                                 TO ME?!
How you can use this:
Get notified when a link is removed
+ Make contact to potentially save dropping link (friendly
  reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process
+ Get notification when link goes live




                                     DO YOU LINK!
                                         TO ME?!
NEXT LEVEL!
XPATH-ING!

  Use Case 2:
  Find every external link from cnn.com
NEXT LEVEL!
XPATH-ING!
                        What we know:"
  Combine attribute
  selectors to more     1)  External links all contain http://"

  accurately target     2)  Internal links can also use http://"
  useful information
   3)  So we need to exclude http:// links
                            to the current domain
SCRAPE ALL!
EXTERNAL LINKS!

  //a[contains(@href,'http://') and not
    (contains(@href,'cnn.com'))]
How you can use this:
Identify if a page is too spammed out to bother with by
   pulling external link counts

Find expired or expiring domains being linked to from
   authority sites. Purchase and rebuild or redirect those
   sites.

Broken link building automation




                                SCRAPE ALL!
                             EXTERNAL LINKS!
LINK TYPE!
IDENTIFICATION!

  Use Case 3:
  How are they ranking? What kind of links
  do they have?
LINK TYPE!
IDENTIFICATION!
  XPath’s ancestor    What we know:"
  axis lets us        A link inside a containing element with
  leverage semantic   an id or class name including the word
                      “comment,” “footer,” or “blogroll” is
  markup to ID link   highly suggestive of type
  types.
LINK TYPE!
IDENTIFICATION!


  "//a[@href='h,p://randfishkin.com/blog']/
    ancestor::*[contains(@id|
    @class,'comment')]"

                                             ment-
                             Wa  s Rand com
                                             ay to
                             spa mming his w       E
                             the top  ? This + 0S
                                            y...
                             tells the stor
Why you might use this:
Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor
   text

Improve workflow: Ignore placed links (comments, directory
  submissions, article submissions, blog networks, etc) and
  work on a smaller subset of EARNED links for manual
  analysis




                                SCRAPE ALL!
                             EXTERNAL LINKS!
REGEX TO!
THE RESCUE!

  Use Case 4:
  I’ve scraped some data, now I need to
  extract some small portion of it that
  XPath can’t do on its own (easily)
REGEX TO!
THE RESCUE!

  Use regular
                     Example:
  expressions to
  pattern match      Extract all @mentions of a specific user
                     from a tweet or page
  structured text
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
EXTRACT!
@ MENTIONS!

       /(?:^|s)@([A-z0-9_]+)/gi
Why you might use this:
Pull contact information from a web site (Twitter username,
  email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs)
  for improved competitive research




                                       REGEX TO!
                                     THE RESCUE!
BEYOND THE !
SPREADSHEET!

  Use Case 5:
  I want to chain processes together,
  process lots of data, or allow multiple
  users to leverage what I build.
BEYOND THE !
SPREADSHEET!
  Scraping outside   PHP Scraping Overview:
  the spreadsheet
                     1)    CURL target page
  allows for more    2)    Convert to DOM Object
  complex systems    3)    Run Xpath Queries
                     4)    Store Data or Hit API
  to be built.
BEYOND THE !
SPREADSHEET!

 Simple PHP Scraper Class:
 http://www.scrapeeverything.com
SHOW!
SOME LOVE!

  I’m @eppievojt and I work for @jplcreative "

  eppie.net
  linkdetective.com
  jplcreative.com

More Related Content

What's hot

Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Dawn Anderson MSc DigM
 
Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Michael McNeill
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUJason Mun
 
SMX East - SEO Tools Panel
SMX East - SEO Tools PanelSMX East - SEO Tools Panel
SMX East - SEO Tools PanelAbby Hamilton
 
The New Renaissance of JavaScript
The New Renaissance of JavaScriptThe New Renaissance of JavaScript
The New Renaissance of JavaScriptHamlet Batista
 
WordPress SEO & Optimisation
WordPress SEO & OptimisationWordPress SEO & Optimisation
WordPress SEO & OptimisationJoost de Valk
 
SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonThe 42nd Estate
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance OptimisationChris Burgess
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdiGlen Dimaandal
 
WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014Arsham Mirshah
 
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOUse Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOGerry White
 
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Hamlet Batista
 
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesKahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesMark Ginsberg
 
Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Hamlet Batista
 

What's hot (16)

Screaming Frog PPT
Screaming Frog PPTScreaming Frog PPT
Screaming Frog PPT
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
SMX East - SEO Tools Panel
SMX East - SEO Tools PanelSMX East - SEO Tools Panel
SMX East - SEO Tools Panel
 
The New Renaissance of JavaScript
The New Renaissance of JavaScriptThe New Renaissance of JavaScript
The New Renaissance of JavaScript
 
WordPress SEO & Optimisation
WordPress SEO & OptimisationWordPress SEO & Optimisation
WordPress SEO & Optimisation
 
SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of Boston
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance Optimisation
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdi
 
WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014
 
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOUse Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
 
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesKahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
 
Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Technical SEO "Overoptimization"
Technical SEO "Overoptimization"
 

Similar to The SEO's Guide to Scraping Everything

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteAdam Audette
 
Information Architecture for SEO
Information Architecture for SEOInformation Architecture for SEO
Information Architecture for SEOiProspect Canada
 
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Dealmaker Media
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactNikola Minkov
 
SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365Benjamin Niaulin
 
Website Security
Website SecurityWebsite Security
Website SecurityCarlos Z
 
Website Security
Website SecurityWebsite Security
Website SecurityMODxpo
 
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...Tin180 VietNam
 
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceBest-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceLaSandra Brill
 
Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101WO Strategies
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Nathan Buggia
 
TeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTraction Software
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in MahabubnagarSubhash Malgam
 
SEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseSEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseMichael Jones
 
Atmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsAtmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsPROIDEA
 
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. #CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. Mel Sciorra
 
Diagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationDiagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationNine By Blue
 
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYBUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYCodeCore
 
A complete digital marketing sop divay jain ( profshine tech )
A complete digital marketing sop  divay jain ( profshine tech )A complete digital marketing sop  divay jain ( profshine tech )
A complete digital marketing sop divay jain ( profshine tech )Divay Jain
 
SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)kdmcBerkeley at UC Berkeley
 

Similar to The SEO's Guide to Scraping Everything (20)

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam Audette
 
Information Architecture for SEO
Information Architecture for SEOInformation Architecture for SEO
Information Architecture for SEO
 
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / Serpact
 
SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365
 
Website Security
Website SecurityWebsite Security
Website Security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
 
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceBest-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
 
Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008
 
TeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan Frank
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in Mahabubnagar
 
SEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseSEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead Horse
 
Atmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsAtmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOps
 
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. #CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
 
Diagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationDiagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine Optimization
 
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYBUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
 
A complete digital marketing sop divay jain ( profshine tech )
A complete digital marketing sop  divay jain ( profshine tech )A complete digital marketing sop  divay jain ( profshine tech )
A complete digital marketing sop divay jain ( profshine tech )
 
SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

The SEO's Guide to Scraping Everything

  • 1. the SEO’s guide to: ! SCRAPING! EVERYTHING! @eppievojt! digital marketing consultant, JPL!
  • 2. NEXT LEVEL! XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?
  • 3. NEXT LEVEL! XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization
  • 4. DO YOU LINK! TO ME?! //a[contains(@href,'http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
  • 5. Add translate() to normalize case //a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno pqrstuvwxyz'),'http://www.eppie.net’)] DO YOU LINK! TO ME?!
  • 6. How you can use this: Get notified when a link is removed + Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource) Integrate into link outreach process + Get notification when link goes live DO YOU LINK! TO ME?!
  • 7. NEXT LEVEL! XPATH-ING! Use Case 2: Find every external link from cnn.com
  • 8. NEXT LEVEL! XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain
  • 9. SCRAPE ALL! EXTERNAL LINKS! //a[contains(@href,'http://') and not (contains(@href,'cnn.com'))]
  • 10. How you can use this: Identify if a page is too spammed out to bother with by pulling external link counts Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites. Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
  • 11. LINK TYPE! IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?
  • 12. LINK TYPE! IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
  • 13. LINK TYPE! IDENTIFICATION! "//a[@href='h,p://randfishkin.com/blog']/ ancestor::*[contains(@id| @class,'comment')]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor
  • 14. Why you might use this: Analyze competitors’ strategies for acquiring links Find what types of links are being used to get good anchor text Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
  • 15. REGEX TO! THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)
  • 16. REGEX TO! THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a specific user from a tweet or page structured text
  • 21. EXTRACT! @ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi
  • 22. Why you might use this: Pull contact information from a web site (Twitter username, email address) to improve outreach efforts Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
  • 23. BEYOND THE ! SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.
  • 24. BEYOND THE ! SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.
  • 25. BEYOND THE ! SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
  • 26. SHOW! SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com