SlideShare una empresa de Scribd logo
1 de 81
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
[object Object],[object Object]
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
[object Object],[object Object]
 
 
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
[object Object]
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Shibuya
<span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
[object Object],[object Object],[object Object]
CSS Selectors ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object],[object Object],[object Object]
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]
process ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };
[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],<span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatraSérgio Santos
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and BeyondAll Things Open
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyAll Things Open
 
Browser Extensions for Web Hackers
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web HackersMark Wubben
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?Sho Yoshida
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Enginegeehwan
 
Eve - REST API for Humans™
Eve - REST API for Humans™Eve - REST API for Humans™
Eve - REST API for Humans™Nicola Iarocci
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
HPPG - high performance photo gallery
HPPG - high performance photo galleryHPPG - high performance photo gallery
HPPG - high performance photo galleryRemigijus Kiminas
 
Android webservices
Android webservicesAndroid webservices
Android webservicesKrazy Koder
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES SystemsChris Birchall
 
Non-Relational Databases
Non-Relational DatabasesNon-Relational Databases
Non-Relational Databaseskchodorow
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909Yusuke Wada
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Odoo
 

La actualidad más candente (20)

Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatra
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and Beyond
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case Study
 
Browser Extensions for Web Hackers
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web Hackers
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
 
Eve - REST API for Humans™
Eve - REST API for Humans™Eve - REST API for Humans™
Eve - REST API for Humans™
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
HPPG - high performance photo gallery
HPPG - high performance photo galleryHPPG - high performance photo gallery
HPPG - high performance photo gallery
 
Android webservices
Android webservicesAndroid webservices
Android webservices
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015
 
Non-Relational Databases
Non-Relational DatabasesNon-Relational Databases
Non-Relational Databases
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
CouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON DocumentsCouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON Documents
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
 

Similar a Web Scraper Shibuya.pm tech talk #8

Similar a Web Scraper Shibuya.pm tech talk #8 (20)

Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To Lamp
 
A brief history of the web
A brief history of the webA brief history of the web
A brief history of the web
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
Php 3 1
Php 3 1Php 3 1
Php 3 1
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
Forum Presentation
Forum PresentationForum Presentation
Forum Presentation
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
 
Ubi comp27nov04
Ubi comp27nov04Ubi comp27nov04
Ubi comp27nov04
 
introduction to web technology
introduction to web technologyintroduction to web technology
introduction to web technology
 
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
RESTful design
RESTful designRESTful design
RESTful design
 
WordPress APIs
WordPress APIsWordPress APIs
WordPress APIs
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Lecture1 B Frames&Forms
Lecture1 B  Frames&FormsLecture1 B  Frames&Forms
Lecture1 B Frames&Forms
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 

Más de Tatsuhiko Miyagawa

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency managerTatsuhiko Miyagawa
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Tatsuhiko Miyagawa
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversTatsuhiko Miyagawa
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEventTatsuhiko Miyagawa
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Tatsuhiko Miyagawa
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked aboutTatsuhiko Miyagawa
 

Más de Tatsuhiko Miyagawa (20)

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency manager
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011
 
Plack at OSCON 2010
Plack at OSCON 2010Plack at OSCON 2010
Plack at OSCON 2010
 
cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010
 
Plack at YAPC::NA 2010
Plack at YAPC::NA 2010Plack at YAPC::NA 2010
Plack at YAPC::NA 2010
 
PSGI/Plack OSDC.TW
PSGI/Plack OSDC.TWPSGI/Plack OSDC.TW
PSGI/Plack OSDC.TW
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
Plack - LPW 2009
Plack - LPW 2009Plack - LPW 2009
Plack - LPW 2009
 
Tatsumaki
TatsumakiTatsumaki
Tatsumaki
 
Intro to PSGI and Plack
Intro to PSGI and PlackIntro to PSGI and Plack
Intro to PSGI and Plack
 
CPAN Realtime feed
CPAN Realtime feedCPAN Realtime feed
CPAN Realtime feed
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEvent
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Remedie OSDC.TW
Remedie OSDC.TWRemedie OSDC.TW
Remedie OSDC.TW
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked about
 
XML::Liberal
XML::LiberalXML::Liberal
XML::Liberal
 
Test::Base
Test::BaseTest::Base
Test::Base
 
Hacking Vox and Plagger
Hacking Vox and PlaggerHacking Vox and Plagger
Hacking Vox and Plagger
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Web Scraper Shibuya.pm tech talk #8

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
  • 2.
  • 3. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 4. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 5.
  • 6.  
  • 7.  
  • 8.
  • 9.
  • 10.
  • 11.  
  • 12. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 13. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 14.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  • 24. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Shibuya
  • 25. <span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 31.
  • 32.
  • 33.
  • 34. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 35. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 36.
  • 37. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 38. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 39.
  • 40.
  • 41.
  • 42. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.