Enviar búsqueda
Cargar
wget.pl
•
2 recomendaciones
•
1,182 vistas
Yasuhiro Onishi
Seguir
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 23
Descargar ahora
Descargar para leer sin conexión
Recomendados
C A S Sample Php
C A S Sample Php
JH Lee
London XQuery Meetup: Querying the World (Web Scraping)
London XQuery Meetup: Querying the World (Web Scraping)
Dennis Knochenwefel
Topological indices (t is) of the graphs to seek qsar models of proteins com...
Topological indices (t is) of the graphs to seek qsar models of proteins com...
Jitendra Kumar Gupta
20 modules i haven't yet talked about
20 modules i haven't yet talked about
Tatsuhiko Miyagawa
Pemrograman Web 9 - Input Form DB dan Session
Pemrograman Web 9 - Input Form DB dan Session
Nur Fadli Utomo
Pemrograman Web 8 - MySQL
Pemrograman Web 8 - MySQL
Nur Fadli Utomo
DEV Čtvrtkon #76 - Fluent Interface
DEV Čtvrtkon #76 - Fluent Interface
Ctvrtkoncz
R57shell
R57shell
ady36
Recomendados
C A S Sample Php
C A S Sample Php
JH Lee
London XQuery Meetup: Querying the World (Web Scraping)
London XQuery Meetup: Querying the World (Web Scraping)
Dennis Knochenwefel
Topological indices (t is) of the graphs to seek qsar models of proteins com...
Topological indices (t is) of the graphs to seek qsar models of proteins com...
Jitendra Kumar Gupta
20 modules i haven't yet talked about
20 modules i haven't yet talked about
Tatsuhiko Miyagawa
Pemrograman Web 9 - Input Form DB dan Session
Pemrograman Web 9 - Input Form DB dan Session
Nur Fadli Utomo
Pemrograman Web 8 - MySQL
Pemrograman Web 8 - MySQL
Nur Fadli Utomo
DEV Čtvrtkon #76 - Fluent Interface
DEV Čtvrtkon #76 - Fluent Interface
Ctvrtkoncz
R57shell
R57shell
ady36
Database api
Database api
InternetDevels
Inc
Inc
Lax Sindikat
introduction to Django in five slides
introduction to Django in five slides
Dan Chudnov
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: Hack
Vic Metcalfe
The History of PHPersistence
The History of PHPersistence
Hugo Hamon
Database Management - Lecture 4 - PHP and Mysql
Database Management - Lecture 4 - PHP and Mysql
Al-Mamun Sarkar
PHP pod mikroskopom
PHP pod mikroskopom
Saša Stamenković
Database Design Patterns
Database Design Patterns
Hugo Hamon
Php
Php
Linh Tran
Php (1)
Php (1)
pinalsadiwala
Uncovering Iterators
Uncovering Iterators
sdevalk
Agile database access with CakePHP 3
Agile database access with CakePHP 3
José Lorenzo Rodríguez Urdaneta
linieaire regressie
linieaire regressie
Mwalima Peltenburg
PHP Lecture 4 - Working with form, GET and Post Methods
PHP Lecture 4 - Working with form, GET and Post Methods
Al-Mamun Sarkar
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object Calisthenics
Guilherme Blanco
Doctrine fixtures
Doctrine fixtures
Bill Chang
Perl Fitxers i Directoris
Perl Fitxers i Directoris
frankiejol
CGI.pm - 3ло?!
CGI.pm - 3ло?!
Anatoly Sharifulin
Table through php
Table through php
syeda zoya mehdi
Object Calisthenics Applied to PHP
Object Calisthenics Applied to PHP
Guilherme Blanco
開発合宿!!!!
開発合宿!!!!
Yasuhiro Onishi
Redmine::ChanでIRCからプロジェクト管理
Redmine::ChanでIRCからプロジェクト管理
Yasuhiro Onishi
Más contenido relacionado
La actualidad más candente
Database api
Database api
InternetDevels
Inc
Inc
Lax Sindikat
introduction to Django in five slides
introduction to Django in five slides
Dan Chudnov
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: Hack
Vic Metcalfe
The History of PHPersistence
The History of PHPersistence
Hugo Hamon
Database Management - Lecture 4 - PHP and Mysql
Database Management - Lecture 4 - PHP and Mysql
Al-Mamun Sarkar
PHP pod mikroskopom
PHP pod mikroskopom
Saša Stamenković
Database Design Patterns
Database Design Patterns
Hugo Hamon
Php
Php
Linh Tran
Php (1)
Php (1)
pinalsadiwala
Uncovering Iterators
Uncovering Iterators
sdevalk
Agile database access with CakePHP 3
Agile database access with CakePHP 3
José Lorenzo Rodríguez Urdaneta
linieaire regressie
linieaire regressie
Mwalima Peltenburg
PHP Lecture 4 - Working with form, GET and Post Methods
PHP Lecture 4 - Working with form, GET and Post Methods
Al-Mamun Sarkar
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object Calisthenics
Guilherme Blanco
Doctrine fixtures
Doctrine fixtures
Bill Chang
Perl Fitxers i Directoris
Perl Fitxers i Directoris
frankiejol
CGI.pm - 3ло?!
CGI.pm - 3ло?!
Anatoly Sharifulin
Table through php
Table through php
syeda zoya mehdi
Object Calisthenics Applied to PHP
Object Calisthenics Applied to PHP
Guilherme Blanco
La actualidad más candente
(20)
Database api
Database api
Inc
Inc
introduction to Django in five slides
introduction to Django in five slides
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: Hack
The History of PHPersistence
The History of PHPersistence
Database Management - Lecture 4 - PHP and Mysql
Database Management - Lecture 4 - PHP and Mysql
PHP pod mikroskopom
PHP pod mikroskopom
Database Design Patterns
Database Design Patterns
Php
Php
Php (1)
Php (1)
Uncovering Iterators
Uncovering Iterators
Agile database access with CakePHP 3
Agile database access with CakePHP 3
linieaire regressie
linieaire regressie
PHP Lecture 4 - Working with form, GET and Post Methods
PHP Lecture 4 - Working with form, GET and Post Methods
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object Calisthenics
Doctrine fixtures
Doctrine fixtures
Perl Fitxers i Directoris
Perl Fitxers i Directoris
CGI.pm - 3ло?!
CGI.pm - 3ло?!
Table through php
Table through php
Object Calisthenics Applied to PHP
Object Calisthenics Applied to PHP
Destacado
開発合宿!!!!
開発合宿!!!!
Yasuhiro Onishi
Redmine::ChanでIRCからプロジェクト管理
Redmine::ChanでIRCからプロジェクト管理
Yasuhiro Onishi
oEmbed と Text::Hatena
oEmbed と Text::Hatena
Yasuhiro Onishi
Webサーバ勉強会#5
Webサーバ勉強会#5
oranie Narut
ウェブアプリケーションのパフォーマンスチューニング
ウェブアプリケーションのパフォーマンスチューニング
Yasuhiro Onishi
Hatena blogdevelopmentflow
Hatena blogdevelopmentflow
Yasuhiro Onishi
Destacado
(6)
開発合宿!!!!
開発合宿!!!!
Redmine::ChanでIRCからプロジェクト管理
Redmine::ChanでIRCからプロジェクト管理
oEmbed と Text::Hatena
oEmbed と Text::Hatena
Webサーバ勉強会#5
Webサーバ勉強会#5
ウェブアプリケーションのパフォーマンスチューニング
ウェブアプリケーションのパフォーマンスチューニング
Hatena blogdevelopmentflow
Hatena blogdevelopmentflow
Similar a wget.pl
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
brian d foy
Bag of tricks
Bag of tricks
brian d foy
Daily notes
Daily notes
meghendra168
Ae internals
Ae internals
mnikolenko
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)
Michael Schwern
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
Ismail Mukiibi
DBI
DBI
abrummett
Perl6 in-production
Perl6 in-production
Andrew Shitov
Drupal Development (Part 2)
Drupal Development (Part 2)
Jeff Eaton
Crazy things done on PHP
Crazy things done on PHP
Taras Kalapun
My shell
My shell
Ahmed Salah
Perl5i
Perl5i
Marcos Rebelo
(Ab)Using the MetaCPAN API for Fun and Profit
(Ab)Using the MetaCPAN API for Fun and Profit
Olaf Alders
Developing applications for performance
Developing applications for performance
Leon Fayer
Adventures in Optimization
Adventures in Optimization
David Golden
The Art of Transduction
The Art of Transduction
David Stockton
Presentation1
Presentation1
Rahadyan Gusti
Writing Maintainable Perl
Writing Maintainable Perl
tinypigdotcom
Advanced php testing in action
Advanced php testing in action
Jace Ju
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Masahiro Nagano
Similar a wget.pl
(20)
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
Bag of tricks
Bag of tricks
Daily notes
Daily notes
Ae internals
Ae internals
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
DBI
DBI
Perl6 in-production
Perl6 in-production
Drupal Development (Part 2)
Drupal Development (Part 2)
Crazy things done on PHP
Crazy things done on PHP
My shell
My shell
Perl5i
Perl5i
(Ab)Using the MetaCPAN API for Fun and Profit
(Ab)Using the MetaCPAN API for Fun and Profit
Developing applications for performance
Developing applications for performance
Adventures in Optimization
Adventures in Optimization
The Art of Transduction
The Art of Transduction
Presentation1
Presentation1
Writing Maintainable Perl
Writing Maintainable Perl
Advanced php testing in action
Advanced php testing in action
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Último
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
V3cube
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Neo4j
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
Último
(20)
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
wget.pl
1.
All YOUR PAGE
ARE BELONG TO US すべてのウェブページをこの手に 2012/11/16 株式会社はてな 大西康裕 id:onishi
2.
id:onishi 大西康裕
ONISHI @yasuhiro_onishi 株式会社はてな はてなブログ
3.
ウェブページを 保存したい
4.
ウェブページを保存したい •ウェブページは日々変化する •手元に置いておきたい
•競合調査 • 魚拓 •画像などまとめて保存したい
5.
Google Chrome
6.
7.
HTML::Parser my $result; my $parser
= HTML::Parser->new( start_h => [ sub {}, 'self,tagname,attr,text' ], default_h => [ sub {}, 'self,text' ], ); $parser->parse($content); print $result; • text • start • end • process • declaration • comment • default
8.
HTML::Parser start_h => [
sub { my($self, $tagname, $attr, $text) = @_; $result .= "<$tagname"; for my $key (sort keys %$attr) { my $value = $attr->{$key}; if ($key =~ /^(?:src)$/i) { # HTTP GET して保存してローカルパスにする $value = get_src($value); } $result .= qq{ $key="$value"}; } $result .= ">"; }, 'self,tagname,attr,text', ],
9.
HTML::Parser default_h => [
sub { my($self, $text) = @_; $result .= $text; }, 'self,text', ],
10.
完
11.
12.
CSSから参照 $content =~ s{url(([^)]+))}{
my $link = $1; # relative link (from HTML::ResolveLink) my $u = URI->new($link); unless (defined $u->scheme) { my $old = $u; $u = $u->abs($url); } $link = get_src($u); # HTTP GET して保存してローカルパスに "url($link)"; }eg;
13.
script 殺す my $context
= { disallow => 0 }; my $disallow_tag = qr{script}; start_h => [sub { if ($tagname =~ /^(?:$disallow_tag)$/i) { $context->{disallow}++; return; } }], end_h => [sub { if ($tagname =~ /^(?:$disallow_tag)$/i) { $context->{disallow}--; return; } }], default_h => [sub { if ($context->{disallow} > 0) { return; } }],
14.
noscript 内を生かす my $nodisplay_tag
= qr{noscript}; start_h => [sub { if ($tagname =~ /^(?:$nodisplay_tag)$/i) { return; } }], end_h => [sub { if ($tagname =~ /^(?:$nodisplay_tag)$/i) { return; } }],
15.
base start_h => [sub
{ if ($tagname =~ /^(?:base)$/i and $key =~ /^(?:href)$/i) { $value = "./"; } }],
16.
できました! gist.github.com/
4071196
17.
#!/usr/bin/env perl use strict; use
warnings; use utf8; use DateTime; use Digest::SHA1 qw(sha1_hex); use Encode; use File::Path qw/make_path/; use HTML::Parser; use HTML::ResolveLink; use HTTP::Request::Common qw/GET/; use IO::All; use LWP::UserAgent; use URI; my $path = './'; my $uri = URI->new(shift) or die; my $now = DateTime->now; my $ymd = $now->ymd; my $ua = LWP::UserAgent->new(agent => 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'); my $resolver = HTML::ResolveLink->new(base => $uri); my $res = $ua->request(GET $uri); my $content = $resolver->resolve($res->decoded_content); my $dir = $uri; $dir =~ s{[^A-Za-z0-9.]+}{-}g; $dir =~ s{-+$}{}; $dir = "$path/$dir/$ymd/"; $dir =~ s{/+}{/}g; make_path($dir); my $disallow_tag = qr{script}; my $nodisplay_tag = qr{noscript}; my $result; my $context = { disallow => 0 }; my $parser = HTML::Parser->new( api_version => 3, start_h => [ sub { my($self, $tagname, $attr, $text) = @_; if ($tagname =~ /^(?:$nodisplay_tag)$/i) { return; } elsif ($tagname =~ /^(?:$disallow_tag)$/i) { $context->{disallow}++; return; } $result .= "<$tagname"; for my $key (sort keys %$attr) { $key eq '/' and next; my $value = $attr->{$key}; if ($key =~ /^(?:src)$/i) { $value = get_src($value); } elsif ($tagname =~ /^(?:link)$/i and $key =~ /^(?:href)$/i) { $value = get_link($value); } elsif ($tagname =~ /^(?:base)$/i and $key =~ /^(?:href)$/i) { $value = $path; } $result .= qq{ $key="$value"}; } $result .= ">"; }, 'self,tagname,attr,text', ], end_h => [ sub { my($self, $tagname, $text) = @_; if ($tagname =~ /^(?:$nodisplay_tag)$/i) { return; } elsif ($tagname =~ /^(?:$disallow_tag)$/i) { $context->{disallow}--; return; } $result .= $text; }, 'self,tagname,text', ], default_h => [ sub { my($self, $text) = @_; if ($context->{disallow} > 0) { return; } $result .= $text; }, 'self,text', ], ); $parser->parse($content); $result =~ s{(<head[^>]*>)}{$1<meta http-equiv="Content-Type" content="text/html; charset=utf-8">}i; # XXX $result = Encode::encode('utf-8', $result); $result > io("${dir}index.html"); print "${dir}index.htmln"; sub get_src { my $src = shift or return; unless (-e "${dir}file") { make_path("${dir}file"); } my $file = $src; $file =~ s{[^A-Za-z0-9.]+}{-}g; if (length($file) > 255) { $file = sha1_hex($file); } $file = "file/$file"; $file =~ s{/+}{/}g; unless (-e "$dir$file") { $ua->request(GET $src)->content >> io("$dir$file"); sleep(1); # DOS対策対策 } $file; } sub get_link { my $url = shift or return; my $file = get_src($url); my $io = io("$dir$file"); my $content = $io->slurp; $content =~ s{url(([^)]+))}{ my $link = $1; $link =~ s{^[s"']+}{}; $link =~ s{[s"']+$}{}; # relative link (from HTML::ResolveLink) my $u = URI->new($link); unless (defined $u->scheme) { my $old = $u; $u = $u->abs($url); } $link = get_src($u); $link =~ s{^file/}{}; "url($link)"; }eg; $content > $io; return $file; }
18.
Google Chrome
19.
20.
wget.pl
21.
22.
どうぞご利用ください! gist.github.com/
4071196
23.
ご清聴ありがとうございました
Descargar ahora