Screen scraping

CLOUD COMPUTING.
DESARROLLO DE APLICACIONESY
MINERÍA WEB
Programa de extensión universitariaUniversidad de Oviedo
Miguel Fernández Fernández
miguel@ThirdWay.es

Porqué screen scraping
La Web es fundamentalmente para
humanos (HTML)

humanos (HTML)
<table width="100%" cellspacing="1" cellpadding="1" border="0" align="center">
<tbody>
<tr>
<td valign="middle" align="center" colspan="5">
</td></tr><tr>
<td align="center" class="cabe"> Hora Salida </td>
<td align="center" class="cabe"> Hora Llegada </td>
<td align="center" class="cabe"> Línea </td>
<td align="center" class="cabe"> Tiempo de Viaje </td>
<td align="center" class="cabe"> </td>
</tr>
<tr>
...
<td align="center" class="color1">06.39</td>
<td class="color3">C1 </td>
<td align="center" class="rojo3"> </td>
</tr>
</tbody>

humanos (HTML)
Pero no está diseñada para ser procesada
por máquinas (XML, JSON, CSV...)

humanos (HTML)
Pero no está diseñada para ser procesada
por máquinas (XML, JSON, CSV...)
<horario>
<viaje>
<salida format="hh:mm">06:39</salida>
<llegada format="hh:mm">07:15</llegada>
<duracion format="minutes">36</duracion>
<linea>C1</linea>
</viaje>
</horario>

No siempre disponemos de una API

Necesitamos simular el comportamiento humano

Interpretar
HTML

Interpretar
HTML
Realizar
interacciones
(Navegar)

Interpretar
HTML
Realizar
interacciones
(Navegar)
Ser un Ninja
Evitar DoS

Selección de las herramientas
¿Con qué lenguaje vamos a trabajar?
Java .NET Ruby Python
URL
fetching
java.net.URL
System.Net.
HTTPWebRequest
net/http
open-uri
rest-open-uri
urllib
urllib2
DOM
parsing
/
transversing
javax.swing.text.html
TagSoup
NekoHTML
HTMLAgilityPack
HTree / ReXML
HPricot
RubyfulSoup
BeautifulSoup
Regexp java.util.regexp
System.Text.
RegularExpressions
Regexp re
--- Librerías de terceras partes. No forman parte de la API del lenguaje.

Duck typing + Reflexión = Syntactic Sugar

Lenguajes dinámicos facilitan la codiﬁcación
Duck typing + Reflexión = Syntactic Sugar

import javax.swing.text.html.*;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import java.net.URL;
import java.io.InputStreamReader;
import java.io.Reader;
public class HTMLParser
{
public static void main( String[] argv ) throws Exception
{
URL url = new URL( "http://java.sun.com" );
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(url.openConnection().getInputStream());
kit.read(HTMLReader, doc, 0);
ElementIterator it = new ElementIterator(doc);
Element elem;
while( elem = it.next() != null )
{
if( elem.getName().equals( "img") )
{
String s = (String) elem.getAttributes().getAttribute(HTML.Attribute.SRC);
if( s != null )
System.out.println (s );
}
}
System.exit(0);
}
}
Java

Ruby
require 'rubygems'
require 'open-uri'
require 'htree'
require 'rexml/document'
open("http://java.sun.com",:proxy=>"http://localhost:8080") do |page|
page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//img') {|elem| puts elem.attribute('src').value }
end

Selección de las herramientas RubyRuby
rest-open-uri
HTree + REXML
RubyfulSoup
WWW:Mechanize
Hpricot
Nos permitirá hacer peticiones a
URLs y extraer su contenido
extiende open-uri para soportar
más verbos

rest-open-uri
HTree + REXML
RubyfulSoup
WWW:Mechanize
Hpricot
HTree crea un árbol de
objetos a partir de código
HTML
HTree#to_rexml
Convierte el árbol a un árbol
REXML
REXML puede navegarse con
XPath 2.0

HTree+REXML
require 'rubygems'
require 'open-uri'
require 'htree'
require 'rexml/document'
open("http://www.google.es/search?q=ruby",:proxy=>"http://localhost:8080") do |page|
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class=l]') {|elem| puts elem.attribute('href').value }
end
Runtime: 7.06s.

rest-open-uri
HTree + REXML
RubyfulSoup
WWW:Mechanize
Hpricot
http://hpricot.com/
Scanner implementado en C
(Muy rápido)
Genera un DOM con su
propio sistema de navegación
(selectores CSS y XPath*)como Jquery
Funcionalidad equivalente a
Htree + REXML

HPricot
require 'rubygems'
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('http://www.google.com/search?q=ruby',:proxy=>'http://localhost:8080'))
links = doc/"//a[@class=l]"
links.map.each {|link| puts link.attributes['href']}
Runtime: 3.71s

rest-open-uri
RubyfulSoup
WWW:Mechanize
Hpricot
http://hpricot.com/
Scanner implementado en C
(Muy rápido)
Genera un DOM con su
propio sistema de navegación
(selectores CSS y XPath*)como Jquery
Funcionalidad equivalente a
Htree + REXML

rest-open-uri
RubyfulSoup
WWW:Mechanize
Hpricot
Ofrece la misma funcionalidad que
HTree + REXML

Rubyful Soup
Runtime: 4.71s
require 'rubygems'
require 'rubyful_soup'
require 'open-uri'
open("http://www.google.com/search?q=ruby",:proxy=>"http://localhost:8080") do |page|
soup = BeautifulSoup.new(page_content)
result = soup.find_all('a', :attrs => {'class' => 'l'})
result.each { |tag| puts tag['href'] }
end

RubyfulSoup
WWW:Mechanize
rest-open-uri
Hpricot
HTree + REXML

RubyfulSoup
WWW:Mechanize
rest-open-uri
Hpricot
HTree + REXML
Menor rendimiento que Hpricot

RubyfulSoup
WWW:Mechanize
rest-open-uri
Hpricot
HTree + REXML
No se admiten selectores CSS

WWW:Mechanize
rest-open-uri
Hpricot
HTree + REXML
No se admiten selectores CSS

rest-open-uri
WWW:Mechanize
Hpricot
Permite realizar interacciones
Rellenar y enviar formularios
Seguir enlaces
Consigue alcanzar documentos en
La Web Profunda

WWW::Mechanize
Runtime: 5.23s
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
agent.set_proxy("localhost",8080)
page = agent.get('http://www.google.com')
search_form = page.forms.select{|f| f.name=="f"}.first
search_form.fields.select {|f| f.name=='q'}.first.value="ruby"
search_results = agent.submit(search_form)
search_results.links.each { |link| puts link.href if link.attributes["class"] == "l" }

No tiene API pública
>8000 usuarios nuevos cada día
2h de sesión promedio
datos datos datos!

Paso 1:Acceder a nuestro perﬁl
require 'rubygems'
require 'mechanize'
#decimos que somos firefox modificando la cabecera user agent
agent.user_agent_alias='Mac FireFox'
login_page = agent.get('http://m.tuenti.com/?m=login')
#cogemos el formulario de login
login_form = login_page.forms.first
#y rellenamos los campos usuario y contraseña
login_form.fields.select{|f| f.name=="email"}.first.value="miguelfernandezfernandez@gmail.com"
login_form.fields.select{|f| f.name=="input_password"}.first.value="xxxxx"
pagina_de_inicio?=agent.submit(login_form)

Segundo intento: versión móvil
require 'rubygems'
require 'mechanize'
#y rellenamos los campos usuario y contraseâˆšÂ±a
login_form.fields.select{|f|
f.name=="tuentiemail"}.first.value="miguelfernandezfernandez@gmail.com
"
login_form.fields.select{|f| f.name=="password"}.first.value="xxxxxx"
pagina_de_inicio=agent.submit(login_form)

require 'rubygems'
require 'mechanize'
class TuentiAPI
def initialize(login,password)
@login=login
@password=password
end
def inicio()
#y rellenamos los campos usuario y contraseâˆšÂ±a
login_form.fields.select{|f| f.name=="tuentiemail"}.first.value=@login
login_form.fields.select{|f| f.name=="password"}.first.value=@password
pagina_de_inicio=agent.submit(login_form)
end
end
pagina_de_inicio=TuentiAPI.new("miguelfernandezfernandez@gmail.com","xxxxxx").inicio()

Paso 2: Obtener las fotos
class TuentiAPI
...
def fotos_nuevas()
tree=Hpricot(inicio().content)
fotos = tree / "//a//img[@alt=Foto]"
fotos.map!{|foto| foto.attributes["src"]}
Set.new(fotos).to_a
end
private
def inicio()
...
end
end

Paso 3: Establecer el estado
class TuentiAPI
...
def actualizar_estado(msg)
form_actualizacion=inicio.forms.first
form_actualizacion.fields.select{|f| f.name=="status"}.first.value=msg
@agent.submit(form_actualizacion)
end
end

Tor: navegando de forma
anónima
https://www.torproject.org/vidalia/
Red de encadenamiento de proxies
N peticiones salen de M servidores
Garantiza el anonimato a nivel de IP

Screen scraping

Más contenido relacionado

Similar a Screen scraping

Screen scraping