Screen scraping

Universidad de Oviedo Programa de extensión universitaria

CLOUD COMPUTING.
DESARROLLO DE APLICACIONES Y
MINERÍA WEB

Miguel Fernández Fernández
miguelff@innova.uniovi.es

Porqué screen scraping
La Web es fundamentalmente para
humanos (HTML)

humanos (HTML)
<table width="100%" cellspacing="1" cellpadding="1" border="0" align="center">
<tbody>
<tr>
<td valign="middle" align="center" colspan="5">
</td></tr><tr>
<td align="center" class="cabe"> Hora Salida </td>
<td align="center" class="cabe"> Hora Llegada </td>
<td align="center" class="cabe"> Línea </td>
<td align="center" class="cabe"> Tiempo de Viaje </td>
<td align="center" class="cabe"> </td>
</tr>

<tr>
...
<td align="center" class="color1">06.39</td>
<td class="color3">C1 </td>
<td align="center" class="rojo3"> </td>
</tr>
</tbody>

humanos (HTML)

Pero no está diseñada para ser procesada
por máquinas (XML, JSON, CSV...)

humanos (HTML)

Pero no está diseñada para ser procesada
por máquinas (XML, JSON, CSV...)

<horario>
<viaje>
<salida format="hh:mm">06:39</salida>
<llegada format="hh:mm">07:15</llegada>
<duracion format="minutes">36</duracion>
<linea>C1</linea>
</viaje>
</horario>

No siempre disponemos de una API


Necesitamos simular el comportamiento humano



Interpretar
HTML



Realizar
Interpretar
interacciones
HTML
(Navegar)



Realizar
Interpretar
interacciones Ser un Ninja
HTML
(Navegar)
Evitar DoS

Selección de las herramientas
¿Con qué lenguaje vamos a trabajar?

Java .NET Ruby Python

net/http
URL System.Net. urllib
java.net.URL open-uri
fetching HTTPWebRequest urllib2
rest-open-uri
DOM javax.swing.text.html HTree / ReXML
parsing TagSoup
/
HTMLAgilityPack HPricot BeautifulSoup
NekoHTML RubyfulSoup
transversing

System.Text.
Regexp java.util.regexp RegularExpressions
Regexp re

--- Librerías de terceras partes. No forman parte de la API del lenguaje.



Duck typing + Reflexión = Syntactic Sugar



Lenguajes dinámicos facilitan la codiﬁcación

Duck typing + Reflexión = Syntactic Sugar

Java
import javax.swing.text.html.*;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import java.net.URL;
import java.io.InputStreamReader;
import java.io.Reader;

public class HTMLParser
{
public static void main( String[] argv ) throws Exception
{
URL url = new URL( "http://java.sun.com" );
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(url.openConnection().getInputStream());
kit.read(HTMLReader, doc, 0);

ElementIterator it = new ElementIterator(doc);
Element elem;

while( elem = it.next() != null )
{
if( elem.getName().equals( "img") )
{
String s = (String) elem.getAttributes().getAttribute(HTML.Attribute.SRC);
if( s != null )
System.out.println (s );
}
}
System.exit(0);
}
}

Ruby
require 'rubygems'
require 'open-uri'
require 'htree'
require 'rexml/document'

open("http://java.sun.com",:proxy=>"http://localhost:8080") do |page|
page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//img') {|elem| puts elem.attribute('src').value }
end

Selección de las herramientas Ruby
rest-open-uri
Nos permitirá hacer peticiones a
HTree + REXML URLs y extraer su contenido
Hpricot
extiende open-uri para soportar
RubyfulSoup más verbos
WWW:Mechanize

HTree crea un árbol de
rest-open-uri objetos a partir de código
HTML
HTree + REXML

Hpricot
HTree#to_rexml
Convierte el árbol a un árbol
RubyfulSoup REXML
WWW:Mechanize REXML puede navegarse con
XPath 2.0

HTree+REXML
require 'rubygems'
require 'open-uri'
require 'htree'
require 'rexml/document'

open("http://www.google.es/search?q=ruby",:proxy=>"http://localhost:8080") do |page|
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class=l]') {|elem| puts elem.attribute('href').value }
end

Runtime: 7.06s.

Scanner implementado en C
rest-open-uri
(Muy rápido)
HTree + REXML
Genera un DOM con su
Hpricot propio sistema de navegación
RubyfulSoup como Jquery(selectores CSS y XPath*)

WWW:Mechanize Funcionalidad equivalente a
Htree + REXML

http://hpricot.com/

HPricot

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://www.google.com/search?q=ruby',:proxy=>'http://localhost:8080'))
links = doc/"//a[@class=l]"
links.map.each {|link| puts link.attributes['href']}

Runtime: 3.71s

rest-open-uri
(Muy rápido)
HTree + REXML
Hpricot propio sistema de navegación
RubyfulSoup co mo Jquery
(selectores CSS y XPath*)

WWW:Mechanize Funcionalidad equivalente a
Htree + REXML

http://hpricot.com/

rest-open-uri
(Muy rápido)
Hpricot
RubyfulSoup propio sistema de navegación
WWW:Mechanize co mo Jquery
(selectores CSS y XPath*)

Funcionalidad equivalente a
Htree + REXML

http://hpricot.com/

rest-open-uri
Ofrece la misma funcionalidad que
Hpricot HTree + REXML
RubyfulSoup

WWW:Mechanize

Rubyful Soup
require 'rubygems'
require 'rubyful_soup'
require 'open-uri'

open("http://www.google.com/search?q=ruby",:proxy=>"http://localhost:8080") do |page|
soup = BeautifulSoup.new(page_content)
result = soup.find_all('a', :attrs => {'class' => 'l'})
result.each { |tag| puts tag['href'] }
end

Runtime: 4.71s

rest-open-uri
RubyfulSoup Menor rendimiento que Hpricot
WWW:Mechanize

rest-open-uri
RubyfulSoup Menor rendimiento que Hpricot
WWW:Mechanize No se admiten selectores CSS

rest-open-uri HTree + REXML

Hpricot Menor rendimiento que Hpricot
WWW:Mechanize No se admiten selectores CSS

Permite realizar interacciones
rest-open-uri
Rellenar y enviar formularios
Hpricot Seguir enlaces

WWW:Mechanize Consigue alcanzar documentos en
La Web Profunda

WWW::Mechanize
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
agent.set_proxy("localhost",8080)
page = agent.get('http://www.google.com')

search_form = page.forms.select{|f| f.name=="f"}.first
search_form.fields.select {|f| f.name=='q'}.first.value="ruby"
search_results = agent.submit(search_form)
search_results.links.each { |link| puts link.href if link.attributes["class"] == "l" }

Runtime: 5.23s

No tiene API pública

>8000 usuarios nuevos cada día

2h de sesión promedio

datos datos datos!

Paso 1: Acceder a nuestro perﬁl
require 'rubygems'
require 'mechanize'

#decimos que somos firefox modificando la cabecera user agent
agent.user_agent_alias='Mac FireFox'
login_page = agent.get('http://m.tuenti.com/?m=login')

#cogemos el formulario de login
login_form = login_page.forms.first
#y rellenamos los campos usuario y contraseña
login_form.fields.select{|f| f.name=="email"}.first.value="miguelfernandezfernandez@gmail.com"
login_form.fields.select{|f| f.name=="input_password"}.first.value="xxxxx"

pagina_de_inicio?=agent.submit(login_form)

Segundo intento: versión móvil
require 'rubygems'
require 'mechanize'


#y rellenamos los campos usuario y contraseâˆšÂ±a
login_form.fields.select{|f|
f.name=="tuentiemail"}.first.value="miguelfernandezfernandez@gmail.com
"
login_form.fields.select{|f| f.name=="password"}.first.value="xxxxxx"

pagina_de_inicio=agent.submit(login_form)

require 'rubygems'
require 'mechanize'

class TuentiAPI

def initialize(login,password)
@login=login
@password=password
end

def inicio()

#y rellenamos los campos usuario y contraseâˆšÂ±a
login_form.fields.select{|f| f.name=="tuentiemail"}.first.value=@login
login_form.fields.select{|f| f.name=="password"}.first.value=@password

pagina_de_inicio=agent.submit(login_form)
end

end

pagina_de_inicio=TuentiAPI.new("miguelfernandezfernandez@gmail.com","xxxxxx").inicio()

Paso 2: Obtener las fotos
class TuentiAPI

...

def fotos_nuevas()
tree=Hpricot(inicio().content)
fotos = tree / "//a//img[@alt=Foto]"
fotos.map!{|foto| foto.attributes["src"]}
Set.new(fotos).to_a
end

private

def inicio()
...
end
end

Paso 3: Establecer el estado

class TuentiAPI
...

def actualizar_estado(msg)
form_actualizacion=inicio.forms.first
form_actualizacion.fields.select{|f| f.name=="status"}.first.value=msg
@agent.submit(form_actualizacion)
end

end

Tor: navegando de forma
anónima
Red de encadenamiento de proxies

N peticiones salen de M servidores
Garantiza el anonimato a nivel de IP

https://www.torproject.org/vidalia/

Screen scraping

Más contenido relacionado

Similar a Screen scraping

Más de Miguel Fernández

Screen scraping