Enviar búsqueda
Cargar
Don't scrape, Glean!
•
Descargar como PPT, PDF
•
1 recomendación
•
742 vistas
tommorris
Seguir
Lacks the demo part, alas, but it's the slides I used
Leer menos
Leer más
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 35
Descargar ahora
Recomendados
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
Recomendados
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Maria Levchenko
Más contenido relacionado
Similar a Don't scrape, Glean!
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Similar a Don't scrape, Glean!
(20)
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
Csphtp1 18
Csphtp1 18
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Jsonsaga
Jsonsaga
The JSON Saga
The JSON Saga
XML processing with perl
XML processing with perl
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
Grails and Dojo
Grails and Dojo
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
How Xslate Works
How Xslate Works
Debugging and Error handling
Debugging and Error handling
Система рендеринга в Magento
Система рендеринга в Magento
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
JavaScript
JavaScript
Orm hero
Orm hero
Último
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Maria Levchenko
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Enterprise Knowledge
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Radu Cotescu
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Pixlogix Infotech
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Neo4j
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Delhi Call girls
Último
(20)
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Slack Application Development 101 Slides
Slack Application Development 101 Slides
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Don't scrape, Glean!
1.
2.
Scraping sucks.
3.
def lastlogin
(@hmodel/ "//td[@class='text'][@width='193']" ).first.innerHTML.split("<br />"[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + "-" + date[ -7 .. -6 ] + "-" + date[ -10 .. -9 ] end end end end
4.
Hpricot for ‘Last
login’ date on MySpace.
5.
try :
lastlogin = self.soup.findAll( True , { "width" : "193" })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r " [0-9] / [0-9] +/ [0-9]* ") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
6.
Taken from a
Python/BeautifulSoup library.
7.
(The Ruby is
prettier, but who’s counting?)
8.
getElementsByClassName(“foo”)[0].children
9.
It’s an edge
case. MySpace’s HTML is worse than average.
10.
But it is
an ugly recipe for mental turmoil.
11.
The alternative?
12.
flickr.getPhotos()
13.
And you get
back nice XML or JSON (or even SOAP!) (or even SOAP!)
14.
But ‘D.R.Y.’! APIs
break that principle. APIs break that principle.
15.
This is the
data equivalent of the ‘accessible version’.
16.
Enter GRDDL.
17.
GRDDL defines a
transformation process for XHTML » RDF.
18.
XHTML ? That’s
what the spec says. That’s what the spec says.
19.
HTML 4 works
too. Tidy ! !
20.
RDF? Yes. Trust
me. It’s not evil. It’s not evil. It’s not evil.
21.
GRDDL can work
like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
22.
You simply use
HTML (or XML) in the normal way...
23.
...and define how
the data transformation.
24.
You can even
use it as a bridge for exisiting APIs and services.
25.
Could even be
used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
26.
Simple example: ‘Not
Safe For Work’ ‘Not Safe For Work’
27.
<a href=" http://tubgirl.com
" class="nsfw">
28.
I can write
that. I can’t write xFolk by hand. I can’t write xFolk by hand.
29.
Is ‘nsfw’ a
good class name? No.
30.
Do I care?
No.
31.
The data layer
becomes separated like CSS is from HTML.
32.
That’s the theory.
Now for the demo. Now for the demo.
33.
irc.freenode.net #swig #swhack
#swhack #swhack
34.
getsemantic.com [email_address] [email_address]
35.
[email_address] http://tommorris.org http://tommorris.org
Descargar ahora