SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
Data and Information
                        Extraction on the Web
                           Gestione delle Informazioni su Web - 2009/2010
                                            Tommaso Teofili
                                    tommaso [at] apache [dot] org




lunedì 12 aprile 2010
Agenda
                        Search

                          Goals

                          Problems

                        Data extraction

                        Information extraction

                        Mixing things together


lunedì 12 aprile 2010
Search - Goals

                        Find what we are looking for

                          Quickly

                          Easily

                        Have suggestions on other interesting related
                        stuff

                        Turn results into useful knowledge


lunedì 12 aprile 2010
What are you looking for?
lunedì 12 aprile 2010
Problems when googling


                         Where to search what we are looking for

                         How to write good queries (i.e.: relations
                         between terms?)

                         How to evaluate when a query is good




lunedì 12 aprile 2010
Search sources


                        Redundant, unhomogeneous, widespread,
                        public, noisy, free, sometimes standard, semi-
                        structured, linked, reachable...

                        in one word:   the Web


lunedì 12 aprile 2010
Focused search sources

                         Address interesting sources for the desired
                         domain

                         Where possible, filter out the unclean and
                         fragmented ones

                         Choose the most standard and well
                         structured ones



lunedì 12 aprile 2010
Fragmented sources
lunedì 12 aprile 2010
Structered sources
lunedì 12 aprile 2010
Data extraction

                        Automatically collect data from the Web

                        Crawl data from domain specific sources

                        Aggregate homogeneous data (i.e.: using
                        equivalence classes)

                        Save (portions of downloaded) data to a
                        convenient separate storage (DB, file system,
                        repository, etc.)


lunedì 12 aprile 2010
Data extraction - Crawling

                        From scratch (good luck!)

                        Leveraging existing facilities (wget, HtmlUnit,
                        Selenium, Apache HttpClient, Ning’s Async
                        HttpClient, etc.)

                        Playing with existing projects (RoadRunner,
                        Webpipe, Apache Nutch, etc.)



lunedì 12 aprile 2010
Data extraction - HttpClient
lunedì 12 aprile 2010
Data extraction - HtmlUnit
lunedì 12 aprile 2010
Data extraction - Aggregating

                        Downloaded resources can be assigned to
                        equivalence classes

                        Crawling process is inherently defining page
                        classes to which pages belong automatically

                        Relations between page classes

                        RoadRunner, Webpipe, etc.



lunedì 12 aprile 2010
Data extraction - EC




lunedì 12 aprile 2010
Data extraction - EC


             “teams indexes” class




                                              “teams” class




                            “players” class                   “coaches” class

lunedì 12 aprile 2010
Data extraction - Relevance


                        What do we really deserve?

                          Depending on the specific domain

                        Not all pages in all classes could be relevant

                        We could be interested only in a subset of
                        the found page classes



lunedì 12 aprile 2010
Data extraction - Example



                         We may be interested
                         in retrieving only
                         information regarding
                         players (Player class)




lunedì 12 aprile 2010
Data extraction - Problems
                        Server unavailability (HTTP 404, 403, 303, etc.)

                        Security and bandwith filters (don’t get your crawler
                        machine IP banned!)

                        Client unavailability (memory and storage space are
                        unlimited only in theory)

                        Encoding

                        Legal issues

                        ...


lunedì 12 aprile 2010
From Data to Information
lunedì 12 aprile 2010
Data vs Information
                        Data                    Information

                          Rough                   Clean

                          Semi-structured         Structured

                          Mixed content           Focused

                          Unmutable               Managed

                          Navigation oriented     Domain oriented



lunedì 12 aprile 2010
From Data to Information


                        We have crawled a lot of data

                        We eventually have some rough structure
                        (page classes and relations)

                        We want to pick only what we need




lunedì 12 aprile 2010
Information extraction - Pruning

                        We want to filter out at least:

                          Banners, advertisement, etc.

                          Headers/Footers

                          Navigation bars/Search boxes

                          Everything else not related with content

                        We may use XPath


lunedì 12 aprile 2010
Information extraction - Pruning

lunedì 12 aprile 2010
Information extraction - Pruning

lunedì 12 aprile 2010
Information extraction

                        Once we have extracted content

                        We are now interested in getting useful
                        information from it -> knowledge

                        Look for some matchings between extracted
                        data and our domain model




lunedì 12 aprile 2010
Information extraction - Example

                        Navigate XML (HTML DOM) nodes with XPath

                        Navigate content and find specific
                        “parts” (nodes or sub-trees)

                        Tag such “parts” as objects or properties
                        inside a (specific) domain model

                        Eventually need to traverse DOM multiple
                        times


lunedì 12 aprile 2010
Information extraction - Name

lunedì 12 aprile 2010
Information extraction - Date of Birth

lunedì 12 aprile 2010
Information extraction - Team

lunedì 12 aprile 2010
Information extraction - Example


                        A Player (taken from the Player pageclass)

                        with name, date of birth and belonging to a
                        team

                        We now know that “Francesco Totti” is a Player
                        of “Italy” team and was born on “27/09/1976”

                        We can apply such XPaths to all PageClass
                        instances and get information about each player



lunedì 12 aprile 2010
Information extraction - Wrapper


                        Context navigation

                           RoadRunner

                           Webpipe

                        Statistical analysis

                           ExAlg

                        Other...



lunedì 12 aprile 2010
Information extraction - Problems



                        Not well structured sources

                        Frequently changing sources

                        False positives

                        Corrupted extracted data




lunedì 12 aprile 2010
False positives
lunedì 12 aprile 2010
Information extraction - Relevance


                        Using wrappers we can get a lot of
                        information

                        We could rank what is relevant in the:

                          “page” context

                          the domain model

                        For efficiency and “reasoning” purposes


lunedì 12 aprile 2010
Information extraction - relevance

lunedì 12 aprile 2010
Information extraction - Metadata


                        Stream extracted information into our
                        domain model

                        Extracted information -> Metadata

                        Populated domain objects contain

                          interesting semantics

                          relations


lunedì 12 aprile 2010
Store Metadata
                        DB (with classic relational schema)

                        Filesystem (XML)

                        Key-Value repository

                        Index

                        Triple Store

                        ...


lunedì 12 aprile 2010
Query enriched data

                        Exploit acquired metadata semantics to build
                        SQL-like (with attributes and relations of our
                        domain model) queries on previously
                        unstructered data

                        Extract hidden knowledge querying
                        aggregated metadata




lunedì 12 aprile 2010
Sample queries
                        Get “young players”

                          SELECT * FROM giocatore g WHERE g.dob
                          AFTER 1993/01/01

                        Aggregate queries

                          Find the average age in each team

                          Find the average age of World Cup
                          players


lunedì 12 aprile 2010
Information extraction
                             on the Web
lunedì 12 aprile 2010
References
                        http://www.w3.org/TR/xpath/

                        http://www.w3.org/DOM/

                        http://www.dia.uniroma3.it/db/roadRunner/

                        http://www.slideshare.net/n0on3/exalg-overview

                        http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm

                        http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/
                        overview_and_setup/overview_and_setup.html

                        http://en.wikipedia.org/wiki/Web_scraping

                        http://www.alchemyapi.com/api/scrape/




lunedì 12 aprile 2010

Más contenido relacionado

La actualidad más candente

Ch. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLTCh. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLT
mh-108
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
trevorthornton
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
andrea huang
 

La actualidad más candente (10)

Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Ch. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLTCh. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLT
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
 
Study Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked DataStudy Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked Data
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasets
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's Missing
 

Destacado

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Cataldo Musto
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
R A Akerkar
 
Production proposal – meeting minutes
Production proposal – meeting minutesProduction proposal – meeting minutes
Production proposal – meeting minutes
hamdi_jama
 

Destacado (20)

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
 
Data presentation 2
Data presentation 2Data presentation 2
Data presentation 2
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
 
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
 
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
 
Compensation ahmed's
Compensation ahmed'sCompensation ahmed's
Compensation ahmed's
 
5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results final5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results final
 
Minutes of meeting form
Minutes of meeting formMinutes of meeting form
Minutes of meeting form
 
Production proposal – meeting minutes
Production proposal – meeting minutesProduction proposal – meeting minutes
Production proposal – meeting minutes
 
Meeting Minute Unit 28
Meeting Minute Unit 28Meeting Minute Unit 28
Meeting Minute Unit 28
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Job analysis
Job analysisJob analysis
Job analysis
 

Similar a Data and Information Extraction on the Web

WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
Stefan Dietze
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3
Olivier Dobberkau
 
03 Custom Classes
03 Custom Classes03 Custom Classes
03 Custom Classes
Mahmoud
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
rusersla
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
emmanuel_jamin
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
3 Round Stones
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
vafopoulos
 

Similar a Data and Information Extraction on the Web (20)

Introducing Riak and Ripple
Introducing Riak and RippleIntroducing Riak and Ripple
Introducing Riak and Ripple
 
20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Lucene revolution with Data Harmony
Lucene revolution with Data HarmonyLucene revolution with Data Harmony
Lucene revolution with Data Harmony
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3
 
Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010
 
03 Custom Classes
03 Custom Classes03 Custom Classes
03 Custom Classes
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
 
SharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methodsSharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methods
 
OER Search
OER SearchOER Search
OER Search
 
Ili structuredauthoring
Ili structuredauthoringIli structuredauthoring
Ili structuredauthoring
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09
 

Más de Tommaso Teofili

Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

Más de Tommaso Teofili (17)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Data and Information Extraction on the Web

  • 1. Data and Information Extraction on the Web Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org lunedì 12 aprile 2010
  • 2. Agenda Search Goals Problems Data extraction Information extraction Mixing things together lunedì 12 aprile 2010
  • 3. Search - Goals Find what we are looking for Quickly Easily Have suggestions on other interesting related stuff Turn results into useful knowledge lunedì 12 aprile 2010
  • 4. What are you looking for? lunedì 12 aprile 2010
  • 5. Problems when googling Where to search what we are looking for How to write good queries (i.e.: relations between terms?) How to evaluate when a query is good lunedì 12 aprile 2010
  • 6. Search sources Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi- structured, linked, reachable... in one word: the Web lunedì 12 aprile 2010
  • 7. Focused search sources Address interesting sources for the desired domain Where possible, filter out the unclean and fragmented ones Choose the most standard and well structured ones lunedì 12 aprile 2010
  • 10. Data extraction Automatically collect data from the Web Crawl data from domain specific sources Aggregate homogeneous data (i.e.: using equivalence classes) Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.) lunedì 12 aprile 2010
  • 11. Data extraction - Crawling From scratch (good luck!) Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.) Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.) lunedì 12 aprile 2010
  • 12. Data extraction - HttpClient lunedì 12 aprile 2010
  • 13. Data extraction - HtmlUnit lunedì 12 aprile 2010
  • 14. Data extraction - Aggregating Downloaded resources can be assigned to equivalence classes Crawling process is inherently defining page classes to which pages belong automatically Relations between page classes RoadRunner, Webpipe, etc. lunedì 12 aprile 2010
  • 15. Data extraction - EC lunedì 12 aprile 2010
  • 16. Data extraction - EC “teams indexes” class “teams” class “players” class “coaches” class lunedì 12 aprile 2010
  • 17. Data extraction - Relevance What do we really deserve? Depending on the specific domain Not all pages in all classes could be relevant We could be interested only in a subset of the found page classes lunedì 12 aprile 2010
  • 18. Data extraction - Example We may be interested in retrieving only information regarding players (Player class) lunedì 12 aprile 2010
  • 19. Data extraction - Problems Server unavailability (HTTP 404, 403, 303, etc.) Security and bandwith filters (don’t get your crawler machine IP banned!) Client unavailability (memory and storage space are unlimited only in theory) Encoding Legal issues ... lunedì 12 aprile 2010
  • 20. From Data to Information lunedì 12 aprile 2010
  • 21. Data vs Information Data Information Rough Clean Semi-structured Structured Mixed content Focused Unmutable Managed Navigation oriented Domain oriented lunedì 12 aprile 2010
  • 22. From Data to Information We have crawled a lot of data We eventually have some rough structure (page classes and relations) We want to pick only what we need lunedì 12 aprile 2010
  • 23. Information extraction - Pruning We want to filter out at least: Banners, advertisement, etc. Headers/Footers Navigation bars/Search boxes Everything else not related with content We may use XPath lunedì 12 aprile 2010
  • 24. Information extraction - Pruning lunedì 12 aprile 2010
  • 25. Information extraction - Pruning lunedì 12 aprile 2010
  • 26. Information extraction Once we have extracted content We are now interested in getting useful information from it -> knowledge Look for some matchings between extracted data and our domain model lunedì 12 aprile 2010
  • 27. Information extraction - Example Navigate XML (HTML DOM) nodes with XPath Navigate content and find specific “parts” (nodes or sub-trees) Tag such “parts” as objects or properties inside a (specific) domain model Eventually need to traverse DOM multiple times lunedì 12 aprile 2010
  • 28. Information extraction - Name lunedì 12 aprile 2010
  • 29. Information extraction - Date of Birth lunedì 12 aprile 2010
  • 30. Information extraction - Team lunedì 12 aprile 2010
  • 31. Information extraction - Example A Player (taken from the Player pageclass) with name, date of birth and belonging to a team We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976” We can apply such XPaths to all PageClass instances and get information about each player lunedì 12 aprile 2010
  • 32. Information extraction - Wrapper Context navigation RoadRunner Webpipe Statistical analysis ExAlg Other... lunedì 12 aprile 2010
  • 33. Information extraction - Problems Not well structured sources Frequently changing sources False positives Corrupted extracted data lunedì 12 aprile 2010
  • 35. Information extraction - Relevance Using wrappers we can get a lot of information We could rank what is relevant in the: “page” context the domain model For efficiency and “reasoning” purposes lunedì 12 aprile 2010
  • 36. Information extraction - relevance lunedì 12 aprile 2010
  • 37. Information extraction - Metadata Stream extracted information into our domain model Extracted information -> Metadata Populated domain objects contain interesting semantics relations lunedì 12 aprile 2010
  • 38. Store Metadata DB (with classic relational schema) Filesystem (XML) Key-Value repository Index Triple Store ... lunedì 12 aprile 2010
  • 39. Query enriched data Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data Extract hidden knowledge querying aggregated metadata lunedì 12 aprile 2010
  • 40. Sample queries Get “young players” SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01 Aggregate queries Find the average age in each team Find the average age of World Cup players lunedì 12 aprile 2010
  • 41. Information extraction on the Web lunedì 12 aprile 2010
  • 42. References http://www.w3.org/TR/xpath/ http://www.w3.org/DOM/ http://www.dia.uniroma3.it/db/roadRunner/ http://www.slideshare.net/n0on3/exalg-overview http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/ overview_and_setup/overview_and_setup.html http://en.wikipedia.org/wiki/Web_scraping http://www.alchemyapi.com/api/scrape/ lunedì 12 aprile 2010