SlideShare una empresa de Scribd logo
1 de 36
Heterogeneous Web Data Search
UsingRelevance-based On TheFly Data Integration
Daniel M. Herzig, Thanh Tran
WWW2012

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN




KIT – Universität des Landes Baden-Württemberg und
nationales Forschungszentrum in der Helmholtz-Gemeinschaft              www.kit.edu
Agenda

     Motivation
     Problem Definition
     Existing Solutions
     Our Approach
         Entity Relevance Model (ERM)
         Ranking
         On-The-Fly alignment
     Experiments
     Conclusion




2    WWW2012   Daniel M. Herzig - Institute AIFB
Company running a movie shopping website




                                                         Movies Shopping
                                                         Website




                                                         Company’s dataset
                                                    Ds
3     WWW2012   Daniel M. Herzig - Institute AIFB
Users search the website via forms.
    Search request is internally executed as a
    structured query




                                                    Steven Spielberg
                                                                  i:directors

                                                 i:year          ?x   type

                                                                      qs              Ds
                                            1982                       i:movie
                                                                 Structured Query
                                     IMdbi:                      (e.g. SQL, SPARQL)




4     WWW2012      Daniel M. Herzig - Institute AIFB
                Screenshot of http://www.imdb.com/search/title
Company discovers the plethora of Linked Data
    available on the Web and identifies Data
    Sources beneficial for its business




                                                qs



           Ds
                                                     Linked Data on the Web
                                                                      http://richard.cyganiak.de/2007/10/lod/


5     WWW2012   Daniel M. Herzig - Institute AIFB
Zero Star Mugs!




                                                    vs.




6     WWW2012   Daniel M. Herzig - Institute AIFB
Problems of Data Integration arise…


                                            qs does not return results

                                                    qs
                                            No links, no integration



                                            No knowledge about the
         Ds                                 external data schema

                                            External data might change
                                            often
7     WWW2012   Daniel M. Herzig - Institute AIFB
Problem Definition

      Find relevant entities in a set of targetdatasetsDtgiven a
      sourcedatasetDsand an
      structuredentityqueryqsadhering to thevocabulary of Ds.


       Structured
       entity query            qs                   ?

                              Ds                        Dt1                Dt2
                   Source Dataset                        Target Datasets


8     WWW2012   Daniel M. Herzig - Institute AIFB
Problem Setting

      Data Model is labeled directed graph
          Directly related to RDF
          RDF specifics, e.g. blank nodes, are omitted


      Entity query: SPARQL BGP query with one select variable
          Entityqueriesarethemostfrequenttype of web searchqueries, Pound
          et al. WWW2010


      Web Data scenario:
          Data exhibits a heterogeneity on the schema- and data-level




9     WWW2012   Daniel M. Herzig - Institute AIFB
Heterogeneous Web Data



      Daniel Craig,
                        Steven Spielberg             Coyote, Peter Spielberg, Steven (I)           db:Film    db:Steven_Spielberg
       Eric Bana
        a:Actors                 a:Directors              i:actors                i:directors        type           db:director
                             a:ReleaseDate                                 type
                      ea                2005
                                                                      ei             i:movie                  ed
         a:Title        type       a:Binding                i:title          i:producer         rdfs:label             db:starring
                                                             E.T.
      Munich       a:Movie          DVD                               Spielberg, Steven (I)     1941 (film)   db:John_Candy_(actor)
                                                            (1994)
      Amazon a:                                           IMdbi:                                DBpedia db:




       Schema-level:   actors vs. starring
       Data-level:     Steven Spielberg vs. Spielberg, Steven
       Varying number of attributes per entity

10     WWW2012        Daniel M. Herzig - Institute AIFB
Aim: Integrate External Data into the Search
     Process

                                                     ?
         qs                          Keyword Search
                                             Wang et al.: Semplore: A scalable IR
                                                                                      Dt
                                             approach to searchthe Web of Data. In:
                                             Journal of Web Semantics. (2009)


                                     Query rewriting based
        Ds                           on up-front data integration                     Dt
                                             Calì et al.: Query Rewriting and
                                             AnsweringunderConstraints in Data
                                             Integration Systems. In: IJCAI. (2003)



11     WWW2012   Daniel M. Herzig - Institute AIFB
Existing Strategies – Keyword Search

                                                         directors rainerwernerfassbinder theatrical release
      “Rainer Werner Fassbinder”                         date 1982 type movie                                             (2)
                      a:Directors               (3)          i:title            Veronika Voss
                                                                                                  e1 title veronikavoss
                                                       i:director
     a:Theatrical   ?x                              e1          Rainer Werner Fassbinder                   director
                                                                                                  rainerwernerfassbinde
     ReleaseDate             type
                                                              i:released                1982          r released 1982

                                                          i:title        SchindlersListe (1994)
        1982                a:Movie                                                               e2           title
                                                         i:director                                schindlersliste 1994
                                                    e2                   Spielberg, Steven (I)
                                                                                                         director
      Amazon a:                         (1)        IMDB i:             type            i:movie    spielbergsteveni type
                                                                                                          movie


        Transform qs into keyword query
        Match against bag-of-words representation of entities
        Bridges schema differences by neglecting the structure
        Baseline 1 (KW), IR baseline using Semplore (Lucene)
12      WWW2012     Daniel M. Herzig - Institute AIFB
Existing Strategies – Query Rewriting

                                                        Schema            Schema
               “Rainer Maria                            Amazon            DBpedia
                Fassbinder”                                                                       ?y
                       a:Directors                              Ontology                               db:director
      a:Theatrical                                           Alignment Tool
      ReleaseDate ?x type                              Amazon a:         Dbpedia db:              ?x
                                                       a:Directors   =    db:director
                                                                                                         type
         1982                 a:Movie                    a:Title     =     db:name
                                                         A:Actor     =    db:starring
                                                                                                           ?z
       Amazon a:                                           …         =        …         DBpedia db:


       Create mappings using ontology alignment tools (Falcon AO)
       Rewrite query using the mappings, omit missing mappings,
       replace constants with variables
       Reduces the search space, perform keyword search on top
       Baseline 2 (QR), database-style baseline

13     WWW2012     Daniel M. Herzig - Institute AIFB
Heterogeneous Web Data Search
     UsingRelevance-based On TheFly Data Integration




14   WWW2012   Daniel M. Herzig - Institute AIFB
Contributions

       (1) Novelapproachforqueryingheterogeneous Web
       datasources
           No upfrontdataintegrationnecessary
           Uses an EntityRelevance Model (ERM) forranking and
           forcomputingmappings on thefly


       (2) Implementation of the approach
           Construction of an ERM and usage for alignment and ranking
           Best-effort algorithm for creating mappings during runtime


       (3) Large-scale evaluation with 3 real-world datasets
           Experiments show our approach exceeds KW and QR baseline by
           120%, respectively 54% in terms of Mean Average Precision.

15     WWW2012   Daniel M. Herzig - Institute AIFB
Overview of our Approach

                                      Keyword search to cross vocabulary mismatches
                                keyword query
            qs
                                                                              et             Dt
                                                        Entity
                               Rs                     Relevance
                                                                              et
                                                        Model


           Ds                                        Model                    et
                                                     leveraging the                           Dt
     Relevance Feedback                              structure of
                                                     the data                 et
                                                                      Matching and Ranking

16     WWW2012   Daniel M. Herzig - Institute AIFB
Entity Relevance Model (ERM)

       Based on Structured Relevance Model (Lavrenkoet.al 2007)

       Entity Relevance Model:
          Query specific model
          Captures structure and content of relevant results
          Composite model consisting of language models weighted by
          occurrence.


       Based on
           Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL.
           (2007)




17     WWW2012   Daniel M. Herzig - Institute AIFB
ERM (2)


               World on Wires
                                       Klaus Löwitsch
                label         starring
          released           starring
     1973            e1
                                      Barbara Valentin
           type
                              director

             Film         Rainer Werner Fassbinder

             type              director
            released          language
     1982              e2                        German
                         label

               Veronika Voss




            qs Rs = {e1,e2}                               ERM


18       WWW2012       Daniel M. Herzig - Institute AIFB
Modelling Target Entities



       Coyote, Peter Spielberg, Steven (I)

         i:actors                  i:directors
                            type
                     ei               i:movie

           i:title              i:producer

           E.T.
                     Spielberg, Steven (I)
          (1994)
        IMdbi:




       Modeled the same way as ERM
       Language Model for each attribute



19     WWW2012       Daniel M. Herzig - Institute AIFB
Ranking

          boosting seed query attributes
                                                                       cross entropy




                                                     frequency of as
       Idea:
       Rank candidate entities according to their similarity to ERM

       Note: Alignment between ERM and et needed
           If no mapping available, use max H.

20     WWW2012   Daniel M. Herzig - Institute AIFB
On The Fly Alignment

       as ~ at ??

       Idea:
       Compare all language models of et to a field of ERM using
       cross entropy -H.
       Establish a mapping, if lowest value for H is lower than a
       threshold t.

       Worst case: nr comparisons
           n , r are usually small

       Allows reuse of computed cross entropies for subsequent
       ranking
21     WWW2012   Daniel M. Herzig - Institute AIFB
EXPERIMENTS


22   WWW2012   Daniel M. Herzig - Institute AIFB
Datasets




       Three real-world, heterogeneous Web datasets:
       (1) DBpedia 3.5.1, structured representation of Wikipedia

       (2) IMdb, information about movies

       (3) Amazon, information about DVD/Videos



       (2,3) are crawled and transformed to RDF. Provided by L3S
23     WWW2012   Daniel M. Herzig - Institute AIFB
Ground Truth

                                                        db:Rainer_Werner_Fassbin     “Fassbinder, Rainer Werner”
      “Rainer Werner Fassbinder”
                                                                   der
                      a:Directors                                      db:director                 i:directors
     a:Theatrical
     ReleaseDate    ?x      type                 db:released       ?x    type           i:year   ?x    type

        1982                a:Movie                      1982            db:Film      1982             i:movie

      Amazon a:                                    DBpedia db:                       IMdbi:



       Goal is to find relevant entities in the target datasets
       Manually rewriting the seed query qsto obtain the relevant
       entities in the target datasets.
       3 query sets each with 23 corresponding entity BGP
       SPARQL queries

24      WWW2012     Daniel M. Herzig - Institute AIFB
IR Experiments

       Baseline KW – Keyword Search
       Baseline QR – Query Rewriting
       Three configurations of ERM:
           ERM – computes alignments on the fly
           ERMa– uses pre-computed alignments only
           ERMq– uses pre-computed alignments and creates mappings on
           top


       Six different retrieval settings.




25     WWW2012   Daniel M. Herzig - Institute AIFB
Results (1) – Mean Average Precision




       ERM improves over KW by 120% and over QR by 54%
       ERMa performs slightly better than ERM
       ERMq performs best.
26     WWW2012   Daniel M. Herzig - Institute AIFB
Results (2) – On The Fly Alignment




       Pooled mappings for n = 115k entities
       Average Precision = 0.7, Average Recall = 0.3 for relevant entities
       Pearson correlation ρ(MAP, Precision-Rel) = 0.98


27     WWW2012   Daniel M. Herzig - Institute AIFB
Results (3) – Parameter and Runtime Analysis

       Analysis on the parameters of the model
           Sensitivness of retrieval performance in terms of MAP for varying
           parameter configurations


       Runtime analysis
           Execution takes less than 13s on average
           Can be improved by moving tasks (e.g. computation of language
           models) to index time.




28     WWW2012   Daniel M. Herzig - Institute AIFB
Conclusion

       Novel approach for searching entities in a target dataset Dt
       with a structured query qsadhering to the vocabulary of Ds.

       Entity Relevance Model used for ranking and creating
       mappings during runtime.

       Experiments showed that our approach is effective and
       exceeds the baselines substantially.




29     WWW2012   Daniel M. Herzig - Institute AIFB
Scenario                                Overview

     Heterogeneous Web Data Search
     UsingRelevance-based On TheFly
     Data Integration
                                                                               Baseline Keyword Search
     Daniel M. Herzig, Thanh Tran
     herzig@kit.edu
     Institute AIFB, Karlsruhe Institute of Technology,
     Germany
                                THANK YOU!                                                   Query Rewriting

     ACKNOWLEDGEMENTS:
     Wethankourcolleagues Philipp Sorg and Günter Ladwigforhelpfuldiscussions. Also, wethank Julien Gaugaz and the
     L3S Research Center forprovidingustheirversions of theIMdb and Amazondatasets.
     Thiswork was supportedbythe German Federal Ministry of Education and Research (BMBF) undertheiGreenproject
     (grant 01IA08005K).

30     WWW2012       Daniel M. Herzig - Institute AIFB
ExecutionProcess of our Approach

                 qs
                                                                  et   Dt
                                                        Entity

             Ds                             Rs        Relevance
                                                        Model     et
                                                                  et
       Run qs against Ds to obtain results Rs                          Dt

       Build ERM from Rs                                          et
       Obtain candidate entities et
       Compare et to ERM #
       Rank et according to similarity to ERM
31     WWW2012    Daniel M. Herzig - Institute AIFB
Runtime Analysis




       Average execution time less than 13 sec for the parameter setting used in the IR
       experiments.
       Increasing parameter c (i.e. reducing the number of fields of ERM) increases
       performances
       Our implementation performed some tasks at runtime, which can be moved to index time
       Improvements are easily possible

32     WWW2012   Daniel M. Herzig - Institute AIFB
Parameter Analysis




       Model is robust in certain parameter ranges
       Boosting b: Beneficial for similar datasets, not so for
       diverse
       Pruning c: Small effect on effectiveness, larger on
       efficenicy
33     WWW2012   Daniel M. Herzig - Institute AIFB
Boosting Parameter b




       If attribute as is present in the seed query, the boosting
       parameter is set to b, in order to increase its influence
       during ranking.


34     WWW2012   Daniel M. Herzig - Institute AIFB
Alignment

       ERM




     Compare LMs (Prob distributions) by cross entropy
       et




35     WWW2012   Daniel M. Herzig - Institute AIFB
Related Work (excerpt)

       Keyword Search
           Wang et al.: Semplore: A scalable IR approach to searchthe Web of Data.
           In: Journal of Web Semantics. (2009)
       Query rewriting
           Calì et al.: Query Rewriting and AnsweringunderConstraints in Data
           Integration Systems. In: IJCAI. (2003)


       Our approach is based on
           Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL.
           (2007)
           Madhavan et al.: Web-scale Data Integration: Youcanafford to pay as
           yougo. In: CIDR. (2007)




36     WWW2012   Daniel M. Herzig - Institute AIFB

Más contenido relacionado

Destacado (8)

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
The Information Workbench -
The Information Workbench -  The Information Workbench -
The Information Workbench -
 
Diesel Exhaust Fluid (DEF) Fact Sheets
Diesel Exhaust Fluid (DEF) Fact SheetsDiesel Exhaust Fluid (DEF) Fact Sheets
Diesel Exhaust Fluid (DEF) Fact Sheets
 
Tips on how to use AdBlue for fleet operators and drivers by air1_yara
Tips on how to use AdBlue for fleet operators and drivers by air1_yaraTips on how to use AdBlue for fleet operators and drivers by air1_yara
Tips on how to use AdBlue for fleet operators and drivers by air1_yara
 
TYPifier: Inferring the Type Semantics of Structured Data (icde2013)
TYPifier: Inferring the Type Semantics of Structured Data (icde2013)TYPifier: Inferring the Type Semantics of Structured Data (icde2013)
TYPifier: Inferring the Type Semantics of Structured Data (icde2013)
 
Genetically Modified Food
Genetically Modified FoodGenetically Modified Food
Genetically Modified Food
 
Faculty forum presentation march 2012
Faculty forum presentation  march 2012Faculty forum presentation  march 2012
Faculty forum presentation march 2012
 
20120410 aiming水口 ドイツゲームのすゝめ
20120410 aiming水口 ドイツゲームのすゝめ20120410 aiming水口 ドイツゲームのすゝめ
20120410 aiming水口 ドイツゲームのすゝめ
 

Similar a Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Property-based Access of RDF Data
Property-based Access of RDF DataProperty-based Access of RDF Data
Property-based Access of RDF Data
Gerd Groener
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
Thanh Tran
 
Capturing emerging relations between schema ontologies on the Web of Data
Capturing emerging relations between schema ontologies on the Web of DataCapturing emerging relations between schema ontologies on the Web of Data
Capturing emerging relations between schema ontologies on the Web of Data
Andriy Nikolov
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan Starr
Carly Strasser
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
OlafGoerlitz
 

Similar a Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration (11)

Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
 
Property-based Access of RDF Data
Property-based Access of RDF DataProperty-based Access of RDF Data
Property-based Access of RDF Data
 
Similarity on DBpedia
Similarity on DBpediaSimilarity on DBpedia
Similarity on DBpedia
 
Drupal and the Semantic Web
Drupal and the Semantic WebDrupal and the Semantic Web
Drupal and the Semantic Web
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Capturing emerging relations between schema ontologies on the Web of Data
Capturing emerging relations between schema ontologies on the Web of DataCapturing emerging relations between schema ontologies on the Web of Data
Capturing emerging relations between schema ontologies on the Web of Data
 
dsnotify presentation at www2010
dsnotify presentation at www2010 dsnotify presentation at www2010
dsnotify presentation at www2010
 
Extracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesExtracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF Predicates
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan Starr
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

  • 1. Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration Daniel M. Herzig, Thanh Tran WWW2012 INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
  • 2. Agenda Motivation Problem Definition Existing Solutions Our Approach Entity Relevance Model (ERM) Ranking On-The-Fly alignment Experiments Conclusion 2 WWW2012 Daniel M. Herzig - Institute AIFB
  • 3. Company running a movie shopping website Movies Shopping Website Company’s dataset Ds 3 WWW2012 Daniel M. Herzig - Institute AIFB
  • 4. Users search the website via forms. Search request is internally executed as a structured query Steven Spielberg i:directors i:year ?x type qs Ds 1982 i:movie Structured Query IMdbi: (e.g. SQL, SPARQL) 4 WWW2012 Daniel M. Herzig - Institute AIFB Screenshot of http://www.imdb.com/search/title
  • 5. Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business qs Ds Linked Data on the Web http://richard.cyganiak.de/2007/10/lod/ 5 WWW2012 Daniel M. Herzig - Institute AIFB
  • 6. Zero Star Mugs! vs. 6 WWW2012 Daniel M. Herzig - Institute AIFB
  • 7. Problems of Data Integration arise… qs does not return results qs No links, no integration No knowledge about the Ds external data schema External data might change often 7 WWW2012 Daniel M. Herzig - Institute AIFB
  • 8. Problem Definition Find relevant entities in a set of targetdatasetsDtgiven a sourcedatasetDsand an structuredentityqueryqsadhering to thevocabulary of Ds. Structured entity query qs ? Ds Dt1 Dt2 Source Dataset Target Datasets 8 WWW2012 Daniel M. Herzig - Institute AIFB
  • 9. Problem Setting Data Model is labeled directed graph Directly related to RDF RDF specifics, e.g. blank nodes, are omitted Entity query: SPARQL BGP query with one select variable Entityqueriesarethemostfrequenttype of web searchqueries, Pound et al. WWW2010 Web Data scenario: Data exhibits a heterogeneity on the schema- and data-level 9 WWW2012 Daniel M. Herzig - Institute AIFB
  • 10. Heterogeneous Web Data Daniel Craig, Steven Spielberg Coyote, Peter Spielberg, Steven (I) db:Film db:Steven_Spielberg Eric Bana a:Actors a:Directors i:actors i:directors type db:director a:ReleaseDate type ea 2005 ei i:movie ed a:Title type a:Binding i:title i:producer rdfs:label db:starring E.T. Munich a:Movie DVD Spielberg, Steven (I) 1941 (film) db:John_Candy_(actor) (1994) Amazon a: IMdbi: DBpedia db: Schema-level: actors vs. starring Data-level: Steven Spielberg vs. Spielberg, Steven Varying number of attributes per entity 10 WWW2012 Daniel M. Herzig - Institute AIFB
  • 11. Aim: Integrate External Data into the Search Process ? qs Keyword Search Wang et al.: Semplore: A scalable IR Dt approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting based Ds on up-front data integration Dt Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003) 11 WWW2012 Daniel M. Herzig - Institute AIFB
  • 12. Existing Strategies – Keyword Search directors rainerwernerfassbinder theatrical release “Rainer Werner Fassbinder” date 1982 type movie (2) a:Directors (3) i:title Veronika Voss e1 title veronikavoss i:director a:Theatrical ?x e1 Rainer Werner Fassbinder director rainerwernerfassbinde ReleaseDate type i:released 1982 r released 1982 i:title SchindlersListe (1994) 1982 a:Movie e2 title i:director schindlersliste 1994 e2 Spielberg, Steven (I) director Amazon a: (1) IMDB i: type i:movie spielbergsteveni type movie Transform qs into keyword query Match against bag-of-words representation of entities Bridges schema differences by neglecting the structure Baseline 1 (KW), IR baseline using Semplore (Lucene) 12 WWW2012 Daniel M. Herzig - Institute AIFB
  • 13. Existing Strategies – Query Rewriting Schema Schema “Rainer Maria Amazon DBpedia Fassbinder” ?y a:Directors Ontology db:director a:Theatrical Alignment Tool ReleaseDate ?x type Amazon a: Dbpedia db: ?x a:Directors = db:director type 1982 a:Movie a:Title = db:name A:Actor = db:starring ?z Amazon a: … = … DBpedia db: Create mappings using ontology alignment tools (Falcon AO) Rewrite query using the mappings, omit missing mappings, replace constants with variables Reduces the search space, perform keyword search on top Baseline 2 (QR), database-style baseline 13 WWW2012 Daniel M. Herzig - Institute AIFB
  • 14. Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration 14 WWW2012 Daniel M. Herzig - Institute AIFB
  • 15. Contributions (1) Novelapproachforqueryingheterogeneous Web datasources No upfrontdataintegrationnecessary Uses an EntityRelevance Model (ERM) forranking and forcomputingmappings on thefly (2) Implementation of the approach Construction of an ERM and usage for alignment and ranking Best-effort algorithm for creating mappings during runtime (3) Large-scale evaluation with 3 real-world datasets Experiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision. 15 WWW2012 Daniel M. Herzig - Institute AIFB
  • 16. Overview of our Approach Keyword search to cross vocabulary mismatches keyword query qs et Dt Entity Rs Relevance et Model Ds Model et leveraging the Dt Relevance Feedback structure of the data et Matching and Ranking 16 WWW2012 Daniel M. Herzig - Institute AIFB
  • 17. Entity Relevance Model (ERM) Based on Structured Relevance Model (Lavrenkoet.al 2007) Entity Relevance Model: Query specific model Captures structure and content of relevant results Composite model consisting of language models weighted by occurrence. Based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007) 17 WWW2012 Daniel M. Herzig - Institute AIFB
  • 18. ERM (2) World on Wires Klaus Löwitsch label starring released starring 1973 e1 Barbara Valentin type director Film Rainer Werner Fassbinder type director released language 1982 e2 German label Veronika Voss qs Rs = {e1,e2} ERM 18 WWW2012 Daniel M. Herzig - Institute AIFB
  • 19. Modelling Target Entities Coyote, Peter Spielberg, Steven (I) i:actors i:directors type ei i:movie i:title i:producer E.T. Spielberg, Steven (I) (1994) IMdbi: Modeled the same way as ERM Language Model for each attribute 19 WWW2012 Daniel M. Herzig - Institute AIFB
  • 20. Ranking boosting seed query attributes cross entropy frequency of as Idea: Rank candidate entities according to their similarity to ERM Note: Alignment between ERM and et needed If no mapping available, use max H. 20 WWW2012 Daniel M. Herzig - Institute AIFB
  • 21. On The Fly Alignment as ~ at ?? Idea: Compare all language models of et to a field of ERM using cross entropy -H. Establish a mapping, if lowest value for H is lower than a threshold t. Worst case: nr comparisons n , r are usually small Allows reuse of computed cross entropies for subsequent ranking 21 WWW2012 Daniel M. Herzig - Institute AIFB
  • 22. EXPERIMENTS 22 WWW2012 Daniel M. Herzig - Institute AIFB
  • 23. Datasets Three real-world, heterogeneous Web datasets: (1) DBpedia 3.5.1, structured representation of Wikipedia (2) IMdb, information about movies (3) Amazon, information about DVD/Videos (2,3) are crawled and transformed to RDF. Provided by L3S 23 WWW2012 Daniel M. Herzig - Institute AIFB
  • 24. Ground Truth db:Rainer_Werner_Fassbin “Fassbinder, Rainer Werner” “Rainer Werner Fassbinder” der a:Directors db:director i:directors a:Theatrical ReleaseDate ?x type db:released ?x type i:year ?x type 1982 a:Movie 1982 db:Film 1982 i:movie Amazon a: DBpedia db: IMdbi: Goal is to find relevant entities in the target datasets Manually rewriting the seed query qsto obtain the relevant entities in the target datasets. 3 query sets each with 23 corresponding entity BGP SPARQL queries 24 WWW2012 Daniel M. Herzig - Institute AIFB
  • 25. IR Experiments Baseline KW – Keyword Search Baseline QR – Query Rewriting Three configurations of ERM: ERM – computes alignments on the fly ERMa– uses pre-computed alignments only ERMq– uses pre-computed alignments and creates mappings on top Six different retrieval settings. 25 WWW2012 Daniel M. Herzig - Institute AIFB
  • 26. Results (1) – Mean Average Precision ERM improves over KW by 120% and over QR by 54% ERMa performs slightly better than ERM ERMq performs best. 26 WWW2012 Daniel M. Herzig - Institute AIFB
  • 27. Results (2) – On The Fly Alignment Pooled mappings for n = 115k entities Average Precision = 0.7, Average Recall = 0.3 for relevant entities Pearson correlation ρ(MAP, Precision-Rel) = 0.98 27 WWW2012 Daniel M. Herzig - Institute AIFB
  • 28. Results (3) – Parameter and Runtime Analysis Analysis on the parameters of the model Sensitivness of retrieval performance in terms of MAP for varying parameter configurations Runtime analysis Execution takes less than 13s on average Can be improved by moving tasks (e.g. computation of language models) to index time. 28 WWW2012 Daniel M. Herzig - Institute AIFB
  • 29. Conclusion Novel approach for searching entities in a target dataset Dt with a structured query qsadhering to the vocabulary of Ds. Entity Relevance Model used for ranking and creating mappings during runtime. Experiments showed that our approach is effective and exceeds the baselines substantially. 29 WWW2012 Daniel M. Herzig - Institute AIFB
  • 30. Scenario Overview Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration Baseline Keyword Search Daniel M. Herzig, Thanh Tran herzig@kit.edu Institute AIFB, Karlsruhe Institute of Technology, Germany THANK YOU! Query Rewriting ACKNOWLEDGEMENTS: Wethankourcolleagues Philipp Sorg and Günter Ladwigforhelpfuldiscussions. Also, wethank Julien Gaugaz and the L3S Research Center forprovidingustheirversions of theIMdb and Amazondatasets. Thiswork was supportedbythe German Federal Ministry of Education and Research (BMBF) undertheiGreenproject (grant 01IA08005K). 30 WWW2012 Daniel M. Herzig - Institute AIFB
  • 31. ExecutionProcess of our Approach qs et Dt Entity Ds Rs Relevance Model et et Run qs against Ds to obtain results Rs Dt Build ERM from Rs et Obtain candidate entities et Compare et to ERM # Rank et according to similarity to ERM 31 WWW2012 Daniel M. Herzig - Institute AIFB
  • 32. Runtime Analysis Average execution time less than 13 sec for the parameter setting used in the IR experiments. Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances Our implementation performed some tasks at runtime, which can be moved to index time Improvements are easily possible 32 WWW2012 Daniel M. Herzig - Institute AIFB
  • 33. Parameter Analysis Model is robust in certain parameter ranges Boosting b: Beneficial for similar datasets, not so for diverse Pruning c: Small effect on effectiveness, larger on efficenicy 33 WWW2012 Daniel M. Herzig - Institute AIFB
  • 34. Boosting Parameter b If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking. 34 WWW2012 Daniel M. Herzig - Institute AIFB
  • 35. Alignment ERM Compare LMs (Prob distributions) by cross entropy et 35 WWW2012 Daniel M. Herzig - Institute AIFB
  • 36. Related Work (excerpt) Keyword Search Wang et al.: Semplore: A scalable IR approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003) Our approach is based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007) Madhavan et al.: Web-scale Data Integration: Youcanafford to pay as yougo. In: CIDR. (2007) 36 WWW2012 Daniel M. Herzig - Institute AIFB