Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
1. Heterogeneous Web Data Search
UsingRelevance-based On TheFly Data Integration
Daniel M. Herzig, Thanh Tran
WWW2012
INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN
KIT – Universität des Landes Baden-Württemberg und
nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
2. Agenda
Motivation
Problem Definition
Existing Solutions
Our Approach
Entity Relevance Model (ERM)
Ranking
On-The-Fly alignment
Experiments
Conclusion
2 WWW2012 Daniel M. Herzig - Institute AIFB
3. Company running a movie shopping website
Movies Shopping
Website
Company’s dataset
Ds
3 WWW2012 Daniel M. Herzig - Institute AIFB
4. Users search the website via forms.
Search request is internally executed as a
structured query
Steven Spielberg
i:directors
i:year ?x type
qs Ds
1982 i:movie
Structured Query
IMdbi: (e.g. SQL, SPARQL)
4 WWW2012 Daniel M. Herzig - Institute AIFB
Screenshot of http://www.imdb.com/search/title
5. Company discovers the plethora of Linked Data
available on the Web and identifies Data
Sources beneficial for its business
qs
Ds
Linked Data on the Web
http://richard.cyganiak.de/2007/10/lod/
5 WWW2012 Daniel M. Herzig - Institute AIFB
6. Zero Star Mugs!
vs.
6 WWW2012 Daniel M. Herzig - Institute AIFB
7. Problems of Data Integration arise…
qs does not return results
qs
No links, no integration
No knowledge about the
Ds external data schema
External data might change
often
7 WWW2012 Daniel M. Herzig - Institute AIFB
8. Problem Definition
Find relevant entities in a set of targetdatasetsDtgiven a
sourcedatasetDsand an
structuredentityqueryqsadhering to thevocabulary of Ds.
Structured
entity query qs ?
Ds Dt1 Dt2
Source Dataset Target Datasets
8 WWW2012 Daniel M. Herzig - Institute AIFB
9. Problem Setting
Data Model is labeled directed graph
Directly related to RDF
RDF specifics, e.g. blank nodes, are omitted
Entity query: SPARQL BGP query with one select variable
Entityqueriesarethemostfrequenttype of web searchqueries, Pound
et al. WWW2010
Web Data scenario:
Data exhibits a heterogeneity on the schema- and data-level
9 WWW2012 Daniel M. Herzig - Institute AIFB
10. Heterogeneous Web Data
Daniel Craig,
Steven Spielberg Coyote, Peter Spielberg, Steven (I) db:Film db:Steven_Spielberg
Eric Bana
a:Actors a:Directors i:actors i:directors type db:director
a:ReleaseDate type
ea 2005
ei i:movie ed
a:Title type a:Binding i:title i:producer rdfs:label db:starring
E.T.
Munich a:Movie DVD Spielberg, Steven (I) 1941 (film) db:John_Candy_(actor)
(1994)
Amazon a: IMdbi: DBpedia db:
Schema-level: actors vs. starring
Data-level: Steven Spielberg vs. Spielberg, Steven
Varying number of attributes per entity
10 WWW2012 Daniel M. Herzig - Institute AIFB
11. Aim: Integrate External Data into the Search
Process
?
qs Keyword Search
Wang et al.: Semplore: A scalable IR
Dt
approach to searchthe Web of Data. In:
Journal of Web Semantics. (2009)
Query rewriting based
Ds on up-front data integration Dt
Calì et al.: Query Rewriting and
AnsweringunderConstraints in Data
Integration Systems. In: IJCAI. (2003)
11 WWW2012 Daniel M. Herzig - Institute AIFB
12. Existing Strategies – Keyword Search
directors rainerwernerfassbinder theatrical release
“Rainer Werner Fassbinder” date 1982 type movie (2)
a:Directors (3) i:title Veronika Voss
e1 title veronikavoss
i:director
a:Theatrical ?x e1 Rainer Werner Fassbinder director
rainerwernerfassbinde
ReleaseDate type
i:released 1982 r released 1982
i:title SchindlersListe (1994)
1982 a:Movie e2 title
i:director schindlersliste 1994
e2 Spielberg, Steven (I)
director
Amazon a: (1) IMDB i: type i:movie spielbergsteveni type
movie
Transform qs into keyword query
Match against bag-of-words representation of entities
Bridges schema differences by neglecting the structure
Baseline 1 (KW), IR baseline using Semplore (Lucene)
12 WWW2012 Daniel M. Herzig - Institute AIFB
13. Existing Strategies – Query Rewriting
Schema Schema
“Rainer Maria Amazon DBpedia
Fassbinder” ?y
a:Directors Ontology db:director
a:Theatrical Alignment Tool
ReleaseDate ?x type Amazon a: Dbpedia db: ?x
a:Directors = db:director
type
1982 a:Movie a:Title = db:name
A:Actor = db:starring
?z
Amazon a: … = … DBpedia db:
Create mappings using ontology alignment tools (Falcon AO)
Rewrite query using the mappings, omit missing mappings,
replace constants with variables
Reduces the search space, perform keyword search on top
Baseline 2 (QR), database-style baseline
13 WWW2012 Daniel M. Herzig - Institute AIFB
14. Heterogeneous Web Data Search
UsingRelevance-based On TheFly Data Integration
14 WWW2012 Daniel M. Herzig - Institute AIFB
15. Contributions
(1) Novelapproachforqueryingheterogeneous Web
datasources
No upfrontdataintegrationnecessary
Uses an EntityRelevance Model (ERM) forranking and
forcomputingmappings on thefly
(2) Implementation of the approach
Construction of an ERM and usage for alignment and ranking
Best-effort algorithm for creating mappings during runtime
(3) Large-scale evaluation with 3 real-world datasets
Experiments show our approach exceeds KW and QR baseline by
120%, respectively 54% in terms of Mean Average Precision.
15 WWW2012 Daniel M. Herzig - Institute AIFB
16. Overview of our Approach
Keyword search to cross vocabulary mismatches
keyword query
qs
et Dt
Entity
Rs Relevance
et
Model
Ds Model et
leveraging the Dt
Relevance Feedback structure of
the data et
Matching and Ranking
16 WWW2012 Daniel M. Herzig - Institute AIFB
17. Entity Relevance Model (ERM)
Based on Structured Relevance Model (Lavrenkoet.al 2007)
Entity Relevance Model:
Query specific model
Captures structure and content of relevant results
Composite model consisting of language models weighted by
occurrence.
Based on
Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL.
(2007)
17 WWW2012 Daniel M. Herzig - Institute AIFB
18. ERM (2)
World on Wires
Klaus Löwitsch
label starring
released starring
1973 e1
Barbara Valentin
type
director
Film Rainer Werner Fassbinder
type director
released language
1982 e2 German
label
Veronika Voss
qs Rs = {e1,e2} ERM
18 WWW2012 Daniel M. Herzig - Institute AIFB
19. Modelling Target Entities
Coyote, Peter Spielberg, Steven (I)
i:actors i:directors
type
ei i:movie
i:title i:producer
E.T.
Spielberg, Steven (I)
(1994)
IMdbi:
Modeled the same way as ERM
Language Model for each attribute
19 WWW2012 Daniel M. Herzig - Institute AIFB
20. Ranking
boosting seed query attributes
cross entropy
frequency of as
Idea:
Rank candidate entities according to their similarity to ERM
Note: Alignment between ERM and et needed
If no mapping available, use max H.
20 WWW2012 Daniel M. Herzig - Institute AIFB
21. On The Fly Alignment
as ~ at ??
Idea:
Compare all language models of et to a field of ERM using
cross entropy -H.
Establish a mapping, if lowest value for H is lower than a
threshold t.
Worst case: nr comparisons
n , r are usually small
Allows reuse of computed cross entropies for subsequent
ranking
21 WWW2012 Daniel M. Herzig - Institute AIFB
22. EXPERIMENTS
22 WWW2012 Daniel M. Herzig - Institute AIFB
23. Datasets
Three real-world, heterogeneous Web datasets:
(1) DBpedia 3.5.1, structured representation of Wikipedia
(2) IMdb, information about movies
(3) Amazon, information about DVD/Videos
(2,3) are crawled and transformed to RDF. Provided by L3S
23 WWW2012 Daniel M. Herzig - Institute AIFB
24. Ground Truth
db:Rainer_Werner_Fassbin “Fassbinder, Rainer Werner”
“Rainer Werner Fassbinder”
der
a:Directors db:director i:directors
a:Theatrical
ReleaseDate ?x type db:released ?x type i:year ?x type
1982 a:Movie 1982 db:Film 1982 i:movie
Amazon a: DBpedia db: IMdbi:
Goal is to find relevant entities in the target datasets
Manually rewriting the seed query qsto obtain the relevant
entities in the target datasets.
3 query sets each with 23 corresponding entity BGP
SPARQL queries
24 WWW2012 Daniel M. Herzig - Institute AIFB
25. IR Experiments
Baseline KW – Keyword Search
Baseline QR – Query Rewriting
Three configurations of ERM:
ERM – computes alignments on the fly
ERMa– uses pre-computed alignments only
ERMq– uses pre-computed alignments and creates mappings on
top
Six different retrieval settings.
25 WWW2012 Daniel M. Herzig - Institute AIFB
26. Results (1) – Mean Average Precision
ERM improves over KW by 120% and over QR by 54%
ERMa performs slightly better than ERM
ERMq performs best.
26 WWW2012 Daniel M. Herzig - Institute AIFB
27. Results (2) – On The Fly Alignment
Pooled mappings for n = 115k entities
Average Precision = 0.7, Average Recall = 0.3 for relevant entities
Pearson correlation ρ(MAP, Precision-Rel) = 0.98
27 WWW2012 Daniel M. Herzig - Institute AIFB
28. Results (3) – Parameter and Runtime Analysis
Analysis on the parameters of the model
Sensitivness of retrieval performance in terms of MAP for varying
parameter configurations
Runtime analysis
Execution takes less than 13s on average
Can be improved by moving tasks (e.g. computation of language
models) to index time.
28 WWW2012 Daniel M. Herzig - Institute AIFB
29. Conclusion
Novel approach for searching entities in a target dataset Dt
with a structured query qsadhering to the vocabulary of Ds.
Entity Relevance Model used for ranking and creating
mappings during runtime.
Experiments showed that our approach is effective and
exceeds the baselines substantially.
29 WWW2012 Daniel M. Herzig - Institute AIFB
30. Scenario Overview
Heterogeneous Web Data Search
UsingRelevance-based On TheFly
Data Integration
Baseline Keyword Search
Daniel M. Herzig, Thanh Tran
herzig@kit.edu
Institute AIFB, Karlsruhe Institute of Technology,
Germany
THANK YOU! Query Rewriting
ACKNOWLEDGEMENTS:
Wethankourcolleagues Philipp Sorg and Günter Ladwigforhelpfuldiscussions. Also, wethank Julien Gaugaz and the
L3S Research Center forprovidingustheirversions of theIMdb and Amazondatasets.
Thiswork was supportedbythe German Federal Ministry of Education and Research (BMBF) undertheiGreenproject
(grant 01IA08005K).
30 WWW2012 Daniel M. Herzig - Institute AIFB
31. ExecutionProcess of our Approach
qs
et Dt
Entity
Ds Rs Relevance
Model et
et
Run qs against Ds to obtain results Rs Dt
Build ERM from Rs et
Obtain candidate entities et
Compare et to ERM #
Rank et according to similarity to ERM
31 WWW2012 Daniel M. Herzig - Institute AIFB
32. Runtime Analysis
Average execution time less than 13 sec for the parameter setting used in the IR
experiments.
Increasing parameter c (i.e. reducing the number of fields of ERM) increases
performances
Our implementation performed some tasks at runtime, which can be moved to index time
Improvements are easily possible
32 WWW2012 Daniel M. Herzig - Institute AIFB
33. Parameter Analysis
Model is robust in certain parameter ranges
Boosting b: Beneficial for similar datasets, not so for
diverse
Pruning c: Small effect on effectiveness, larger on
efficenicy
33 WWW2012 Daniel M. Herzig - Institute AIFB
34. Boosting Parameter b
If attribute as is present in the seed query, the boosting
parameter is set to b, in order to increase its influence
during ranking.
34 WWW2012 Daniel M. Herzig - Institute AIFB
35. Alignment
ERM
Compare LMs (Prob distributions) by cross entropy
et
35 WWW2012 Daniel M. Herzig - Institute AIFB
36. Related Work (excerpt)
Keyword Search
Wang et al.: Semplore: A scalable IR approach to searchthe Web of Data.
In: Journal of Web Semantics. (2009)
Query rewriting
Calì et al.: Query Rewriting and AnsweringunderConstraints in Data
Integration Systems. In: IJCAI. (2003)
Our approach is based on
Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL.
(2007)
Madhavan et al.: Web-scale Data Integration: Youcanafford to pay as
yougo. In: CIDR. (2007)
36 WWW2012 Daniel M. Herzig - Institute AIFB