Keyword Search on Structured Data using Relevance Models

Keyword Search on Structured Data using
Relevance Models*

Veli Bicer

INFORMATIK
FZI Research Center for Information Technology
Karlsruhe, Germany

FZI FORSCHUNGSZENTRUM
Joint work with Thanh Tran from Semantic Search Group, AIFB
Institute, KIT

* based on the papers @ 20th ACM Conference on Information and Knowledge
Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)
© FZI Forschungszentrum Informatik 1

About the presenter

Veli Bicer
 Research Scientist at FZI Research Center for Information
Technology, Karlsruhe, Germany
 Associated Researcher at Karlsruhe Service Research Institute (KSRI)
 KSRI founded by IBM Germany
Research Interests
 Semantic Data Management/Search
 Relational Learning
 Software Engineering (for Services)
Projects
 German Internet Research Programme THESEUS
 KOIOS Semantic Search in Core Technology Cluster
 TEXO Internet-of-Services Use-case
 Previously, EU ICT Artemis, Satine, Saphire and Ride

10.04.2012 © FZI Forschungszentrum Informatik 2

Agenda

Introduction
 Keyword search on structured data
 Relevance models
Approach
 Ranking scheme using relevance models
 Top-k Query processing
Experiments
Application
 Search on environmental data
Conclusion


INFORMATIK
Introduction


Keyword Search on Structured Data

Rationale
 4 billion web searches daily
 Data-driven websites have relational database backend
 Predefined search forms constrain retrieval
 SQL difficult to learn
 simplify data retrieval by not using SQL



Example
 Who is the character played by Audrey Hepburn in Roman Holiday?
Query result Person Character
 A tree of tuples that is reduced id name id name pid mid
with respect to the query. p1 Audrey Hepburn c1 Princess p1 m1
Ann
Which would you rather write? p3 Kate Winslet
c3 Iris p3 m2
… ………
Simpkins
SELECT C.name
… ……..
FROM Person, Character, Movie
WHERE Person.id = Character.pId Movie
AND Character.mid = Movie.id id title plot
AND Person.name = ‘Audrey Hepburn' m1 Roman Holiday Princess Ann is a royal princess
AND Movie.title = ‘Roman Holiday' ; of unknow of an …
m2 The Holiday Iris swaps her cottage for the
 or “Hepburn Holiday” holiday along the next two …
m3 The Aviator Hughes and Hepburn go to a
holiday and fly together ..
… …… …..


Many approaches are proposed recently
 Performance focus
 Less consideration of ranking

Recent study (Coffman and Weaver, CIKM 2010)
 effectiveness of previous works are below expectations
 problem about ranking strategies, not performance

Two major types of ranking schemes:
 IR-inspired TF-IDF ranking
 (Liu et al, 2006) (SPARK, 2007)
 Proximity based approaches
 (Banks, 2002) (Bidirectional, 2005)

Problem:
 Missing a robust and principled approach!!


Relevance Models

Proposed by Lavrenko and Croft (SIGIR 01) Q D
Assumes that Classical Model

 queries and documents are samples from a
hidden representation space and
 generated from the same generative model
Initial representation of relevance is R
unknown
 Estimated from query
Q D
Language Model

R

Q D
Relevance Model

INFORMATIK
Approach


Overview of Approach
1 Query
2 PRF
3 Query RM
4 Res. RM

words p words p words p

hepburn 0.5 hepburn 0.21 5 Res. Score
hepburn 0.12

holiday 0.5 holiday 0.15 holiday 0.18

audrey 0.13 audrey 0.11

katharine 0.09 D(RMQ||RMR) katharine 0.05

princess 0.01 princess 0.00

roman 0.01 roman 0.06

…. … …. …

Title Name

Roman Holiday Audrey
Hepburn

Breakfast at Tiff. Audrey
Hepburn

The Aviator Katharine
Hepbun

The Holiday Kate
Winslet

6 Query Generation 7 Structured Queries 8 Top-k Query Proc.
9 Result Ranking


Data Model

Different kinds of data
 e.g. relational, XML and RDF data
Data Graph of nodes and edges (G=(V,E))
Resource nodes, attribute nodes
 Every resource is typed
 Resources have unique ids, (e.g. primary keys)


Edge-Specific Relevance Models 1 2 3

A set of feedback resources FR are retrieved from an inverted keyword index:
 E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}
Edge-specific relevance model for each unique edge e:
Probability of word at resource

Importance of resource w.r.t. query

Inverted Index FR Edge-specific Relevance Models
princess  m1, c1
breakfast  m3 p1
name birthplace
hepburn
hepburn  m3,p1,p4,c2
Audrey Hepburn Ixelles Belgium
melbourne  p2
iris  c3 m3
title The Holiday

holiday
holiday  m1,m2,m3 plot

breakfast  m3 Iris swaps her
cottage for the
ann  m1,c2 holiday along the
next two …..
………. … ……. © FZI Forschungszentrum Informatik 12

Edge Specific Resource Models 4 5

Each resource (a tuple) is also represented as a RM
 …as final results (joint tuples) are obtained by combining resources
Edge-specific resource model:

The score of resource: cross-entropy of edge-specific RM and
ResM:


Smoothing

Well-known technique to address data sparseness and improve
accuracy of RMs (and LMs)
 is the core probability for both query and resource RM
Local smoothing

Neighborhood of attribute a is another attribute a’:
 a and a’ shares the same resources
 resources of a and a’ are of the same type
 resources of a and a’ are connected over a FK

Neighborhood of a

Smoothing
words P name (v | p1 )
r a

Person Character audrey 0.5 0.4 0.37 0.36
type type type hepburn 0.5 0.4 0.39 0.38
pid_fk
p1 c1 ixelles 0.1 0.09 0.08
p4
birthplace belgium
name name 0.1 0.09 0.08
name
Audrey Hepburn Ixelles Belgium Princess Ann katharine 0.02 0.01
Katharine
Hepburn birthplace connecticut 0.02 0.01
Connecticut USA usa 0.02 0.01
princess
0.035
ann
0.035

Smoothing of each type is controlled
by weights:

where γ1 ,γ2 ,γ3 are control parameters
set in experiments


Ranking JRTs 9

Ranking aggregated JRTs:
 Cross entropy between edge-specific RM (Query Model) and geometric
mean of combined edge-specific ResM:

The proposed score is monotonic w.r.t. individual resource scores
 …a desired property for most of top-k algorithms


Query Translation* 6 7

Mapping of keywords to data elements
Hepburn Hepburn Holiday Holiday

title
name name title
 Result in a set of keyword elements p4 p1
m1
m3

Data Graph exploration type
type

 Search for substructures (query graph) pid_fk
Character
Person mid_fk
connecting keyword elements
bornIn Movie
 Bi-directional exploration of query
Is-a Location
graphs operates on summary of data hasDist
hasLoc
graph only Summary Producer Studio

Top-k computation
Graph worksFor

 Search guided by a scoring function to Person Character Movie

output only the top-k queries type type type
pid_fk mid_fk
Query graphs to be processed name
?p ?c ?m
title
 Free vs. Non-free variables Hepburn Holiday

*[Tran et al. ICDE’09]

Top-k Query Processing 8

Top-k query processing (TQP) is highly common in Web-
accessible databases
 return K highest-ranked answers
 avoid unnecessary accesses to database
TQP assumes
 Scoring function and attribute values to be known a-priori (e.g. RankJoin)
 Combine attribute values by aggregation function
 Sorted access (SA), random access (RA) probes
How to adapt TQP to return top-k relevant results?
 Results are joined set of resources
 Scores are query-dependent
 No indexing is possible
Idea:
 Retrieve resources for non-free variables and rank
 Use SA on those initially retrieved resources
 Use RA to find other resources


Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
 complete when all variables are bound to some resources
 xi =* indicates unbounded
Threshold
Binding operator 0.50
 c’=(c,xiri)
Threshold determines upper bound for unseen resources
 Scheduling between SA and RA
 Tight bound is desired
Priority Queue
<(p1,*,*),0.50>
Person Character Movie <(*,*,m2),0.50>
type type type
pid_fk mid_fk
?p ?c ?m title
name

Hepburn Holiday

Person Character 0.11 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18
p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08


Threshold
Priority Queue
<(p1,*,*),0.50>
Person Character Movie <(*,*,m2),0.50>
type type type
pid_fk mid_fk <(p3,*,*),0.48>
?p ?c ?m title
name

Hepburn Holiday

p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1


Threshold
Priority Queue
<(*,*,m2),0.50>
Person Character Movie <(p1,c1,*),0.49>
type type type
pid_fk mid_fk <(p3,*,*),0.48>
?p ?c ?m title
name

Hepburn Holiday

p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1


Threshold
Priority Queue
<(p1,c1,*),0.49>
Person Character Movie <(p3,*,*),0.48>
type type type
pid_fk mid_fk <(*,c3,m2),0.44>
?p ?c ?m title
name

Hepburn Holiday

p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09


Threshold
Priority Queue
<(p3,*,*),0.48>
Person Character Movie <(*,c3,m2),0.44>
type type type
pid_fk mid_fk
?p ?c ?m title
name

Hepburn Holiday

p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 <(p1,c1,m1),0.48>
p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09


INFORMATIK
Experiments


Experiments

Datasets: Subsets of Wikipedia, IMDB and Mondial Web
databases
Queries: 50 queries for each dataset including “TREC style”
queries and “single resource” queries
Metrics: Three metrics are used: (1) the number of top-1 relevant
results, (2) Reciprocal rank and (3) Mean Average Precision
(MAP)
Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
CoveredDensity (TF-IDF).
RM-S: Our approach


Experiments

MAP scores for all queries

Reciprocal rank for single
resource queries


Experiments

Precision-recall for TREC-style queries on Wikipedia

INFORMATIK
Application


Large amount of environmental data

Environmental issues stir public interests
 Increase transparency, awareness, responsibility, protection
Growing amount of data
 Public access through EU directive 2003/4/EC
 PortalU (Germany) http://www.portalu.de/
 EDP (UK) http://www.edp.nerc.ac.uk
 Envirofacts (USA) http://www.epa.gov/enviro/index.html
Linking data in international context
 Local government databases of environmental part of LOD cloud
 Linked environment data for the life sciences


Opportunity: mass dissemination and
consumption of environmental data
The percentage of people who actively find environmental
information is significantly lower than those who have those with
frequent access to it!
Complex results
 CO emission values around Karlsruhe area in Germany
Analytics
 CO emission values around Karlsruhe area in Germany
 Sorted by year
 Bar chart
 Emission values of US and Germany
 Compare average
 Timeline visualization


KOIOS – Overview

A semantic search system
 Exploit semantics in the data for keywords interpretation to hide
complexity of query languages and data representation
 Keyword search for searching structured data
 Lower access barriers while enabling richness of data to be fully
harnessed
Contribution
 Transfer research results to commercial EIS
 Selector mechanism
Process
 Input: keywords
 Facet-based refinement
 Selector (result and view template) initialization
 Output: query results embedded in specific views


KOIOS – Architecture


Facets generation
Derive facets from query results (not from query!) for refinement
 Attributes serve as facet categories
 Attribute values as facet values
E.g. for ?s
 Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…
 Value.year: 2005,2006,…


Selectors

Selector: parameterized, predefined result and view templates
 Data parameters: specify scope of information need, initialized to a
particular values based on facet categories and values
 Query parameter: additional data processing for analysis tasks
(GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)
 Presentation parameter: visualization types (data value, data series,
data table, map-based, specific diagram type, etc.)


Selector initialization

Selectors
 capture templates for information needs and presentation of their
results
Map facets to selectors and initialize them
 Applicable selectors: cover facet categories
 Initialize selectors based on facet values
 Initialized values are captured in the WHERE clause
 Non-initialized parameters are included in the SELECT clause


Deployment
Hippolytos project (Theseus)
 Easy access to spatial data
warehouse (disy Cadenza) built for
domain of environmental
administration
Data about
 Emission and waste
 From the Baden-Württemberg
 Provided by:
Umweltinformationssystem (UIS)
Baden-Württemberg, Landesamt für
Geoinformation und
Landentwicklung (LGL) Baden-
Württemberg and Statistisches
Landesamt Baden-Württemberg


Facets and selectors


Conclusions

Keyword search on structured data is a popular problem for
which various solutions exist.

We focus on the aspect of result ranking, providing a principled
approach that employs relevance models.

Experiments show that RMs are promising for searching
structured data.

Top-k Query processing proposed to get only most relevant
results

Application on environmental data enables intuitive
 Access
 Visualization
 Analysis of environmental information!


INFORMATIK
Thank you for your attention!
Questions?

Opportunity: mass dissemination and
consumption of environmental data
Increase transparency, awareness, responsibility, protection


Challenges: intuitive access and visualization of
structured environmental data and analytics
The percentage of people who actively find environmental
information is significantly lower than those who have those
with frequent access to it!

Complex structured queries
Knowledge of the underlying data /
query language
Complex structured data
Heterogeneity and distribution of
environmental data is overwhelming
Complex structured results
Understanding results and
extracting relevant information /
analytics are difficult tasks


KOIOS

Semantic search system, KOIOS, for intuitive access, analysis,
and visualization of structured environmental information

Overview and architecture
Structured query generation
from keywords
Facet-based browsing and
refinement
Selector initialization for final
result and view construction
Implementation and deployment
Conclusions


Conclusions

Replace predefined forms and hard-coded visualization
Semantic search using lightweight semantics in data and
schema to dynamically
 Translate keywords to queries
 Generate facets for results
 Initialize result and presentation templates
Enables intuitive
 Access
 Visualization
 Analysis of environmental information!


Inverted Index
princess  m1, c1
breakfast  m3
hepburn  m3,p1,p4,c2
melbourne  p2
iris  c3
holiday  m1,m2,m3
breakfast  m3
ann  m1,c2
………. … …….


Ranking Schemes

Proximity between keyword nodes
 EASE:

 XRank:
 w is the smallest text window in n that contains all search keywords

2012-4-10
SIGMOD09 Tutorial 50

Ranking Schemes

Based on graph structure
 BANKS
 Nodes:
 Edges :
 PageRank-like methods
 XRank [Guo et al, SIGMOD03]
 ObjectRank [Balmin et al, VLDB04] : considers both
Global ObjectRank and Keyword-specific
ObjectRank

2012-4-10

Ranking Schemes
1 ln(1 ln(tf )) N 1
Score(n, Q) ln
w Q n (1 s ) s dl / avdl df
TF*IDF based:
 Discover/EASE
 [Liu et al, SIGMOD06]

 SPARK
 but not at the node level

2012-4-10

Relevance Models

Relevance sample probabilities
Model q1 P(w|Q) w
israeli
.077 palestinian
M q2 palestinian .055 israel
.034 jerusalem
M q3 raids .033 protest
M .027 raid
w ??? .011 clash
P(q | w) .010 bank
.010 west
P( w) .010 troop
P( w | q1...qk ) P(q | M ) P( M | w) …
P(q1...qk ) q M

P(q1...qk | w)

Keyword Search on Structured Data using Relevance Models

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Keyword Search on Structured Data using Relevance Models

Notas del editor