Index Structures and Top-k Joins for Native Keyword Search Databases

Index Structures and Top-k Joins for Native Keyword
Search Databases
Günter Ladwig, Thanh Tran
Conference on Information and Knowledge Management (CIKM2011)

Institute of Applied Informatics and Formal Description Methods (AIFB)

KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association www.kit.edu

Contents

Introduction:
Native keyword search
Contributions
Index Structures
d-length 2-Hop Cover
Path indexes
Keyword Query Processing
Integrated Query Plan
Operator Ranking
Evaluation
Conclusion

2 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)

Keyword Search on Graph-Structured Data
“john”
“2009”
“acme”

Queries
“steve”
“mary” “steve 2009”
“john steve alice”
“2009”
“2009”
“alice”

Keyword queries over structured data
Approaches
Query translation (based on schema exploration)
Native keyword search (based on data graph exploration)


Native Keyword Search
“john” Queries
“2009” “steve 2009”
“acme”
“john steve alice”
“john”
“2009”
“steve”
“mary”

“steve”
“2009” “mary”
“2009” “steve”
“alice”
“2009”

Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)
More expensive than query translation approaches
Preprocess data to reduce online effort


Native Keyword Search: EASE

Indexes at the level of r-maximal subgraphs
Given keyword query find relevant subgraphs using index
Explore subgraphs to construct Steiner graphs
“john” “john”
“2009” “2009”
“john”
“acme”
Query
“steve 2009”
“mary” “steve” “steve”
“mary” “steve”
Exploration

“steve”
“2009” “2009” “mary”
“alice” “2009”
“alice”

High redundancy “2009”

Requires special operations: exploration, pruning


Native Keyword Search using Top-k Joins

Fine-grained indexing at the level of paths
“john” “john”

“steve” “2009” “steve”
“mary” Query “steve” “john” “2009”
“steve 2009”

Joins
“mary”
“2009” “steve” “steve” “mary” “2009”

More pruning, less redundancy: less storage required
Enables use of database query processing concepts
Data access and top-k joins
Keyword search is now a “traditional” query processing problem


Contributions

We propose a new processing strategy for the keyword
search problem based on standard database operations
data access and join
For efficient data access we extend the 2-hop cover to pre-
compute and materialize neighborhoods of data
elements, indexing the data at the level of paths
Keyword search requires consideration of a large number
of query plans: push-based top-k join procedure ranks
query plans during processing


INDEX STRUCTURES



Compact representation of connections in a graph
Used to find paths between two nodes
Extension of 2-Hop Cover to store only paths of length d or less
2-Hop Cover labels all nodes u with neighborhood NBu
If two nodes u,v are connected via paths of length d or less then

All paths of length d or less between center nodes u and v are of
the form

w is called a hop node
Construction prunes redundant entries from
neighborhoods to reduce size of the cover


Finding Paths Using Joins

To find paths between two nodes u and v
Retrieve neighborhoods NBu and NBv
Intersect NBuand NBv to obtain all hop nodes
Reconstruct paths between u and v through hop nodes

“steve” “steve” hop node
“2009”

“2009” “mary” “john”
“mary” center node

“alice”
“acme”

Intersection is performed as rank join
Rank join requires input to be sorted

Index Storage

Pruned neighborhoods are stored as path entries
Path entry (w,s) for each hop node w in NBu

Path entry index maps nodes to its Node Path Entries

path entries (sorted) (w1, 1.0)
u1 (w2, 2.0)
(w3, 2.0)
Path index u2 (w5, 1.0)
Stores paths for all center nodes and …
their path entries
Used to reconstruct paths


KEYWORD QUERY
PROCESSING


Keyword Query Processing

Use joins to find connections between matching elements
for all keywords
Base inputs: keyword neighborhood for each keyword
Union of matching elements’ neighborhoods
Process
Data access to retrieve keyword
neighborhoods
Joins to connect keyword matching
elements

steve john alice
Are all possible plans valid?


Query Plans

“john”

No results!
d=2
“steve”

alice john steve “alice”

Join order matters
No single join order delivers all results (some might even be empty)
We do not know in advance which orders deliver results
Consider all possible join orders



Join operators in all query plans:
Query plans for different join orders overlap
Share as many operators as possible
Join operators with sharing:

|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100


Top-k Keyword-Join Processing

High number of operators
Terminate early after computing top-k instead of all results
Rank join operators
Top-k union operator
Integrated Query Plan is a composition of many sub-plans
Some sub-plans might produce no results
Pull-based operators will block until result can be produced
Use push-based operators: execution driven by inputs instead of
results
Some sub-plans might produce results earlier than others
Rank not only results, but also rank operators


Operator Ranking

Prefer operators that have “promising” results
Global score of rank join operator, based on current results
and upper bounds for subsequent join operations
R: intermediate results
NBK: keyword neighborhoods not yet covered
Global score defined as

Join operators have a global score when they have results ready
Only the operator with the highest global score can push
results to subsequent operators
Otherwise, lower level data access operators are activated


EVALUATION


Evaluation

Four approaches
EASE: indexing at the level of graphs
KJ: keyword join approach
KJU: keyword join approach without operator ranking
Datasets
BTC: 10M triples
DBLP1/5/10: 1M, 5M, 10M triples (from SP2Bench)
9 keyword queries for each dataset

Reduction of index storage size
50% (DBLP1) – 79% (DBLP10)


Results

KJ, KJU outperform EASE
Operator ranking is beneficial


Results

Benefit of operator ranking more pronounced for larger
queries as these need more join operators


Conclusion

Native keyword search based on data access and join
Index at the level of paths, instead of graphs
Top-k Keyword Join
Exploration transformed into series of join operators
Operator ranking
Reduces storage requirement and increases performance


Thank you for your attention! Questions?

Günter Ladwig, guenter.ladwig@kit.edu


BACKUP SLIDES


Introduction

Keyword search on graph-structured data (RDF)
Query Translation
Translate keywords into structured query using schema knowledge
Native Keyword Search
No translation
Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)
More expensive than query translation approaches

Preprocess data and create special indexes
Reduces search space during online query processing
Requires offline preprocessing and storage


Example Query: “alice malta peter”

Malta l1 l1 Malta

locatedIn locatedIn

ABC Corp o1 o2 ABC Corp
worksAt
worksAt
worksAt
knows
p3 knows p2
knows

Alice p4 p1 Richard
Peter Mary

Match keyword elements
Find connections between keyword elements


Problem Definition

Given a graph GE=(NE,ER)
Find Steiner graphs connection keyword elements


Scoring

Assumption: more compact Steiner graphs are more
relevant
Scoring function
GS: Steiner graph
P: set of paths connecting its keyword elements

Other functions possible, but not part of this work


Approaches

Bidirectional Search
Explore graph from keyword elements to find connections
Does not scale well
EASE
Indexes neighborhood graphs to restrict search space for
exploration
Our approach
Use database operations: data access and join
Transform graph exploration into a series of join operations
Improves storage requirements and performance


d-Length 2-Hop Cover

Preliminaries

Compact representation of connections in a graph
Used to find paths between two nodes in a graph


Construction

Trivial d-length 2-hop cover is the set of all d-
neighborhoods of GE, but contains redundancies
Finding a minimal 2-hop cover is NP-hard (Minimum Set
Cover)
Approximation algorithm
Select a “best” node covering a large amount of paths
Use its neighborhood to prune redundant paths from all other
neighborhoods


Example: Pruning
center node
d=2 p3 hop node p2
knows knows
knows prune worksAt
worksAt knows

p4 o1 p2 o2 p3
p1

locatedIn locatedIn
worksAt knows worksAt knows

l1 o1 p1 l2 o1 p4

Pruned paths between two nodes can be reconstructed by
intersecting their neighborhoods
Store each pruned neighborhood as a list of path entries


Neighborhood Join

hop node
o1 o1 o3

center node
l1 p4 p4 p2

p3 p3 l2

Result: Keyword Graphs
p4 o1 p2 stands for all paths of
length d between p4 and
p2 through o1
p4 p3 p2

...


Graph Join

Expand keyword graphs to keyword graph neighborhoods

Keyword Graph Keyword Graph Neighborhood
p4 o1 p2 p4 o1 p2 o3

p4 o1 p2 l2

l1 p4 o1 p2

...

Graph Join: joins keyword graph neighborhood with
keyword neighborhood



Number of join operators without operator sharing

Number of join operators with operator sharing

|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100


Index Structures and Top-k Joins for Native Keyword Search Databases

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Index Structures and Top-k Joins for Native Keyword Search Databases