Lifecycle support in architectures for ontology-based information systems - iswc
Index Structures and Top-k Joins for Native Keyword Search Databases
1. Index Structures and Top-k Joins for Native Keyword
Search Databases
Günter Ladwig, Thanh Tran
Conference on Information and Knowledge Management (CIKM2011)
Institute of Applied Informatics and Formal Description Methods (AIFB)
KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association www.kit.edu
2. Contents
Introduction:
Native keyword search
Contributions
Index Structures
d-length 2-Hop Cover
Path indexes
Keyword Query Processing
Integrated Query Plan
Operator Ranking
Evaluation
Conclusion
2 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
3. Keyword Search on Graph-Structured Data
“john”
“2009”
“acme”
Queries
“steve”
“mary” “steve 2009”
“john steve alice”
“2009”
“2009”
“alice”
Keyword queries over structured data
Approaches
Query translation (based on schema exploration)
Native keyword search (based on data graph exploration)
3 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
4. Native Keyword Search
“john” Queries
“2009” “steve 2009”
“acme”
“john steve alice”
“john”
“2009”
“steve”
“mary”
“steve”
“2009” “mary”
“2009” “steve”
“alice”
“2009”
Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)
More expensive than query translation approaches
Preprocess data to reduce online effort
4 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
5. Native Keyword Search: EASE
Indexes at the level of r-maximal subgraphs
Given keyword query find relevant subgraphs using index
Explore subgraphs to construct Steiner graphs
“john” “john”
“2009” “2009”
“john”
“acme”
Query
“steve 2009”
“mary” “steve” “steve”
“mary” “steve”
Exploration
“steve”
“2009” “2009” “mary”
“alice” “2009”
“alice”
High redundancy “2009”
Requires special operations: exploration, pruning
5 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
6. Native Keyword Search using Top-k Joins
Fine-grained indexing at the level of paths
“john” “john”
“steve” “2009” “steve”
“mary” Query “steve” “john” “2009”
“steve 2009”
Joins
“mary”
“2009” “steve” “steve” “mary” “2009”
More pruning, less redundancy: less storage required
Enables use of database query processing concepts
Data access and top-k joins
Keyword search is now a “traditional” query processing problem
6 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
7. Contributions
We propose a new processing strategy for the keyword
search problem based on standard database operations
data access and join
For efficient data access we extend the 2-hop cover to pre-
compute and materialize neighborhoods of data
elements, indexing the data at the level of paths
Keyword search requires consideration of a large number
of query plans: push-based top-k join procedure ranks
query plans during processing
7 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
8. INDEX STRUCTURES
8 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
9. d-length 2-Hop Cover
Compact representation of connections in a graph
Used to find paths between two nodes
Extension of 2-Hop Cover to store only paths of length d or less
2-Hop Cover labels all nodes u with neighborhood NBu
If two nodes u,v are connected via paths of length d or less then
All paths of length d or less between center nodes u and v are of
the form
w is called a hop node
Construction prunes redundant entries from
neighborhoods to reduce size of the cover
9 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
10. Finding Paths Using Joins
To find paths between two nodes u and v
Retrieve neighborhoods NBu and NBv
Intersect NBuand NBv to obtain all hop nodes
Reconstruct paths between u and v through hop nodes
“steve” “steve” hop node
“2009”
“2009” “mary” “john”
“mary” center node
“alice”
“acme”
Intersection is performed as rank join
Rank join requires input to be sorted
10 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
11. Index Storage
Pruned neighborhoods are stored as path entries
Path entry (w,s) for each hop node w in NBu
Path entry index maps nodes to its Node Path Entries
path entries (sorted) (w1, 1.0)
u1 (w2, 2.0)
(w3, 2.0)
Path index u2 (w5, 1.0)
Stores paths for all center nodes and …
their path entries
Used to reconstruct paths
11 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
12. KEYWORD QUERY
PROCESSING
12 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
13. Keyword Query Processing
Use joins to find connections between matching elements
for all keywords
Base inputs: keyword neighborhood for each keyword
Union of matching elements’ neighborhoods
Process
Data access to retrieve keyword
neighborhoods
Joins to connect keyword matching
elements
steve john alice
Are all possible plans valid?
13 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
14. Query Plans
“john”
No results!
d=2
“steve”
alice john steve “alice”
Join order matters
No single join order delivers all results (some might even be empty)
We do not know in advance which orders deliver results
Consider all possible join orders
14 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
15. Integrated Query Plan
Join operators in all query plans:
Query plans for different join orders overlap
Share as many operators as possible
Join operators with sharing:
|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100
15 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
16. Top-k Keyword-Join Processing
High number of operators
Terminate early after computing top-k instead of all results
Rank join operators
Top-k union operator
Integrated Query Plan is a composition of many sub-plans
Some sub-plans might produce no results
Pull-based operators will block until result can be produced
Use push-based operators: execution driven by inputs instead of
results
Some sub-plans might produce results earlier than others
Rank not only results, but also rank operators
16 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
17. Operator Ranking
Prefer operators that have “promising” results
Global score of rank join operator, based on current results
and upper bounds for subsequent join operations
R: intermediate results
NBK: keyword neighborhoods not yet covered
Global score defined as
Join operators have a global score when they have results ready
Only the operator with the highest global score can push
results to subsequent operators
Otherwise, lower level data access operators are activated
17 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
18. EVALUATION
18 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
19. Evaluation
Four approaches
EASE: indexing at the level of graphs
KJ: keyword join approach
KJU: keyword join approach without operator ranking
Datasets
BTC: 10M triples
DBLP1/5/10: 1M, 5M, 10M triples (from SP2Bench)
9 keyword queries for each dataset
Reduction of index storage size
50% (DBLP1) – 79% (DBLP10)
19 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
20. Results
KJ, KJU outperform EASE
Operator ranking is beneficial
20 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
21. Results
Benefit of operator ranking more pronounced for larger
queries as these need more join operators
21 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
22. Conclusion
Native keyword search based on data access and join
d-length 2-Hop Cover
Index at the level of paths, instead of graphs
Top-k Keyword Join
Exploration transformed into series of join operators
Operator ranking
Reduces storage requirement and increases performance
22 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
23. Thank you for your attention! Questions?
Günter Ladwig, guenter.ladwig@kit.edu
23 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
24. BACKUP SLIDES
24 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
25. Introduction
Keyword search on graph-structured data (RDF)
Query Translation
Translate keywords into structured query using schema knowledge
Native Keyword Search
No translation
Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)
More expensive than query translation approaches
Preprocess data and create special indexes
Reduces search space during online query processing
Requires offline preprocessing and storage
25 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
26. Example Query: “alice malta peter”
Malta l1 l1 Malta
locatedIn locatedIn
ABC Corp o1 o2 ABC Corp
worksAt
worksAt
worksAt
knows
p3 knows p2
knows
Alice p4 p1 Richard
Peter Mary
Match keyword elements
Find connections between keyword elements
26 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
27. Problem Definition
Given a graph GE=(NE,ER)
Find Steiner graphs connection keyword elements
27 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
28. Scoring
Assumption: more compact Steiner graphs are more
relevant
Scoring function
GS: Steiner graph
P: set of paths connecting its keyword elements
Other functions possible, but not part of this work
28 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
29. Approaches
Bidirectional Search
Explore graph from keyword elements to find connections
Does not scale well
EASE
Indexes neighborhood graphs to restrict search space for
exploration
Our approach
Use database operations: data access and join
Transform graph exploration into a series of join operations
Improves storage requirements and performance
29 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
30. d-Length 2-Hop Cover
Preliminaries
Compact representation of connections in a graph
Used to find paths between two nodes in a graph
30 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
31. Construction
Trivial d-length 2-hop cover is the set of all d-
neighborhoods of GE, but contains redundancies
Finding a minimal 2-hop cover is NP-hard (Minimum Set
Cover)
Approximation algorithm
Select a “best” node covering a large amount of paths
Use its neighborhood to prune redundant paths from all other
neighborhoods
31 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
32. Example: Pruning
center node
d=2 p3 hop node p2
knows knows
knows prune worksAt
worksAt knows
p4 o1 p2 o2 p3
p1
locatedIn locatedIn
worksAt knows worksAt knows
l1 o1 p1 l2 o1 p4
Pruned paths between two nodes can be reconstructed by
intersecting their neighborhoods
Store each pruned neighborhood as a list of path entries
32 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
33. Neighborhood Join
hop node
o1 o1 o3
center node
l1 p4 p4 p2
p3 p3 l2
Result: Keyword Graphs
p4 o1 p2 stands for all paths of
length d between p4 and
p2 through o1
p4 p3 p2
...
33 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
34. Graph Join
Expand keyword graphs to keyword graph neighborhoods
Keyword Graph Keyword Graph Neighborhood
p4 o1 p2 p4 o1 p2 o3
p4 o1 p2 l2
l1 p4 o1 p2
...
Graph Join: joins keyword graph neighborhood with
keyword neighborhood
34 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)
35. Integrated Query Plan
Number of join operators without operator sharing
Number of join operators with operator sharing
|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100
35 October 25th, 2011 CIKM 2011, Glasgow Institute of Applied Informatics and Formal Description Methods (AIFB)