This document presents an algorithm for matching properties between linked databases using Kullback-Leibler divergence (KL-Divergence). It first creates documents representing the distributions of objects linked to properties in each database. It then computes the normalized KL-Divergence between all document pairs to identify the most similar properties. The property with the lowest KL-Divergence score to a given property is returned as its match. Experimental results on real linked datasets found the algorithm could accurately match properties over 90% of the time.
Handwritten Text Recognition for manuscripts and early printed texts
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence
1. Property Matching and Query Expansion on
Linked Data Using Kullback-Leibler Divergence
Sean Golliher, Nathan Fortier, Logan Perreault
December 12, 2013
1 / 25
3. def: Query Expansion
Query expansion (QE) is the process of reformulating a seed
query to improve retrieval performance in information retrieval
operations.
3 / 25
7. Property Matching Problem
How do we find all actors in both databases?
Don’t want to manually inspect all databases
Can we use SPARQL query language to infer across all datasets?
SELECT ?p
WHERE { s ?p o }
Can only match total sizes of returned triple sets
7 / 25
8. Original Bayesian Approach
Problems with Bayesian Approach
Had to create, and track, a large vocabulary for training
Smoothing issues with very sparse text
Underflow issues – small confidence values
Complexity of likelihood was growing:
n different features in feature set X and c classes + tunable parameters.
8 / 25
9. KL-Divergence
Original paper from 1951 entitled “On Information and Sufficiency”
Also referred to as“relative entropy”
A system gains entropy when it moves to a state with more possible
arrangements. For example, a liquid to a gas.
Used in paper from 2003 for text categorization:
”Using KL-Distance for Text Categorization
Elegant and efficient method for plagiarism detection
9 / 25
14. Formal Problem Statement
Given:
Two databases DB1 and DB2
A predicate p1 ∈ DB1
An object type S1 where some triple “s p1 o exists in D1
where s ∈ S1
Find predicate p2 in DB2 where p2 is equivilant to p1
14 / 25
15. High Level Description
Create a document d1 containing labels of all objects linked
by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
For each predicate p2 used by S2 create a document d2
containing labels of all objects linked by p2
Remove stop words and language tags from d1 and d2
For each document compute the normalized KL-Divergence,
KLD ∗ (d1 , d2 )
Return predicate corresponding to the document with the
lowest KL-Divergence
15 / 25
16. Algorithm 1 FindPredicate(DB1 , DB2 , p1 , S1 )
Create document d1 containing labels of all objects linked by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
for each predicate p2 used by S2 do
Create document d2 containing labels of all objects linked by p2
end for
Remove stop words and language tags from d1 and d2
min ← 1
for each predicate pi used by S2 do
k ← KLD ∗ (d1 , di )
if k < min then
min ← k
pmap ← pi
end if
end for
return pmap
16 / 25
17. Computing KL-Divergence
KL-Divergence is computed as
(P(tk , di ) − P(tk , dj )) × log
KLD(di , dj ) =
k∈V
Where
P(tk , di ) =
tf (tk , di )
x∈di tf (tx , dj )
P(tk , di )
(1)
P(tk , dj )
(2)
If tk does not occur in di then P(tk , di ) ←
KL-Divergence is then normalized as follows:
KLD ∗ (di , dj ) =
KLD(di , dj )
KLD(di , 0)
(3)
17 / 25
18. Algorithm 2 tf (tk , di )
tf ← 0
for each term tx in di do
if sim(tk , tx ) > τ then
tf ← tf + 1
end if
end for
return tf
18 / 25