Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

•

1 recomendación•693 vistas

This document presents an algorithm for matching properties between linked databases using Kullback-Leibler divergence (KL-Divergence). It first creates documents representing the distributions of objects linked to properties in each database. It then computes the normalized KL-Divergence between all document pairs to identify the most similar properties. The property with the lowest KL-Divergence score to a given property is returned as its match. Experimental results on real linked datasets found the algorithm could accurately match properties over 90% of the time.

Tecnología Educación

Property Matching and Query Expansion on
Linked Data Using Kullback-Leibler Divergence
Sean Golliher, Nathan Fortier, Logan Perreault

December 12, 2013

1 / 25

Property Matching Problem

Databases with diﬀerent properties:

2 / 25

def: Query Expansion

Query expansion (QE) is the process of reformulating a seed
query to improve retrieval performance in information retrieval
operations.

3 / 25

Cloud Diagram (TRIZ Problem Solving)

5 / 25

Property Matching Problem

How do we ﬁnd all actors in both databases?
Don’t want to manually inspect all databases
Can we use SPARQL query language to infer across all datasets?
SELECT ?p
WHERE { s ?p o }
Can only match total sizes of returned triple sets

7 / 25

Original Bayesian Approach

Problems with Bayesian Approach
Had to create, and track, a large vocabulary for training
Smoothing issues with very sparse text
Underﬂow issues – small conﬁdence values
Complexity of likelihood was growing:
n diﬀerent features in feature set X and c classes + tunable parameters.

8 / 25

KL-Divergence

Original paper from 1951 entitled “On Information and Suﬃciency”
Also referred to as“relative entropy”
A system gains entropy when it moves to a state with more possible
arrangements. For example, a liquid to a gas.
Used in paper from 2003 for text categorization:
”Using KL-Distance for Text Categorization
Elegant and eﬃcient method for plagiarism detection

9 / 25

KL-Divergence

Measure of divergence of information between two distributions:
D(P

Q) =

P(x) log
x∈X

P(x)
Q(x)

Not symmetric

10 / 25

KL-Divergence Example

Table : Generic Vocabularies Generated by Fixing on Predicates

d1

d2

d3

subject1
object1
object2
subject2
object3
object3

subject3
object4

subject1
object1
object2
subject4
object3

subject2
object3

ex: D(d1 d2 ) = 1 log 1/5 + 1 log 1/5 + ........ + 2 log 2/5
5
0
5
0
5
1/4
tf( subject1 ) is 1/5 in d1 and 0 in d2 – using value for now

12 / 25

Formal Problem Statement

Given:
Two databases DB1 and DB2
A predicate p1 ∈ DB1
An object type S1 where some triple “s p1 o exists in D1
where s ∈ S1

Find predicate p2 in DB2 where p2 is equivilant to p1

14 / 25

High Level Description

Create a document d1 containing labels of all objects linked
by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
For each predicate p2 used by S2 create a document d2
containing labels of all objects linked by p2
Remove stop words and language tags from d1 and d2
For each document compute the normalized KL-Divergence,
KLD ∗ (d1 , d2 )
Return predicate corresponding to the document with the
lowest KL-Divergence

15 / 25

Algorithm 1 FindPredicate(DB1 , DB2 , p1 , S1 )
Create document d1 containing labels of all objects linked by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
for each predicate p2 used by S2 do
Create document d2 containing labels of all objects linked by p2
end for
Remove stop words and language tags from d1 and d2
min ← 1
for each predicate pi used by S2 do
k ← KLD ∗ (d1 , di )
if k < min then
min ← k
pmap ← pi
end if
end for
return pmap

16 / 25

Computing KL-Divergence
KL-Divergence is computed as
(P(tk , di ) − P(tk , dj )) × log

KLD(di , dj ) =
k∈V

Where
P(tk , di ) =

tf (tk , di )
x∈di tf (tx , dj )

P(tk , di )
(1)
P(tk , dj )

(2)

If tk does not occur in di then P(tk , di ) ←
KL-Divergence is then normalized as follows:
KLD ∗ (di , dj ) =

KLD(di , dj )
KLD(di , 0)

(3)

17 / 25

Algorithm 2 tf (tk , di )
tf ← 0
for each term tx in di do
if sim(tk , tx ) > τ then
tf ← tf + 1
end if
end for
return tf

18 / 25

Más contenido relacionado

La actualidad más candente

CPM2013-tabei201306Yasuo Tabei

2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria

K-Means AlgorithmCarlos Castillo (ChaTo)

DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei

Functional programmingHeman Gandhi

Ch03 Mining Massive Data Sets stanfordSakthivel C R

lecture 12sajinsc

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...marxliouville

IR-rankingFELIX75

PyData Amsterdam - Name Matching at ScaleGoDataDriven

How to share a secretCamilo Garrido

Locality sensitive hashingSameera Horawalavithana

Evaluating the Effectiveness of Axiomatic Approaches in Web TrackTwitter Inc.

Gradient Estimation Using Stochastic Computation GraphsYoonho Lee

Data Structure and Algorithms ManishPrajapati78

4.2 bst 02Krish_ver2

Extract And Manage Knowledgeabedali

Computational ComplexityKasun Ranga Wijeweera

Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Tarek Dib

LSHHsiao-Fei Liu

La actualidad más candente (20)

CPM2013-tabei201306

2014-mo444-practical-assignment-02-paulo_faria

K-Means Algorithm

DCC2014 - Fully Online Grammar Compression in Constant Space

Functional programming

Ch03 Mining Massive Data Sets stanford

lecture 12

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...

IR-ranking

PyData Amsterdam - Name Matching at Scale

How to share a secret

Locality sensitive hashing

Evaluating the Effectiveness of Axiomatic Approaches in Web Track

Gradient Estimation Using Stochastic Computation Graphs

Data Structure and Algorithms

4.2 bst 02

Extract And Manage Knowledge

Computational Complexity

Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

LSH

Similar a Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataVrije Universiteit Amsterdam

IVR - Chapter 7 - Patch models and dictionary learningCharles Deledalle

Artificial Intelligencevini89

Detecting paraphrases using recursive autoencodersFeynman Liang

Practical Collapsed Stochastic Variational InferenceArnim Bleier

On Unified Stream Reasoning - The RDF Stream Processing realmDaniele Dell'Aglio

DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman

A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao

Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam

Symbolic Execution as DPLL Modulo TheoriesQuoc-Sang Phan

Latent Dirichlet AllocationMarco Righini

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf

Type and proof structures for concurrencyFacultad de Informática UCM

Structure and interpretation of computer programs modularity, objects, and ...bdemchak

Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada

Local Closed World Semantics - DL 2011 PosterAdila Krisnadhi

Lec1Prafulla Kiran

Intro.pptWrushabhShirsat3

Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase

Introduction to PrologChamath Sajeewa

Similar a Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence (20)

Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data

IVR - Chapter 7 - Patch models and dictionary learning

Artificial Intelligence

Detecting paraphrases using recursive autoencoders

Practical Collapsed Stochastic Variational Inference

On Unified Stream Reasoning - The RDF Stream Processing realm

DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism

A Distributed Tableau Algorithm for Package-based Description Logics

Navigating and Exploring RDF Data using Formal Concept Analysis

Symbolic Execution as DPLL Modulo Theories

Latent Dirichlet Allocation

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...

Type and proof structures for concurrency

Structure and interpretation of computer programs modularity, objects, and ...

Context-dependent Token-wise Variational Autoencoder for Topic Modeling

Local Closed World Semantics - DL 2011 Poster

Lec1

Intro.ppt

Сергей Кольцов —НИУ ВШЭ —ICBDA 2015

Introduction to Prolog

Más de Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher

Goprez sgSean Golliher

Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher

Lecture 7- Text Statistics and Document ParsingSean Golliher

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher

PageRank and The Google MatrixSean Golliher

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher

Más de Sean Golliher (9)

Time Series Forecasting using Neural Nets (GNNNs)

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Goprez sg

Lecture 9 - Machine Learning and Support Vector Machines (SVM)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Lecture 7- Text Statistics and Document Parsing

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

PageRank and The Google Matrix

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

Último

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Slack Application Development 101 Slidespraypatel2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Real Time Object Detection Using Open CVKhem

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Histor y of HAM Radio presentation slidevu2urc

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

1. Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence Sean Golliher, Nathan Fortier, Logan Perreault December 12, 2013 1 / 25

2. Property Matching Problem Databases with diﬀerent properties: 2 / 25

3. def: Query Expansion Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations. 3 / 25

4. Societal Cloud 4 / 25

5. Cloud Diagram (TRIZ Problem Solving) 5 / 25

6. Cloud Diagram Broken 6 / 25

7. Property Matching Problem How do we ﬁnd all actors in both databases? Don’t want to manually inspect all databases Can we use SPARQL query language to infer across all datasets? SELECT ?p WHERE { s ?p o } Can only match total sizes of returned triple sets 7 / 25

8. Original Bayesian Approach Problems with Bayesian Approach Had to create, and track, a large vocabulary for training Smoothing issues with very sparse text Underflow issues – small confidence values Complexity of likelihood was growing: n different features in feature set X and c classes + tunable parameters. 8 / 25

9. KL-Divergence Original paper from 1951 entitled “On Information and Suﬃciency” Also referred to as“relative entropy” A system gains entropy when it moves to a state with more possible arrangements. For example, a liquid to a gas. Used in paper from 2003 for text categorization: ”Using KL-Distance for Text Categorization Elegant and eﬃcient method for plagiarism detection 9 / 25

10. KL-Divergence Measure of divergence of information between two distributions: D(P Q) = P(x) log x∈X P(x) Q(x) Not symmetric 10 / 25

11. KL-Divergence Example 11 / 25

12. KL-Divergence Example Table : Generic Vocabularies Generated by Fixing on Predicates d1 d2 d3 subject1 object1 object2 subject2 object3 object3 subject3 object4 subject1 object1 object2 subject4 object3 subject2 object3 ex: D(d1 d2 ) = 1 log 1/5 + 1 log 1/5 + ........ + 2 log 2/5 5 0 5 0 5 1/4 tf( subject1 ) is 1/5 in d1 and 0 in d2 – using value for now 12 / 25

13. Algorithm Description 13 / 25

14. Formal Problem Statement Given: Two databases DB1 and DB2 A predicate p1 ∈ DB1 An object type S1 where some triple “s p1 o exists in D1 where s ∈ S1 Find predicate p2 in DB2 where p2 is equivilant to p1 14 / 25

15. High Level Description Create a document d1 containing labels of all objects linked by p1 Find an object type S2 ∈ d2 where S1 is equivilant to S2 For each predicate p2 used by S2 create a document d2 containing labels of all objects linked by p2 Remove stop words and language tags from d1 and d2 For each document compute the normalized KL-Divergence, KLD ∗ (d1 , d2 ) Return predicate corresponding to the document with the lowest KL-Divergence 15 / 25

16. Algorithm 1 FindPredicate(DB1 , DB2 , p1 , S1 ) Create document d1 containing labels of all objects linked by p1 Find an object type S2 ∈ d2 where S1 is equivilant to S2 for each predicate p2 used by S2 do Create document d2 containing labels of all objects linked by p2 end for Remove stop words and language tags from d1 and d2 min ← 1 for each predicate pi used by S2 do k ← KLD ∗ (d1 , di ) if k < min then min ← k pmap ← pi end if end for return pmap 16 / 25

17. Computing KL-Divergence KL-Divergence is computed as (P(tk , di ) − P(tk , dj )) × log KLD(di , dj ) = k∈V Where P(tk , di ) = tf (tk , di ) x∈di tf (tx , dj ) P(tk , di ) (1) P(tk , dj ) (2) If tk does not occur in di then P(tk , di ) ← KL-Divergence is then normalized as follows: KLD ∗ (di , dj ) = KLD(di , dj ) KLD(di , 0) (3) 17 / 25

18. Algorithm 2 tf (tk , di ) tf ← 0 for each term tx in di do if sim(tk , tx ) > τ then tf ← tf + 1 end if end for return tf 18 / 25

19. Experimental Results 19 / 25

20. Experimental Results 20 / 25

21. Experimental Results 21 / 25

22. Experimental Results 22 / 25

23. Experimental Results 23 / 25

24. Experimental Results 24 / 25

25. Questions? 25 / 25

Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

Similar a Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence (20)

Más de Sean Golliher

Más de Sean Golliher (9)

Último

Último (20)

Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence