Pdi conditioning-sum2018-milan-20181004

RULE-BASED CONDITIONING
OF PROBABILISTIC DATA
MAURICE VAN KEULEN1, BENJAMIN KAMINSKI2
CHRISTOPH MATEJA2, JOOST-PIETER KATOEN1,2
2 RWTH Aachen
1

Motivation and Context
 Data integration problems
 Why probabilistic approach to data integration?
Background
 Our probabilistic database model
The paper
 Conditioning a probabilistic database based on
evidence expressed as hard or soft rules
Conclusions
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 2
CONTENTS

 It may be hard to extract information from certain kinds of
sources (e.g., natural language, websites).
 Information in a source may be missing, of bad quality, or
its meaning is unclear.
 It may be unclear which data items in the sources should
be combined.
 Sources may be inconsistent complicating a unified view
DATA INTEGRATION
“Data integration involves combining data
residing in different sources
and providing users with a unified view of them”
Lenzerini

Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved
N times earlier)
Let it improve during use
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data
2-PHASE PROBABILISTIC DATA INTEGRATION PROCESS
Use
Gather
evidence
Improve
data quality
‘Quick-and-dirty’
initial data integration
Make remaining problems
explicit & estimate likelihoods
Probabilistic representation
for “data with problems”
InitialintegrationContinuousimprovement
5
PDB
M. van Keulen, Probabilistic Data Integration.
Encyclopedia of Big Data Technologies, Springer,
2018. DOI 10.1007/978-3-319-63962-8_18-1
This paper

OUR PROBABILISTIC
DATABASE MODEL
Similar to probabilistic c-tables
Inspired by MayBMS
(C. Koch et al)

Example
 Data items a1, a2, and a3
unclear whether they
should be in the database
 Problem X with 3 cases:
a1 is in the database in
the first two cases, a2
only in the first and third.
 Problem Y with 2 cases:
a3 only the database if a
certain condition holds,
a2 only if it doesn’t
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (1/3)
Y=1
Y=2
X=1 X=2 X=3
a1 a1
a1 a1
a2
a2
a3
a3
a3
partitioning
label
10%20%70%
60%
40%
Possible
world
assertion

Compact representation of set of possible worlds W(CPDB)
Abstract notion of data item: we call them assertions
(for a probabilistic relational model: assertion = tuple)
Associate each assertion ai with a sentence 𝜑I
Meaning: ai exists in all worlds for which 𝜑i is true
 (ai,𝜑i)
where 𝜑 is a propositional formula of labels l
and labels are atoms of the form ω = v
ωi independent, labels of one ω mutually exclusive
Example
 < a2, ¬X=2⋀Y=1 >
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.

Compact representation of set of possible worlds W(CPDB)
A probabilistic database is a 3-tuple CPDB = <DB, Ω, P>
 (data) DB={ (a1,𝜑1), …, (an,𝜑n) }
(partitionings) Ω is a set of partitionings ω
(probabilities) function P assigns probabilities to labels
 A world w is identified by a fully described sentence 𝜑
𝜑: conjunction of one label from each partitioning
𝜑 can be seen as a name/identifier for world w
Assertion ai exists in world w iff 𝜑 ⇒ 𝜑I
Example
 CPDB=<{<a1,¬X=3>,<a2,¬X=2⋀Y=1>,<a3,Y=2>},{X,Y},P>
P(X=1)=0.7; P(X=2)=0.2; P(X=3)=0.1 P(Y=1)=0.6; P(Y=2)=0.4
 𝜑=(X=1⋀Y=2) w={a1,a3} P(w) = P(𝜑) = 0.8 x 0.5 = 0.4

SCALABLE QUERYING
Query implementation
Query semantics
possible worlds possible answers
Theory
Implementation
compact
representation
representation
of possible answers
Q
Q’

Given any data model with its query language
 Choose data item to associate with sentence
 For every query operator ⨂ in language’s algebra
 Define extended operator ⨷=(⨂,𝜏⨂)
 Where 𝜏⨂ is a function that produces the sentence of
a result based on the sentences of the operands in a
manner that is appropriate for operation ⨂
This produces a probabilistic variant
for that data model + query language
 Done for relational, XML, and DataLog
GENERAL APPROACH
TO OBTAIN PROBABILISTIC QUERY IMPLEMENTATION
Paper has
example in
JudgeD =
probabilistic
datalog

THE PAPER
RULE-BASED CONDITIONING OF
PROBABILISTIC DATA
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data12

EXAMPLE:
INFORMATION EXTRACTION FROM NATURAL LANGUAGE
Unstructured
Text
Structured data
Information extraction /
natural language processing
“We humans happily deal with doubt and misinterpretation every day;
Why shouldn’t computers?”
“Paris Hilton stayed in the Paris Hilton”
Paris: Firstname
Paris: City
Paris Hilton: Person
Paris Hilton: Hotel
Paris Hilton: Fragrance
Which Paris? 60+ Parises (Source: geonames)
Which Person? More people with that name!
Which Hotel? There are 3 in capital of France

NE detection
NE
disambiguation
Instead of
Do
OUR WORK
NE candidate
extraction
Indeterministic NE
disambiguation
Cleaning
(with enriched data)
Habib, M.B. and van Keulen, M. (2016) TwitterNEED: a
hybrid approach for named entity extraction and
disambiguation for tweets. Natural language
engineering, 22 (03). pp. 423-456. ISSN 1351-3249
Go for high recall
at expense of low
precision => A lot
of ’noise’ to be
cleaned later!
Pipeline where intermediate results are
probabilistic data

PROBABILISTIC REPRESENTATION FOR
ANNOTATIONS AND REFERENCES
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
4 1 2 Fragrance
5 6 6 City
6 6 7 Hotel
7 6 7 Person
: : : :
T1=1
T2=1⋀X=0
T2=1⋀X=1
T2=1⋀X=2
T3=1
T4=1⋀Y=0
T4=1⋀Y=1
:
Annotations
“Paris Hilton stayed in the Paris Hilton”
1 2 3 4 5 6 7
RID ID URL
1 1 France
2 1 Texas, US
3 2 La Defense
4 2 Opera
5 2 Orly
6 5 France
7 5 Texas, US
: : :
References
Z=1
Z=2
X=0⋀Z=1⋀A=1
X=0⋀Z=1⋀A=2
X=0⋀Z=1⋀A=3
B=1
B=2
:
This paper:
If I have this, how
do I clean it given
some evidence?

Evidence from users or context or analytics
 Given a phrase that is a person, a part is never a city
(hard knowledge rule)
 If a city is part of a hotel name, then it is more likely to
refer to a city containing such a hotel
(soft knowledge rule)
 “stayed in” suggests that what precedes it is more
likely a person and what follows it is more likely a hotel
(soft knowledge rule learnt from corpora)
Data integration problems produce “noise” in the data
Cleaning is aimed at filtering/reducing ”noise”
= removing worlds or improving probabilities
CLEANING PROBABILISTIC DATA

INTUITION: CLEANING PROBABILISTIC DATA
= CONDITIONING = BAYESIAN UPDATING
Paris Hilton stayed in the Paris Hilton
Person --- dnc --- City
inconsistency
T1=1 (“Paris” is a City) [a]
T2=1⋀X=1 (“Paris Hilton” is a Person) [b]
sentences become mutually exclusive
∅
a∧b
a∧¬b
b∧¬a
0.48
0.12
0.32
0.08
∅
a∧¬b
b∧¬a
0.23
0.62
0.15
a and b
independent
P(a)=0.6
P(b)=0.8
a and b
mutually
exclusive
(a∧b is not
possible)
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
T1=1
T2=1⋀X=1
T2=1⋀X=1

1. Represent integrated data as probabilistic facts and rules
2. Represent evidence as hard/soft rules
 Hard: evidence is absolutely true
 Soft: evidence is likely
3. Incorporate evidence by updating the database
deleting worlds that do not correspond with evidence
a) Evaluate rule to obtain evidence sentence 𝜑e
b) Remap partitionings in 𝜑e to a fresh one ω
c) Exclude inconsistent labels and renumber
d) Pe(𝜑e) is remaining probability mass to be distributed
over the remaining worlds
This constructs an updated CPDB’=<DB’, Ω’, P’>
OVERVIEW OF CONDITIONING APPROACH
Different from
“observe” as in
ProbLog
Directly on
compact
representation

1. Represent data integration result as facts and rules
THE PROCESS STEP BY STEP
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton” is a hotel, person, or fragrance (x)
“Paris” is a firstname or city (y)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot

2. Represent evidence as rules (a7 which uses a6)
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot
a6 contained(B1,E1,B2,E2) :- B1<=B2, E1<=E2.
a7 hardrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2).
Person
--- dnc ---
City

y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y)
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
deleting worlds that do not correspond with evidence

not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds
“Paris” is a
firstname
or city (y)
(a) evaluate rule to obtain evidence sentence 𝜑e

not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
deleting worlds that do not correspond with evidence (𝜑e)
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
“Paris” is a
firstname
or city (y) I want to do this
directly on the compact
representation CPDB
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds

Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ✓
(b) remap partitionings in 𝜑e to a fresh one ω
• For every label, find logical equivalent, e.g., x=1 ⇔ (z=1 ∨ z=4)
• DB : For every sentence 𝜑 in DB, replace x- and y-labels, and simplify
• Ω : Remove x,y from Ω and add z
• P: Remove x,y from domain of P and add P(z=1) … P(z=6)
No change in semantics:
same set of possible worlds𝜑e = ¬(x=2⋀y=2) becomes ¬(z=5)
On
PWs
On
CPDB

Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓
(c) exclude inconsistent labels and renumber
• DB : For every sentence 𝜑 in DB, replace z=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
𝜑e = ¬(z=5)
On
PWs
On
CPDB

Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓ 0.2083
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓ 0.1667
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓ 0.0417
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓ 0.4861
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓ 0.0972
(d) Pe(𝜑e) is remaining probability mass to be distributed over the remaining worlds
• DB : For every sentence 𝜑 in DB, replace x=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
• P: Remove z from domain of P and add P(z’=1) … P(z’=5)
On
PWs
On
CPDB

THE RESULT AFTER CONDITIONING
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
x=1
x=2
x=3
y=1
y=2
P
x=1 0.5
x=2 0.4
x=3 0.1
y=1 0.3
y=2 0.7
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
z=1 ∨ z=4
z=2
z=3 ∨ z=5
z=1 ∨ z=2 ∨ z=3
z=4 ∨ z=5
P
z=1 0.2083
z=2 0.1667
z=3 0.0417
z=4 0.4861
z=5 0.0972

We obtain 𝜑e = ¬(x=2⋀y=2⋀r=1)
Approach
Condition as if it was a hard rule
Only effectuate it for worlds W(r=1)
SOFT RULES
WHAT IF THE RULE IS UNCERTAIN ITSELF?
a’7 softrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2) [r=1].

SOFT RULES
RESULT
Looks scary, but same number of assertions,
only more partitionings and longer sentences
Example: a1 with (r=0⋀x=1) ⋁ (r=1⋀(z=1⋁z=4))

Probabilistic data integration
 Two-phase data integration process
1. Quick-and-dirty data integration with late cleaning
2. Meaningful use with evidence gathering & cleaning
 Data integration problems handled by means of a
probabilistic data representation
Paper proposes approach for cleaning with evidence
 Evidence expressed as hard and soft rules
 Incorporates evidence by updating the database
 Iterative and scalable
Allows for continuous data quality improvement
CONCLUSIONS

(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
(often attributed to John Maynard Keynes, but Carveth Read, 1898)

 Scalability
 Obtaining evidence sentence has same complexity
as querying
 Remapping and redistribution of probabilities
exponential in the number of partitionings in 𝜑e
 Assumption: Uncertainty local
 Splitting approach in the paper
 Simplification of sentences also local
 Database updating worst case linear in size of DB
 Resulting DB same size, but longer sentences
 Querying complexity doesn’t change significantly
 Iterative : one piece of evidence at-a-time
PROPERTIES OF THE APPROACH

Pdi conditioning-sum2018-milan-20181004

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Pdi conditioning-sum2018-milan-20181004

Similar a Pdi conditioning-sum2018-milan-20181004 (20)

Último

Último (20)

Pdi conditioning-sum2018-milan-20181004