Presentation given at the International Conference on Scalable Uncertainty Management, 2-5 Oct 2018, Milan, Italy.
Paper: https://research.utwente.nl/en/publications/rule-based-conditioning-of-probabilistic-data
Data interoperability is a major issue in data management for data science and big data analytics. Probabilistic data integration (PDI) is a specific kind of data integration where extraction and integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. This allows a data integration process with two phases: (1) a quick partial integration where data quality problems are represented as uncertainty in the resulting integrated data, and (2) using the uncertain data and continuously improving its quality as more evidence is gathered. The main contribution of this paper is an iterative approach for incorporating evidence of users in the probabilistically integrated data. Evidence can be specified as hard or soft rules (i.e., rules that are uncertain themselves).
2. Motivation and Context
Data integration problems
Why probabilistic approach to data integration?
Background
Our probabilistic database model
The paper
Conditioning a probabilistic database based on
evidence expressed as hard or soft rules
Conclusions
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 2
CONTENTS
4. It may be hard to extract information from certain kinds of
sources (e.g., natural language, websites).
Information in a source may be missing, of bad quality, or
its meaning is unclear.
It may be unclear which data items in the sources should
be combined.
Sources may be inconsistent complicating a unified view
DATA INTEGRATION
“Data integration involves combining data
residing in different sources
and providing users with a unified view of them”
Lenzerini
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 4
5. Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved
N times earlier)
Let it improve during use
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data
2-PHASE PROBABILISTIC DATA INTEGRATION PROCESS
Use
Gather
evidence
Improve
data quality
‘Quick-and-dirty’
initial data integration
Make remaining problems
explicit & estimate likelihoods
Probabilistic representation
for “data with problems”
InitialintegrationContinuousimprovement
5
PDB
M. van Keulen, Probabilistic Data Integration.
Encyclopedia of Big Data Technologies, Springer,
2018. DOI 10.1007/978-3-319-63962-8_18-1
This paper
7. Example
Data items a1, a2, and a3
unclear whether they
should be in the database
Problem X with 3 cases:
a1 is in the database in
the first two cases, a2
only in the first and third.
Problem Y with 2 cases:
a3 only the database if a
certain condition holds,
a2 only if it doesn’t
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (1/3)
Y=1
Y=2
X=1 X=2 X=3
a1 a1
a1 a1
a2
a2
a3
a3
a3
partitioning
label
10%20%70%
60%
40%
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 7
Possible
world
assertion
8. Compact representation of set of possible worlds W(CPDB)
Abstract notion of data item: we call them assertions
(for a probabilistic relational model: assertion = tuple)
Associate each assertion ai with a sentence 𝜑I
Meaning: ai exists in all worlds for which 𝜑i is true
(ai,𝜑i)
where 𝜑 is a propositional formula of labels l
and labels are atoms of the form ω = v
ωi independent, labels of one ω mutually exclusive
Example
< a2, ¬X=2⋀Y=1 >
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 8
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (2/3)
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
9. Compact representation of set of possible worlds W(CPDB)
A probabilistic database is a 3-tuple CPDB = <DB, Ω, P>
(data) DB={ (a1,𝜑1), …, (an,𝜑n) }
(partitionings) Ω is a set of partitionings ω
(probabilities) function P assigns probabilities to labels
A world w is identified by a fully described sentence 𝜑
𝜑: conjunction of one label from each partitioning
𝜑 can be seen as a name/identifier for world w
Assertion ai exists in world w iff 𝜑 ⇒ 𝜑I
Example
CPDB=<{<a1,¬X=3>,<a2,¬X=2⋀Y=1>,<a3,Y=2>},{X,Y},P>
P(X=1)=0.7; P(X=2)=0.2; P(X=3)=0.1 P(Y=1)=0.6; P(Y=2)=0.4
𝜑=(X=1⋀Y=2) w={a1,a3} P(w) = P(𝜑) = 0.8 x 0.5 = 0.4
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 9
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (2/3)
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
10. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 10
SCALABLE QUERYING
Query implementation
Query semantics
possible worlds possible answers
Theory
Implementation
compact
representation
representation
of possible answers
Q
Q’
11. Given any data model with its query language
Choose data item to associate with sentence
For every query operator ⨂ in language’s algebra
Define extended operator ⨷=(⨂,𝜏⨂)
Where 𝜏⨂ is a function that produces the sentence of
a result based on the sentences of the operands in a
manner that is appropriate for operation ⨂
This produces a probabilistic variant
for that data model + query language
Done for relational, XML, and DataLog
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 11
GENERAL APPROACH
TO OBTAIN PROBABILISTIC QUERY IMPLEMENTATION
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
Paper has
example in
JudgeD =
probabilistic
datalog
13. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 13
EXAMPLE:
INFORMATION EXTRACTION FROM NATURAL LANGUAGE
Unstructured
Text
Structured data
Information extraction /
natural language processing
“We humans happily deal with doubt and misinterpretation every day;
Why shouldn’t computers?”
“Paris Hilton stayed in the Paris Hilton”
Paris: Firstname
Paris: City
Paris Hilton: Person
Paris Hilton: Hotel
Paris Hilton: Fragrance
Which Paris? 60+ Parises (Source: geonames)
Which Person? More people with that name!
Which Hotel? There are 3 in capital of France
14. NE detection
NE
disambiguation
Instead of
Do
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 14
OUR WORK
NE candidate
extraction
Indeterministic NE
disambiguation
Cleaning
(with enriched data)
Habib, M.B. and van Keulen, M. (2016) TwitterNEED: a
hybrid approach for named entity extraction and
disambiguation for tweets. Natural language
engineering, 22 (03). pp. 423-456. ISSN 1351-3249
Go for high recall
at expense of low
precision => A lot
of ’noise’ to be
cleaned later!
Pipeline where intermediate results are
probabilistic data
15. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 15
PROBABILISTIC REPRESENTATION FOR
ANNOTATIONS AND REFERENCES
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
4 1 2 Fragrance
5 6 6 City
6 6 7 Hotel
7 6 7 Person
: : : :
T1=1
T2=1⋀X=0
T2=1⋀X=1
T2=1⋀X=2
T3=1
T4=1⋀Y=0
T4=1⋀Y=1
:
Annotations
“Paris Hilton stayed in the Paris Hilton”
1 2 3 4 5 6 7
RID ID URL
1 1 France
2 1 Texas, US
3 2 La Defense
4 2 Opera
5 2 Orly
6 5 France
7 5 Texas, US
: : :
References
Z=1
Z=2
X=0⋀Z=1⋀A=1
X=0⋀Z=1⋀A=2
X=0⋀Z=1⋀A=3
B=1
B=2
:
This paper:
If I have this, how
do I clean it given
some evidence?
16. Evidence from users or context or analytics
Given a phrase that is a person, a part is never a city
(hard knowledge rule)
If a city is part of a hotel name, then it is more likely to
refer to a city containing such a hotel
(soft knowledge rule)
“stayed in” suggests that what precedes it is more
likely a person and what follows it is more likely a hotel
(soft knowledge rule learnt from corpora)
Data integration problems produce “noise” in the data
Cleaning is aimed at filtering/reducing ”noise”
= removing worlds or improving probabilities
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 16
CLEANING PROBABILISTIC DATA
17. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 17
INTUITION: CLEANING PROBABILISTIC DATA
= CONDITIONING = BAYESIAN UPDATING
Paris Hilton stayed in the Paris Hilton
Person --- dnc --- City
inconsistency
T1=1 (“Paris” is a City) [a]
T2=1⋀X=1 (“Paris Hilton” is a Person) [b]
sentences become mutually exclusive
∅
a∧b
a∧¬b
b∧¬a
0.48
0.12
0.32
0.08
∅
a∧¬b
b∧¬a
0.23
0.62
0.15
a and b
independent
P(a)=0.6
P(b)=0.8
a and b
mutually
exclusive
(a∧b is not
possible)
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
T1=1
T2=1⋀X=1
T2=1⋀X=1
18. 1. Represent integrated data as probabilistic facts and rules
2. Represent evidence as hard/soft rules
Hard: evidence is absolutely true
Soft: evidence is likely
3. Incorporate evidence by updating the database
deleting worlds that do not correspond with evidence
a) Evaluate rule to obtain evidence sentence 𝜑e
b) Remap partitionings in 𝜑e to a fresh one ω
c) Exclude inconsistent labels and renumber
d) Pe(𝜑e) is remaining probability mass to be distributed
over the remaining worlds
This constructs an updated CPDB’=<DB’, Ω’, P’>
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 18
OVERVIEW OF CONDITIONING APPROACH
Different from
“observe” as in
ProbLog
Directly on
compact
representation
19. 1. Represent data integration result as facts and rules
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 19
THE PROCESS STEP BY STEP
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton” is a hotel, person, or fragrance (x)
“Paris” is a firstname or city (y)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot
20. 2. Represent evidence as rules (a7 which uses a6)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 20
THE PROCESS STEP BY STEP
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot
a6 contained(B1,E1,B2,E2) :- B1<=B2, E1<=E2.
a7 hardrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2).
Person
--- dnc ---
City
21. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 21
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y)
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
deleting worlds that do not correspond with evidence
22. not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 22
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
Person --- dnc --- City
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y)
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(a) evaluate rule to obtain evidence sentence 𝜑e
23. not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 23
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
deleting worlds that do not correspond with evidence (𝜑e)
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
Person --- dnc --- City
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y) I want to do this
directly on the compact
representation CPDB
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds
24. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ✓
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 24
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(b) remap partitionings in 𝜑e to a fresh one ω
• For every label, find logical equivalent, e.g., x=1 ⇔ (z=1 ∨ z=4)
• DB : For every sentence 𝜑 in DB, replace x- and y-labels, and simplify
• Ω : Remove x,y from Ω and add z
• P: Remove x,y from domain of P and add P(z=1) … P(z=6)
No change in semantics:
same set of possible worlds𝜑e = ¬(x=2⋀y=2) becomes ¬(z=5)
On
PWs
On
CPDB
25. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 25
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(c) exclude inconsistent labels and renumber
• DB : For every sentence 𝜑 in DB, replace z=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
𝜑e = ¬(z=5)
On
PWs
On
CPDB
26. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓ 0.2083
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓ 0.1667
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓ 0.0417
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓ 0.4861
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓ 0.0972
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 26
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(d) Pe(𝜑e) is remaining probability mass to be distributed over the remaining worlds
• DB : For every sentence 𝜑 in DB, replace x=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
• P: Remove z from domain of P and add P(z’=1) … P(z’=5)
On
PWs
On
CPDB
27. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 27
THE PROCESS STEP BY STEP
THE RESULT AFTER CONDITIONING
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
x=1
x=2
x=3
y=1
y=2
P
x=1 0.5
x=2 0.4
x=3 0.1
y=1 0.3
y=2 0.7
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
z=1 ∨ z=4
z=2
z=3 ∨ z=5
z=1 ∨ z=2 ∨ z=3
z=4 ∨ z=5
P
z=1 0.2083
z=2 0.1667
z=3 0.0417
z=4 0.4861
z=5 0.0972
28. We obtain 𝜑e = ¬(x=2⋀y=2⋀r=1)
Approach
Condition as if it was a hard rule
Only effectuate it for worlds W(r=1)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 28
SOFT RULES
WHAT IF THE RULE IS UNCERTAIN ITSELF?
a’7 softrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2) [r=1].
29. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 29
SOFT RULES
RESULT
Looks scary, but same number of assertions,
only more partitionings and longer sentences
Example: a1 with (r=0⋀x=1) ⋁ (r=1⋀(z=1⋁z=4))
30. Probabilistic data integration
Two-phase data integration process
1. Quick-and-dirty data integration with late cleaning
2. Meaningful use with evidence gathering & cleaning
Data integration problems handled by means of a
probabilistic data representation
Paper proposes approach for cleaning with evidence
Evidence expressed as hard and soft rules
Incorporates evidence by updating the database
Iterative and scalable
Allows for continuous data quality improvement
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 30
CONCLUSIONS
31. (Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
(often attributed to John Maynard Keynes, but Carveth Read, 1898)
32. Scalability
Obtaining evidence sentence has same complexity
as querying
Remapping and redistribution of probabilities
exponential in the number of partitionings in 𝜑e
Assumption: Uncertainty local
Splitting approach in the paper
Simplification of sentences also local
Database updating worst case linear in size of DB
Resulting DB same size, but longer sentences
Querying complexity doesn’t change significantly
Iterative : one piece of evidence at-a-time
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 32
PROPERTIES OF THE APPROACH