SlideShare una empresa de Scribd logo
1 de 32
RULE-BASED CONDITIONING
OF PROBABILISTIC DATA
MAURICE VAN KEULEN1, BENJAMIN KAMINSKI2
CHRISTOPH MATEJA2, JOOST-PIETER KATOEN1,2
2 RWTH Aachen
1
Motivation and Context
 Data integration problems
 Why probabilistic approach to data integration?
Background
 Our probabilistic database model
The paper
 Conditioning a probabilistic database based on
evidence expressed as hard or soft rules
Conclusions
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 2
CONTENTS
DATA INTEGRATION
 It may be hard to extract information from certain kinds of
sources (e.g., natural language, websites).
 Information in a source may be missing, of bad quality, or
its meaning is unclear.
 It may be unclear which data items in the sources should
be combined.
 Sources may be inconsistent complicating a unified view
DATA INTEGRATION
“Data integration involves combining data
residing in different sources
and providing users with a unified view of them”
Lenzerini
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 4
Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved
N times earlier)
Let it improve during use
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data
2-PHASE PROBABILISTIC DATA INTEGRATION PROCESS
Use
Gather
evidence
Improve
data quality
‘Quick-and-dirty’
initial data integration
Make remaining problems
explicit & estimate likelihoods
Probabilistic representation
for “data with problems”
InitialintegrationContinuousimprovement
5
PDB
M. van Keulen, Probabilistic Data Integration.
Encyclopedia of Big Data Technologies, Springer,
2018. DOI 10.1007/978-3-319-63962-8_18-1
This paper
OUR PROBABILISTIC
DATABASE MODEL
Similar to probabilistic c-tables
Inspired by MayBMS
(C. Koch et al)
Example
 Data items a1, a2, and a3
unclear whether they
should be in the database
 Problem X with 3 cases:
a1 is in the database in
the first two cases, a2
only in the first and third.
 Problem Y with 2 cases:
a3 only the database if a
certain condition holds,
a2 only if it doesn’t
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (1/3)
Y=1
Y=2
X=1 X=2 X=3
a1 a1
a1 a1
a2
a2
a3
a3
a3
partitioning
label
10%20%70%
60%
40%
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 7
Possible
world
assertion
Compact representation of set of possible worlds W(CPDB)
Abstract notion of data item: we call them assertions
(for a probabilistic relational model: assertion = tuple)
Associate each assertion ai with a sentence 𝜑I
Meaning: ai exists in all worlds for which 𝜑i is true
 (ai,𝜑i)
where 𝜑 is a propositional formula of labels l
and labels are atoms of the form ω = v
ωi independent, labels of one ω mutually exclusive
Example
 < a2, ¬X=2⋀Y=1 >
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 8
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (2/3)
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
Compact representation of set of possible worlds W(CPDB)
A probabilistic database is a 3-tuple CPDB = <DB, Ω, P>
 (data) DB={ (a1,𝜑1), …, (an,𝜑n) }
(partitionings) Ω is a set of partitionings ω
(probabilities) function P assigns probabilities to labels
 A world w is identified by a fully described sentence 𝜑
𝜑: conjunction of one label from each partitioning
𝜑 can be seen as a name/identifier for world w
Assertion ai exists in world w iff 𝜑 ⇒ 𝜑I
Example
 CPDB=<{<a1,¬X=3>,<a2,¬X=2⋀Y=1>,<a3,Y=2>},{X,Y},P>
P(X=1)=0.7; P(X=2)=0.2; P(X=3)=0.1 P(Y=1)=0.6; P(Y=2)=0.4
 𝜑=(X=1⋀Y=2) w={a1,a3} P(w) = P(𝜑) = 0.8 x 0.5 = 0.4
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 9
OUR PROBABILISTIC DATA MODEL
IS BASED ON POSSIBLE WORLDS THEORY (2/3)
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 10
SCALABLE QUERYING
Query implementation
Query semantics
possible worlds possible answers
Theory
Implementation
compact
representation
representation
of possible answers
Q
Q’
Given any data model with its query language
 Choose data item to associate with sentence
 For every query operator ⨂ in language’s algebra
 Define extended operator ⨷=(⨂,𝜏⨂)
 Where 𝜏⨂ is a function that produces the sentence of
a result based on the sentences of the operands in a
manner that is appropriate for operation ⨂
This produces a probabilistic variant
for that data model + query language
 Done for relational, XML, and DataLog
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 11
GENERAL APPROACH
TO OBTAIN PROBABILISTIC QUERY IMPLEMENTATION
B. Wanders, M. van Keulen, Revisiting
the formal foundation of Probabilistic
Databases. EUSFLAT 2015.
Paper has
example in
JudgeD =
probabilistic
datalog
THE PAPER
RULE-BASED CONDITIONING OF
PROBABILISTIC DATA
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data12
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 13
EXAMPLE:
INFORMATION EXTRACTION FROM NATURAL LANGUAGE
Unstructured
Text
Structured data
Information extraction /
natural language processing
“We humans happily deal with doubt and misinterpretation every day;
Why shouldn’t computers?”
“Paris Hilton stayed in the Paris Hilton”
Paris: Firstname
Paris: City
Paris Hilton: Person
Paris Hilton: Hotel
Paris Hilton: Fragrance
Which Paris? 60+ Parises (Source: geonames)
Which Person? More people with that name!
Which Hotel? There are 3 in capital of France
NE detection
NE
disambiguation
Instead of
Do
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 14
OUR WORK
NE candidate
extraction
Indeterministic NE
disambiguation
Cleaning
(with enriched data)
Habib, M.B. and van Keulen, M. (2016) TwitterNEED: a
hybrid approach for named entity extraction and
disambiguation for tweets. Natural language
engineering, 22 (03). pp. 423-456. ISSN 1351-3249
Go for high recall
at expense of low
precision => A lot
of ’noise’ to be
cleaned later!
Pipeline where intermediate results are
probabilistic data
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 15
PROBABILISTIC REPRESENTATION FOR
ANNOTATIONS AND REFERENCES
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
4 1 2 Fragrance
5 6 6 City
6 6 7 Hotel
7 6 7 Person
: : : :
T1=1
T2=1⋀X=0
T2=1⋀X=1
T2=1⋀X=2
T3=1
T4=1⋀Y=0
T4=1⋀Y=1
:
Annotations
“Paris Hilton stayed in the Paris Hilton”
1 2 3 4 5 6 7
RID ID URL
1 1 France
2 1 Texas, US
3 2 La Defense
4 2 Opera
5 2 Orly
6 5 France
7 5 Texas, US
: : :
References
Z=1
Z=2
X=0⋀Z=1⋀A=1
X=0⋀Z=1⋀A=2
X=0⋀Z=1⋀A=3
B=1
B=2
:
This paper:
If I have this, how
do I clean it given
some evidence?
Evidence from users or context or analytics
 Given a phrase that is a person, a part is never a city
(hard knowledge rule)
 If a city is part of a hotel name, then it is more likely to
refer to a city containing such a hotel
(soft knowledge rule)
 “stayed in” suggests that what precedes it is more
likely a person and what follows it is more likely a hotel
(soft knowledge rule learnt from corpora)
Data integration problems produce “noise” in the data
Cleaning is aimed at filtering/reducing ”noise”
= removing worlds or improving probabilities
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 16
CLEANING PROBABILISTIC DATA
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 17
INTUITION: CLEANING PROBABILISTIC DATA
= CONDITIONING = BAYESIAN UPDATING
Paris Hilton stayed in the Paris Hilton
Person --- dnc --- City
inconsistency
T1=1 (“Paris” is a City) [a]
T2=1⋀X=1 (“Paris Hilton” is a Person) [b]
sentences become mutually exclusive
∅
a∧b
a∧¬b
b∧¬a
0.48
0.12
0.32
0.08
∅
a∧¬b
b∧¬a
0.23
0.62
0.15
a and b
independent
P(a)=0.6
P(b)=0.8
a and b
mutually
exclusive
(a∧b is not
possible)
ID b e type
1 1 1 City
2 1 2 Hotel
3 1 2 Person
T1=1
T2=1⋀X=1
T2=1⋀X=1
1. Represent integrated data as probabilistic facts and rules
2. Represent evidence as hard/soft rules
 Hard: evidence is absolutely true
 Soft: evidence is likely
3. Incorporate evidence by updating the database
deleting worlds that do not correspond with evidence
a) Evaluate rule to obtain evidence sentence 𝜑e
b) Remap partitionings in 𝜑e to a fresh one ω
c) Exclude inconsistent labels and renumber
d) Pe(𝜑e) is remaining probability mass to be distributed
over the remaining worlds
This constructs an updated CPDB’=<DB’, Ω’, P’>
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 18
OVERVIEW OF CONDITIONING APPROACH
Different from
“observe” as in
ProbLog
Directly on
compact
representation
1. Represent data integration result as facts and rules
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 19
THE PROCESS STEP BY STEP
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton” is a hotel, person, or fragrance (x)
“Paris” is a firstname or city (y)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot
2. Represent evidence as rules (a7 which uses a6)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 20
THE PROCESS STEP BY STEP
IN THE PAPER: PROBABILISTIC DATALOG (JUDGED)
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
X=1
X=2
X=3
Y=1
Y=2
P
X=1 0.5
X=2 0.4
X=3 0.1
Y=1 0.3
Y=2 0.7
annot
a6 contained(B1,E1,B2,E2) :- B1<=B2, E1<=E2.
a7 hardrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2).
Person
--- dnc ---
City
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 21
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y)
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
deleting worlds that do not correspond with evidence
not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 22
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
Person --- dnc --- City
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y)
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(a) evaluate rule to obtain evidence sentence 𝜑e
not(hardrule)?
𝜑e = ¬(x=2⋀y=2)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 23
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
deleting worlds that do not correspond with evidence (𝜑e)
y=1
y=2
x=1 x=2 x=3
a1
a1
a4
a4
a2
a2
a3
a4
a3
0.10.40.5
0.3
0.7
a5
a5
a5
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
a6
a7
Person --- dnc --- City
“Paris Hilton” is a hotel,
person, or fragrance (x)
“Paris” is a
firstname
or city (y) I want to do this
directly on the compact
representation CPDB
Inconsistent world identified by (x=2⋀y=2)
In general 𝜑e represents many possible worlds
Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ✓
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 24
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(b) remap partitionings in 𝜑e to a fresh one ω
• For every label, find logical equivalent, e.g., x=1 ⇔ (z=1 ∨ z=4)
• DB : For every sentence 𝜑 in DB, replace x- and y-labels, and simplify
• Ω : Remove x,y from Ω and add z
• P: Remove x,y from domain of P and add P(z=1) … P(z=6)
No change in semantics:
same set of possible worlds𝜑e = ¬(x=2⋀y=2) becomes ¬(z=5)
On
PWs
On
CPDB
Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 25
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(c) exclude inconsistent labels and renumber
• DB : For every sentence 𝜑 in DB, replace z=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
𝜑e = ¬(z=5)
On
PWs
On
CPDB
Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦
Renumbered
Consistent Pe
x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓ 0.2083
x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓ 0.1667
x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓ 0.0417
x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓ 0.4861
x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕
x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓ 0.0972
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 26
THE PROCESS STEP BY STEP
3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE
(d) Pe(𝜑e) is remaining probability mass to be distributed over the remaining worlds
• DB : For every sentence 𝜑 in DB, replace x=5 by ⊥, and simplify
if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB
• Ω : Remove z from Ω and add z’
• P: Remove z from domain of P and add P(z’=1) … P(z’=5)
On
PWs
On
CPDB
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 27
THE PROCESS STEP BY STEP
THE RESULT AFTER CONDITIONING
“Paris Hilton”
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
x=1
x=2
x=3
y=1
y=2
P
x=1 0.5
x=2 0.4
x=3 0.1
y=1 0.3
y=2 0.7
ID b e type
a1 1 2 hotel
a2 1 2 person
a3 1 2 fragrance
a4 1 1 firstname
a5 1 1 city
z=1 ∨ z=4
z=2
z=3 ∨ z=5
z=1 ∨ z=2 ∨ z=3
z=4 ∨ z=5
P
z=1 0.2083
z=2 0.1667
z=3 0.0417
z=4 0.4861
z=5 0.0972
We obtain 𝜑e = ¬(x=2⋀y=2⋀r=1)
Approach
Condition as if it was a hard rule
Only effectuate it for worlds W(r=1)
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 28
SOFT RULES
WHAT IF THE RULE IS UNCERTAIN ITSELF?
a’7 softrule :- annot(Ph1,B1,E1,city),
annot(Ph2,B2,E2,person),
contained(B1,E1,B2,E2) [r=1].
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 29
SOFT RULES
RESULT
Looks scary, but same number of assertions,
only more partitionings and longer sentences
Example: a1 with (r=0⋀x=1) ⋁ (r=1⋀(z=1⋁z=4))
Probabilistic data integration
 Two-phase data integration process
1. Quick-and-dirty data integration with late cleaning
2. Meaningful use with evidence gathering & cleaning
 Data integration problems handled by means of a
probabilistic data representation
Paper proposes approach for cleaning with evidence
 Evidence expressed as hard and soft rules
 Incorporates evidence by updating the database
 Iterative and scalable
Allows for continuous data quality improvement
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 30
CONCLUSIONS
(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
(often attributed to John Maynard Keynes, but Carveth Read, 1898)
 Scalability
 Obtaining evidence sentence has same complexity
as querying
 Remapping and redistribution of probabilities
exponential in the number of partitionings in 𝜑e
 Assumption: Uncertainty local
 Splitting approach in the paper
 Simplification of sentences also local
 Database updating worst case linear in size of DB
 Resulting DB same size, but longer sentences
 Querying complexity doesn’t change significantly
 Iterative : one piece of evidence at-a-time
4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 32
PROPERTIES OF THE APPROACH

Más contenido relacionado

La actualidad más candente

FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONFUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONijdms
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmIOSR Journals
 
IMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMOIMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMOijnlc
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchIRJET Journal
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
 
Mining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical FormatMining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical FormatCSCJournals
 
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- ReduceBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduceijircee
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal databaseTPO TPO
 
Confidence of AOI-HEP Mining Pattern
Confidence of AOI-HEP Mining PatternConfidence of AOI-HEP Mining Pattern
Confidence of AOI-HEP Mining PatternTELKOMNIKA JOURNAL
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data miningGeorge Ang
 
Confiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesConfiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesIJERA Editor
 

La actualidad más candente (18)

FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONFUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient Algorithm
 
IMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMOIMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMO
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword Search
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...
 
Mining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical FormatMining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical Format
 
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- ReduceBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
Confidence of AOI-HEP Mining Pattern
Confidence of AOI-HEP Mining PatternConfidence of AOI-HEP Mining Pattern
Confidence of AOI-HEP Mining Pattern
 
Bill howe 2_databases
Bill howe 2_databasesBill howe 2_databases
Bill howe 2_databases
 
At33264269
At33264269At33264269
At33264269
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data mining
 
Confiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesConfiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational Databases
 

Similar a Pdi conditioning-sum2018-milan-20181004

A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?Szymon Klarman
 
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...ijccsa
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
Business Analytics Foundation with R tools - Part 2
Business Analytics Foundation with R tools - Part 2Business Analytics Foundation with R tools - Part 2
Business Analytics Foundation with R tools - Part 2Beamsync
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph MiningSabri Skhiri
 
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisProbabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisBayesia USA
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAEFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAcsandit
 
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELSREPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELScscpconf
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Austin Benson
 

Similar a Pdi conditioning-sum2018-milan-20181004 (20)

A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?
 
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
Business Analytics Foundation with R tools - Part 2
Business Analytics Foundation with R tools - Part 2Business Analytics Foundation with R tools - Part 2
Business Analytics Foundation with R tools - Part 2
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisProbabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAEFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
 
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELSREPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 

Último

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Último (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Pdi conditioning-sum2018-milan-20181004

  • 1. RULE-BASED CONDITIONING OF PROBABILISTIC DATA MAURICE VAN KEULEN1, BENJAMIN KAMINSKI2 CHRISTOPH MATEJA2, JOOST-PIETER KATOEN1,2 2 RWTH Aachen 1
  • 2. Motivation and Context  Data integration problems  Why probabilistic approach to data integration? Background  Our probabilistic database model The paper  Conditioning a probabilistic database based on evidence expressed as hard or soft rules Conclusions 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 2 CONTENTS
  • 4.  It may be hard to extract information from certain kinds of sources (e.g., natural language, websites).  Information in a source may be missing, of bad quality, or its meaning is unclear.  It may be unclear which data items in the sources should be combined.  Sources may be inconsistent complicating a unified view DATA INTEGRATION “Data integration involves combining data residing in different sources and providing users with a unified view of them” Lenzerini 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 4
  • 5. Let’s go for an initial integration that can readily and meaningfully be used “Good is good enough” for meaningful use in many applications (can be achieved N times earlier) Let it improve during use 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 2-PHASE PROBABILISTIC DATA INTEGRATION PROCESS Use Gather evidence Improve data quality ‘Quick-and-dirty’ initial data integration Make remaining problems explicit & estimate likelihoods Probabilistic representation for “data with problems” InitialintegrationContinuousimprovement 5 PDB M. van Keulen, Probabilistic Data Integration. Encyclopedia of Big Data Technologies, Springer, 2018. DOI 10.1007/978-3-319-63962-8_18-1 This paper
  • 6. OUR PROBABILISTIC DATABASE MODEL Similar to probabilistic c-tables Inspired by MayBMS (C. Koch et al)
  • 7. Example  Data items a1, a2, and a3 unclear whether they should be in the database  Problem X with 3 cases: a1 is in the database in the first two cases, a2 only in the first and third.  Problem Y with 2 cases: a3 only the database if a certain condition holds, a2 only if it doesn’t OUR PROBABILISTIC DATA MODEL IS BASED ON POSSIBLE WORLDS THEORY (1/3) Y=1 Y=2 X=1 X=2 X=3 a1 a1 a1 a1 a2 a2 a3 a3 a3 partitioning label 10%20%70% 60% 40% 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 7 Possible world assertion
  • 8. Compact representation of set of possible worlds W(CPDB) Abstract notion of data item: we call them assertions (for a probabilistic relational model: assertion = tuple) Associate each assertion ai with a sentence 𝜑I Meaning: ai exists in all worlds for which 𝜑i is true  (ai,𝜑i) where 𝜑 is a propositional formula of labels l and labels are atoms of the form ω = v ωi independent, labels of one ω mutually exclusive Example  < a2, ¬X=2⋀Y=1 > 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 8 OUR PROBABILISTIC DATA MODEL IS BASED ON POSSIBLE WORLDS THEORY (2/3) B. Wanders, M. van Keulen, Revisiting the formal foundation of Probabilistic Databases. EUSFLAT 2015.
  • 9. Compact representation of set of possible worlds W(CPDB) A probabilistic database is a 3-tuple CPDB = <DB, Ω, P>  (data) DB={ (a1,𝜑1), …, (an,𝜑n) } (partitionings) Ω is a set of partitionings ω (probabilities) function P assigns probabilities to labels  A world w is identified by a fully described sentence 𝜑 𝜑: conjunction of one label from each partitioning 𝜑 can be seen as a name/identifier for world w Assertion ai exists in world w iff 𝜑 ⇒ 𝜑I Example  CPDB=<{<a1,¬X=3>,<a2,¬X=2⋀Y=1>,<a3,Y=2>},{X,Y},P> P(X=1)=0.7; P(X=2)=0.2; P(X=3)=0.1 P(Y=1)=0.6; P(Y=2)=0.4  𝜑=(X=1⋀Y=2) w={a1,a3} P(w) = P(𝜑) = 0.8 x 0.5 = 0.4 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 9 OUR PROBABILISTIC DATA MODEL IS BASED ON POSSIBLE WORLDS THEORY (2/3) B. Wanders, M. van Keulen, Revisiting the formal foundation of Probabilistic Databases. EUSFLAT 2015.
  • 10. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 10 SCALABLE QUERYING Query implementation Query semantics possible worlds possible answers Theory Implementation compact representation representation of possible answers Q Q’
  • 11. Given any data model with its query language  Choose data item to associate with sentence  For every query operator ⨂ in language’s algebra  Define extended operator ⨷=(⨂,𝜏⨂)  Where 𝜏⨂ is a function that produces the sentence of a result based on the sentences of the operands in a manner that is appropriate for operation ⨂ This produces a probabilistic variant for that data model + query language  Done for relational, XML, and DataLog 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 11 GENERAL APPROACH TO OBTAIN PROBABILISTIC QUERY IMPLEMENTATION B. Wanders, M. van Keulen, Revisiting the formal foundation of Probabilistic Databases. EUSFLAT 2015. Paper has example in JudgeD = probabilistic datalog
  • 12. THE PAPER RULE-BASED CONDITIONING OF PROBABILISTIC DATA 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data12
  • 13. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 13 EXAMPLE: INFORMATION EXTRACTION FROM NATURAL LANGUAGE Unstructured Text Structured data Information extraction / natural language processing “We humans happily deal with doubt and misinterpretation every day; Why shouldn’t computers?” “Paris Hilton stayed in the Paris Hilton” Paris: Firstname Paris: City Paris Hilton: Person Paris Hilton: Hotel Paris Hilton: Fragrance Which Paris? 60+ Parises (Source: geonames) Which Person? More people with that name! Which Hotel? There are 3 in capital of France
  • 14. NE detection NE disambiguation Instead of Do 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 14 OUR WORK NE candidate extraction Indeterministic NE disambiguation Cleaning (with enriched data) Habib, M.B. and van Keulen, M. (2016) TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweets. Natural language engineering, 22 (03). pp. 423-456. ISSN 1351-3249 Go for high recall at expense of low precision => A lot of ’noise’ to be cleaned later! Pipeline where intermediate results are probabilistic data
  • 15. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 15 PROBABILISTIC REPRESENTATION FOR ANNOTATIONS AND REFERENCES ID b e type 1 1 1 City 2 1 2 Hotel 3 1 2 Person 4 1 2 Fragrance 5 6 6 City 6 6 7 Hotel 7 6 7 Person : : : : T1=1 T2=1⋀X=0 T2=1⋀X=1 T2=1⋀X=2 T3=1 T4=1⋀Y=0 T4=1⋀Y=1 : Annotations “Paris Hilton stayed in the Paris Hilton” 1 2 3 4 5 6 7 RID ID URL 1 1 France 2 1 Texas, US 3 2 La Defense 4 2 Opera 5 2 Orly 6 5 France 7 5 Texas, US : : : References Z=1 Z=2 X=0⋀Z=1⋀A=1 X=0⋀Z=1⋀A=2 X=0⋀Z=1⋀A=3 B=1 B=2 : This paper: If I have this, how do I clean it given some evidence?
  • 16. Evidence from users or context or analytics  Given a phrase that is a person, a part is never a city (hard knowledge rule)  If a city is part of a hotel name, then it is more likely to refer to a city containing such a hotel (soft knowledge rule)  “stayed in” suggests that what precedes it is more likely a person and what follows it is more likely a hotel (soft knowledge rule learnt from corpora) Data integration problems produce “noise” in the data Cleaning is aimed at filtering/reducing ”noise” = removing worlds or improving probabilities 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 16 CLEANING PROBABILISTIC DATA
  • 17. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 17 INTUITION: CLEANING PROBABILISTIC DATA = CONDITIONING = BAYESIAN UPDATING Paris Hilton stayed in the Paris Hilton Person --- dnc --- City inconsistency T1=1 (“Paris” is a City) [a] T2=1⋀X=1 (“Paris Hilton” is a Person) [b] sentences become mutually exclusive ∅ a∧b a∧¬b b∧¬a 0.48 0.12 0.32 0.08 ∅ a∧¬b b∧¬a 0.23 0.62 0.15 a and b independent P(a)=0.6 P(b)=0.8 a and b mutually exclusive (a∧b is not possible) ID b e type 1 1 1 City 2 1 2 Hotel 3 1 2 Person T1=1 T2=1⋀X=1 T2=1⋀X=1
  • 18. 1. Represent integrated data as probabilistic facts and rules 2. Represent evidence as hard/soft rules  Hard: evidence is absolutely true  Soft: evidence is likely 3. Incorporate evidence by updating the database deleting worlds that do not correspond with evidence a) Evaluate rule to obtain evidence sentence 𝜑e b) Remap partitionings in 𝜑e to a fresh one ω c) Exclude inconsistent labels and renumber d) Pe(𝜑e) is remaining probability mass to be distributed over the remaining worlds This constructs an updated CPDB’=<DB’, Ω’, P’> 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 18 OVERVIEW OF CONDITIONING APPROACH Different from “observe” as in ProbLog Directly on compact representation
  • 19. 1. Represent data integration result as facts and rules 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 19 THE PROCESS STEP BY STEP IN THE PAPER: PROBABILISTIC DATALOG (JUDGED) “Paris Hilton” is a hotel, person, or fragrance (x) “Paris” is a firstname or city (y) “Paris Hilton” ID b e type a1 1 2 hotel a2 1 2 person a3 1 2 fragrance a4 1 1 firstname a5 1 1 city X=1 X=2 X=3 Y=1 Y=2 P X=1 0.5 X=2 0.4 X=3 0.1 Y=1 0.3 Y=2 0.7 annot
  • 20. 2. Represent evidence as rules (a7 which uses a6) 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 20 THE PROCESS STEP BY STEP IN THE PAPER: PROBABILISTIC DATALOG (JUDGED) “Paris Hilton” ID b e type a1 1 2 hotel a2 1 2 person a3 1 2 fragrance a4 1 1 firstname a5 1 1 city X=1 X=2 X=3 Y=1 Y=2 P X=1 0.5 X=2 0.4 X=3 0.1 Y=1 0.3 Y=2 0.7 annot a6 contained(B1,E1,B2,E2) :- B1<=B2, E1<=E2. a7 hardrule :- annot(Ph1,B1,E1,city), annot(Ph2,B2,E2,person), contained(B1,E1,B2,E2). Person --- dnc --- City
  • 21. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 21 y=1 y=2 x=1 x=2 x=3 a1 a1 a4 a4 a2 a2 a3 a4 a3 0.10.40.5 0.3 0.7 a5 a5 a5 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 “Paris Hilton” is a hotel, person, or fragrance (x) “Paris” is a firstname or city (y) THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE deleting worlds that do not correspond with evidence
  • 22. not(hardrule)? 𝜑e = ¬(x=2⋀y=2) 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 22 y=1 y=2 x=1 x=2 x=3 a1 a1 a4 a4 a2 a2 a3 a4 a3 0.10.40.5 0.3 0.7 a5 a5 a5 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 Person --- dnc --- City Inconsistent world identified by (x=2⋀y=2) In general 𝜑e represents many possible worlds “Paris Hilton” is a hotel, person, or fragrance (x) “Paris” is a firstname or city (y) THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE (a) evaluate rule to obtain evidence sentence 𝜑e
  • 23. not(hardrule)? 𝜑e = ¬(x=2⋀y=2) 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 23 THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE deleting worlds that do not correspond with evidence (𝜑e) y=1 y=2 x=1 x=2 x=3 a1 a1 a4 a4 a2 a2 a3 a4 a3 0.10.40.5 0.3 0.7 a5 a5 a5 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 a6 a7 Person --- dnc --- City “Paris Hilton” is a hotel, person, or fragrance (x) “Paris” is a firstname or city (y) I want to do this directly on the compact representation CPDB Inconsistent world identified by (x=2⋀y=2) In general 𝜑e represents many possible worlds
  • 24. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦ Renumbered Consistent Pe x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ✓ x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ✓ x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ✓ x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ✓ x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕ x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ✓ 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 24 THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE (b) remap partitionings in 𝜑e to a fresh one ω • For every label, find logical equivalent, e.g., x=1 ⇔ (z=1 ∨ z=4) • DB : For every sentence 𝜑 in DB, replace x- and y-labels, and simplify • Ω : Remove x,y from Ω and add z • P: Remove x,y from domain of P and add P(z=1) … P(z=6) No change in semantics: same set of possible worlds𝜑e = ¬(x=2⋀y=2) becomes ¬(z=5) On PWs On CPDB
  • 25. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦ Renumbered Consistent Pe x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓ x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓ x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓ x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓ x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕ x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓ 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 25 THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE (c) exclude inconsistent labels and renumber • DB : For every sentence 𝜑 in DB, replace z=5 by ⊥, and simplify if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB • Ω : Remove z from Ω and add z’ 𝜑e = ¬(z=5) On PWs On CPDB
  • 26. Worlds 𝜑 W(𝜑) P(𝜑) Remapped ↦ Renumbered Consistent Pe x=1⋀y=1 {a1,a4,a6,a7} 0.15 z=1 ↦ z’=1 ✓ 0.2083 x=2⋀y=1 {a2,a4,a6,a7} 0.12 z=2 ↦ z’=2 ✓ 0.1667 x=3⋀y=1 {a3,a4,a6,a7} 0.03 z=3 ↦ z’=3 ✓ 0.0417 x=1⋀y=2 {a1,a5,a6,a7} 0.35 z=4 ↦ z’=4 ✓ 0.4861 x=2⋀y=2 {a2,a5,a6,a7} 0.28 z=5 ✕ x=3⋀y=2 {a3,a5,a6,a7} 0.07 z=6 ↦ z’=5 ✓ 0.0972 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 26 THE PROCESS STEP BY STEP 3. INCORPORATE EVIDENCE BY UPDATING THE DATABASE (d) Pe(𝜑e) is remaining probability mass to be distributed over the remaining worlds • DB : For every sentence 𝜑 in DB, replace x=5 by ⊥, and simplify if 𝜑 ≡ ⊥, then delete <a,𝜑> from DB • Ω : Remove z from Ω and add z’ • P: Remove z from domain of P and add P(z’=1) … P(z’=5) On PWs On CPDB
  • 27. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 27 THE PROCESS STEP BY STEP THE RESULT AFTER CONDITIONING “Paris Hilton” ID b e type a1 1 2 hotel a2 1 2 person a3 1 2 fragrance a4 1 1 firstname a5 1 1 city x=1 x=2 x=3 y=1 y=2 P x=1 0.5 x=2 0.4 x=3 0.1 y=1 0.3 y=2 0.7 ID b e type a1 1 2 hotel a2 1 2 person a3 1 2 fragrance a4 1 1 firstname a5 1 1 city z=1 ∨ z=4 z=2 z=3 ∨ z=5 z=1 ∨ z=2 ∨ z=3 z=4 ∨ z=5 P z=1 0.2083 z=2 0.1667 z=3 0.0417 z=4 0.4861 z=5 0.0972
  • 28. We obtain 𝜑e = ¬(x=2⋀y=2⋀r=1) Approach Condition as if it was a hard rule Only effectuate it for worlds W(r=1) 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 28 SOFT RULES WHAT IF THE RULE IS UNCERTAIN ITSELF? a’7 softrule :- annot(Ph1,B1,E1,city), annot(Ph2,B2,E2,person), contained(B1,E1,B2,E2) [r=1].
  • 29. 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 29 SOFT RULES RESULT Looks scary, but same number of assertions, only more partitionings and longer sentences Example: a1 with (r=0⋀x=1) ⋁ (r=1⋀(z=1⋁z=4))
  • 30. Probabilistic data integration  Two-phase data integration process 1. Quick-and-dirty data integration with late cleaning 2. Meaningful use with evidence gathering & cleaning  Data integration problems handled by means of a probabilistic data representation Paper proposes approach for cleaning with evidence  Evidence expressed as hard and soft rules  Incorporates evidence by updating the database  Iterative and scalable Allows for continuous data quality improvement 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 30 CONCLUSIONS
  • 31. (Francis Bacon, 1605) (Jorge Luis Borges, 1979) (often attributed to John Maynard Keynes, but Carveth Read, 1898)
  • 32.  Scalability  Obtaining evidence sentence has same complexity as querying  Remapping and redistribution of probabilities exponential in the number of partitionings in 𝜑e  Assumption: Uncertainty local  Splitting approach in the paper  Simplification of sentences also local  Database updating worst case linear in size of DB  Resulting DB same size, but longer sentences  Querying complexity doesn’t change significantly  Iterative : one piece of evidence at-a-time 4 Oct 2018SUM 2018 - Rule-based Conditioning of Probabilistic Data 32 PROPERTIES OF THE APPROACH