The slides based on the poster of the ISWC 2012 doctoral consortium on "Reconstructing Provenance". Trying to summarize what I think of the next 3 years of my PhD.
1. Reconstructing Provenance Sara Magliacane - VU University Amsterdam
Advisors: Paul Groth and Frank van Harmelen
Problem Statement An initial prototype implementation
The provenance of a data item is the metadata describing how, As a first step we focus on dependencies between files instead of
when and by whom the data item was produced. sequences of operations.
Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the pipeline using open-source
resulting in collections of files with only basic filesystem components, like Apache Lucene, Apache Tika and Dropbox API.
metadata, e.g. timestamps. As signal detectors we used well-known similarity measures.
In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;%
!#$%
@9:).*%).-72*+:% ! "
'()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-%
91,2A.*.1,.%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/# !
@*A#<7"#A,#8,/# B9-9:+*9)6%
& $#"%
& 01/.(%,21).1)% 0-+;.%49-9:+*9)6%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/6#
9*5,#.":*597B*"C# )A*.4A2:/4%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/# "
013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,%
)67.4% 49-9:+*9)6% >:).*91;%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+%
49-9:+*9)6%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6#
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#
4,!5(
67"8#(
4,!5(
!"$"8$"!+(
9"$"!+$"-#:
!"$"8$"!+(
Initial (encouraging) results
)#*+$#!,$)%!&'(
!"!#$%!&'(
=+",# # # # # # # # # # # # # #></*?,5#
We performed an experiment with a small set of biomedical
!,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts.
./01( ./31( ./21(
Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General
EvidenceQ|| EvidenceQX Guideline
!"#$#%&'(
22
23 17
15 2 6 7
Research Question 13
14 20
16 21
18 19
0
1
4
3 5
8
9 10
11
24 12
How can one automatically, accurately and efficiently 5
reconstruct a plausible provenance of files in a shared folder, 23
)"*+#,-*+(
20 17
intended as the sequences of operations connecting the files?
19 7
4 15 8
3 14
2 18 9
6 22
21
16
0 13 10
1 11
Approach & Methodology
12
24
Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General
EvidenceQ|| EvidenceQX Guideline
We propose a multi-signal pipeline approach that reconstructs F1-score of 0.49 for only text similarity
plausible provenance traces using the contents of the files and F1-score of 0.70 for the aggregation of various similarities
metadata as evidence of the relationships between files.
The pipeline consists of four stages, each containing several
components that can be executed in parallel:
Future work
#$4:2-4#-';'<=>'
#$%&' Following the planned methodology, we will explore additional
8$#A' @1-%1$#-AA)4,' B&%$0C-A-A'D-4-1+E$4' B&%$0C-A-A'@1F4)4,' G,,1-,+E$4'+42'1+4H)4,'
! "
components for each of the pipeline phases and consider also
./01+#0'*-0+2+0+'' 6),4+7'8-0-#0$1!' 6),4+7'9)70-1!' G,,1-,+0$1!'
computational efficiency.
( )*+,-'
!
( 342-/'#$40-40' 6),4+7'8-0-#0$1(' 6),4+7'9)70-1(' G,,1-,+0$1(' #$4:2-4#-';'<=?'
"
5' 5' 5' ==='
!
#$%&'
"
Bibliography
( (1) Sara Magliacane: Reconstructing Provenance, ISWC Doctoral
Consortium 2012
The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacane: Automatic Metadata
incrementally integrate existing approaches in literature and Annotation through Reconstructing Provenance, Third International
evaluate the performance on benchmark corpora. Workshop on the role of Semantic Web in Provenance Management,
ESWC 2012
2. Advisors: Paul Groth and Frank van Harmelen
Problem Statement An initial prototype im
The provenance of a data item is the metadata describing how, As a first step we focus on dependen
when and by whom the data item was produced. sequences of operations.
Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the p
resulting in collections of files with only basic filesystem components, like Apache Lucene, Ap
metadata, e.g. timestamps. As signal detectors we used well-kno
In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F
@9:).*%).-72*
'()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6%
91,2A.*.1,
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/# !
@*A#<7"#A,#8,/# B9-9:+*9)6%
& 01/.(%,21).1)% 0-+;.%49-9:+*9)6%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/6#
9*5,#.":*597B*"C# )A*.4A2:/4
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/# "
013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,
)67.4% 49-9:+*9)6% >:).*91;%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+%
49-9:+*9)6%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6#
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#
4,!5(
67"8#(
4,!5(
!"$"8$"!+(
9"$"!+$"-#:
!"$"8$"!+(
Initial (encouragin
)#*+$#!,$)%!&'(
!"!#$%!&'(
=+",# # # # # # # # # # # # # #></*?,5#
We performed an experiment with a
!,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by
./01( ./31( ./21(
Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: G
EvidenceQ|| EvidenceQX Guideline
!"#$#%&'(
22
23 17
15
Research Question 13
14 20
16 21
18 19
0
1
24
3. 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,%
Advisors: Paul Groth and Frank van )67.4%
Harmelen 49-9:+*9)6% >:).*91;%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+%
49-9:+*9)6%
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6#
Problem Statement An initial prototype im
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#
The provenance of a data item is the metadata describing how,
4,!5(
67"8#(
4,!5(
!"$"8$"!+(
9"$"!+$"-#:
!"$"8$"!+(
Initial (encouraging
As a first step we focus on dependenc
when !"!#$%!&'( whom the data item was produced.
and by )#*+$#!,$)%!&'( sequences of operations.
We performed an experiment with a sm
=+",# # # # # # # # # # # # # #></*?,5#
!,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by tw
Provenance is crucial in many ./01(
settings, but often it is ./21( tracked,
./31( not We implemented a prototype of the pip
resulting in collections of files with only basic filesystem components, like Apache Lucene, Apa
Cluster 1: Blood Cultures
EvidenceQ||
Cluster 2: Markers
EvidenceQX
Cluster 3: General
Guideline
metadata, e.g. timestamps. As signal detectors we used well-know
!"#$#%&'(
22
23 17
In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% 15
D672)A.4.4%C*F191;% 2
Research Question '()*+,)%-.)+/+)+%%
13
14 20
16
8.()%49-9:+*9)6%
21
18 19
0
@9:).*%).-72*+:%
91,2A.*.1,.%
!
1
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#
@*A#<7"#A,#8,/# 24
B9-9:+*9)6%
How !"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/6#
can one automatically, accurately and efficiently
9*5,#.":*597B*"C#
& 01/.(%,21).1)% 5
0-+;.%49-9:+*9)6%
)A*.4A2:/4%
reconstruct a plausible provenance of files in a shared folder,
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 23
)"*+#,-*+(
013.*%4.-+15,% <2-+91=47.,9>,% 20
<2-+91=47.,9>,% 17
>:).*91;%
intended as the sequences of operations connecting the files? )67.4% 49-9:+*9)6%
19
4 15
3 14
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# 2 ?.)+/+)+% 18
6
49-9:+*9)6% 22
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 21
16
0 13
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 1
Approach & Methodology
Initial (encouraging
24
Cluster 1: Blood Cultures Cluster 2: Markers C
EvidenceQ|| EvidenceQX G
9"$"!+$"-#:
4,!5( !"$"8$"!+(
4,!5(
67"8#( !"$"8$"!+(
We !"!#$%!&'(
propose )#*+$#!,$)%!&'(
a multi-signal pipeline approach that reconstructs F1-score of 0.49an experiment with a sm
We performed for only text similarity
plausible provenance# traces# !,-)#$%!!)( #the# contents of the files and
=+",# # # # # #
using
# #
!,-)#$%!!)(
# #></*?,5#
!,-)#$%!!)(
F1-score of 0.70 for the aggregation of v
publications, annotated manually by tw
metadata as evidence of the./01( relationships between./21(
./31(
files.
Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General
Future work
EvidenceQ|| EvidenceQX Guideline
The pipeline consists of four stages, each containing several
!"#$#%&'(
22
components that can be executed in parallel: #$4:2-4#-';'<=>'
23 17
15 2
Following the planned methodology, we
8$#A' @1-%1$#-AA)4,'
Research Question
B&%$0C-A-A'D-4-1+E$4' B&%$0C-A-A'@1F4)4,' G,,1-,+E$4'+42'1+4H)4,'
!
#$%&'
"
components for each of the pipeline ph
13
14 20
16 21
18 19
0
./01+#0'*-0+2+0+'' 6),4+7'8-0-#0$1!' 6),4+7'9)70-1!' G,,1-,+0$1!'
computational efficiency. 1
( 24
4. 013.*%4.-+15,%
013.*%4.-+15,% <2-+91=47.,9>,%
<2-+91=47.,9>,% <2-+91=47.,9>,%
<2-+91=47.,9>,%
>:).*91;%
>:).*91;%
)"*+#,
)67.4%
)67.4% 2
49-9:+*9)6%
49-9:+*9)6% 18
6 22
21
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6#################
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+%
?.)+/+)+% 16
49-9:+*9)6%
49-9:+*9)6%0 13
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6#
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 1
Approach & Methodology
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#
!"#!$%&#'&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#
Cluster 1: Blood Cultures
EvidenceQ||
Cluster 2: Markers
EvidenceQX
24
Cluste
Guide
We propose a multi-signal pipeline approach that reconstructs
4,!5(
4,!5(
67"8#(
67"8#(
4,!5(
4,!5(
!"$"8$"!+(
!"$"8$"!+(
9"$"!+$"-#:
9"$"!+$"-#:
!"$"8$"!+(
!"$"8$"!+(
Initial (encouraging)
Initial (encouraging
F1-score of 0.49 for only text similarity
plausible provenance traces using the contents of the files and
)#*+$#!,$)%!&'(
)#*+$#!,$)%!&'(
F1-score of 0.70 for the aggregation of va
!"!#$%!&'(
!"!#$%!&'(
metadata #as evidence of# the relationships between files.
=+",# #
=+",# # # # # # # # # # # # # # # # # # # # # # #></*?,5#
#></*?,5#
We performed an experiment with a a sm
We performed an experiment with sma
!,-)#$%!!)(
!,-)#$%!!)( !,-)#$%!!)(
!,-)#$%!!)( !,-)#$%!!)(
!,-)#$%!!)( publications, annotated manually by two
publications, annotated manually by tw
./01(
./01( ./31(
./31( ./21(
./21(
The pipeline consists of four stages, each containing several
components that can be executed in parallel:
Cluster 1: Blood Blood Cultures Cluster 2: Markers
Cluster 1: Cultures
EvidenceQ||
EvidenceQ||
Cluster 2: Markers
Future work EvidenceQX
EvidenceQX
Cluster 3: General
Cluster 3: General
Guideline
Guideline
!"#$#%&'(
!"#$#%&'(
#$4:2-4#-';'<=>' 22 22
#$%&' Following the planned methodology, we w 23 23 17 17
8$#A' @1-%1$#-AA)4,' B&%$0C-A-A'D-4-1+E$4' B&%$0C-A-A'@1F4)4,' G,,1-,+E$4'+42'1+4H)4,'
components for each of the pipeline phas 15 15 2 2 6
Research Question
Research Question
! " 14 14 20 20 18 18 19 19 4
./01+#0'*-0+2+0+'' 6),4+7'8-0-#0$1!' 6),4+7'9)70-1!' G,,1-,+0$1!'
computational efficiency.
(
13 13 16 16 21 21 0 0 3
)*+,-'
! 1 1
(
How can automatically, accurately and efficiently #$4:2-4#-';'<=?'
How342-/'#$40-40' one automatically, 6),4+7'9)70-1('
can one 6),4+7'8-0-#0$1('
24 24
G,,1-,+0$1('
accurately and efficiently
"
Bibliography
5 5
#$%&'
reconstruct a a plausible provenance of files ===' a shared folder,
reconstruct plausible provenance of files in in shared folder,
5' 5' 5' a 23 23
)"*+#,-*+(
)"*+#,-*+(
20 20 17 17
intended as the sequences ofof operations connecting the!files?
intended as the sequences operations connecting the files?
19 19
" 4 4 15 15
(1) Sara Magliacane: Reconstructing Prove
3 3 14 14
(
2 2 18 18
Consortium 2012
6 6 22 22
21 21
16 16
0 0 13 13
The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacan
1 1
Approach &&Methodology
Approach Methodology
incrementally integrate existing approaches in literature and Annotation through Reconstructing Provena
Cluster 1: BloodBlood Cultures
Cluster 1: Cultures
Workshop on the role of Semantic Web in P
EvidenceQ||
EvidenceQ||
Cluster 2: Markers
Cluster 2: Markers
EvidenceQX
EvidenceQX
24 24
Cluste
Guide
C
G
evaluate the performance on benchmark corpora.
ESWC 2012
We propose a a multi-signal pipeline approach that reconstructs
We propose multi-signal pipeline approach that reconstructs F1-score ofof 0.49 for only text similarity
F1-score 0.49 for only text similarity
plausible provenance traces using the contents ofof the files and
plausible provenance traces using the contents the files and F1-score ofof 0.70 for the aggregation of v
F1-score 0.70 for the aggregation of va
metadata as evidence ofof the relationships between files.
metadata as evidence the relationships between files.
The pipeline consists ofof four stages, each containing several
The pipeline consists four stages, each containing several
components that can be executed in in parallel:
components that can be executed parallel:
Future work
Future work
#$4:2-4#-';'<=>'
#$4:2-4#-';'<=>'
#$%&'
#$%&' Following the planned methodology, we w
Following the planned methodology, we
8$#A'
8$#A' @1-%1$#-AA)4,'
@1-%1$#-AA)4,' B&%$0C-A-A'D-4-1+E$4' B&%$0C-A-A'@1F4)4,'
B&%$0C-A-A'D-4-1+E$4' B&%$0C-A-A'@1F4)4,' G,,1-,+E$4'+42'1+4H)4,'
G,,1-,+E$4'+42'1+4H)4,'
! ! " " components for each ofof the pipeline ph
components for each the pipeline phas
./01+#0'*-0+2+0+''
./01+#0'*-0+2+0+'' 6),4+7'8-0-#0$1!'
6),4+7'8-0-#0$1!' 6),4+7'9)70-1!'
6),4+7'9)70-1!' G,,1-,+0$1!'
G,,1-,+0$1!'
computational efficiency.
computational efficiency.
( (
5. isors: Paul Groth and Frank van Harmelen
nt An initial prototype implementation
adata describing how, As a first step we focus on dependencies between files instead of
duced. sequences of operations.
t often it is not tracked, We implemented a prototype of the pipeline using open-source
sic filesystem components, like Apache Lucene, Apache Tika and Dropbox API.
As signal detectors we used well-known similarity measures.
ovenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;%
!#$%
@9:).*%).-72*+:% ! "
'()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-%
91,2A.*.1,.%
!
<7"#A,#8,/# B9-9:+*9)6%
& $#"%
& 01/.(%,21).1)% 0-+;.%49-9:+*9)6%
#.":*597B*"C# )A*.4A2:/4%
"
013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,%
)67.4% 49-9:+*9)6% >:).*91;%
563-:6################# ?.)+/+)+%
49-9:+*9)6%
,<05,3*5/63-:6#
3,563-:6#
9"$"!+$"-#:
!"$"8$"!+(
Initial (encouraging) results
# # #></*?,5#
We performed an experiment with a small set of biomedical
,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts.
31( ./21(
Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General
EvidenceQ|| EvidenceQX Guideline
!"#$#%&'(
22
23 17
15 2 6 7
on 13
14 20
16 21
18 19
0
1
4
3 5
8
9 10
11
24 12