An incremental learning method to support the annotation of workflows with data-to-data relations

1
An incremental learning method to support the
annotation of workﬂows with data-to-data relations
Enrico Daga, Mathieu d’Aquin, Aldo Gangemi, Enrico Motta
Feedback: @enridaga
20th International Conference on Knowledge Engineering
and Knowledge Management
Bologna, Italy
19-23 November 2016
http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9

“LipidMaps Query”
from http://
www.myexperiment.org
/workﬂows/1052

Workﬂow models are
focused on actions, to
support multiple and
parametric executions

There are scenarios in 
which we need to  
focus on the data…

… and understand how 
the data is affected by 
the actions of the
workﬂow.

Data ﬂow (DF): to
express the 
implications of the
actions on the data.

Datanode, a taxonomy 
of the relations between 
data objects, used for
example to
support reasoning on
policy propagation
http://purl.org/datanode/ns/
Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Propagation of policies in rich data flows.
In: Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015
http://doi.acm.org/10. 1145/2815833.2815839

8
Our objective is to derive such data ﬂows from the
representation of existing workﬂows.

9
APPROACH: to learn how to label data-to-data relations
using the description of the actions in the workﬂow.
ASSUMPTION: there is a correlation between the features
of a workﬂow action and the labels.
PROBLEM: Cold start - this requires a pre-existing training
set, that we do not have!

10
Incremental learning method

11
HYPOTHESIS: the quality of the recommendations
improves in time

13
WORKFLOW to DATA FLOW
Arcs
=
I/O port pairs (1->3 ; 2->3)
1234 Workﬂows from www.myexperiments.org = 30612 I/O port pairs

14
FEATURES
Direct:
About the ports and
processors involved:
ids, data types,
annotations, scripts …
Derived:
From annotations: Bag
of words, NER/DBPedia
entities plus types and
categories.
An incremental learning method to support the annotation of workflows 7
Table 2. Example of derived features (bag of words and DBPedia entities) generated
for the IO port pair 1 ! 3.
Type Value
From/FromPortName-word string
To/ToPortName-word split
From/FromLinkedPortDescription-word single
From/FromLinkedPortDescription-word possibilities
From/FromLinkedPortDescription-word orb
From/FromLinkedPortDescription-word mass
FromToPorts/DbPediaType wgs84:SpatialThing
FromToPorts/DbPediaType resource:Text file
FromToPorts/DbPediaType resource:Mass
FromToPorts/DbPediaType Category:State functions
FromToPorts/DbPediaType Category:Physical quantities
FromToPorts/DbPediaType Category:Mathematical notation
80%
18%
2%
< 10
10 ⇠ 100
> 100
Fig. 4. Distribution of features ex-
tracted from the workflow descriptions.
68%
28%
4%
< 10
10 ⇠ 100
> 100
Fig. 5. Distribution of features (includ-
ing derived features).
Type Value
80%
18%
2%
< 10
10 ⇠ 100
> 100
68%
28%
4%
< 10
10 ⇠ 100
> 100
Distribution:
(30612 I/O port pairs)

15
FEATURES
Type Value
80%
18%
2%
< 10
10 ⇠ 100
> 100
68%
28%
4%
< 10
10 ⇠ 100
> 100
Type Value
80%
18%
2%
< 10
10 ⇠ 100
> 100
68%
28%
4%
< 10
10 ⇠ 100
> 100
. This processor has three ports: two input ports (1 and 2) and one output port
e can translate this model into a graph connecting the data objects of the inputs
one of the output.
1. Sample of the features extracted for the IO port pair 1 ! 3 in the example
ure 3.
Type Value
From/FromPortName string
To/ToPortName split
Activity/ActivityConfField script
Activity/ActivityType http://ns.taverna.org.uk/2010/
activity/beanshell
Activity/ActivityName reformat list
Activity/ConfField/derivedFrom http://ns.taverna.org.uk/2010/
activity/localworker/org.embl.
ebi.escience.scuflworkers.java.
SplitByRegex
Activity/ConfField/script List split = new ArrayList();if
(!string.equals(””)) { String regexString =
”,”; if (regex != void) ...
Processor/ProcessorType Processor
Processor/ProcessorName reformat list
owever, the objective of these feature sets is to support the clustering of
nnotated IO port pair through finding similarities with IO port pairs to be
ated. At this stage of the study we performed a preliminary evaluation of
stribution of the features extracted. We discovered that very few of them
shared between a significant number of port pairs (see Figure 4). In order
rease the number of shared features we generated a set of derived fea-
by extracting bags of words from lexical feature values and by performing
d Entity Recognition on the features that constituted textual annotations
s and comments), when present. Moreover, from the extracted entities we
dded the related DBPedia categories and types as additional features. As
ple, Table 2 shows a sample of the bag of words and entities extracted from
atures listed in the previous Table 1.
An incremental learning method to support the annotation of workflow
Table 2. Example of derived features (bag of words and DBPedia entities)
Type Value
80%
18%
2%
< 10
10 ⇠ 100
> 100
68%
28%
4%
<
1
>
Fig. 5. Distribution of featur
3.3 Retrieval of association rules and generation of
recommendations
Direct: Derived:
Distribution:
(30612 I/O port pairs)

17
Formal Concept Analysis (FCA)
• FCA is a clustering method for association rule mining
• Lattice of ordered closed item sets - concepts
• Item: I/O port pair <-> features + annotations
• FCA Concept:
• Extent (I/O port pairs)
• Intent (features, annotations)
• Incremental lattice construction (Godin algorithm).
• Lattice is reconstructed on each item addition.

18
Step 0
At the beginning, the user adds a single item, without
support. The lattice contains a single concept.

19
Step 1
By adding new annotations, the lattice allows to derive
association rules.
(f1, f2, ..., fn) → (a1, a2, ..., an)

20
Step 2
By adding new annotations, the lattice grows… 
allowing to generate recommendations.
(f1, f2, ..., fn) → (a1, a2, ..., an)

21
Step 3
By adding new annotations, the lattice grows…
allowing to generate more recommendations.
(f1, f2, ..., fn) → (a1, a2, ..., an)

22
Step 4
By adding new annotations, the lattice grows…
allowing to generate many recommendations.
(f1, f2, ..., fn) → (a1, a2, ..., an)

23
ASSOCIATION RULE MINING
Generating all association rules on each iteration is
expensive
We query the lattice to retrieve only rules applicable to
a given I/O port pair.
• only rules that have annotations in the rule consequence:
• This: (f1, f2, ..., fn) → (a1, a2, ..., an)
• Not these: (f1, f2, a6) → (f3, f4), (f1, f2, a6) → (f3, a4)
• avoid redundancies (select the best for a certain head)
• rank the rules according to: support, conﬁdence and
relevance.

24
io6: f7,f8,f9,f10,f11,a?
(f7,f8) →(a0) (f8,f9) →(a2)

25
EVALUATION
• Expectation: the quality of the recommendations
improves in time.
• EXPERIMENT:
• Dinowolf (Datanode in workﬂows)  
http://github.com/enridaga/dinowolf  
Uses SCUFL2, Apache Taverna, Apache Lucene, DBPedia
Spotlight
• 6 users to annotate 20 workﬂows from
www.myexperiments.org for a total of 260 I/O
port pairs.

26
RESULTS
of selected recommendations. The vertical axis represents the score placing at
the top the first position. This confirms our hypothesis that the quality of rec-
ommendations increases, stabilizing within the upper region after a critical mass
of annotated items is produced, reflecting the same behavior observed in Fig. 7.
20 40 60 80 100 120 140 160 180 200 220 240 260
5s
20s
1m
5m
10m
Fig. 6. Evolution of the time spent by each user on a given annotation page of the tool
before a decision was made.
An Incremental Learning Method to Support the Annotation of Workflows 141
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
0.5
0.7
1.0
Fig. 7. Progress of the ratio of annotations selected from recommendations.
Time required to make a choice:
Selections from recommendations:
Effort reduced.
Cold start problem tackled.

27
RESULTS
20 40 60 80 100 120 140 160 180 200 220 240 260
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
0.5
0.7
1.0
Fig. 8. Average rank of selected recommendations. The vertical axis represents the
score placing at the top the ﬁrst position.
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
0.5
0.7
1.0
Fig. 9. Progress of the average relevance score of picked recommendations.
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
0.5
0.7
1.0
Fig. 8. Average rank of selected recommendations. The vertical axis represents the
score placing at the top the ﬁrst position.
20 40 60 80 100 120 140 160 180 200 220 240 260
0.0
0.2
0.5
0.7
1.0
Fig. 9. Progress of the average relevance score of picked recommendations.
Rank of selected recommendations:
Relevance score of selected recommendations:
Quality of recommendations increases.

28
CONCLUSIONS
• Supporting users on annotating workﬂows with data-to-data
relations with recommendations is problematic because of the lack
of an initial training set (cold start problem). We tackled this issue
by means of an incremental learning process that leverages FCA
and an information retrieval approach to ARM.
• Future work:
• Integrate this approach in Data Hub metadata management to
support policy propagation.
• Study the quality and consistency of annotations.
• Agreement/disagreement between users.
• The solution is domain independent, can be applied to other
scenarios.

29
Thank you
Enrico Daga
Feedback: @enridaga
http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9

30
REFERENCES
• Daga, E., d’Aquin, M., Adamou, A., Motta, E.: Addressing exploitability of smart city data.
In: 2016 IEEE Second International Smart Cities Conference (ISC2). IEEE (2016)  
• Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Describing semantic web applica- tions
through relations between data nodes. Technical report kmi-14-05, Knowledge Media
Institute, The Open University, Walton Hall, Milton Keynes (2014). http:// kmi.open.ac.uk/
publications/techreport/kmi-14-05  
• Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Propagation of policies in rich data ﬂows.
In: Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015,
New York, NY, USA, pp. 5:1–5:8 (2015). http://doi.acm.org/10. 1145/2815833.2815839  
• Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on
galois (concept) lattices. Comput. Intell. 11(2), 246–267 (1995)  
• Poelmans,J.,Elzinga,P.,Viaene,S.,Dedene,G.:Formalconceptanalysisinknowl- edge
discovery: a survey. In: Croitoru, M., Ferŕe, S., Lukose, D. (eds.) ICCS 2010. LNCS (LNAI),
vol. 6208, pp. 139–153. Springer, Heidelberg (2010). doi:10.1007/ 978-3-642-14197-3 15

An incremental learning method to support the annotation of workflows with data-to-data relations

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

More from Enrico Daga

More from Enrico Daga (18)

Recently uploaded

Recently uploaded (20)

An incremental learning method to support the annotation of workflows with data-to-data relations