Simon Walk's talk at CIKM '14 about our paper titled "Sequential Action Patterns in Collaborative Ontology Engineering Projects: A Case-study in the Biomedical Domain"
Sequential Action Patterns in Collaborative Ontology Engineering Projects: A Case-study in the Biomedical Domain
1. 1
S C I E N C E P A S S I O N T E C H N O L O G Y
Sequential Action Patterns in
Collaborative Ontology-Engineering Projects:
A Case-Study in the Biomedical Domain
Simon Walk1, Philipp Singer2 and Markus Strohmaier2,3
1 Graz University of Technology
2 Gesis – Leibniz Institute for the Social Sciences
3 University of Koblenz
u Graz University of Technology CIKM2014
2. 2
Introduction & Motivation
The importance of collaborative ontology-engineering
projects increased over recent years due to an
increase in
• complexity of the modeled domains
• requirements for the resulting ontology
No individual is able to single-handedly cover the increased
complexity and requirements.
Hence, it is crucial to better understand and steer the
underlying processes of how users collaboratively
work on an ontology (i.e., via predictive models).
u Graz University of Technology CIKM2014
3. 3
Approach & Objective
To that extend we analyzed five collaborative ontology-engineering
projects from the biomedical domain to:
1. explore regularities and common patterns in user
action sequences
2. fit and select models using Markov chains of
varying order
3. predict user actions via the fitted Markov chains
Our main objective is to predict future user actions
in collaborative ontology-engineering projects.
u Graz University of Technology CIKM2014
4. 4
Datasets
Five collaborative ontology-engineering projects from
the biomedical domain with varying sizes of features.
Note that all ontologies were created with WebProtégé
or derivatives of WebProtégé!
u Graz University of Technology CIKM2014
5. 5
Types of Action Paths
u Graz University of Technology CIKM2014
6. 6
Types of Action Paths
u Graz University of Technology CIKM2014
7. 7
Types of Action Paths
u Graz University of Technology CIKM2014
8. 8
Types of Action Paths
u Graz University of Technology CIKM2014
9. 9
Extracted Action Paths
1. Users for Classes
Sequences of users that changed a class.
2. Change Types for Users & Classes
Sequences of change types performed by a user / on
a class.
3. Properties for Users & Classes
Sequences of properties changed by a user / for a
class.
u Graz University of Technology CIKM2014
11. 11
Exploring Regularities
Randomness & Regularities
Wald-Wolfowitz runs test
Adapted by O’Brien and Dyck (1985)
For ~60% of our paths, regularities could be detected.1
Sequential Pattern Mining
PrefixSpan to investigate commonly used sequential
patterns.
Only immediately succeeding states build patterns.
E.g., “A B C” contains “A B” and “B C” but not “A C”
1https://github.com/psinger/RunsTest
u Graz University of Technology CIKM2014
12. 12
Results for the Sequential Pattern Analysis
Users for Classes Paths
u Graz University of Technology CIKM2014
13. 13
Results for the Sequential Pattern Analysis
Users for Classes Paths
u Graz University of Technology CIKM2014
14. 14
Model Fitting & Selection
u Graz University of Technology CIKM2014
15. 15
Modeling Fitting
Markov chains are stochastic processes
representing transition probabilities between
a countable number of known states.
A state space: listing all possible states
A transition matrix: listing all transition-probabilities
between states
A Markov chain of n-th order means that n previous
states contain predictive information about the next
state.
u Graz University of Technology CIKM2014
16. 16
Modeling Fitting & Selection
We fitted Models from orders of zero to five.2
Lower order models are nested within higher order
models.
Higher orders need exponentially more parameters
and may result in overfitting.
Bayesian model selection (Singer et al. 2014)2
Higher order models receive a penalty due to higher
complexity.
2 https://github.com/psinger/PathTools
u Graz University of Technology CIKM2014
17. 17
Results Bayesian Model Selection
u Graz University of Technology CIKM2014
19. 19
K-Fold Cross-Fold Prediction Experiment
1. Fit Markov chain model.
Split Paths into training and test set (stratified).
Rank transitions for each row in the transition matrix.
1. Determine position of test set transition in the fitted
Markov chain model.
1. Calculate average over all positions.
Average Position of 1 equals best prediction
accuracy.
u Graz University of Technology CIKM2014
20. 20
K-Fold Cross-Fold Prediction Results
u Graz University of Technology CIKM2014
21. 21
Results for the Prediction Task
u Graz University of Technology CIKM2014
22. 22
Conclusions
A number of sequences were produced in a non-random
way and frequent patterns can be extracted.
Memory effects (serial dependence) can increase
prediction accuracy.
The resulting prediction models can (potentially) be
used for
the creation of various recommendations as well as
to assess the impact of potential changes on the
ontology and the community.
u Graz University of Technology CIKM2014
23. 23
Future Work
Include additional data sources (e.g., Semantic
MediaWikis).
Analyze higher order patterns and compare patterns
of different data sources
Conduct live-lab experiments with generated
prediction-models (recommendations).
u Graz University of Technology CIKM2014
25. 25
Thank you for your attention!
uu Grraz Uniiverrsiitty off Technollogy CIKM2014
26. 26
References
Wald and J. Wolfowitz. On a test whether two samples are from
the same population. The Annals of Mathematical Statistics,
11(2):147–162, 1940.
P. C. O’Brien and P. J. Dyck. A runs test based on run lengths.
Biometrics, pages 237–244, 1985.
P. Singer, D. Helic, B. Taraghi, and M. Strohmaier. Detecting
memory and structure in human navigation patterns using
markov chain models of varying order. PloS one,
9(7):e102070, 2014.
u Graz University of Technology CIKM2014
Notas del editor
ICD-11: Classification-Scheme to encode diseases to inform decision makers of health-related spendings & insurance companies of what to charge
ICTM: The same as ICD-11, and is planned to be merged into ICD-11, for traditional medicine! Multilingual (japanese, chinese, korean, english and traditional chinese)
NCIt: It is a reference vocabulary covering areas for clinical care, translational and basic research, and cancer biology.
BRO: A controlled terminology of resources, which is used to improve the sensitivity and specificity of web searches used for Biositemaps.
Who is going to change a class next?
What kind of change is a user going to perform next?
What kind of change is performed next on a class?
What property will be changed next by a user?
What property will be changed next for a class?
Support = the percentage of paths that exhibit a certain pattern.
Pattern: Users for Classes
45.844 Sequences for ICD-11. If a user only changes 200 concepts => 45844 / 200 = 0.0044
Given that there are sequential patterns of lengths 2 to 4 we argue that such patterns play a crucial role in the contributor logs of collaborative ontology-engineering projects at hand.
Support: Small number of 1 Patterns, as they have a support close to 0. Look in Figure (a) at 0.2 - 0.4 Support. Aggregated number of patterns per Dataset == what we see in Figure (a)
Interested in identifying higher order Markov chain models.
Zero-order Markov chain as weighted random selection baseline.
Zero Order Models => Random Baseline
To calculate the transition probabilities between different states we calculate how often they occur in the data.
Higher order models will fit at least as good as lower order models.
Number of parameters: states^n * n-1