Offspring from Reproduction Problems: what replication failure teaches us

NewsReader is funded by the European Union’s
7th Framework Programme (ICT-316404)
BiographyNet is funded by the Netherlands
eScience Center. Partners in BiographyNet are
Huygens/ING Institute of the Dutch Academy of
Sciences andVU University Amsterdam.
Wednesday, August 7, 13

OFFSPRING FROM
REPRODUCTION PROBLEMS:
WHAT REPLICATION FAILURETEACHES US
Antske Fokkens, Marieke van Erp, Marten Postma,Ted Pedersen,
PiekVossen and Nuno Freire

NER EXPERIMENTS
• Nuno Freire, José Borbinha, and Pável Calado (2012) present an
approach to recognise named entities in the cultural heritage
domain with a small amount of training data
• The dataset used in Freire et al., (2012) is available
• Their software was not available
• The paper describes the feature set used as well as the (open
source) machine learning package and the experimental setup

• Nuno Freire provided additional feedback, but had ﬁnished his
PhD and changed jobs: he had no access to his original
experimental setup or his code
• The paper did not provide information on tokenisation, exact
preprocessing steps, cuts for 10-fold cross validation, number
of decimals used for rounding weights, etc.
NER EXPERIMENTS

NER EXPERIMENTS
Freire et al. (2012)
Precision Recall F-score
P
P
P
Van Erp and Van der Meij (2013)
LOC (388) 92% 55% 69 77.8% 39.2% 52.1
ORG (157) 90% 57% 70 65.8% 30.6% 41.7
PER (614) 91% 56% 69 73.3% 37.6% 49.7
Overall
(1,159)
91% 55% 69 73.3% 37.1% 49.5

• Variations on tokenisation yielded a 15 point drop in overall F-
score
• Results on individual folds different up to 25 points in F-score
• Experimenting with a different implementation of the CRF
algorithm yielded signiﬁcantly different scores (almost attaining
those of Freire et al. (2012) without the complex features)
NER EXPERIMENTS

• Preprocessing of the data set and the extra resources used
probably inﬂuenced our experiments
• Encoding the features as multivariate vs. boolean as input for
the machine learner may have made a difference
• Without additional information (exact output, data/resources
after preprocessing), it is hard to ﬁnd out what causes these
differences or which experiment provides the most indicative
results of the potential of the approach.
NER EXPERIMENTS

WORDNET SIMILARITY EXPERIMENTS
• Marten Postma & PiekVossen wanted to run a WordNet
similarity experiment for Dutch previously done for English by
Patwardhan & Pedersen (2006) and Pedersen (2010)
• This experiment ranks similarities between words based on
WordNet similarity measures and compares it to human
rankings
• Step 1: replicate the WordNet calculations of the original
experiments

• The code used in the original experiments is open source, but
still results were different from the original
• Pedersen pointed out that the version of WordNet (and
possibly Perl packages) may inﬂuence results
• Experiments were repeated with the exact same versions as
the original: results still differed

• Together withTed Pedersen, we ran the experiment step by
step comparing outcome until we obtained the same results
• We identified the following factors that had lead to
differences:
• Restriction on PoS tags
• Gold Standard used
• Ranking coefficient used
• How much can these factors actually influence results?

VARIATIONS IN OUTPUT
Measure rho tau rank
path 0.08 0.07 1-8
wup 0.09 0.08 1-6
lch 0.08 0.07 1-7
res 0.10 0.31 4-11
lin 0.24 0.17 6-10
jcn 0.27 0.23 5,7-11
hso 0.07 0.05 1-3,5-10
vpairs 0.30 0.24 7-11
vector 0.44 0.43 1,2,4,6-11
lesk 0.17 0.63 1-8,11,12

Variation Spearman rho Kendall tau Different rank
WordNet version 0.44 0.42 88%
Gold Standard 0.24 0.21 71%
PoS-tag 0.09 0.08 41%
Conﬁguration 0.08 0.60 41%

• Performance of similarity measures can vary significantly
• Influential factors interact differently with individual scores
(i.e. comparative performance changes)
• Apart from WordNet version, these factors have (to our
knowledge) not been discussed in previous literature, despite
the fact that similarity scores are used very frequently
• Open question: what is the impact of influential factors when
similarity measures are used for other tasks?

DISCUSSION
• The results in this paper point to two main issues:
(1) Our methodological descriptions often do not contain
the details needed to reproduce our results
(2)These details can have such high impact on our results
that it is hard to distinguish the contribution of the approach
from the contribution of preprocessing, the exact versions of
tools/resources used, the evaluation set chosen etc.

CONCLUSION
• It is easier to ﬁnd out how an approach really works, if you
have:
• the original code (even if containing hacks, unclean and/or undocumented)
• a clear description of each individual step
• the exact output on evaluation data (not just the overall numbers)
• the preprocessed/modiﬁed/improved versions of standard resources

CONCLUSION
• Systematic testing can help to gain insight into the expected
variation of a speciﬁc approach:
• what is the performance on individual tools
• what are the best and worst result using different parameters?
• how does performance compare using different evaluation metrics?
• As a community, we should know where our
approaches fail as much -if not more- as where
they succeed

THANKYOU &THANKS
• ToTed Pedersen and Nuno Freire for writing this paper with us!
• To Ruben Izquierdo, Lourens van der Meij, Christoph Zwirello,
Rebecca Dridan and the Semantic Web group atVU university
for their help and feedback
• To the anonymous reviewers who really helped to make this a
better paper

http://wordpress.let.vupr.nl/reproducingnlpresearch/

Offspring from Reproduction Problems: what replication failure teaches us

Recommended

Recommended

More Related Content

Similar to Offspring from Reproduction Problems: what replication failure teaches us

Similar to Offspring from Reproduction Problems: what replication failure teaches us (20)

More from Marieke van Erp

More from Marieke van Erp (20)

Recently uploaded

Recently uploaded (20)

Offspring from Reproduction Problems: what replication failure teaches us