SlideShare a Scribd company logo
1 of 19
Download to read offline
NewsReader is funded by the European Union’s
7th Framework Programme (ICT-316404)
BiographyNet is funded by the Netherlands
eScience Center. Partners in BiographyNet are
Huygens/ING Institute of the Dutch Academy of
Sciences andVU University Amsterdam.
Wednesday, August 7, 13
OFFSPRING FROM
REPRODUCTION PROBLEMS:
WHAT REPLICATION FAILURETEACHES US
Antske Fokkens, Marieke van Erp, Marten Postma,Ted Pedersen,
PiekVossen and Nuno Freire
Wednesday, August 7, 13
NER EXPERIMENTS
• Nuno Freire, José Borbinha, and Pável Calado (2012) present an
approach to recognise named entities in the cultural heritage
domain with a small amount of training data
• The dataset used in Freire et al., (2012) is available
• Their software was not available
• The paper describes the feature set used as well as the (open
source) machine learning package and the experimental setup
Wednesday, August 7, 13
• Nuno Freire provided additional feedback, but had finished his
PhD and changed jobs: he had no access to his original
experimental setup or his code
• The paper did not provide information on tokenisation, exact
preprocessing steps, cuts for 10-fold cross validation, number
of decimals used for rounding weights, etc.
NER EXPERIMENTS
Wednesday, August 7, 13
NER EXPERIMENTS
Freire et al. (2012)
Precision Recall F-score
P
Freire et al. (2012)
Precision Recall F-score
P
Freire et al. (2012)
Precision Recall F-score
P
Van Erp and Van der Meij (2013)
Precision Recall F-score
Van Erp and Van der Meij (2013)
Precision Recall F-score
Van Erp and Van der Meij (2013)
Precision Recall F-score
LOC (388) 92% 55% 69 77.8% 39.2% 52.1
ORG (157) 90% 57% 70 65.8% 30.6% 41.7
PER (614) 91% 56% 69 73.3% 37.6% 49.7
Overall
(1,159)
91% 55% 69 73.3% 37.1% 49.5
Wednesday, August 7, 13
• Variations on tokenisation yielded a 15 point drop in overall F-
score
• Results on individual folds different up to 25 points in F-score
• Experimenting with a different implementation of the CRF
algorithm yielded significantly different scores (almost attaining
those of Freire et al. (2012) without the complex features)
NER EXPERIMENTS
Wednesday, August 7, 13
• Preprocessing of the data set and the extra resources used
probably influenced our experiments
• Encoding the features as multivariate vs. boolean as input for
the machine learner may have made a difference
• Without additional information (exact output, data/resources
after preprocessing), it is hard to find out what causes these
differences or which experiment provides the most indicative
results of the potential of the approach.
NER EXPERIMENTS
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
• Marten Postma & PiekVossen wanted to run a WordNet
similarity experiment for Dutch previously done for English by
Patwardhan & Pedersen (2006) and Pedersen (2010)
• This experiment ranks similarities between words based on
WordNet similarity measures and compares it to human
rankings
• Step 1: replicate the WordNet calculations of the original
experiments
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
• The code used in the original experiments is open source, but
still results were different from the original
• Pedersen pointed out that the version of WordNet (and
possibly Perl packages) may influence results
• Experiments were repeated with the exact same versions as
the original: results still differed
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
• Together withTed Pedersen, we ran the experiment step by
step comparing outcome until we obtained the same results
• We identified the following factors that had lead to
differences:
• Restriction on PoS tags
• Gold Standard used
• Ranking coefficient used
• How much can these factors actually influence results?
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
VARIATIONS IN OUTPUT
Measure rho tau rank
path 0.08 0.07 1-8
wup 0.09 0.08 1-6
lch 0.08 0.07 1-7
res 0.10 0.31 4-11
lin 0.24 0.17 6-10
jcn 0.27 0.23 5,7-11
hso 0.07 0.05 1-3,5-10
vpairs 0.30 0.24 7-11
vector 0.44 0.43 1,2,4,6-11
lesk 0.17 0.63 1-8,11,12
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
Variation Spearman rho Kendall tau Different rank
WordNet version 0.44 0.42 88%
Gold Standard 0.24 0.21 71%
PoS-tag 0.09 0.08 41%
Configuration 0.08 0.60 41%
Wednesday, August 7, 13
WORDNET SIMILARITY EXPERIMENTS
• Performance of similarity measures can vary significantly
• Influential factors interact differently with individual scores
(i.e. comparative performance changes)
• Apart from WordNet version, these factors have (to our
knowledge) not been discussed in previous literature, despite
the fact that similarity scores are used very frequently
• Open question: what is the impact of influential factors when
similarity measures are used for other tasks?
Wednesday, August 7, 13
DISCUSSION
• The results in this paper point to two main issues:
(1) Our methodological descriptions often do not contain
the details needed to reproduce our results
(2)These details can have such high impact on our results
that it is hard to distinguish the contribution of the approach
from the contribution of preprocessing, the exact versions of
tools/resources used, the evaluation set chosen etc.
Wednesday, August 7, 13
CONCLUSION
• It is easier to find out how an approach really works, if you
have:
• the original code (even if containing hacks, unclean and/or undocumented)
• a clear description of each individual step
• the exact output on evaluation data (not just the overall numbers)
• the preprocessed/modified/improved versions of standard resources
Wednesday, August 7, 13
CONCLUSION
• Systematic testing can help to gain insight into the expected
variation of a specific approach:
• what is the performance on individual tools
• what are the best and worst result using different parameters?
• how does performance compare using different evaluation metrics?
• As a community, we should know where our
approaches fail as much -if not more- as where
they succeed
Wednesday, August 7, 13
THANKYOU &THANKS
• ToTed Pedersen and Nuno Freire for writing this paper with us!
• To Ruben Izquierdo, Lourens van der Meij, Christoph Zwirello,
Rebecca Dridan and the Semantic Web group atVU university
for their help and feedback
• To the anonymous reviewers who really helped to make this a
better paper
Wednesday, August 7, 13
http://wordpress.let.vupr.nl/reproducingnlpresearch/
Wednesday, August 7, 13

More Related Content

Similar to Offspring from Reproduction Problems: what replication failure teaches us

Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networkingStenio Fernandes
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentationPaolo Missier
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Jekaterina Novikova, PhD
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Daniel Roggen
 
Differences in-task-descriptions
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptionsSameer Chavan
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniquesahmad abdelhafeez
 
Common testing pitfalls er-2014 - 2014-10-27
Common testing pitfalls   er-2014 - 2014-10-27Common testing pitfalls   er-2014 - 2014-10-27
Common testing pitfalls er-2014 - 2014-10-27Donald Firesmith
 
UX Research on the Harvard IQSS Data Science website
UX Research on the Harvard IQSS Data Science websiteUX Research on the Harvard IQSS Data Science website
UX Research on the Harvard IQSS Data Science websiteYeseul Song
 
Data Analysis Presentation
Data Analysis PresentationData Analysis Presentation
Data Analysis Presentationjim_porter
 
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios  Stephen BrightSUS - ease of use perceptions and eportfoliostfolios  Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios Stephen BrightePortfolios Australia
 
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios  Stephen BrightSUS - ease of use perceptions and eportfoliostfolios  Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios Stephen BrightePortfolios Australia
 
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnaires
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnairesItem 2 : Results of the Spectral Soil Data - Needs and capacities questionnaires
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnairesSoils FAO-GSP
 
Assessment Model for Opportunistic Routing
Assessment Model for Opportunistic RoutingAssessment Model for Opportunistic Routing
Assessment Model for Opportunistic RoutingWaldir Moreira
 
Using a novel whole slide imaging software platform for an international mult...
Using a novel whole slide imaging software platform for an international mult...Using a novel whole slide imaging software platform for an international mult...
Using a novel whole slide imaging software platform for an international mult...Yves Sucaet
 

Similar to Offspring from Reproduction Problems: what replication failure teaches us (20)

Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
Pedersen acl2011-business-meeting
Pedersen acl2011-business-meetingPedersen acl2011-business-meeting
Pedersen acl2011-business-meeting
 
[Paul Holland] Bad Metrics and What You Can Do About It
[Paul Holland] Bad Metrics and What You Can Do About It[Paul Holland] Bad Metrics and What You Can Do About It
[Paul Holland] Bad Metrics and What You Can Do About It
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
 
Differences in-task-descriptions
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptions
 
My experiment
My experimentMy experiment
My experiment
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
 
Common testing pitfalls er-2014 - 2014-10-27
Common testing pitfalls   er-2014 - 2014-10-27Common testing pitfalls   er-2014 - 2014-10-27
Common testing pitfalls er-2014 - 2014-10-27
 
UX Research on the Harvard IQSS Data Science website
UX Research on the Harvard IQSS Data Science websiteUX Research on the Harvard IQSS Data Science website
UX Research on the Harvard IQSS Data Science website
 
Data Analysis Presentation
Data Analysis PresentationData Analysis Presentation
Data Analysis Presentation
 
classmar2.ppt
classmar2.pptclassmar2.ppt
classmar2.ppt
 
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios  Stephen BrightSUS - ease of use perceptions and eportfoliostfolios  Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
 
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios  Stephen BrightSUS - ease of use perceptions and eportfoliostfolios  Stephen Bright
SUS - ease of use perceptions and eportfoliostfolios Stephen Bright
 
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnaires
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnairesItem 2 : Results of the Spectral Soil Data - Needs and capacities questionnaires
Item 2 : Results of the Spectral Soil Data - Needs and capacities questionnaires
 
Assessment Model for Opportunistic Routing
Assessment Model for Opportunistic RoutingAssessment Model for Opportunistic Routing
Assessment Model for Opportunistic Routing
 
Using a novel whole slide imaging software platform for an international mult...
Using a novel whole slide imaging software platform for an international mult...Using a novel whole slide imaging software platform for an international mult...
Using a novel whole slide imaging software platform for an international mult...
 

More from Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebMarieke van Erp
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit Marieke van Erp
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceMarieke van Erp
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Marieke van Erp
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research Marieke van Erp
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...Marieke van Erp
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp
 

More from Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Offspring from Reproduction Problems: what replication failure teaches us

  • 1. NewsReader is funded by the European Union’s 7th Framework Programme (ICT-316404) BiographyNet is funded by the Netherlands eScience Center. Partners in BiographyNet are Huygens/ING Institute of the Dutch Academy of Sciences andVU University Amsterdam. Wednesday, August 7, 13
  • 2. OFFSPRING FROM REPRODUCTION PROBLEMS: WHAT REPLICATION FAILURETEACHES US Antske Fokkens, Marieke van Erp, Marten Postma,Ted Pedersen, PiekVossen and Nuno Freire Wednesday, August 7, 13
  • 3. NER EXPERIMENTS • Nuno Freire, José Borbinha, and Pável Calado (2012) present an approach to recognise named entities in the cultural heritage domain with a small amount of training data • The dataset used in Freire et al., (2012) is available • Their software was not available • The paper describes the feature set used as well as the (open source) machine learning package and the experimental setup Wednesday, August 7, 13
  • 4. • Nuno Freire provided additional feedback, but had finished his PhD and changed jobs: he had no access to his original experimental setup or his code • The paper did not provide information on tokenisation, exact preprocessing steps, cuts for 10-fold cross validation, number of decimals used for rounding weights, etc. NER EXPERIMENTS Wednesday, August 7, 13
  • 5. NER EXPERIMENTS Freire et al. (2012) Precision Recall F-score P Freire et al. (2012) Precision Recall F-score P Freire et al. (2012) Precision Recall F-score P Van Erp and Van der Meij (2013) Precision Recall F-score Van Erp and Van der Meij (2013) Precision Recall F-score Van Erp and Van der Meij (2013) Precision Recall F-score LOC (388) 92% 55% 69 77.8% 39.2% 52.1 ORG (157) 90% 57% 70 65.8% 30.6% 41.7 PER (614) 91% 56% 69 73.3% 37.6% 49.7 Overall (1,159) 91% 55% 69 73.3% 37.1% 49.5 Wednesday, August 7, 13
  • 6. • Variations on tokenisation yielded a 15 point drop in overall F- score • Results on individual folds different up to 25 points in F-score • Experimenting with a different implementation of the CRF algorithm yielded significantly different scores (almost attaining those of Freire et al. (2012) without the complex features) NER EXPERIMENTS Wednesday, August 7, 13
  • 7. • Preprocessing of the data set and the extra resources used probably influenced our experiments • Encoding the features as multivariate vs. boolean as input for the machine learner may have made a difference • Without additional information (exact output, data/resources after preprocessing), it is hard to find out what causes these differences or which experiment provides the most indicative results of the potential of the approach. NER EXPERIMENTS Wednesday, August 7, 13
  • 8. WORDNET SIMILARITY EXPERIMENTS • Marten Postma & PiekVossen wanted to run a WordNet similarity experiment for Dutch previously done for English by Patwardhan & Pedersen (2006) and Pedersen (2010) • This experiment ranks similarities between words based on WordNet similarity measures and compares it to human rankings • Step 1: replicate the WordNet calculations of the original experiments Wednesday, August 7, 13
  • 9. WORDNET SIMILARITY EXPERIMENTS • The code used in the original experiments is open source, but still results were different from the original • Pedersen pointed out that the version of WordNet (and possibly Perl packages) may influence results • Experiments were repeated with the exact same versions as the original: results still differed Wednesday, August 7, 13
  • 10. WORDNET SIMILARITY EXPERIMENTS • Together withTed Pedersen, we ran the experiment step by step comparing outcome until we obtained the same results • We identified the following factors that had lead to differences: • Restriction on PoS tags • Gold Standard used • Ranking coefficient used • How much can these factors actually influence results? Wednesday, August 7, 13
  • 12. WORDNET SIMILARITY EXPERIMENTS VARIATIONS IN OUTPUT Measure rho tau rank path 0.08 0.07 1-8 wup 0.09 0.08 1-6 lch 0.08 0.07 1-7 res 0.10 0.31 4-11 lin 0.24 0.17 6-10 jcn 0.27 0.23 5,7-11 hso 0.07 0.05 1-3,5-10 vpairs 0.30 0.24 7-11 vector 0.44 0.43 1,2,4,6-11 lesk 0.17 0.63 1-8,11,12 Wednesday, August 7, 13
  • 13. WORDNET SIMILARITY EXPERIMENTS Variation Spearman rho Kendall tau Different rank WordNet version 0.44 0.42 88% Gold Standard 0.24 0.21 71% PoS-tag 0.09 0.08 41% Configuration 0.08 0.60 41% Wednesday, August 7, 13
  • 14. WORDNET SIMILARITY EXPERIMENTS • Performance of similarity measures can vary significantly • Influential factors interact differently with individual scores (i.e. comparative performance changes) • Apart from WordNet version, these factors have (to our knowledge) not been discussed in previous literature, despite the fact that similarity scores are used very frequently • Open question: what is the impact of influential factors when similarity measures are used for other tasks? Wednesday, August 7, 13
  • 15. DISCUSSION • The results in this paper point to two main issues: (1) Our methodological descriptions often do not contain the details needed to reproduce our results (2)These details can have such high impact on our results that it is hard to distinguish the contribution of the approach from the contribution of preprocessing, the exact versions of tools/resources used, the evaluation set chosen etc. Wednesday, August 7, 13
  • 16. CONCLUSION • It is easier to find out how an approach really works, if you have: • the original code (even if containing hacks, unclean and/or undocumented) • a clear description of each individual step • the exact output on evaluation data (not just the overall numbers) • the preprocessed/modified/improved versions of standard resources Wednesday, August 7, 13
  • 17. CONCLUSION • Systematic testing can help to gain insight into the expected variation of a specific approach: • what is the performance on individual tools • what are the best and worst result using different parameters? • how does performance compare using different evaluation metrics? • As a community, we should know where our approaches fail as much -if not more- as where they succeed Wednesday, August 7, 13
  • 18. THANKYOU &THANKS • ToTed Pedersen and Nuno Freire for writing this paper with us! • To Ruben Izquierdo, Lourens van der Meij, Christoph Zwirello, Rebecca Dridan and the Semantic Web group atVU university for their help and feedback • To the anonymous reviewers who really helped to make this a better paper Wednesday, August 7, 13