EFO tools - the good, the great, and the evil

•

0 likes•722 views

Tomasz Adamusiak

An overview of some the tooling used in maintaining the Experimental Factor Ontology

Technology Education

EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD

We have means to assign blame when things go wrong (definition_editor)

We need richness and consistency for EFO based query expansion

New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII

Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII

Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance BioPortal metadata xrefs BioportalImporter

Regression testing is essential as these are massive updates

We need better concept recognition because clean_ontology_terms.plis evil

We need fuzzines, because input data is extremely dirty

There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl

N-grams is a simple and relatively unknown method of string approximation

N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18% 90% 19% 40%

OntoCAT is a great success and generated a lot of interest within the community

Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470

What's hot

Museum impact: linking-up specimens with research published on themRoss Mounce

Paul GrothConnected Data World

Emerging challenges in data-intensive genomicsmikaelhuss

When the world beats a path to your door. Collaboration in the era of big dataCEDAR: Center for Expanded Data Annotation and Retrieval

Data analysis & integration challenges in genomicsmikaelhuss

Text and Data Mining explained at FTDMpetermurrayrust

Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce

Aspects of Reproducibility in Earth ScienceRaul Palma

Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble

Gcc talk baltimore july 2014pratikomics

Automatic Extraction of Knowledge from Biomedical literature TheContentMine

Looking for Data: Finding New ScienceAnita de Waard

schema.org and biomedical ontologies Simon Jupp

Science Commons Open Notebook Science TalkJean-Claude Bradley

EMBL Australia Bioinformatics Resource BioInfoSummer 2016Philippa Griffin

Use of dataChris Evelo

Opportunities in chemical structure standardizationValery Tkachenko

What's hot (17)

Museum impact: linking-up specimens with research published on them

Paul Groth

Emerging challenges in data-intensive genomics

When the world beats a path to your door. Collaboration in the era of big data

Data analysis & integration challenges in genomics

Text and Data Mining explained at FTDM

Workshop 5: Uptake of, and concepts in text and data mining

Aspects of Reproducibility in Earth Science

Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks

Gcc talk baltimore july 2014

Automatic Extraction of Knowledge from Biomedical literature

Looking for Data: Finding New Science

schema.org and biomedical ontologies

Science Commons Open Notebook Science Talk

EMBL Australia Bioinformatics Resource BioInfoSummer 2016

Use of data

Opportunities in chemical structure standardization

Similar to EFO tools - the good, the great, and the evil

Center for Clinical Genomics and Personalized Medicine, HungaryBalint L. Balint

2011 12 08 - LOINC Introductiondvreeman

HEVnet: Sharing sequences & metadata of hepatitis E virus AgnethaRIVM1

OntoCAT - integrated programming toolkit for common ontology application task...Tomasz Adamusiak

A moqrichEUintheUS

Introduction to Bioinformatics.Elena Sügis

Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute

Ontologies neo4j-graph-workshop-berlinSimon Jupp

Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...Scott Edmunds

The seven-deadly-sins-of-bioinformatics3960mare34

The Seven Deadly Sins of BioinformaticsDuncan Hull

The Ondex Data Integration Frameworkbosc

MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project

WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann

AJH CV sept2016andrew heuze

Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres

Connecting life sciences data at the European Bioinformatics InstituteConnected Data World

Facilitating semantic alignment.-biohackathon-juppSimon Jupp

G03-SemanticWeb-OntoCATBioinformatics Open Source Conference

Pathology is being disrupted by Data Integration, AI & BlockchainNatalio Krasnogor

Similar to EFO tools - the good, the great, and the evil (20)

Center for Clinical Genomics and Personalized Medicine, Hungary

2011 12 08 - LOINC Introduction

HEVnet: Sharing sequences & metadata of hepatitis E virus

OntoCAT - integrated programming toolkit for common ontology application task...

A moqrich

Introduction to Bioinformatics.

Advanced Bioinformatics for Genomics and BioData Driven Research

Ontologies neo4j-graph-workshop-berlin

Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...

The seven-deadly-sins-of-bioinformatics3960

The Seven Deadly Sins of Bioinformatics

The Ondex Data Integration Framework

MseqDR consortium: a grass-roots effort to establish a global resource aimed ...

WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015

AJH CV sept2016

Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes

Connecting life sciences data at the European Bioinformatics Institute

Facilitating semantic alignment.-biohackathon-jupp

G03-SemanticWeb-OntoCAT

Pathology is being disrupted by Data Integration, AI & Blockchain

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

A Year of the Servo Reboot: Where Are We Now?Igalia

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Slack Application Development 101 Slidespraypatel2

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter

A Year of the Servo Reboot: Where Are We Now?

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Slack Application Development 101 Slides

What Are The Drone Anti-jamming Systems Technology?

Tata AIG General Insurance Company - Insurer Innovation Award 2024

🐬 The future of MySQL is Postgres 🐘

Advantages of Hiring UIUX Design Service Providers for Your Business

Exploring the Future Potential of AI-Enabled Smartphone Processors

Breaking the Kubernetes Kill Chain: Host Path Mount

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Handwritten Text Recognition for manuscripts and early printed texts

The 7 Things I Know About Cyber Security After 25 Years | April 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Friends Colony Women Seeking Men

Histor y of HAM Radio presentation slide

EFO tools - the good, the great, and the evil

1. EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD

2. Huge ontology developed by a tiny team

3. We have means to assign blame when things go wrong (definition_editor)

4. We need richness and consistency for EFO based query expansion

5. New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII

6. Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII

7. Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance BioPortal metadata xrefs BioportalImporter

8. Regression testing is essential as these are massive updates

9. We need better concept recognition because clean_ontology_terms.plis evil

10. We need fuzzines, because input data is extremely dirty

11. There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl

12. N-grams is a simple and relatively unknown method of string approximation

13. N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18% 90% 19% 40%

14. The King is dead. Long live the Queen.

15. OntoCAT is a great success and generated a lot of interest within the community

16. Natalja & Misha hit the mother lode

17. Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470

18. Acknowledgments Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen K Joeri van derVelde DespoinaAntonakaki Dasha Zhernakova James Malone Helen Parkinson Emma Hastings NiranAbeygunawardena Ele Holloway Tim Rayner Zooma: Tony Burdett Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3] OntoCAT logo courtesy of Eamonn Maguire Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide

Editor's Notes

Experimental Factor Ontology is a great application ontology, hugely popular among internal and external collaborators and featured among the top 10 most accessed ontologies within NCBO BioPortal, which provides access to hundreds of different ontology resources. It is a pleasure to be involved in this project.
I joined the EFO teamaround January 2008 working in parallel to GEN2PHEN, to which some of this work was fed back into. My first task was designing and implementing a workflow for pulling in metadata (synonyms & definitions) from for xrefed ontology terms in external ontologies. We now have nearly 5,000 classes and 20,000 synonyms and there’s steady continuing growth.
Venn diagram representing who edited/added which class. In cases where it overlaps, the same class was touched by more than one person. Three people directly interact with the ontology Helen, James and I. Ele and Jie would submit large term requests, so added those classes indirectly through any of us.
This is how we’re leveraging all the rich metadata within the ontology. Here is an example of querying ArrayExpress http://www.ebi.ac.uk/arrayexpress/ for CML, and getting all experiments also annotated with chronic myloidleukemia and chronic myelogenousleukemia. Querying for leukemia or blood cancer would also give you this results. Anything inconsistent in the ontology would negatively influence this outcome.
Here’s a typical workflow. Annotations unmapped to EFO in the Gene Expression Atlas (http://www.ebi.ac.uk/gxa/) are discovered by Zooma (zooma.sf.net). Zooma in turn verifies whether there is a pre-existing mapping within the Atlas already, if not tries to map it to EFO or other ontologies in OLS and BioPortal via OntoCAT. The output is the fed into similarity_match.pl script to double check that no similar terms are in EFO already (as Zooma performs only exact matching) and the vetted terms are finally added to EFO via James’ tab_to_owl script or manually.Another sources of new terms is external users requests. They usually supply a flat list of terms they would like to see within the ontology. These are then mapped via similarity_match.pl to check whether they’re already in EFO, and the added.similarlity_match.pl has custom dedicated dependencies for parsing OWL ontologies and MeSH.
Before metadata from external resources can be imported into EFO we need to add appropriate xrefs. These are stored in a dedication annotation ‘definition_citation’ on the mapped term within EFO. The xrefs are added discovered by using similarity_match.pl to align other ontologies (e.g. MeSH, OMIM, NCI Thesaurus, Brenda, Cell Type, etc.) lexically to EFO. Note other tools exist in this domain that would rely on information content to align the ontologies. As far as I know they use exact matching only, so our approach could in fact be more efficient and in my experience the information content approach is not adding much value to the alignment.
Once we have the xrefs in, we can use a separate application BioportalImporter which will follow all the xrefs into respective external terms via BioPortal and import all the missing synonyms and definitions into EFO recording the source in a dedicated ‘bioportal_provenance’ annotation. With OWL2 it would be also possible to annotate the annotation directly.
Part of the BioportalImporter code base is consistency checking which performs 13 different tests once the import is completed. Most importantly it will report if there were any changes in external resources by cross-referencing provenance information between two versions of the import, and also alert on any potentially duplicated terms, by verifying shared metadata between two distinct terms within EFO.BioportalImporter is not in public domain as it’s tied quite heavily into EFO specifics, but most of the ontology handling code is actually in OntoCAT.Overview of the tests:Malformed efourisChanged ontology annoationsChanged classesObsoleted classesRenamed classesDuplicated xrefsDuplicated synonyms or labelsDupplicated xrefs same as URILocal efoURIs on external classesChanged featuresChanged external classesCircular referencesNon-english characters in annotations
Clean_ontology_terms.pl relies on the metaphone and double metaphone algorithms. Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. The aim of Metaphone is to match words or names that are pronounced similarly, according to the criteria of similarity which ignores any non-initial vowels and treats voiced and unvoiced versions of consonants as the same. Its latest versionMetaphone 3achieves an unparalleled level of accuracy in producing correct lookup keys for English words, non-English words familiar to English speakers, and names commonly found in the United States, within the criterion of similarity as defined above, but it is not designed to match words which are clearly pronounced differently. Recently publishedAnatomy ontologies and potential users: bridging the gap, Ravensara S Travillian1*, Tomasz Adamusiak1, Tony Burdett1, Michael Gruenberger2,John Hancock3, Ann-Marie Mallon3, James Malone1, Paul Schofield2 and Helen Parkinson1While the original aim of the article was to show how difficult it is to align the two anatomy ontologies: FMA and Uberon, the other conclusion that can be reached is that metaphone algorithms are inapplicable to this particular use case. Mostly importantly clean_ontology_terms.pl performed only marginally better than Zooma doing exact matching, with an enormous hit to precision (~0.07) as the script for lack of better matches would present all the phrases just starting with the same word (a side effect of double metaphonemisapplied on a whole phrase rather than individual words, this is a different behaviour from classic metaphone).
Our input data is rarely about differences in spelling such as British tumour and American tumor, but ratherdifferent grammatical number (cell vs. cells), digits, typos, and differently ordered words in similar phrases.Here left column shows an example unmapped annotations from the Atlas. Right-hand column existing terms in EFO that we would like to semi-automatically map to EFO. The ontology is too big to handle manually and it is impossible to remember anymore whether a particular term has already been added, that’s why we need to automate this.
First of all clean_ontology_terms.pl is not that fuzzy at all.Tim Rayner the original developer of clean_ontology_terms.pl already considered a more fuzzy approach, and there is a comment in the code suggesting the use of Levenhsteindistance. Rather than extending the script further, rewrote it from scratch into similarity_match.plThe Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.This algorithm, an example of bottom-up dynamic programming, which is is a method for solving complex problems by breaking them down into simpler subproblems. Similar approaches have already been extensively studied in DNA sequence alignment, and the edit distance approach is further generalised by local and global alignment algorithms: Smith–Waterman and Needleman-Wunsch, but they don’t offer much improvement for transpositions, i.e. different ordering of words in a phrase.And this is where n-grams excel.
An n-gram is basically a fragment of n length from a given sequence.This idea can be traced to Claude Shannon's work in information theory in the 1900s, but it was Gravano et al. Who first suggested it for string querying in database applications.
N-grams work particularly well for transpositions. This surprisingly simple and easy to implement approach allows some powerful fuzzy matching.The general idea is that you split the two strings in question into all the possible 2-character fragments (2-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be easily normalised by dividing the shared number by the total number of n-grams in the longer string.Here we have three strings 19 characters long. The two suprsing things about using Levenshtein distance in this case is that not only both strings are quite low on the similarity, but also the completely different one is actually more similar. N-grams on the other hand deliver exactly the result that we’re expecting, with the sentence A being the most similar to the template, almost identical sharing 18 out of 20 possible 2-grams.Note there is a variation of Levenshtein distance called Damerau–Levenshtein, but it only allows for transposition of two adjacent characters.
clean_ontology_terms.pl is being retired in place of similarity_match.pl Emma (emma@ebi.ac.uk) refactored all the code and repackaged it for easier integration and reuse into a dedicated set of modules EBI::FGPT::FuzzyRecogniser (http://search.cpan.org/dist/EBI-FGPT-FuzzyRecogniser/) available on CPAN.
Blowing my own trumpet here. The OntoCAT’sarticle was featured in the top 10 most accessed articles at BMC Bioinformatics a few months ago. The website (http://www.ontocat.org) sees about 1,000 pageviews monthly.
But it was Natalja and Misha who stole the show with the ontocat R package included in Bioconductor. Googling for ‘ontology R’ will return the wiki page for the package as first hit, and the actual article as fourth. This is no small feat considering the prevalence of dedicated Gene Ontology R packages that otherwise predominate this space.
An example of a directed acyclic graph representing all the relationships in an ontology for a particular EFO ontology term ‘EFO_0000815’ (heart). Edgesare labelled according to the relationship. Organism part classes are represented as ellipses and disease classes are shown as rectangles. The ontoCATpackage was used to compute the relationships which were later processed in Cytoscape (Cline et al., 2007).Converting the whole ontology to what is effectively RDF triples is a computationally intensive tasks, and takes about 30 minutes when run on 200 cluster nodes and parallelised by multiprocessing. It is demonstrated in Example 16 in the online documentation (http://www.ontocat.org/browser/trunk/ontoCAT/src/uk/ac/ebi/ontocat/examples/Example16.java)

EFO tools - the good, the great, and the evil

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to EFO tools - the good, the great, and the evil

Similar to EFO tools - the good, the great, and the evil (20)

More from Tomasz Adamusiak

More from Tomasz Adamusiak (12)

Recently uploaded

Recently uploaded (20)

EFO tools - the good, the great, and the evil

Editor's Notes