Brief of description of standardised sequence metadata and their importance in comparative/integrative analysis. Thorough description of the ENVIRONMENTS tagger. Demonstration of a browser extension able to list on-demand Diseases, Tissues, Environments, and Organisms identified in a selected piece of text in a web page (Thanks to the contribution of Dr. Lars Juhl Jensen and members of this group)
Botany krishna series 2nd semester Only Mcq type questions
Text Mining and Environmental Metadata Suggestion
1. Text Mining and Environmental
Metadata Suggestion
Evangelos Pafilis
Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC)
Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece
pafilis@hcmr.gr, http://epafilis.info
ENA – 1st Dec 2014 – EBI, UK
5. Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/
6. Essential Context Information
Metadata
Meta- = Μετά (“after”)
=> data “after” data
=> data describing data
ENA – 1st Dec 2014 – EBI, UK
7. a clear definition, that can be interpreted
in many, sometimes conflicting, ways
ENA – 1st Dec 2014 – EBI, UK
8. a clear definition, that can be interpreted
in many, sometimes conflicting, ways
Essential Context Information
ENA – 1st Dec 2014 – EBI, UK
9. Community Standards
• Standards (such as MiXS, MIMARKS)
see http://gensc.org/gc_wiki/index.php/GSC_Publications
for a comprehensive list of publications
• capture genomic/metagenomic and other type of sequence contextual information
• Including detailed guidelines on how to annotate a sample
(e.g. Yilmaz P et al. (2011) The ISME journal 5: 1565–1567)
ENA – 1st Dec 2014 – EBI, UK
http://gensc.org/
10. P. Yilmaz et al., Nat Biotech 29, 415–420 (2011)
13. • Project descriptions
• Scientific-content web pages
• Full text scientific articles
• Literature abstracts
• In-house documents
ENA – 1st Dec 2014 – EBI, UK
14. Microbes are key players in both healthy and
degraded coral reefs. A combination of
metagenomics, microscopy, culturing, and
water chemistry were used to characterize
microbial communities on four coral atolls in
the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3
(“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
15. Looking up terms:
Intensive, learning curve
ENA – 1st Dec 2014 – EBI, UK
19. ENVIRONMENTS: ENVO term identification in text
terrestrial, aquatic,
marine, lagoon, coral reef,
sediment, freshwater, soil
ENA – 1st Dec 2014 – EBI, UK
20. ENVIRONMENTS: ENVO term identification in text
Microbes are key players in both healthy and
degraded coral reefs. A combination of
metagenomics, microscopy, culturing, and
water chemistry were used to characterize
microbial communities on four coral atolls in
the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3
(“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
21. ENVIRONMENTS: ENVO term identification in text
ID: ENVO:00000150
Name: coral reef
Microbes are key players in both healthy and
degraded coral reefs. A combination of
metagenomics, microscopy, culturing, and
water chemistry were used to characterize
microbial communities on four coral atolls in
the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3
(“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
22. ENVIRONMENTS: ENVO term identification in text
ID: ENVO:00000150
Name: coral reef
Microbes are key players in both healthy and
degraded coral reefs. A combination of
metagenomics, microscopy, culturing, and
water chemistry were used to characterize
microbial communities on four coral atolls in
the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3
(“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
23. ENVIRONMENTS
http://environments.hcmr.gr
http://environments-eol.blogspot.gr/
ENA – 1st Dec 2014 – EBI, UK
● Dictionary based
● Open source
● Environment Ontology
● fast performance
● 4000 PubMed abstracts /
second *
● Based on SPECIES name recognition
tagger (Pafilis et al, PLOS ONE)
● E600 gold standard: ENVO-based
corpus of EOL Species pages
● Recognition Accuracy – Mention Level:
- F1: 82.0%
87.1% of the TPs: exact id
among predicted ones
● Submitted preprint: http://biorxiv.org/
content/early/2014/11/13/011403
Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of
Taxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24
GB RAM processing a set of 536,052 abstracts
24. ENVO: source of environment descriptor
names and synonyms
http://environmentontology.org
~1600 terms, June 2013
ENA – 1st Dec 2014 – EBI, UK
biome
environmental
feature
environmental
material
environmental
condition
…
…
…
…
habitat …
Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany
25. ENVIRONMENTS – Improving Accuracy
● Increasing matches in text
● orthographic variation supported
e.g. freshwater, fresh water, and fresh-water
● Case-insensitive matching
● Synonym generation to reflect the way environment descriptive
terms are mentioned in text (both generic and ENVO specific)
Action Example
● Preventing overmatching (i.e. avoiding increased FP)
● „stopword-list” (e.g. spring, well, range)
ENA – 1st Dec 2014 – EBI, UK
Add a variant in which
non-informative words
have been removed
epipelagic zone → epipelagic
estuarine biome → estuarine
Plural form addition sediment → sediments
Adjective form addition lagoon → lagoonal
26. Scope
ENVO parts Not included:
species
tissues
foods
Limitations – Known Issues
negation not supported
conflicts with anatomy terms
(e.g. mouth, blowhole)
ENA – 1st Dec 2014 – EBI, UK
27. ENVIRONMENTS – Sample Output
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012
eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001
eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477
ENA – 1st Dec 2014 – EBI, UK
File Name
Start
coord
End
coord
Match
text ENVO ID
Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353
of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
28. ENVIRONMENTS – Sample Output
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012
eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001
eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000
eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477
ENA – 1st Dec 2014 – EBI, UK
File Name
Start
coord
End
coord
Match
text ENVO ID
Traversing all
IS_A, PART_OF
Relationships in ENVO
Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353
of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
29. Download
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS
• Home Page: http://environments.hcmr.gr/
• Tagger Software:
http://download.jensenlab.org/environments_tagger.tar.gz
41. Summary
! Importance of standardized metadata and annotations
! ENVO: Standardized hierarchically organized descriptions of
environment types
! Literature, project and other scientific content web pages may
describe the environment context of a metagenomics sample
ENA – 1st Dec 2014 – EBI, UK
! ENVIRONMENTS:
! Dictionary-based environment descriptive term identification
! Ontological Community standards, e.g. ENVO: name source
! Command line application
! Browser extensions, a user-friendly interface
! Highly Interactive
! Can be used while browsing the web
! Extract ENVO from a selected part of a web page
! Extended for:
! Organism, diseases, and tissue mention identification
43. BioCreative: Metagenomics Track
Critical Assessment of Information Extraction in Biology
• Preparing a Metagenomics Track as part of the BioCreative 2015 challenge
• Aim: improve the environmental-context annotation of sequences in major
metagenomics repositories.
• Track coordinator: Dr. L. Hirschman, MITRE
• BioCreative (www.biocreative.org)
ENA – 1st Dec 2014 – EBI, UK
44. Biodiversity – Genomics
ENVIRONMENTS-EOL
http://environments-eol.blogspot.com/
Encyclopedia of Life (EOL) http://www.eol.org
• process EOL taxon pages
• extract environmental context (ENVO terms)
• EOL Taxon Page: Quick Facts, Data tab
• integrated in Traitbank
• large scale biological questions
Rubenstein Fellowship 2013
In collab: Jennifer Hammock, Patrick Leary, Katja
Schulz, Cyndy Parr
Hexanchus griseus EOL page, http://eol.org/pages/212027
SEQenv http://environments.hcmr.gr/seqenv.html
• annotate microbial sequences with ENVO terms
• sequence analysis, literature mining, visualization
• GenBank isolation source, PubMed Abstracts
• sample comparison, temporal/spatial pattern analysis
• extension: proteins, protein families, 3D visualization
Reused: Analysis of America bird habitats, http://blog.eol.org/
(NoPlaceLikeHome, in collab: Rob Stevenson, Carl Nordman)
ACTION ES1103
ENA – 1st Dec 2014 – EBI, UK
45. http://jensenlab.org/
Santos A et al. (under review),
preprint: http://biorxiv.org/content/early/2014/11/10/010975
Frankild S et al. (under review),
preprint: http://biorxiv.org/content/early/2014/08/25/008425
Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of
Taxonomic Names in Text. PLoS ONE 8(6): e65390
ENA – 1st Dec 2014 – EBI, UK
46. Acknowledgements
Thank You!
HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou
Lucia Fanini, Sarah Faulwetter, Anastasis Oulas
NNF CPR: Lars Juhl Jensen, Sune Frankild
U Mass: Rob Stevenson
Uni Glasgow: Christopher Quince, Umer Ijaz
EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz
MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, HITS: Dr. S. Berger and more
Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN,
NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)
ENA – 1st Dec 2014 – EBI, UK
Amvrakikos Lagoons, May 2011
ACTION ES1103
47. Acknowledgements
Thank You!
HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou
ENA – 1st Dec 2014 – EBI, UK
id: ENVO:00000038
name: lagoon
Amvrakikos Lagoons, May 2011
ACTION ES1103
Lucia Fanini, Sarah Faulwetter, Anastasis Oulas
NNF CPR: Lars Juhl Jensen, Sune Frankild
U Mass: Rob Stevenson
Uni Glasgow: Christopher Quince, Umer Ijaz
EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz
MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, and more
Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN,
NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)
48. Tutorial
• Start Firefox
• Install the “megx-seqenv-bar.xpi”
• Drug and Drop
• “Install Now” and “Restart”
• Visit a couple of PubMed abstracts or article web
pages of your preference
• Annotate the complete abstract,
• Annotate selected sentences only
ENA – 1st Dec 2014 – EBI, UK