Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools
1. Metagenomic Data Provenance and
Management using the ISA infrastructure
overview, implementation patterns & software tools
Alejandra !
Gonzalez-Beltran, PhD
Eamonn !
Maguire
!
alejandra.gonzalezbeltran@oerc.ox.ac.uk
eamonn.maguire@oerc.ox.ac.uk
!
!
Metagenomics Bioinformatics,
EMBL-EBI, Hinxton, UK
September 2014
University of Oxford e-Research Centre, UK
8. Experimental Metadata
Notes in lab notebooks
(information for humans) Spreadsheets & tables
RDF statements
(information for machines)
It is all about structuring experimental information to make it available to
computers and software agents to enable:
8
!
provenance tracking
assessment and evaluation
accountability, reliability, trust, evidence
conservation, preservation, storage, archiving and mining
12. 12
A growing ecosystem of over 30 public and internal resources using
the ISA metadata tracking framework (ISA-Tab and/or tools) to
facilitate standards-compliant collection, curation, management and
reuse of investigations in an increasingly diverse set of life science
domains, including:
!
• stem cell discovery
• system biology
• transcriptomics
• toxicogenomics
• also by communities working to build a library of cellular
signatures
!
• environmental health
• environmental genomics
• metabolomics
• metagenomics
• nanotechnology
• proteomics
14. Why ISA format and Tools?
investigation
assay(s) assay(s)
pointers to data file
names/location
external files in
native or other for-mats
data data
investigation
high level concept to link
related studies
study
the central unit, containing
information on the subject
under study, its characteristics
and any treatments applied.
a study has associated assays
assay
test performed either on
material taken from the sub-ject
or on the whole initial
subject, which produce quali-tative
or quantitative meas-urements
(data)
H. Sapiens
H. Sapiens
H. Sapiens
H. Sapiens
33 Years
H1
H1
H2
35
35
33
Years
Years
Years
ISA metadata specifications:
!
• workflow and process
orientated
• compatible with checklist
enforcement
• compatible with external
vocabulary resources
• compatible by design with
existing schemas
!
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
MAGE-Tab
Pride-xml SRA-xml
15. Essentials about ISA syntax
15
• 3 types of files
• Investigation file: at max 1 (think executive summary)
–Why? general study description
–How? methods / protocol declaration
–How? variable declarations (factors and response variable)
–Who? contact and affiliation information
• Study File: true table (think sorting, filtering)
–What? Listing all biological materials collected over the study course.
• Assay File: true table (think sorting, filtering)
–Results! Listing all data files collected by a given assay
–n files, as many as there are assay types declared
16. Essentials about ISA syntax
• Material Transformations:
– Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled
Extract Name.)
Material Node
Characteristics[…]
Factor Value[…] (independent
variables)
Material Type
Comment[…]
Parameter Value
! […]
Performer (operator effect)
Date (day effect)
Material
Protocol
Process
Data File Node
!
DATA Derived Data File
Raw Data File
!
DATA
!
Material
16
18. Essentials about ISA syntax
–Branching events: Tabular Representation
Sample
Material
muscle
biopsy
liver
biopsy
human
volunter
1
Source
Name
Characteris0c
s[organism]
Protocol
REF
Parameter
Value[storage
condi0on]
Sample
Name Characteris0cs[organ]
volunteer
1 Homo
sapiens
sample
collec8on
heparinated
tube,
room
temperature
volunteer
1
-‐
sample1 peripheral
blood
volunteer
1 Homo
sapiens sample
collec8on
liquid
nitrogen volunteer
1
-‐
sample2 muscle
volunteer
1 Homo
sapiens
sample
collec8on liquid
nitrogen volunteer
1
-‐
sample3 liver
Source
Material
peripheral
blood
18
19. Essentials about ISA syntax
–Pooling events: Tabular Representation
Source
Name
Characteris0c
s[organism]
Protocol
REF
Parameter
Value[storage
condi0on]
Sample
Material
Sample
Name Characteris0cs[organ]
animal
1 Mus
musculus
sample
collec8on
heparinated
tube,
room
temperature
pool1 salivary
gland
animal
2 Mus
musculus sample
collec8on
heparinated
tube,
room
temperature
pool1 salivary
gland
animal
3 Mus
musculus
sample
collec8on
heparinated
tube,
room
temperature
pool1 salivary
gland
animal
1
animal
2
animal
3
Source
Material
salivary
glands
19
20. Essentials about ISA syntax
Tagging with Terminologies
• Implicit column order matters:
!
!
!
!
!
!
• ISA tools (ISAcreator - ISAconfigurator) provide Ontology
term selection and term tagging facilities to help users.
Source
Name
Characteris0cs
[organism]
Factor
Value[comp
ound
agent]
Factor
Value[per
turba0on
agent]
Factor
Value[dose]
Factor
Value[dura
0on]
Factor
Value[was
hout
period
Factor
Value[dura
0on]
Factor
Value[perturba0o
n
agent]
Factor
Value[dose] Factor
Value[dura0on]
individual1 human
Source
Name
Characteris0cs
[organism]
Term
Source
REF
Term
Accession
Number
Characteris0c
s[dura0on] Unit
Term
Source
REF
Term
Accession
Number
Factor
Value[compound
(htppt://purl]
Term
Source
REF Term
Accession
Number
individual1 Homo
sapiens NCBITax 9606 12 week UO UO:wwerw
ta
aspirin CHEBI 1231354
20
22. Parallel group design
source: hOp://dx.doi.org/10.1016/S1569-‐9056(02)00115-‐X; figure 1
22
23. Essentials about ISA syntax
Representing interventions and treatments
!
• expressing treatments as sets of factor levels
• examples: treatment is a tadalafil supplementation
• Factors will be ‘compound’, ‘dose’ and duration
• (what?, how much?, how long for?)
!
Characteris0c
Factor
!
Source
Name
s[organism]
Protocol
REF
Value[compoun
Factor
Value[dose] Factor
Value[dura0on]
d]
!
volunteer
1 Homo
sapiens treatment tadalafil
250
mg/day 12
weeks
!
volunteer
2 Homo
sapiens treatment tadalafil
250
mg/day 12
weeks
!
volunteer
3 Homo
sapiens treatment placebo 20
mg/day 12
weeks
!
• Implicit column order matters but this is independent from the ISA
syntax specification
24. Cross-over design
24
source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.1186/1550-2783-4-25
34. ISA configurations
Available from:
http://isa-tools.org/configurations.html
https://github.com/ISA-tools/Configuration-Files
• Assembling workflow archetypes
• Setting annotation requirements
–for compliance with database schemas (SRA, MAGE, PRIDE)
–for compliance with community based requirements (MIAME,
MIAPE, MIMS, MIxS, …)
• Guide users
–Provide pre-assembled templates
–Specify vocabulary support
ISAconfigurator: Supporting tool
https://github.com/ISA-tools/ISAconfigurator
35. ISA configurations
Available from:
http://isa-tools.org/configurations.html
https://github.com/ISA-tools/Configuration-Files
• Minimum information about any (x) sequence (MIxS) Guidelines as
issued by Genomic Standards Consortium
• ENA-GSC-MIxS checklist XML document:
–based on MIxS guidelines
–augmented with a number of regular expressions to further validate/
regularize input
–fixing a number of units used to report measurement
–issued July 2013 (version 3.0), July 2014 (version 4.0)
• SRA 1.5 schema requirements (mandatory information and required
terminology, e.g. Library Selection or Library Strategy)
• All this information is used to derive ISA MIxS configurations allowing all
those annotation requirements to be embedded in spreadsheet tables
38. Things to bear in mind with NGS data
Important considerations for managing data
and submitting to public repositories
–be aware of support file formats
• FastA,FastQ,SFF,.....
–be aware of the need to demultiplex reads
–SRA schema evolves and updates are needed
• e.g. Study replaced by Project
• Updates to the ISAconverter
• Mapping from ISA is straightforward as brings a
number of element ISA already supported
40. isacreator
Java desktop application
Developed to be a user
friendly way to enter
standards-compliant
metadata: it has lots of
features...
But these are just some of
them… we also have a data
entry wizard and an import
utility...
42. ISACreator Wizard: automatic template generation
Prerequisites and Conditions of use:
!
-supports factorial design experiments, meaning sets of discrete factor levels
combined together, to define a treatment
2x2 factorial design as in 2 compounds and 2 time points
2x2x3 factorial design as in 2 compounds, 2 time points, 2 doses
-assumes one sample collection event (all samples collected at sacrifice time)
-supports some but not all currently available assay types
-supports fractional factorial design
-supports unbalanced factor group population sizes (ethical considerations
for high dose toxic exposures)
-generates automatically sample identifiers, human readable & meaning full
labels and , if requested, barcodes
!
-does not support ‘crossover design’, which have to be coded manually
-does not support sample collection timeline management (under
development)
44. ISAcreator features: visualizing experimental workflows
Work completed during investigation of new approach for creation of glyphs with use of taxonomy for
guidance. See Maguire et al, Taxonomy-Based Glyph Design – with a Case Study on Visualizing
Workflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, 2012
44
45. OntoMaton: a BioPortal powered
Ontology widget for Google Spreadsheets
Maguire et al, 2013
Bioinformatics
Tools for creating ISA-Tab documents
!
!
!
!
http://www.slideshare.net/proccaserra/ontomaton-icbo2013alternative-ordertwv3
http://isatools.wordpress.com/2012/07/13/introducing-ontomaton-ontology-search-tagging-
for-google-spreadsheets/
46. Potential Issues and known hurdles
• The problem of conflicting versions
–especially high when working with big consortia
–distributed, decentralised groups of users
• Lack of version control and history
• Absence of collaborative features
!
–Looking for new solutions while retaining the
features !
= + +
LOV
60. Pre-requirements:
– registration to ENA/EBI Metagenomics
– data upload by one of the methods provided by ENA
http://www.ebi.ac.uk/ena/about/sra_data_upload
60
73. • New open-access, online-only publication for descriptions of scientifically valuable datasets
• Only content type: Data Descriptor, narrative + structured parts
• Initially focused on the life, environmental and biomedical sciences
• Data Descriptor will be complementary to traditional research journals and data repositories
• Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery
www.nature.com/scientificdata
74. Data Descriptors served by Scientific Data
Narrative Section!
A brief article-like document like with:!
•Title!
•Abstract!
•Background & Summary!
•Methods!
•Technical Validation!
•Usage Notes !
•Figures & Tables !
•References
Structured Section!
Detailed descriptions of the experimental
procedures used to produce the data
•Following community-defined minimum
information requirements
• for a level of detail sufficient to reproduce the
experiments
•Using ontologies & controlled-vocabularies
• To maximise consistency of the descriptions
www.nature.com/scientificdata
75. Data Descriptors served by Scientific Data
Narrative Section!
A brief article-like document like with:!
•Title!
•Abstract!
•Background & Summary!
•Methods!
•Technical Validation!
•Usage Notes !
•Figures & Tables !
•References
Structured Section!
Detailed descriptions of the experimental
procedures used to produce the data
•Following community-defined minimum
information requirements
• for a level of detail sufficient to reproduce the
experiments
•Using ontologies & controlled-vocabularies
• To maximise consistency of the descriptions
www.nature.com/scientificdata
77. http://isa-tools.org/training.html
Hands-on Material
• Software:
–ISAcreator 1.7.8 (see pre-release)
–ISAconfigurator 1.6
• Configurations:
–ISA-ENA-MIxS Configuration
–default MultiAssay Configuration
• ISA-Tab formatted datasets
–BII-S-3: Western Channel Water Samples metagenome and
meta transcriptome
–BII-S-7: Human gut microbiome targeted gene survey
• Google Templates and Ontomaton
• ISA mapping file
82. Thanks for your attention!
Questions?
You can email us...
isatools@googlegroups.com
View our websites
View our Git repo & contribute
http://github.com/ISA-tools
View our blog
http://isatools.wordpress.com
Follow us on Twitter
@isatools