1. Bio-GraphIIn: a graph-based,
integrative and semantically enabled
repository for life science
experimental data
Alejandra González-Beltrán, PhD
Oxford e-Research Centre, University of Oxford
alejandra.gonzalezbeltran@oerc.ox.ac.uk
@alegonbel
NETTAB 2013
October 16-18, 2013
Venice Lido, Italy
5. Experimental workflow
Planning
Planning
Use existing
data
Publication
Data Collection
Perform new
experiment
Use existing
data
Publication
Data
Scientist
Data
Scientist
Data
Management
Visualization
Analysis
Data Collection
Data
Management
Visualization
Analysis
ity
il
ab
us
Re
ta
Da
Perform new
experiment
6. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary and future work
7. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary and future work
8. Motivation 1/4
retrospective data submissions
Planning
Use existing
data
Publication
Data Collection
Data
Scientist
Data
Management
Visualization
Analysis
Perform new
experiment
9. Motivation 1/4
retrospective data submissions
Planning
metadata
Use existing
data
Publication
Data Collection
Data
Scientist
Data
Management
Visualization
retrospective
Analysis
Perform new
experiment
10. Motivation 1/4
retrospective data submissions
Planning
metadata
Use existing
data
Publication
Data Collection
Perform new
experiment
Data
Scientist
Data
Management
Visualization
retrospective
Analysis
Metadata edits to repositories are not
straightforward, often requiring deleting the
submission and re-submitting the data
11. Motivation 1/4
prospective
retrospective data submissions
Planning
metadata
metadata
Publication
Use existing
data
Data Collection
metadata
metadata
Data
Scientist
Data
Management
Visualization
metadata
Analysis
metadata
Perform new
experiment
12. Motivation 1/4
prospective
retrospective data submissions
Planning
metadata
metadata
Publication
Use existing
data
Data Collection
metadata
metadata
Perform new
experiment
Data
Scientist
Data
Management
Visualization
metadata
Analysis
metadata
Support incremental data deposition
+ metadata edits
15. Motivation 4/4
semantic queries leading to integrative analysis
•
Visualization
Analysis
support for rich and uniform query interface across studies,
enabling integrative data analysis to provide new insights at
systems biology level
•
e.g. find all data files associated with samples from a particular
organism (e.g. Homo Sapiens) and particular tissue type (e.g.
liver)
•
allow to select a set of samples/data files through browsing,
semantic filtering
•
provide links to analysis and visualisation platforms
life science
experiments repo
16. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary
17. The Investigation/Study/Assay (
) infrastructure
generic format for experimental
description and data exchange
community engagement
open source software tools
12
18. investigation
investigation
high level concept to link
related studies
study
the central unit, containing
information on the subject
under study, its characteristics
and any treatments applied.
a study has associated assays
assay
test performed either on
material taken from the subject or on the whole initial
subject, which produce qualitative or quantitative measurements (data)
assay(s)
assay(s)
pointers to data file
names/location
external files in
native or other formats
data
data
• environmental health
• environmental genomics
• metabolomics
• metagenomics
• nanotechnology
• proteomics
• stem cell discovery
• system biology
• transcriptomics
• toxicogenomics
• communities
working to build a
library of cellular
signatures
19. Experimental workflow - graph representation
H1.sample1
H1.sample1.labeled
...
Scanning
h1-s1.cel
...
Labeling
Scanning
h1-s2.cel
...
Scanning
h2-s1.cel
H1
H. Sapiens
35 Years
H2
H. Sapiens
33 Years
H1.sample2
H2.sample1
Labeling
H2.sample1.labeled
20. Experimental workflow - graph representation
Labeling
H1.sample1.labeled
...
Scanning
h1-s1.cel
...
H1.sample1
Scanning
h1-s2.cel
...
Scanning
h2-s1.cel
H1
H. Sapiens
35 Years
H2
H1.sample2
Labeling
H2.sample1
H2.sample1.labeled
H. Sapiens
33 Years
Spreadsheets for end-users
...
H1
H. Sapiens
35
Years
H1.sample1
H1
H. Sapiens
35
Years
H1.sample2
H2
H. Sapiens
33
Years
H2.sample1
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
Scanning
Labeling
Scanning
h1-s2.cel
Scanning
h2-s1.cel
vocabulary for the description of the experimental workflow
21. Experimental workflow - graph representation
Labeling
H1.sample1.labeled
...
Scanning
h1-s1.cel
...
H1.sample1
Scanning
h1-s2.cel
...
Scanning
h2-s1.cel
H1
H. Sapiens
35 Years
H2
H1.sample2
Labeling
H2.sample1
H2.sample1.labeled
H. Sapiens
33 Years
Spreadsheets for end-users
...
H1
H. Sapiens
35
Years
H1.sample1
H1
H. Sapiens
35
Years
H1.sample2
H2
H. Sapiens
33
Years
H2.sample1
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
Scanning
Labeling
Scanning
h1-s2.cel
Scanning
h2-s1.cel
vocabulary for the description of the experimental workflow
syntactic interoperability
across biological experiments of different types
22. Machine-readable representation
Graph + Semantics
obi:material
entity obi:material
sample
tax:homo
sapiens
H1.sample1
obi:material
processing
d
i fie
c
spe _of
s_
bi:i nput
o i
_
labeling1obi:
obi:processed
material
scanning1
d
is_
i fie
c
_o spe
utp ci
spe _of
_
ut fied
i:is put
_o H1.sample1. b in
o
f
_
labeled
isa:raw data
file
ob
i:i
_o s_spe
utp ci
ut fied
_o
f
d labeling2obi
scanning2
ob
:is_
ifie
c
i:is
e f
_o spe
fied
sp _o
ci
_
_o _spe
utp ci
fie H1.sample2. s_spe _of
i:is put
utp ci
ut
d
ob in
: i ut
ut fied
_o
i
_
_o
f
ob inp
H1.sample2
labeled
_
f
isa:executes
H1
ives
bfo:der
from
bfo:
der
_fro ives
m
obi:planned
process
labeling protocol
obi:protocol
semantic interoperability
across biological experiments of different types
h1-s1.cel
h1-s2.cel
25. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary
27. Bio-GraphIIn Requirements
•
•
•
•
•
support prospective annotation of experiments
•
support Create Read Update Delete (CRUD)
operations
manage heterogeneous biological and biomedical metadata
•
relying on ISA-TAB
support data integration & semantic queries
•
relying on ISA2OWL
links to analysis and visualisation platforms
take advantage of experimental design information,
improving metadata such as including study groups
28. Functionality provided by existing repositories
& Bio-GrapIIn requirements
Browsing/
Searching
Programmatic
submission
Programmatic
access
SampleTAB
browse/search
X
(email
submission)
REST API
X
X
YES
MAGETAB
browse/filter/
search/
advanced
search
MAGE-TAB
spreadsheet/
MIAMExpress
REST API
X
X
X*
SRA-XML
browse/text/
sequence/
advance search
Webin, REST
REST API
X
X
X
mass
PRIDE
spectromet PRIDE-ML
inspector/
ry
PRIDE Biomart
X
(FTP upload)
Java API
X
X
X
Data Types
BioSample
DB
Format
sample
info
ArrayExpress
sequencing
/GEO
SRA/ENA
PRIDE
BII
Bio-GraphIIn
next
generation
sequencing
All
All
CRUD
Community
operations
curation
RDF
ISA-TAB
browse/text
search/filtering
X
SOAP web
services
X
X
X
ISA-TAB
browse/filter/
search/
advanced
search
YES
(upload, REST)
REST API
YES
YES
YES
*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF
29. Functionality provided by existing repositories
& Bio-GrapIIn requirements
Browsing/
Searching
Programmatic
submission
Programmatic
access
SampleTAB
browse/search
X
(email
submission)
REST API
X
X
YES
MAGETAB
browse/filter/
search/
advanced
search
MAGE-TAB
spreadsheet/
MIAMExpress
REST API
X
X
X*
SRA-XML
browse/text/
sequence/
advance search
Webin, REST
REST API
X
X
X
mass
PRIDE
spectromet PRIDE-ML
inspector/
ry
PRIDE Biomart
X
(FTP upload)
Java API
X
X
X
Data Types
BioSample
DB
Format
sample
info
ArrayExpress
sequencing
/GEO
SRA/ENA
PRIDE
BII
Bio-GraphIIn
next
generation
sequencing
All
All
CRUD
Community
operations
curation
RDF
ISA-TAB
browse/text
search/filtering
X
SOAP web
services
X
X
X
ISA-TAB
browse/filter/
search/
advanced
sing
search
row pe
YES
(upload, REST)
REST API
YES
YES
YES
b oty
rot
p
e
typ
oto
pr
e
typ
o
rot
p
e
typ
o
rot
p
*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF
e
typ
o
rot
p
30. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary
32. semantic representation of the graph,
rich queries over common
semantic framework enabling
integration with other repositories
37. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary and future work
39. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary and future work
58. Outline
•
•
•
•
•
•
•
Motivation for an integrative and semanticallyenabled metadata repository in life sciences
retrospective data submissions
heterogeneous experimental data
fragmentation of formats and databases
semantic queries leading to integrative analysis
•
•
•
•
Context: the ISA infrastructure
Bio-GraphIIn requirements
Bio-GraphIIn design & architecture
Bio-GraphIIn graph queries
Bio-GraphIIn prototype
Summary and future work
59. Summary and future work
•
Bio-GraphIIn - the new integrative and semantically -enabled
repository for the ISA infrastructure: motivation,
requirements, design & architecture, prototype
•
Support for data integration, uniform semantic queries
across experiments enabled by a common semantic
framework (ISA2OWL)
•
More work required on
•
Querying: performance analysis, support for ad hoc
queries
•
•
Extension/improvement of prototype
Interfaces to services (e.g. BioPortal) and analysis/
visualisation platforms (e.g. R/Bioconductor & Refinery)
61. Thanks for your attention!
Questions?
You can email us...
isatools@googlegroups.com
View our website
http://www.isa-tools.org
View our Git repo & contribute
http://github.com/ISA-tools
View our blog
http://isatools.wordpress.com
Follow us on Twitter
@isatools