Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source
1. www.slideshare.net/SusannaSansone
Delivering reproducible bioscience data by enabling
biocuration at the source
Susanna-Assunta Sansone, PhD
Principal Investigator and Team Leader,
University of Oxford e-Research Centre, Oxford, UK
Academic Consultant, Open Access Data Products,
Nature Publishing Group
Data Curation Centre (DCC)
13th Regional Data Management Roadshow, London, 20 November 2012
3. University of Oxford e-Research Centre
Providing research
computing, high-
performance
computing
Integrating with
national and
international
infrastructure
Supporting leading
edge facilities through
education and training
4. University of Oxford e-Research Centre
Collaborating with European and wider
international groups in, e.g.:
• energy,
• radio astronomy,
• biological data federation,
• life sciences simulation,
• biodiversity,
• computational chemistry,
• neuroscience,
• digital humanities tools,
• digital music analysis
• visualization
• …
5. My team’s activities and stakeholders we work with
data management and biocuration, collaborative development of software
and database, standards and ontology
• environmental genomics • stem cell discovery
• metabolomics • system biology
• metagenomics • transcriptomics
• nanotechnology • toxicogenomics
• proteomics • environmental health
6. Outline
“The buzz around reproducible bioscience data:
the communities and the standards”
“The reality from the buzz:
challenges and exemplar project”
8. C E
O M R H
N
P E S
I
B
L
E
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
9. C E
O M R H
N
P E S
I
N R E I
T E O P R
A
B
L
E
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
10. C E
O M R H
N
P E S
I
N R E I
T E O P R
A
R E U S B
L
E
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
11. experimental design
sample characteristic(s)
experimental variable(s)
technology(s)
measurement(s)
protocols(s)
data file(s)
......
11 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
12. § We must strike a balance
between
• depth and breadth of
information; and
• sufficient information
required to reuse the data
§ Capture all salient features
of the experimental
workflow
§ Make annotation explicit
and discoverable
§ Structure the descriptions
for consistency, tracking
12 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
13. Growing, worldwide movement for reproducible research
esoteric formats
comprehensible?
lack of sufficient
contextual interoperable?
information
hoc or proprietary reusable?
terminologies
Source: http://ebbailey.wordpress.com
§ Researchers and bioinformaticians in both academic and commercial
science, along with funding agencies and publishers, embrace the
concept that community-developed standards are pivotal to structure
and enrich the annotation of
• entities of interest (e.g., genes, metabolites, phenotypes) and
• experimental steps (e.g., provenance of study materials,
technology and measurement types)
14. Community mobilization to develop standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
15. Is this general mobilization good or bad?
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
§ Fragmentation of the standards is a major issue
• Being focused on particular communities’ interests, be their individual
technologies or biological/biomedical disciplines, leads to duplication of effort,
and more seriously, the development of (largely arbitrarily) different standards
• This severely hinders the interoperability of databases and tools and ultimately
the integration of datasets
18. But how much do we know about these standards
Which tools and I use high throughput
databases sequencing technologies,
implement which which one are applicable
standards? to me?
How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?
Which one are I work on plants,
mature enough for are these just for
me to use or biomedical
recommend? applications?
19. A catalogue to map the
landscape of standards and the
systems implementing them:
Over 400 bio-standards
(public and in curation)
Field*, Sansone* et al., Omics data sharing. Science
19 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
326, 234-36 (2009) doi:0.1126/science.1180598
www.ebi.ac.uk/net-project
20. • A coherent, curated and searchable catalogue of data sharing resources
• Bioscience standards and associated data-sharing policies, publications, tools and databases
• Assessment criteria for usability and popularity of standards
• Relationships among standards
• Encouragement for communication & interaction among groups
• Promoting interoperability & informed decisions about standards
21.
22. Social engineering
22 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
23. Ownership of open standards
can be problematic in broad,
grass-root collaborations; it
requires improved models, to
encourage maintenance of and
contributions to these efforts,
supporting their evolutions
23 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
24. The extensive community
liaison needs to be managed
and funded; rewards and
incentives need to be identified
for all contributors
24 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
25. The cost of implementing a
standards-supported data
sharing vision is as large as the
number of stakeholders that
must operate synchronously
25 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
32. reasoning visualization
analysis browsing integration
exchange retrieval
Community Software
Standards Tools
Well-annotated &
Structured Data
Reproducible &
Reusable
Bioscience Research
33. An exemplar approach to the status quo
§ A grass-root collaborative that works to facilitate collection, curation
and sharing of experiments using a common, structured representation
of the experiments that
• transcends individual biological and technological domains and
• can be ‘configured’ to implement (several of) the community
standards
34. metadata tracking framework
user community
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
36. A growing ecosystem of over 30 public and internal resources using the
ISA metadata tracking framework to facilitate standards-compliant
collection, curation, management and reuse of investigations in an
increasingly diverse set of life science domains, including:
• environmental health • stem cell discovery
• environmental genomics • system biology
• metabolomics • transcriptomics
• metagenomics • toxicogenomics
• nanotechnology • also by communities working to build
• proteomics, a library of cellular signatures
TOWARDS INTEROPERABLE BIOSCIENCE DATA Feb 2012
Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W,
Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA,
Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J,
Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S,
Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A,
Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.
40. Implementation at the EBI
40 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
41. Extensions of the
Nanotechnology
Informatics Working Group
41 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
42. We must increase the level of annotation
Notes in Lab Books Spreadsheets and Tables Facts as RDF statements
(information for humans) ( the compromise) (information for machines)
• Invest in curating and manage data at the source using:
• a common metadata tracking framework, such as ISA
• publicly available and community-developed terminologies
• recording sufficient contextual information of the experimental steps
§ Progressively datasets will become more comprehensible, interoperable,
reproducible and (re)usable, underpinning future investigations
43. Collaborative approaches are highly valuable but take time
Community involvement and uptake!
1st ISA-Tab workshop! 3rd ISA-Tab workshop! User workshops/visits - start! 1st public instance: !
2nd ISA-Tab workshop! Other tools implement ! Harvard Stem Cell ! Growing number of
ISA-Tab! Discovery Engine! systems starts to adopt
ISA framework!
Core developments!
Conversions to ! Links to
Pride-XML/SRA-XML/! analysis tools
Strawman ISA-Tab spec! ISA software v1! MAGE-Tab and more! starts!
Final ISA-Tab spec! Database instance !
at EBI! RDF format starts!
Publications!
Stem Cell !
ISA-Tab and ! Discovery ! ISA Commons!
Omics data sharing!
Workshop reports! ISA software suite! Engine!
(Science)! (Nature Genetics)!
(Bioinformatics)! (NAR)!
2007 2008 2009 2010 2011 2012
Development timeline