Scaling API-first – The story of a global engineering organization
Susanna-Assunta Sansone: An Overview of the Evolving Portfolio of Data Sharing Enablers: BioSharing
1. Policies and standards for reproducible research:
from theory to practice
§ How do we make standards-compliant data sharing culture
functional and efficient?
• Several data management, sharing policies and plans have
emerged; the number of data journals is growing and guidelines to
authors for reporting data are being enriched; there are thousands of
biological databases and a wealth of community standards
• Although, funders, journal editors, data producers, consumers
and service providers agree in principle that shared, annotated
research data and methods offers new discovery opportunities,
compliance is challenging in practice
§ Starting from the genomics domain and extending to other areas
of life-science, we are looking to highlight the success stories and
existing problems
2. About this session - speakers
§ Representatives from stakeholders involved in complete cycle of data
• from funding and regulation, to production, release and re-use
§ Setting the scene:
• Susanna-Assunta Sansone, University of Oxford, UK
• Scott Edmunds, GigaScience BGI Shenzhen, China
§ Funders
• Rita Colwell, University of Maryland, USA
• Paula J. Olsiewski, Sloan Foundation
§ Service providers and/or data producers
• Philippe Rocca-Serra, University of Oxford, UK
• Folker Meyer, Argonne National Laboratory, USA
• Srikrishna Subramanian, IMTECH, India
§ Editors
• Clare Garvey, Genome Biology/BioMed Central
• Craig Mak, Nature Biotechnology
3. About this session - topics
§ Data management, preservation and sharing policies – view points
• formulation and enforcement, or
• uptake and compliance
§ Reporting standards – experiences and challenges
• evolutions of standards, costs of compliance, reward for complying
etc.
• usability of standards when working across disciplines, also they all
have differing community norms
• challenges in integrating data types and how standards can help
§ Tackling the challenges – approaches and lessons learned
• balance needs and expectations (data producers, consumers,
reviews, service providers etc.)
• potential role of each stakeholder
• new way forwards
4. the evolving portfolio of data sharing enablers
Susanna-Assunta Sansone, PhD
University of Oxford,
Oxford e-Research Centre, Oxford, UK
http://uk.linkedin.com/in/sasansone
GSC13th, Shenzhen, China, March 5-7, 2012
5. From reusable data to reproducible research
To make the datasets comprehensible, interoperable and reusable,
underpinning future investigations, we need common ways to report and
share the experimental details and the associated results.
Consistent reporting will have a positive and long-lasting impact on the value
of collective scientific outputs.
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
6. A ‘general mobilization’ to develop standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
7. A ‘general mobilization’ to develop standards…..BUT
§ Fragmentation of the standards is a major issue !
• Being focused on particular communities’ interests, be their individual technologies
or biological/biomedical disciplines, leads to duplication of effort, and more
seriously, the development of (largely arbitrarily) different standards
• This severely hinders the interoperability of databases and tools and ultimately the
integration of datasets
10. But how much do we know about these standards
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
11. But how much do we know about these standards
Which tools and I use high throughput
databases sequencing technologies,
implement which which one are applicable
standards? to me?
How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?
Which one are I work on plants,
mature enough for are these just for
me to use or biomedical
recommend? applications?
12. Often
Which tools and not muchI use high throughput
…
databases sequencing technologies,
implement which which one are applicable
standards? to me?
How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?
Which one are I work on plants,
Several policy documentations and guidelines are inconsistent just for
are these and/or
mature enough for
unclear when recommending use of standards, e.g.: biomedical
me to use or
“..recommend use of appropriate standards...where these exists…....mature,
applications?
recommend?
stable efforts....MIAME format…..standards from accredited standards
organizations…..deposition to public repositories, supporting these
standards…...”
16. A coherent, curated and searchable catalogue of data sharing resources that
(collaboratively) works to:
2. Centralizes community-developed bioscience standards and make them
discoverable; linking to:
• data sharing, preservation and management policies
• other portals e.g. MIBBI, NCBO’s BioPortal, NIF, BioSiteMaps, OBO foundry
• related open access, published material e.g. BioMedCentral, Nature Precedings, F1000
• tools and databases implementing the standards e.g. collaboration with NAR Database
3. Identifies and maintain a set of (implicit) criteria for assessing usability and
popularity of the standards, including:
• implementations by tools and databases
• availability of standards-compliant, public datasets
• relations among standards
3. Fosters communication among groups, in particular to:
• address overlaps and duplication of efforts and enhance interoperability of standards
• produce ‘best practice’ guidelines starting new, or contributing to existing efforts
Ø Will allow stakeholders (funders, journals, service providers and
16 researchers) toSystems Biologyinformed decision on standards
The International Conference on make (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
17. 17 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
18. Over 400 entries
(public and in curation)
18 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
19. Smith et al, 2007
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
20. Smith et al, 2007
Taylor, Field, Sansone et al, 2008
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
21. List of databases, linked to standards a collaboration with Database Issue
21 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
22. List of databases, linked to standards a collaboration with Database Issue
22 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
23. List of databases, linked to standards a collaboration with Database Issue
23 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
24. Define groups and relations among standards
CREDIT:
The relationship among popular standard formats for pathway information Demir, et al., The BioPAX
BioPAX and PSI-MI are designed for data exchange to and from databases and community standard for
pathway and network data integration. SBML and CellML are designed to pathway data sharing,
support mathematical simulations of biological systems and SBGN represents 2010.
pathway diagrams.
24 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
25. E.g. in the genomics context:
resources from GSC and other communities…
INSDC
GCDML
EnvO GOLD
SRAxml
MixS EnvO-light MG-RAST
ISA-Tab
OBI CAMERA
BIOM etc… SILVA
(data matrices)
etc…
Disclaimer: draft for illustrative purpose; this is a dynamic environment, work in progress…
26. E.g. in the genomics context:
resources from GSC and other communities…
INSDC
GCDML
EnvO GOLD
SRAxml
MixS EnvO-light MG-RAST
ISA-Tab
OBI CAMERA
BIOM etc… SILVA
(data matrices)
etc…
Disclaimer: draft for illustrative purpose; this is a dynamic environment, work in progress…
27. E.g. in the genomics context:
resources from GSC and other communities…
INSDC
GCDML
EnvO GOLD
SRAxml
MixS EnvO-light MG-RAST
ISA-Tab
OBI CAMERA
BIOM etc… SILVA
(data matrices)
etc…
Disclaimer: draft for illustrative purpose; this is a dynamic environment, work in progress…
28. E.g. in the genomics context:
resources from GSC and other communities…
INSDC
GCDML
EnvO GOLD
SRAxml
MixS EnvO-light MG-RAST
ISA-Tab
OBI CAMERA
BIOM etc… SILVA
(data matrices)
etc…
Disclaimer: draft for illustrative purpose; this is a dynamic environment, work in progress…
29. E.g. in the genomics context:
resources from GSC and other communities…
INSDC
GCDML
EnvO GOLD
SRAxml
MixS EnvO-light MG-RAST
ISA-Tab
OBI CAMERA
BIOM etc… SILVA
(data matrices)
etc…
Disclaimer: draft for illustrative purpose; this is a dynamic environment, work in progress…
30. Acknowledgements:
Philippe Rocca-Serra (University of Oxford)
Eamonn Maguire (University of Oxford)
Annapaola Santarsiero (University of Oxford)
Susanna Sansone (University of Oxford)
Chris Taylor (EMBL-EBI)
Dawn Field (NERC-NEBC)
with contributions from members of our communities and
individuals.