This document discusses improving reproducibility and transparency in scholarly publishing by rewarding open sharing of data, methods, and results. It presents a case study using the ISA framework to describe experimental data and workflows, nanopublications to make assertions from results, and research objects to aggregate these scholarly artifacts. Combining these approaches could help address issues like increasing retractions by incentivizing open review, replication of analyses, and credit for sharing diverse scholarly contributions beyond publications. The document advocates extending this case study to provide community guidelines for integrated use of these models.
2. The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Core scientific statements or assertions are intertwined and
hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication
4. Problem: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
More retractions:
>15X increase in last decade
At current % > by 2045 as many papers published as
retracted
5. Motivation
• Scholarly artefacts must be
– Treated as first-class objects, in scientific investigations
and in scholarly communications
– Made machine-readable for the convenience of reasoning
– Represented in an interoperable manner
• Truly “add value” to publishing
– Provide infrastructure to aid & reward replication
• Trial of ISA+Nanopublication+RO
– Three similar approaches that should complement each
other for the representation of scholarly artifacts
6. Motivation
• Scholarly artefacts must be
– Treated as first-class objects, in scientific investigations
and in scholarly communications
?
:“data* generated in the
course of research are just as valuable
to the ongoing academic discourse as
papers and monographs”.
“increase acceptance of research
data* as legitimate, citable contributions
to the scholarly record”.
Data* Citation (*and more)
8. GigaSolution: deconstructing the paper
Need to credit and reward:
•Data/software availability
•Metadata/curation
•Interoperability
•Availability of workflows
•Transparent analyses
Data
Metadata
Methods
Analyses
9. GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
20PB storage, 20.5K cores, 212TFlops,
>1000 bioinformaticians
Utilizes big-data infrastructure and expertise from:
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
11. Different levels of granularity:
Experiment
(e.g. Rice 10K project)
Datasets
(e.g. species, variety)
Sample
(e.g. specimen xyz)
e.g. doi:10.5524/100001
e.g. doi:10.5524/100001-2
e.g. doi:10.5524/100001-2000
or doi:10.5524/100001_xyz
Smaller still?
Papers
Data/
Micropubs
NanopubsFacts/Assertions (~1014
in literature)
Reward different shaped publishable objects
13. Validationchecks
Fail – submitter is
provided error report
Pass – dataset is
uploaded to
GigaDB.
Submission Workflow
Curator makes dataset public
(can be set as future date if
required)
DataCite
XML file
Excel
submission file
Submitter logs in to
GigaDB website and
uploads Excel
submission
GigaDB
DOI
assigned
Files
Submitter provides
files by ftp or
Aspera
XML is generated and
registered with DataCite
Curator
Review
Curator contacts submitter with
DOI citation and to arrange file
transfer (and resolve any other
questions/issues).
DOI 10.5524/100003
Genomic data from the
crab-eating
macaque/cynomolgus
monkey (Macaca
fascicularis) (2011)
Public GigaDB dataset
14. Reward open & transparent review
End reviewer 3 Download parody videos, now!
15. Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
16. Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
BMC Research Awards 2013
Winner of open data award
18. Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 500 papers
citing Galaxy use
Over 55 Galaxy
servers deployed
Open source
20. Research Objects
An aggregation of scholarly artefacts:
• Data used or results produced in an
experiment study
• Methods employed to produce and
analyse that data
• Provenance and setting information
about the experiments
• People involved in the investigation
• Annotations about these resources, that
are essential to the understanding and
interpretation of the scientific outcomes
captured by a research object.
23. How are we supporting data
reproducibility?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>11000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>5000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
24. 8 referees downloaded & tested data, then signed reports
Reward open & transparent review
25. Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2
Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
27. SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >36K Galaxy users in
29. How much further can we take this?
ISA + RO + Nanopub case study
Understand how each of the three models can support representation of
the actual scholarly artefacts, which are essential first-class objects in
scholarly communication
Demonstrate added value to life science, publishing and scholarly
communication communities on how these models should be used
together to describe scholarly artefacts from life sciences domains
30. Data models
(instructions for authors for digital publishing)
• Research Object
– An encapsulation of essential information related to experiments
and investigations
• The ISA (Investigation + Study + Assay) framework
– includes a format and a set of software tools that enable its
international user community to provide rich description of
the experimental workflows in life science, environmental
and biomedical domains.
• Nanopublication
– Dissemination of individual data (assertions) with/without an
accompanying scholarly articles
– Enables attribution to the scientists for sharing these their data
32. The SOAPdenovo2 Case study
The Data
The method, as
a Galaxy
workflow
The findings, Table2 in the paper
(doi:10.1186/2047-217X-1-18)
33. investigation
Investigation/Study/Assay infrastructure
The investigation file is a high-level aggregator for related studies, contains all the
information to understand the overall goals of an experiment, including investigators
involved, associated publications, the experimental design, experimental factors,
protocols, funding agencies and so on…
35. investigation
Investigation/Study/Assay infrastructure
Each study has one or more associated assays. The assay is the test performed
either on the subject or on material taken from the subject, which produce qualitative
and/or quantitative measurements.
study study
assay assay assay assay
40. investigation
study
assay
Representation available in:
•Tabular format
• Spreadsheet-like format
• For biologists/experimentalists
•RDF/OWL format for Semantic Web/Linked Data users
• For bioinformaticians/software developers
• Facilitating data integration, querying, reasoning
•Support for submission to public repositories and data
publication platforms
•Tools support for curation, creation, storage, analysis…
•Large and diverse life science user/collaborator communities
ISA framework
41. RO + ISA
Scientific Workflow-specific ROs ISA experiment and data description
Scientific,
computational
Experiments,
non-wet lab
protocols
Scientific,
computational
Experiments,
non-wet lab
protocols
Focus on web-lab
or non-
computational
experimental
protocols
Focus on web-lab
or non-
computational
experimental
protocols
42. An RO for the Case Study
A Galaxy
workflow
A Galaxy
workflow
Some nanopub
statements
Some nanopub
statements
Input
sequence
data
Input
sequence
data
A Research Object
The Research Object contains the
following artefacts:
• The inputs sequence data that
are represented in ISA-TAB format
• The Galaxy workflow that reflects
the computational steps taken for
generating the results used to
produce Table 2
• Machine-readable descriptions
about the workflow
• The nanopublication statements
that represents claims based on
the content of Table 2
Descriptions about
the workflow
Descriptions about
the workflow
44. Assertion
Nanopublication URL
Provenance PublicationInfo
assertio
n
assertio
n
opm:
was
Derived
From
opm:
was
Derived
From
opm:
wasGene-
ratedBy
opm:
wasGene-
ratedBy
this
nanopub
this
nanopub
dcterms:
created
dcterms:
created
pav:
authored-
By
pav:
authored-
By
associa-
tion
associa-
tion aa
sio:statis-
ticalAssoci
ation
sio:statis-
ticalAssoci
ation
sio:has-
measurem
entValue
sio:has-
measurem
entValue
Associatio
n_1_p_val
ue
Associatio
n_1_p_val
ue
aa
Sio:probab
ility-value
Sio:probab
ility-value
sio:has-
value
sio:has-
value
6.56 e-5
^^xsd:floa
t
6.56 e-5
^^xsd:floa
t
sio:
refers-to
sio:
refers-to
dcterms:
DOI
dcterms:
DOI
…
Integrity KeyIntegrity Key
An Individual association
between concepts:
•statement or declaration
•measurement
•hypothetical inference
•quantitative or qualitative
An Individual association
between concepts:
•statement or declaration
•measurement
•hypothetical inference
•quantitative or qualitative
Guarantee immutability
after publication
Guarantee immutability
after publication
Unique, persistent and
resolvable identifier
Unique, persistent and
resolvable identifier
How this assertion came
to be, methods,
evidence, context, etc.
How this assertion came
to be, methods,
evidence, context, etc.
• Detailed attribution
for authors,
institutions, lab
technicians, curators
• License info
• Publication date
• Detailed attribution
for authors,
institutions, lab
technicians, curators
• License info
• Publication date
45. A Nanopublication-Centric View
• Improvements of SOAPdenovo2 have also been observed in assembling GAGE [8]
dataset (see Additional file 1: Supplementary Method 6 and Tables 2 and 3). As
shown in Tables 2 and 3, the correct assembly length of SOAPdenovo2 increased
by approximately 3 to 80-fold comparing with that of SOAPdenovo1.
47. How do we generate a nanopub from this?
…stay tuned for Tech Track talk #34 by Marco Roos
ICC Lounge 81, Tuesday 23rd
: 3.40pm-4.05pm
48. Final step: visualizationFinal step: visualization
NC_010079.pdf
gi_161510924_ref_NC_010063.1_.pdf
CONTIGuator 2 (thanks Marco Galardini)CONTIGuator 2 (thanks Marco Galardini)
https://github.com/combogenomics/CONTIGuator
49. Lessons learned:
• Is possible to push button(s) & recreate a result from a paper
• Reproducibility is COSTLY. How much are you
willing to spend?
• Learn a huge amount about the study, and provides lots of
information not present in the paper
• Much easier to do this before rather than after publication
50. steps
• Complete the case study on the release of ISA-OWL,
nanopubs & ROs
• Extend the case study by including more than one datasets or
ROs, in order to show how related or conflicting information
can be more easily interlinked
• Create community guidelines on how these three models
should be used together, e.g. recommended patterns or
vocabulary terms
51. www.gigasciencejournal.com
Give us your data &
pipelines!*
Want to go beyond
dead trees & the PDF?
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
Contact us:
* APC’s currently generously covered by
BGI in 2013
52. Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
Peter Li
Huayan Gao
Chris Hunter
Jesse Si Zhe
Nicole Nogoy
Laurie Goodman
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Oxford)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
Over 20,000 users on the main server Over 500 papers citing the use of Galaxy Over 55 servers deployed on the Web
The investigation file is a high-level aggregator for related studies, contains all the information to understand the overall goals of an experiment, including investigators involved, associated publications, the experimental design, experimental factors, protocols, funding agencies and so on… Here there is an example of some of the elements in the Investigation file for the SOAPdenovo2 investigation.
An investigation can have one or more studies. A study is the central unit of the experimental description and it contains information on the subject(s) under study, their characteristics, and any treatments applied. In the SOAPdenovo2 case, there is a single study file, which describes the sample collection workflow. The elements can be associated with ontology terms. In the table shown, the source names are associated with a term from the NCBI Taxonomy to indicate their organism.
Each study has one or more associated assays. The assay is the test performed either on the subject or on material taken from the subject, which produce qualitative and/or quantitative measurements. The assay file in the SOAPdenovo2 case describes the different protocols applied, the raw data and how it is processed. The assay files aggregates this information and points to the specific data/analysis methods/scripts, i.e. resources of different types. In the example, the assay file points to an FTP site with the data, to a table in the paper, to the workflow available in the Galaxy-CBIIT instance.
The ISA representation is available as1) a tabular format (ISA-TAB), which is a spreadsheet-like format targeted for biologists/experimentalists. 2) an RDF representation (produced by the ISA2OWL project), following the semantic web/linked data approach. This representation is targeted to bioinformaticians/software developers, it facilitates the integration of data, rich querying and reasoning over the data. The ISA framework also has support for submission to public repositories, either by direct submission to databases that support the format (e.g. Metabolights, GIGA-DB) .
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing. Thank you for listening.