Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
3. Scientific publishing
– Dominant channel for the
dissemination of peer-
reviewed data.
– Journals function as a proxy
for quality in research
assessment
– The rate of publishing keeps
increasing.
– Papers are human-readable
but poorly machine-readable.
5/27
6. ‘Expert View’
• All the data required to support the conclusions
included in the paper.
• ‘General reader’ vs. ‘expert’ view of the paper:
– Expandable/collapsible ‘inline’ sections,
– Copy edited.
• Restricted to select types of data and information:
– Replicates
– Controls, experimental optimization
– ‘Negative’ results
– Extended experimental protocols
– Computational algorithms
• Datasets presented as separate files.
• No further reaching data 6
10. SourceData
Tools to publish figures as structured digital objects that
link the human-readable illustrations with machine-
readable metadata and ‘source data’ in order to
• improve data transparency (ethics)
• make published data (re)useable
• enable data-oriented search
9/27
11. Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-oriented semantic
search of the literature.
•Overcome some of the
limitations of keyword-
based search
10/27
SourceData
Data
•Figure source data files
hosted by the journals
•Link to data repositories
16. Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-oriented semantic
search of the literature.
•Overcome some of the
limitations of keyword-
based search
10/27
SourceData
Data
•Figure source data files
hosted by the journals
•Link to data repositories
17. Structured metadata:
‘perturbation-observation-assay’
1. ‘Object-oriented’ representation of experimental
variables: list biological components.
2. Retain the causality of the experimental design:
“Measurement of Y as a function of A, B, C,
using assay P in biological system S.”
3. Machine-readable representation with standard
identifiers.
measured
component
measured
component
perturbed
component
perturbed
component
experimental system
15/27
assayed
property
19. Data
•Figure source data files
hosted by the journals
•Link to data repositories
Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Sata-oriented semantic
search of the literature.
•Overcome some of the
limitations of keyword-
based search
10/27
SourceData
21. Resulting hypothesis: test drug Z in disease D.
tissue Ttissue T disease
D
disease
D
gene xgene x
Paper3 protein Xprotein X PPkinase Ykinase Y
Paper2
kinase Ykinase Y activityactivitydrug Zdrug Z
Paper1
Data-oriented search
19/27
28. A data ‘ecosystem’
data access
search
ReaderReader
paper
data
AuthorAuthor
SourceDataSourceData
JournalsJournals Data repositoriesData repositories
26/27
37. Availability of published data and
software
• Datasets obtained by experimentation, computation or
data mining, should be made freely available, without
restriction.
• Software should be described in sufficient detail to
allow reproduction. If a specific implementation is the
focus of the study, free access for non-commercial
users is strongly recommended.
• Deposition of data should preferably be in one of the
public databases prior to submission.
38. Data deposition
Large-scale datasets, sequences, atomic coordinates
and computational models should be deposited in
one of the relevant public databases prior to
submission (provided private access is available at
the database) and authors should include accession
codes in the Materials & Methods section.
42. SourceData
Data
•Figure source data files
hosted by the journals
•Link to ‘unstructured
data’ repositories
Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-oriented semantic
search of the literature.
•Overcome some of the
limitations of keyword-
based search
10/27
Transparent process provides a permissive environment for the publication of ethically robust papers by releasing some of the pressures in the race to publish in biology.
I would like to present some initiatives and ideas we have with regard to published data. They represent an extension of the concept of transparency to the data we publish in our journals, but also and extension to the concept of open access . Several of these ideas are currently being developed in a project called SourceData that I will briefly summarize.
Data are the heart of a paper: free text is author interpretation; data is absolute . WE think it is important to think about how data are presented in figures. But publishing faces many challenges. One of which is that that the rate of publishing is increasing: 1 Mio papers are currently indexed in every year, twice as much as 10 years ago. Some journals like PLOS one are in exponential growth phase. It is thus becoming harder and harder to search the literature and find specific information. While less and less people actually manage to keep up with this mass of human-readable papers, we rely more and more on machines to access these documents which are however poorly machine-readable.
Deconstruct a paper : it is a stacked layered structure that allow access to the content with increasing depth, from the title, abstracts down to the data. Title and abstract provide quick access to the browser. synopses and visual abstracts provide summaries of key facts The core is the main paper which is optimized for the human reader. At a deeper level there are supplementary information, structured datasets and computer codes.
What we would like to achieve in the near future is to eliminate the concept of supp info : - the volume of these supplementary sections is continuously growing, -they are not well reviewed, not copy edited, often not well presented -sometimes contain data only peripherally related to the main conclusions.
Instead of supp info, we propose to have an expert view of the paper. Some data can be repetitive and make papers difficult to read. We propose to have two views of the paper: one for general readers , that correspond to the main paper as we know it now, and an expert view where the additional in-depth information, data are included within the paper as expandable/colapsible sections.
Similarly, to encourage maximal use/reuse, we will make all the datasets and source data freely available under a CC0 license by default.
Data are the core of a research paper, yet these figures are published as images, that is a collection of pixels, making it impossible to re-analyse, to re-use or to find data easily. > affects all journals, whether they are open access or not. > most of the published scientific data remains locked inside the papers.
To start addressing this challenge, we have started the SourceData project. With SourceData we want to be able to publish figures into structured digital objects by linking Figures with Source data and machine readable metadata. This in order to improve data transparency, to promote data sharing and to enable data-oriented search.
Three components… This first step of the project is to enable authors to provide the raw data that are behind the figures. This can be done in several ways: either host the data in Journals , and I will show an example in a minute, or elect to host the data on one of the several ‘ unstructured data repositories’ such as Dryad and provide links to these resources.
This is not limited to numerical data. Here is an example from the EMBO Journal where the full gels are provided as uncropped images, allowing readers to examine the blots beyond the narrow slices usually displayed in published figures. Also micrographs.
Datasets alone are however of limited utility: Need to associate these data with structured metadata that explain the biological content of the data. To be useful for data mining, this must be done in a machine-readable way which involves the use of standard identifiers, controlled vocabularies and existing ontologies .
The second level will encode a fundamental structure common to many biomedical experiments. This is most easily seen for data that are represented as a plot: these data result from an experiments that where a given biological component Y was measured or observed as a function of various experimental conditions or perturbations A, B, C, using an assay P in a defined experimental system S. This separation between components that are observed — the phosphorylation level of a protein — and the components that are perturbed — for example a kinase inhibited by a drug — can be applied across an extremely broad range of published data, whether western blots, histological preparations or microscopy. This is because the model is able to represent the causality underlying an experimental design. This representation of directional relationship between biological components represents a backbone model on which much details can be elaborated. It is thus scalable as a model, in the sense that it can be extended and refined and specialized models can be derived from this backbone.
So we are currently developing tools that will enable curation of accepted manuscript by data editors and embed the curation process in the production process.
Finally, the third component is to use the machine-readable metadata to enable data-oriented searches of the papers based on the data they contain. The semantic information provided by the metadata will help overcoming some of the limitations of text-based searches. Source Data will make figures more useable. The search will make them discoverable
Source Data will allow to search papers through their data. If for example we are interested in finding data about CDK1 substrates , we can formulate a more or less complex query in PubMed. In this case, we would find a series of papers. To check the relevance we would have to open them, read the abstract, check the article and the figures. If the figures would be have been annotated with SourceData metadata, we could search directly for published experiments where measurements were conducted under conditions where CDK1 activity was perturbed. This would lead us to the relevant data inside the papers, from where we can link out to the associated papers.
As a consequence, related experiments can be found across papers in the literature and joined in a directional way to help generating hypotheses. In this example: drug Z might be interesting to test for disease D. It would be extremely difficult to perform such tasks in a systematical manner with conventional search strategies. This is an applications that goes beyond mere search and is a step towards the integration of multiple datasets. This will potentially be an extremely powerful feature to generate new hypotheses and potentially new findings.
find figures that are closely related to each other. With this function, from a starting figure, it would be possible to find related figures and the respective papers. This is a function that resembles the function ‘related articles’ in PubMed but applied to individual figure panels.
To conclude, the last ten years has seen profound changes in scientific publishing: the transition to online publishing and open access content has opened the door to the large-scale systematic mining of the literature. This transition needs however to be completed to go beyond access to the text and offer deeper access to the research data. The current format of the human-readable version of the papers will remain . But the paper of the future will need to be associated to a machine-readable version of the paper. With SourceData, we will make published data useable by linking them to explanatory machine-readable metadata. These metadata will in turn enable data-oriented search functions that will increase the discoverability the papers. This will represent the next generation of open access which will enable a much deeper access to the literature and a systematical mining and integration of published data across the literature. Such transformation will be needed to benefit from the potential of research data to generate new findings and accelerate scientific discovery.
It is very early days to predict how the data ecosystem will stabilize both at the technical and economical level. From the users point of view, the basic tasks have to remain as simple and straightforward as possible: authors want to submit their papers and data, and readers need to access the data and search the literature. The role of SourceData in this ecosystem is that it will provide a series of tools and services that will will create a win-win-win situation across the major stakeholders: authors will benefit from the increase discoverability of their research, journals and data repositories will increase the visibility of their content and add more value to their content, which is a crucial issue currently in publishing readers will have a greater and deeper access to data and to the literature.
50 panels on TGF beta signaling data, annotated primitivelly
These data that are published in papers are mainly presented in figures. It is in the figures that the evidence that support conclusions are shown. Figures are absolutely essential to make a formal scientific proof.
The fact that these large datasets do not fit the classical format of published papers, especially when print was still relevant, has created a situation where papers and data largely live parallel and separate lives: Papers are published in journals, datasets are deposited in databases. This has maybe conditioned us to think of scientific publishing and data dissemination in separate terms.
The importance of making research data available has been largely driven by the fields in biology that produce large-scale datasets—genomics and the other omics fields, but also structural biology.
This has serious consequences for search, which has become essential to find specific information in this ocean of papers.
This first step of the project is to enable authors to provide the raw data that are behind the figures—we call these source data and this gave the name to the entire project. This can be done in several ways: either host the data on the publisher’s website, and I will show an example in a minute, or elect to host the data on one of the several ‘unstructured data repositories’ such as Dryad and provide links to these resources. Datasets alone are however of limited utility. The second component which central to the present proposal today is to associate these data with structured metadata that explain the biological content of the data. To be useful for global data mining, this must be done in a machine-readable way which involves the use of standard identifiers and controlled vocabularies and existing ontologies. Finally, the third component is to use the machine-readable metadata to enable data-oriented searches of the papers based on the data they contain. The semantic information provided by the metadata will help overcoming some of the limitations of text-based searches. The metadata will make the data more useable. The search will make them discoverable So, let us briefly review these three components.