Pulverer-embo-source data-nfdp13

EMBO SourceData
– Next Gen Open Accesss
Bernd Pulverer
Chief Editor | The EMBO Journal
Head | Scientific Publications

Scientific publishing
– Dominant channel for the
dissemination of peer-
reviewed data.
– Journals function as a proxy
for quality in research
assessment
– The rate of publishing keeps
increasing.
– Papers are human-readable
but poorly machine-readable.
5/27

Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
The Research Paper

Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
Expert View
The Research Paper

‘Expert View’
• All the data required to support the conclusions
included in the paper.
• ‘General reader’ vs. ‘expert’ view of the paper:
– Expandable/collapsible ‘inline’ sections,
– Copy edited.
• Restricted to select types of data and information:
– Replicates
– Controls, experimental optimization
– ‘Negative’ results
– Extended experimental protocols
– Computational algorithms
• Datasets presented as separate files.
• No further reaching data 6

Title
Abstract
Synopsis
Main paper
Expert View
Datasets
Source data

What is a figure?
A scientific result converted into a
collection of pixels
8/27

Discoverable, rich content
n seeing all the data –
nt lever that we have for transparency’
Michael F

SourceData
Tools to publish figures as structured digital objects that
link the human-readable illustrations with machine-
readable metadata and ‘source data’ in order to
• improve data transparency (ethics)
• make published data (re)useable
• enable data-oriented search
9/27

Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-oriented semantic
search of the literature.
•Overcome some of the
limitations of keyword-
based search
10/27
SourceData
Data
•Figure source data files
hosted by the journals
•Link to data repositories

•Archive
•Transparency
•Revisualization
•Reuse
•Integration
•Search
•Discourage
manipulation
o voluntary
o ~40% papers

No
No
Yes
Yes
Data Transparency

Structured metadata:
‘perturbation-observation-assay’
1. ‘Object-oriented’ representation of experimental
variables: list biological components.
2. Retain the causality of the experimental design:
“Measurement of Y as a function of A, B, C,
using assay P in biological system S.”
3. Machine-readable representation with standard
identifiers.
measured
component
measured
component
perturbed
component
perturbed
component
experimental system
15/27
assayed
property

Data
•Link to data repositories
Metadata
content
vocabularies
Search
•Sata-oriented semantic
based search
10/27
SourceData

Resulting hypothesis: test drug Z in disease D.
tissue Ttissue T disease
D
disease
D
gene xgene x
Paper3 protein Xprotein X PPkinase Ykinase Y
Paper2
kinase Ykinase Y activityactivitydrug Zdrug Z
Paper1
Data-oriented search
19/27

Data-oriented search
CREBforskolin CREBforskolin CREBforskolin CREBtime
Query: More-like-this:
17/27

sdAnnotations:annotationID a
sdCore:PerturbationMeasurmentExp;
:linkedToPanel sdPanels:panelID;
:hasVariable
sdVariables:variable1;
:hasVariable
sdVariables:variable2;
:usingBiologicalSystem
sdBiolSystem:biolSystemNode;
:basedOnSourcedataset
sdSourceDatasets:dsID .
‘Next Generation’ Open Access
Data SearchMetadata

A data ‘ecosystem’
data access
search
ReaderReader
paper
data
AuthorAuthor
SourceDataSourceData
JournalsJournals Data repositoriesData repositories
26/27

Distributed infrastructure
Database
Journals
Users
Users
ResearchdataResearchdata

Smad3
Hey1
TGFbeta
VE-cdh
Rad51
foci
AR
Tsc2
1 4
6 2 5
3
1,4
4
5
6
2
…
…
Rad51
Nuclear
complexesTGFb, Smad3

Literature search engines
PubMed
72%
PubMed
72%
Europe PMC
<2%
Europe PMC
<2%
Google
17%
Google
17%

Data are published in papers
7/27

‘Publishing’
papers
‘Depositing’
datasets

Availability of published data and
software
• Datasets obtained by experimentation, computation or
data mining, should be made freely available, without
restriction.
• Software should be described in sufficient detail to
allow reproduction. If a specific implementation is the
focus of the study, free access for non-commercial
users is strongly recommended.
• Deposition of data should preferably be in one of the
public databases prior to submission.

Data deposition
Large-scale datasets, sequences, atomic coordinates
and computational models should be deposited in
one of the relevant public databases prior to
submission (provided private access is available at
the database) and authors should include accession
codes in the Materials & Methods section.

Public databases
Structural data PDB, NDB, EMDataBank
Functional genomics GEO, ArrayExpress
Proteomics Pride, PeptideAtlas, PASSEL
PPI IMEx consortium
Clinical genomics datasets EGA, dbGAP
Metagenomics Genbank
Computational models BioModels, JWS

SourceData
Data
•Link to ‘unstructured
data’ repositories
Metadata
content
vocabularies
Search
•Data-oriented semantic
based search
10/27

Pulverer-embo-source data-nfdp13

Pulverer-embo-source data-nfdp13

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Pulverer-embo-source data-nfdp13

Similar a Pulverer-embo-source data-nfdp13 (20)

Más de DataDryad

Más de DataDryad (20)

Último

Último (20)

Pulverer-embo-source data-nfdp13

Notas del editor