Prepared and presented by Jo McEntyre (EMBL_EBI) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.
1. Citing data in research articles:
principles, implementation, challenges
- and the benefits of changing our ways
Jo McEntyre
Europe PMC, EMBL-EBI
www.ebi.ac.uk
3. Familiar Complexity!
Article‘Package’ExternalResources
“Recognized” data repos:
file|structured record,
Accession|DOI|API+ Accession
Institutional repos:
file|structured record,
URL|DOI|API+Accession
Author database|‘website’:
file|struct record,
URL|DOI|API+Accession
Supp info tables/data:
file, URL|DOI
Cross-reference
Dataset list
Ref to external
resRef to external
res
Reference list
Fig Source data:
file, URL|DOI
Fig (caption + graphic)
Cross-reference
Ref to external
resource
Adapted from Thomas Lemberger, EMBO
4. Europe PMC literature database
Europe PMC
• Abstracts: 30 million
• Full-text articles: 3 million
• Article citation counts
• Grants
• ORCIDs
• Semantic annotation
• Data citations
• Data integration
Europe PMC is a member of the PMC
International Collaboration.
Funded by 28 European funders of life science research
5. About EMBL-EBI
• Part of the European
Molecular Biology
Laboratory
• International, non-profit
research institute
• Europe’s hub for
biological data services
and research
6. Making data discoverable
Labs around the
world deposit
data and we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
8. Data Citation in Europe PMC full text
Literature*
Added-Value
Submitted
*OMIM, Clinical trials, GO
Submission statements
vs reuse?
260K
9. Data Citation Principals Engender Two
Big Ideas
"sound, reproducible scholarship rests upon a
foundation of robust, accessible data"
"data should be considered legitimate, citable
products of research"
These slides are adapted from:
http://www.slideshare.net/joanstarr/data-citation-a-joint-declaration-
10. 1 Importance
2 Credit and Attribution
3 Evidence
4 Unique Identification
5 Access
6 Persistence
7 Specificity and Verifiability
8 Interoperability and flexibility
Full Principles: https://www.force11.org/datacitation
Joint Declaration on Data Citation Principles
11. Joint Declaration
Data should be considered legitimate, citable
products of research. Data citations should be
accorded the same importance in the scholarly
record as citations of other research objects, such as
publications.
1. Importance
12. Data citations should facilitate giving scholarly credit
and normative and legal attribution to all contributors
to the data, recognizing that a single style or
mechanism of attribution may not be applicable to all
data.
2. Credit and Attribution
Joint Declaration
13. In scholarly literature, whenever and wherever
a claim relies upon data, the corresponding data
should be cited.
3. Evidence
Joint Declaration
14. A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc.. !!!
Joint Declaration
15. Data citations should facilitate access to the data
themselves and to such associated metadata,
documentation, code, and other materials, as are
necessary for both humans and machines to make
informed use of the referenced data.
5. Access
Joint Declaration
16. Unique identifiers, and metadata describing the
data, and its disposition, should persist -- even
beyond the lifespan of the data they describe.
6. Persistence
Joint Declaration
17. Data citations should facilitate identification of,
access to, and verification of the specific data that
support a claim. Citations or citation metadata
should include information about provenance and
fixity sufficient to facilitate verifying that the specific
timeslice, version and/or granular portion of data
retrieved subsequently is the same as was
originally cited.
7. Specificity and Verifiability
Joint Declaration
18. Data citation methods should be sufficiently flexible
to accommodate the variant practices among
communities, but should not differ so much that they
compromise interoperability of data citation practices
across communities.
8. Interoperability and flexibility
Joint Declaration
20. An implementation example
Principle 2:
Credit and
Attribution
Principle 4, 5,
6:
Unique ID
Access
Persistence
Principle 7:
Specificity
and
Verifiability
Principle 8: Interoperability and flexibility
Creators, Year, Dataset Title, DOI, Data Repository, version
(Resolves to landing page with
access to metadata, docs, and
data)
Slide from
Mercè Crosas, Ph.D.
Harvard University
32. JATS support for data citation
<mixed-citation publication-type='data'>
<name><surname>Heinz</surname><given-names>D.W.</given-
names></name>,
<name><surname>Baase</surname><given-names>W.A.</given-
names></name>,
<etal>et. al.</etal>
<data-title>How amino-acid insertions are allowed in an
alpha-helix of T4
lysozyme</data-title>.
<source>PDB Europe</source>,
accession <pub-id pub-id-type='accession' assigning-
authority='pdb'
xlink:href='http://www.ebi.ac.uk/pdbe/entry/search/index?te
xt:102L'>102l</pub-id>.
<pub-id pub-id-type='doi'
xlink:href='http://dx.doi.org/10.2210/pdb102l/pdb'>10.2210/
pdb102l/pdb</pub-id>
</mixed-citation>
33. Minimal, maximal & extensible citation
Resource
name
I
D
Resource
name
Resolution ‘template’ I
D
Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
? Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
?
For example:
new data vs pre-existing
data
For example:
version
Thomas Lemberger, EMBO
34. Integrated Research
Reused from: seier+seier,
Flickr
Reused from: Images
Money, Flickr
Articles
Data
People
Institutions
Funders
35. A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc..
Joint Declaration
36. 1. Discoverability through accessibility
• Deposit in a public/open database
• Where possible, structured archive (e.g. PDB,
ENA) >> unstructured archive (e.g. Zenodo,
Figshare)
• Uniquely identify it: PID, Accession number, DOI,
ROI
• Give it context: metadata (and more)
• All of the above = citable =
37. 2. Discoverability through structured data
structured data is one of the true
enablers of life science
- Discovery of homology between genes across species
- Predicting function based on protein folds
• Structured data can be cross-analysed, compared by
algorithm, and encourages development of new products
and tools
38. Structured data is good value for money
Annual cost of generating new protein
structure data in labs around the world
Annual cost of
maintaining it
in a central
database
39. Degrees of Data
Unstructured/semi-
structured
Structured
Added Value
Metadata
A picture of a graph
A spreadsheet of my results
A record in a DNA
sequence
database
A graphical display of a genome
A narrative with
citations, pictures
and attachments
Article
40. Metadata – critical to discoverability
Generic: title, submitters, date, file format, version.
citation
basic search
Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1
gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006
(Rel. 89, Last updated, Version 7). BN000065
Specific: organism, tissue, assay, page number …
deep search
analysis
computation