BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF Digital identiﬁers subgroup

Gudmundur A. Thorisson <gt50@leicester.ac.uk> GEN2PHEN / University of Leicester
Pierre-Antoine Gourraud <pierreantoine.gourraud@ucsf.edu> UCSF

-- Overview --
‣Brief backgrounder on identification & digital identifiers
‣Use cases for bio-resource identification in BRIF
‣Digital resources: datasets, databases (Mummi)
‣Non-digital resources: projects, studies, cohorts [...] (Pierre)

‣Conclusions and next steps

This work is published under the Creative Commons Attribution license
(CC BY: http://creativecommons.org/licenses/by/3.0/) which means that
it can be freely copied, redistributed and adapted, as long as proper
attribution is given.

Monday, 22 October 12

BRIF and bio-resource identiﬁcation
• The identiﬁcation requirement: need to identify resources in
order to
– track use/reuse and impact
– credit those who contribute to them

• Biobanking projects have relied on:
– Project/study/cohort names
• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr
• Challenges: - ad hoc agreements with research groups who reuse samples or data
- painstaking manual searching through literature for mentions of ‘GAZEL‘
- project names are often ambiguous in global context

BRIF workshop, Toulouse Oct 22 2012

BRIF and bio-resource identiﬁcation
• The identiﬁcation requirement: need to identify resources in
order to
– track use/reuse and impact
– credit those who contribute to them

• Example: biobanking projects frequently rely on...
– Project/study/cohort names
• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr
• Challenges: - ad hoc agreements with research groups who reuse samples or data
- painstaking manual searching through literature for mentions of ‘GAZEL‘
- project names are often ambiguous in global context

– Citations to journal publications
• Which paper to cite? Tricky to keep track of which citations are relevant to impact
• Also troublesome if there is no paper to cite (e.g. for a new study)

Digital identifiers - some background
• Definition: a digital identifier is a character string used to uniquely
identify i) a digital object in a computer system, or ii) a record in a
computer system which describes a non-digital object
• Persistence - once assigned, identifier MUST NOT change
• Uniqueness - global scope vs local scope
– Most ID schemes require tacid knowledge of the type of identifier to interpret
• Example: EC grant identifiers in acknowledgement statements


This work has received funding from the European Community's
Seventh Framework Programme (FP7/2007-2013) under grant
agreement number 200754 - the GEN2PHEN project.


This work has received funding
under grant
agreement number 200754


Digital identifiers - some background
• Definition: a digital identifier is a character string used to uniquely
identify i) a digital object in a computer system, or ii) a record in a
computer system which describes a non-digital object
• Persistence - once assigned, identifier MUST NOT change
• Uniqueness - global scope vs local scope
– Most ID schemes require tacid knowledge of the type of identifier to interpret
• Example: EC grant identifiers

• Some problem domains require for globally unique IDs
– Example: ISBN numbers to identify books, e.g. for copyright purposes

• Some problem domains require resolvable IDs
– Resolve = retrieve out information about the thing being identified, including where
to access it (for a digital object, its location on the Internet)
– Digital Object IDs best known, but several other systems exist


Identiﬁer use cases in BRIF
• 3x broad categories of “stuff” to identify

i) Digital resources
Resources that actually “lives” in computers (born-digital or digitized content):
datasets and databases

ii) Physical resources
Resources corresponding to actual physical things: samples, groups of samples,
experimental instruments, etc.

iii) Project-level and other “meta” resources
Higher-level aggregates of things, projects, organizations, consortia etc.

NB in many cases identiﬁers already exist for these things, but they are
not exposed to the outside world in a usable form (i.e. made resolvable,
citable, globally-unique).

Datasets
• Definition: a data set (or dataset) is a collection of data, often presented in
tabular form but in the bio-sciences also frequently in a multitude of
domain-specific formats, such as FASTA for biological sequences
• Data publication and data citation is a hot topic - lots of
research and infrastructure-building activity in recent years
• Emerging best practices for data citation & attribution
• Identifiers for dataset - persistent data DOIs issued via DataCite

• Little new for BRIF to add here, except issue recommendations
– KEY POINT: infrastructure for data preservation and access is a prerequisite for any
sort of persistent bio-dataset identification scheme. Many projects don’t have this!


Data DOI scenario (simpliﬁed)
1. Research group registers a dataset and metadata in a suitable domain
repository (or their own repository)

2. Repository archives dataset and and assigns a DOI name to it

3. Unique DOI name is used by article authors (and others) to indicate resource
reuse (ideally via formal data citation)

4. Journal article reference listings & full-text and other sources are mined to
identify references to dataset and/or downloads

5. Dataset-level metrics calculated from collected data
e.g. - total no. citations in scholarly articles
- no. secondary citations (citations to papers which cited the original dataset)
- no. downloads in the last 2 years


ORCID and DataCite Interoperability Network

• Persistent identiﬁers for connecting people and
dataset
• 2y EC-funded project, 7 partners in Europe + USA
• Two main proof-of-concept pilots
– Social Science data - use and citation of British Birth Cohort
Studies
• historical data, decades old, steadily being curated by lots of
different people
• high rate of reuse, often cited in papers
– High-energy physics - attribution challenges
• dealing with large no. authors on HEP papers - ‘dilution’ of the term
authorship
• Linking HEP papers to supporting datasets

http://odin-project.eu/

Databases
• Definition: an online database can be regarded as a collection of
data, but made accessible in such a way that facilitates using the data
to answer scientific question, via structured querying and/or free-text
searching of the data over the Internet
• Broad range, from large-scale DNA and protein sequence
repositories to small locus-specific databaess
– E.g. GenBank, UniProt, GWAS Central, Ehlers-Danlos Syndrome Variant Database

• Challenges in assessing impact & attributing curators
– Reliance citations to database paper, if there is one (sometimes many)
• Analyzing website traffic is another indicator - highly-accessed database =~ important
– Database URLs sometimes change
– Database name + URL often only mentioned only in materials&methods, no citation
– Credit via authorship impossible if there is no database journal paper

BioDBCore - global catalogue of bio-db’s
• BioDBCore aims
– annotation - organize the bio-database
‘resourceome’
– discovery - e.g. which protein sequence
databases are available?

• Who’s behind it?
– International Society for Biocuration
– Resource catalogues: Bioinformatics Links,
BioSiteMaps, NAR db-issue etc
– Working group includes reps from NAR and
DATABASE journals, MIBBI, Model
organism db’s, others

• Catalogue will have persistent
identiﬁers for each db entry

http://www.biosharing.org/biodbcore

•[slot in Pierre]


From
Pa(ents
to
BioBanks
and
back…
• Persistent
IDs
for
datasets
&
other
digital
resources
– Absolute
need
• From
BioresourceResearchIF
to
BioresourceXIF
– More
than
an
IP
address
?

• Increase
need
of
iden<ﬁca<on
for
source
of
informa<on

in
general

–
Not
only
research
purpose…
– “Big
data”

– Quan<ﬁed
self.
• Blurring
the
border
between
:
Research,
data
(Non-‐CLIA),

Clinically
approved
,
consumer
centered
data


Database
Gateway

&
Computa1ons

User
data Imaging

Reference
Front-‐end

Individual
data groups
of
pa.ents tablet

Applica1on

Copyright
©
2012
The
Regents
of
University
California,
USA
-‐
All
right
reserved.


Conclusions / next steps
• Complex landscape, lots of problems to tackle
• Key challenge will be to get authors to use the right identifiers
– education, awareness, best practices, journal guidelines etc.
– build support into tools that researchers use

• Potential outputs from BRIF subgroup, by end of GEN2PHEN
– Continue work on whitepaper on identifiers (partial drafted earlier in the year)
– Compile recommendations for authors & biobankers, for use cases where workable
solutions exist or are emerging (data DOIs, BioDBCore)

• Need some biobanker-expert help in ID subgroup!
– Esp. to look in-depth into study catalogues with established identifier schemes
• International Clinical Trials Registry Platform
• ClinicalTrials.gov
• P3G study catalogue

Acknowledgements
GEN2PHEN Consortium
This work has received funding from the
http://www.gen2phen.org/about-gen2phen/partners European Community's Seventh
Framework Programme (FP7/2007-2013)
under grant agreement number 200754 -
Prof Anthony J. Brookes Bioinformatics Group, Leicester
the GEN2PHEN project.

Contact me!

<gt50@le.ac.uk> |<gthorisson@gmail.com>
http://www.linkedin.com/in/mummi
http://www.twitter.com/gthorisson
Published under the CC BY license (http://
http://www.gthorisson.name creativecommons.org/licenses/by/3.0/)


BRIF workshop Toulouse 2012 Digital IDs subgroup

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a BRIF workshop Toulouse 2012 Digital IDs subgroup

Similar a BRIF workshop Toulouse 2012 Digital IDs subgroup (20)

Más de Gudmundur Thorisson

Más de Gudmundur Thorisson (16)

BRIF workshop Toulouse 2012 Digital IDs subgroup