2. Tom Demeranville
THOR Senior Project Officer and ORCID
Software Engineer
https://orcid.org/0000-0003-0902-4386
Martin Fenner
DataCite Technical Director
https://orcid.org/0000-0003-1419-2405
Laura Rueda
DataCite Communications Director
https://orcid.org/0000-0001-5952-7630
3. Linking Data and Data
Challenges
How to cite data with right granularity?
How to link data and contributors with right
granularity?
Datasets that are part of larger datasets or
heterogenous collections
Multiple versions of the same dataset
Dynamic data
4. Linking Data and Data
Challenges – Granularity of Data in ORCID Record
http://search.datacite.org/contributors/0000-0002-8635-8390
5. Linking Data and Data
Challenges – Versioned Data in ORCID Record
http://orcid.org/0000-0003-1419-2405
6. Linking Data and Data
Research – Granularity
Attribution vs. Specificity
Persistent identifiers for datasets need to
support different levels of granularity
Ideally this is done my multiple persistent
identifiers linked via Has Part/Is Part Of
relationship
Collections will play an increasingly
important role
7. Linking Data and Data
Research – Data Versioning
Versioning of data is important for specificity and
verifiability
Practices and expectations for versioning of data vary
widely between communities and data centers
The data repository is ultimately responsible for
decisions about versioning
General recommendations can only include high-level
best practices and common vocabulary
9. Linking Data and Data
Implementation – Cross-Linking of Databases
Cross-linking between different databases not conceptually different
from article-data linking, implementation should follow same
principles (see next section)
10. Linking Data and Data
Implementation – Collections
http://search.datacite.org/works/10.1594/PANGAEA.611088
11. Linking Data and Data
Demo
Collection of climate data from ship logbooks
http://search.datacite.org/works/10.1594/PANGAEA.611088
Dryad Datasets associated with a specific publication
http://search.datacite.org/works/10.5061/DRYAD.9R161.1
12. Linking Data and Articles
Challenges
Data underlying the findings described in a manuscript
not always fully available
Data underlying the findings described in a manuscript
made available, but hidden in supplementary information
and not easily findable
Data underlying the findings described in a manuscript
made available, but not properly linked to/from article
13. Linking Data and Articles
Implementation – Follow FAIR Data Principles
From: http://slideshare.net/lshtm/preparing-data-for-sharing-the-fair-principles
14. Linking Data and Articles
Research - Conceptual Model
Linkage as Triples. In the form subject-predicate-
object, consistent with the Resource Description
Framework (RDF) data model.
Describing the relation. Additional information
such as relation type (e.g. A is new version of B)
and provenance.
Persistent Identifiers as HTTP URIs. This makes
them actionable, and compatible with the RDF
data model.
Centralized infrastructure for persistent identifier
linking. Provided for example by ORCID and
DataCite, facilitating discovery.
15. Linking Data with Articles
Implementation – Discover Article/Data Links
DataCite Event Data (https://eventdata.datacite.org)
Collect, aggregate and make available article/data links from DataCite
metadata and other sources
Crossref Event Data (https://api.eventdata.crossref.org)
Collect and make available article/data links from Crossref metadata
and other sources
OpenAIRE Data/Literature Linking Service (http://dliservice.research-
infrastructures.eu)
Collect and make available article/data links from a variety of sources
16. Linking Data with Articles
Implementation – Exchange Article/Data Links
Standard metadata for exchanging Article/Data Links
Joint Collaboration within RDA/WDS Data Publishing Services WG
(http://www.scholix.org/guidelines)
Link Exchange between Crossref and DataCite
Using the same open source software
(https://github.com/lagotto/lagotto) for their respective Event Data
services
17. Linking Data with Articles
Demo
Supplementary Information hosted in Data Repository
http://search.datacite.org/works/10.6084/M9.FIGSHARE.3427304
Five datasets from Cambridge Crystallographic Data Centre linked to the same article
http://search.datacite.org/works/10.1021/acs.cgd.6b00527
Software library described in Journal of Open Source Software
http://search.datacite.org/works/10.21105/joss.00026
PLOS articles linked with at least one DataCite DOI
http://search.datacite.org/data-centers/340
DataCite DOI -> Crossref DOI links exported from DataCite to Crossref
http://api.eventdata.crossref.org/works?source_id=datacite_crossref
18. In practical terms...
Real interoperability is much more than a framework:
• Compatible data models
• Metadata quality
• Development effort
• Coordination
During this first year, THOR has:
• Assessed how artefacts, contributors, organisations and
others are modelled
• Explored different implementations (ADS, Dryad… )
• Proposed approaches to overcome mismatches
23. Metadata compatibility - ORCID/DataCite
• Lack of standards
• Low adoption
• Organisations:
• ISNI / Ringgold / Others
• Open standard?
• Funding, projects:
• Crossref’s Open Funder Registry
• Coverage and quality?
24. The results!
• ORCID Auto-Update:
Whenever a publication or a dataset
receives a DOI and its metadata
contains ORCID iDs, the ORCID
record of the author(s) can be
updated automatically!
• Authors receive a notification (inbox)
• They can configure:
• Accept updates automatically
• Level of privacy
26. The results!
• EThOS is the UK’s thesis service,
offering search and discovery of all
UK theses, and direct access to all
those that are digitally, openly available.
27. The results!
• PANGAEA archives, publishes and
distributes geo-referenced data about
climate variability, the marine
environment and geological research.
• PANGAEA attempts to resolve ORCID
iDs and annotate author names using a
heuristic algorithm
• Data citations from literature are rare!
• PANGAEA is keeping track of the link from
datasets back to articles (“reverse links”)
28. Linking Data and Contributors
Implementation – ORCID Search and Link
http://search.datacite.org/works?query=martin+fenner
29. Linking Data and Contributors
Implementation – ORCID Auto-Update
https://profiles.datacite.org/users/me
30. Linking Data and Contributors
Demo
Link Works via ORCID record
https://orcid.org/my-orcid
DataCite/ORCID Search and Link after authenticating with ORCID
https://profiles.datacite.org/users/me
32. Remember this?
It’s the crosslinks between EMBL-
EBI databases
Most of those databases use
different identifier types
There are 560 collections!
This can make things tricky
Linking identifier types
Case study - identifier types in the life sciences
33. ORCID currently supports 33 identifier types, such as DOIs.
These are part of a fixed vocabulary, with associated rules about
validation and how to resolve them.
Adding a new one can be difficult, adding 500 is really difficult.
We now know that this does not scale.
But to fully realise our mission, we need to be able to do it, and
so do others.
Linking identifier types
Case study - External identifiers at ORCID
35. Linking identifier types
Challenges - Resolution
Not all of them are resolvable
Ideally, they’d already be URIs, but that’s not the case.
Mandating URIs is problematic as it could exclude large parts of
the community with established practice
How do we turn the “foo” identifier with value “bar” into a URI so
that the identifier can be resolved?
Do we need a set of transformation rules?
36. Linking identifier types
Challenges - Equivalence
Identifiers as URIs can introduce another
problem - Some have more than one
representation, in more than one place
The Protein Data Bank identifier (PDB)
“3coj” can be resolved in lots of places:
• PDB Europe: http://www.ebi.ac.uk/pdbe/entry/pdb/3coj
• PDB Japan: http://pdbj.org/mine/summary/3coj
• RCSB Protein Data Bank:
http://www.rcsb.org/pdb/explore/explore.do?structureId=3coj
• Protopedia: http://proteopedia.org/wiki/index.php/3coj
• PDBsum: https://www.ebi.ac.uk/thornton-srv/databases/cgi-
bin/pdbsum/GetPage.pl?pdbcode=3coj
37. Linking identifier types
Challenges - Equivalence (2)
These URLs all point at the same conceptual entity. But for
systems that group entities by identifiers, this can be a problem.
How do we check for equivalence?
How do we transform the URI into an identifier?
Can we separate the location of things from their identifier?
38. Linking identifier types
Challenges - Maintenance
People may define the same thing in different ways.
For example, the display name, validation rules or resolution URIs
Working with multiple identifiers from multiple sources quickly
becomes difficult. It’s a jumbled pile of bilateral agreements.
Who owns the defnition, who updates it, where is it kept?
How do we handle overlaps and conflicts?
How do we make the process hassle free and timely?
39. Linking identifier types
Challenges - Usability
Presenting a list of a thousand identifier types
to a user is bad.
Where do definitions and display names come
from, what about internationalisation etc?
Are users expected to know the URI of their
identifiers or the identifier itself?
Should systems be able to recognise and
transform between representations?
40. Linking identifier types
What are we doing to address these issues?
1: ORCID are working with EBI to integrate with
systems such as MIRIAM and identifiers.org
2: Refactoring the ORCID registry to streamline the
addition of identifier types
3: Investigating how ORCID might enable member
defined identifier types
41. The life sciences community
realised the issues and did
something about it. They
developed the MIRIAM registry.
It provides the data required to
transform local identifiers into
URIs, enabling resolution of
metadata and the data itself.
Decouples the identification of an
entity from its location on the Web.
Linking identifier types
Integration - identifiers in the life sciences
42. Identifiers.org is a service built on
top of the MIRIAM registry
It turns the URNs used by
MIRIAM into URLs for the web
It provides persistent resolvable
identifiers. The PDB identifier
“3coj” can be resolved at
http://identifiers.org/pdb/3coj
Linking identifier types
Integration - identifiers in the life sciences
Image from: Identifiers.org and MIRIAM Registry: community resources to provide persistent identification,
http://doi.org/10.1093/nar/gkr1097
43. Linking identifier types
Integration - identifiers in the life sciences
ORCID will reference these services for life science identifiers,
but there are still unanswered questions, which may have
multiple correct answers.
Does ORCID work with the “3coj” the identifier of type PDB?
or the “http://identifiers.org/pdb/3coj” of the type identifiers.org?
or is it some hybrid system that works with both?
THOR provides the platform to help answer these types of
questions.
44. Controlled vocabularies can, in fact, impede interoperability by
restricting links to specific systems. Yet we need to know what
is valid and what isn’t.
ORCID is moving to a system whereby the identifier vocabulary is
well understood and defined, yet not fixed and easily extensible
in an on-demand manner.
Clients can query the current list of identifier types using the
public API. We will soon add the rules associated with them
https://pub.sandbox.orcid.org/v2.0_rc2/#!/Identifier_API/viewIdentifierTypes
Linking identifier types
Integration - ‘un’controlled vocabularies identifier types
45. The communities that use identifiers and the databases that
create them are the best places to define and maintain their
definitions
We’re investigating if the ORCID registry could enable external
clients to define identifier types and the rules that go with them,
on-the-fly, for re-use by themselves and others?
We’re evaluating to see if this will meet the needs of scholarly
communication including EBI, CERN, DRYAD, PANGAEA and
the communities they serve.
Linking identifier types
Integration - ‘un’controlled vocabularies identifier types
46. Some of the images in these slides were designed by
freepik.com
THOR is funded by the European Commission under call H2020-EINFRA-2014-2, project
number 654039
Notas del editor
For example: a user might wish to claim some sequencing data within their ORCID record (or the database might want to add it for them). How can this happen if they do not have a known identifier? Having other-id:218751258217 in a record doesn’t help anyone.
We could treat PDB Europe identifiers as being conceptually as different from PDB Japan as DOIs are from Handles. This benefits from simplicity but effectively ignores the problem, providing no way of associating the two identifiers.
Reverse lookup is required, the ability to query these resolving services in the reverse direction, such that, for example, a query for http://www.ebi.ac.uk/pdbe/entry/pdb/3coj points to a common ‘umbrella’/’collection’ identifier such as http://identifiers.org/pdb/3coj.
(but is that now the identifier? A meta identifier)
(yes, it does matter!)
(compounded by the fact some identifiers in identifiers.org already exist within ORCID e.g. PMC identifiers.
ARGH!
This will enable new identifier types and their associated metadata (for example multi-language descriptions) to be added to the registry in response to community needs.
E.g. ISGN
Once we've done the evaluation then we will put it on the dev roadmap.