Introduction of the bridge between DataverseNL data repository for ongoing research and EASY trusted digital repository on
workshop PID Information Types for the Social Sciences.
1. dans.knaw.nl
DANS is een instituut van KNAW en NWO
CESSDA Persistent Identifiers
Workshop PID Information Types for the Social Sciences
May 29, 2017, The Hague
Vyacheslav Tykhonov
Senior Information Scientist (DANS)
vyacheslav.tykhonov@dans.knaw.nl
2. dans.knaw.nl
DANS is een instituut van KNAW en NWO
DANS data repositories with Persistent Identifiers
Within the context of DANS’ mission, it is obligatory that every (digital) object
archived via DANS has a PID, so that it can be (re)located and cited. DANS
uses PIDs for both (digital) objects and people.
DataverseNL for ongoing research projects
• every dataset has its own handle (for Dutch Universities)
• revisions of dataset don’t change the handle, every new version changing only
citation
EASY for permanent data archiving (DOIs)
• archived dataset has DOI
• every version of dataset archived from DataverseNL producing new DOI
3. dans.knaw.nl
DANS is een instituut van KNAW en NWO
• DANS has developed Plugin to archive datasets deposited
in Dataverse temporary storage to Trusted Digital
Repositories (TDR)
• Before putting datasets in the long term archive users
should create account in TDR and get proper permissions
to archive their data
• Archival Plugin is open source software and can be easily
extended by support of any TDRs:
https://github.com/DANS-KNAW/dataverse-bridge
4. dans.knaw.nl
DANS is een instituut van KNAW en NWO
“Archive” button is
available for local
Dataverse administrators
to push datasets to EASY
archive for long term
preservation
5. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Administrator can make
choice where to archive
the dataset:
Archivematica, Islandora,
FEDORA or DANS EASY
(EASY is default option)
6. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Archiving process will run
in background to extract
data and metadata from
dataset and will create
archived (bagit) package
containing all files and
checksums
7. dans.knaw.nl
DANS is een instituut van KNAW en NWO
After process of archiving
will be finished button
“Archive” will disappear on
the page. Dataset citation
will be extended with DOI
pointing to archived version
of the dataset in EASY
8. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Archived version of the
dataset is available on
EASY landing page and
can be cited in
research papers
9. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Archived dataset
automatically will get
DOI and URN pointing
to archived revision
(version) of dataset
10. dans.knaw.nl
DANS is een instituut van KNAW en NWO
All files from
dataset will get
permission levels
corresponding to
versions of files
stored in Dataverse
11. Dataverse as Archival Service
• We’re working on the extension of Dataverse with DOIs
generated for every version of dataset to make it work as
permanent storage
• Citations can contain duplicate metadata but dataset content
(data files) should be different
• Archival part can be hosted by the same Dataverse
depending from plugin settings
12. CESSDA PID plugin
• Universal plugin to get DOIs and handles in the same
Dataverse instance
• Prefix of every organisation will be generated based on the
configuration and authentication settings of the plugin
• switch Dataverse between support of ongoing research and
archive (in separate subdataverses)
13. Challenges
• We need PID “Proxy” Service collecting information about all
DOIs generated for different versions of datasets with handles
• depending from the location and status of dataset every
citation should contain handle (Netherlands), URN:NBN
(Europe) and DOI (worldwide)
• statistics about all citations of datasets in research papers
should be aggregated and provided as part of “Proxy” Service
to build own “PageRank” index
• Big Data and Linked Open Data archiving with Persistent
Identifiers
• higher level of granularity for separate files, subsets,
fragments, time services to make citation more accurate
• tombstone pages maintenance
14. Big Data repository with Persistent Identifiers
The approach is suitable for product development companies (industry) and
organisations and institutions (CESSDA) looking for sustainable (Big) data
archiving services.
Big Data object in Dataverse consists of:
• metadata with authorship and citation information
• data usage licence
• persistent DOI or handle
• information how to obtain key (API token) to start use API endpoint(s)
• link to API endpoint delivering data
• representation of API (interactive documentation, Swagger)
• data provenance
• controlled vocabularies to meet domain specific community standards (optional)
Public demonstration is available on Dataverse demo website.