Researcher data management shared service for the UK – John Kaye, Jisc
Hydra - Tom Cramer, Stanford University and Chris Awre, University of Hull
Addressing the preservation gap at the University of York - Jenny Mitcham, University of York
Emulation developments - David Rosenthal, Stanford University
Jisc and CNI conference, 6 July 2016
5. Contents
»Background and Policy Context
»Sector Requirements
»Shared Service
»Timescales
»Engagement
14/07/2016 Jisc-CNI Research Data Shared Service 5
7. Research Funder Policies
14/07/2016 Jisc-CNI Research Data Shared Service 7
» Public good: Publicly funded research data are produced in the public interest should be made
openly available with as few restrictions as possible
» Planning for preservation: Institutional and project specific data management policies and plans
needed to ensure valued data remains usable
» Discovery: Metadata should be available and discoverable; Published results should indicate how to
access supporting data
» Confidentiality: Research organisation policies and practices to ensure legal, ethical and commercial
constraints assessed; research process should not be damaged by inappropriate release
» First use: Provision for a period of exclusive use, to enable research teams to publish results
» Recognition: Data users should acknowledge data sources and terms & conditions of access
» Public funding: Use of public funds for RDM infrastructure is appropriate and must be efficient and
cost-effective
RCUK Common Principles on Data Policy
8. EPSRC Policy
14/07/2016 Jisc-CNI Research Data Shared Service 8
» Retained EPSRC-funded research data is preserved for a minimum of ten years
» Effective data curation is provided throughout full data lifecycle
» Knowledge of publicly-funded research data holdings
» Discoverability; recording of third party access requests
» Notice and justification of access restrictions, for example ‘commercially confidential’
» Awareness and use of relevant law, for example FOI
» Awareness and compliance with research data policise
» Adequate RDM resource allocation for example from quality-related research (QR)
funding or research grants
9. Strategic guidance from…
14/07/2016 Jisc-CNI Research Data Shared Service 9
»UCISA research IT systems
group -
› Procure a shared
national RDM service
»UUK research policy
network discussion –
› Concern over multiple
solutions
10. What would you like Jisc to Provide?
14/07/2016 Jisc-CNI Research Data Shared Service 10
2015 Research Systems Survey:
» “Currently the UK is running a very inefficient model requiring individual institutions to
establish their own repositories. Influencing future central/research council provision would be
useful”
» “A national data repository”
» “Increasing use of CRISes to fulfill traditional repository functions does not seem to be
prioritised as an issue by JISC……”
» “If not able to provide e.g. data repositories, influence funder or sector/community provision to
support the needs of that funder/community.”
» “Data access and user tracking tools and statistics on shared archive services”
» “Development of the national research data registry.This will have implications for institutional
research data registry development.”
11. A Key Requirement - Preservation
14/07/2016 Jisc-CNI Research Data Shared Service 11
12. A Key Requirement - Interoperability
14/07/2016 Jisc-CNI Research Data Shared Service 12
13. Vision
»Researchers shouldn’t need to think (too much!) about
Research Data Management
»"Visible data, invisible infrastructure”
› Provide researchers intuitive, easy functionality to
publish, archive and preserve their research outputs.
› Provide interoperable systems to allow researchers and
institutions to fulfil and go beyond policy requirements
and adhere to best practice throughout the RDM
lifecycle.
14/07/2016 Jisc-CNI Research Data Shared Service 13
14. Why a Shared Service?
14/07/2016 Jisc-CNI Research Data Shared Service 14
» There is no single “solution” easily available and that meets requirements for
Universities to enable Research Data Management
» More effective Research Data Management must happen to comply with Funder
Mandates, ensure data is not lost, and to realise a whole range of positive
benefits
» A shared service (provided by Jisc) seems to offer a number of benefits:
» Cost savings and efficiencies
» Common approaches and practice
» Research system standardisation and interoperability
» Others…
15. Pilot Institutions
» Pilot institutions selected to create a balanced portfolio of types of institution,
specialisms and research systems already in place
14/07/2016 Jisc-CNI Research Data Shared Service 15
Institution Name
Cardiff University
CREST - Consortium for Research Excellence, Support andTraining (Buckinghamshire
New University, Harper Adams, St Mary’s -Twickenham, UCA &Winchester)
Imperial College of Science,Technology and Medicine
Middlesex University
Plymouth University
Royal College of Music
St George's Hospital Medical School
University of Cambridge
University of Lancaster
University of Lincoln
University of StAndrews
University of Surrey
University ofYork
16. Pilots’ MVP’s
»Easy to use and cost effective archiving, ingest, preservation,
repository, reporting and discovery supported that can handle
sensitive data”
»“Robust data storage that has growth ability for active and archive
data”
»“Standard metadata profile - international for interoperability”
»“Integration with all main CRIS systems”
»“Meets REF and funder deposit requirements (supports deposit of
REF data output types)”
»…..........
14/07/2016 Jisc-CNI Research Data Shared Service 16
18. Where are we now?
14/07/2016 Jisc-CNI Research Data Shared Service 18
19. Research at Risk Portfolio
14/07/2016 Jisc-CNI Research Data Shared Service 19
20. Project Support
14/07/2016 Jisc-CNI Research Data Shared Service 20
Consultancy Description
RDM Costing (Cambridge
Econometrics)
To investigate current costing practices, tools, models and potential
future developments in the field of RDM costing—and this work is
being applied to developing the business model for the research data
shared service pilot
Data Asset Framework
(Research Consulting)
To provide the consultation phase for stakeholders in the project,
not focused on the final technology solution, for example an audit of
datasets, legal and compliance framework, financial and strategic
commitment.
Technical Architect
(Digirati)
To provide expert technical advice to the project on the technical
architecture of the service, assessment of institutional technical
capability and to assist in gatheringdetailed requirements from
institutions and researchers
Metadata and
Interoperability (CLAX)
An examination of metadata specifications and provide advice on
identifier systems and interoperability
Project Management (LM) To provide project management support and coordinate contract
negotiations, facilitate collaboration between suppliers and HEI’s and
monitor overall service development. This function will also gather
evidence to feed into the business model for the next stage
Market Research (TBC) To gather information on the demand for a service and to test
proposed models for the business case to proceed to aproduction
service.
Preservation Audit (TBC) To provide the requirements and priorities for RDM preservation
tools development
21. Project Support
14/07/2016 Jisc-CNI Research Data Shared Service 21
Milestones 2015-18
Apr 2015-Dec 2015 Jan 2016 – July 2016 Aug-2016 -June 2017 Jul 2017-Sept 2017 Oct 2017-Apr 2018
-Requirements
- HEI Pilots
Selected
-Procurement
commences
- Support
consultancy
work begins
-Supplier
Framework
selected
-Alpha
Development
-Alpha service
tested and
reviewed
-Beta
Development
-Feedback on
Beta Service
-Detailed HEI
requirements and
technical
architecture
-Contracting
commences
-Development
Phase
-Contact additional
early adopter HEI’s
and promote Beta
Service
-Business planning
and Begin Business
Case
-Market Research
and Consultation
-Promote service to
institutions
-Start on next
phases (service
enhancement/mod
ular)
-Requirements
- HEI Pilots
Selected
-Procurement
commences
-Institutional
survey
-HEI and supplier
workshops
-Pilot HEI
selection process
- Business case
decision
-If go then begin
transition to
production service
28. use a particular repository technology?
Wrong question
can we implement sustainable repository infrastructure
to serve our digital content management needs?
29. Answers to questions
• How do I manage my various collections of different digital content?
• How can I deal with the different file types I’m having to archive?
• How do I ensure I can cope with the increasing amount of digital content I need
to manage?
• How can I manage my digital content in a way that is meaningful?
• How can I ensure that I can sustain the technology choice I make?
37. A Word About…
• Flexible, Extensible, Durable, Object
Repository Architecture
• Open source digital repository
• middleware for relating your objects
and hooking them to services &
storage
• Particularly powerful for data & other
“non-simple” content types
• More than 300 adopters worldwide
• 4 major software releases since
2000
38. Large Universities
Small Universities
Colleges
Public Broadcasting
Government Ministry
National Libraries
National Lab
Small Research Labs
National Digital Repository
Statewide Digital Libraries
Chemical Heritage Foundation
Museum of Performing Arts
A Shakespeare Festival
Self-deposit System
Digital Collections System
Sheet Music
Architectural Resources
Electronic Theses & Dissertations
Digital Image System
Media Management
Media Preservation System
Research Data Management
Digitization Workflow System
Digital Preservation System
Digital Archives System
And more!
Used By... Used For...
40. Trend 1: Move to Linked Data
PCDM (Portland Common Data Model),
for data and code interoperability
41. Trend 2: Architecting Layers & Gems for Code Reuse
Active Fedora
Hydra::PCDM
Hydra::Works
Curation Concerns
Sufia
Local customization
Hydra App Layers Hydra Gems
(kinda like sprinkles)
browse-everything
hydra-editor
hydra-derivatives
hydra-role-management
hydra-shibboleth
Geomash
iiif_manifest
orcid
questioning_authority
etc.
42. Trend 3: Hydra-in-a-Box
● Directed project to produce a turnkey solution
○ ...and a hosted service
○ ...and metadata enrichment engine
● 2.5 years (May 2015 - November 2017)
● $2M grant from IMLS
● Core partners = DPLA, DuraSpace & Stanford
○ Plus significant & growing community
contributions
43. What is Hydra? Community
Hydra Connect
Mailing lists, Slack, Skype/Hangouts
Meetings – manager and technical focus
49. Hydra – getting localised
• Hydra New England (NE) regional group
• Hydra West Coast regional group
• Developer congresses
• Stanford and Michigan this year so far
• Fostering face-to-face exchange of ideas and putting them into practice
51. Hydra in (other parts of) Europe
Ireland
• Digital Repository of Ireland (based at Trinity College, Dublin)
• University College Dublin
• Maynooth University
Denmark
• Royal Library of Copenhagen
• Danish Technical University
Theatre Museum of Barcelona
Hydra Europe Symposia
• Dublin 2014
• London 2015
• ?
53. Partnership
Hydra would not work without partnership
Hydra would not work if we tried to do the same by ourselves
Partnership has brought together many different types of institution who would not
have worked together otherwise
Partnership has been stimulated by recognising a common need and finding a
way to address this together
Partnership has helped us find answers to our questions
54. Answers to questions
• How do I manage my various collections of different digital content?
• How can I deal with the different file types I’m having to archive?
• How do I ensure I can cope with the increasing amount of digital content I need
to manage?
• How can I manage my digital content in a way that is meaningful?
• How can I ensure that I can sustain the technology choice I make?
59. Why is this relevant for research data?
• Funder requirements around retention:
– NERC - data should be retained for a minimum of 10 years but
for projects of major importance this may need to be 20 years
or longer
– STFC - expect data to be retained for a minimum of 10 years
and data that cannot be re-measured should be retained
indefinitely
– Wellcome Trust – expect data to be kept for a minimum of 10
years but suggest longer periods for certain types of data
– EPSRC – expect research data to be securely preserved for a
minimum of 10‐years from the date of last access
60. University of York RDM questionnaire 2013
• Which data management issues have you come
across in your research over the last five years?
– “Inability to read files in old software formats on old
media or because of expired software licences”
– 24% of 181 researchers who answered this question
admitted this had been a problem for them
…and researchers already encounter
barriers to reusing data
61. Most universities have a
place to store data
The
researcher
The
researcher
gives data to
the repository
Access to the research
data via the repository
interface
But what about this bit?
The Open Archival Information System
Data
reuse will
happen
hereThe repository
ingests the data
64. Filling the digital preservation gap:
Project aim
“…to investigate Archivematica
and explore how it might be
used to provide digital
preservation functionality within
a wider infrastructure for
Research Data Management.”
65. The teamUniversity of Hull:
• Chris Awre – Head of Information Services,
Library and Learning Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager, Digital York
• Jen Mitcham – Digital Archivist
Artefactual Systems
66. What have we been doing?
• Phase 1 – explore: test Archivematica, research,
do some thinking (3 months)
• Phase 2 – develop: make Archivematica better for
RDM, plan implementation (4 months)
• Phase 3 – implement: set up proof of concepts at
York and Hull, investigation of the file format
problem (6 months)
69. A quick look at file formats
Research data file formats are:
• Numerous
• Sometimes a bit obscure
• Sometimes very big
• Ever-changing
• Often very new
This means they can be hard to preserve... because we
can’t identify them. If we can’t identify them how can we
carry out preservation activities?
71. The NDSA Levels
of Digital
Preservation:
Level 2 requires you to
know what you’ve got ...
and levels 3 and 4 build
on this
72. Can we identify our research data?
We ran Droid over the research data deposited with us
over the past year.
Out of 3752 individual files:
• for 1382 (37%) of the files a file format was identified
– 668 (48%) by signature
– 648 (47%) by extension
– 65 (5%) by container
• 34 different file formats were identified automatically
76. How do we improve this result?
• More file signature research required
– institutions can submit sample files to TNA
– or they can create their own file format
signatures
– digital preservation tools (eg: Archivematica) can
help us with better reporting on unidentified files
We can improve the tools if we work together
78. Do talk to me (or Chris) if you are
interested in finding out more about our
preservation work
Useful links:
Project website: http://www.york.ac.uk/borthwick/archivematica
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Archivematica: https://www.archivematica.org/en/
PRONOM: http://www.nationalarchives.gov.uk/PRONOM/
Phase 1 report: http://dx.doi.org/10.6084/m9.figshare.1481170
Phase 2 report: https://dx.doi.org/10.6084/m9.figshare.2073220