SlideShare una empresa de Scribd logo
From Box to Hydra via Archivematica
Turning proof of concept into reality
Background
• University of Hull and University of York working on a Research Data Spring
project
• Filling the digital preservation gap, 2015-16
• https://www.york.ac.uk/borthwick/projects/archivematica/
• Dual use cases for the University of Hull
• Digital preservation of archival materials
• Management and preservation of research data
Systems background
• Box
• Institutional subscription from 2015
• Supported and managed personal cloud storage service
• Archivematica
• No experience prior to the project, but had watched its development over a period
of years
• Particularly liked the combination of microservices that can be used flexibly
according to use case
Repository
• Hydra digital repository – http://hydra.hull.ac.uk
• Implemented 2012 based on previous Fedora repository
• Designed to hold any structured digital collection (within reason!) to meet
University’s needs
• NB ** Hydra is now Samvera **
• Community is refreshing and re-launching for the next decade
• Watch this space – http://samvera.org
• New website and logo coming shortly
Questions
• How can we enable a preservation workflow with the systems environment
available to us?
• How can we facilitate pathways to preserving archival materials and
research data alongside each other?
• What is required to bring these different components together to best
effect?
Ingest to the system, either direct
or via ingest folder (Box)
Archivematica captures content
and processes it through
microservices
Archivematica outputs AIP for
storage and DIP for repository
DIP processor unpacks DIPs and
creates repository objects
Repository manages access to
objects
Project focus
• User assembles files and simple descriptive file(s) in Box
folder. Shares the folder with Archivematica
• System checks folder contents and if OK creates a bag
(BagIt standard) for each object which is passed to
Archivematica
• Archivematica processes the bag to create an AIP which
goes to a preservation store…
• …and also a DIP which is passed to the DIP processor
• DIP processor creates Hydra objects from the DIP
contents and injects them into the repository QA
queue…
• …matched to the AIP by UUID
Joining up the dots
• The joins between the three components were:
• A ‘Box-watcher’ – users share their data with a nominated Box user account for the
archivematica system. This account watch for shares with it, and automatically
create a BAGIT of the files found and transfer this to archivematica for processing
• A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and
uses the information within this to create repository objects
• These tools were wrapped into a single gem, hullsync
• https://github.com/uohull/hullsync
Deposit options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file  a single AIP and a single repository
object with (optionally) one or more surrogate files for download (so can be a “metadata-only”
record)
• A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple
repository objects, each with (optionally) a surrogate for download
• A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single
repository object (optionally) containing the zipped file for download
In detail – option 1
• A folder containing multiple data files and one descriptive file  a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• Data files are associated with a .txt descriptive file providing associated metadata
• Descriptive file can be used to determine access permissions and content model
• Descriptive metadata can be provided using Dublin Core
• Can also submit README.txt for information to inform repository staff on
appropriate actions
In detail – option 2
• A folder containing multiple files and a csv file (one row per file)  multiple
AIPs with multiple repository objects, each with (optionally) a surrogate for
download
• Use a .csv file instead of a .txt file for the descriptive information
• Use column headings to cover the same fields as in option 1
• Can associate the same or different metadata with each object
• Can create simple or compound objects
In detail – option 3
• A folder containing the top-level folder of a structure  a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
• Aim is to allow the submission of a folder or nested folders of data, replicating how
the files are organised
• Files are unpacked by Archivematica, analysed, and then re-zipped up for submission
to the repository
Lessons learned
• Error handling needs attention when turning the p-o-c into production
• But the testing highlighted a lot of the errors that would need handling
• A key element when joining systems together
• Normalisation of filetypes requires additional consideration
• E.g., how to deal with TIFF files converted to JPG
• The zipping and unzipping workflows require further attention to ensure
success for this option
Next steps
• Take learning and tools from the Research Data Spring project and use these
as the basis for development of services
• Two use cases
• Research data storage and management service development
• City of Culture digital archive
• Understanding Archivematica pipelines and options better – Perpetua test!
• Focus on improving proof-of-concept and developing additional
functionality
Research data storage and management
• Joint Library and ICTD project to discover and understand research data
storage and management needs amongst academic staff
• Open workshops
• Data interviews
• Capture and processing of research data a part of local provision alongside
advice and guidance on options outside the institution
City of Culture digital archive
• Hull2017 – City of Culture
• Events throughout the year
• Four data elements
• Business archive
• Creative archive
• Participatory archive
• Research and evaluation archive
• Applying the same technology environment to manage ingest and delivery
Key issues going forward
• What are the differences in pipeline processing in Archivematica between
research data and archival materials?
• Dealing with unusual file formats – a key learning point from the RDS
project
• Scaling up to meet heavier data demands
• Being realistic about what we can’t use this environment for and need
alternative approaches, e.g., Big Data
To conclude
• Combining components has its issues, but it has been better to exploit
systems that do certain parts of the workflow well and turn them into more
than the sum of their parts
• Data is not simple
• We need flexibility in how we look to manage it
• We need engagement with researchers to understand it
• Turning an idea into production needs careful planning
• Scope for community exchange or training on how to do this?
Thank you
c.awre@hull.ac.uk
(And many thanks to the University of York and my colleagues Richard Green and
Simon Wilson, plus Cottage Labs LLC for their work on this)

Más contenido relacionado

La actualidad más candente

Grant Funding Programme
Grant Funding ProgrammeGrant Funding Programme
Grant Funding Programme
Jisc RDM
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
Jisc RDM
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
Robin Rice
 
UK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaUK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schema
Jisc RDM
 
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareScottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Robin Rice
 
Business cases and costs RDN
Business cases and costs RDNBusiness cases and costs RDN
Business cases and costs RDN
Jisc RDM
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
Jisc RDM
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
Jisc RDM
 
Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]
EDINA, University of Edinburgh
 
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
EDINA, University of Edinburgh
 
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc RDM
 
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataGoing for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial Metadata
EDINA, University of Edinburgh
 
Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016
Jisc RDM
 
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
EDINA, University of Edinburgh
 
National data services lightening talk at the RDA
National data services lightening talk at the RDANational data services lightening talk at the RDA
National data services lightening talk at the RDA
Jisc RDM
 
RDM shared services at IDCC
RDM shared services at IDCCRDM shared services at IDCC
RDM shared services at IDCC
Jisc RDM
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...
Jisc RDM
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Historic Environment Scotland
 
RDA UK
RDA UKRDA UK
RDA UK
Jisc RDM
 
COBWEB technology platform and future development needs
COBWEB technology platform and future development needsCOBWEB technology platform and future development needs
COBWEB technology platform and future development needs
EDINA, University of Edinburgh
 

La actualidad más candente (20)

Grant Funding Programme
Grant Funding ProgrammeGrant Funding Programme
Grant Funding Programme
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
 
UK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaUK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schema
 
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareScottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
 
Business cases and costs RDN
Business cases and costs RDNBusiness cases and costs RDN
Business cases and costs RDN
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
 
Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]
 
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
 
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...
 
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataGoing for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial Metadata
 
Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016
 
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
 
National data services lightening talk at the RDA
National data services lightening talk at the RDANational data services lightening talk at the RDA
National data services lightening talk at the RDA
 
RDM shared services at IDCC
RDM shared services at IDCCRDM shared services at IDCC
RDM shared services at IDCC
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
 
RDA UK
RDA UKRDA UK
RDA UK
 
COBWEB technology platform and future development needs
COBWEB technology platform and future development needsCOBWEB technology platform and future development needs
COBWEB technology platform and future development needs
 

Similar a From Box to Hydra via Archivematica

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
ARDC
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
datascienceiqss
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
Jenny Mitcham
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
Jenny Mitcham
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
National Library of Australia
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
Nederlands Instituut voor Beeld en Geluid
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
Jenny Mitcham
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
Pascal-Nicolas Becker
 
Data Storage
Data StorageData Storage
Data Storage
Moghees1
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steve Androulakis
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ARDC
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)
Nikos Palavitsinis, PhD
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
Jenny Mitcham
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
northerncollaboration
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Jenny Mitcham
 
Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revised
Gabe Montemayor
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
Hostway|HOSTING
 
BatIg
BatIgBatIg
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724
mikeum
 

Similar a From Box to Hydra via Archivematica (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Data Storage
Data StorageData Storage
Data Storage
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revised
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
 
BatIg
BatIgBatIg
BatIg
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724
 

Más de Jisc RDM

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
Jisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc RDM
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
Jisc RDM
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
Jisc RDM
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
Jisc RDM
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
Jisc RDM
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
Jisc RDM
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
Jisc RDM
 
Stories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okStories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) ok
Jisc RDM
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
Jisc RDM
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
Jisc RDM
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
Jisc RDM
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
Jisc RDM
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
Jisc RDM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
Jisc RDM
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
Jisc RDM
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
Jisc RDM
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
Jisc RDM
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
Jisc RDM
 

Más de Jisc RDM (20)

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
 
Stories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okStories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) ok
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
 

Último

Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Diana Rendina
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 

Último (20)

Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 

From Box to Hydra via Archivematica

  • 1. From Box to Hydra via Archivematica Turning proof of concept into reality
  • 2. Background • University of Hull and University of York working on a Research Data Spring project • Filling the digital preservation gap, 2015-16 • https://www.york.ac.uk/borthwick/projects/archivematica/ • Dual use cases for the University of Hull • Digital preservation of archival materials • Management and preservation of research data
  • 3. Systems background • Box • Institutional subscription from 2015 • Supported and managed personal cloud storage service • Archivematica • No experience prior to the project, but had watched its development over a period of years • Particularly liked the combination of microservices that can be used flexibly according to use case
  • 4. Repository • Hydra digital repository – http://hydra.hull.ac.uk • Implemented 2012 based on previous Fedora repository • Designed to hold any structured digital collection (within reason!) to meet University’s needs • NB ** Hydra is now Samvera ** • Community is refreshing and re-launching for the next decade • Watch this space – http://samvera.org • New website and logo coming shortly
  • 5. Questions • How can we enable a preservation workflow with the systems environment available to us? • How can we facilitate pathways to preserving archival materials and research data alongside each other? • What is required to bring these different components together to best effect?
  • 6. Ingest to the system, either direct or via ingest folder (Box) Archivematica captures content and processes it through microservices Archivematica outputs AIP for storage and DIP for repository DIP processor unpacks DIPs and creates repository objects Repository manages access to objects
  • 7. Project focus • User assembles files and simple descriptive file(s) in Box folder. Shares the folder with Archivematica • System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica • Archivematica processes the bag to create an AIP which goes to a preservation store… • …and also a DIP which is passed to the DIP processor • DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue… • …matched to the AIP by UUID
  • 8. Joining up the dots • The joins between the three components were: • A ‘Box-watcher’ – users share their data with a nominated Box user account for the archivematica system. This account watch for shares with it, and automatically create a BAGIT of the files found and transfer this to archivematica for processing • A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and uses the information within this to create repository objects • These tools were wrapped into a single gem, hullsync • https://github.com/uohull/hullsync
  • 9. Deposit options • Depositors have several options: • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download
  • 10. In detail – option 1 • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • Data files are associated with a .txt descriptive file providing associated metadata • Descriptive file can be used to determine access permissions and content model • Descriptive metadata can be provided using Dublin Core • Can also submit README.txt for information to inform repository staff on appropriate actions
  • 11. In detail – option 2 • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • Use a .csv file instead of a .txt file for the descriptive information • Use column headings to cover the same fields as in option 1 • Can associate the same or different metadata with each object • Can create simple or compound objects
  • 12. In detail – option 3 • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download • Aim is to allow the submission of a folder or nested folders of data, replicating how the files are organised • Files are unpacked by Archivematica, analysed, and then re-zipped up for submission to the repository
  • 13. Lessons learned • Error handling needs attention when turning the p-o-c into production • But the testing highlighted a lot of the errors that would need handling • A key element when joining systems together • Normalisation of filetypes requires additional consideration • E.g., how to deal with TIFF files converted to JPG • The zipping and unzipping workflows require further attention to ensure success for this option
  • 14. Next steps • Take learning and tools from the Research Data Spring project and use these as the basis for development of services • Two use cases • Research data storage and management service development • City of Culture digital archive • Understanding Archivematica pipelines and options better – Perpetua test! • Focus on improving proof-of-concept and developing additional functionality
  • 15.
  • 16. Research data storage and management • Joint Library and ICTD project to discover and understand research data storage and management needs amongst academic staff • Open workshops • Data interviews • Capture and processing of research data a part of local provision alongside advice and guidance on options outside the institution
  • 17. City of Culture digital archive • Hull2017 – City of Culture • Events throughout the year • Four data elements • Business archive • Creative archive • Participatory archive • Research and evaluation archive • Applying the same technology environment to manage ingest and delivery
  • 18. Key issues going forward • What are the differences in pipeline processing in Archivematica between research data and archival materials? • Dealing with unusual file formats – a key learning point from the RDS project • Scaling up to meet heavier data demands • Being realistic about what we can’t use this environment for and need alternative approaches, e.g., Big Data
  • 19. To conclude • Combining components has its issues, but it has been better to exploit systems that do certain parts of the workflow well and turn them into more than the sum of their parts • Data is not simple • We need flexibility in how we look to manage it • We need engagement with researchers to understand it • Turning an idea into production needs careful planning • Scope for community exchange or training on how to do this?
  • 20. Thank you c.awre@hull.ac.uk (And many thanks to the University of York and my colleagues Richard Green and Simon Wilson, plus Cottage Labs LLC for their work on this)