Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

•Descargar como PPTX, PDF•

1 recomendación•575 vistas

Data curation projects frequently deal with data that were not created for the purposes of long- term preservation and re-use. How can curation of such legacy data be improved by supplying necessary metadata? In this report, we address this and other questions by creating robust metadata for twenty legacy research datasets. We report on the metrics of creating domain- specific metadata and propose a four-prong framework of metadata creation for legacy research data. Our findings indicate that there is a steep learning curve in encoding metadata using the FGDC content standard for digital geospatial metadata. Our project demonstrates that when data curators are handed research data “as is,” they may be successful in incorporating such data into a data sharing environment. We found that data curators can be successful in creating descriptive metadata and enhancing discoverability via subject analysis. However, curators must be aware of the limitations in applying structural and administrative metadata for legacy data.

Educación Tecnología

Creating Metadata for Legacy Research Data
Collaborate, Automate,
Prepare, Prioritize
Stacy Konkiel
IU Libraries
Inna Kouper
Data to Insight Center, IU
Jennifer A. Liss
IU Libraries
Juliet L. Hardesty
IU Libraries

Data Management as
“Grand Challenge”
& Metadata
^

SEAD is funded by the National Science Foundation under Cooperative
Agreement #OCI0940824

SEAD Virtual Archive
(SVA)
-- manage sustainability
science window to multiple IRs
IU Scholar
Works IR
publish associate
discover
UIUC IDEALS
IR
UMich Deep
Blue IR
ingest

Investigation
 How can the curation of legacy data be
improved by supplying necessary
metadata?
 How much time and effort is required to
supply domain-specific metadata?

Goals
• Enable discovery of research data
• Communicate experiences with metadata
creation for legacy dataset to community
• Begin conversation about metadata
practices for legacy data

Methodology
• 20 NCED legacy datasets
• Federal Geographic Data Committee (FGDC)
Content Standard for Digital Geospatial
Metadata

Methodology
• 4 encoders, each
assigned 5 datasets
• Datasets ranged greatly
in size and composition
• 0.01–664 GB
• 1–140,000 files

Methodology
• Phase I
Standalone XML files using basic
NCED-provided information &
“Googleable” facts
• Phase II
Extensive research re: processes by
which datasets were created and used

Findings–Phase I
0:00
1:12
2:24
3:36
4:48
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
Metadata creation time Phase I (h:mm)
Encoder 1 Encoder 2 Encoder 3 Encoder 4

Findings–Phase I
Successes:
• Supplied many mandatory elements
• Thesauri & Controlled Vocabularies
Challenges:
• Time-intensive startup
• Lacking geospatial information

Findings–Phase II
Successes:
• Enhanced 10 metadata fields
Challenges:
• Accessing and processing the
datasets (size, complexity)

Observations
Though the information that we found may
enhance opportunities for the discovery of
legacy research data, the available
information was unlikely to be sufficient to
support the tasks of preservation,
reproducibility, and re-use.

Observations
• FGDC is insufficient for dealing with
legacy research data
• Data curators without domain expertise
can be successful in creating some types
of metadata
• Structural and administrative metadata is
difficult to curate without help of
researchers

Proposal: The CAPP Framework
• Labor
• Datasets
• Types of
metadata
• User needs
• Choice of
metadata
standards
• Instructions /
manuals
• Workflows /
software
• Licensing and
contact information
• File format
identification
• Provenance
• Native
environment
• Entity extraction
• Subject specialists
• Librarians
• Researchers
• Tool developers
Collaborate Automate
PrioritizePrepare

Collaborate
• Subject specialists
• Librarians
• Researchers
• Tool developers

Automate
• File format identification
• Provenance
• Native environment
• Entity extraction

Prepare
• Choice of metadata standards
• Licensing and contact information
• Instructions and manuals
• Workflows and software

Prioritize
• Labor
• Datasets
• Types of metadata
• User needs

Future Work
Benchmark:
• Effectiveness of tools and workflows
• Collaborations and relationships
• Domains/interdisciplinarity

Thank you!
Jennifer A. Liss
Metadata/Cataloging Librarian
jaliss@indiana.edu
http://sead-data.net

Más contenido relacionado

La actualidad más candente

SEAD slide set (October 2011)SEAD

A Data Scientist Perspective on Data Curation in the Digital EraVicki Ferrini

User engagement in research data curationEDINA, University of Edinburgh

NSF DataNet Partners Update at RDAP14SEAD

ESA14 Workshop on SEAD's Data Services and ToolsSEAD

SEAD: Lightweight Data Services for Sustainability ResearchSEAD

20130222 kaptur training_goldsmithsJISC funded KAPTUR project

Preservation, Publishing, and People: A SEAD ViewInna Kouper

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

D4Science Data Infrastructure - Facilitator for a FAIR Data ManagementBlue BRIDGE

Simon hodsonJISC funded KAPTUR project

Improving Data Management Capacity in the Mekong Basin Using SEADSEAD

S cook ands_ttt2_perth_rdm_trainingARDC

Rdm training presentation 16.01.2013JISC funded KAPTUR project

Introduction to research data managementopl10

Organising and Documenting DataEDINA, University of Edinburgh

Supporting the Research data management process- a guide for Librarians. .ALISS

White Manipulating Metadata to Enhance AccessNational Information Standards Organization (NISO)

Next generation data services at the Marriott LibraryRebekah Cummings

Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss

La actualidad más candente (20)

SEAD slide set (October 2011)

A Data Scientist Perspective on Data Curation in the Digital Era

User engagement in research data curation

NSF DataNet Partners Update at RDAP14

ESA14 Workshop on SEAD's Data Services and Tools

SEAD: Lightweight Data Services for Sustainability Research

20130222 kaptur training_goldsmiths

Preservation, Publishing, and People: A SEAD View

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

D4Science Data Infrastructure - Facilitator for a FAIR Data Management

Simon hodson

Improving Data Management Capacity in the Mekong Basin Using SEAD

S cook ands_ttt2_perth_rdm_training

Rdm training presentation 16.01.2013

Introduction to research data management

Organising and Documenting Data

Supporting the Research data management process- a guide for Librarians. .

White Manipulating Metadata to Enhance Access

Next generation data services at the Marriott Library

Data Citation Implementation Guidelines By Tim Clark

Destacado

Discovery Layer Strategies for Kuali OLE: Indiana UniversityCourtney McDonald

From Anywhere Library to Everywhere Library: Creating a User Experience Strat...Courtney McDonald

If we build it, they will come: authority data for a linked data futureJennifer Liss

"We'll burn that bridge when we get to it”—Technology, Metadata Standards, an...Jennifer Liss

Taking the Plunge into Holistic DesignCourtney McDonald

Doctoring Strange ResultsCourtney McDonald

Possibilities & Pitfalls: Reference in the age of discoveryCourtney McDonald

Cataloging Competencies for the 21st CenturyJennifer Liss

Better Libraries by Design - ALAO 2016Courtney McDonald

UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...Courtney McDonald

It's Live, Now What? Reflecting on the first year with Ebsco Discovery ServiceCourtney McDonald

Designing for Users First: Creating the User-Centered LibraryCourtney McDonald

'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...Jennifer Liss

Going Straight to the SourceRachael Cohen

Please send catalogers : metadata staffing in the 21st centuryJennifer Liss

Destacado (15)

Discovery Layer Strategies for Kuali OLE: Indiana University

From Anywhere Library to Everywhere Library: Creating a User Experience Strat...

If we build it, they will come: authority data for a linked data future

"We'll burn that bridge when we get to it”—Technology, Metadata Standards, an...

Taking the Plunge into Holistic Design

Doctoring Strange Results

Possibilities & Pitfalls: Reference in the age of discovery

Cataloging Competencies for the 21st Century

Better Libraries by Design - ALAO 2016

UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...

It's Live, Now What? Reflecting on the first year with Ebsco Discovery Service

Designing for Users First: Creating the User-Centered Library

'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...

Going Straight to the Source

Please send catalogers : metadata staffing in the 21st century

Similar a Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

Managing Your Research DataKristin Briney

Incentivising the uptake of reusable metadata in the survey production processLouise Corti

RDAP14: Learning to Curate Panel ASIS&T

Research Data Mangagement Essentials, 5th July 2017Research Data Leeds

Love Your Data LocallyErin D. Foster

RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T

dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET

ESI Supplemental 1 E-research Support SlidesDuraSpace

IEDA Overview & Updates, March 2014iedadata

Planning for Research Data Managementdancrane_open

SEAD Datanet and Sustainability Science Robert H. McDonald

Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy

Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...Research Support Team, IT Services, University of Oxford

Getting to grips with Research Data ManagementIzzyChad

L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP PilotCASRAI

Institutional repositoryWaqas Ahmed

Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...Research Support Team, IT Services, University of Oxford

Engaging with students and researchers: the case of the social sciencesLouise Corti

Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...Research Support Team, IT Services, University of Oxford

Similar a Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data (20)

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Managing Your Research Data

Incentivising the uptake of reusable metadata in the survey production process

RDAP14: Learning to Curate Panel

Research Data Mangagement Essentials, 5th July 2017

Love Your Data Locally

RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...

dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...

ESI Supplemental 1 E-research Support Slides

IEDA Overview & Updates, March 2014

Planning for Research Data Management

SEAD Datanet and Sustainability Science

Supporting Libraries in Leading the Way in Research Data Management

Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...

Getting to grips with Research Data Management

L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot

Institutional repository

Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...

Engaging with students and researchers: the case of the social sciences

Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...

Último

ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10

How to Add Barcode on PDF Report in Odoo 17Celine George

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1

Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo

Transaction Management in Database Management SystemChristalin Nelson

Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

Oppenheimer Film Discussion for Philosophy and FilmStan Meyer

Activity 2-unit 2-update 2024. English translationRosabel UA

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

ClimART Action | eTwinning Projectjordimapav

Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

1. Creating Metadata for Legacy Research Data Collaborate, Automate, Prepare, Prioritize Stacy Konkiel IU Libraries Inna Kouper Data to Insight Center, IU Jennifer A. Liss IU Libraries Juliet L. Hardesty IU Libraries

2. Data Management as “Grand Challenge” & Metadata ^

3. SEAD is funded by the National Science Foundation under Cooperative Agreement #OCI0940824

4. SEAD Virtual Archive (SVA) -- manage sustainability science window to multiple IRs IU Scholar Works IR publish associate discover UIUC IDEALS IR UMich Deep Blue IR ingest

5. Investigation  How can the curation of legacy data be improved by supplying necessary metadata?  How much time and effort is required to supply domain-specific metadata?

6. Goals • Enable discovery of research data • Communicate experiences with metadata creation for legacy dataset to community • Begin conversation about metadata practices for legacy data

7. Methodology • 20 NCED legacy datasets • Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata

8. Methodology • 4 encoders, each assigned 5 datasets • Datasets ranged greatly in size and composition • 0.01–664 GB • 1–140,000 files

9. Methodology • Phase I Standalone XML files using basic NCED-provided information & “Googleable” facts • Phase II Extensive research re: processes by which datasets were created and used

10. Findings–Phase I 0:00 1:12 2:24 3:36 4:48 Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Metadata creation time Phase I (h:mm) Encoder 1 Encoder 2 Encoder 3 Encoder 4

11. Findings–Phase I Successes: • Supplied many mandatory elements • Thesauri & Controlled Vocabularies Challenges: • Time-intensive startup • Lacking geospatial information

12.

13. Findings–Phase II Successes: • Enhanced 10 metadata fields Challenges: • Accessing and processing the datasets (size, complexity)

14. Observations Though the information that we found may enhance opportunities for the discovery of legacy research data, the available information was unlikely to be sufficient to support the tasks of preservation, reproducibility, and re-use.

15. Observations • FGDC is insufficient for dealing with legacy research data • Data curators without domain expertise can be successful in creating some types of metadata • Structural and administrative metadata is difficult to curate without help of researchers

16. Proposal: The CAPP Framework • Labor • Datasets • Types of metadata • User needs • Choice of metadata standards • Instructions / manuals • Workflows / software • Licensing and contact information • File format identification • Provenance • Native environment • Entity extraction • Subject specialists • Librarians • Researchers • Tool developers Collaborate Automate PrioritizePrepare

17. Collaborate • Subject specialists • Librarians • Researchers • Tool developers

18. Automate • File format identification • Provenance • Native environment • Entity extraction

19. Prepare • Choice of metadata standards • Licensing and contact information • Instructions and manuals • Workflows and software

20. Prioritize • Labor • Datasets • Types of metadata • User needs

21. Future Work Benchmark: • Effectiveness of tools and workflows • Collaborations and relationships • Domains/interdisciplinarity

22. Thank you! Jennifer A. Liss Metadata/Cataloging Librarian jaliss@indiana.edu http://sead-data.net

Notas del editor

Research data management has been recognized by many international governmental bodies and their agencies as a grand challenge: JISC, UK Data Archive, Bill & Melinda Gates Foundation, US Department of Energy’s Office of Science. All are struggling with data management, particularly with the increase in data that is born digitally.Many governmentalfunding agencies now require that researchers pay heed to metadata…but they don’t explain how researchers should go about it.
This research falls within the context of a NSF-funded DataNet project called SEAD or “Sustainable Environment, Actionable Data.” My colleagues Inna Kouper and Stacy Konkiel are part of the SEAD research team, which is comprised of a larger team of scientists and research data specialists.SEAD is a federation of repositories for sustainability science, which is a highly interdisciplinary field. The SEAD project focuses on the development of tools that enable sustainability scientists to curate and share their data at earlier stages of research as well as “downstream,” after the data have been collected and stored.
The federation leverages existing IR platforms, where data is preserved, and also has value-added services built on its interface so that scientists can easily work with and annotate data, harvest metadata, and use the VIVO social network to connect with other researchers in the field.More information about SEAD is available at its website, sead-data.net. The work presented in our project report focuses on supplying metadata for the ingest of legacy datasets into the SEAD Virtual Archive.
Data repositories and federations are becoming more prevalent, what with success of Dyrad and DataONE, among other such projects. We wanted to address the following questions in order to better understand (in absolute, quantifiable terms) what effects metadata has on data management in these spaces.How can the curation of legacy datasets be improved by supplying necessary metadata?How much time and effort is required to create domain-specific metadata?
Our paper reports on quantitative and qualitative metrics of creating domain-specific metadata. In benchmarking the process of enhancing the metadata for legacy datasets, we pursue several goals. First, to make datasets available for effective search and re-use within newer data sharing environments. Second, to advance knowledge among researchers and data professionals about the needs, barriers, and requirements of curating legacy research data. Ultimately, we hope to advance a conversation about efficient metadata creation practices for research data within broader context of data curation and mangement
For this project, we used 20 datasets that are publicly available via the National Center on Earth-surface Dynamics (NCED) repository. Because these datasets originate from the interdisciplinary domain of earth sciences, the choice of a domain-specific metadata standard was not easy. Butgiven that thedatasets contained a significant amount of geospatial information, we decided to use the Federal Geographic Data Committee’s (FGDC) Content Standard for Digital Geospatial Metadata.
A team of four librarians and data professionals (or “encoders”) contributed to metadata creation. We all have different backgrounds: I come from a metadata perspective in traditional library context, Julie from a digital library metadata and useabilty context; Stacy and Inna from the scientific data context. Each encoder received 5 datasets of varying sizesranging from 0.01 to 664 gigabytes and from 1 to ~140,000 files per dataset. The datasets could be comprised of one or many different files types: text files, spreadsheets, images, applications, zip files
Metadata encoding was done in two phases. During Phase I, encoders created standalone XML-based metadata files for each dataset using basic information provided by the NCED repository and information available via quick Internet searches. During Phase II, encoders undertook extensive research to find more information about datasets, particularly concerning the processes by which datasets were created and used. Encoders timed all of their encoding activities and logged their experiences in a journal.
Encoding the basic metadata during Phase I required 9 minutes to 4 hours per dataset (average time: 54 minutes). Time dropped significantly after first dataset was cataloged (learning curve).
Successes: Metadata that we were able to encodeduring Phase I (i.e., the metadata that was easiest to obtain)largely corresponded to the mandatory elements required by the FGDC content standard. Librarians had great success assigning subject terms using controlled vocabularies.Challenges: WhilePhase I allowed us to collect descriptive metadata, which describes resources for the purposes of discovery and identification, it was very hard to encode spatial information (info wasn’t included in datasets’ readme files, we didn’t have software/expertise with software to figure out geospatial info, geospatial details not included in NCED repository metadata).
During Phase II, we attempted to supply richer metadata, which included encoding the composition of the complexresearch objects, as well as encoding relevanttechnical and preservation information.Providing additional metadata during Phase II required 20 minutes to 1.5 hours per dataset.
Successes: 10 metadata fields were enhanced during Phase II, adding such information as references to grants and funding information, distribution conditions, digital access and transfer information, and citations to related datasets and published articles.Challenges: We were still unable to provide a few key mandatory elements, such as geospatial coordinates, resulting in XML files that did not validate against the FGDC schema
The FGDC Content Standard for Digital Geospatial Metadata is a powerful tool for representing descriptive, structural, and administrative metadata. In dealing with legacy research data, however, the capabilities of this tool become seriously limited. Unlike other information resources, such as books or images that remain accessible and relatively transparent for preservation and sharing efforts, research data are complex compound objects. Formats, structure, relationships, and provenance become opaque once the data has been created. Our project demonstrates that data curators who are handed legacy research data “as is” can be very effective in creating descriptive metadata – particularly, in conducting subject analysis and assigning keywords based on controlled vocabularies and thesauri. However, identifying structural and administrative metadata for legacy data is extremely difficult.
Our approach, CAPP: Collaborate, Automate, Prepare, Prioritize is based on the premise that metadata creation or enhancement projects need to rely on a collaborative effort and on a combination of automated and manual labor.
Data managers need to collaborate with subject specialists, researchers, and tool developers to define the requirements of specific data curation projects. Researchers, as data producers and consumers, can contribute to metadata creation by indicating what elements are valuable and for what purposes. Researcherscan also supply additional information that can be used in completing metadata records.
Tasks such as file format identification, provenance capture, and entity extraction need to be automated. Existing tools, such as the JSTOR/Harvard Object Validation Environment (JHOVE, http://jhove.sourceforge.net/), MIME Type Detection Utility (mime-util, http://sourceforge.net/projects/mime-util), or Internet Assigned Number Authority’s MIME Media Types (IANA, http://www.iana.org/assignments/media-types) can be used to automate identification of technical metadata, including file formats. Tools such as GeoServer (http://geoserver.org/display/GEOS/Welcome) can provide access to specific metadata within certain formats, such as shapefiles, and automate the extraction of bounding coordinates and other geospatial information. Researcher identification registries such as ORCID (http://orcid.org/) may help mitigate some of the challenges of finding up-to-date information about data set contributors that the encoders encountered in Phase I.
Librarians and data managers can contribute to automation by providing system and user requirements, identifying a minimal set of metadata elements, and encouraging other partners to become involved in data sharing initiatives.
At the beginning of a legacy data curation project, data managers may also want to make the decision-making explicit by prioritizing which datasets should be curated and what user needs should guide curation.
In the future, we plan to enhance the CAPP framework by benchmarking other processes of metadata creation, such as the usability and effectiveness of certain tools and workflows, the impact of collaborations on metadata creation, and the effects of domain orientation or interdisciplinarity on the effectiveness and completeness of metadata. At its current early stage, CAPP framework is a proposition that needs to be developed into a rich research agenda. We hope that our framework will be considered by the Dublin Core community for further development, testing, improvement, and eventual incorporation into the set of best practices for metadata creation.

Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Similar a Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data (20)

Último

Último (20)

Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Notas del editor