Data curation projects frequently deal with data that were not created for the purposes of long- term preservation and re-use. How can curation of such legacy data be improved by supplying necessary metadata? In this report, we address this and other questions by creating robust metadata for twenty legacy research datasets. We report on the metrics of creating domain- specific metadata and propose a four-prong framework of metadata creation for legacy research data. Our findings indicate that there is a steep learning curve in encoding metadata using the FGDC content standard for digital geospatial metadata. Our project demonstrates that when data curators are handed research data “as is,” they may be successful in incorporating such data into a data sharing environment. We found that data curators can be successful in creating descriptive metadata and enhancing discoverability via subject analysis. However, curators must be aware of the limitations in applying structural and administrative metadata for legacy data.
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data
1. Creating Metadata for Legacy Research Data
Collaborate, Automate,
Prepare, Prioritize
Stacy Konkiel
IU Libraries
Inna Kouper
Data to Insight Center, IU
Jennifer A. Liss
IU Libraries
Juliet L. Hardesty
IU Libraries
3. SEAD is funded by the National Science Foundation under Cooperative
Agreement #OCI0940824
4. SEAD Virtual Archive
(SVA)
-- manage sustainability
science window to multiple IRs
IU Scholar
Works IR
publish associate
discover
UIUC IDEALS
IR
UMich Deep
Blue IR
ingest
5. Investigation
How can the curation of legacy data be
improved by supplying necessary
metadata?
How much time and effort is required to
supply domain-specific metadata?
6. Goals
• Enable discovery of research data
• Communicate experiences with metadata
creation for legacy dataset to community
• Begin conversation about metadata
practices for legacy data
7. Methodology
• 20 NCED legacy datasets
• Federal Geographic Data Committee (FGDC)
Content Standard for Digital Geospatial
Metadata
8. Methodology
• 4 encoders, each
assigned 5 datasets
• Datasets ranged greatly
in size and composition
• 0.01–664 GB
• 1–140,000 files
9. Methodology
• Phase I
Standalone XML files using basic
NCED-provided information &
“Googleable” facts
• Phase II
Extensive research re: processes by
which datasets were created and used
14. Observations
Though the information that we found may
enhance opportunities for the discovery of
legacy research data, the available
information was unlikely to be sufficient to
support the tasks of preservation,
reproducibility, and re-use.
15. Observations
• FGDC is insufficient for dealing with
legacy research data
• Data curators without domain expertise
can be successful in creating some types
of metadata
• Structural and administrative metadata is
difficult to curate without help of
researchers
16. Proposal: The CAPP Framework
• Labor
• Datasets
• Types of
metadata
• User needs
• Choice of
metadata
standards
• Instructions /
manuals
• Workflows /
software
• Licensing and
contact information
• File format
identification
• Provenance
• Native
environment
• Entity extraction
• Subject specialists
• Librarians
• Researchers
• Tool developers
Collaborate Automate
PrioritizePrepare
22. Thank you!
Jennifer A. Liss
Metadata/Cataloging Librarian
jaliss@indiana.edu
http://sead-data.net
Notas del editor
Research data management has been recognized by many international governmental bodies and their agencies as a grand challenge: JISC, UK Data Archive, Bill & Melinda Gates Foundation, US Department of Energy’s Office of Science. All are struggling with data management, particularly with the increase in data that is born digitally.Many governmentalfunding agencies now require that researchers pay heed to metadata…but they don’t explain how researchers should go about it.
This research falls within the context of a NSF-funded DataNet project called SEAD or “Sustainable Environment, Actionable Data.” My colleagues Inna Kouper and Stacy Konkiel are part of the SEAD research team, which is comprised of a larger team of scientists and research data specialists.SEAD is a federation of repositories for sustainability science, which is a highly interdisciplinary field. The SEAD project focuses on the development of tools that enable sustainability scientists to curate and share their data at earlier stages of research as well as “downstream,” after the data have been collected and stored.
The federation leverages existing IR platforms, where data is preserved, and also has value-added services built on its interface so that scientists can easily work with and annotate data, harvest metadata, and use the VIVO social network to connect with other researchers in the field.More information about SEAD is available at its website, sead-data.net. The work presented in our project report focuses on supplying metadata for the ingest of legacy datasets into the SEAD Virtual Archive.
Data repositories and federations are becoming more prevalent, what with success of Dyrad and DataONE, among other such projects. We wanted to address the following questions in order to better understand (in absolute, quantifiable terms) what effects metadata has on data management in these spaces.How can the curation of legacy datasets be improved by supplying necessary metadata?How much time and effort is required to create domain-specific metadata?
Our paper reports on quantitative and qualitative metrics of creating domain-specific metadata. In benchmarking the process of enhancing the metadata for legacy datasets, we pursue several goals. First, to make datasets available for effective search and re-use within newer data sharing environments. Second, to advance knowledge among researchers and data professionals about the needs, barriers, and requirements of curating legacy research data. Ultimately, we hope to advance a conversation about efficient metadata creation practices for research data within broader context of data curation and mangement
For this project, we used 20 datasets that are publicly available via the National Center on Earth-surface Dynamics (NCED) repository. Because these datasets originate from the interdisciplinary domain of earth sciences, the choice of a domain-specific metadata standard was not easy. Butgiven that thedatasets contained a significant amount of geospatial information, we decided to use the Federal Geographic Data Committee’s (FGDC) Content Standard for Digital Geospatial Metadata.
A team of four librarians and data professionals (or “encoders”) contributed to metadata creation. We all have different backgrounds: I come from a metadata perspective in traditional library context, Julie from a digital library metadata and useabilty context; Stacy and Inna from the scientific data context. Each encoder received 5 datasets of varying sizesranging from 0.01 to 664 gigabytes and from 1 to ~140,000 files per dataset. The datasets could be comprised of one or many different files types: text files, spreadsheets, images, applications, zip files
Metadata encoding was done in two phases. During Phase I, encoders created standalone XML-based metadata files for each dataset using basic information provided by the NCED repository and information available via quick Internet searches. During Phase II, encoders undertook extensive research to find more information about datasets, particularly concerning the processes by which datasets were created and used. Encoders timed all of their encoding activities and logged their experiences in a journal.
Encoding the basic metadata during Phase I required 9 minutes to 4 hours per dataset (average time: 54 minutes). Time dropped significantly after first dataset was cataloged (learning curve).
Successes: Metadata that we were able to encodeduring Phase I (i.e., the metadata that was easiest to obtain)largely corresponded to the mandatory elements required by the FGDC content standard. Librarians had great success assigning subject terms using controlled vocabularies.Challenges: WhilePhase I allowed us to collect descriptive metadata, which describes resources for the purposes of discovery and identification, it was very hard to encode spatial information (info wasn’t included in datasets’ readme files, we didn’t have software/expertise with software to figure out geospatial info, geospatial details not included in NCED repository metadata).
During Phase II, we attempted to supply richer metadata, which included encoding the composition of the complexresearch objects, as well as encoding relevanttechnical and preservation information.Providing additional metadata during Phase II required 20 minutes to 1.5 hours per dataset.
Successes: 10 metadata fields were enhanced during Phase II, adding such information as references to grants and funding information, distribution conditions, digital access and transfer information, and citations to related datasets and published articles.Challenges: We were still unable to provide a few key mandatory elements, such as geospatial coordinates, resulting in XML files that did not validate against the FGDC schema
The FGDC Content Standard for Digital Geospatial Metadata is a powerful tool for representing descriptive, structural, and administrative metadata. In dealing with legacy research data, however, the capabilities of this tool become seriously limited. Unlike other information resources, such as books or images that remain accessible and relatively transparent for preservation and sharing efforts, research data are complex compound objects. Formats, structure, relationships, and provenance become opaque once the data has been created. Our project demonstrates that data curators who are handed legacy research data “as is” can be very effective in creating descriptive metadata – particularly, in conducting subject analysis and assigning keywords based on controlled vocabularies and thesauri. However, identifying structural and administrative metadata for legacy data is extremely difficult.
Our approach, CAPP: Collaborate, Automate, Prepare, Prioritize is based on the premise that metadata creation or enhancement projects need to rely on a collaborative effort and on a combination of automated and manual labor.
Data managers need to collaborate with subject specialists, researchers, and tool developers to define the requirements of specific data curation projects. Researchers, as data producers and consumers, can contribute to metadata creation by indicating what elements are valuable and for what purposes. Researcherscan also supply additional information that can be used in completing metadata records.
Tasks such as file format identification, provenance capture, and entity extraction need to be automated. Existing tools, such as the JSTOR/Harvard Object Validation Environment (JHOVE, http://jhove.sourceforge.net/), MIME Type Detection Utility (mime-util, http://sourceforge.net/projects/mime-util), or Internet Assigned Number Authority’s MIME Media Types (IANA, http://www.iana.org/assignments/media-types) can be used to automate identification of technical metadata, including file formats. Tools such as GeoServer (http://geoserver.org/display/GEOS/Welcome) can provide access to specific metadata within certain formats, such as shapefiles, and automate the extraction of bounding coordinates and other geospatial information. Researcher identification registries such as ORCID (http://orcid.org/) may help mitigate some of the challenges of finding up-to-date information about data set contributors that the encoders encountered in Phase I.
Librarians and data managers can contribute to automation by providing system and user requirements, identifying a minimal set of metadata elements, and encouraging other partners to become involved in data sharing initiatives.
At the beginning of a legacy data curation project, data managers may also want to make the decision-making explicit by prioritizing which datasets should be curated and what user needs should guide curation.
In the future, we plan to enhance the CAPP framework by benchmarking other processes of metadata creation, such as the usability and effectiveness of certain tools and workflows, the impact of collaborations on metadata creation, and the effects of domain orientation or interdisciplinarity on the effectiveness and completeness of metadata. At its current early stage, CAPP framework is a proposition that needs to be developed into a rich research agenda. We hope that our framework will be considered by the Dublin Core community for further development, testing, improvement, and eventual incorporation into the set of best practices for metadata creation.