This document discusses the University of Alabama Libraries' transition from Dublin Core to MODS metadata standards for their digital collections. It describes how they mapped Dublin Core elements to MODS, developed workflows for processing metadata at scale using scripts, and created a public interface called Acumen to provide discovery of digital objects. The libraries learned that adopting metadata standards and using a flexible directory structure allowed them to implement new technologies and share records more broadly.
1. Metadata Magic
By Mary S. Alexander
Metadata Librarian, The University of Alabama Libraries
Given by Jeannine Keefer
Moderator
VRA Conference
Case Studies IV
March 21, 2012
7. Display labels for master
spreadsheet
• Title
• Other title
• Cover title
• First Line of Text
• First Line of Chorus
• Masthead
• Series Title
• Special Issue
• Title from plate
• Subjects(s)
• And more
8. Catalysts for change
• Institutional repositories
Born-digital resources
Electronic thesis and dissertations
• More metadata schemas
Metadata Object Description Schema (MODS)
Metadata Encoding and Transmission Standard (METS)
Encoding Archival Description (EAD)
• Emphasizes on discovery of hidden collections
• Mass digitization
9. Local changes
Local development for a web-based search and retrieval system
was favored. It is now know as Acumen.
Mass digitization and processing workflows would be
implemented.
MODS would be used for descriptive metadata schema.
10. DC to MODS crosswalk
displayLabel DC element MODS element
Title Title <title>
Other Title Title-alternative <title type=“alternative”
Subject(s) Subject <subject authority=“lcsh”
Description Description <description>
Creator(s) Creator <name type=‘personal’>
Author(s) Creator <name type=“personal”>
Editor Contributor <name type=“personal”>
15. Name processing
Name Role Type Authority VIAF_ID LCCN Source History Comments
Coffman, J. Henry Sender(s) personal local no matches
King, Helen Sender(s) personal local check VIAF again
King, Margaret Recipient(s) personal local 1+ matches--check VIAF again
King, Robert S. Sender(s) Personal naf 76561235 nb2004005427
16. Subject processing
Subject Master list Tagging key
Incorporation#x--West Virginia#z--Ohio County#z #x topics
#z geographic location
Ingram, T.#p--Finance, Personal#x #p personal name
Jefferson County (Ala.). Tax collector#c #c corporate name
18. More magic
<name type="personal" authority="naf">
<namePart>King, Robert S.</namePart>
</name>
<subject authority="lcsh">
<topic>Debtor and creditor</topic>
</subject>
<identifier type="local">u0003_0002865_0000001</identifier>
<identifier type="uri">http://purl.lib.ua.edu/35547</identifier>
19. Practical lessons learned
• By moving to the directory structure, we are able to implement
OAI, ETDs, and EADS.
• This structure allows flexibility to adopt other schemas.
• It is important to know the standards.
Knowing how to manipulate large amount of data is invaluable!
Current records and digital objects available:
Items available: 74197
Scans available: 282091
Mary Alexander is one of three metadata librarians at the University of Alabama Libraries where she has been employed since 1993. In 2003, she transition to a new position working with emerging metadata schemas and their related standards.
Across the nation digitization efforts were in production when the University of Alabama Libraries and the two large state institutions began to write their first grant to digitize archival collections.We had experienced personnel in archives, cataloging, and systems. We were rich in archival collections and limited in staffing and resources.
The grant was awarded in 2003. A metadata group representing the three institutions collaborated on a best practice document. The Dublin Core metadata schema was chosen based on its usage with large collaborative projects. While the group was discussing essential schema elements, individually the institutions proceeded in their efforts.The UA Libraries was already in possession of the digital management system, Endeavor’s Encompass. We began.
Within a short time after starting the project, Endeavor was acquired by a competing company. Encompass would no longer be supported.OCLC’s CONTENTdm was recommended as the next system for UA Libraries Digital Collections. A consultant was hired to get data out of Encompass. The exported data was delivered in a tab-delimited file.Those involved with the state-wide grant also choose CONTENTdm based on its connectivity with other sites. It provided a way for state collections to be discovered through one web site. The state-wide group discussed standards for digitization and metadata. The Dublin Core Metadata Best Practices by the CDP Metadata Working Group of the Collaborative Digitation Program was adopted.
In CONTENTdm, there was a digital collection for every physical collection.The Qualified Dublin Core schema had been implemented in all collections.Each collection had fields tailored to the collections need. Number of fields ranged from 17 to 27. Production was driven by a workflow based on spreadsheets. Each collection had a corresponding spreadsheet. Many of the collection’s fields were repeated elements. Each creator element had a unique display label that reflected the role of the person or corporation. Author, composer, artist, photographer are some of the roles/display labels. These were used as the header for columns in the spreadsheet.
Over a 5 year period, there were 28 collections in CONTENTdm each with a slightly different set of data fields. There were approximately 6,972 digital objects and records in CONTENTdm. A license plateau was quickly approaching.
The Cataloging and Metadata Services Department and the Digital Services Unit decided that a master list of fields should be adopted across collections. This would enable CONTENTdm collections to be combined resulting in fewer collections. Fields usage was analyzed to their frequency and definitions. Common fields were kept. The least used fields were discarded.A master list was developed. New fields were added for creating finding aids or EAD records for special collections. The display labels for field names were applied to a master spreadsheet. Existing CONTENTdm collections were modified to reflect the master data dictionary. When this process was completed, 28 collections were reduced to 12 collections. CONTENTdm collections were called containers locally and renamed using time spans significant to the holdings of the William Stanley Hoole Special Collections Library. One additional collection was added for University of Alabama digital objects.
The UA digital planning group looked forward to build a digital program.Institutional repositories and born-digital resources that would populate them were hot topics!Emerging schemas (MODS, a descriptive metadata schema; METS, a structure metadata schema, and EAD, a descriptive metadata schema for archival collections) were receiving a lot of attention.The Association of Research Libraries continue to emphasize exposing hidden collections.Mass digitization was being practice.With these factors and more, UA Libraries wanted to be in position to implement these possibilities. A Digital Services Unit was created to digitize and manage these collections.
Combining collections in CONTENTdm delayed reaching the next license plateau. The potential of digital objects filling CONTENTdm was a fact.The Digital Services Unit and the Library Office of Technology decided on moving away from CONTENTdm and Dublin Core. Plans were under way for digital objects and their metadata records to reside in a directory. A display and retrieval system would be built over SQL database for users.The metadata workflow was changed. Now the master spreadsheet begun with the archivists. After titles, names, dates, and other description information was added, the spreadsheets would move to the Digital Services Unit for the file names to be added as a step in the digitization process. The spreadsheets would be transformed to MODS records producing a preliminary record. This temporary record provided some access to resources while the Metadata Unit reviewed the spreadsheet for quality control, establishing names, adding abstracts, and assigning subject and genres terms.
The master spreadsheet field was mapped to MODS using EXCEL for a visual crosswalk.The next step was to create a style sheet (xls) for the spreadsheet conversion to MODS records.
This is the template used for the conversion of files to MODS records.
Archivist Utility uses a style sheet template and tab-delimited files derived from the spreadsheets to create raw MODS records. AU was created by a programmer working with the Digital Services Unit leader. Data could be viewed as text or as a MODS record. These preliminary views provide an opportunity for quality control. The error log reports unused or missing columns from tab-delimited files. Errors are corrected in the spreadsheet and loaded again into the utility. When the librarian is satisfied with the records, they are saved to a folder for the next step.[The is the icon for the Archivist Utility. The University’s mascot is Big Al, a elephant.]
The raw MODS records only contained data in the top-element MODS tags.Very simple edits were performed before loading to the server.
A second metadata librarian with hired. She wrote scripts to pull data for names and subjects files, to replace data, and to transform rawMODS data into robust MODS records.
Names were collected from the “name” columns of the spreadsheet to a tab-delimited file using a Python script. The processing included searching names in viag.org. If found, the viaf and LCCN columns would be populated. Notes stating search results for each name would be given. After reviewing this file, a metadata librarian would run a script to place the names and numbers into a MADS file.
Subject processing uses a Window powershell script. It pulls the subjects into a master list for tagging names, topics, geographic, events, genres, occupations, and other headings. After tagging, another script is run to replace subject headings with their tags into the tab-delimited file derived from the spreadsheet.
PURLs, Persistent Uniform Resource Locators, are needed for the digital object and the digital collection. A master list for collection-level PURLS were collected monthly and passed to the Metadata Services Unit.A Python script is used to pull identifiers from spreadsheets for the purpose of gathering PURLs for digital objects from the server. The results are provided in xml list.
A transformation puts the pieces together.The rawMODS records are in a project folder used in xmlSpy. A XSLT transformation populates the rawMODS records from the generated files created earlier. The authority attribute used with the name tag is populated from the MADS files. From the xml list, PURLS are added to their appropriate identifier tags. The subject tags (#c, #p, #x, etc.) are recognized so sub elements for name, topic, geographic, and other headings are assigned.The final step includes validation through Schematron using the item-level PURLs. When the Schematron validation is free of errors, final MODS records are loaded to the server replacing preliminary MODS records.
By moving to adirectory structure, we are able to implement OAI, ETDs, and EADS.This structure allows flexibility to adopt other schemas.It is important to know the standards.Knowing how to manipulate large amount of data is invaluable!