Presented by Shalimar Fojas White at the Annual Conference of the Visual Resources Association, March 12-15, 2014 in Milwaukee, Wisconsin.
Session #6: Back to Basics — Cataloging Workflows and Solutions ORGANIZER/MODERATOR: Jennifer Kniesch, Dickinson College
PRESENTERS:
• Marie Elia, Warhol Museum
• Vicki Sipe, University of Maryland, Baltimore County
• Kelly Smith, Lafayette College
• Shalimar Fojas White, Dumbarton Oaks Research Library and Collection
Sometimes the abyss of cataloging can leave some of us questioning, searching, harvesting, and questioning more. In this session you’ll hear from four professionals who will provide their step-by-step procedures for cataloging: from how analog and digital assets are cataloged to uploading assets in a Digital Asset Management System. We will hear about the speakers’ institutional projects, pitfalls and triumphs in cataloging, and the cost of trying to make cataloging work. Each speaker will present on their topic and time will be left at the end of the session for Q&A.
Marie Elia will provide details on how The Warhol Museum uses a combination of cataloging standards (analog and digital) to accommodate an art collection that is composed of archival materials, as well as the Warhol’s Content Management System. Vicki Sipe will discuss her step-by-step collaborative cataloging process involving University of Maryland’s Baltimore County Special Collections and University of Maryland’s Baltimore County Bibliographic and Metadata Services using historic image collections as examples. Shalimar Fojas White will explain how she and her team are currently migrating records out of legacy databases and into a new Content Management System, that is being developed with a VRA Core template. Kelly Smith will describe how Shared Shelf fits into the workflow of day to day cataloging for Lafayette College's Visual Resources Collection and how it is being explored as an option for other digital projects and needs across campus. Is there a best workflow and can we find a solution for our respective institutions?
1. METADATA MIGRATION AND
INTEGRATION AT DUMBARTON OAKS
Shalimar Fojas White
Image Collections and Fieldwork Archives
Dumbarton Oaks Research Library and Collection
Visual Resources Association Annual Conference
March 13, 2014, 1:35 PM
10. CATALOGING INVENTORY:
•EmbARK – 61,000 image records
•OLIVIA – 5,000 image records
•MS Access – 5,000 image records
•Word/PDF – 15 archival finding aids
11.
12.
13. Objective evaluation
Team approach (user-defined)
Peer reviews and resources:
Council of Nova Scotia Archives: AMS Review (2008)
CLIR Report by Lisa Spiro (2009)
Museum Association of New York CMS list (2011)
Canadian Heritage Information Network (CHIN)
guidance and template
METHODOLOGY
14. Handle archival and image
descriptions
Unlimited hierarchical
relationships
Afford complex geographic
cataloging
Item-level storage locations
Global search
Local authorities
Data reuse
Accessioning
Import for legacy metadata
migration
Exports for Hollis, OASIS and VIA
SYSTEM REQUIREMENTS
50. SPRING 2014
- Complete VRA
Core Template
Development
- Import Accession
Records
SUMMER 2014
- Test Imports of
Image Records
- Test Uploads of
Digital Objects
FALL 2014
- Import Backlog
Records and Images
- Contribute
Collection-Level
Records to HOLLIS,
WorldCat, etc.
51.
52.
53. THE ICFA TEAM:
Anne-Marie Viola, Metadata and Cataloging Specialist
Rona Razon, Archivist
Fani Gargova, Byzantine Research Associate
Beth Bayley, Jessica Cebra, and Ameena Mohammad, Departmental
and Archival Assistants
Shalimar Fojas White, Manager
SPECIAL THANKS TO:
Prathmesh Mengane, Database and CMS Specialist
Alison Miner, ICFA Intern (Fall 2012)
Artefactual Systems, Inc.
Notas del editor
Dumbarton Oaks is a research institute of Harvard University, dedicated to supporting scholarship in three very distinct fields of study, Byzantine, Pre-Columbian, and Garden and Landscape Studies.
My department, the Image Collections and Fieldwork Archives (ICFA) has a bifurcated name that reflects is history as two separate departments that were brought together in the 1990s.
However, for most of the fifty years since Dumbarton Oaks’ founding in 1940s, the Photograph Collection was distinct in organization, purpose, and physical location
from the Fieldwork Archives, which collected papers related to the archaeological and conservation projects sponsored by Dumbarton Oaks and the Byzantine Institute.
When I arrived at Dumbarton Oaks three years ago, this separate history was reflected in how the collections had been cataloged.
The archival collections had finding aids as Word documents or PDFs and were described hierarchically.
The image collections had item level cataloging records, which could be found in several legacy systems. There was no way to search across all our holdings, which was particularly problematic given the nature of our collections – the fieldwork photography had been separated from the fieldwork archives, thereby losing valuable context.
Since the rest of our half a million images had never been fully inventoried, we needed to establish full intellectual and physical control over our holdings. We embarked upon a comprehensive inventory of all the collections, which also included an inventory of all existing cataloging.
In addition to rounding up all versions of the finding aids, we also turned up multiple legacy datasets for the image collections: EmbARK,OLIVIA/VIA, Access databases, for a total of 71,000 image records. How we are attempting to create an aggregated data repository for all these datasets will be the subject of my talk today, but first, a
Full disclosure, I am not a cataloger, nor have I played one on TV. The work I will describe has been spearheaded by Anne-Marie Viola, our Metadata and Cataloging Specialist, who I immediately set out to hire once it became clear that for our inventory to work
We needed an integrated system for collection management. Hired in January 2012, Anne-Marie dove into the system selection process with verve.
She developed a methodology based on objective evaluation of systems against a set of user-defined criteria developed by our interdisciplinary team.
We held several requirements gathering sessions with our team to compile all of the necessary features of our ideal system. Our Archivist needed to be able to continue creating finding aids for our archival collections, with multiple levels of descriptions, from the collection level, subgroups, series, subseries, down to the folder level. Our Byzantine Research Associate insisted on the necessity of browsability by place, the primary method by which scholars had traditionally used our collections, which are arranged geographically by medium. As the manager, I needed a way to record and track accessions, donors, and rights/restrictions, since our predecessors had only left us paper accession logs and records. And of course, there were those 71,000 legacy image cataloging records that we needed to bring together. Eventually, the requirements were whittled down to the high priority deal-breakers.
Anne-Marie reviewed approximately 20 systems, both proprietary and open source. She evaluated them against our requirements list, using the CHIN evaluation template. Eventually, we got it down to 4 candidate systems, two proprietary and two open source, and conducted demonstrations, peer reviews, and reference interviews. We settled on ICA-AtoM.
Which stands for International Council on Archives Access to Memory, an open source archival collection management software.
The system is based on standards developed and promulgated by the International Council on Archives (ICA) with initial funding from UNESCO. We selected the system for two main reasons. The first was flexibility - AtoM allows you to catalog in a number of different record templates: ISAD(G), MODS, RAD, and Dublin Core.
Since it was open source, this would allow us to develop an additional cataloging template for VRA Core. Also, once data is entered into the system, we could get it out as XML, such as EAD XML so that we could contribute our finding aids and records to other distribution channels like Harvard’s OASIS, WorldCat, and ArchiveGrid. Also, since it was open source, there was no ongoing licensing fee and we could devote our limited funds towards development of additional functionality. Simply put, we were priced out of the proprietary systems.
Once we selected the system, we followed two parallel routes for feature development. We took on the development of the VRA Core cataloging template in house, since Prathmesh Mengane, our Database and CMS Specialist, was confident that he could work from the existing Dublin Core plugin to adapt it to VRA Core.
For more complex functionality related to geographic description, we sponsored development through Artefactual Systems, the lead developers of AtoM and archivematica, the digital preservation system. This team of archivists and programmers recognized the benefit of developing a browsable vocabulary of geographic terms for their larger user community and agreed to take on the project. So, we put our small budget for the CMS toward this sponsored development in a forthcoming version of AtoM. In the meantime, we focused on the in-house development…
To guide Prathmesh’ development of the VRA Core cataloging template, Anne-Marie conducted an analysis of all extant cataloging in our legacy datasets.
She evaluated all the fields used in EmbARK, OLIVIA, the various Access databases, and mapped them to VRA Core’s elements.
Then, she developed a crosswalk
Between Quibit, the backend database for AtoM, the existing fields in the Dublin Core template, the VRA Core elements, and the new AtoM displaylabels we desired. Where elements did not map, Prathmesh would have to create new fields for our new VRA Core template.
Finally, Anne-Marie created a Metadata Application Profile
A reference document for Prathmesh thatspecified the encoding scheme, elements, attributes, and labels that will be used in the AtoM collection management system, along with the type of field required, whether free text, a controlled list, or linked to an existing taxonomy in AtoM (Names, Places, or Subjects).
While Prathmesh got started with the development on a staging environment, Anne-Marie started to export datasets from the legacy systems. In some cases, this was relatively straightforward, as was the case with Access and Embark, which had a CSV export.
Other cases, such as OLIVIA, required the assistance of technicians at Harvard’s Digital Repository System, since exporting data from this complex hierarchical system could only be achieved piecemeal, table by table. In the end, Anne-Marie had to reconstruct the original dataset from several spreadsheets. This points to another benefit of AtoM, the ability to get our data back out as XML. For those of you currently in the process of selecting a system, remember that you will always need an exit strategy. Any system, no matter how perfect now, will be obsolete within 5-7 years, so plan accordingly and make sure that you can get your data out for reuse and migration into another system.
Next what I like to call data hygiene, otherwise known as data cleanup. Now, just because it was easy to export a dataset, did not mean that data cleanup was equally straightforward.
In the case of EmbARK, the system added additional characters that would cause column shifting and line breaking, which could be detected by scanning the dataset in Excel.
Once we discovered the nature of the extra characters and the places where EmbARK would typically add them, Anne-Marie developed a method of globally replacing them in BBEdit, an HTML and text editor. Since most of the legacy cataloging had been done on a project basis, Anne-Marie could export discrete sets of data that were more manageable then the 71,000 total records.
At this point, Anne-Marie and Fani, reviewed the data sets to develop the work records that would be used, since they didn’t exist for the legacy records. We started with our fieldwork photography, which concentrated on specific Byzantine sites or monuments. This work helps Anne-Marie to plan out the next step of
Data Quality Assurance
So that the detailed quality control could really begin. We all know what messy and dirty data looks like, but in some cases, it amazed even us. Many of these datasets had been entered over the course of 20 years with no discernable commonstandard applied across them. In one dataset I worked on, describing 3,200 architectural drawings, the name of the department at the time “Fieldwork Archive, Dumbarton Oaks,” was entered at least 48 different ways. Many fields were inconsistently used, with the Site field containing values representing varying levels of specificity from regional terms like “Asia Minor” to countries “Turkey” to cities “Istanbul” and even generic terms like “church.”
Thankfully, one of our interns, Alison Miner, introduced us to the tool we are now using for most of our data standardization, Open Refine.
Previously known as Google Refine, the tool is something that you download and use within a browser to manipulate your datasets. You can facet any column and see all the values entered into that field, along with the numerical distribution, so that you can easily see outliers and cleanup typos and variants in capitalization and spacing. Open Refine also has a very powerful clustering tool that can suggest matches between words that are spelled similarly, sound alike, or might have been mistakenly keyed in.
There are also common transformations that you can perform to remove leading and trailing spaces or globally change the case for your values. The best part? Everything is reversible and your original dataset has not been touched. Once you are done making your changes, your can simply export a fully cleaned version of your dataset as a CSV file, ready for additional review.
The final step is another review with our Byzantine Subject Specialist to ensure that the values used not only conform to existing vocabularies, such as the Getty’s AAT and TGN, but also to accepted norms within the field of Byzantine studies. Since we serve such a specialized audience, we have to be fairly precise about things like using naos instead of nave, or providing alternative titles for monuments like Hagia Sophia, also variously known as Ayia Sofia, Santa Sofia, Saint Sophia, which is located in Istanbul / Constantinople. In certain cases, our local preferred term may diverge from those used by the Library of Congress or other authorities, but we always provide the alternate terms, as well. This essentially serves as a reality check for us librarians and archivists, which points to the benefit of a multidisciplinary team like ours in ICFA where we can incorporate all the perspectives to improve our users’ experience.
In fact, while Anne-Marie was busy with the data cleanup, the rest of us were simultaneously busy populating the CMS.
Fani was working on expanding and refining our taxonomy of geographic terms, adding URIs for matching concepts in vocabularies like TGN and Pleiades. The goal is to get this taxonomy imported into AtoM as SKOS XML. Wherever possible, we included URIs from Linked Data sets, since in the future, we will be able to export our authorities from AtoMas SKOS and hopefully be able to contribute to Linked Open Data repositories like those of the Getty, Pleiades, etc.
We also developed over 400 authority records, all of which conform to the EAC-CPF standard (Encoded Archival Context – Corporate Bodies, Persons, Families). As with the other taxonomies, we tried as much as we could to match to existing records in vocabularies like VIAF, LCNAF, and ULAN. TheAtoM authority records can be exported as EAC XML, which we also hope to eventually contribute to the National Archival Authorities Cooperative (or NAAC), which built off of the SNAC project (Social Networks and Archival Context) to use Linked Data and data visualizations to recreate historical social networks from archival records.
Meanwhile, our Archivist, part-time staff, and interns were converting our existing Word finding aids into AtoM records.
This was primarily accomplished by data entry, cutting and pasting text, but in the future, we plan to accomplish much of our pre-processing of archival collections through the system itself. The numbers of eyes and hands doing the data entry necessitated the development of two key documents, a finding aid Style Guide and a Workflow for the CMS.
We planned for a soft launch of AtoM@DO in December 2013, making it available within the Dumbarton Oaks community only.
This allowed us to conduct 2 weeks of usability testing, where we identified problems and addressed them before our public launch in February 2014. We also developed help documentation tailored to the issues brought up by our testers – who were rewarded with Starbucks gift cards for their time and input.
Our February hard launch consisted of all of our existing archival finding aids, representing approximately 40 collections. Some of these are more robustly described than others, which have minimal collection-level records. However, we are still processing archival collections as we speak, so new content will be added going forward.
But, more importantly, we are in the final stages of Prathmesh’s development of the VRA Core cataloging template.
As of last week, we were down to three fields! The complications have mostly involved linking certain fields to thetaxonomies, which we believe are crucial to the success of the venture. Our users ability to browse by Place and locate records by their creators’ Names will be vitally important to the success of AtoM@DO as an integrated data repository for our blended photograph and archival collections.
So, next steps. Once the VRA Core template is completed, we plan to start test imports of Anne-Marie’s cleaned up datasets, which have been waiting in the wings, ready to go, along with their associated digital images.
We hope to start importing VRA Core compliant image records into AtoM by this Summer and working steadily through the backlog of 71,000 records through the remainder of the year. I’ve also finally been able to OCR and transcribe our paper accession records and we plan to start importing those into AtoM’s accessions module in the coming months.
As the manager of this complex, but infinitely fascinating, collection of photographs and archival records, one of the main selling points of AtoM was the fact that we could combine the collection management system that we so desperately needed with a discovery tool to enable our users to access our data online, even before we had gained full intellectual and physical control over our holdings.
AtoM@DO is a work in progress, but is flexible enough to allow us to provide access while we rationalize collection management behind the scenes. For now, our users will only see our pretty records in AtoM@DO, without any idea of the messy paper records, multitude of old finding aids and inventories, small menagerie of legacy systems, and behemoth backlog of dirty data.
Thankfully, with a great team in place, ICFA can go back to basics and concentrate on reuniting our separated collections through metadata, using the legacy that our predecessors have left us as a foundation from which to move forward into the future.