2. Summary
Metadata inconsistency
Overview of vendor authority process
Further work with Open Refine
Next steps
2
http://www.utahindians.org
3. Inconsistency
Gosiute Indians
Goshute Indians
Navajo Indians
Navaho Indians
Salt Lake
Salt Lake City
Salt Lake City (Utah)
Bishop, Dail Stapley
Bishop, Dale Stapely
Bishop, Dale Stapley
Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (1876-
1951)
Beckwith, Frank A.
Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (1876-1951)
Beckwith, Frank Asahel, 1876-1951
3
Woven basket or jug;
http://content.lib.utah.edu/cdm/ref/collection/
UU_Photo_Archives/id/13887
4. Project Timeline
4
June-Sept. 2012 – Define project
Oct. 2012 – May 2013 – Testing
June 2013 – Contracted with
Backstage Library Works
June 2013-Feb. 2014 – Continued
testing
Feb.-May 2014 – 17 collections
processed
June-Aug. 2014 – Manual review
(intern)
April 2015-today – Explore
OpenRefine
5. Methodology
5
<title>A group of St. George (Sibwit) Paiutes and Wickiups (cedar)</title>
<subjec>Paiute Indians; Ute Indians--History; Wickiups; Indians of North
America--Dwellings;</subject>
<covspa>Utah;</covspa>
<descri>A group of people sitting and standing in front of a brush
shelter;<descri>
<publis>Digitized by: J. Willard Marriott Library, University of Utah;</publis>
<type>Image;StillImage;</type>
<format>image/jpeg;</format>
http://content.lib.utah.edu/cdm/ref/collection/uaida/id/14697
7. Backstage: standardization
Capitalization, Punctuation, and
Updated Authorized Access Points
Forests and Forestry – Utah
forests and forestry -- Utah
Forest lands - Utah
Forests and forestry--Utah
7
A group of Navajos at Navajo Mountain
government school;
http://content.lib.utah.edu/cdm/ref/collection/
uaida/id/43551
8. Backstage: problems
encountered
Missing MARC tags
Names treated as topical
headings and vice versa
Provo => Provisional IRA
Data in wrong fields
Date: Price Hiram, 1814-
1901
Incorrect match
Local names matching wrong
records
Johnson, Abe is not Johnson,
F. T.
8
Walker War Map 1853-1854;
http://content.lib.utah.edu/cdm/ref/collection/
uaida/id/15474
10. OpenRefine project
◦ Used UAIDA as a
pilot, since it had the
greatest number of
unmatched names
due to the size of the
collection (over
8,000 items)
◦ 529 unmatched
names after
Backstage process
10
Navajo woman weaving,
http://content.lib.utah.edu/cdm/ref/collection/uaida/id/45379
11. OpenRefine: two approaches
Reconciliation
process developed
by Jenn Wright and
Matt Carruthers,
University of
Michigan Library,
https://github.com/mcarruthers
/LCNAF-Named-Entity-
Reconciliation
Reconciliation
process developed
by Roderic Page,
http://iphylo.blogspot.com/2
013/04/reconciling-author-
names-using-open.html
11
A group of Navajo children and teenagers,
http://content.lib.utah.edu/cdm/ref/collection/uaida/id/43285
12. OpenRefine: differences in
results
Both processes found name matches
through searching VIAF.
◦ Wright and Carruthers’ process looked for
a matching LC authority record in the
VIAF cluster
81 records were matched, 132 were false
matches, and 312 number had no match
◦ Page’s process matched names to
authors in a more general fashion
70 records were matched, 37 were false
matches, and 449 had no match.
12
14. OpenRefine: updating UAIDA
We updated an
additional 455
records with
updated names.
405 matches were
from both
processes, 38 were
unique to Wright
and Carruthers and
5 were matched by
the Page process.
14
Eight Hopi Baskets,
http://content.lib.utah.edu/cdm/ref/collection/uaida/id/45009
15. Open Refine: student work
Fall 2015 – student ran additional
unmatched items from other
collections through OpenRefine with
Wright & Carruthers process
Metadata librarian currently reviewing
student work and updating collections
15
17. Next Steps: Reconcile across
more collections
CONTENTdm
metadata exported
in SOLR
Easier to get list of
personal names
across all
collections
Explore other
reconciliation
methods
17
18. Next Steps
URIs in Digital Collections Metadata,
MWDL (Primo),
and DPLA
18
http://content.lib.utah.edu/cdm/ref/
collection/uaida/id/43183
19. Questions?
19
Anna Neatrour | anna.neatrour@utah.edu
Metadata Librarian
Jeremy Myntti | jeremy.myntti@utah.edu
Interim Head, Digital Library Services
Forthcoming article:,
Use Existing Data First: Reconcile Metadata Before
Creating New Controlled Vocabularies. Journal of
Library Metadata.
http://dx.doi.org/10.1080/19386389.2015.1099989
Notas del editor
I’m Anna Neatrour, Metadata librarian at the University of Utah Marriott Library. I’m also presenting on behalf of Jeremy Myntti who is Interim Head of Digital Library Services.
I’ll provide an overview of the metadata problems we had, what the vendor supplied authority service did, go over further reconciliation work, and talk about our plans for the future. Throughout this presentation you’ll see examples of items from the Utah American Indian Digital Archive, which was one of the collections we processed.
We’ve been creating digital library collections for a long time. Our existing records had a great deal of inconsistency both for personal names and subjects. Having six different ways of expressing the name of one person is really bad, and leads to problems for users in faceting and discovery.
Project started in 2012, and we’re just moving on to a different stage of it today. 17 of our collections were processed by Backstage Library Works, and we did additional review and post processing.
Processing digital collections XML files similar to existing processes developed at Backstage for MARC records for names and subjects. Had previously done other projects at our library where we changed raw CONTENTdm metadata.
Backstage did their automated authority control on the CONTENTdm desc.all file, which is the way metadata is stored internally in the system. Different from the file you get if you export metadata as a collection manager in CONTENTdm.
If you have hosted contentdm you cannot do this.
Skip-----------
Extracting data from CONTENTdm
Stop updates to collection and make it read-only
Make copy of desc.all metadata file for backup.
Run desc.all file through AC processing from Backstage
Replace desc.all file on CONTENTdm server
Run the full collection index
Remove read-only status from collection
Get reports on matches, no matches, and changes report back from Backstage.
In matched headings report, it also included URIs for id.loc.gov and VIAF in some cases, so we have URIs for items we can then use if we want to express our information as linked data in the future.
Skip-----------------
For UAIDA:
Creator/Contributor names (7033)
10% changed (669)
48% matched 3342)
Subjects (98931)
21% changed (21072)
76% matched (75471)
In addition to matching authorities, Backstage helped with fixing punctuation problems. Here’s an example of three poorly punctuated subject heading variants that got fixed.
Skip-------
Space double dash space
Word em dash
Single dash
Convert all to double dash
Each collection may be different, so need to watch out for which ways to standardize (single dash may not need to be converted in some collections)
Capitalizing every word
Capitalizing nothing
Convert to the correct capitalization according to LCSH
Older forms of names used (pre-RDA)
Cross references used rather than authorized access point
This happened a lot because of training issues. Many students for a few years didn’t realize the correct form of an access point that should be used from the 1xx in the auth
“access points” = heading
Backstage makes certain assumptions about the headings
Part of that is shunting data into a topical subject heading
as opposed to a geographic or personal or corporate name headings
See sample of matching issues on this slide:
-skip-
So “Provo” goes in as a 650 field; our system performs the match as a
Subject heading; so we find an authority where the 110 field is:
Provisional IRA
And the 410 field is:
Provo
With “Cars” we searched this as a generic name heading, and lopped off
The date of 2002, so we found the conference heading instead
We found that we need to be careful with single-word headings in CDM
In fact, our recommendation might be to not search those at all or,
if we do, then to report them rather than update them
Backstage made 85,000 changes in our digital collections. There were 200 problems in the data that they changed that were fixed after review
Need for lots of manual review for some collections
Intern reviewed collections for 3 months – fixed nearly 2000 mistakes (mostly from metadata rather than Authority Control process
Process pointed out need for training to encourage more consistency. (use correct access points, standardize punctuation, spacing, capitalization, date formats, etc., field usage, NACO or local auths?
Used Utah American Indian Digital Archive first since it was a large collection. Wanted to do further reconciliation for the names Backstage wasn’t able to match and see if we could enhance the collections even more.
We tested out two different approaches to matching personal names to see what would work best.
Found better results with matching just against LC name authorities, as would be expected since the collection we were working on was a regional Utah collection.
We have no Carlos Santana materials in this collection! Manually reviewed matches in Open Refine.
Updated the desc.all file locally and reindexed the collection to get these changes live.
At the end of this we combined all the matched names, and did a further update of the Utah American Indian Digital Archive desc.all collection.
Skip-------
Wright and Carruthers process had 262 undo/redo actions in the open refine project
Page’s reconciliation process resulted in 424 undo/redo actions
Updating manually if only a few changes, but we can also script against the desc.all as we did with UAIDA if any of the other collections have extensive updates.
Create NACO records for notable people
Investigate local controlled vocabularies for more regional personal names
A few weeks ago, our metadata for our CONTENTdm collections was dumped into SOLR.
Expect it will be much easier to work with.
Building unified list of creators and contributors.
Want to explore additional means of reconciliation.
Put URIs in digital collections metadata, test how they appear in our repository, in MWDL, and DPLA.
We have been putting geonames URIs in our metadata, but not personal name URIs yet.
Could easily add the URIs at a future date when our repository can do more with them.
Have several collections now where we have confidence in our metadata being cleaned up and matched to authorities, so this is a great base as we want to explore using Linked Data more.
We have an article coming out in the Journal of Library Metadata that goes into greater detail about what I presented here today. Happy to take your questions.