Libraries, Archives and Museums now have massive digital
holdings. There is tremendous potential for library and
information science, computer science and computer engineering
researchers to partner with cultural heritage institutions and
make our digital cultural record more useful and usable. In
particular, there is a significant need to bridge basic research in
areas such as computer vision, crowdsourcing, natural language
processing, multilingual OCR, and machine learning to make this
work directly usable in the practices of cultural heritage
institutions. In this talk, I discuss a series of exemplar projects,
largely funded through the Institute of Museum and Library
Services National Digital Platform initiative, that illustrate some
key principles for building applied research partnerships with
cultural heritage institutions. Building on Ben Schniderman’s
The New ABCs of Research: Achieving Breakthrough
Collaborations, I focus specifically on why the public purpose
and missions of cultural heritage institutions are particularly
valuable in establishing new kinds of collaborations that can
simultaneously advance basic research and the ability for people
of the world to engage with their cultural record.
2. TALK ROADMAP
- Context on where I’m coming from
- The New ABCs of Research as Framework
- Examples from IMLS National Digital Platform Projects
- Examples of initiatives from LC Labs
- Some Jumping off Points and Applied Grand Challenges
18. Extending Intelligent Computational Image Analysis for
Archival Discovery (LG-71-16-0152-16), Board of Regents
of the University of Nebraska, $462,317 The Image
Analysis for Archival Discovery (Aida) research team at the
University of Nebraska-Lincoln will investigate the use of
image analysis as a methodology for content identification,
description, and information retrieval in digital libraries and
other digitized collections. The project will focus on identifying
poetic and advertising content in digitized historic
newspapers. Using a machine learning approach, the project
will result in an intelligent computational system that can
process digital images and identify these specific types of
content. https://www.imls.gov/grants/awarded/LG-71-16-
0152-16
19. Improving Access to Time-Based Media through
Crowdsourcing and Machine Learning (LG-71-15-0208-
15), WGBH Educational Foundation, $898,474 WGBH, in
partnership with Pop-Up Archive, will address the challenges
faced by many libraries and archives trying to provide better
online access to their media collections. This 30-month
research project will explore and test technological and social
approaches for metadata creation by leveraging scalable
computation and engaging the public to improve access
through crowdsourcing games for time-based media.
https://www.imls.gov/grants/awarded/LG-71-15-0208-15
20. Systems Interoperability and Collaborative Development
for Web Archiving $353,221 and $98,460 in cost share:
The Internet Archive, with the University of North Texas,
Rutgers University, and Stanford University Library will build
a foundation for collaborative technology development,
improved systems interoperability, and an Application
Programming Interface (API) based model for enhanced
access to, and research use of, web archives. In working with
the Archive-It platform, used by more than 350 partner
institutions, results of this research will be directly applicable
to libraries, archives, and museums around the country and
the world. https://www.imls.gov/grants/awarded/LG-71-15-
0174-15
21. Transforming Libraries and Archives through
Crowdsourcing (LG-71-16-0028-16), Adler Planetarium,
$1,214,780 This research partnership between Adler
Planetarium’s Library and researchers at Oxford University,
will expand the capacity for libraries and archives across the
country to use crowdsourcing techniques to engage with
audiences and improve access to digital collections. Through
this effort, the team will develop a series of library/archive
Zooniverse projects that explore improvements to full text
and audio transcription and image annotation crowdsourcing
tools and research differences between transcribing in
isolation versus with knowledge of others’ transcription.
Lessons learned from these projects will be incorporated into
the Project Builder. https://www.imls.gov/grants/awarded/LG-
71-16-0028-16
22. Programmatic Extraction of “Documents” from Web
Archives (LG-71-17-0202-17), University of North Texas,
$318,988 The University of North Texas Libraries and the
Computer Science and Engineering Department will research
the efficacy of using machine-learning algorithms to identify
and extract publications contained in web archives. The
overarching goal of this project is to understand if machine-
learning models can successfully identify content-rich PDF
and Word documents from web archives that align with
library and archives collecting plans.
https://www.imls.gov/grants/awarded/LG-71-17-0202-17
36. Jer Thorp, Innovator in Residence
• Overview https://labs.loc.gov/experiments/innovator-in-residence-jer-thorp/
• Research materials https://osf.io/b7e6w/
• Code https://github.com/blprnt/loc
• Podcast https://artistinthearchive.podbean.com/
Laura Wrubel’s Library of Congress Colors
• Application https://loc-colors.glitch.me/
• Code https://github.com/lwrubel/loc-colors
• Blog post https://blogs.loc.gov/thesignal/2018/01/from-code-to-colors-working-with-the-
loc-gov-json-api/
Tahir Hemphill, Papamarkou Chair in Education at the John W. Kluge Center
• About Hip Hop Word Count https://www.newyorker.com/magazine/2013/04/01/rap-
sheet-2
• Studio https://www.tahirhemphill.com/
• Past chairs https://www.loc.gov/loc/kluge/fellowships/hpeducation.html
Reports
• Gallinger, M. & Chudnov, D. Recommendations for a Digital Scholarship Lab at the
Library of Congress
• Herron, S. Digital Scholarship Resource Guide.
• Access the reports https://labs.loc.gov/meta/reports/
LC LABS RESOURCES
38. SOME ESSENTIAL APPLIED
RESEARCH AREAS
- How can various new technologies be implemented to
scale the ability to acquire, describe, organize and make
available digital collections?
- How can we best integrate various automated methods for
working with digital collections with the work of subject
catalogers/subject matter experts?
- What ways can we best connect and build relationships
with various user communities through crowdsourcing
initiatives?
- What do all of these technologies look like in ongoing
production workflows?
39. SOME MORE SPECIFIC
EXAMPLES
- Working models for content addressable storage in digital
repository storage architectures
- Reconciling data warehousing approaches with library
approaches to content and metadata management
- Weaving together structured cataloging workflows with
metadata generating mechanisms (crowdsourcing, NLP,
Computer Vision, etc.)
- Virtual machines general purpose policy based restricted
access infrastructure
- Enabling data mining and computational scholarship on
arbitrary restricted access collections