Presentation by Karen Cariani, WGBH Media Library and Archives Senior Director and Project Director for the American Archive of Public Broadcasting at the 2017 Association of Moving Image Archivists Conference in New Orleans.
8. the situation
72,000 digitized television and radio programs
incomplete, inaccurate metadata records
limited staff resources
we need to know what we have in the collection
we have a responsibility to users to provide access to the collection
continued growth of the collection (content and sparse metadata)
9. the potential:
transforming content into data
• Computational Tools
• Speech-to-text
• Audio analysis
• Image Analysis
• Visualization of Data
How can we use them?
11. AV crowdsourcing precedents
TiltFactor @
Dartmouth:
“Metadata Games”
New York Public
Library’s Together
We Listen project &
Transcript Editing
Tool
Netherlands
Institute for Sound
and Vision
12.
13. user population
General public
Public media
fans
K-12 students
Senior Citizens
People seeking
to develop
editing skills
People seeking
volunteer
opportunities
26. game improvement targets
Change algorithm and game pipeline to get transcripts through the game
quicker
Update Rules page to allow more leniency in corrections. Communicate that
we’re looking for acceptable corrections, not perfection.
Add ability for AAPB staff to prioritize transcripts in the game
Remove the preferences feature
Update API to help AAPB staff determine more easily which transcripts are
ready to come out of the game.
27. lessons learned
• Ensure that all team members understand the overall
goals of the project from the beginning
• Ensure that all relevant team members are involved in
developing the game flow concepts and API
• Stay involved in all decision-making – don’t trust that
the developers/contractors will make all the right
decisions
• Test, test, test!!
28. once corrected…
JSON transcripts will be
stored on AAPB’s
Amazon S3 account
Transcripts will be
indexed for keyword
searching on the AAPB
website
Transcripts will be made
available alongside the
media on the record
page
Transcripts can play as
captions within the
player
Transcripts can be
harvested via an API
and used as a dataset
for research such as a
digital humanities
project
29. usability & ux research questions
Do users understand the
workflow of the game?
Do users understand the
iconography?
How do users feel about
interacting with random
transcripts rather than
choosing a specific
transcript to work on?
How do users feel about
interacting with small bits
of transcripts rather than
a full transcript at once?
What is the overall user
experience when playing
the game?
What is the overall
satisfaction level in
playing the game?
And we are talking about ….. Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections
We are WGBH, Pop-Up Archive and University of Texas at Austin School of Information. I am discussing a project generously funded by IMLS. I am Karen Cariani, Senior Director of WGBH media library and archives, and project director for the american archive of public broadcasting.
... a collaboration between the Library of Congress…
Home to many hours of prime time PBS programming
The American Archive goal is to preserve and make accessible significant public radio and television programs before they are lost to posterity. The American Archive is a digital archive with a website, americanarchive.org, the homepage of which you see here. Users anywhere in the U.S. can access a wide range of historical public television and radio programs from the late 1940s to the present. Our primary objective is to preserve public media and assure discoverability and access through a coordinated national effort. In doing this, we support content creators and current stewards of the materials, and facilitate the use of historical public broadcasting by researchers, educators, students, and others.
As an aggregator of content, AAPB hopes to provide a centralized web portal of discovery for public media materials. The collection is growing with new additions. Access for research, educational, and informational purposes only. Due to rights restrictions, a portion (about 20,000 items) are available through our On-line Reading Room anywhere in the US. However, the entire collection of over 72,000 items is available for viewing on location at the Library of Congress and WGBH.
As part of the initial project funded by CPB, the AAPB has 72,000 digitized tv and radio programs from about 100 stations across the country. Along with these digital files we received incomplete metadata records with very little descriptive data about the content or the program. We have limited staff resources to fully catalog the 72,000 items. We figured it would take a full time person about 32 years to watch everything, spending only 15 minutes per item cataloguing to complete the collection, all while we adding up to 25,000 items in annually. So you can do the math and figure out that even if we could afford a team of 10 people to just catalogue full time, it would still take a long time and we would barely catch up cataloguing the new acquisitions. However, we need to know what we have, (it helps us determine rights and what we can make accessible) and we need to be able to make it findable for users, and do that, currently, we need to be able to expose text for search engines and indexers.
So how to do you transform large amounts of audio and video into something searchable for search engines and indexers? How can we transform it into a dataset?
We thought, this is a great opportunity for collaboration with computational tools and computer science field, but we need to understand the capabilities of what exist. Here are some of the tools available that can help us with our dilemma. With this IMLS funded project we are working with Pop-up archive to create speech to text transcripts of the entire collection, and with UT Texas to analyze the audio to help further identify speakers and sounds. And we will use a crowdsourcing game to help correct or fix the computer generated transcripts which will hopefully help further train the tools to improve..
Experience has shown that most speech to text tools don’t output clean transcripts. Accurate transcripts are dependent on audio quality, speaker accents, background noise, etc, Given that our collection is from 100 different local tv and radio stations across the country, the variety of audio and audio quality varies widely. Some programs are in Spanish, some are musical performances, and nearly all begin with standard bars and tone for video recordings. The speech to text tool tries to interpret these sounds as text, and it makes a number of other mistakes too. WGBH has created a web based game to allow the public to help us fix and correct these transcripts.
Before our project there were several crowdsourcing projects creating additional metadata through crowdsourcing. A group at Dartmouth called Tilt Factor had created a number of metadata games. NYPL built a transcription tool for on-line crowdsourcing of oral histories to create and edit transcripts. And the Netherlands Institute of Sound and Vision developed a social tagging game called Waisda? which asked the public to compete by tagging the same video simultaneously, awarding points for both speed and accuracy of tags.
We decided to join the fray.
The game has a terms of use that we need players to check off to make sure they understand that they can not use the content for anything but helping us correct the transcripts. We’ve kept the clips very short in order to be able to take advantage of fair use.
Our target audience for the game is just about everyone. Maybe not college students or early graduates.
There are 3 games you can play – identify errors, suggest fixes, and validate fixes. You gain points for each action taken.
As a player you can choose preferences along topic, or station of choice.
The more you play, the more points you get.
There is a progress board to tell you how far along we are overall.
And there are instructions for each game to let you know how to play.
Each iteration of a game lasts 5 minutes. But you can play multiple times for any length of time. Three lines of the transcript are active at once. You listen to the audio, see the line highlighted and click on it if there is a mistake. There are instructions and guides on what is considered an error and how to mark it. It take a little bit to figure it out, but after a few times you can pick it up pretty quickly.
For game 1, this is highlighting a mistake that needs correction in this line.
In game 2 you can chose to fix the error or claim it is not an error. We require at least 2 people agreeing on whether it is an error or not.
In game 3 you validate a correction that someone else has made. We are requiring that at least 3 people agree to a correction.
The game board keeps track of points and players. And highlights top scorers. Studies have shown that people play these games for personal satisfaction and a competition doesn’t necessarily increase the desire to play. We hope people will be driven just by the personal satisfaction of getting points and helping us out as oppose to competing against anyone in particular.
There are 260 transcripts in the pipeline and more as new players add new preferences. 49 corrections have been made. But Zero transcripts have been fixed.
As you can see, with over 700 players, and 68,000 transcripts we have barely made a dent. There are over 15,000 errors identified across 260 transcripts. We needed to rethink our approach.
We worked with our game developers and decided we needed to change the algorithm that allows transcripts to move through the pipeline. We can limit what transcripts are in the pipeline, overriding the preferences – so not all 68,000 plus are in play at the same time, but a more concentrated number, like 10. We changed the level of perfection we required to say it was fixed – a little lenient. We assured that once a phrase has been validated as not having an error it moves forward never to go backwards. And we updated the ability for our staff to go in and manually decide which transcripts were ready to be finished.
So we learned some lessons, most of which are true for any project. As with so many other technical projects where we as archivists rely on a technical team outside our own department, it is important to have the archivist voice heard. We actually did know best, we know the content, and we know what the project are goal is. With the first iteration of the game it seemed that the developers lost sight that the basic goal was to correct transcripts and output them, not just play a game. We seem to be on the right track now and have relaunched the game – so please play it and give us feedback.
Once the transcripts have been verified, the JSON transcripts will be stored in the AAPB’s Amazon S3 account and indexed for keyword searching on the AAPB website. The transcripts will be made available alongside the media on the record page. They can also be played like captions within the video player. And they will be able to be harvested via an API to be used as a data set for research. We are hoping that researchers will begin to look at the collection as a data set and start trying to see trends from programming over the last 60 years. Particularly across news programs.
In the meantime we are continuing to improve the interface and feedback from users. We will be having an editathon session this afternoon – so please join us. There will be treats!
But wait there is more!!!
We plan to utilize the NYPL transcript editor tool to see if it is a more efficient way to correct transcripts and get the public engaged.
And we are launching a zoouniverse project called “Roll the Credits” to help us gather data form the credit rolls – like authenticated titles, broadcast date, producer, writer, etc.