Let the Public and the Computer do the Metadata Work!

•Descargar como PPTX, PDF•

1 recomendación•333 vistas

Presentation by Karen Cariani, WGBH Media Library and Archives Senior Director and Project Director for the American Archive of Public Broadcasting at the 2017 Association of Moving Image Archivists Conference in New Orleans.

Tecnología

Karen Cariani
AAPB Project Director, WGBH
Senior Director, WGBH Media Library & Archives
Let the Computer,
and the Public,
do the Metadata Work!

The Library of Congress
Packard Campus for Audio Visual Conservation
American Archive of Public Broadcasting

WGBH Educational Foundation
American Archive of Public Broadcasting

the situation
72,000 digitized television and radio programs
incomplete, inaccurate metadata records
limited staff resources
we need to know what we have in the collection
we have a responsibility to users to provide access to the collection
continued growth of the collection (content and sparse metadata)

the potential:
transforming content into data
• Computational Tools
• Speech-to-text
• Audio analysis
• Image Analysis
• Visualization of Data
How can we use them?

a crowdsourcing game
fixit.americanarchive.org
Casey Davis Kaufman
Associate Director, WGBH Media Library and Archives
Project Manager, AAPB

AV crowdsourcing precedents
TiltFactor @
Dartmouth:
“Metadata Games”
New York Public
Library’s Together
We Listen project &
Transcript Editing
Tool
Netherlands
Institute for Sound
and Vision

user population
General public
Public media
fans
K-12 students
Senior Citizens
People seeking
to develop
editing skills
People seeking
volunteer
opportunities

game pipeline
Identify errors
1
Suggest
corrections
2
Validate
corrections
3

game improvement targets
Change algorithm and game pipeline to get transcripts through the game
quicker
Update Rules page to allow more leniency in corrections. Communicate that
we’re looking for acceptable corrections, not perfection.
Add ability for AAPB staff to prioritize transcripts in the game
Remove the preferences feature
Update API to help AAPB staff determine more easily which transcripts are
ready to come out of the game.

lessons learned
• Ensure that all team members understand the overall
goals of the project from the beginning
• Ensure that all relevant team members are involved in
developing the game flow concepts and API
• Stay involved in all decision-making – don’t trust that
the developers/contractors will make all the right
decisions
• Test, test, test!!

once corrected…
JSON transcripts will be
stored on AAPB’s
Amazon S3 account
Transcripts will be
indexed for keyword
searching on the AAPB
website
Transcripts will be made
available alongside the
media on the record
page
Transcripts can play as
captions within the
player
Transcripts can be
harvested via an API
and used as a dataset
for research such as a
digital humanities
project

usability & ux research questions
Do users understand the
workflow of the game?
Do users understand the
iconography?
How do users feel about
interacting with random
transcripts rather than
choosing a specific
transcript to work on?
How do users feel about
interacting with small bits
of transcripts rather than
a full transcript at once?
What is the overall user
experience when playing
the game?
What is the overall
satisfaction level in
playing the game?

facebook.com/amarchivepub
@amarchivepub
americanarchive.org
http://fixit.americanarchive.org
#FixItAAPB
Come to our
editathon!
Friday, 5:45 – 6:45
pm
Room: Arcadian I
Treats and prizes!

Más contenido relacionado

Similar a Let the Public and the Computer do the Metadata Work!

Hypermediated TVRyan Shaw

RA21 Charleston Library Conference Presentation National Information Standards Organization (NISO)

The mobile game application of educational relate to animal's in Holy QuranNORJANNAH7

Liferay and Big DataMiguel Pastor

Software Analytics: Data Analytics for Software EngineeringTao Xie

CROSSMINER Project at OW2con'19OW2

Going Far by Going Together: Collaboration with Scholars and Other AlliesWGBH Media Library and Archives

Localisation of AT - PDF Version E.A. Draffan

Big Data: the weakest linkCS, NcState

Beyond The Bench WorkshopsBeyond The Bench

xAPI: The LandscapeMegan Bowe

Francia Sandoval UX PortfolioFrancia Sandoval

Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...WGBH Media Library and Archives

Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems

Text-mining and Automationbenosteen

E resources selection criteriaIme Amor Mortel

Agile software developmentHemangi Talele

Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Heidi Nance

From Research to Innovation: Linked Open Data and Gamification to Design Inte...Ig Bittencourt

Connecting Librarians to ResearchersQSR International

Similar a Let the Public and the Computer do the Metadata Work! (20)

Hypermediated TV

RA21 Charleston Library Conference Presentation

The mobile game application of educational relate to animal's in Holy Quran

Liferay and Big Data

Software Analytics: Data Analytics for Software Engineering

CROSSMINER Project at OW2con'19

Going Far by Going Together: Collaboration with Scholars and Other Allies

Localisation of AT - PDF Version

Big Data: the weakest link

Beyond The Bench Workshops

xAPI: The Landscape

Francia Sandoval UX Portfolio

Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...

Analyzing Big Data's Weakest Link (hint: it might be you)

Text-mining and Automation

E resources selection criteria

Agile software development

Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...

From Research to Innovation: Linked Open Data and Gamification to Design Inte...

Connecting Librarians to Researchers

Más de WGBH Media Library and Archives

Engage Your Community to Celebrate Your HistoryWGBH Media Library and Archives

Wikipedia Editathon: How to GuideWGBH Media Library and Archives

FIX IT+ Transcript EditingWGBH Media Library and Archives

Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...WGBH Media Library and Archives

AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...WGBH Media Library and Archives

Implementing Samvera Open Source Technology at WGBH and the American Archive ...WGBH Media Library and Archives

Use of American Archive of Public Broadcasting in Humanities ResearchWGBH Media Library and Archives

American Archive of Public Broadcasting: a Digital Library for Teaching Media...WGBH Media Library and Archives

Accessibility of the American Archive of Public Broadcasting in Academic Libr...WGBH Media Library and Archives

How to Use the American Archive of Public Broadcasting as a Resource in the C...WGBH Media Library and Archives

Putting the Pieces Together: Creating a National Educational Television CatalogWGBH Media Library and Archives

DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...WGBH Media Library and Archives

DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...WGBH Media Library and Archives

Preserving Your Station Legacy with the American Archive of Public Broadcasti...WGBH Media Library and Archives

Let the Computer Do the WorkWGBH Media Library and Archives

FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...WGBH Media Library and Archives

Can the Computer and the Public Do the Metadata Work?WGBH Media Library and Archives

Building AAPB Participation into Digitization Grant Proposals: Requirements, ...WGBH Media Library and Archives

Building the AAPB: Inter-Institutional Preservation and Access WorkflowsWGBH Media Library and Archives

Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...WGBH Media Library and Archives

Más de WGBH Media Library and Archives (20)

Engage Your Community to Celebrate Your History

Wikipedia Editathon: How to Guide

FIX IT+ Transcript Editing

Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...

AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...

Implementing Samvera Open Source Technology at WGBH and the American Archive ...

Use of American Archive of Public Broadcasting in Humanities Research

American Archive of Public Broadcasting: a Digital Library for Teaching Media...

Accessibility of the American Archive of Public Broadcasting in Academic Libr...

How to Use the American Archive of Public Broadcasting as a Resource in the C...

Putting the Pieces Together: Creating a National Educational Television Catalog

DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...

DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...

Preserving Your Station Legacy with the American Archive of Public Broadcasti...

Let the Computer Do the Work

FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...

Can the Computer and the Public Do the Metadata Work?

Building AAPB Participation into Digitization Grant Proposals: Requirements, ...

Building the AAPB: Inter-Institutional Preservation and Access Workflows

Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...

Último

Architecting Cloud Native ApplicationsWSO2

Platformless Horizons for Digital AdaptabilityWSO2

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

FWD Group - Insurer Innovation Award 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

MINDCTI Revenue Release Quarter One 2024MIND CTI

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Why Teams call analytics are critical to your entire businesspanagenda

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Let the Public and the Computer do the Metadata Work!

2. Karen Cariani AAPB Project Director, WGBH Senior Director, WGBH Media Library & Archives Let the Computer, and the Public, do the Metadata Work!

4. The Library of Congress Packard Campus for Audio Visual Conservation American Archive of Public Broadcasting

5. WGBH Educational Foundation American Archive of Public Broadcasting

8. the situation 72,000 digitized television and radio programs incomplete, inaccurate metadata records limited staff resources we need to know what we have in the collection we have a responsibility to users to provide access to the collection continued growth of the collection (content and sparse metadata)

9. the potential: transforming content into data • Computational Tools • Speech-to-text • Audio analysis • Image Analysis • Visualization of Data How can we use them?

10. a crowdsourcing game fixit.americanarchive.org Casey Davis Kaufman Associate Director, WGBH Media Library and Archives Project Manager, AAPB

11. AV crowdsourcing precedents TiltFactor @ Dartmouth: “Metadata Games” New York Public Library’s Together We Listen project & Transcript Editing Tool Netherlands Institute for Sound and Vision

12.

13. user population General public Public media fans K-12 students Senior Citizens People seeking to develop editing skills People seeking volunteer opportunities

14. game pipeline Identify errors 1 Suggest corrections 2 Validate corrections 3

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26. game improvement targets Change algorithm and game pipeline to get transcripts through the game quicker Update Rules page to allow more leniency in corrections. Communicate that we’re looking for acceptable corrections, not perfection. Add ability for AAPB staff to prioritize transcripts in the game Remove the preferences feature Update API to help AAPB staff determine more easily which transcripts are ready to come out of the game.

27. lessons learned • Ensure that all team members understand the overall goals of the project from the beginning • Ensure that all relevant team members are involved in developing the game flow concepts and API • Stay involved in all decision-making – don’t trust that the developers/contractors will make all the right decisions • Test, test, test!!

28. once corrected… JSON transcripts will be stored on AAPB’s Amazon S3 account Transcripts will be indexed for keyword searching on the AAPB website Transcripts will be made available alongside the media on the record page Transcripts can play as captions within the player Transcripts can be harvested via an API and used as a dataset for research such as a digital humanities project

29. usability & ux research questions Do users understand the workflow of the game? Do users understand the iconography? How do users feel about interacting with random transcripts rather than choosing a specific transcript to work on? How do users feel about interacting with small bits of transcripts rather than a full transcript at once? What is the overall user experience when playing the game? What is the overall satisfaction level in playing the game?

30. future plans

31.

32.

33. facebook.com/amarchivepub @amarchivepub americanarchive.org http://fixit.americanarchive.org #FixItAAPB Come to our editathon! Friday, 5:45 – 6:45 pm Room: Arcadian I Treats and prizes!

Notas del editor

And we are talking about ….. Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections
We are WGBH, Pop-Up Archive and University of Texas at Austin School of Information. I am discussing a project generously funded by IMLS. I am Karen Cariani, Senior Director of WGBH media library and archives, and project director for the american archive of public broadcasting.
... a collaboration between the Library of Congress…
Home to many hours of prime time PBS programming
The American Archive goal is to preserve and make accessible significant public radio and television programs before they are lost to posterity. The American Archive is a digital archive with a website, americanarchive.org, the homepage of which you see here. Users anywhere in the U.S. can access a wide range of historical public television and radio programs from the late 1940s to the present. Our primary objective is to preserve public media and assure discoverability and access through a coordinated national effort. In doing this, we support content creators and current stewards of the materials, and facilitate the use of historical public broadcasting by researchers, educators, students, and others.
As an aggregator of content, AAPB hopes to provide a centralized web portal of discovery for public media materials. The collection is growing with new additions. Access for research, educational, and informational purposes only. Due to rights restrictions, a portion (about 20,000 items) are available through our On-line Reading Room anywhere in the US. However, the entire collection of over 72,000 items is available for viewing on location at the Library of Congress and WGBH.
As part of the initial project funded by CPB, the AAPB has 72,000 digitized tv and radio programs from about 100 stations across the country. Along with these digital files we received incomplete metadata records with very little descriptive data about the content or the program. We have limited staff resources to fully catalog the 72,000 items. We figured it would take a full time person about 32 years to watch everything, spending only 15 minutes per item cataloguing to complete the collection, all while we adding up to 25,000 items in annually. So you can do the math and figure out that even if we could afford a team of 10 people to just catalogue full time, it would still take a long time and we would barely catch up cataloguing the new acquisitions. However, we need to know what we have, (it helps us determine rights and what we can make accessible) and we need to be able to make it findable for users, and do that, currently, we need to be able to expose text for search engines and indexers. So how to do you transform large amounts of audio and video into something searchable for search engines and indexers? How can we transform it into a dataset?
We thought, this is a great opportunity for collaboration with computational tools and computer science field, but we need to understand the capabilities of what exist. Here are some of the tools available that can help us with our dilemma. With this IMLS funded project we are working with Pop-up archive to create speech to text transcripts of the entire collection, and with UT Texas to analyze the audio to help further identify speakers and sounds. And we will use a crowdsourcing game to help correct or fix the computer generated transcripts which will hopefully help further train the tools to improve..
Experience has shown that most speech to text tools don’t output clean transcripts. Accurate transcripts are dependent on audio quality, speaker accents, background noise, etc, Given that our collection is from 100 different local tv and radio stations across the country, the variety of audio and audio quality varies widely. Some programs are in Spanish, some are musical performances, and nearly all begin with standard bars and tone for video recordings. The speech to text tool tries to interpret these sounds as text, and it makes a number of other mistakes too. WGBH has created a web based game to allow the public to help us fix and correct these transcripts.
Before our project there were several crowdsourcing projects creating additional metadata through crowdsourcing. A group at Dartmouth called Tilt Factor had created a number of metadata games. NYPL built a transcription tool for on-line crowdsourcing of oral histories to create and edit transcripts. And the Netherlands Institute of Sound and Vision developed a social tagging game called Waisda? which asked the public to compete by tagging the same video simultaneously, awarding points for both speed and accuracy of tags. We decided to join the fray.
The game has a terms of use that we need players to check off to make sure they understand that they can not use the content for anything but helping us correct the transcripts. We’ve kept the clips very short in order to be able to take advantage of fair use.
Our target audience for the game is just about everyone. Maybe not college students or early graduates.
There are 3 games you can play – identify errors, suggest fixes, and validate fixes. You gain points for each action taken.
As a player you can choose preferences along topic, or station of choice.
The more you play, the more points you get.
There is a progress board to tell you how far along we are overall.
And there are instructions for each game to let you know how to play.
Each iteration of a game lasts 5 minutes. But you can play multiple times for any length of time. Three lines of the transcript are active at once. You listen to the audio, see the line highlighted and click on it if there is a mistake. There are instructions and guides on what is considered an error and how to mark it. It take a little bit to figure it out, but after a few times you can pick it up pretty quickly.
For game 1, this is highlighting a mistake that needs correction in this line.
In game 2 you can chose to fix the error or claim it is not an error. We require at least 2 people agreeing on whether it is an error or not.
In game 3 you validate a correction that someone else has made. We are requiring that at least 3 people agree to a correction.
The game board keeps track of points and players. And highlights top scorers. Studies have shown that people play these games for personal satisfaction and a competition doesn’t necessarily increase the desire to play. We hope people will be driven just by the personal satisfaction of getting points and helping us out as oppose to competing against anyone in particular.
There are 260 transcripts in the pipeline and more as new players add new preferences. 49 corrections have been made. But Zero transcripts have been fixed.
As you can see, with over 700 players, and 68,000 transcripts we have barely made a dent. There are over 15,000 errors identified across 260 transcripts. We needed to rethink our approach.
We worked with our game developers and decided we needed to change the algorithm that allows transcripts to move through the pipeline. We can limit what transcripts are in the pipeline, overriding the preferences – so not all 68,000 plus are in play at the same time, but a more concentrated number, like 10. We changed the level of perfection we required to say it was fixed – a little lenient. We assured that once a phrase has been validated as not having an error it moves forward never to go backwards. And we updated the ability for our staff to go in and manually decide which transcripts were ready to be finished.
So we learned some lessons, most of which are true for any project. As with so many other technical projects where we as archivists rely on a technical team outside our own department, it is important to have the archivist voice heard. We actually did know best, we know the content, and we know what the project are goal is. With the first iteration of the game it seemed that the developers lost sight that the basic goal was to correct transcripts and output them, not just play a game. We seem to be on the right track now and have relaunched the game – so please play it and give us feedback.
Once the transcripts have been verified, the JSON transcripts will be stored in the AAPB’s Amazon S3 account and indexed for keyword searching on the AAPB website. The transcripts will be made available alongside the media on the record page. They can also be played like captions within the video player. And they will be able to be harvested via an API to be used as a data set for research. We are hoping that researchers will begin to look at the collection as a data set and start trying to see trends from programming over the last 60 years. Particularly across news programs.
In the meantime we are continuing to improve the interface and feedback from users. We will be having an editathon session this afternoon – so please join us. There will be treats!
But wait there is more!!!
We plan to utilize the NYPL transcript editor tool to see if it is a more efficient way to correct transcripts and get the public engaged.
And we are launching a zoouniverse project called “Roll the Credits” to help us gather data form the credit rolls – like authenticated titles, broadcast date, producer, writer, etc.

Let the Public and the Computer do the Metadata Work!

Recomendados

Recomendados

Más contenido relacionado

Similar a Let the Public and the Computer do the Metadata Work!

Similar a Let the Public and the Computer do the Metadata Work! (20)

Más de WGBH Media Library and Archives

Más de WGBH Media Library and Archives (20)

Último

Último (20)

Let the Public and the Computer do the Metadata Work!

Notas del editor