An online game will be developed to crowd-source the correction of OCRed content in the Biodiversity Heritage Library (BHL). Several additional content types will digitized and added to BHL, namely seed lists, seed & nursery catalogs, and hand-written field notebooks.
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Purposeful Gaming, OCR Correction and Seed & Nursery Catalog Digitization
1. Purposeful Gaming,
OCR Correction
and
Seed & Nursery Catalog Digitization
Marty Schlabach
Food & Agriculture Librarian
Cornell University
A Grant-Funded BHL Project
Council on Botanical & Horticultural Libraries
Annual Meeting
Richmond, VA
April 30, 2014
2. •Full title - Purposeful Gaming and BHL: Engaging the
Public in Improving and Enhancing Access to Digital
Texts
•National Leadership Grant for Libraries given to
Missouri Botanical Garden in St Louis.
•Partners include Harvard, Cornell, New York Botanical
Garden
•Funded by IMLS
•Runs Dec 2013-Nov 2015
What is Purposeful Gaming and BHL?
3. BHL Problem Statement:
•Major challenge for digital libraries: full-text
searching of scanned texts is significantly hampered
by poor output from Optical Character Recognition
(OCR) software.
•Historic literature has proven to be particularly
problematic because of its tendency to have varying
fonts, typesetting, and layouts
4.
5. 8 Primary Objectives of
Purposeful Gaming and BHL
1) digitizing horticultural catalogs
2) transcribing field notebooks, horticultural catalogs &
other digital content in BHL
3) building a technical framework for management of
digital text outputs
4) comparing digital outputs for OCR accuracy
5) developing and deploying a game to crowd-source
OCR error
6) evaluating accuracy scores from the game against
ground truth pages
7) generating an error matrix for clean-up
8) producing a report and disseminating findings
6. Like reCaptcha …….
reCAPTCHA offers
more than just spam
protection.
Every time
CAPTCHAs are
solved, that human
effort helps digitize
text, annotate
images, and build
machine learning
datasets.
This in turn helps
preserve books,
improve maps, and
solve hard AI
problems.
7. …..but a lot more fun!
http://www.digitalkoot.fi/
Mole Bridge http://www.youtube.com/watch?v=Q6CZId38Hvk
Mole Hunt http://www.youtube.com/watch?v=uHN5WW6yCc4
8. BHL Content
•Primarily Books & Journals
•Purposeful Gaming Project adding new content types
• Seed & Nursery Catalogs & Seed Lists
• Test OCR correction on this content type
• Crowd-source the transcription when needed
• Field Notebooks,
• Handwritten, OCR virtually impossible
• Crowd-source the transcription
9. Seed & Nursery Catalog
and Seed List
Digitization
Seed & Nursery Catalogs
•New York Botanical Garden
•Cornell
Seed Lists
•Missouri Botanical Garden
10. Why Digitize Seed & Nursery Catalogs?
• Taxonomists
• Discover early introductions of new plants
• Gardeners
• Peruse old catalogs for historical availability and uses of traditional
cultivars of heirloom annuals and perennials
• Museums and botanical gardens
• Recreate historical gardens
• Plant breeders
• Look for descriptions of plants with unique disease and pest resistance
• Historians of art and illustration
• Drawn to the striking representations of flowers, fruit & vegetables
• Historians of printing
• Catalogs documented changes in printing
• Text-only broadsides & pamphlets
• Multipage booklets with engraved illustrations
• Colorful lithographs added
• Photographic illustrations, b&w and later color
12. •50,000+ catalogs in collection
•Previous NEH grant
•Catalog whole seed & nursery catalog collection
•Digitize a portion
•In IMLS project, will continue systematic digitization
•Targeting 500 catalogs, 15,000 pages
•In-house scanning
•Currently available via Mertz Digital
•http://mertzdigital.nybg.org/
•Upload to Internet Archive (IA) & BHL
New York Botanical Garden (NYBG)
13. •130,000+ catalogs in Bailey Horticultural Catalog
collection
•Digitization priorities
•Firms that supplied grapes
•Have identified 325+ firms
•1771-
•Minimizing duplication of firms & catalogs NYBG & NAL
are digitizing
•Targeting 3,700 catalogs, 125,000 pages
•External scan vendor Trigonix
•Upload to Internet Archive (IA) & BHL
Cornell University Library
14. •Not a BHL member or IMLS recipient, but a collaborator
•Already digitized 6,400+ catalogs
•Selection priorities
•Peter Henderson
•Women-owned firms
•Mid-Atlantic firms
•Long collection runs
•In-house scanning
•Uploading to Internet Archive (IA)
•https://archive.org/details/usda-nurseryandseedcatalog
•BHL ingests non-member content from IA
National Agriculture Library (NAL)
15.
16.
17.
18.
19. • 140,000 digitized pages of horticultural catalogs in BHL –
minimum of 2 OCR outputs for each page
• Implementation of a transcription tool for BHL materials
• 2,000 pages of transcribed field notebooks
• Technical framework within BHL architecture for
classifying, comparing and managing multiple OCR outputs
for a single page (ie. possibility to include corrected OCR)
• Production of an error matrix that will allow for automated
text correction on full BHL corpus
• Proof of concept for whether gaming and crowd-sourcing
can be used to improve access to digital texts
Purposeful Gaming:
Development Deliverables
20. Acknowledgements
• IMLS Purposeful Gaming grant colleagues at
– Missouri Botanical Garden
– New York Botanical Garden
– Cornell University Library
– Harvard Museum of Comparative Zoology
• Trish Rose-Sandler, Missouri Botanical Garden
– Plagiarized (with permission) selected slides
• IMLS for funding
Notas del editor
Sample of poor OCR output from an 18th century publication.
This page is from Linneaus' Species Plantarum published in 1753
An image of the original text is on the left. The OCR is on the right
An example of a game to crowd-source the correction of OCRed text
By the National Library of Finland
In addition to the value to the Purposeful Gaming project
--Both are highly relevant and useful biodiversity content
Missouri Botanical Garden digitizing seed lists
NYBG and Cornell digitizing seed & nursery catalogs
This is not the audience that needs to be convinced of the value of a digital collection of historical horticultural catalogs, but…
There are collections within BHL
Note ‘Collections’ button
One of the 35 collections is the ‘Seed & Nursery Catalogs’ collection
Seed and Nursery Catalog collection in BHL currently holds 6,441 items.
Most are from NAL’s digitizing efforts.
NYBG & Cornell digital content to be added over the next several months.