The document discusses the Biodiversity Heritage Library (BHL), an open access digital library focused on biodiversity literature. It notes that taxonomic literature often relies on old texts and that there is a "taxonomic impediment" due to gaps in taxonomic knowledge. The BHL aims to digitize the core biodiversity literature to make it openly accessible online. It has over 70,000 volumes and 26 million pages digitized so far from member institutions. The BHL provides various services like name finding, citations, and APIs to support research.
Biodiversity Heritage Library: Liberating World's Literature
1. The Biodiversity Heritage Library: Liberating
the World’s Biodiversity Literature
Thomas Garnett EOL Fellows March 2010
2. BHL- Why?
The cited half-life of
publications in taxonomy
is longer than in any other
scientific discipline
-Macro-economic case for open access, Tom Moritz
-Current taxonomic
literature often relies on
texts and specimens > 100
years old.
Levinus Vincent
Elenchus tabularum, pinacothecarum, 1719
.
2
3. BHL – Why?
The Taxonomic
Impediment
“The taxonomic
impediment is a term
that describes the gaps
of knowledge in our
taxonomic system”
- Darwin Declaration, 1998
Georges Louis Leclerc, comte de Buffon
Histoire naturelle : générale et particulière (Oiseaux), 1799-1808
3
4.
5. BHL Members: US/UK
• Academy of Natural Science (Philadelphia, PA)
• American Museum of Natural History (New York, NY)
• California Academy of Science (San Francisco, CA)
• The Field Museum (Chicago, IL)
• Harvard University Botany Libraries (Cambridge, MA)
• Harvard University, Ernst Mayr Library of the Museum of
Comparative Zoology (Cambridge, MA)
• Marine Biological Laboratory / Woods Hole Oceanographic
Institution (Woods Hole, MA)
• Missouri Botanical Garden (St. Louis, MO)
• Natural History Museum (London, UK)
• The New York Botanical Garden (New York, NY)
• Royal Botanic Gardens, Kew (Richmond, UK)
• Smithsonian Institution Libraries (Washington, DC)
6. BHL Members: BHL-Europe
• Museum für Naturkunde - Leibniz-Institut • Stichting Nationaal Natuurhistorisch
für Evolutions- und Museum, Naturalis
Biodiversitätsforschung an der Humboldt- • National Botanic Garden of Belgium
Universität zu Berlin • Royal Museum for Central Africa,
• Natural History Museum, UK • Royal Belgian Institute of Natural
• Narodni muzeum NMP CZ Sciences
• Angewandte Informationstechnik • Bibliothèque nationale de France
Forschungsgesellschaft mbH • Museum national d’histoire naturelle
• Freie Universität Berlin FUBBGBM • Consejo Superior de Investigaciones
• Georg-August-Universität Göttingen Cientificas
Stiftung Öffentlichen Rechts • Università degli Studi di Firenze
• Naturhistorisches Museum Wien • Royal Botanic Garden, Edinburgh
• Hungarian Natural History Museum • Species 2000
• Museum and Institute of Zoology, Polish • John Wiley & Sons limited
Academy of Sciences
• University of Copenhagen • Helsingin yliopisto UH-Viikki
7. BHL Members: BHL-China
• Chinese Academy of Science – Institute of
Botany
• Chinese Academy of Science – Institute of
Zoology
• Chinese Academy of Science – Institute of
Microbiology
• Chinese Academy Science - Institute of
Oceanography
8. BHL is a Focused Program
• Though BHL has is composed of libraries it
has been a domain-specific program, not just a
digital library project. It arose from and is
responsive to the biodiversity community
composed of the disciplines of taxonomy,
systematics, evolutionary biology, ecology,
conservation, and wildlife management. These
are the primary audience.
10. Core Literature
Botany Plant conservation
Phytogeography Plant anatomy
Plant physiology Plant ecology
Spermatophyta, Phanerogams Cryptogams
Biological diversity Evolution
Phylogenetic relationships Evolutionary genetics
Scientific voyages and expeditions
Pre-Linnaean works Linnaean works
Biodiversity conservation Conservation biology
Ecosystem management Endangered species & ecosystems
Extinction Classification, Nomenclature
Biogeography
Zoology/Botany--Morphology Zoology/Botany--Anatomy
Zoology/Botany--Embryology Zoology/Botany--
Reproduction Zoology/Botany--Geographical distribution
Classification, systematics and taxonomy
Zoology Invertebrates Chordates Vertebrates
Animal Behavior
11. Stats: Now Online
• 70,630 volumes
• 26.4 million pages
Oldest book: Schöffer’s Herbarius, 1484.
12. What is the plan?
Digitize the core literature of biodiversity. Full works, not
bits & pieces.
Open Access: all content can be repurposed, reused,
reformatted.
Congruent: must fit in to a dynamic knowledge ecology.
Scan public domain biodiversity literature.
Negotiate rights to digitize copyrighted materials.
Ingest content digitized by others.
Provide interfaces & APIs for repository.
GUIs
Services for data mining & citation resolution
13. BHL Digital Preservation
• Committed to long-term storage, curation,
and preservation of digital text assets for
the world-wide biodiversity community
• BHL is a steward for this literature.
• To keep this content available and open for
the future requires careful organizational
planning.
• Preservation is both a technical and
political/social process.
14. BHL Relationship with Non-Profit Journal
Publishers
Opt in Copyright Model: The BHL works with professional societies and
associations to integrate their publications into the BHL in a way that
serves the societies’ missions and goals
BHL indexes the articles using Taxonomic Intelligence, thereby vastly
increasing their usability.
Publishers’ content is embedded in the emerging knowledge ecology
that is sweeping biology in this century .
73 Permission Agreements to date. More under negotiation.
Integration with gray literature in later phases of project.
21. SOAP response Name finding via TaxonFinder Submit Extract names
to NameBank
Image from Scanner Converted to text OCR
via OC OCR OCR
Name Finding in action
with Taxonomic Intelligence…
22. OCR error rate for names only
Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
Top OCR errors
1 Insert Space 8 n->v
35.16% 2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
23. Considerations
• Improving OCR software is out of scope
– Google’s Tesseract is only viable open
source option
– Flurry of activity in 2006-2007, quiet since
• Rekeying is expensive given size of
corpus
– Will not scale
24. Name finding statistics
• 27.7 million pages scanned
• 70.4 million name strings found
• 56.2 million names verified with a
NameBankID
• 1.4 million unique names with a
NameBankID
• 3.3 million unique names *without* a
NameBankID
– This is where the interesting data live!!!
31. Mandate for new development
• display / manage articles
• meet community demands for
bibliography / citation management
• build from more open source tools
32. Development goals re: citations
• Create a repository for community-vetted
taxonomic bibliographies.
• Ability to ingest, display, download, and
index articles so that the BHL can operate
as an article repository.
• Build from existing community of work
around Drupal / Biblio.
– In use by collaborators
37. Services
• OpenURL
– Facilitate links to citations: protologues, articles, references
• Documentation:
http://www.biodiversitylibrary.org/openurlhelp.aspx
• Names Service
– Return all occurrences of a name throughout BHL digitized
corpus
• Documentation: http://bit.ly/2e6sg9
– Access to 51million name strings using TaxonFinder
– 1.4million unique names
– Working out a strategy for obscure species
– Algorithm improvements to detect nomenclatural & taxonomic
acts
• New API