This document discusses the Biodiversity Heritage Library (BHL) project. It provides statistics on the size and usage of the BHL digital library, describes the workflow and technologies used to digitize materials. It also summarizes services provided by BHL, including name finding, an application programming interface (API), and links to citation databases. Plans are outlined to improve display of articles and develop new tools for working with bibliographies and citations.
20. SOAP response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action with Taxonomic Intelligence…
21. Name Finding Evaluation Structured and performed by Qin Wei Ph.D. student at UIUC, working with Bryan Heidorn Methodology Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL corpus Compared those against OCR,then two name finding algorithms (TaxonFinder & FAT) Goals Spark discussion, set baseline for future work
23. 35.16% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors
24. Considerations Improving OCR software is out of scope Google’s Tesseract is only viable open source option Flurry of activity in 2006-2007, quiet since Rekeying is expensive given size of corpus Will not scale
25. Name finding statistics 27.7 million pages scanned 70.4 million name strings found 56.2 million names verified with a NameBankID 1.4 million unique names with a NameBankID 3.3 million unique names *without* a NameBankID This is where the interesting data live!!!
33. Mandate for new development display / manage articles meet community demands for bibliography / citation management build from more open source tools
34. Development goals re: citations Create a repository for community-vetted taxonomic bibliographies. Ability to ingest, display, download, and index articles so that the BHL can operate as an article repository. Build from existing community of work around Drupal / Biblio. In use by collaborators
39. Services OpenURL Facilitate links to citations: protologues, articles, references Documentation: http://www.biodiversitylibrary.org/openurlhelp.aspx Useful to Nomenclators, Reference Systems IPNI Tropicos Names Service Return all occurrences of a name throughout BHL digitized corpus Documentation: http://bit.ly/2e6sg9 Access to 51million name strings using TaxonFinder 1.4million unique names Working out a strategy for obscure species Algorithm improvements to detect nomenclatural & taxonomic acts New API
43. How? Tropicos maintains internal authority list of publications: Each protologue/reference tied to authority: Matched Tropicos TitleIDs to BHL TitleIDs: Throw citations at resolver at regular intervals & cache data in Tropicos http://www.tropicos.org/Publication/775 http://www.tropicos.org/Publication/775 = http://www.biodiversitylibrary.org/title/3934 http://www.biodiversitylibrary.org/openurl? pid=title:3934&volume=14&issue=&spage=301&date=1879
45. Other consumers EarthCape Lab BioGuid BioSTOR Research projects BREC - NSF Conjecturator - NSF Darwin’s Library – NEH/JISC Hong Cui @ University of AZ - NSF
52. Global BHL Nodes BHL-Australia http://ec2-75-101-224-221.compute-1.amazonaws.com/ BHL-China http://bhl-china.org BHL-Europe http://biodiversitylibrary.eu
53. BHl: collaboration w/ BIG Zeamayshttp://www.eol.org/pages/1115259 Literature: http://biodiversitylibrary.org/name/Zea_mays
54. <insert discussion here> Existing issues in Jira Taxonomic name finding enhancements Nomenclatural acts in web services Other algorithms / verification WoRMS data Improvement Ranking results Visualization LifeDesks Bibliography sharing Resolve to articles