The document discusses the Biodiversity Heritage Library (BHL), an open access digital library focused on biodiversity literature. It provides details on the BHL's member institutions, organizational structure, content selection and digitization processes, metadata standards, and online platform. The BHL aims to make biodiversity literature from its member institutions openly available online by digitizing books and journals, generating metadata, and developing tools for access and discovery.
5. BHL “classic” or US/UK: 15 institutions …Formed in
2006, 13 members and 2 affiliates
15 Members
•Academy of Natural Sciences Library
and Archives
•American Museum of Natural History
Library
•California Academy of Sciences Library
•Cornell University Library
•The Field Museum Library
•Harvard University Botany Libraries
•Ernst Mayr Library of the Museum of
Comparative Zoology
•Library of Congress
•Marine Biological Laboratory and Woods
Hole Oceanographic Institution Library
•Missouri Botanical Garden Library
•Natural History Museum, London, Library
& Archives
•The New York Botanical Garden
•Royal Botanic Garden, Kew, Library &
Archives
•Smithsonian Institution Libraries
•United States Geological Survey
Libraries
6. BHL “classic” or US/UK: Key Organizational Points
• BHL is not a legal entity; fiduciary
and legal agreements are generally
delegated to individual members
• Membership is governed my a
Memorandum of Understanding
signed by all members
• Two levels of membership (as of
2013):
• Member (voting and
administrative input);
annual dues of $10,000 USD
• Affiliate (provide content or
other services to BHL; no
voting or input into overall
BHL direction)
• BHL Secretariat (Administrative
component), housed at the
Smithsonian Libraries
• BHL Technical Team (housed at
Missouri Botanical Garden)
• BHL Executive Committee (elected
by Members): Chair, Vice-Chair and
Secretary
11. Selection
High-yield taxonomic materials
Unique & rare materials
Permissions titles
User requested titles & gap-fills
Discipline specific subject matter
Non-BHL member materials ingested from the
Internet Archive
ACTIVE
/HIGH
PRIORITY
PASSIVE /LOW
PRIORITY
12. Increase agreements
with publishers of in
copyright materials
US Titles: 206
UK Titles: 67
TOTAL TITLES: 273
US Licensors: 85
UK Licensors: 40
TOTAL LICENSORS: 125
13. Deduplication
• We try to avoid duplication where possible
• Tools
• Serials = Scanlist
• Monographs = Monographic deduper
• Check the BHL before you send for scanning
• We do our best but duplication happens
• Post-digitization, we merge titles as necessary
14. BHL US/UK Principles: Digitization
Mass-digitization scanning operation
Scan books, cover-to-cover
To the best we can, we seek to provide an exact digital
copy of the original physical object
We are experimenting with field note books
Infrastructure supports book-like objects
We do not (yet) have maps, art-works, photographs
Workflow designed around scanning physical books
We are working on solutions to incorporating born-
digital materials
15. BHL US/UK Principles: Digitization
Most BHL US/UK libraries scan directly through the
Internet Archive
We pay the Internet Archive to provide us with full
digitization services
Each BHL US/UK member library has its own
workflow:
Sending the books from our shelves
And the bibliographic metadata from our library catalogs
To the digitization station
And returning the books back to our shelves
16. BHL US/UK Principles: Metadata
Our baseline standard is MARC
We derive the metadata that displays on the BHL
website from the MARC records
We aggregate the bibliographic records from each
of our library catalogs into the BHL database AS
IS
We edit the metadata displayed on the BHL website
manually as necessary
BHL Digitization Specifications documentation currently
being updated
17. Digitization workflow
1. Titles vs. Items vs. Segments
2. Metadata we need:
• MARC for book and journal titles
• Volume information
• Page data
BHL Term Titles Items Segments
Library Term Book or Journal
Titles
Volume, Piece Articles, Book
chapters,
Meaning Conceptual unit Object Section of
consecutive pages
20. Internet Archive Scanning
Northeast Regional
Scanning Facility
(Boston)
New Jersey Facility
Natural History Museum,
London
Fedscan (Library of
Congress)
Internet Archive (San
Francisco)
Smithsonian Libraries
Missouri Botanical Garden
(Non-Scribe operation)
23. … and now including segments
92,356 “segments”
24. New Content Types
Field Books and other archival materials Stand along or linked illustrations
25. Relevant information in Spanish
about Mexican biodiversity
BHL contains much in Spanish or about Mexico,
but it is not clearly broken out
Ensayo ornitologico de los troquilideos ó colibries de
Mexico.
Mexico, ,I Escalante,1875
The Orchidaceae of Mexico and Guatemala.
London: Ridgway, [1837-1843]
A selection of the birds of Brazil and Mexico: the
drawings
London: H.G. Bohn,1841
27. 2012 BHL Member In-kind Staff FTE & Costs (incomplete)
14.193 FTE from the 14 member institutions
$1,239,300 staff and other costs
(does not include Secretariat or Technical staff)
28. 2012 BHL Central Support
7.055 FTE 4.43 FTE
Technical
2.625 FTE
Secretariat
Staff
$316,053
Other
$ 78,477
Total
$394,531
Staff
$472,529
Other
$ 22,615
Total
$495,144
31. User Statistics: 2007 - 2013
Visitors: 3,628,088
Page Views: 17,604,395
New vs. Returning: 50.06% vs. 49.04%
2007
2013
146,798 visitors | November 2012
33. I am thrilled with what I have been able to find
re: archaic mammary embryology some of
which I had been hoping to find at the National
Library of Medicine, and to get it through your
program was a huge advantage. Last night I
believe I requested and received 11 PDFs, all
of which are essential to a review paper* I am
completing.
Olav T. Oftedal PhD
Smithsonian Environmental Research Center
* “Evo-Devo of the Mammary Gland” by Oftedal, et al.
Journal of Mammary Gland Biology and Neoplasia (May 2013)
34. “
T
h
a
n
k
y
o
u
m
u
c
h
f
o
r
y
o
u
r
What an absolutely wonderful site.
It is a treasure trove of information.
Thank you!
May I compliment you on this splendid service? The
Library's invaluable for my work on seasonal
variability of climate and vector-borne disease in
British India, 1875-1940.
I really appreciate your work. The Biodiversity Heritage
Library is an excellent resource that regularly helps my
assistant and I obtain original descriptions for plants .... I
feel so privileged to be working in a day in age when such
resources are so readily available and easy to obtain.
36. Facebook
Total Page Likes: 4,384
Twitter @ BioDivLibrary
Total Followers: 2,369
Pinterest
2,373 images & 16 collections
Blog
Total Visits: 9,096
(2Q13)
BHL Social Media
February 2013
40. Firewall
Images (JP2)
PDF
Coordinate-based OCR
XML metadata
BHL Architecture: Window Seat Ed.
BHL DB
Internet Archive
Storage
Logic
APIs UI
Data
Exports
Access
Data Transform
Utilities
Geocoding
Name
Finding
42. Hardware & Software
Hardware
Scribe station
Off-the-shelf scanners or good-quality digital cameras
Software
Wonderfetch -> Partner Meta App (when using Scribe
machines)
Macaw
Uploading directly to Internet Archive (for example: MBG’s
Botanicus)
43. Standards and formats to consider
The simplest way to contribute a text item to IA is currently as a single pdf file. IA
creates a second pdf with a text layer, if none exist.
Items can be submitted as a stack of image files, one image per page. The files
can be in JPEG2000, JPG, or TIFF format, but with strict requirements for
how the files in an image stack are to be named, and the stack needs to be
packed into a single .zip or .tar file before submission.
When IA (Archive.org) scans a book for a Contributing Library, they use the
custom-engineered "Scribe" workstation, but for many materials, adequate
images can be made with off-the-shelf scanners or good-quality digital
cameras.
For best results, it is recommended to use the highest resolution your device is
capable of. Most images IA processes were produced at a resolution of 300-
600 ppi.
44. Standards and formats to consider
BHL recommends following, in part, the DLF's "Benchmark for
Faithful Digital Reproductions of Monographs and Serials"
(available online at
http://www.diglib.org/standards/bmarkfin.htm).
Bitonal: 600 dpi, 1-bit or bitonal TIFF images
Grayscale: 300 dpi, 8-bit grayscale uncompressed TIFF, or lossless
compressed image (e.g. LZW, JPEG2000 [*.jp2]).
Color: 300 dpi, 24-bit color uncompressed TIFF, or lossless compressed
images (e.g. LZW, JPEG2000 [*.jp2]).
NOTE: the above specifications are the preferred ones. BHL
will, however, accept lossy files. In the case of JPEG2000,
files with a compression level of 85% are acceptable.
45. Standards and formats to consider
Currently, BHL data can be downloaded as MODS,
EndNote and BibTex. See our wiki page with more
information:
http://biodivlib.wikispaces.com/Data+Exports#x--MODS
Title metadata as well as pagination, descriptive and page
order (structural) metadata is being copied into METS
files in the <biodiveristy> collection at IA.
The purpose of these METS files is to accommodate the need
of our pagination data.
These METS files are pagination specific and they do not have
the item/volume information included.
If bibliographic metadata for BHL content was required, it
should be found in the MODS files on the Data Exports page.
46. Standards and formats to consider
For the future, we are looking at serving OLEF as
an envelope format to share information with
other BHL Nodes.
See
http://www.bhle.eu/bhl-schema/v0.3/ and
http://www.slideshare.net/HeimoRainer/bhleuropemet
adataharmonisationtdwg20111018kollerwhrainer/6 )
47. Metadata generation and
indexing strategy
Each item to be uploaded needs a unique
identifier within our central repository, currently
Internet Archive (archive.org) and a folder with
such name is created to hold the uploaded and
generated (derivative) files.
Within BHL we record metadata at 3 levels of
bibliographic granularity – Title, Item & Page –
as well as metadata for the Creator(s) of the
title.
48. Metadata generation and
indexing strategy
Scanned material (jp2.zip) and basic title-level metadata content
(marc.xml), item-level metadata (meta.xml) and page-level
metadata (scandata.xml) are uploaded to Internet Archive
(IA), in the ‘biodiversity’ collection.
JP2.zip: The compressed JP2 images (Compression Quality 15) that
IA will use for delivering pages to the Read Online feature
following a very specific naming convention for the filenames:
Master images files named with local library identifier + 4-digit
sequence number (with no gaps).
MARC.xml: The MARC record for the title from the library catalog in
MARCXML format
Title, *Abbreviation, *Creator, Description, Publisher, Start Date Published,
End Date Published, Local Library Identifier, *OCLC Number, *ISSN,
*ISBN, *Call Number, *Subject, *Language, Date Created, Date Last
Modified, *Foreign Keys
49. Metadata generation and
indexing strategy
META.xml: The item level information (even redundant with
the title-level information) including the title, author, publisher,
copyright information, digitizing sponsor, date published, type
of item, and who originally uploaded it. IA may also update
this XML file with information as it processes the pages of the
item.
Barcode, Sequence, Local Library Identifier, +Start Volume, End
Volume, +Start Date, End Date, *Language, Scanning Institution,
*Scanning Contributor, *Scanning Sponsor, Date Created, Date
Last Modified
SCANDATA.xml: An XML file (scandata.xml) recording
information about each page image (handSide, cropBox,
original width & height, etc. )
FileName, Sequence, *Page number, *Page Type, Year, Volume,
IssuePrefix, Issue, Date Created, Date Last Modified
50. Metadata generation and
indexing strategy
CREATOR: A “Creator” is defined as a person or
company responsible for the creation of the Title.
Name, *Role, Date of Birth, Date of Death, Biography
A detailed description of the contents of each one
of these files and the whole process of
Uploading content to IA is available at:
http://biodivlib.wikispaces.com/Upload
51. Metadata generation and
indexing strategy
Internet Archive runs the OCR process and
generates “derivative files” that include:
The resulting files of the OCR process with ABBYY
FineReader (djvu, djvu.txt, djvu.xml, abby.gz)
A 100x152 pixel GIF with a looping, animated thumbnail of
the first 20 pages of a book.
The presentation version on BHL in PDF format.
The MARC record in binary and XML formats.
And others ( for a more detailed description you can see
http://biodivlib.wikispaces.com/Download+All+File+Type
s+and+Descriptions )
52. Metadata generation and
indexing strategy
The metadata from new items included in the BHL
collection is included in the database and indexed
to be used in searches through the Portal and API
services.
Periodically, the OCR pages are ran through
taxonomic names services to mine for new taxa
names like TaxonFinder (ubio.org) or GNRDS
(Global Names resolution tools and services:
resolver.globalnames.org) soon.
Taxa names are added to the database and written
back into Internet Archive (names.xml)
54. Online Platform
Publication
BHL API
(biodivlib.wikispaces.com/Developer+Tools+and+API)
The BHL Application Programming Interface (API) is a set of
REST-like web services that can be invoked via HTTP queries
(GET/POST requests) or SOAP.
Responses can be received in one of three formats: JSON, XML,
or XML wrapped in a SOAP envelope.
We are currently developing a new API v3, closer to a RESTful
design than previous versions, using resource-centric
URLs (where possible) and GET/PUT/POST/DELETE verbs.
56. Online Platform
Management
BHL Admin Dashboard
Admin Functions
(Alert Message, Image Server, Collections, Institutions,
Languages, Page Types, PDF Requests, Segment Types)
Library Functions
(Titles/Items/Segments /Pagination/Authors)
Science Functions (Names (Taxa) on a Page)
Library Statistics
(Titles/Items/Pages/Names/Segments/Items with Segments,
Names, Pages with Names)
Growth Statistics
(Titles/Items/Pages/Names/Segments new this Month/Year)
57. Online Platform
Management
BHL Admin Dashboard
PDF Generation Statistics (Generated: 174,162)
Internet Archive Harvesting Statistics (Complete: 119,125 items)
BioStor Harvest Statistics (Published: 11,126 as of Aug. 29, 2013)
DOI Assignment Statistics (DOI Approved: 57,338 as of Aug 29,
2013)
Web Traffic Statistics (API v2, OpenURL)
Reports
(Item Pagination, Title Import History, Character Encoding Problems,
DOIs by Institution, Monographic Contributions,
Items by Contributor)
58. Online Platform
Management
Monographic Deduping Tool
The MBLWHOI Library has been working on a tool that
assists with de-duplicating the monographs that BHL
members are sending to IA for scanning.
The application is ready for use and it’s entirely web-based,
requiring no client or user configuration.
The monographic deduper acts as a master database that
contains records for all of the monographs that any BHL
partner institution has scanned.
59. Online Platform
Management
Monographic Deduping Tool
In addition, there is a process also in place that allows for
material ingested from the Internet Archive, but not
contributed by a BHLpartner institution, to be added to the
deduper database.
Ultimately, the Monographic deduper database should be
seen as living record of accountability that communicates
to staff collaborating in the BHL network, a partner’s
promise to digitize a particular monographic title.
60. Online Platform
Management
Serials Bid List
It is a catalogue that allows users to browse and search
Serials titles held by BHL member institutions using
advanced filtering.
62. Scanning Locally, Collaborating Globally
6 global nodes: By country, region, language
Each node is independent and self-organized, but work under a
set of common principles
Share content as much as possible
Node leaders form a Global Coordinating Committee
Goal is to share a common portal where possible
Goal is to develop multi-lingual portal
68. Looking Forward
In any well-appointed
Natural History Library
there should be found
every book and every
edition of every book
dealing in the remotest
way with the subjects
concerned.
Charles Davies Sherborn
Epilogue to Index Animalium, March 1922