1) The Art of Life project aims to make natural history illustrations from biodiversity texts more accessible by automatically identifying them, developing metadata standards, and enabling community tagging on platforms like Flickr.
2) Algorithms were developed to identify illustrations in biodiversity texts and are being applied to the entire Biodiversity Heritage Library collection. An illustration metadata schema is being finalized.
3) The project benefits the scientific community by providing access to previously hidden illustrations, linking them to biodiversity databases, and making them freely available for reuse under public domain.
OpenShift Commons Paris - Choose Your Own Observability Adventure
Finding a goldmine of natural history illustrations within BHL texts: the Art of Life project
1. Finding a goldmine of natural
history illustrations within
BHL texts:
the Art of Life project
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
2. BHL Problem statement
– users want access to images, access to images is
limited
– How to broaden the audiences for BHL content?
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
3. What is Art of Life?
• Full title - The Art of Life: Data Mining and Crowdsourcing the
Identification and Description of Natural History Illustrations
from the Biodiversity Heritage Library (BHL)
• Grant given to Missouri Botanical Garden in St Louis
• Funded by National Endowment for the Humanities
• Runs May 2012-April 2014
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
4. TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
5. 5 Primary Objectives of Art of Life
Objective 1: Define an appropriate metadata schema for natural history illustrations
Objective 2: Build software tools to automatically identify illustrations in the BHL corpus
Objective 3: Enhance existing tools to enable the initial sorting, viewing, and editing of these
identified visual resources.
Objective 4: Integrate tagging applications to enable a community of users to edit descriptive
metadata for the illustrations
Objective 5: Integrate the descriptive metadata generated by users back into BHL portal both for
access and preservation
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
6. TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
7. Current status of Art of Life
• Development of the algorithms are complete. Running them
across entire BHL corpus now.
• Draft schema for describing natural history illustrations was
posted for public review http://tinyurl.com/9hm7nsb. In
process of converting to an application profile
• Classifier tool complete
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
8. Algorithms
• Developed by folks at Indianapolis Museum of Art (IMA) Lab.
• Built 4 primary types:
–
–
–
–
ABBYY (87% accurate)
Contrast (88% accurate)
Color (.09% accurate)
Compression (9% accurate)
• Tested against a gold standard set of 100 books (40k pages)
• ABBYY and Contrast were chosen as most effective in finding
illustrations
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
9. Interface designed for BHL to assess performance of
algorithms
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
10. Interface developed to assign broad classes
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
11. Art of Life Schema
Needs to support three objectives:
1) to enable the discovery, description and use of the identified images by
artists, biologists, humanities scholars, librarians, and educators
2) to make BHL’s metadata and images available to other platforms
3) to import crowdsourced metadata generated in other platforms back
into BHL.
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
12. Schema landscape review
– VRA Core 4.0 (art image community)
– LIDO (museum community)
– Dublin Core (Web community)
– Darwin Core (biodiversity community)
– Audubon Core (biodiversity community)
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
13. ART OF LIFE SCHEMA ELEMENTS
red =required
Title
Type
Date
Copyright
Source
Agent
Subjects
Description
Inscription
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
14. Example of illustration described using Art of Life schema
Title
Stictospiza formosa
Type
Paintings
Date
Publication: 1898
Agent
Description
Subjects
Inscriptions
Author: Arthur G. Butler (1844-1925)
Illustrator: F.W. Frohawk (1861-1946)
A pair of finches with green and yellow bodies resting on reeds
Birds, finches
Scientific name: Amandava formosa
Vernacular Name: Green Avadavat or Green Munia
Accepted Name: Amandava formosa (Latham, 1790)
bottom center: Green Amaduvade Waxbill (Stictospiza formosa)
Source
Rights
TDWG Oct 2013 Florence Italy
Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and
Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage
Library, and is available online at biodiversitylibrary.org/page/17195895
Public domain
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
15. How will this project benefit the scientific
community?
•
•
•
Will provide access to content in BHL that has been largely hidden and difficult to
find. Functionality will be added to the BHL portal to allow searching for images by
species name, common names, subjects, and illustrators
Once the images are available and described in places like Flickr and Wikimedia
Commons they will become easily linked to and available in other biodiversityrelated platforms such as Wikispecies and EOL
Like the text content in BHL, most image content will fall under public domain and
be freely available for download and re-use so you can incorporate them into your
research and publications
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
16. Art of Life team
PI
Trish Rose-Sandler, Missouri Botanical Garden
Algorithm development
Ed Bachta, Charlie Moad, Kyle Jaebker, Indianapolis Museum of Art
Schema development
Gaurav Vaidya and Robert Guralnick, University of Colorado, Boulder
William Ulate, Missouri Botanical Garden
Programming
Mike Lichtenberg, Missouri Botanical Garden
Consultants
Doug Holland, Missouri Botanical Garden; Chris Freeland, Washington University
(former PI for Art of Life)
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
17. Interested? Here’s how you can help
• We welcome your feedback on the schema before its finalized!
http://tinyurl.com/9hm7nsb
• Would love to talk with other folks about their experiences with
crowdsourcing of metadata, particularly if you’ve used flickr or Wikimedia
commons
• Spread the word about this free, rich resource of images
http://www.flickr.com/photos/biodivlibrary and help us describe our
illustrations!
TDWG Oct 2013 Florence Italy
Trish Rose-Sandler, Missouri Botanical Garden
Art of Life project
Art of Life evolved out of a need in the BHL that was expressed by our users. We had a critical mass of textual content online, BHL users knew there were amazing images within the BHL pages but there was no easy way to find them other than opening up a BHL book or volume and scrolling through page by page to find illustrations. There is no descriptive metadata attached to the illustration that would tell you the content of the image, date when they were created or who was involved in their creation. We also wanted to expand BHL to new audiences and domains and felt the illustrations were the pathway for doing that. Knew these illustrations would be of interest not only to biologists, but also to artists, historians in both the arts and science, educators; librarians/curators so we wrote a proposal to the National Endowment for the Humanities because we believed they would understand and want to support the disciplinary nature of this content. Luckily they did and awarded Missouri Botanical Garden a grant for the Art of LIfe
One way we’ve tried to address the need for image discovery is by pushing selected images to Flickr. We have created a BHL account in Flickr and pushed over 80,000 images so far but this is all a very manual process that takes considerable staff time. We estimate that we have millions of illustrations within BHL so this manual process does not scale well. The address is flickr.com/photos/biodivlibrary
This is the Art of Life workflow diagram which identifies the 4 processes the illustrations will go through as they move through each stage of the workflow. They include: Extract, Classify, Describe, and Share.The Extract stage is where BHL pages will be run through the algorithmsto identify which pages contain illustrations, whether they be full plates or only a section of the page. At the Classify stage, the pages with illustrations will be tagged by Art of Life staff as being one or several broad types such as drawing/painting, photograph, diagram, or map. For the Describe stage, the illustrations will be pushed into platforms such as Flickr and Wikimedia Commons where both the general public and specialists can describe them in much greater detail such as adding a title, creator, date (if different from date of publication), and subjects. Wikimedia Commons is where the schema can play a role. Because Wikimedia allows you to create templates we can provide guidance to more expert taggers on what information to record and how to record it. In the Share stage, the metadata contributed in Flickr and Wikimedia Commons will be ingested into the BHL portal both for preservation and discovery. Because many of these new audiences don’t know about BHL and wouldn’t go to the BHL platform to discover the illustrations we also want to push the illustrations out to environments where those audiences are familiar with: Encyclopedia of Life, ARTstor, and even iTunesU where we already have some themed collections at the book level.
The team developed a gold standard set to be used as a “control group” to compare results against. This was a set of 100 books and journals whose illustrations were manually tagged with “has illustration”. Accuracy rates are being computed based on how well each algorithm is performing against the gold standard set. ABBYY – relies on metadata output from OCR process which have coordinate information but not always accurateContrast – pretty accurate because the contrast qualities between pages with text and those with images is easily distinguishableColor – pretty much useless. Probably due to many of the older texts exhibiting yellowing or poor color qualitiesCompression – not useful enough to be usedDecided to go with ABBYY and Contrast
This is the interface that IMA built for us to review the performance of the algorithmsThis information on top shows the total pages in a book, actual # of illustrations (based on gold standard set) and accuracy rating for ABBYY and Contrast The information on upper right allows you to filter by true positive, true neg, false pos, and false negatives.Each page image is then shown with its bounding coordinates and overall coverage. This allows us to play around with the coverage percentage we can determine if pages with 10% coverage are really illustrations or mostly anomalies like ilustrated letter or artifacts on the page.
The Classification tool that will be used by staff for identifying which broad type of illustration each page contains was developed by Joel Richard of the Smithsonian Libraries. He modified an existing tool called Macaw that BHL currently uses to add volume and page level metadata to its books. Its sort of a light table view of all the pages in a book that allows you to quickly highlight several images and globally assign map, or drawing, etc.
A challenge for this project wasto identify the schema, or perhaps schemas, that can serve the metadata needs of a mix of audiences. For example, an art historian reviewing an illustration may be interested in knowing the artist and geographic location where the work was created in order to understand how the artist was influenced by his or her locality. A scientist, considering the same illustration, may be interested in knowing the species name and geographic distribution of the organism depicted in the illustration to compare the development of the species with related species from that area. Both have a need for the geographic metadata contained within the text, but from different perspectives.Since we wanted to push these illustrations out into other platforms for crowdsourcing the descriptions and then bring that metadata back into the BHL platform we needed a schema that would help guide users in what information to contribute and how to record it and also to create some consistency in those descriptions so they are easier to bring back to BHL Rather than inventing a new schema from scratch we really wanted to adopt an existing schema or schemas so that when we shared the described illustrations beyond the BHL, the metadata could easily interoperate with data in other systems .
VRA Core was designed for images of artworks and the images that serves as surrogates for them. LIDO was designed for museum objects and has begun to supercede CDWA. Dublin Core of course is the default standard to consider for any online digital repository Darwin Core and Audubon core need no introduction in this communityI have to confess here that I have some personal bias towards VRA Core because I have been involved in the development and maintenance of version 4. But ultimately we determined that VRA Core really was found to be the best fit for the natural history illustrations. Its elements and attributes were mostly closely aligned with the types of information we wanted users to record. But also because its relationship of works to one or more images fit nicely with the book structure which often contain one or more illustrations on a single page. The only thing the VRA Core lacked was a way to record an acceptedName and CommonName for a species. VRA Core has a subject attribute type of scientificName but Taxonomists need more specificity. Darwin Core was able to fulfill this need and so we borrowed 2 elements from that schema.
We ended up with 9 elements total, 7 of which came from VRA Core 4.0 and 2 which came from Darwin Core. The elements in red are required but since Date, Copyright and Source are pulled directly from the bibliographic citation for the book the tagger really only has to enter Title and Type. The value for Title we recommend either pulling from a caption if it exists or doing a basic description of the objects in the image. For Type BHL staff will apply at least one of 5 broad types: drawings/paintings; maps; photographs; diagrams; or prints and this gets added during the classification stage that I mentioned.
Here is an illustration described using the schema