The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 and some of the methods and challenges we are encountering as we convert it to an digital version and Linked Open Data.
2. • What is Linked Open Data / The Semantic Web?
• Where can I see LOD in use?
• What is Taxonomic Literature II?
• How is it being converted to LOD?
• Did we encounter any challenges?
Agenda
3. Linked data
From Wikipedia, the free encyclopedia
A method of publishing structured data so that it can be
interlinked and become more useful. It builds upon
standard Web technologies … [and] extends them to
share information in a way that can be read
automatically by computers. This enables data from
different sources to be connected and queried.
What is Linked Open Data?
http://en.wikipedia.org/wiki/Linked_Open_Data
4. What is the Semantic Web?
Semantic Web
From Wikipedia, the free encycloped
A movement led by the World Wide Web Consortium… to
promote common data formats on the Web.
By encouraging the inclusion of semantic content in web
pages, the Semantic Web aims at converting the current
web dominated by unstructured and semi-structured
documents into a "web of data".
"The Semantic Web provides a common framework that
allows data to be shared and reused across
application, enterprise, and community boundaries."
http://en.wikipedia.org/wiki/Semantic_Web)
5. Five Stars of Linked Open Data
Available on the web (in any format) but with an open
license, to be Open Data.
Available as machine-readable structured data (e.g.
excel instead of image scan of a table.)
As (2) plus non-proprietary format (e.g. CSV instead of
Microsoft Excel.)
All the above plus, Use open standards from W3C (RDF
and SPARQL) to identify things, so that people can
point at your stuff.
All the above, plus: Link your data to other people’s
data to provide context.
What is Linked Open Data?
★
★★
★★★
★★★★
★★★★★
http://www.w3.org/DesignIssues/LinkedData.html
6. What is Linked Open Data?
LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
7. What is Linked Open Data?
Charles Darwin
“Feb 12, 1809”
Shrewsbury
BornOn
Born In
City
England
Type
Is In
Person
Type
Country
Type
Charles Darwin “Feb 12, 1809”
BornOn
Identifier Predicate Identifier /Value
(subject) (verb/relationship) (object)
On the Origin
of Species
Author Of
8. Tim Berners-Lee outlined four principles
for linked open data:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can be
referred to and looked up ("dereferenced")
by people and user agents.
3. Provide useful information about the thing when its URI is
dereferenced, leveraging standards such as RDF, SPARQL.
4. Include links to other related things (using their URIs) when
publishing data on the Web.
What is Linked Open Data?
http://www.w3.org/DesignIssues/LinkedData.html
http://5stardata.info/
9. What is Linked Open Data?
http://dbpedia.org/
resource/Charles_Darwin
“Feb 12, 1809”
http://dbpedia.org/
resource/Shrewsbury
BornOn
Born In
City
http://dbpedia.org/
resource/United_Kingdom
Type
Is In
Person
Type
Country
Type
Identifier Predicate Identifier /Value
http://dbpedia.org/resource/
On_the_Origin_of_Species
Author Of
Predicate Identifier /Value
10. What is Linked Open Data?
Predicate Vocabularies
• Dublin Core – General Metadata for Discovery
• SKOS – Simple Knowledge Organization System
• BIBO – Bibliographic Ontology
• BIO – Biographical
• FOAF – Friend of a Friend
• Events…
• Geographic…
• Many others!
• OWL – Web Ontology Language
11. What is Linked Open Data?
Mondeca Labs
Linked Open
Vocabularies (LOV)
Vocabulary of a Friend
(VOAF)
A vocabulary for
describing other
vocabularies
http://labs.mondeca.com/dataset/lov
13. What is Linked Open Data?
Benefits of Linked Open Data
• Disambiguation
• Connecting Relevant Content
• More visibility via Search
• Enrichment of your data
• Easier reuse of data
17. Congress: Linked Data Services
http://id.loc.gov/
Schema.org
http://www.schema.org
Data.gov / Semantic
http://www.data.gov/semantic
Linked Data.org
http://linkeddata.org/
Stephen Dale: Linked Data in Action
http://www.slideshare.net/stephendale/linked-data-in-action-4487244
Other LOD Examples and Information
18. Taxonomic Literature: A selective guide to botanical
publications and collections with dates, commentaries
and types. (Stafleu et al.)
Essential Reference
Tool for Botanists
Authors and their
Publications from
1753 to 1940
It is a “database in book form.”
Taxonomic Literature II
24. Scanned the pages.
Uploaded to the Internet Archive.
Hired contractor for OCR and correction (99.97%
accuracy.)
Received XML dataset from Contractor.
Verified and Imported to SQL Server Database.
Built a website to search the data.
Taxonomic Literature II
27. 1. Select Identifiers for our data
http://library.si.edu/digital-library/tl-2/author/darwin
http://library.si.edu/digital-library/tl-2/title/origin_of_species
http://library.si.edu/digital-library/tl-2/title/1313
2. Choose vocabularies for predicates (harder than it
sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIB
O, etc.
3. Create Links to other data sources on the web.
Taxonomic Literature II
28. Taxonomic Literature II as Linked Data
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/1313
tl2:creator <http://library.si.edu/tl2/title/1313>
owl:sameAs <http://viaf.org/viaf/27063124>
dc:creator <http://library.si.edu/tl2/author/darwin>
owl:sameAs http://www.archive.org/details/originofspecies00darwuoft
owl:sameAs <http://www.worldcat.org/oclc/425919213>
Select Identifiers
29. Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/author/darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>
foaf:lastName “Darwin”
foaf:familyName “Darwin”
foaf:firstName “Charles”
foaf:givenName “Charles”
foaf:name “Darwin, Charles Robert”
skos:prefLabel “Darwin, Charles Robert”
bio:birth “1809”
bio:death “1882”
skos:defintion “British evolutionary biologist”
tl2:personAbbreviation “Darwin”
Select Identifiers:Authors
30. Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/book/1313>
rdf:type <http://purl.org/ontology/bibo/Book>
tl2:titleNumber “1313”
tl2:titleAbbreviation “Origin sp.”
tl2:shortTitle “On the origin of species”
dc:title “On the origin of species by means of natural
selection, or the preservation of favoured races in the...”
dc:publisher “John Murray”
event:place “London”
dc:created “1859”
SelectVocabularies: Publications
31. Taxonomic Literature II as Linked Data
Linking: Author Names
Used a combination of OpenRefine and LODRefine as well as
custom code.
Results: Mixed
• Matched 15 - 20% of the names in our sample set
• Some named weren’t high in the list and required a human touch
Conclusion: Computer code needs to be improved with the aim of
minimizing amount of staff or volunteer time spent matching
names.
33. Taxonomic Literature II as Linked Data
Linking: Herbaria
Used computer code to split the herbarium names and identify
them in data provided by the Biodiversity Collections Index.
Results: Good
• Matched 95+% of the herbarium names in all ofTL-2
• Careful attention to “A” which is an herbarium, but also starts
some sentences in the HERBARIUM andTYPES blocks
Conclusion:These will be added toTL-2 when it is launches as LOD.
34. Taxonomic Literature II
Missouri Botanical Garden Herbarium
(From the Biodiversity Collections Index)
Lsid urn:lsid:biocol.org:col:15859
Name Missouri Botanical Garden Herbarium
Code MO
Kind Herbarium
Taxon Scope Herbarium collection limited to vascular plants (5.6 million
specimens) and bryophytes (500,000 specimens), Jan. 2009.
Geo Scope Worldwide; phanerogams strong in Central America (especially
Costa Rica, Nicaragua, and Panama), tropical South America. . .
Size 6,150,000
FoundedYear 1859
Web Site http://www.mobot.org/
Location Street P.O. Box 299
Location City Saint Louis
Location State Missouri
Location Postcode 63166-0299
Location Country Iso US
http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
35. Taxonomic Literature II as LOD
How are we going to store all this?
We’re using Drupal – automatically embed some
Linked Open Data elements in the webpage.
Probably not a good idea for very large datasets.
TL-2 = 10,000 authors + 37,000 titles
(about 400,000 triples, but growing)
36. TL-2 and LOD Challenges
Performance of Drupal Import:
Feeds Import: 7 Hours for 35,000 “Records” or Drupal Nodes
Other options? Still searching…
Our linked data set will grow to at least 600-700k Drupal
nodes.
Is Drupal the best way to do this?
37. Challenges
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making connections
unable to be made by automated means
• Finding suitable sources of data to link to.
(DBPedia? VIAF? EOL? Others?)
38. Summary
• This data may already exist online.
• It may also not always be as accurate as
needed for science.
• We are in a position to be the authoritative
source for this information.
• Linked Data allows it to be easily reused and
shared.
41. Thank You!
Unlocking Taxonomic Literature II
using Linked Open Data
Joel Richard
richardjm@si.edu
library.si.edu/staff/joel-richard
Special thanks to
The International Association for PlantTaxonomy, for giving us
permission to scan and digitizeTL-2 and place it online.
For his advice and support, Dr. Laurence Dorr, Botanist and
Curator, Department of Botany, Smithsonian National Museum of Natural
History.
This project was partially funded by the Atherton Seidell Endowment
Fund of the Smithsonian Institution.
Notas del editor
This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are more today and more being added every day.Not all data sets are represented here, so this is only a sample of what’s available. The actual graph could be four or five times larger by now.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
The basic unit of LOD is the “triple” made up of three elements. An identifier, a predicate and another identifier or a value of some kind. Think of it as a sentence: Subject-verb-object. The underlined blue text indicate that this is an identifier that can be linked to on the web. The first part of the triple is always an identifier. The third part is sometimes an identifier but should be if an identifier exists.When we repeat these connections, we start to create a web of networked data.
Looking back, we can see that Tim Berners Lee has mapped out these four principles that make up the foundation of linked data, which also give it structure and make it easy to use.
Going back to our web of data, we can now represent the identifiers as identifiers.The next question is: where do we get the predicates from? Why are they important?
There are numerous vocabularies of predicates that we can use when developing our linked open data. (Describe them more in detail, leading into the next slide)
Wow, look at al of them! Mondeca labs has collected and classified all the vocabularies they can find. There are 350 vocabularies listed here.
Here is an example of some linked data in a reasonably human-readable form. We have some prefix definitions of the predicate vocabularies we are using. Then we have the identifier in green, and the predicates in blue. Values are in black with identifiers enclosed in greater-than and less-than signs.
What are the benefits of LOD?
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Here are some more examples of places you can go for linked data. The Library of Congress has a linked data services for their authorities and vocabularies. Schema.org is being used within webpages to improve their visibility and search results. The US Government is offering a lot of data, some of it in linked data. LinkedData.org is a place to go to learn about all things linked data and finally, Stephen Dale, a knowledge management consultant, has a great presentation with examples of linked data in use to learn more than we knew before.
Overall, TL-2 provides the most comprehensive biographical and bibliographical analysis for systematic botany literature published between 1753 and 1940 to date.
Here is a page from TL-2. It’s hard to read. Let’s zoom in a bit.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
Continuing our zooming… This includes some additional information that we know about Charles Darwin, including places where we can find known samples of his handwriting, species that were named for him and even postage stamps that honor him.
Continuing our zooming… Here we see three publications by Darwin giving a number of the book, the title and publication information.
The things that make TL-2 important are the unique abbreviations of the author names. e.g. “Darwin” outlined in Green. Also significant are the abbreviations of the titles of the publications, also outlined in green (“Srigin sp.”), but not all publications have titles. In red are the book numbers, also unique across all 37,000 publications. Finally, we have the “short title” of the volumes which is outlined in blue.
Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data. This page got approximately 860 visitors and 1500 visits in the month of April 2013. Which is twice the number of visitors we got in April 2012. We actually get more visits from Europe than from North America. You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
As an exmaple, wikipedia has 3000 botanists in their database. We have 10,000 of them. We have the more complete, richer set of data that can be used to