SlideShare una empresa de Scribd logo
1 de 41
Joel Richard, Smithsonian Libraries
Unlocking Taxonomic Literature II
using Linked Open Data
• What is Linked Open Data / The Semantic Web?
• Where can I see LOD in use?
• What is Taxonomic Literature II?
• How is it being converted to LOD?
• Did we encounter any challenges?
Agenda
Linked data
From Wikipedia, the free encyclopedia
A method of publishing structured data so that it can be
interlinked and become more useful. It builds upon
standard Web technologies … [and] extends them to
share information in a way that can be read
automatically by computers. This enables data from
different sources to be connected and queried.
What is Linked Open Data?
http://en.wikipedia.org/wiki/Linked_Open_Data
What is the Semantic Web?
Semantic Web
From Wikipedia, the free encycloped
A movement led by the World Wide Web Consortium… to
promote common data formats on the Web.
By encouraging the inclusion of semantic content in web
pages, the Semantic Web aims at converting the current
web dominated by unstructured and semi-structured
documents into a "web of data".
"The Semantic Web provides a common framework that
allows data to be shared and reused across
application, enterprise, and community boundaries."
http://en.wikipedia.org/wiki/Semantic_Web)
Five Stars of Linked Open Data
Available on the web (in any format) but with an open
license, to be Open Data.
Available as machine-readable structured data (e.g.
excel instead of image scan of a table.)
As (2) plus non-proprietary format (e.g. CSV instead of
Microsoft Excel.)
All the above plus, Use open standards from W3C (RDF
and SPARQL) to identify things, so that people can
point at your stuff.
All the above, plus: Link your data to other people’s
data to provide context.
What is Linked Open Data?
★
★★
★★★
★★★★
★★★★★
http://www.w3.org/DesignIssues/LinkedData.html
What is Linked Open Data?
LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
What is Linked Open Data?
Charles Darwin
“Feb 12, 1809”
Shrewsbury
BornOn
Born In
City
England
Type
Is In
Person
Type
Country
Type
Charles Darwin “Feb 12, 1809”
BornOn
Identifier Predicate Identifier /Value
(subject) (verb/relationship) (object)
On the Origin
of Species
Author Of
Tim Berners-Lee outlined four principles
for linked open data:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can be
referred to and looked up ("dereferenced")
by people and user agents.
3. Provide useful information about the thing when its URI is
dereferenced, leveraging standards such as RDF, SPARQL.
4. Include links to other related things (using their URIs) when
publishing data on the Web.
What is Linked Open Data?
http://www.w3.org/DesignIssues/LinkedData.html
http://5stardata.info/
What is Linked Open Data?
http://dbpedia.org/
resource/Charles_Darwin
“Feb 12, 1809”
http://dbpedia.org/
resource/Shrewsbury
BornOn
Born In
City
http://dbpedia.org/
resource/United_Kingdom
Type
Is In
Person
Type
Country
Type
Identifier Predicate Identifier /Value
http://dbpedia.org/resource/
On_the_Origin_of_Species
Author Of
Predicate Identifier /Value
What is Linked Open Data?
Predicate Vocabularies
• Dublin Core – General Metadata for Discovery
• SKOS – Simple Knowledge Organization System
• BIBO – Bibliographic Ontology
• BIO – Biographical
• FOAF – Friend of a Friend
• Events…
• Geographic…
• Many others!
• OWL – Web Ontology Language
What is Linked Open Data?
Mondeca Labs
Linked Open
Vocabularies (LOV)
Vocabulary of a Friend
(VOAF)
A vocabulary for
describing other
vocabularies
http://labs.mondeca.com/dataset/lov
What is Linked Open Data?
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix dbpprop: <http://dbpedia.org/property/> .
<http://dbpedia.org/resource/Charles_Darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>;
rdf:type <http://dbpedia.org/ontology/Scientist>;
foaf:name “Charles Darwin”;
foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”;
dbpedia-owl:field <http://dbpedia.org/resource/Natural_history>
dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”;
dbpedia-owl:birthDate "1809-02-12";
dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury>
dbpedia-owl:deathDate "1882-04-19";
dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House>
dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>
What is Linked Open Data?
Benefits of Linked Open Data
• Disambiguation
• Connecting Relevant Content
• More visibility via Search
• Enrichment of your data
• Easier reuse of data
Linked Open Data in Use
Google Knowledge Graph
Linked Open Data in Use
Google Knowledge Graph
Linked Open Data in Use
Congress: Linked Data Services
http://id.loc.gov/
Schema.org
http://www.schema.org
Data.gov / Semantic
http://www.data.gov/semantic
Linked Data.org
http://linkeddata.org/
Stephen Dale: Linked Data in Action
http://www.slideshare.net/stephendale/linked-data-in-action-4487244
Other LOD Examples and Information
Taxonomic Literature: A selective guide to botanical
publications and collections with dates, commentaries
and types. (Stafleu et al.)
Essential Reference
Tool for Botanists
Authors and their
Publications from
1753 to 1940
It is a “database in book form.”
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Scanned the pages.
Uploaded to the Internet Archive.
Hired contractor for OCR and correction (99.97%
accuracy.)
Received XML dataset from Contractor.
Verified and Imported to SQL Server Database.
Built a website to search the data.
Taxonomic Literature II
Taxonomic Literature II
First...what does 99.97% accuracy mean?
Taxonomic Literature II
~12,000 Errors
1. Select Identifiers for our data
http://library.si.edu/digital-library/tl-2/author/darwin
http://library.si.edu/digital-library/tl-2/title/origin_of_species
http://library.si.edu/digital-library/tl-2/title/1313
2. Choose vocabularies for predicates (harder than it
sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIB
O, etc.
3. Create Links to other data sources on the web.
Taxonomic Literature II
Taxonomic Literature II as Linked Data
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/1313
tl2:creator <http://library.si.edu/tl2/title/1313>
owl:sameAs <http://viaf.org/viaf/27063124>
dc:creator <http://library.si.edu/tl2/author/darwin>
owl:sameAs http://www.archive.org/details/originofspecies00darwuoft
owl:sameAs <http://www.worldcat.org/oclc/425919213>
Select Identifiers
Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/author/darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>
foaf:lastName “Darwin”
foaf:familyName “Darwin”
foaf:firstName “Charles”
foaf:givenName “Charles”
foaf:name “Darwin, Charles Robert”
skos:prefLabel “Darwin, Charles Robert”
bio:birth “1809”
bio:death “1882”
skos:defintion “British evolutionary biologist”
tl2:personAbbreviation “Darwin”
Select Identifiers:Authors
Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/book/1313>
rdf:type <http://purl.org/ontology/bibo/Book>
tl2:titleNumber “1313”
tl2:titleAbbreviation “Origin sp.”
tl2:shortTitle “On the origin of species”
dc:title “On the origin of species by means of natural
selection, or the preservation of favoured races in the...”
dc:publisher “John Murray”
event:place “London”
dc:created “1859”
SelectVocabularies: Publications
Taxonomic Literature II as Linked Data
Linking: Author Names
Used a combination of OpenRefine and LODRefine as well as
custom code.
Results: Mixed
• Matched 15 - 20% of the names in our sample set
• Some named weren’t high in the list and required a human touch
Conclusion: Computer code needs to be improved with the aim of
minimizing amount of staff or volunteer time spent matching
names.
Taxonomic Literature II as Linked Data
Charles Darwin
(From the dbpedia.org)
Taxonomic Literature II as Linked Data
Linking: Herbaria
Used computer code to split the herbarium names and identify
them in data provided by the Biodiversity Collections Index.
Results: Good
• Matched 95+% of the herbarium names in all ofTL-2
• Careful attention to “A” which is an herbarium, but also starts
some sentences in the HERBARIUM andTYPES blocks
Conclusion:These will be added toTL-2 when it is launches as LOD.
Taxonomic Literature II
Missouri Botanical Garden Herbarium
(From the Biodiversity Collections Index)
Lsid urn:lsid:biocol.org:col:15859
Name Missouri Botanical Garden Herbarium
Code MO
Kind Herbarium
Taxon Scope Herbarium collection limited to vascular plants (5.6 million
specimens) and bryophytes (500,000 specimens), Jan. 2009.
Geo Scope Worldwide; phanerogams strong in Central America (especially
Costa Rica, Nicaragua, and Panama), tropical South America. . .
Size 6,150,000
FoundedYear 1859
Web Site http://www.mobot.org/
Location Street P.O. Box 299
Location City Saint Louis
Location State Missouri
Location Postcode 63166-0299
Location Country Iso US
http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
Taxonomic Literature II as LOD
How are we going to store all this?
We’re using Drupal – automatically embed some
Linked Open Data elements in the webpage.
Probably not a good idea for very large datasets.
TL-2 = 10,000 authors + 37,000 titles
(about 400,000 triples, but growing)
TL-2 and LOD Challenges
Performance of Drupal Import:
Feeds Import: 7 Hours for 35,000 “Records” or Drupal Nodes
Other options? Still searching…
Our linked data set will grow to at least 600-700k Drupal
nodes.
Is Drupal the best way to do this?
Challenges
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making connections
unable to be made by automated means
• Finding suitable sources of data to link to.
(DBPedia? VIAF? EOL? Others?)
Summary
• This data may already exist online.
• It may also not always be as accurate as
needed for science.
• We are in a position to be the authoritative
source for this information.
• Linked Data allows it to be easily reused and
shared.
Closing: something fun
One example of reuse
Ryan Schenk http://synynyms.com/
Closing: something fun
One example of reuse
Ryan Schenk http://synynyms.com/
Thank You!
Unlocking Taxonomic Literature II
using Linked Open Data
Joel Richard
richardjm@si.edu
library.si.edu/staff/joel-richard
Special thanks to
The International Association for PlantTaxonomy, for giving us
permission to scan and digitizeTL-2 and place it online.
For his advice and support, Dr. Laurence Dorr, Botanist and
Curator, Department of Botany, Smithsonian National Museum of Natural
History.
This project was partially funded by the Atherton Seidell Endowment
Fund of the Smithsonian Institution.

Más contenido relacionado

La actualidad más candente

Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015Cason Snow
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collectionslljohnston
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities Getaneh Alemu
 
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...lljohnston
 
Linked Open Data and Systematic Taxonomy
Linked Open Data and Systematic TaxonomyLinked Open Data and Systematic Taxonomy
Linked Open Data and Systematic TaxonomyJoel Richard
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentivesdatacite
 
Finding grey literature
Finding grey literatureFinding grey literature
Finding grey literatureKosjanka
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014datacite
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific ImagesTheContentMine
 
Linking Data, Linking People
Linking Data, Linking PeopleLinking Data, Linking People
Linking Data, Linking PeoplefereiraJ
 

La actualidad más candente (20)

Linking up your data
Linking up your dataLinking up your data
Linking up your data
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities
 
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symp...
 
Linked Open Data and Systematic Taxonomy
Linked Open Data and Systematic TaxonomyLinked Open Data and Systematic Taxonomy
Linked Open Data and Systematic Taxonomy
 
Data hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshrData hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshr
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Semantic Web and Linked Open Data
Semantic Web and Linked Open DataSemantic Web and Linked Open Data
Semantic Web and Linked Open Data
 
Ejis
EjisEjis
Ejis
 
Finding grey literature
Finding grey literatureFinding grey literature
Finding grey literature
 
E conf(2)
E conf(2)E conf(2)
E conf(2)
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
Mobile lucky7
Mobile lucky7Mobile lucky7
Mobile lucky7
 
Linking Data, Linking People
Linking Data, Linking PeopleLinking Data, Linking People
Linking Data, Linking People
 

Destacado

Lita national forum 2012
Lita national forum 2012Lita national forum 2012
Lita national forum 2012Joel Richard
 
The Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries
The Wonderful Technicolor World Digital Goodness @ Smithsonian LibrariesThe Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries
The Wonderful Technicolor World Digital Goodness @ Smithsonian LibrariesMartin Kalfatovic
 
Building the New Open Linked Library
Building the New Open Linked LibraryBuilding the New Open Linked Library
Building the New Open Linked LibraryJoel Richard
 
Building a Linked Open Data Set
Building a Linked Open Data SetBuilding a Linked Open Data Set
Building a Linked Open Data SetJoel Richard
 
The Nature of Illumination: Cultural Heritage and the Technology of Culture.
The Nature of Illumination: Cultural Heritage and the Technology of Culture.The Nature of Illumination: Cultural Heritage and the Technology of Culture.
The Nature of Illumination: Cultural Heritage and the Technology of Culture.Martin Kalfatovic
 
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardThe Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardMartin Kalfatovic
 
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...Martin Kalfatovic
 
Digitization Basics for Libraries, Archives, and Museums
Digitization Basics for Libraries, Archives, and MuseumsDigitization Basics for Libraries, Archives, and Museums
Digitization Basics for Libraries, Archives, and MuseumsMartin Kalfatovic
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Destacado (12)

Lita national forum 2012
Lita national forum 2012Lita national forum 2012
Lita national forum 2012
 
Thinking of Linking
Thinking of LinkingThinking of Linking
Thinking of Linking
 
Of Metaphors and Metadata
Of Metaphors and MetadataOf Metaphors and Metadata
Of Metaphors and Metadata
 
The Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries
The Wonderful Technicolor World Digital Goodness @ Smithsonian LibrariesThe Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries
The Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries
 
Building the New Open Linked Library
Building the New Open Linked LibraryBuilding the New Open Linked Library
Building the New Open Linked Library
 
Building a Linked Open Data Set
Building a Linked Open Data SetBuilding a Linked Open Data Set
Building a Linked Open Data Set
 
The Nature of Illumination: Cultural Heritage and the Technology of Culture.
The Nature of Illumination: Cultural Heritage and the Technology of Culture.The Nature of Illumination: Cultural Heritage and the Technology of Culture.
The Nature of Illumination: Cultural Heritage and the Technology of Culture.
 
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardThe Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
 
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...
 
Digitization Basics for Libraries, Archives, and Museums
Digitization Basics for Libraries, Archives, and MuseumsDigitization Basics for Libraries, Archives, and Museums
Digitization Basics for Libraries, Archives, and Museums
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar a Unlocking Taxonomic Literature II using Linked Open Data

ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearchTope Omitola
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshareHafabe
 
What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)Alison Hitchens
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data皓仁 柯
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museumstrevorthornton
 
Building the new open linked library: Theory and Practice
Building the new open linked library: Theory and PracticeBuilding the new open linked library: Theory and Practice
Building the new open linked library: Theory and PracticeTrish Rose-Sandler
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...Alison Hitchens
 
Madrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesMadrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesVictor de Boer
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107皓仁 柯
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersPrattSILS
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 

Similar a Unlocking Taxonomic Literature II using Linked Open Data (20)

Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearch
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 
Introducing linked data
Introducing linked dataIntroducing linked data
Introducing linked data
 
What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
 
Biodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic WebBiodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic Web
 
Building the new open linked library: Theory and Practice
Building the new open linked library: Theory and PracticeBuilding the new open linked library: Theory and Practice
Building the new open linked library: Theory and Practice
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
 
Madrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesMadrid Linked Data for Digital Humanities
Madrid Linked Data for Digital Humanities
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
 
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Web3uploaded
Web3uploadedWeb3uploaded
Web3uploaded
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Unlocking Taxonomic Literature II using Linked Open Data

  • 1. Joel Richard, Smithsonian Libraries Unlocking Taxonomic Literature II using Linked Open Data
  • 2. • What is Linked Open Data / The Semantic Web? • Where can I see LOD in use? • What is Taxonomic Literature II? • How is it being converted to LOD? • Did we encounter any challenges? Agenda
  • 3. Linked data From Wikipedia, the free encyclopedia A method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies … [and] extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried. What is Linked Open Data? http://en.wikipedia.org/wiki/Linked_Open_Data
  • 4. What is the Semantic Web? Semantic Web From Wikipedia, the free encycloped A movement led by the World Wide Web Consortium… to promote common data formats on the Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data". "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries." http://en.wikipedia.org/wiki/Semantic_Web)
  • 5. Five Stars of Linked Open Data Available on the web (in any format) but with an open license, to be Open Data. Available as machine-readable structured data (e.g. excel instead of image scan of a table.) As (2) plus non-proprietary format (e.g. CSV instead of Microsoft Excel.) All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff. All the above, plus: Link your data to other people’s data to provide context. What is Linked Open Data? ★ ★★ ★★★ ★★★★ ★★★★★ http://www.w3.org/DesignIssues/LinkedData.html
  • 6. What is Linked Open Data? LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 7. What is Linked Open Data? Charles Darwin “Feb 12, 1809” Shrewsbury BornOn Born In City England Type Is In Person Type Country Type Charles Darwin “Feb 12, 1809” BornOn Identifier Predicate Identifier /Value (subject) (verb/relationship) (object) On the Origin of Species Author Of
  • 8. Tim Berners-Lee outlined four principles for linked open data: 1. Use URIs to denote things. 2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents. 3. Provide useful information about the thing when its URI is dereferenced, leveraging standards such as RDF, SPARQL. 4. Include links to other related things (using their URIs) when publishing data on the Web. What is Linked Open Data? http://www.w3.org/DesignIssues/LinkedData.html http://5stardata.info/
  • 9. What is Linked Open Data? http://dbpedia.org/ resource/Charles_Darwin “Feb 12, 1809” http://dbpedia.org/ resource/Shrewsbury BornOn Born In City http://dbpedia.org/ resource/United_Kingdom Type Is In Person Type Country Type Identifier Predicate Identifier /Value http://dbpedia.org/resource/ On_the_Origin_of_Species Author Of Predicate Identifier /Value
  • 10. What is Linked Open Data? Predicate Vocabularies • Dublin Core – General Metadata for Discovery • SKOS – Simple Knowledge Organization System • BIBO – Bibliographic Ontology • BIO – Biographical • FOAF – Friend of a Friend • Events… • Geographic… • Many others! • OWL – Web Ontology Language
  • 11. What is Linked Open Data? Mondeca Labs Linked Open Vocabularies (LOV) Vocabulary of a Friend (VOAF) A vocabulary for describing other vocabularies http://labs.mondeca.com/dataset/lov
  • 12. What is Linked Open Data? @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix dbpprop: <http://dbpedia.org/property/> . <http://dbpedia.org/resource/Charles_Darwin> rdf:type <http://xmlns.com/foaf/0.1/Person>; rdf:type <http://dbpedia.org/ontology/Scientist>; foaf:name “Charles Darwin”; foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”; dbpedia-owl:field <http://dbpedia.org/resource/Natural_history> dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”; dbpedia-owl:birthDate "1809-02-12"; dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury> dbpedia-owl:deathDate "1882-04-19"; dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House> dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>
  • 13. What is Linked Open Data? Benefits of Linked Open Data • Disambiguation • Connecting Relevant Content • More visibility via Search • Enrichment of your data • Easier reuse of data
  • 14. Linked Open Data in Use Google Knowledge Graph
  • 15. Linked Open Data in Use Google Knowledge Graph
  • 17. Congress: Linked Data Services http://id.loc.gov/ Schema.org http://www.schema.org Data.gov / Semantic http://www.data.gov/semantic Linked Data.org http://linkeddata.org/ Stephen Dale: Linked Data in Action http://www.slideshare.net/stephendale/linked-data-in-action-4487244 Other LOD Examples and Information
  • 18. Taxonomic Literature: A selective guide to botanical publications and collections with dates, commentaries and types. (Stafleu et al.) Essential Reference Tool for Botanists Authors and their Publications from 1753 to 1940 It is a “database in book form.” Taxonomic Literature II
  • 24. Scanned the pages. Uploaded to the Internet Archive. Hired contractor for OCR and correction (99.97% accuracy.) Received XML dataset from Contractor. Verified and Imported to SQL Server Database. Built a website to search the data. Taxonomic Literature II
  • 26. First...what does 99.97% accuracy mean? Taxonomic Literature II ~12,000 Errors
  • 27. 1. Select Identifiers for our data http://library.si.edu/digital-library/tl-2/author/darwin http://library.si.edu/digital-library/tl-2/title/origin_of_species http://library.si.edu/digital-library/tl-2/title/1313 2. Choose vocabularies for predicates (harder than it sounds) OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIB O, etc. 3. Create Links to other data sources on the web. Taxonomic Literature II
  • 28. Taxonomic Literature II as Linked Data http://library.si.edu/tl2/author/darwin http://library.si.edu/tl2/title/1313 tl2:creator <http://library.si.edu/tl2/title/1313> owl:sameAs <http://viaf.org/viaf/27063124> dc:creator <http://library.si.edu/tl2/author/darwin> owl:sameAs http://www.archive.org/details/originofspecies00darwuoft owl:sameAs <http://www.worldcat.org/oclc/425919213> Select Identifiers
  • 29. Taxonomic Literature II as Linked Data <http://library.si.edu/tl2/author/darwin> rdf:type <http://xmlns.com/foaf/0.1/Person> foaf:lastName “Darwin” foaf:familyName “Darwin” foaf:firstName “Charles” foaf:givenName “Charles” foaf:name “Darwin, Charles Robert” skos:prefLabel “Darwin, Charles Robert” bio:birth “1809” bio:death “1882” skos:defintion “British evolutionary biologist” tl2:personAbbreviation “Darwin” Select Identifiers:Authors
  • 30. Taxonomic Literature II as Linked Data <http://library.si.edu/tl2/book/1313> rdf:type <http://purl.org/ontology/bibo/Book> tl2:titleNumber “1313” tl2:titleAbbreviation “Origin sp.” tl2:shortTitle “On the origin of species” dc:title “On the origin of species by means of natural selection, or the preservation of favoured races in the...” dc:publisher “John Murray” event:place “London” dc:created “1859” SelectVocabularies: Publications
  • 31. Taxonomic Literature II as Linked Data Linking: Author Names Used a combination of OpenRefine and LODRefine as well as custom code. Results: Mixed • Matched 15 - 20% of the names in our sample set • Some named weren’t high in the list and required a human touch Conclusion: Computer code needs to be improved with the aim of minimizing amount of staff or volunteer time spent matching names.
  • 32. Taxonomic Literature II as Linked Data Charles Darwin (From the dbpedia.org)
  • 33. Taxonomic Literature II as Linked Data Linking: Herbaria Used computer code to split the herbarium names and identify them in data provided by the Biodiversity Collections Index. Results: Good • Matched 95+% of the herbarium names in all ofTL-2 • Careful attention to “A” which is an herbarium, but also starts some sentences in the HERBARIUM andTYPES blocks Conclusion:These will be added toTL-2 when it is launches as LOD.
  • 34. Taxonomic Literature II Missouri Botanical Garden Herbarium (From the Biodiversity Collections Index) Lsid urn:lsid:biocol.org:col:15859 Name Missouri Botanical Garden Herbarium Code MO Kind Herbarium Taxon Scope Herbarium collection limited to vascular plants (5.6 million specimens) and bryophytes (500,000 specimens), Jan. 2009. Geo Scope Worldwide; phanerogams strong in Central America (especially Costa Rica, Nicaragua, and Panama), tropical South America. . . Size 6,150,000 FoundedYear 1859 Web Site http://www.mobot.org/ Location Street P.O. Box 299 Location City Saint Louis Location State Missouri Location Postcode 63166-0299 Location Country Iso US http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
  • 35. Taxonomic Literature II as LOD How are we going to store all this? We’re using Drupal – automatically embed some Linked Open Data elements in the webpage. Probably not a good idea for very large datasets. TL-2 = 10,000 authors + 37,000 titles (about 400,000 triples, but growing)
  • 36. TL-2 and LOD Challenges Performance of Drupal Import: Feeds Import: 7 Hours for 35,000 “Records” or Drupal Nodes Other options? Still searching… Our linked data set will grow to at least 600-700k Drupal nodes. Is Drupal the best way to do this?
  • 37. Challenges • Errors in the Corrected OCR • Challenges in Parsing Citations • The 80/20 rule: manually making connections unable to be made by automated means • Finding suitable sources of data to link to. (DBPedia? VIAF? EOL? Others?)
  • 38. Summary • This data may already exist online. • It may also not always be as accurate as needed for science. • We are in a position to be the authoritative source for this information. • Linked Data allows it to be easily reused and shared.
  • 39. Closing: something fun One example of reuse Ryan Schenk http://synynyms.com/
  • 40. Closing: something fun One example of reuse Ryan Schenk http://synynyms.com/
  • 41. Thank You! Unlocking Taxonomic Literature II using Linked Open Data Joel Richard richardjm@si.edu library.si.edu/staff/joel-richard Special thanks to The International Association for PlantTaxonomy, for giving us permission to scan and digitizeTL-2 and place it online. For his advice and support, Dr. Laurence Dorr, Botanist and Curator, Department of Botany, Smithsonian National Museum of Natural History. This project was partially funded by the Atherton Seidell Endowment Fund of the Smithsonian Institution.

Notas del editor

  1. This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are more today and more being added every day.Not all data sets are represented here, so this is only a sample of what’s available. The actual graph could be four or five times larger by now.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
  2. The basic unit of LOD is the “triple” made up of three elements. An identifier, a predicate and another identifier or a value of some kind. Think of it as a sentence: Subject-verb-object. The underlined blue text indicate that this is an identifier that can be linked to on the web. The first part of the triple is always an identifier. The third part is sometimes an identifier but should be if an identifier exists.When we repeat these connections, we start to create a web of networked data.
  3. Looking back, we can see that Tim Berners Lee has mapped out these four principles that make up the foundation of linked data, which also give it structure and make it easy to use.
  4. Going back to our web of data, we can now represent the identifiers as identifiers.The next question is: where do we get the predicates from? Why are they important?
  5. There are numerous vocabularies of predicates that we can use when developing our linked open data. (Describe them more in detail, leading into the next slide)
  6. Wow, look at al of them! Mondeca labs has collected and classified all the vocabularies they can find. There are 350 vocabularies listed here.
  7. Here is an example of some linked data in a reasonably human-readable form. We have some prefix definitions of the predicate vocabularies we are using. Then we have the identifier in green, and the predicates in blue. Values are in black with identifiers enclosed in greater-than and less-than signs.
  8. What are the benefits of LOD?
  9. Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  10. Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  11. Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  12. Here are some more examples of places you can go for linked data. The Library of Congress has a linked data services for their authorities and vocabularies. Schema.org is being used within webpages to improve their visibility and search results. The US Government is offering a lot of data, some of it in linked data. LinkedData.org is a place to go to learn about all things linked data and finally, Stephen Dale, a knowledge management consultant, has a great presentation with examples of linked data in use to learn more than we knew before.
  13. Overall, TL-2 provides the most comprehensive biographical and bibliographical analysis for systematic botany literature published between 1753 and 1940 to date.
  14. Here is a page from TL-2. It’s hard to read. Let’s zoom in a bit.
  15. When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  16. Continuing our zooming… This includes some additional information that we know about Charles Darwin, including places where we can find known samples of his handwriting, species that were named for him and even postage stamps that honor him.
  17. Continuing our zooming… Here we see three publications by Darwin giving a number of the book, the title and publication information.
  18. The things that make TL-2 important are the unique abbreviations of the author names. e.g. “Darwin” outlined in Green. Also significant are the abbreviations of the titles of the publications, also outlined in green (“Srigin sp.”), but not all publications have titles. In red are the book numbers, also unique across all 37,000 publications. Finally, we have the “short title” of the volumes which is outlined in blue.
  19. Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
  20. This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data. This page got approximately 860 visitors and 1500 visits in the month of April 2013. Which is twice the number of visitors we got in April 2012. We actually get more visits from Europe than from North America. You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
  21. Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
  22. This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  23. This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  24. This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  25. When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  26. When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  27. When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  28. As an exmaple, wikipedia has 3000 botanists in their database. We have 10,000 of them. We have the more complete, richer set of data that can be used to