Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Building the New Open Linked Library
1. Building the New Open
Linked Library
Theory and Practice
…and results!
Keri Thompson, Joel Richard, Trish Rose- LITA National Forum, September 30,
Sandler 2011
2. Smithsonian Libraries
• Founded in 1846
• 1.5 m volumes in collection, plus
assorted archival collections
• 15,000 volumes scanned and online
• 20 libraries serving ~500
researchers/curators + hundreds of
fellows and interns
• 102 library staff
• 1.5 web staff
• Founding member of the Biodiversity
Heritage Library
LITA National Forum, September 30,
2011
3. Linked Data in our Library
WHY Linked Open Data?
• It’s cool
• “Increase and Diffusion of Knowledge”
• Share, contribute to a global database
• Create context around our data
• Allow data to be reused/repurposed by
ourselves and others
• Improve discoverability of our content
LITA National Forum, September 30,
2011
4. Linked Data
“The Semantic Web isn’t just about putting data on the web. It is about
making links, so that a person or machine can explore the web of data.
With linked data, when you have some of it, you can find other, related,
data.”
Tim Berners-Lee, Linked Data – Design Issues
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful
information, using the standards (RDF, SPARQL)
4. Include links to other URIs, so that they can discover
more things.
LITA National Forum, September 30,
2011
5. Linked Data Open Data
• Publishing structured data on the web • Freely available to use, reuse,
• RDF (Resource Description Framework) republish with no restrictions
• Enables queries computer 2 computer • Made available through various
mechanisms such as .csv files,
• uses standard ontologies (vocabularies)
APIs
• data in “triples” (“triplestore”)
URI http://library.si.edu/tl2/author/charles-darwin
Predicate owl:sameAs
Object http://viaf.org/viaf/27063124
LITA National Forum, September 30,
2011
6. Our Website
Organically grown since 1995
• 83,000 HTML pages
• 3,700 ColdFusion pages
• 253,000 JPEG files
• 27,000 PNG files
• 46,000 PDFs
No CMS.
LITA National Forum, September 30,
2011
7. Digital Library Planning
1. Analyze and categorize our current
& future online content
2. Create high-level data models for
common content types
Questions:
Where are we metadata-rich?
What do we have that others don’t?
What is feasible right now?
LITA National Forum, September 30,
2011
8. Content Analysis
• 400+ Online “books”
• Exhibitions
• Research Tools
• Image Collections (60,000+ images)
• “Brochure” content (About us, Locations, Hours)
• Bibliographies, Fact Sheets, Subject Guides
• Databases, inventories and database-like books
Collections not on our website:
• ~15,000 digitized volumes, with many more planned
• Other analog collections that will be digitized
LITA National Forum, September 30,
2011
9. Linked Data in our Library
Books (and book-like objects)
• expose bibliographic data for reuse
• consume links to other internal content
and external authoritative data
Databases
• expose data previously unavailable
• provide authoritative data
• consume our data and others’ to create
new aggregate websites
LITA National Forum, September 30,
2011
10. Linked Digital Library Planning
1. Decide which data elements
should be exposed as linked data
for each content type
2. Choose appropriate vocabularies
3. Create a rough timeline and plan
for migrating site content (=1
year*)
* Optimism included in this estimate
LITA National Forum, September 30,
2011
11. Linked Data in our Library
Implement all this linked open data
goodness (and a shiny new website) by
moving to Drupal 7
LITA National Forum, September 30,
2011
12. Drupal and Linked Data
• Native support for RDFa in Drupal 7.
• RDF Extensions (rdfx) – even more features.
• Vocabularies can be imported and cached for
reuse.
• Few or no modifications to HTML to support
RDFa.
What’s the difference between RDF,
RDF/XML and RDFa?
LITA National Forum, September 30,
2011
13. RDFa Sample
URI: http://library.si.edu/book/origin-of-species
<meta content="The Origin of Species"
about=”/book/origin-species" property="dc:title" />
<h1>The Origin of Species</h1>
<img typeof="foaf:Image"
src="http://localhost:8087/images/origin-of-species.png"
alt="The origin of species cover image”
title="The origin of species cover image" />
<div rel="bibo:authorList">
<a href="/content/darwin-charles-1809-1882">
Darwin, Charles, 1809-1882
</a>
</div>
<div property="dc:created">November 24, 1859</div>
<div property="bibo:numPages">1000</div>
<div property="dc:language">english</div>
<div rel="owl:sameAs">
<a href="http://www.worldcat.org/oclc/1184647"
target="_blank">http://www.worldcat.org/oclc/1184647</a>
</div>
LITA National Forum, September 30,
2011
15. What other modules are we using?
• Fields, Views, Views UI
• Node Reference
• SPARQL Endpoint , SPARQL API
• RESTful Web Services
• SPARQL Views
• RDF External Vocabulary Importer
Caveat: Some modules not ready for Drupal 7
• i.e., Biblio module (no CCK, RDF capabilities)
LITA National Forum, September 30,
2011
16. What about Namespaces/Vocabularies?
• Drupal 7 comes with several namespaces. We
will use: DC Terms, FOAF, SKOS, OWL
• We're working with books, so we
need the Bibliographic Ontology:
• Website: http://bibliontology.com/
• Namespace: http://purl.org/ontology/bibo/
• Prefix: “bibo”
• We may also create our own vocabulary.
LITA National Forum, September 30,
2011
18. Setting up RDF Mappings in Drupal
LITA National Forum, September 30,
2011
19. Databases: TL-2
Taxonomic Literature 2 (1977-2009)
• The standard reference work for plant taxonomic
literature from Linnaeus to 1940.
• Contains botanists, authors, biographies, citations,
and species.
• Indexed and cross referenced.
• Should be digitized & on the web!
• SIL aims to be an authority for
botanist names on the Internet.
LITA National Forum, September 30,
2011
20. TL-2 Page Sample
Taxonomic Literature 2 (TL-2). v1., p. 600
LITA National Forum, September 30,
2011
22. TL-2 Page Sample http://library.si.edu/tl2/author/darwin
RDF Type = foaf:Person
foaf:lastName, foaf:familyName
foaf:firstName, foaf:givenName
foaf:name, skos:prefLabel
tl2:birthYear
tl2:deathYear
tl2:description
tl2:personAbbrev
http://library.si.edu/tl2/book/1313
RDF Type = bibo:Book
tl2:bookNumber
dc:title
event:place
dc:publisher
tl2:bookAbbreviation
dc:created
LITA National Forum, September 30,
2011
23. TL-2 Page Sample Results
http://library.si.edu/tl2/author/darwin http://library.si.edu/tl2/book/1313
tl2:creatorOf dc:creator
“http://library.si.edu/tl2/book/1313” “http://library.si.edu/tl2/author/darwin”
owl:sameAs owl:sameAs
“http://viaf.org/viaf/27063124” ”http://www.archive.org/details/
originofspecies00darwuoft”
foaf:lastName “Darwin”
tl2:bookNumber “1313”
foaf:familyName “Darwin”
bibo:shortTitle “On the origin of species”
foaf:firstName “Charles”
dc:title “On the origin of species by means
foaf:givenName “Charles” of natural selection, or the preservation
of favoured races in the struggle for
foaf:name “Darwin, Charles Robert” life.”
skos:prefLabel “Darwin, Charles Robert” event:place “London”
tl2:birthYear “1809” dc:publisher “John Murray”
tl2:deathYear “1882” dc:created “1859”
tl2:description “British evolutionary biologist” tl2:bookAbbreviation “Origin sp.”
tl2:personAbbrev “Darwin”
LITA National Forum, September 30,
2011
24. Setting up TL-2 in Drupal
• Two Content Types: Authors (Botanists) and Publications
• Node Reference between Authors and Publications
based on the TL-2 index.
• Other data is available when it's parsed:
• Herbaria
• Institutions
• Species names
• Bibliographies
• Handwriting Samples
• Postage Stamps
LITA National Forum, September 30,
2011
25. Image Credits: Database: eponas-deeway (http://eponas-deeway.deviantart.com); Magnifying Glass: Flahorn (http://flahorn.deviantart.com/)
Getting Data into Drupal
• Create Content Types (Digital Library books & TL-2)
• Create import process
• May be able to use the Feeds module for import
• Must create node references during the import.
• Must accommodate the blocks of unparsed
information in TL-2
• Create a search interface specifically for TL-2
LITA National Forum, September 30,
2011
26. What else is there to do?
Resolve /node/22365.rdf
and /tl2/author/charles-darwin
Handling "See also" and "Same as" entries in the TL-2
indexes.
Can we search our own data using SPARQL?
• Should we? Does it make sense?
Discuss/Extend vocabulary for our special needs.
Set up linked data within our site
• image collections
• trade literature
• Exhibitions
LITA National Forum, September 30,
2011
27. Other Resources
LinkedData.org
http://linkeddata.org/guides-and-tutorials
http://linkeddatabook.com/editions/1.0/
Drupal Groups
http://groups.drupal.org/semantic-web
http://groups.drupal.org/libraries
Tim Berners-Lee, TED talks
Tim Berners-Lee on the next Web (2009)
The year open data went worldwide (2010)
LITA National Forum, September 30,
2011
28. BHL is….
• A consortium of 13 natural history and
botanical libraries and research institutions
• An open access digital library for legacy
biodiversity literature.
• An open data repository of taxonomic names
and bibliographic information
LITA National Forum, September 30,
2011
31. Benefits of open data
Allows data which was created for a
specific purpose and audience to interact
with other data to serve new, previously
unimagined roles..
LITA National Forum, September 30,
2011
32. What information have we
opened up?
Essentially, everything – our metadata
(descriptive, rights, structural), our image files,
scientific names, OCR’d files
LITA National Forum, September 30,
2011
33. Technical methods for opening data
• Data exports
• APIs
• OpenURL
• OAI-PMH
LITA National Forum, September 30,
2011
34. Who is reusing our data?
• Tropicos
• Rod Page – BioGUID, BioStor
• Encyclopedia of Life
• Ryan Schenk – Visualizing taxominic
synonyms
LITA National Forum, September 30,
2011
35. Who is reusing our data?
Tropicos
LITA National Forum, September 30,
2011
36. Tropicos
LITA National Forum, September 30,
2011
37. Who is reusing our data?
Tropicos
LITA National Forum, September 30,
2011
38. Who is reusing our data?
Rod Page – BioGUID – http://bioguid.info/bhl/
LITA National Forum, September 30,
2011
39. Who is reusing our data?
Rod Page – BioStor – http://biostor.org/
LITA National Forum, September 30,
2011
40. Who is reusing our data?
Rod Page – BioStor – http://biostor.org/
LITA National Forum, September 30,
2011
41. Who is reusing our data?
Encyclopedia of Life – http://eol.org/
LITA National Forum, September 30,
2011
42. Who is reusing our data?
Encyclopedia of Life – http://eol.org/
LITA National Forum, September 30,
2011
43. Who is reusing our data?
Encyclopedia of Life – http://eol.org/
LITA National Forum, September 30,
2011
44. Who is reusing our data?
Ryan Schenk – http://ryanschenk.com/2011/02/visualizing-taxonomic-synoymns/
LITA National Forum, September 30,
2011
45. Making open data successful
• Promote it!
LITA National Forum, September 30,
2011
46. Do a code challenge
LITA National Forum, September 30,
2011
47. Publicly display your data’s copyright/licensing
and API terms of service
LITA National Forum, September 30,
2011
48. Thank You!
Building the New Open Linked Library
Keri Thompson, Head of Web Services
Smithsonian Institution Libraries
thompsonk@si.edu , @DigiKeri_SIL
Joel Richard, Lead Developer
Smithsonian Institution Libraries
richardjm@si.edu
Trish Rose-Sandler, Data Analyst
Biodiversity Heritage Library
trisha.rose-sandler@mobot.org
LITA National Forum, September 30,
2011
Notas del editor
Possibly omit or move this slide
Mo’ data mo’ better. Mission fulfilment. Sharing=caring. Efficient reuse of data.
Q to Audience: How many people have heard of linked data before today? How many feel they have a basic grasp of what it is? How many people want to watch me trip over my tongue trying to explain it in less than a minute?(If good grasp, note that in the 4 principles of linked data from T B-L 1 & 2 are easy, 3 is where we’re working now, and 4 we’re trying to figure out how to do it.) (otherwise on to definition)
LD describes a way of publishing structured data to the web so it can be interlinked with other structured data. Shared data usually (not always) in RDF (resource description framework), often as RDF in XML (we understand XML !) standard that allows data from different sources to be connected and queried.Linking data enables you to enrich yr data & give it additional contextData expressed almost like sentences in ‘triples’ URI=your data Predicate=verb Object=object. Example. Object can be a link to another system, or can just be more data, e.g. “1809”The predicate is chosen from a set vocabulary (or ontology) or if you have to make one up, you publish that new vocabulary on the web so others can get to it. Common vocabularies includeFOAF (Friend of a Friend) people, personal relationships DC (Dublin Core) publications, etc. etc. SKOS (Simple Knowledge Organization System) links systems, conceptsOWL (Web Ontology Language) links ontologies, extension of RDF
How did SIL start thinking about implementing LD? Website rebuild. Goal is to make our data more useful, reusable, and accessible to people and machines, more than just putting our stuff up ‘online’. Started looking at CMS. Wow! D7 is not only a CMS, it’s open source, and it has RDFa baked in, along with common LD ontologies! Sold!
-lots of bibliographic data in ILS, but unfortunately no access to it (for now)-re-doing online books, good candidates for providing lined biblio data-existing ‘database’ stuff – inventories, as well as new project digitization/markup of reference book Tax Lit 2 (more from JMR)
Initial focus for us will be on “database” like content we already have, or are currently creating. JMR will discuss one example.
As we move through our website redesign, and rearrange more of our content online, we will gradually go through books, other database stuff, maybe even simple stuff like library locations and hours, and apply our planning principles.
Questions for the audience to get a feel for who you are.* Computer* Librarians* Worked with Databases* Worked with Drupal
Why drupal? Why not!Why 7 and not 6? Well RDFa is built in. If it’s there, we’re more likely to use it.RDFx extends RDFa to provide different formats (XML, JSON,NTriples, Turtle via REST) RDFx also provides UI to se t the RDF mappings (Drupal comes with some already set up, but we really want to customize ours)Evoc used for caching also for autocomplete, which we’ll see later.AUDIENCE QUESTION: How many know the difference between RDF and RDFa?
Since we sort of know about what Linked Data is, let’s take a quick look at it compare RDFa, which embeds RDF data into the webpage, and and RDF in XML.The identifier is the URI of the page, the predicates are embedded in the page, and are displayed in orange, and the object or property is displayed in the <div> or <span>There may be more that needs to be done here
This RDF is formatted in XML, note that only the predicates are shown here, There is no extraneous HTML to distract. Typically you need a special tool to use this information. The web browser doesn’t natively understand an RDF XML file.
Field / Node Ref / Views are built inSPARQL is an addon module to allow others to come in and query the data on our siteSPARQL Views allow us to use external data from other sites, presumably to create new content (we may use this)RDF Ext Vocab (evoc) is used to cache vocabularies to use them in the autocomplete feature when setting up RDF mappings (among other things)Biblio is a nice module, but it needs a serious update before we can start using it.
Namespaces that drupal comes with:Dublin Core -FOAF – Friend of a friend – Links between people and the things they create and doOpen Graph – Allowing web pages to become an object in the social graph – Mainly facebookSIOC – Semantically interlinked Online CommunitiesSKOS – Knowledge organization – concepts, collections, ideasOWL – Web Ontology LanguageBIBO – Bibliogrpahic Ontology – For books! How convenient! Covers nearly all of what we need for describing books on the webWe may need to extend for publication year (rather than publication date)Later we’ll discuss a few cases where we aren’t finding something perfectly appropriate for our needs or our data is very specialized, so we may extend an existing namepsace or create our own. We can do this as long as the namespace is published and documented for others to reused.
Adding a namespace is a simple matter of giving it a prefix and the URI to the namespace. This page does not show all of the namespaces used by RDFx, there are actually 8 or 9 of them.Drupal can aslo import and cache these namespaces using the External Vocabulary Importer for reuse and also for the autocomplete feature, which is really nice. (not shown, but it’s also a matter of supplying the prefix and name.)
Although some very basic RDF mappings are set up in Drupal for us, it’s easy to create our own. They can be viewed in multiple places, but on the content type, each field’s RDF mappings can be edited on a single page. Additionally, if we have imported the vocabulary into Drupal, we get the nice benefit of the autocomplete feature to help us choose the appropriate mapping.
TL2 is a database. In book form!Botanists and their books, cross referenced in the index using unique identifiers across all volumes. It’s really a database!Used by botanists, having this online and searchable could be huge. At least having it online saves them the trouble of going to the physical volumes.Since no one else has this online in linked data form (in fact it’s barely online as it is) we’re going to become the authority for botanist names. Also, SI has contributed to the supplemental volumes.
Here we have a page of TL2, our good man Charles Darwin and some information about him. At the bottom, we have an obscure (ha!) book that he wrote, which is number 1313 in the TL2 scheme of things.Our goal is to identify the data elements that we are going to initially make public and how to map them to the vocabularies to make them more useful to others.This goes hand in hand with the parsing that we’ve hired a contractor to do, they’re pre-parsing some of the information based on our specs.1313. Nice address. 1313 Mockingbird Lane. Munsters Reference. Bad joke.
TODO: LinkSameAs to BHL, not OCLCHere’s an example. The identifiers, /darwin and /1313 are linked together with “dc:creator” and in the reverse “dc:contributor” (I think)(Predicates are one-way)So these links, which come from the index of TL2, are cross linked and our site is nicely browseable and searchable and so on.But we also link out to other places, VIAF for darwin’s identifier and WorldCat for Origin of Species that allow others to go out and do other things with this data. We link out, but how do we get people to link back into us? That’s one of the questions we aim to get an answer to, but solving it will take some time.
And here’s what we’re going to start with. (run through the different elements, starting with URI, the RTF type, then the predicates and data types TODO: More info hereOther data elements may be linked later, there’s certainly stuff available here Herbaria other bibliographic entries (need to define their relationships) Handwriting Samples Postage Stamps (!!)Mentioned earlier that we might create our own or extend an existing vocabulary. You’ll note here that we are creating the “tl2”namepsace because the concepts in TL2 are specific to it and yet is commonly used that a new namespace would be useful to others.BUT! Something is missing! Where’s that “linked” part of linked data?
TODO: LinkSameAs to BHL, not OCLCHere’s an example. The identifiers, /darwin and /1313 are linked together with “dc:creator” and in the reverse “dc:contributor” (I think)(Predicates are one-way)So these links, which come from the index of TL2, are cross linked and our site is nicely browseable and searchable and so on.But we also link out to other places, VIAF for darwin’s identifier and WorldCat for Origin of Species that allow others to go out and do other things with this data. We link out, but how do we get people to link back into us? That’s one of the questions we aim to get an answer to, but solving it will take some time.
So to recap, this entire dataset is initially going to be represented in exactly two content types, Authors and BooksA node reference between them allows us to browse between them in Drupal, but also helps create the RDF links for LOD
So how do we get this data into Drupal.We start with an XML file from our contractor. It’s already partially parsed, we’ll do some more parsing and convert that data into CSV, most likely.Using the Feeds module’s import tool, we’ll bring in the data and (hopefully) create the proper node references between. We need to keep the blocks of information together (herbaria, handwriting samples, bibliography, postage stamps) until we can parse them out at a later date as needed.Ultimately we’ll create a custom search just for TL2, even though its data will be included in the general site search on our Drupal site.
What things do we still need to do.RDFx (rdf extensions) module uses one set of identifiers and Drupal uses another. i.e. /node/22365 and /node/22356.rdf for the XML version versus /tl2/author/charles-darwinOther useful information in TL2 includes “See also” entries,Alternate names, etcUseful to researchers. We do plan to incorporate this data in a later phase of development, if only for the human-friendly site search.Investigating whether it makes sense to use SPARQL when users are querying our own data? Would this facilitate the search or make things more complicated.As we mentioned before, we’ll need to design, document and publish any extended or new ontologies (vocabularies) that we create for TL2. Our website’s been around for 15 years. Now we are laying the foundation for the next 15 years. Hopefully.