1. Linked Open Communism:
Better discovery through data dis- and re-aggregation
--- or ---
How I learned to shut about about linked data
AND BUILD SOMETHING!!
Presented at code4lib2013
by Corey A Harper
2013-02-13
2. Linked Data
• Metadata as a Graph
• Typed “things”, named by URIs
• The relationships between those
things, also built on URIs
• Ease of integration *across* data
sources – “merging graphs”
2013-02-13 ☭ code4lib2013 ☭ 2
8. Backing Away from Evangelism...
Image NOT used by permission.
Probably a violation of several copyrights & trademarks.
2013-02-13 ☭ code4lib2013 ☭ 8
9. Image by Jonestown Institute via Wikimedia Commons
http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg
9
☭ code4lib2013 ☭
Aside on metaphors
2013-02-13
10. Image by Joe Mabel via Wikimedia Commons.
http://en.wikipedia.org/wiki/File:Furthur_05.jpg
10
☭ code4lib2013 ☭
Aside on metaphors
2013-02-13
15. The Death of Browse
• Next-Gen Discovery Systems don't
make use of Authority Control
• “Browse” was/is broken as a UI Design
• Rich data in Authorities, disconnected
from narrative, context, search
• Richer “Authority” type data outside
libraries...
2013-02-13 ☭ code4lib2013 ☭ 15
16. Linked Data Based UI Design
For Boutique Collections
2013-02-13 ☭ code4lib2013 ☭ 16
17. Public Domain image of Paulette Goddard
via Wikimedia Commons.
http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG
17
☭ code4lib2013 ☭
A research leave
2013-02-13
18. Public Domain image via Wikimedia Commons.
http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg
18
☭ code4lib2013 ☭
Initial Scope
2013-02-13
19. Linked Open Communism
• Dis-aggregate EAD records into
Collections & Components
• Create a broad set of resource “types”
• Extract key “entities” from EAD
People, Places, Topics, Corporate Bodies
Incorporate additional data about entites
• Put this in Blacklight
• Load MARC & other data
2013-02-13 ☭ code4lib2013 ☭ 19
23. Technology Stack - UI
• Vanilla Blacklight
Minor SOLR Index Tweaks / Additions
Minor View Hacks
• “pre-beta”
Only on localhost right now
2013-02-13 ☭ code4lib2013 ☭ 23
30. Performance Benchmarks
• EAD -> SOLR:
~26 hrs to parse 1600 EAD, push 385k
“records” to SOLR
• DBPedia matching
X-ref label varients for entities against 9.4
million DBPedia labels (labels-en.ttl).
Should be using Hadoop
Other ideas?
• Re-solr-izing entities: ~10 minutes
Pulls local copy of dbpedia data from 4store
2013-02-13 ☭ code4lib2013 ☭ 30
31. 4Store
• Provenance-ish
Naming of sub-graphs
Default context is everything
• First EAD cut produced ~4m triples
• Easy to delete whole graphs, or individ
triples
• SPARQL-able – good for stats:
992 DBPedia links for 6331 “Entities”
2013-02-13 ☭ code4lib2013 ☭ 31
32. Image by wallygrom via flickr
http://www.flickr.com/photos/33037982@N04/3669790240/
32
https://github.com/chrpr/ead2rdf2solr
☭ code4lib2013 ☭
2013-02-13
33. Future Steps: Code to Incorporate
• Components: Inheritance of
accesspoints
fuzzywuzzy string match to unittitle
matched about 10%
Extend to cross ead match via 4Store
• VIAF, id.loc, fast reconciliation
• Override configs for DBPedia matching
2013-02-13 ☭ code4lib2013 ☭ 33