Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
How to integrate Linked Data into your application
1. SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE |
SAN FRANCISCO, JUNE 5, 2012
HOW TO
INTEGRATE LINKED DATA
INTO YOUR APPLICATION
LDIF Team:
Andreas Schultz, Freie Universität Berlin
Andrea Matteini, mes|semantics
Robert Isele, Freie Universität Berlin
Pablo N. Mendes, Freie Universität Berlin
Christian Becker, mes|semantics
Christian Bizer, Freie Universität Berlin
With contributions by:
Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
2. |
WHAT IS LINKED DATA?
• Raw data (RDF)
• Accessible on the web
• Data can link to other data sources
Thing Thing Thing Thing Thing
Thing Thing Thing Thing Thing
data link data link data link data link
A B C D E
• Benefits: Ease of access and re-use; enables discovery
• One API for all data sources?
3. |
LINKING OPEN DATA CLOUD
Linked
LOV User Slideshare tags2con
Audio
Feedback 2RDF delicious
Moseley Scrobbler Bricklink Sussex
Folk (DBTune) Reading St.
GTAA
Magna- Lists Andrews
Klapp-
tune stuhl- Resource NTU
DB club Lists Resource
Tropes Lotico Semantic yovisto
John Music Man- Lists
Music Tweet chester
Hellenic Peel Brainz NDL
(DBTune) (Data Brainz Reading
subjects
FBD (zitgist) Lists Open
EUTC Incubator) Linked
Hellenic Library Open t4gm
Produc- Crunch-
PD Surge RDF info
tions
Discogs base Library
Radio Ontos Source Code
Crime ohloh Plymouth (Talis)
(Data News LEM
Ecosystem Reading RAMEAU
Reports business Incubator)
Crime data.gov. Portal Linked Data Lists SH
UK Music Jamendo
(En- uk
Brainz (DBtune) LinkedL
Ox AKTing) FanHubz gnoss ntnusc
(DBTune) SSW CCN
Points Thesau-
Last.FM Thesaur
Media
Poké-
Popula- artists pédia Didactal us rus W LIBRIS
tion (En- (DBTune) Last.FM ia theses. LCSH Rådata
reegle research patents MARC
AKTing) (rdfize) my fr nå!
data.gov. data.go Codes
Ren.
NHS uk v.uk Good- Experi-
Classical List
Energy (En- win flickr ment
(DB Pokedex Family Norwe-
Genera- AKTing) Mortality BBC wrappr Sudoc PSH
Tune) gian
(En-
tors Program MeSH
Geographic
AKTing) semantic
mes BBC IdRef GND
CO2 educatio OpenEI web.org SW
Energy Sudoc ndlna
Emission n.data.g Music Dog VIAF
EEA (En- Chronic- Linked
(En- ov.uk Portu- Food UB
AKTing) ling Event MDB
AKTing) guese Mann- Europeana
BBC America Media
DBpedia Calames heim
Ord- Recht- Wildlife Deutsche
Open Revyu DDC
Openly spraak. Finder Bio- lobid
nance
Publications
Election RDF graphie
Data legislation Survey Local nl data Ulm Resources NSZL Swedish
EU Tele- New Book
Project data.gov.uk graphis bnf.fr Catalog Open
Insti- York
URI Open Mashup Cultural
tutions Times Greek P20
UK Post- Burner Calais Heritage
codes DBpedia ECS Wiki
statistics lobid
GovWILD data.gov. Taxon iServe South- Organi-
LOIUS BNB
Brazilian
uk Concept ECS ampton sations
Geo World BibBase STW GESIS
User-generated content
OS South- ECS
Poli- ESD Names Fact-
ampton (RKB
ticians stan- reference book Budapest
data.gov.uk Freebase EPrints Explorer)
dards data.gov. NASA
uk intervals Project OAI
Lichfield transport (Data DBpedia data
Guten- Pisa
Spen- data.gov. Incu- dcs RESEX Scholaro-
ISTAT ding bator) Fishes berg DBLP DBLP
uk Geo
meter
Immi- Scotland of Texas (FU (L3S)
Pupils & Uberblic DBLP
Species Berlin)
Government
gration IRIT
Exams Euro- dbpedia data- (RKB
London TCM ACM
stat lite open- Explorer) NVD
Gazette (FUB) Gene IBM
Traffic Geo ac-uk
Scotland TWC LOGD Eurostat Daily DIT
Linked UN/
Data UMBEL Med ERA
Data LOCODE DEPLOY
Gov.ie CORDIS YAGO New-
lingvoj Disea-
(RKB some SIDER RAE2001 castle LOCAH
Explorer) Linked Eurécom
Cross-domain
CORDIS Drug Roma
Eurostat Sensor Data CiteSeer
(FUB) (Ontology Bank
GovTrack (Kno.e.sis) Open Pfam Course-
Central) riese Enipedia
Cyc Lexvo LinkedCT ware
Linked PDB
UniProt VIVO
EURES EDGAR dotAC
US SEC Indiana ePrints IEEE
(Ontology totl.net
(rdfabout)
Central) WordNet RISKS
Life sciences
(VUA) Taxono UniProt
US Census EUNIS Twarql HGNC
Semantic Cornetto (Bio2RDF)
(rdfabout) my VIVO
FTS XBRL PRO- ProDom STITCH Cornell LAAS
SITE KISTI NSF
Scotland
Geo- GeoWord LODE
graphy Net WordNet WordNet JISC
(W3C) (RKB
Climbing
Linked Affy- KEGG
SMC Explorer) SISVU Pub VIVO UF
Piedmont GeoData metrix Drug
ECCO-
Finnish Journals PubMed Gene SGD Chem
Accomo- El TCP
Munici- AGROV Ontology
dations Alpine bible
palities Viajero OC
Ski ontology
Tourism KEGG
Austria PBAC
Ocean GEMET Enzyme
Metoffice ChEMBL
Italian Drilling OMIM KEGG
Weather Open
public Codices AEMET Linked MGI Pathway
Data
schools Forecasts Open InterPro GeneID KEGG
EARTh Thesau-
Turismo
rus Colors Reaction
de
Zaragoza Product Smart KEGG
Weather DB Link Medi Glycan
Janus Stations Product Care KEGG
AMP UniParc UniRef UniSTS
Types Italian
Homolo Com-
Yahoo! Airports Museums pound
Ontology Google
Gene
Geo Art
Planet National Chem2
wrapper
Radio- Bio2RDF
activity UniPath
JP Sears Open Linked OGOLOD way
Corpo- Amster- Reactome
dam medu- Open
rates Numbers
Museum cator
http://lod-cloud.net As of September 2011
4. |
TYPES OF LINKED DATA
VERY SOON?
Open, Linked
Commercial
Public Data Enterprise
Linked Data
(LOD Cloud) Data
... AND WHAT YOU CAN DO WITH THEM
• Provide interfaces on top of them
• Augment your website
• Integrate them into your application logic
• Create specialized data marts
5. |
AUGMENT YOUR WEBSITE: BBC
BBC online properties make intensive use of
data from Wikipedia and MusicBrainz
6. |
DATA MARTS: NEUROWIKI
• NeuroWiki creates views
for genes, drugs and
diseases data from four
RDF data sources
• Provides navigation and
composition tools for
accessing and mining the
data
7. |
APPLICATION LOGIC: IBM WATSON
http://www.flickr.com/photos/ibm_media/
• IBM Watson makes use of Linked Data sources such as DBpedia
9. |
STEP #1:
ACCESS LINKED DATA
• Linked Data is published via HTTP, SPARQL endpoints, RDF dumps
Access Methods Decision Factors
Architecture HTTP Dump
SPARQL Recency Speed / Scalability Reliability Complexity
Dereferencing import
On-The-Fly X High Low Low High
Dereferencing
Decreases
Moderate with
exponentially as
Query Federation X High Low SPARQL 1.1
new sources are
SERVICE clause
added
Crawling and Caching X X X Depends High High High
Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011)
• Live access allows quick prototyping and limited production use
• As data sets grow in size and more data sources are added, a
crawling/caching architecture often becomes necessary
10. |
STEP #1:
ACCESS LINKED DATA
Implementations:
• On-the-fly dereferencing
• LDspider, SQUIN, Semantic Web Client library
• Query federation
• SPARQL 1.1 SERVICE clause
• Crawling and Caching
• Triplestore import script
• Public caches (e.g. Sindice, OpenLink LOD endpoint)
• LDIF
11. |
STEP #2:
NORMALIZE VOCABULARIES
Data sources that overlap in content use a wide range of vocabularies.
mpeg7 swrc po
dcam bib
tl
wot rdfg
txncompass
metalex
doap
dc
wdrs
admingeo
vann
api
org
sawsdl
Over 60 % of all LOD sources use
sdmx
•
geospecies
qb
xml
rev
vu-wordnet
umbel
uniprot
http
scovo
void
tag
proprietary vocabularies
dbp
bio
ore
dbo
gr
dbpedia
event
time
xsd • It’s up to the data consumer to
frbr
geonames
cc
normalize the vocabularies
sioc foaf
vcard • Enterprise: Need to translate
mo between internal and external
bibo
akt vocabularies
xhtml skos
geo
Most widely used vocabularies in the LOD cloud (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
12. |
STEP #2:
NORMALIZE VOCABULARIES
Approaches to Schema Mapping:
• Hand-crafting queries against individual sources – no different than an API
OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .
OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }
OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] }
Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
• Ontology Representation Languages: OWL, RDFS
• Rules: SWRL, RIF
• Query Languages
• SPARQL CONSTRUCT clause
• TopQuadrant SPARQLMotion
• Mosto
• R2R (part of LDIF)
13. |
STEP #2:
NORMALIZE VOCABULARIES
Using SPARQL:
• Rename a class
CONSTRUCT {
?s a mo:MusicArtist
} WHERE {
?s a dbpedia-owl:MusicalArtist
}
• Value transformation
CONSTRUCT {
?s movie:runtime ?runtimeInMinutes .
} WHERE {
?s dbpedia-owl:runtime ?runtime .
BIND(?runtime * 60 As ?runtimeInMinutes)
}
• Create URI from literal
CONSTRUCT {
?s diseasome:omim ?omimuri .
?omimuri dc:identifier ?identifier .
} WHERE {
?s dbpedia-owl:omim ?omim .
BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri)
BIND(concat(“omim:”, ?omim) As ?identifier)
}
Slide credits: Andreas Schultz
14. |
STEP #3:
RESOLVE IDENTIFIERS
Data sources that overlap in content use different identifiers for the
same real-world entity.
1 linked data sets 98 • Most LOD sources only provide
2 linked data sets 62
owl:sameAs links to one other
data source
3 linked data sets 38
4 linked data sets 19
• It’s up to the data consumer to
generate additional links
5 linked data sets 5
• Enterprise: Need to link both
6 - 10 linked data sets 17
internal and external resources
> 10 linked data sets 27
0 25 50 75 100
Number of linked data sets per source (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
15. |
STEP #3:
RESOLVE IDENTIFIERS
Approaches to Identity Resolution:
• Improvised or manual merging
• Rule-based approaches:
• SILK (part of LDIF)
• LIMES
Union Sq., New York
Union Sq., Seattle
Union Sq., San Francisco
′N
47 W
° 24′
37 2°
12
Union Sq.
Union =
Square Union Sq.,
San Francisco
′N
47 W
° 24′
37 2°
12
16. |
STEP #4:
FILTER DATA
Data sources that overlap in content provide data that is conflicting and of
varying quality.
• Data sources have...
• ... different knowledge levels, views or intents
• ... wrong, biased, inconsistent or outdated information
• Approaches:
• Import data into distinct Named Graphs; query them separately
using the SPARQL GRAPH clause
• Sieve (part of LDIF)
17. |
LDIF – LINKED DATA INTEGRATION FRAMEWORK
Integrates Linked Data from multiple sources into a clean, local target
representation while keeping track of data provenance
1 Collect data: Managed download and update
2 Translate data into a single target vocabulary
3 Resolve identifier aliases into local target URIs
NEW 4 Cleanse data; resolving the conflicting values
5 Output
• Follows the Crawling and Caching Architecture Pattern
• Open source (Apache License, Version 2.0)
• Collaboration between Freie Universität Berlin and mes|semantics
18. |
LDIF PIPELINE
1 Collect data Supported data sources:
2 Translate data • RDF dumps (all common formats)
• SPARQL Endpoints
3 Resolve identities
• Crawling Linked Data via HTTP
4 Cleanse data
5 Output
19. |
LDIF PIPELINE
1 Collect data
Sources use a wide range of different RDF vocabularies
2 Translate data dbpedia-owl: City
3 Resolve identities schema:Place R2R local:City
fb:location.citytown
4 Cleanse data
5 Output • Simple mappings using OWL / RDFS statements
(x rdfs:subClassOf y)
• Complex mappings with SPARQL expressivity
• Built-in transformation function library (XPath)
20. |
LDIF PIPELINE
1 Collect data Sources use different identifiers for the same entity
2 Translate data
Union Sq., New York
Union Sq., Seattle
3 Resolve identities Union Sq., San Francisco
′N
° 47 4′ W
37 2°2
12
4 Cleanse data Union Sq.
Union =
5 Output Square Silk Union Sq.,
San Francisco
′N
° 47 4′ W
37 2°2
12
• Automated link creation based on Link Specifications
• Supports various comparators and transformations
(string similarity, basic arithmetics, time, geographical
distance)
21. |
LDIF PIPELINE
Sources provide different values for the same property
1 Collect data
San Francisco
2 Translate data population is
0.7M
3 Resolve identities
★
★
San Francisco
San
4 Cleanse data population is
Francisco
0.8M
Sieve population
5 Output is 0.8M
★
★
★
1. Quality Assessment – assign quality scores to Named
Graphs (by time, by source preference, thresholds)
2. Data Fusion – resolve conflicting property values
(according to quality scores, frequency, averages)
22. |
LDIF PIPELINE
1 Collect data
Output options:
2 Translate data • N-Quads
3 Resolve identities • N-Triples
• SPARQL Update Stream
4 Cleanse data
5 Output
• Provenance tracking using Named Graphs
24. |
VERSIONS
• In-memory
• fast, but scalability limited by local RAM
• RDF Store (TDB)
• stores intermediate results in a Jena TDB RDF store
• can process more data than In-memory but doesn't scale
• Cluster (Hadoop)
• scales by parallelizing work across multiple machines using Hadoop
• can process a virtually unlimited amount of data
• ready for Amazon Elastic MapReduce
27. |
THANKS!
• Early adopters wanted!
• Website: http://bit.ly/ldifweb
• Google Group: http://bit.ly/ldifgroup
• http://mes-semantics.com
• Supported in part by
• Vulcan Inc. as part of its Project Halo
• EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data
(Grant No. 257943)
• Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz