What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Data Science meets Linked Data
1. Data Science meets Linked
Data
Alasdair J G Gray
http://www.alasdairjggray.co.uk
@gray_alasdair
A.J.G.Gray@hw.ac.uk
SICSA Data Science Theme Launch
3 July 2014
6. 1. Global ID – URI
2. Resolvable ID
3. Useful content
HTML for humans
RDF for machines
4. Link to other resources
Like the Web, but for data!
Linked Data Principles
3 July 2014SICSA Data Science Theme Launch
5
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
7. Challenge 1: Matching
Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
Messy data
Probabilistic
matches
Schema matching
8. Gleevec® = Imatinib Mesylate
3 July 2014 SICSA Data Science Theme Launch 7
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
9. Challenge 2: Reusing mappings
3 July 2014 SICSA Data Science Theme Launch 8
Link: skos:closeMatch
Reason: non-salt form
Link: skos:exactMatch
Reason: drug name
Link: owl:sameAs
10. Challenge: Multiple Identities
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
3 July 2014SICSA Data Science Theme Launch
9
P12047
X31045
GB:29384
http://rdf.ebi.ac.uk/resource/ch
embl/molecule/CHEMBL1642
https://www.ebi.ac.uk/chembl/co
mpound/inspect/CHEMBL1642
11. Challenge Open Data:
Licenses
5★ of linked data
Licenses who can
reuse the data
Interoperability of
licenses
Non-commercial:
academic use,
teaching, industry
3 July 2014SICSA Data Science Theme Launch
10
13. Challenge: Query Performance
Response time
Data freshness
Reliability
Volume of
requests
Hosting
resources
3 July 2014SICSA Data Science Theme Launch
12
Queries Queries
14. In Data we Trust
How can we trust
the data we’ve got
back?
How can we ensure
that it hasn’t been
tampered on the
way?
Trusty URIs
3 July 2014SICSA Data Science Theme Launch
13
http://www.intelsat.com/wp-
content/uploads/2014/03/Red-padlock.jpg
15. Contact Details
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
3 July 2014SICSA Data Science Theme Launch
15
“There is lots of data we all use every day, and it’s not part of the web. I
can see my bank statements on the web, and my photographs, and I can
see my appointments in a calendar. But can I see my photos in a calendar
to see what I was doing when I took them? Can I see bank statement lines
in a calendar?
No. Why not? Because we don’t have a web of data. Because data is
controlled by applications and each application keeps it to itself.”
Tim Berners-Lee
Editor's Notes
Many of you will have visited this site recently
Lot of sport coverage, how do the BBC cope within their resources?
700+ pages on teams, groups and players
Minimal journalist involvement
Automatic aggregation and links to relevant stories
Article tagged with Frank Lampard, inference used to link team, group ,etc
Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues
Page for every athlete and country drawing on open data
Internally
DBPedia and Geonames
Linked Data hugely successful since inception in 2009
About 300 datasets published
Wide range of topics
Familiar with birth, marriage and death records.
Aligning individuals is hard
Also applies to schema matching
Data’s been aligned now what?
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases
Different results
Data is messy!
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Links need provenance to enable reuse – James’s talk
Each captures a subtly different view of the world
Are they the same? … depends on your point of view
Different URIs for different representations (content negotiation)
Not all data should be open
Consider your interaction with the health service – its unique to you
Need statistical aggregation to anonymise data
As much about educating the public – Public relations