Publicidad

Data Science meets Linked Data

Lecturer in Computer Science at Heriot-Watt University
2 de Jul de 2014
Publicidad

Más contenido relacionado

Más de Alasdair Gray(20)

Publicidad

Data Science meets Linked Data

  1. Data Science meets Linked Data Alasdair J G Gray http://www.alasdairjggray.co.uk @gray_alasdair A.J.G.Gray@hw.ac.uk SICSA Data Science Theme Launch 3 July 2014
  2. BBC World Cup 3 July 2014SICSA Data Science Theme Launch 1
  3. BBC Linked Data Platform 3 July 2014SICSA Data Science Theme Launch 2
  4. Olympics 2012 3 July 2014SICSA Data Science Theme Launch 3
  5. Linking Data 3 July 2014SICSA Data Science Theme Launch 4
  6. 1. Global ID – URI 2. Resolvable ID 3. Useful content HTML for humans RDF for machines 4. Link to other resources Like the Web, but for data! Linked Data Principles 3 July 2014SICSA Data Science Theme Launch 5 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!”
  7. Challenge 1: Matching Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014 John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 Messy data Probabilistic matches Schema matching
  8. Gleevec® = Imatinib Mesylate 3 July 2014 SICSA Data Science Theme Launch 7 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  9. Challenge 2: Reusing mappings 3 July 2014 SICSA Data Science Theme Launch 8 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name Link: owl:sameAs
  10. Challenge: Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 3 July 2014SICSA Data Science Theme Launch 9 P12047 X31045 GB:29384 http://rdf.ebi.ac.uk/resource/ch embl/molecule/CHEMBL1642 https://www.ebi.ac.uk/chembl/co mpound/inspect/CHEMBL1642
  11. Challenge Open Data: Licenses 5★ of linked data Licenses who can reuse the data  Interoperability of licenses  Non-commercial: academic use, teaching, industry 3 July 2014SICSA Data Science Theme Launch 10
  12. Challenges: Privacy 11 3 July 2014SICSA Data Science Theme Launch
  13. Challenge: Query Performance Response time Data freshness Reliability Volume of requests Hosting resources 3 July 2014SICSA Data Science Theme Launch 12 Queries Queries
  14. In Data we Trust How can we trust the data we’ve got back? How can we ensure that it hasn’t been tampered on the way? Trusty URIs 3 July 2014SICSA Data Science Theme Launch 13 http://www.intelsat.com/wp- content/uploads/2014/03/Red-padlock.jpg
  15. Contact Details www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair 3 July 2014SICSA Data Science Theme Launch 15 “There is lots of data we all use every day, and it’s not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar? No. Why not? Because we don’t have a web of data. Because data is controlled by applications and each application keeps it to itself.” Tim Berners-Lee

Notas del editor

  1. Many of you will have visited this site recently Lot of sport coverage, how do the BBC cope within their resources?
  2. 700+ pages on teams, groups and players Minimal journalist involvement Automatic aggregation and links to relevant stories Article tagged with Frank Lampard, inference used to link team, group ,etc
  3. Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues Page for every athlete and country drawing on open data
  4. Internally DBPedia and Geonames
  5. Linked Data hugely successful since inception in 2009 About 300 datasets published Wide range of topics
  6. Familiar with birth, marriage and death records. Aligning individuals is hard Also applies to schema matching
  7. Data’s been aligned now what? Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results Data is messy!
  8. sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link. Links need provenance to enable reuse – James’s talk
  9. Each captures a subtly different view of the world Are they the same? … depends on your point of view Different URIs for different representations (content negotiation)
  10. Not all data should be open Consider your interaction with the health service – its unique to you Need statistical aggregation to anonymise data As much about educating the public – Public relations
Publicidad