The document provides methodological guidelines for publishing linked data. It introduces linked data and its key principles of using URIs, HTTP URIs, providing useful information through standards like RDF and SPARQL, and including links between data. The rest of the document outlines guidelines for publishing linked data, including identifying data sources, modeling vocabularies by reusing existing ones, generating RDF data from sources, generating URIs, publishing and linking the RDF data, enabling discovery through mechanisms like CKAN and Sitemaps, and tools that can help with each step of the process.
Methodological Guidelines for Publishing Linked Data
1. Methodological Guidelines for
Publishing Linked Data
Boris Villazón-Terrazas, Asunción Gómez-Pérez, and Óscar Corcho
Facultad de Informática, Universidad Politécnica de Madrid
Campus de Montegancedo sn, 28660 Boadilla del Monte, Madrid
http://www.oeg-upm.net
http://www oeg upm net
{bvillazon,asun,ocorcho}@fi.upm.es
Phone: 34.91.3366605, Fax: 34.91.3524819
CONSEGI 2011 – Brasília, Brazil
12th May, 2011
2. ToC
• Introduction to Linked Data
• G id li
Guidelines f P bli hi Li k d D t
for Publishing Linked Data
• Demo
2
8. In a nutshell
• An extension of the current
Web…
• … where information and services
data
are given well-defined and explicitly
represented meaning, …
• … so that it can be shared and used
by humans and machines ...
machines,
• ... better enabling them to work in
cooperation
• How?
• Promoting information exchange by
tagging web content with machine
processable descriptions of its
meaning.
• A d t h l i and i f t t
And technologies d infrastructure
to do this
• And clear principles on how to
publish data
8
9. The four principles (Tim Berners Lee, 2006)
1. Use URIs as names • http://www.w3.org/D
for things esignIssues/Linked
2. Use HTTP URIs so Data.html
that people can look
up those names. http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
3. When someone looks
up a URI, provide
useful information,
using th standards
i the t d d
(RDF*, SPARQL)
4.
4 Include links to other
URIs, so that they can
discover more things.
9
11. And guess who is starting to publish Linked Data now?
• UK Government
• US Government
• BBC
• Open Calais
• Freebase
• NY Times
• CNET
• Dbpedia
• ….
11
14. ToC
• Introduction to Linked Data
• G id li
Guidelines f P bli hi Li k d D t
for Publishing Linked Data
• Demo
14
15. Linked Data in OEG
• GeoLinkedData is an open initiative whose aim is to
enrich the Web of Data with Spanish geospatial data.
p g p
http://geo.linkeddata.es
• El Viajero Linked Data is project that focuses on the
integration of the contents produced by newspapers
and digital platforms belonging to Prisa Group
Group.
http://webenemasuno.linkeddata.es/
• A project with the Biblioteca Nacional to publish the
library information as Linked Data.
y
http://cultura.linkeddata.es/visualizer/
15
16. Linked Data in OEG
• Tools for generating and cosuming Linked Data, e.g.,
• geometry2rdf http://www oeg upm net/index php/downloads/151 geometry2rdf
http://www.oeg-upm.net/index.php/downloads/151-geometry2rdf
• map4rdf http://oegdev.dia.fi.upm.es/projects/map4rdf/
• Spanish Thematic Network of Linked Data
http://red.linkeddata.es
p
» Group leader: Ontology Engineering Group
» 19 Research Groups
» 4 companies
16
19. Identification of the data sources
• Guidelines based on the Open Data Manual 1
• Two possibilities
• To find the data sources already available in a public data
catalog, e.g., Aporta project 2
• To get an agreement with a particular government body to
p
publish its data sources, e.g., GeoLinkedData - IGN
g
1 http://opendatamanual.org/
2 http://aporta.es
19
20. Identification of the data sources
GeoLinkedData
Agreement with the IGN
IGN
National Geographic Institute of Spain
g p p
Oracle & MySQL
Data sources available
in a public data catalog
INE
National Statistic Institute of Spain
20
21. Identification of the data sources
IGN & INE
Year
Province Industry Production Index
21
23. Vocabulary Modelling
Ontology
• An ontology is an engineering artifact, which provides:
• A set of terms
• A set of explicit assumptions regarding the intended meaning of the terms.
• Almost always including concepts and their classification
• Almost always including properties between concepts
• Shared understanding of a domain of interest
nderstanding
23
24. Vocabulary Modelling
Reuse available vocabularies
Search for suitable
vocabularies
Linked Open Vocabularies
are there Yes Build the vocabulary by
suitable reusing available
vocabularies? vocabularies
No
…
24
25. Vocabulary Modelling
Reuse available non-ontological resources
Highly reliable Web Sites
Search for suitable Domain-related sites
non-ontological resources
Government Catalogs
are there Yes Build the vocabulary by
suitable transforming available
resources? resources
No
Build the vocabulary from
scratch
25
26. Vocabulary Modelling
GeoLinkedData
WGS84 Geo
Positioning: an RDF
vocabulary scv:Dimension
scv:Item
scv:Dataset
hydrographical
phenomena (rivers
(rivers,
lakes, etc.)
Vocabulary for
instants, intervals,
, ,
durations, etc.
Names and
international code
Ontology for OGC systems for
Geography Markup territories and
Language
g g groups
Classes 33 33
Object Properties
j p 44 44
Data Properties 318 318
http://neon-toolkit.org/
26
28. Generation of the RDF Data
NOR2O
INE
ODEMapster
IGN
Geospatial Geometry2RDF
column
IGN
28
29. Generation of the RDF Data
NOR2O
Industry Production Index Year
Province
NOR2O
29
30. Generation of the RDF Data
R2O & ODEMapster
• R2O is an extensible fully declarative language to describe
extensible,
mappings between relational database schemas and ontologies.
• The ODEMapster processor generates RDF instances from
relational instances based on the mapping description
expressed in the R2O document
www.oeg-upm.net/index.php/en/downloads/9-r2o-odempaster
30
31. Generation of the RDF Data
R2O & ODEMapster
• Creation of the R2O Mappings
31
32. Generation of the RDF Data
R2O & ODEMapster
Excerpt of the R2O document
32
33. Generation of the RDF Data
geometry2rdf
• Tool for generating RDF from geometrical information
• The geometry could be available in GML or WKT
• The RDF generated follows our Geometry Model
http://www.oeg-upm.net/index.php/en/downloads/151-geometry2rdf
33
34. Generation of the RDF Data
geometry2rdf
Oracle STO UTIL package
SELECT TO_CHAR(SDO_UTIL.TO_GML311GEOMETRY(geometry))
AS Gml311Geometry
FROM "BCN200"."BCN200_0301L_RIO" c
WHERE c.Etiqueta='Arroyo'
34
36. Generation of the RDF Data
Geometry Model
geoes: http://geo.linkeddata.es/
geo: http://www.w3.org/2003/01/geo/wgs84_pos#
geoes:ontology/Geometría
rdfs:subClassOf rdfs:subClassOf
rdfs:subClassOf
geo:Point geoes:ontology/Curva geoes:ontology/Polígono
formadoPor formadoPor
39
geo:lat 39
geo:long
Collection of 2 or Collection of 3 or
more geo:Points more geo:Points
36
37. Generation of the RDF Data
RDF generated according to our Geometry Model
1 2
0
0
37
38. Generation of the RDF Data
URI Generation
• URIs are extremely relevant in this process since
they are the key for the alignment of heterogeneous
resources that come from different data sources.
• Cool URIs 1
• UK Cabinet Office 2
• Examples:
http://geo.linkeddata.es/ontology/{class/property}
http://geo.linkeddata.es/ontology/Lago
http://geo.linkeddata.es/resource/dataset/type/{resourcename}
http://geo linkeddata es/resource/dataset/type/{resourcename}
http://geo.linkeddata.es/resource/Provincia/Madrid
1 http://www.w3.org/TR/cooluris/
2 http://www.cabinetoffice.gov.uk/media/301253/puiblic sector uri.pdf
38
39. Generation of the RDF Data
Provenance Information
• It is relevant
• to manage the provenance information of the resources
• to establish the license of the information
• Example
Pubby: http://www4.wiwiss.fu-berlin.de/pubby/
39
41. Publication of the RDF data
map4rdf
map4rdf
http://oegdev.dia.fi.upm.es/projects/map4rdf/
HTML Linked Data SPARQL
Including Provenance Pubby
Support
http://www4.wiwiss.fu-berlin.de/pubby/ Pubby 0.3
Virtuoso 6.1.0
41
43. Data Cleansing
• To find possible errors, identified by Hogan et al.
• http-level issues such as accessibility and derefencability
issues, derefencability,
e.g., HTTP URIs return 40x/50x errors
• reasoning issues such as namespace without vocabulary,
e.g., rss:item term invented
• malformed/incompatible datatypes, e.g., “true” as xsd:int
• To fix the identified errors
• Example, encoding URIs
• Special characters á é ñ
á, é,
• http://geo.linkeddata.es/resource/Provincia/M%C3%A1laga
43
45. Linking the RDF Data
Identify suitable data sets http://ckan.net
as li ki t
linking targets
t
Discover relationships
between data items
LIMES Silk Framework
http://aksw.org/Projects/limes http://www4.wiwiss.fu-berlin.de/bizer/silk/
Validate the relationships
discovered sameAs Validator
http://oegdev.dia.fi.upm.es:8080/sameAs/
45
46. Linking the RDF Data
GeoLinkedData
GeoLinked
Data
DBPedia GeoNames
…. …. ….
http://dbpedia.org/re http://geo.linkeddata http://sws.geoname
source/Madrid .es/.../Madrid s.org/6355233/
…. …. ….
46
47. Linking the RDF Data
sameAs Validator
http://oegdev.dia.fi.upm.es:8080/sameAs/
47
49. Enable Effective Discovery
Register the dataset into CKAN Registry
• Add the dataset to CKAN, the open registry of data
and content packages
• Minimum information
• Name, unique ID for your data set on CKAN
• Title, full name of your data set
, y
• URL, link to the data set home page
http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation
49
50. Enable Effective Discovery
Sitemap protocol
• Used by web crawlers
• Efficiently find all your content & discover
what has been updated
http://sitemaps.org/
A sitemap fil contains i f
i file i information regarding one or more URL on
i di URLs
your Web site. The information that is stored there helps search
engines better spider your website.
50
52. Enable Effective Discovery
sitemap4rdf
• Simple command line tool
• Sends a SPARQL query to list all URIs
• Generates sitemap
sitemap4rdf htt //
it 4 df http://yoursite/sparql htt //
it / l http://yoursite/resource/
it / /
Example:
sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/
• run sitemap4rdf specifying th SPARQL endpoint
it 4 df if i the d i t
and the prefix of the URLs to include in the Sitemap
http://lab.linkeddata.deri.ie/2010/sitemap4rdf/
52
67. Methodological Guidelines for
Publishing Linked Data
Boris Villazón-Terrazas, Asunción Gómez-Pérez, and Óscar Corcho
Facultad de Informática, Universidad Politécnica de Madrid
Campus de Montegancedo sn, 28660 Boadilla del Monte, Madrid
http://www.oeg-upm.net
http://www oeg upm net
{bvillazon,asun,ocorcho}@fi.upm.es
Phone: 34.91.3366605, Fax: 34.91.3524819
CONSEGI 2011 – Brasília, Brazil
12th May, 2011