Linked Open Data is the most usable kind of Open Data. An example of a well integrated source of Linked Open Data on tourism and mobility is the Open Data Hub operated by NOI. We will use the SPARQL querying language, a W3C standard, to query the data and show how this differs from other access methods. The tour will start by querying the end point directly from the command line with tools, like curl. Then, one by one, well known data science software packages. like R and Pandas, will be used to directly work with these datasets, to perform statistical calculations and generating graphs from data.
In the final part, these software packages will be used to query data from other well known data sources, like Wikidata and DBpedia.
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
1. Open Data de luxe: Querying
public SPARQL endpoints from the
command line, R and Pandas
Bolzano - 13 NOV 2020 We make data actually usable
Making the most of Open Data Hub, Wikidata, DBpedia
and other sources of high quality data
2. Evolutions of Open Data: 5 star Open Data
★ available on the web (whatever format) but with an open licence
★★ plus: available as machine-readable structured data (e.g. excel instead of image
scan of a table)
★★★ plus: non-proprietary format (e.g. CSV instead of excel)
★★★★ plus: Use open standards from W3C (RDF and SPARQL) to identify things
★★★★★plus: Link your data to other people’s data to provide context
https://5stardata.info/
3. Evolutions of Open Data: FAIR
Findable, Accessible, Interoperable and Reusable (FAIR)
FAIR data is not always open data (personal data, competitive data etc.)
❖ It facilitates data interchange on the web
❖ It facilitates data integration across sources even when schemas are
different
❖ It supports evolution of schemas over time with minimal disruption
to data consumers
https://www.go-fair.org
4. Technology of choice: 1 - RDF
RDF is “a standard model for data
interchange on the Web”
Large graphs are build on triples
@prefix ab:
<http://learningsparql.com/ns/addressbook#> .
@prefix rdf:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ab:i0432
ab:firstName "Richard" ;
ab:lastName "Mutt" ;
ab:spouse ab:i9771 .
ab:i8301
ab:firstName "Craig" ;
ab:lastName "Ellis" ;
ab:patient ab:i9771 .
ab:i9771
ab:firstName "Cindy" ;
ab:lastName "Marshall" .
ab:spouse
rdf:type owl:SymmetricProperty ;
rdfs:comment "Identifies someone's spouse" .
ab:patient
rdf:type rdf:Property ;
rdfs:comment "Identifies a doctor's patient" .
subject predicate object
5. Technology of choice: 2 - SPARQL
SPARQL, the language to to select, update, create and delete
triples
PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT * WHERE {
?t a schema:PerformingArtsTheater ;
geo:asWKT ?pos ;
schema:name ?posLabel .
}
6. Technology of choice: 2 - SPARQL
SPARQL is similar to SQL, but is web age:
★ HTTP/S as transport protocol
★ No drivers required
★ Standardized by the W3C
7. Your personal SPARQL database: Tracker
Tracker is the file system indexer used by the Gnome
desktop, e.g. for Full Text Search
$ tracker sparql -q "SELECT DISTINCT ?performerName WHERE {?s
<http://www.tracker-project.org/temp/nmm#performer> ?performerName
. }"
Results:
urn:artist:Yasmine%20Hamdan
urn:artist:Otfried%20Preu%C3%9Fler
urn:artist:Queens%20Of%20The%20Stone%20Age
urn:artist:Guns%20N'Roses
...
8. Big SPARQL endpoints: Wikidata
Wikidata handles the
fact data for wikipedia
articles
Data from Wikidata
Link to Wikidata entry
12. Big SPARQL endpoints: datacommons.org
Operated by Google. Integrates
many data sources:
★ United States Census
★ World Bank
★ US Bureau of Labor Statistics
★ Wikipedia
★ National Oceanic and
Atmospheric Administration
★ Federal Bureau of
Investigation
★ ...
13. 0 KM endpoints: The Open Data Hub
Operated by NOI Techpark (https://sparql.opendatahub.bz.it/)
14. How can I use these end points for my
analyses?
15. $ curl -X POST https://query.wikidata.org/sparql -H
"Accept: text/csv" --data-urlencode query@countries.rq
Command line: cURL
# countries.rq
SELECT DISTINCT ?countryLabel ?population ?area
WHERE
{
?country wdt:P31 wd:Q6256 .
?country wdt:P1082 ?population .
?country wdt:P2046 ?area .
MINUS {?country wdt:P31 wd:Q3024240 .}
SERVICE wikibase:label { bd:serviceParam wikibase:language
"en,[AUTO_LANGUAGE]". }
}
ORDER BY DESC(?population)
16. $ ${JENA_DIR}/bin/rsparql --service
'https://query.wikidata.org/sparql' --query countries.rq
--results=CSV > countries.csv
Command line: rsparql from Apache
# countries.rq
SELECT DISTINCT ?countryLabel ?population ?area
WHERE
{
?country wdt:P31 wd:Q6256 .
?country wdt:P1082 ?population .
?country wdt:P2046 ?area .
MINUS {?country wdt:P31 wd:Q3024240 .}
SERVICE wikibase:label { bd:serviceParam wikibase:language
"en,[AUTO_LANGUAGE]". }
}
ORDER BY DESC(?population)
17. Directly from R
library(WikidataQueryServiceR)
r <- query_wikidata('
SELECT DISTINCT ?countryLabel ?population ?area
WHERE
{
?country wdt:P31 wd:Q6256 .
?country wdt:P1082 ?population .
?country wdt:P2046 ?area .
MINUS {?country wdt:P31 wd:Q3024240 .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". }
}
ORDER BY DESC(?population)
')
head(r)
# A tibble: 6 x 3
countryLabel population area
<chr> <dbl> <dbl>
1 People's Republic of China 1409517397 9596961
2 India 1326093247 3287263
3 United States of America 328239523 9826675
...
18. Python with the requests module
import requests
url = "https://sparql.opendatahub.bz.it/sparql"
q = """
PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT * WHERE {
?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel .
}
"""
r = requests.get(url, params = {'query': q}, headers={'Content-Type':
'application/sparql-results+json'})
print(r.json())
It works, but the returned results are not directly usable as a
table.
19. Python with sparql_client
import sparql
endpoint = "https://sparql.opendatahub.bz.it/sparql"
q = """
PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT * WHERE {
?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel .
}
"""
result = sparql.query(endpoint, q)
for row in result:
print (row)
(<IRI <http://noi.example.org/data/poi/9621F83525089644A0D47464D27D634E>>, <Literal "POINT
(11.3534199999999998
46.4990740000000002)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>>, <Literal
"Kleinkunsttheater Carambolage">)
...
Good, but needs some rework for Pandas
20. Python with sparql-dataframe
import sparql_dataframe
endpoint = "https://sparql.opendatahub.bz.it/sparql"
q = """
PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT * WHERE {
?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel .
}
"""
df = sparql_dataframe.get(endpoint, q)
Most comfortable solution for Pandas
21. What makes RDF / SPARQL great for data
exchange?
★ Data really be queried, not only
downloaded
★ Well structured data with rich data models,
often standardized and good metadata
★ Data is easy to integrate
★ Technology is easy to integrate
23. Diego
Calvanese
Scientific
advisor of the
board
Full professor
at unibz
ACM Fellow
Benjamin
Cogrel
CTO
Chair of the
board
Peter
Hopfgartner
CEO
Marco
Montali
Scientific
consultant
Assoc. professor
at unibz
The Team
Guohui
Xiao
Chief scientist
Jun. professor
at unibz
24. Big SPARQL endpoints: Typical queries
# Wikidata: bands that start with "Radio"
# try it on https://query.wikidata.org
SELECT DISTINCT ?band ?bandLabel
WHERE
{
?band wdt:P31 wd:Q215380 .
?band rdfs:label ?bandLabel .
FILTER(STRSTARTS(?bandLabel, 'Radio')) .
}
# DBPedia: facts about Joe Biden
SELECT ?property ?hasValue ?isValueOf
WHERE {
{ <http://dbpedia.org/resource/Joe_Biden> ?property ?hasValue }
UNION
{ ?isValueOf ?property <http://dbpedia.org/resource/Joe_Biden> }
}
25. Evolutions of Open Data: Linked Data
❏ Use URIs to name (identify) things.
❏ Use HTTP URIs so that these things can be looked up
(interpreted, “dereferenced”).
❏ Provide useful information about what a name identifies
when it’s looked up, using open standards such as RDF,
SPARQL, etc.
❏ Refer to other things using their HTTP URI-based names
when publishing data on the Web.
Tim Berners-Lee, 2006