SFScon 2020 - Peter Hopfgartner - Open Data de luxe

Open Data de luxe: Querying
public SPARQL endpoints from the
command line, R and Pandas
Bolzano - 13 NOV 2020 We make data actually usable
Making the most of Open Data Hub, Wikidata, DBpedia
and other sources of high quality data

Evolutions of Open Data: 5 star Open Data
★ available on the web (whatever format) but with an open licence
★★ plus: available as machine-readable structured data (e.g. excel instead of image
scan of a table)
★★★ plus: non-proprietary format (e.g. CSV instead of excel)
★★★★ plus: Use open standards from W3C (RDF and SPARQL) to identify things
★★★★★plus: Link your data to other people’s data to provide context
https://5stardata.info/

Evolutions of Open Data: FAIR
Findable, Accessible, Interoperable and Reusable (FAIR)
FAIR data is not always open data (personal data, competitive data etc.)
❖ It facilitates data interchange on the web
❖ It facilitates data integration across sources even when schemas are
diﬀerent
❖ It supports evolution of schemas over time with minimal disruption
to data consumers
https://www.go-fair.org

Technology of choice: 1 - RDF
RDF is “a standard model for data
interchange on the Web”
Large graphs are build on triples
@prefix ab:
<http://learningsparql.com/ns/addressbook#> .
@prefix rdf:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ab:i0432
ab:firstName "Richard" ;
ab:lastName "Mutt" ;
ab:spouse ab:i9771 .
ab:i8301
ab:firstName "Craig" ;
ab:lastName "Ellis" ;
ab:patient ab:i9771 .
ab:i9771
ab:firstName "Cindy" ;
ab:lastName "Marshall" .
ab:spouse
rdf:type owl:SymmetricProperty ;
rdfs:comment "Identifies someone's spouse" .
ab:patient
rdf:type rdf:Property ;
rdfs:comment "Identifies a doctor's patient" .
subject predicate object

Technology of choice: 2 - SPARQL
SPARQL, the language to to select, update, create and delete
triples
PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT * WHERE {
?t a schema:PerformingArtsTheater ;
geo:asWKT ?pos ;
schema:name ?posLabel .
}

Technology of choice: 2 - SPARQL
SPARQL is similar to SQL, but is web age:
★ HTTP/S as transport protocol
★ No drivers required
★ Standardized by the W3C

Your personal SPARQL database: Tracker
Tracker is the file system indexer used by the Gnome
desktop, e.g. for Full Text Search
$ tracker sparql -q "SELECT DISTINCT ?performerName WHERE {?s
<http://www.tracker-project.org/temp/nmm#performer> ?performerName
. }"
Results:
urn:artist:Yasmine%20Hamdan
urn:artist:Otfried%20Preu%C3%9Fler
urn:artist:Queens%20Of%20The%20Stone%20Age
urn:artist:Guns%20N'Roses
...

Big SPARQL endpoints: Wikidata
Wikidata handles the
fact data for wikipedia
articles
Data from Wikidata
Link to Wikidata entry

Big SPARQL endpoints: Wikidata

Big SPARQL endpoints: DBpedia
DBpedia extracts the data from Wikipedia
and makes this data available and
downloadable

Big SPARQL endpoints: Typical queries

Big SPARQL endpoints: datacommons.org
Operated by Google. Integrates
many data sources:
★ United States Census
★ World Bank
★ US Bureau of Labor Statistics
★ Wikipedia
★ National Oceanic and
Atmospheric Administration
★ Federal Bureau of
Investigation
★ ...

0 KM endpoints: The Open Data Hub
Operated by NOI Techpark (https://sparql.opendatahub.bz.it/)

How can I use these end points for my
analyses?

$ curl -X POST https://query.wikidata.org/sparql -H
"Accept: text/csv" --data-urlencode query@countries.rq
Command line: cURL
# countries.rq
SELECT DISTINCT ?countryLabel ?population ?area
WHERE
{
?country wdt:P31 wd:Q6256 .
?country wdt:P1082 ?population .
?country wdt:P2046 ?area .
MINUS {?country wdt:P31 wd:Q3024240 .}
SERVICE wikibase:label { bd:serviceParam wikibase:language
"en,[AUTO_LANGUAGE]". }
}
ORDER BY DESC(?population)

$ ${JENA_DIR}/bin/rsparql --service
'https://query.wikidata.org/sparql' --query countries.rq
--results=CSV > countries.csv
Command line: rsparql from Apache
# countries.rq
WHERE
{
SERVICE wikibase:label { bd:serviceParam wikibase:language
"en,[AUTO_LANGUAGE]". }
}

Directly from R
library(WikidataQueryServiceR)
r <- query_wikidata('
WHERE
{
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". }
}
')
head(r)
# A tibble: 6 x 3
countryLabel population area
<chr> <dbl> <dbl>
1 People's Republic of China 1409517397 9596961
2 India 1326093247 3287263
3 United States of America 328239523 9826675
...

Python with the requests module
import requests
url = "https://sparql.opendatahub.bz.it/sparql"
q = """
SELECT * WHERE {
?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel .
}
"""
r = requests.get(url, params = {'query': q}, headers={'Content-Type':
'application/sparql-results+json'})
print(r.json())
It works, but the returned results are not directly usable as a
table.

Python with sparql_client
import sparql
endpoint = "https://sparql.opendatahub.bz.it/sparql"
q = """
SELECT * WHERE {
}
"""
result = sparql.query(endpoint, q)
for row in result:
print (row)
(<IRI <http://noi.example.org/data/poi/9621F83525089644A0D47464D27D634E>>, <Literal "POINT
(11.3534199999999998
46.4990740000000002)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>>, <Literal
"Kleinkunsttheater Carambolage">)
...
Good, but needs some rework for Pandas

Python with sparql-dataframe
import sparql_dataframe
endpoint = "https://sparql.opendatahub.bz.it/sparql"
q = """
SELECT * WHERE {
}
"""
df = sparql_dataframe.get(endpoint, q)
Most comfortable solution for Pandas

What makes RDF / SPARQL great for data
exchange?
★ Data really be queried, not only
downloaded
★ Well structured data with rich data models,
often standardized and good metadata
★ Data is easy to integrate
★ Technology is easy to integrate

Diego
Calvanese
Scientific
advisor of the
board
Full professor
at unibz
ACM Fellow
Benjamin
Cogrel
CTO
Chair of the
board
Peter
Hopfgartner
CEO
Marco
Montali
Scientific
consultant
Assoc. professor
at unibz
The Team
Guohui
Xiao
Chief scientist
Jun. professor
at unibz

Big SPARQL endpoints: Typical queries
# Wikidata: bands that start with "Radio"
# try it on https://query.wikidata.org
SELECT DISTINCT ?band ?bandLabel
WHERE
{
?band wdt:P31 wd:Q215380 .
?band rdfs:label ?bandLabel .
FILTER(STRSTARTS(?bandLabel, 'Radio')) .
}
# DBPedia: facts about Joe Biden
SELECT ?property ?hasValue ?isValueOf
WHERE {
{ <http://dbpedia.org/resource/Joe_Biden> ?property ?hasValue }
UNION
{ ?isValueOf ?property <http://dbpedia.org/resource/Joe_Biden> }
}

Evolutions of Open Data: Linked Data
❏ Use URIs to name (identify) things.
❏ Use HTTP URIs so that these things can be looked up
(interpreted, “dereferenced”).
❏ Provide useful information about what a name identifies
when it’s looked up, using open standards such as RDF,
SPARQL, etc.
❏ Refer to other things using their HTTP URI-based names
when publishing data on the Web.
Tim Berners-Lee, 2006

SFScon 2020 - Peter Hopfgartner - Open Data de luxe

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a SFScon 2020 - Peter Hopfgartner - Open Data de luxe

Similar a SFScon 2020 - Peter Hopfgartner - Open Data de luxe (20)

Más de South Tyrol Free Software Conference

Más de South Tyrol Free Software Conference (20)

Último

Último (20)

SFScon 2020 - Peter Hopfgartner - Open Data de luxe