Industrialized Linked Data

Industrialized Linked Data

Dave Reynolds, Epimorphics Ltd
@der42

Context: public sector Linked Data

Linked Data journey ...

explore
 what is linked data?
 what use it is for us?


explore

 self-describing  Integration
 carries semantics with it  comparable
 annotate and explain  slice and dice
 data in context  web API
 ...  ...


explore

 self-describing  Integration
 carries semantics with it  comparable
 annotate and explain  slice and dice
 data in context  web API
 ...  ...
 what’s involved?


explore pilot

data model convert publish apply

Photo of The Thinker © dSeneste.dk@flicker CC BY


explore pilot routine?
Great pilot but ...
 can we reduce the time and cost?
 how do we handle changes and updates?
 how can we make the published data easier to use?

How do we make Linked Data “business as usual”?

Example case study: Environment Agency
 monitoring of bathing
water quality
 static pilot
 live pilot
 historic annual
assessments
 weekly assessments
 operational system
 additional data feeds
 live update
 integrated API
 data explorer

From pilot to practice
 reduce modelling costs
 patterns dive1
 reuse
 handling change and update
 patterns
 publication process
 automation
 conversion
 publication
 embed in the business process
 use internally as well as externally
 publish once, use many
 data platform

Reduce costs - modelling
1. Don’t do it
 map source data into isomorphic RDF, synthesize URIs
 loses some of the value proposition
2. Reuse existing ontologies intact or mix-and-match
 best solution when available
 W3C GLD work on vocabularies – people, organizations,
datasets ...
3. Reusable vocabulary patterns
 example:
 Data cube plus reference URI sets
 adaptable to broad range of data – environmental, statistical,
financial ...

Reusable patterns: Data cube
 Much public sector data has regularities
 set of measures
 observations, forecasts, budgets, assessments, statistics ...

>0.1 34

27 good
excellent
poor
good 125

 sets of measures
 observations, forecasts, budgets, assessments, estimates ...
 organized along some dimensions
 region, agency, time, category, cost centre ...

objective code cost centre

12 15 25
measure: spend
8 9 11
120 130 180
time

 sets of measures
 observations, forecasts, budgets, assessments, estimates ...
 organized along some dimensions
 region, agency, time, category, cost centre ...
 interpreted according to attributes
 units, multipliers, status

objective code cost centre

provisional
$12k $15k $25k
measure: spend
$8k $9k $11k
final
$120k $130k $180k
time

Data cube pattern
 Pattern, not a fixed ontology
 customize by selecting measures, dimensions and attributes
 originated in publishing of statistics
 applied to environment measurements, weather forecasts, budgets
and spend, quality assessments, regional demographics ...
 Supports reuse
 widely reusable URI sets – geography, time periods, agencies, units
 organization-wide sets
 modelling often only requires small increments on top of core
pattern and reusable components
 opens door for reusable visualization tools
 standardization through W3C GLD

Application to case study
 Data Cubes for water quality measurement
 in-season weekly assessments
 end of season annual assessments
 dimensions:
 time intervals – UK reference time service
 location - reference URI set for bathing waters and sample pts
 cubes can reuse these dimensions
 just need to define specific measures

 patterns
 reuse
 patterns dive 2
 automation
 conversion
 publication
 data platform

Handling change
 critical challenge
 most initial pilots choose a snapshot dataset
 and go stale, fast
 understanding the nature of data updates and how to handle
them is critical to successful scaling to business as usual
 types of change
 new data related to different time period
 corrections to data
 entities change
 properties
 identity

Modelling change
1. Individual data items relate to new time period
Pattern: n-ary relation
 observation resource relates value to time period and other context
 use Data Cube dimensions for this
bwq:sampleYear
bwq:bathingWater http://reference.data.gov.uk/id/year/2009
http://environment.data.gov.
uk/id/bathing- bwq:classification Higher
water/ukk1202-36000
bwq:sampleYear
Clevedon Beach http://reference.data.gov.uk/id/year/2010
bwq:classification
Minimum

bwq:sampleYear
http://reference.data.gov.uk/id/year/2011

bwq:classification
Higher

History or latest?
 latest is non-monotonic but helpful for many practical uses
 materialize (SPARQL Update), implement in query, implement in API
 choice whether to keep history as well
 water quality v. weather forecasts

Modelling change
2. Corrections
 patterns
 silent change (!)
 explicit replacement
 API level hides replaced values but SPARQL query can retrieve & trace
 explicit change event

bwq:sampleYear
http://environment.data.gov. bwq:bathingWater
classification : Higher http://reference.data.gov.uk/id/year/2011
uk/id/bathing-
water/ukk1202-36000
dct:isReplacedBy ev:after
Clevedon Beach dct:replaces
ev:occuredOn
classification : Minimum
status: replaced
analysis event
reason: reanalysis
ev:before ev:agent

Modelling change
3. Mutation
 Infrequent change of properties, essential identity remains
 e.g. renaming a school, adding another building
 routine accesses see property value, not function of time
 patterns
 in place update
 named graphs
 current graph + graphs for each previous state + meta-graph
 explicit versioning with open periods

Modelling change
3. Mutation
explicit versioning with open periods
dct:hasVersion dct:hasVersion
endurant

“Clevedon Beach” “Clevedon Sands”

time:intervalStarts time:intervalStarts
dct:valid 2003 dct:valid 2011

2011
time:intervalFinishes

 find right version by query on validity interval
 simplify use through
 non-monotonic “latest value” link
 API to implement query filters automatically

 weekly and annual samples
 use Data Cube pattern (n-ary relation)
 withdrawn samples
 replacement pattern (no explicit change event)
 Data Cube slice for “latest valid assessment”
 generated by a SPARQL Update query
 API gives easy access to the latest valid values
 linked data following or raw SPARQL query allows drilling into changes
 changes to bathing water profile
 versioning pattern
 bathing water entity points to latest profile (SPARQL Update again)

 patterns
 reuse
 patterns
 automation
 conversion dive 3
 publication
 data platform

Automation
Transform and publish data feed increments
 transformation engine service
 reusable mappings, low cost to adapt to new feeds
 linking to reference data
 publication service that supports non-monotonic changes

publication
service
data increments (csv) transform
service

replicated
xform xform reconciliation
xform
spec. spec. publication
spec. service
servers

Reference data

Transformation service
 declarative specification of transform
 single service support range of transformations
 easy to adapt transformation to new feeds and modelling
changes
 R2RML – RDB to RDF Mapping Language
 specify mapping from database tables to RDF triples
 W3C candidate recommendation
 D2RML
 R2RML extension to treat CSV feed as a database table

Small D2RML example
:dataSource a dr:CSVDataSource ;
rdfs:label "dataSource" .

:bathingWaterTermMap a dr:SubjectMap;
dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ;
dr:class def-bw:BathingWater .

:bathingWaterMap
dr:logicalTable :dataSource ;
dr:subjectMap :bathingWaterTermMap ;

dr:predicateObjectMap [
dr:predicate rdfs:label ;
dr:objectMap [dr:column "description_english" ; dr:language "en" ] ]

dr:predicateObjectMap [
dr:predicate def-bw:eubwidNotation;
dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid ] ] .

Using patterns
 problems with verbosity, increases reuse costs
 extend to support modelling patterns
 Data Cube
 specify mapping to observation with measures and dimensions
 engine generates Data Set and Data Structure Definition
automatically

D2RML cube map example
:dataCubeMap a dr:DataCubeMap ;
rr:logicalTable “dataSource”;
dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ;
dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ;

Instances will
dr:observationMap [ automatically link to
rr:subjectMap [ base Data Set
rr:termType rr:IRI ;
rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ;
rr:componentMap [
Implies an entry in the Data
dr:componentType qb:measure ;
Structure Definition which is
rr:predicate aq:concentration ;
auto-generated
rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ]
] ;
... Define how measure
value is to be
represented

But what about linking?
 connect observations to reference data
 a core value of linked data
 R2RML has Term Maps to create values
 constants and templates
 extend to allow maps based on other data sources
 Lookup map
 lookup resource in a store, fetch predicate
 Reconcile
 specify lookup in a remote service
 use Google Refine reconciliation API

Automation
 transformation engine service 
 reusable mappings, low cost to adapt to new feeds 
 linking to reference data 
 publication service that supports non-monotonic changes

publication
service
service

replicated
xform
spec. service
servers

Reference data

Publication service
 goals
 cope with non-monotonic effects of change representation
 so replication is robust and cheap (=> make it idempotent)
 solution
 SPARQL Update
 publish transformed increment as a simple DATA INSERT
 then run SPARQL Update script for non-monotonic links
 dct:replacedBy links
 lastest value slices

Sample update script
DELETE {
?bw bwq:latestComplianceAssessment ?o .
} WHERE {
}

INSERT {
} WHERE {
{
?slice a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year].
OPTIONAL {
?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2].
FILTER (?year2 > ?year)
} FILTER ( !bound(?slice2) )
}
?slice qb:observation ?o .

?o bwq:bathingWater ?bw.
}

Automation
 transformation engine service 
 reusable mappings, low cost to adapt to new feeds 
 linking to reference data 
 publication service that supports non-monotonic changes 

publication
service
service

replicated
xform
spec. service
servers

Reference data

 Update server
 transforms based on scripts (earlier scripting utility)
 linking to reference data
 distributed publication via
SPARQL Update
 extensible range of data sets
 annual assessments
 in-season assessments
 bathing water profile
 features (e.g. pollution sources)
 reference data

 patterns
 reuse
 patterns
 automation
 conversion
 publication
 embed in the business process dive 4
 data platform

Embed in business process
 embedding is critical to ensure data kept up to date
 in turn needs usage
=> lower barrier to use external
use

data not
used rich, up
to date invest
data

data goes hard to
stale justify
internal
use

Lowering barrier to use
 simple REST APIs
 use Linked Data API specification
 rich query without learning SPARQL
 easy consumption as JSON, XML
 gets developers used to data and data model
publication

LD API
service

transform
service

 embedded in process for weekly/daily updates
 infrastructure to automate conversion and publishing
 API plus extensive developer documentation
 third party and in-house applications built over API

 information products as applications over a data platform,
usable externally as well as internally

The next stage
 grow range of data publications and uses
 range of reference data and sets brings new challenges
 discover reference terms and models to reuse
 discover datasets to use for application
 discover models and links between sets
 needs a coordination or registry service
 story for another day ...

Conclusions
 illustrated how public sector users of linked are moving
from static pilots to operational systems
 keys are:
 reduce modelling costs through patterns and reuse
 design for continuous update
 automation of publication using declarative mappings and
SPARQL Update
 lower barrier to use through API design and documentation
 embed in organization’s process so the data is used and useful
Acknowledgements
Only possible thanks to many smart colleagues: Stuart
Williams, Andy Seaborne, Ian Dickinson, Brian McBride,
Chris Dollin
plus Alex Coley and team from the Environment Agency

Industrialized Linked Data

Recomendados

Recomendados

Más contenido relacionado

Similar a Industrialized Linked Data

Similar a Industrialized Linked Data (20)

Último

Último (20)

Industrialized Linked Data