Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
2. Outline
I. Context
– Transforming national & international statistical
systems
– Semantic Web / Linked Data meets Official Statistics
– SemStats 2013
– Parameters for the R&D project
II. Investigation of existing tools
III. Summary of the transformation process
IV. Lessons learned
V. Discussion
3. 2009 (Australia)
• The case for an international statistical innovation program
Transforming national and international statistics systems
• Future capabilities
1. From static data products to “common information services”
2. From publications to communication
3. Support for transaction data flowing at a much higher volume
4. Ability to rapidly incorporate new issues and views of data into
standards and classifications
5. ‘Rapid-response’ capability
6. Connecting processes and passing metadata and data easily
between them
7. Analysing assemblies of data
4. The Challenges
Increasing
cost &
difficulty of
acquiring
survey data
New sources
& changing
expectations
Rapid changes
in the
environment
Competition
for skilled
resourcesDiminishing
budgets
Riding the
big data
wave
5. HLG
• High-Level Group for the Modernisation of Statistical Production and Services
• Comprises 10 heads of national and international statistical organisations
– Gosse van der Veen (Netherlands) - Chairman
– Brian Pink (Australia)
– Eduardo Sojo Garza-Aldape (Mexico)
– Enrico Giovannini (Italy)
– Woo, Ki-Jong (Republic of Korea)
– Irena Križman (Slovenia)
– Katherine Wallman (United States)
– Walter Radermacher (Eurostat)
– Martine Durand (OECD)
– Lidia Bratanova (UNECE)
The official statistics industry
and its place in the wider
information industry
From Strategy to implement the vision of
the HLG (2012)
6. Grouping the challenges
1. Product Challenge - Modernising Statistical Services
• Designing and delivering new and better statistical
outputs (products and services)
2. Process Challenge – Modernising Statistical
Production
• Developing and implementing new and better production
processes and methods which are capable of delivering
statistical outputs with
i. reduced cost, and
ii. greater flexibility.
7. HLG Strategy
• Standards-based, collaborative modernisation of official statistics.
• Create an environment (eg “common architecture”) that facilitates
collaborative development, sharing and reuse of
– statistical business processes
– statistical methods
– IT components
– data repositories
• Explicit role for
– common conceptual frameworks, eg
• GSIM (Generic Statistical Information Model)
– and common implementation standards, eg
• SDMX (Statistical Data and Metadata eXchange), working with
• DDI (Data Documentation Initiative)
8. ABS main data service support SDMX
• ABS.Stat Beta
– Dissemination from predefined aggregate data cubes
• eg Consumer Price Index
– Featured at GovHack 2013
– Based on OECD.Stat
• Now used by OECD, IMF, UNESCO, European Commission, ABS,
Statistics New Zealand, Statistics Italy
• Further development through SIS Collaboration Community
• TableBuilder
– Dissemination of on demand tabulations from microdata
• Includes Population Census
9. Harnessing the opportunities
• Global community around SDMX
– intersects with SIS Collaboration Community
• Working on
– SDMX to JSON (JavaScript Object Notation)
• Making life easier for third party developers
– No need to parse SDMX-ML
• Object model similar to Data Cube Vocabulary (DCV)
• Expected to be released for review in September
– SDMX to Data Cube Vocabulary (DCV)
• Much earlier stage within SIS Collaboration Community
10. Layering standards on standards
• RDF Data Cube Vocabulary (DCV) developed
under W3C
– designed for publishing multi-dimensional data, such
as statistics, on the web in such a way that it can be
linked to related data sets and concepts
– based upon the approach used by the SDMX ISO
standard for statistical data exchange
– very general and can be used for other data sets such
as survey data, spreadsheets and OLAP data cubes
11. Use of DCV
• Usage within
– data.gov.uk
– Eurostat
– Other institutions within the European Union via
the EU’s Open Data Portal
• eg European Environment Agency
– Experimental use within data.gov.au
12. Linked Data view on Official Statistics
• Official Statistics and the Practice of Data Fidelity
– Official statistics are the “crown jewels” of a nation’s public data
– Provide empirical evidence for policy making and economic research
– Statistical offices are among the most “data-savvy” organisations in
government
– Handling of Statistical Data as Linked Data requires particular attention
to maintain its integrity and fidelity
• Linked SDMX Data
– Challenges
• Automation of data transformation of data from high profile statistical
organizations
• Minimization of third-party interpretation of the source data and metadata and
lossless transformations
13. (Unofficial) view from Official Statistics
• Semantic Statistics opportunities include :
– external application of statistical classifications, and other statistical
concept schemes, as ontologies
– simpler, more flexible and more powerful use of statistical data along side
other data
– partnering more closely with other “data” communities
• Semantic Statistics issues and risks include
– ensuring production process is sustainable
– ensuring semantics are identified consistently across all statistical outputs
from a single agency
– possible lack of rigour when defining and linking concepts to outputs from
other sources
– the possibility of “fuzzy” semantics leading to incorrect data analyses
14. SemStats 2013
• Interest in “Semantic Statistics” is growing rapidly
within Statistical and Semantic Web communities
• There are existing semantic web developments
building on both SDMX and DDI
• SemStats 2013 provides a rare opportunity to interact
with world experts while they’re in Australia
• We are interested in what entrants might create and
demonstrate in regard to SemStats 2013 Challenge
15. SemStats 2013 Challenge
• Provides Australian and French Census data in
Data Cube Vocabulary (DCV) format
– Data is Geography x Sex x Age x “Activity” status
– Entrants are asked to demonstrate value from innovative
application of semantic web technologies to the data.
16. Aim when preparing Australian content
• use as an opportunity for practical learning
• start with SDMX-ML (not, eg, CSV) (if possible)
– Plan A: SDMX-ML from TableBuilder
• use existing international tools for SDMX-ML to
DCV transformations (if possible)
• do the work within the ABS (if possible)
• Plan B was to ask INSEE (Statistics France) to help us with the
transformation
17. Investigation
• Datalift
– Supports multiple input types
– Generic transformation
– Supports dissemination to the web
• Mimas
– XSLT based
– Complicated
• Guillaume report
– From INSEE
– Highly tailored to the input data
18. Datalift
• Free to use – source code also available
• Java web application
• Supports multiple input types
– Semantic graphs
– Relational databases
– Files (CSV, XML, etc)
• Supports entire cycle
– INSEE plan to use in future
• SDMX -> DCV plug-in in development
19. Mimas
• Inflexible
– XML input only
– XML output only
• Cumbersome
– Requires multiple intermediate conversions
• Inefficient for large volumes of data
20. Guillaume Report
• INSEE short term solution
• Datalift was not mature enough
• MIMAS identified as cumbersome and
inefficient
• Opted to use Apache Jena for small Java
application
21. Technology Overview
• Census TableBuilder
– Data extracted in SDMX and CSV
• Java
– Apache Jena library
– SDMX 2.0 XML beans
• Ontologies used
– Simple Knowledge Organisation System
– Data Cube Vocabulary
• Turtle RDF syntax
– Easy to read for humans and machines
22. SDMX Extraction Tool Overview
• Reads in SDMX structure file
– Uses SDMX 2.0 beans to parse file
• Disassembles XML to main components
– Code lists
– Concepts
– Key Families
• Build semantic model with Apache Jena
• Write to file in Turtle syntax
29. Data Structure Definition
Can only be values of
this type
List of codes to use
Concept dimension is
measuring
What the observation
is measuring
30. The Data - SDMX
• Series key – dimensions being measured
• Attributes – extra metadata about observation
• Obs – the value of the observation (i.e. people
counted)
31. The Data - DCV
• More condensed – attributes attached to the
dataset instead of the observation
Dimensions
Coded values
Observation
value
Dataset
observation is
from
32. Lessons Learned (1)
• Subject Matter Experts needed
– What dimensions to use?
– What attributes to use?
– What concepts are we measuring?
• Current tools not yet mature
• Full validation of data complex
• Heavy resource usage for large data
– Unable to process SA2 level data on 32bit
33. Lessons Learned (2)
• Conversion straight forward
– Standards very similar
• Promotes reuse
– Power comes from linking data
• Linked nature makes you think about what
you are doing
– E.g. How close is INSEE activity to ABS labour force
status?
34. Semantic Considerations
• How much, how soon, do we aim to harness opportunities
for carrying more usable semantics in Data Cube
Vocabulary?
– Expected an external ontology for sex – but most are for Gender
• How close is “close enough” for semantic assertions in Linked Open
Data?
• Aim for statistical harmonisation first (eg SDMX Cross Domain
Concepts) then explore links to broader ontologies?
• Even data producers are not sure if Age is a common
concept across ABS & INSEE (Statistics France).
• Risk of overselling the technical format before semantic
payload is sorted?
35. Laying the foundations
• The project confirmed that, in order to deliver more useable semantics in
our outputs, on a sustainable basis, we need statistical data and metadata
to be defined and managed on a consistent, standards aligned basis across
the organisation, including
– across all statistical subject matter domains (social, economic, environmental)
– “end to end” (ie spanning design, collection, processing/integration, analysis
and dissemination)
• We also need production processes to be automated & sustainable.
• This is one example of why ABS needs to “modernise statistical
production” to reflect the changed world in which we operate and to offer
new services that address new needs and expectations of users.
• In the 13/14 Budget Papers funding of $2.1 million was provided to
develop a second pass business case for a major statistical infrastructure
and business process reengineering project.
National Statistical Institutions face shared constraints and challenges.External ChallengesRapidly changing external environment - 24 / 7 access to informationIncreasing demand by sophisticated users for more timely, relevant statistical data to meet ‘current’ day issuesincreasing demand for more accessible and ‘joined up’ data to solve complex policy questionsConstraintsReduced funding and volatility in funding Our costs are increasing significantly – unable to contact many households, response rates are dropping, it is becoming more and more difficult to recruit and retain interviewers skills shortages – competing for statistical and ICT skills across government complex work programs siloed processesand aging infrastructure