SlideShare una empresa de Scribd logo
1 de 36
Experimental transformation of
ABS data into Data Cube
Vocabulary (DCV) format
Why?
How?
What was learned?
Outline
I. Context
– Transforming national & international statistical
systems
– Semantic Web / Linked Data meets Official Statistics
– SemStats 2013
– Parameters for the R&D project
II. Investigation of existing tools
III. Summary of the transformation process
IV. Lessons learned
V. Discussion
2009 (Australia)
• The case for an international statistical innovation program
Transforming national and international statistics systems
• Future capabilities
1. From static data products to “common information services”
2. From publications to communication
3. Support for transaction data flowing at a much higher volume
4. Ability to rapidly incorporate new issues and views of data into
standards and classifications
5. ‘Rapid-response’ capability
6. Connecting processes and passing metadata and data easily
between them
7. Analysing assemblies of data
The Challenges
Increasing
cost &
difficulty of
acquiring
survey data
New sources
& changing
expectations
Rapid changes
in the
environment
Competition
for skilled
resourcesDiminishing
budgets
Riding the
big data
wave
HLG
• High-Level Group for the Modernisation of Statistical Production and Services
• Comprises 10 heads of national and international statistical organisations
– Gosse van der Veen (Netherlands) - Chairman
– Brian Pink (Australia)
– Eduardo Sojo Garza-Aldape (Mexico)
– Enrico Giovannini (Italy)
– Woo, Ki-Jong (Republic of Korea)
– Irena Križman (Slovenia)
– Katherine Wallman (United States)
– Walter Radermacher (Eurostat)
– Martine Durand (OECD)
– Lidia Bratanova (UNECE)
The official statistics industry
and its place in the wider
information industry
From Strategy to implement the vision of
the HLG (2012)
Grouping the challenges
1. Product Challenge - Modernising Statistical Services
• Designing and delivering new and better statistical
outputs (products and services)
2. Process Challenge – Modernising Statistical
Production
• Developing and implementing new and better production
processes and methods which are capable of delivering
statistical outputs with
i. reduced cost, and
ii. greater flexibility.
HLG Strategy
• Standards-based, collaborative modernisation of official statistics.
• Create an environment (eg “common architecture”) that facilitates
collaborative development, sharing and reuse of
– statistical business processes
– statistical methods
– IT components
– data repositories
• Explicit role for
– common conceptual frameworks, eg
• GSIM (Generic Statistical Information Model)
– and common implementation standards, eg
• SDMX (Statistical Data and Metadata eXchange), working with
• DDI (Data Documentation Initiative)
ABS main data service support SDMX
• ABS.Stat Beta
– Dissemination from predefined aggregate data cubes
• eg Consumer Price Index
– Featured at GovHack 2013
– Based on OECD.Stat
• Now used by OECD, IMF, UNESCO, European Commission, ABS,
Statistics New Zealand, Statistics Italy
• Further development through SIS Collaboration Community
• TableBuilder
– Dissemination of on demand tabulations from microdata
• Includes Population Census
Harnessing the opportunities
• Global community around SDMX
– intersects with SIS Collaboration Community
• Working on
– SDMX to JSON (JavaScript Object Notation)
• Making life easier for third party developers
– No need to parse SDMX-ML
• Object model similar to Data Cube Vocabulary (DCV)
• Expected to be released for review in September
– SDMX to Data Cube Vocabulary (DCV)
• Much earlier stage within SIS Collaboration Community
Layering standards on standards
• RDF Data Cube Vocabulary (DCV) developed
under W3C
– designed for publishing multi-dimensional data, such
as statistics, on the web in such a way that it can be
linked to related data sets and concepts
– based upon the approach used by the SDMX ISO
standard for statistical data exchange
– very general and can be used for other data sets such
as survey data, spreadsheets and OLAP data cubes
Use of DCV
• Usage within
– data.gov.uk
– Eurostat
– Other institutions within the European Union via
the EU’s Open Data Portal
• eg European Environment Agency
– Experimental use within data.gov.au
Linked Data view on Official Statistics
• Official Statistics and the Practice of Data Fidelity
– Official statistics are the “crown jewels” of a nation’s public data
– Provide empirical evidence for policy making and economic research
– Statistical offices are among the most “data-savvy” organisations in
government
– Handling of Statistical Data as Linked Data requires particular attention
to maintain its integrity and fidelity
• Linked SDMX Data
– Challenges
• Automation of data transformation of data from high profile statistical
organizations
• Minimization of third-party interpretation of the source data and metadata and
lossless transformations
(Unofficial) view from Official Statistics
• Semantic Statistics opportunities include :
– external application of statistical classifications, and other statistical
concept schemes, as ontologies
– simpler, more flexible and more powerful use of statistical data along side
other data
– partnering more closely with other “data” communities
• Semantic Statistics issues and risks include
– ensuring production process is sustainable
– ensuring semantics are identified consistently across all statistical outputs
from a single agency
– possible lack of rigour when defining and linking concepts to outputs from
other sources
– the possibility of “fuzzy” semantics leading to incorrect data analyses
SemStats 2013
• Interest in “Semantic Statistics” is growing rapidly
within Statistical and Semantic Web communities
• There are existing semantic web developments
building on both SDMX and DDI
• SemStats 2013 provides a rare opportunity to interact
with world experts while they’re in Australia
• We are interested in what entrants might create and
demonstrate in regard to SemStats 2013 Challenge
SemStats 2013 Challenge
• Provides Australian and French Census data in
Data Cube Vocabulary (DCV) format
– Data is Geography x Sex x Age x “Activity” status
– Entrants are asked to demonstrate value from innovative
application of semantic web technologies to the data.
Aim when preparing Australian content
• use as an opportunity for practical learning
• start with SDMX-ML (not, eg, CSV) (if possible)
– Plan A: SDMX-ML from TableBuilder
• use existing international tools for SDMX-ML to
DCV transformations (if possible)
• do the work within the ABS (if possible)
• Plan B was to ask INSEE (Statistics France) to help us with the
transformation
Investigation
• Datalift
– Supports multiple input types
– Generic transformation
– Supports dissemination to the web
• Mimas
– XSLT based
– Complicated
• Guillaume report
– From INSEE
– Highly tailored to the input data
Datalift
• Free to use – source code also available
• Java web application
• Supports multiple input types
– Semantic graphs
– Relational databases
– Files (CSV, XML, etc)
• Supports entire cycle
– INSEE plan to use in future
• SDMX -> DCV plug-in in development
Mimas
• Inflexible
– XML input only
– XML output only
• Cumbersome
– Requires multiple intermediate conversions
• Inefficient for large volumes of data
Guillaume Report
• INSEE short term solution
• Datalift was not mature enough
• MIMAS identified as cumbersome and
inefficient
• Opted to use Apache Jena for small Java
application
Technology Overview
• Census TableBuilder
– Data extracted in SDMX and CSV
• Java
– Apache Jena library
– SDMX 2.0 XML beans
• Ontologies used
– Simple Knowledge Organisation System
– Data Cube Vocabulary
• Turtle RDF syntax
– Easy to read for humans and machines
SDMX Extraction Tool Overview
• Reads in SDMX structure file
– Uses SDMX 2.0 beans to parse file
• Disassembles XML to main components
– Code lists
– Concepts
– Key Families
• Build semantic model with Apache Jena
• Write to file in Turtle syntax
Code Lists
• Representation of a classification
– Can be hierarchical or flat
Code Schemes
Code scheme
information
Code information
Codes
Code schemes
Generate SKOS
concept scheme
SKOS Concept Schemes
Unique identifier
Type
Parent category
Label
Classification/
concept scheme
Code
Concepts & Components
• Links observations to their:
– Classification
– Concept
Concept Schemes
Concept informationConcepts
Concept Schemes
Components
Component
informationComponents
Key families
Create data structure
definition
Data Structure Definition
Can only be values of
this type
List of codes to use
Concept dimension is
measuring
What the observation
is measuring
The Data - SDMX
• Series key – dimensions being measured
• Attributes – extra metadata about observation
• Obs – the value of the observation (i.e. people
counted)
The Data - DCV
• More condensed – attributes attached to the
dataset instead of the observation
Dimensions
Coded values
Observation
value
Dataset
observation is
from
Lessons Learned (1)
• Subject Matter Experts needed
– What dimensions to use?
– What attributes to use?
– What concepts are we measuring?
• Current tools not yet mature
• Full validation of data complex
• Heavy resource usage for large data
– Unable to process SA2 level data on 32bit
Lessons Learned (2)
• Conversion straight forward
– Standards very similar
• Promotes reuse
– Power comes from linking data
• Linked nature makes you think about what
you are doing
– E.g. How close is INSEE activity to ABS labour force
status?
Semantic Considerations
• How much, how soon, do we aim to harness opportunities
for carrying more usable semantics in Data Cube
Vocabulary?
– Expected an external ontology for sex – but most are for Gender
• How close is “close enough” for semantic assertions in Linked Open
Data?
• Aim for statistical harmonisation first (eg SDMX Cross Domain
Concepts) then explore links to broader ontologies?
• Even data producers are not sure if Age is a common
concept across ABS & INSEE (Statistics France).
• Risk of overselling the technical format before semantic
payload is sorted?
Laying the foundations
• The project confirmed that, in order to deliver more useable semantics in
our outputs, on a sustainable basis, we need statistical data and metadata
to be defined and managed on a consistent, standards aligned basis across
the organisation, including
– across all statistical subject matter domains (social, economic, environmental)
– “end to end” (ie spanning design, collection, processing/integration, analysis
and dissemination)
• We also need production processes to be automated & sustainable.
• This is one example of why ABS needs to “modernise statistical
production” to reflect the changed world in which we operate and to offer
new services that address new needs and expectations of users.
• In the 13/14 Budget Papers funding of $2.1 million was provided to
develop a second pass business case for a major statistical infrastructure
and business process reengineering project.
Discussion

Más contenido relacionado

La actualidad más candente

Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Zoltan Nagy
 
Service innovation: the hidden value of open data
Service innovation: the hidden value of open dataService innovation: the hidden value of open data
Service innovation: the hidden value of open data
Slim Turki, Dr.
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
madynav
 
Meeting today’s dissemination challenges – Implementing International Standar...
Meeting today’s dissemination challenges – Implementing International Standar...Meeting today’s dissemination challenges – Implementing International Standar...
Meeting today’s dissemination challenges – Implementing International Standar...
Jonathan Challener
 

La actualidad más candente (20)

2016 SDMX Experts meeting, Checklist for SDMX Design Projects, Daniel Suranyi...
2016 SDMX Experts meeting, Checklist for SDMX Design Projects, Daniel Suranyi...2016 SDMX Experts meeting, Checklist for SDMX Design Projects, Daniel Suranyi...
2016 SDMX Experts meeting, Checklist for SDMX Design Projects, Daniel Suranyi...
 
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
 
Open data presentation 2013 v0 5
Open data presentation 2013 v0 5Open data presentation 2013 v0 5
Open data presentation 2013 v0 5
 
Service innovation: the hidden value of open data
Service innovation: the hidden value of open dataService innovation: the hidden value of open data
Service innovation: the hidden value of open data
 
From open data to data-driven services
From open data to data-driven servicesFrom open data to data-driven services
From open data to data-driven services
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraft
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
How Government Agencies are Using MongoDB to Build Data as a Service Solutions
How Government Agencies are Using MongoDB to Build Data as a Service SolutionsHow Government Agencies are Using MongoDB to Build Data as a Service Solutions
How Government Agencies are Using MongoDB to Build Data as a Service Solutions
 
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
bigdataintro.pptx
bigdataintro.pptxbigdataintro.pptx
bigdataintro.pptx
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Mapping presentation THAG big data from space
Mapping presentation THAG big data from spaceMapping presentation THAG big data from space
Mapping presentation THAG big data from space
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
 
Meeting today’s dissemination challenges – Implementing International Standar...
Meeting today’s dissemination challenges – Implementing International Standar...Meeting today’s dissemination challenges – Implementing International Standar...
Meeting today’s dissemination challenges – Implementing International Standar...
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Global IT Outsourcing case study
Global IT Outsourcing case studyGlobal IT Outsourcing case study
Global IT Outsourcing case study
 
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
 

Destacado

Explorador de windows xp
Explorador de windows xpExplorador de windows xp
Explorador de windows xp
Aura Duque
 
Apunte Sistema Operativo
Apunte  Sistema  OperativoApunte  Sistema  Operativo
Apunte Sistema Operativo
033
 
IT Leaders 2012年5月号 No.43
IT Leaders 2012年5月号 No.43IT Leaders 2012年5月号 No.43
IT Leaders 2012年5月号 No.43
Takumi ITOH
 
Informatica enfermeria 1_ro
Informatica enfermeria 1_roInformatica enfermeria 1_ro
Informatica enfermeria 1_ro
Solcitocruz
 
Contoh Tugasan : Tajuk Rakan Sebaya
Contoh Tugasan : Tajuk Rakan SebayaContoh Tugasan : Tajuk Rakan Sebaya
Contoh Tugasan : Tajuk Rakan Sebaya
nazri15
 
Unidad 4 actividad 3
Unidad 4 actividad 3Unidad 4 actividad 3
Unidad 4 actividad 3
KARY
 

Destacado (20)

Explorador de windows xp
Explorador de windows xpExplorador de windows xp
Explorador de windows xp
 
Apunte Sistema Operativo
Apunte  Sistema  OperativoApunte  Sistema  Operativo
Apunte Sistema Operativo
 
History of Manga and Anime
History of Manga and AnimeHistory of Manga and Anime
History of Manga and Anime
 
Anexo 06 sc equipos e inst electromecanicas
Anexo 06  sc equipos e inst electromecanicasAnexo 06  sc equipos e inst electromecanicas
Anexo 06 sc equipos e inst electromecanicas
 
English: Manga & Anime Lesson
English: Manga & Anime LessonEnglish: Manga & Anime Lesson
English: Manga & Anime Lesson
 
IT Leaders 2012年5月号 No.43
IT Leaders 2012年5月号 No.43IT Leaders 2012年5月号 No.43
IT Leaders 2012年5月号 No.43
 
Sistemas2
Sistemas2Sistemas2
Sistemas2
 
Modelos de comunicacion
Modelos de comunicacionModelos de comunicacion
Modelos de comunicacion
 
Informatica enfermeria 1_ro
Informatica enfermeria 1_roInformatica enfermeria 1_ro
Informatica enfermeria 1_ro
 
La globalización: consecuencias humanas
La globalización: consecuencias humanasLa globalización: consecuencias humanas
La globalización: consecuencias humanas
 
Orden HAP/467/2015 de 13 de marzo por la que se aprueban los modelos de Rent...
Orden HAP/467/2015  de 13 de marzo por la que se aprueban los modelos de Rent...Orden HAP/467/2015  de 13 de marzo por la que se aprueban los modelos de Rent...
Orden HAP/467/2015 de 13 de marzo por la que se aprueban los modelos de Rent...
 
Basic Mandarin Chinese | Lesson 6 | Introducing yourself & meeting new friends
Basic Mandarin Chinese | Lesson 6 | Introducing yourself & meeting new friendsBasic Mandarin Chinese | Lesson 6 | Introducing yourself & meeting new friends
Basic Mandarin Chinese | Lesson 6 | Introducing yourself & meeting new friends
 
Contoh Tugasan : Tajuk Rakan Sebaya
Contoh Tugasan : Tajuk Rakan SebayaContoh Tugasan : Tajuk Rakan Sebaya
Contoh Tugasan : Tajuk Rakan Sebaya
 
Modelo de Negócios - Business Model Canvas
Modelo de Negócios - Business Model CanvasModelo de Negócios - Business Model Canvas
Modelo de Negócios - Business Model Canvas
 
Guía entorno socioeconómico
Guía entorno socioeconómicoGuía entorno socioeconómico
Guía entorno socioeconómico
 
Codigo tributario
Codigo tributarioCodigo tributario
Codigo tributario
 
Proyecto de investigación motivación escolar metodología por proyectos
Proyecto de investigación motivación escolar  metodología por proyectos Proyecto de investigación motivación escolar  metodología por proyectos
Proyecto de investigación motivación escolar metodología por proyectos
 
Estación 2 yaneth
Estación 2 yanethEstación 2 yaneth
Estación 2 yaneth
 
Bienvenido a la republica independiente de las pruebas unitarias con Core Data
Bienvenido a la republica independiente de las pruebas unitarias con Core DataBienvenido a la republica independiente de las pruebas unitarias con Core Data
Bienvenido a la republica independiente de las pruebas unitarias con Core Data
 
Unidad 4 actividad 3
Unidad 4 actividad 3Unidad 4 actividad 3
Unidad 4 actividad 3
 

Similar a Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format : Why, How and What was learned

Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some Basics
Shalin Hai-Jew
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 

Similar a Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format : Why, How and What was learned (20)

IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Identity Management: Tools, processes & services
Identity Management: Tools, processes & servicesIdentity Management: Tools, processes & services
Identity Management: Tools, processes & services
 
Connected development data
Connected development dataConnected development data
Connected development data
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some Basics
 
2016 SDMX Experts meeting, Opening of SDMX Capacity Building - Introduction ...
2016 SDMX Experts meeting, Opening of SDMX Capacity Building  - Introduction ...2016 SDMX Experts meeting, Opening of SDMX Capacity Building  - Introduction ...
2016 SDMX Experts meeting, Opening of SDMX Capacity Building - Introduction ...
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
V. Del Vecchio - Sdmx versus other standards
V. Del Vecchio - Sdmx versus other standards V. Del Vecchio - Sdmx versus other standards
V. Del Vecchio - Sdmx versus other standards
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format : Why, How and What was learned

  • 1. Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format Why? How? What was learned?
  • 2. Outline I. Context – Transforming national & international statistical systems – Semantic Web / Linked Data meets Official Statistics – SemStats 2013 – Parameters for the R&D project II. Investigation of existing tools III. Summary of the transformation process IV. Lessons learned V. Discussion
  • 3. 2009 (Australia) • The case for an international statistical innovation program Transforming national and international statistics systems • Future capabilities 1. From static data products to “common information services” 2. From publications to communication 3. Support for transaction data flowing at a much higher volume 4. Ability to rapidly incorporate new issues and views of data into standards and classifications 5. ‘Rapid-response’ capability 6. Connecting processes and passing metadata and data easily between them 7. Analysing assemblies of data
  • 4. The Challenges Increasing cost & difficulty of acquiring survey data New sources & changing expectations Rapid changes in the environment Competition for skilled resourcesDiminishing budgets Riding the big data wave
  • 5. HLG • High-Level Group for the Modernisation of Statistical Production and Services • Comprises 10 heads of national and international statistical organisations – Gosse van der Veen (Netherlands) - Chairman – Brian Pink (Australia) – Eduardo Sojo Garza-Aldape (Mexico) – Enrico Giovannini (Italy) – Woo, Ki-Jong (Republic of Korea) – Irena Križman (Slovenia) – Katherine Wallman (United States) – Walter Radermacher (Eurostat) – Martine Durand (OECD) – Lidia Bratanova (UNECE) The official statistics industry and its place in the wider information industry From Strategy to implement the vision of the HLG (2012)
  • 6. Grouping the challenges 1. Product Challenge - Modernising Statistical Services • Designing and delivering new and better statistical outputs (products and services) 2. Process Challenge – Modernising Statistical Production • Developing and implementing new and better production processes and methods which are capable of delivering statistical outputs with i. reduced cost, and ii. greater flexibility.
  • 7. HLG Strategy • Standards-based, collaborative modernisation of official statistics. • Create an environment (eg “common architecture”) that facilitates collaborative development, sharing and reuse of – statistical business processes – statistical methods – IT components – data repositories • Explicit role for – common conceptual frameworks, eg • GSIM (Generic Statistical Information Model) – and common implementation standards, eg • SDMX (Statistical Data and Metadata eXchange), working with • DDI (Data Documentation Initiative)
  • 8. ABS main data service support SDMX • ABS.Stat Beta – Dissemination from predefined aggregate data cubes • eg Consumer Price Index – Featured at GovHack 2013 – Based on OECD.Stat • Now used by OECD, IMF, UNESCO, European Commission, ABS, Statistics New Zealand, Statistics Italy • Further development through SIS Collaboration Community • TableBuilder – Dissemination of on demand tabulations from microdata • Includes Population Census
  • 9. Harnessing the opportunities • Global community around SDMX – intersects with SIS Collaboration Community • Working on – SDMX to JSON (JavaScript Object Notation) • Making life easier for third party developers – No need to parse SDMX-ML • Object model similar to Data Cube Vocabulary (DCV) • Expected to be released for review in September – SDMX to Data Cube Vocabulary (DCV) • Much earlier stage within SIS Collaboration Community
  • 10. Layering standards on standards • RDF Data Cube Vocabulary (DCV) developed under W3C – designed for publishing multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts – based upon the approach used by the SDMX ISO standard for statistical data exchange – very general and can be used for other data sets such as survey data, spreadsheets and OLAP data cubes
  • 11. Use of DCV • Usage within – data.gov.uk – Eurostat – Other institutions within the European Union via the EU’s Open Data Portal • eg European Environment Agency – Experimental use within data.gov.au
  • 12. Linked Data view on Official Statistics • Official Statistics and the Practice of Data Fidelity – Official statistics are the “crown jewels” of a nation’s public data – Provide empirical evidence for policy making and economic research – Statistical offices are among the most “data-savvy” organisations in government – Handling of Statistical Data as Linked Data requires particular attention to maintain its integrity and fidelity • Linked SDMX Data – Challenges • Automation of data transformation of data from high profile statistical organizations • Minimization of third-party interpretation of the source data and metadata and lossless transformations
  • 13. (Unofficial) view from Official Statistics • Semantic Statistics opportunities include : – external application of statistical classifications, and other statistical concept schemes, as ontologies – simpler, more flexible and more powerful use of statistical data along side other data – partnering more closely with other “data” communities • Semantic Statistics issues and risks include – ensuring production process is sustainable – ensuring semantics are identified consistently across all statistical outputs from a single agency – possible lack of rigour when defining and linking concepts to outputs from other sources – the possibility of “fuzzy” semantics leading to incorrect data analyses
  • 14. SemStats 2013 • Interest in “Semantic Statistics” is growing rapidly within Statistical and Semantic Web communities • There are existing semantic web developments building on both SDMX and DDI • SemStats 2013 provides a rare opportunity to interact with world experts while they’re in Australia • We are interested in what entrants might create and demonstrate in regard to SemStats 2013 Challenge
  • 15. SemStats 2013 Challenge • Provides Australian and French Census data in Data Cube Vocabulary (DCV) format – Data is Geography x Sex x Age x “Activity” status – Entrants are asked to demonstrate value from innovative application of semantic web technologies to the data.
  • 16. Aim when preparing Australian content • use as an opportunity for practical learning • start with SDMX-ML (not, eg, CSV) (if possible) – Plan A: SDMX-ML from TableBuilder • use existing international tools for SDMX-ML to DCV transformations (if possible) • do the work within the ABS (if possible) • Plan B was to ask INSEE (Statistics France) to help us with the transformation
  • 17. Investigation • Datalift – Supports multiple input types – Generic transformation – Supports dissemination to the web • Mimas – XSLT based – Complicated • Guillaume report – From INSEE – Highly tailored to the input data
  • 18. Datalift • Free to use – source code also available • Java web application • Supports multiple input types – Semantic graphs – Relational databases – Files (CSV, XML, etc) • Supports entire cycle – INSEE plan to use in future • SDMX -> DCV plug-in in development
  • 19. Mimas • Inflexible – XML input only – XML output only • Cumbersome – Requires multiple intermediate conversions • Inefficient for large volumes of data
  • 20. Guillaume Report • INSEE short term solution • Datalift was not mature enough • MIMAS identified as cumbersome and inefficient • Opted to use Apache Jena for small Java application
  • 21. Technology Overview • Census TableBuilder – Data extracted in SDMX and CSV • Java – Apache Jena library – SDMX 2.0 XML beans • Ontologies used – Simple Knowledge Organisation System – Data Cube Vocabulary • Turtle RDF syntax – Easy to read for humans and machines
  • 22. SDMX Extraction Tool Overview • Reads in SDMX structure file – Uses SDMX 2.0 beans to parse file • Disassembles XML to main components – Code lists – Concepts – Key Families • Build semantic model with Apache Jena • Write to file in Turtle syntax
  • 23. Code Lists • Representation of a classification – Can be hierarchical or flat
  • 24. Code Schemes Code scheme information Code information Codes Code schemes Generate SKOS concept scheme
  • 25. SKOS Concept Schemes Unique identifier Type Parent category Label Classification/ concept scheme Code
  • 26. Concepts & Components • Links observations to their: – Classification – Concept
  • 29. Data Structure Definition Can only be values of this type List of codes to use Concept dimension is measuring What the observation is measuring
  • 30. The Data - SDMX • Series key – dimensions being measured • Attributes – extra metadata about observation • Obs – the value of the observation (i.e. people counted)
  • 31. The Data - DCV • More condensed – attributes attached to the dataset instead of the observation Dimensions Coded values Observation value Dataset observation is from
  • 32. Lessons Learned (1) • Subject Matter Experts needed – What dimensions to use? – What attributes to use? – What concepts are we measuring? • Current tools not yet mature • Full validation of data complex • Heavy resource usage for large data – Unable to process SA2 level data on 32bit
  • 33. Lessons Learned (2) • Conversion straight forward – Standards very similar • Promotes reuse – Power comes from linking data • Linked nature makes you think about what you are doing – E.g. How close is INSEE activity to ABS labour force status?
  • 34. Semantic Considerations • How much, how soon, do we aim to harness opportunities for carrying more usable semantics in Data Cube Vocabulary? – Expected an external ontology for sex – but most are for Gender • How close is “close enough” for semantic assertions in Linked Open Data? • Aim for statistical harmonisation first (eg SDMX Cross Domain Concepts) then explore links to broader ontologies? • Even data producers are not sure if Age is a common concept across ABS & INSEE (Statistics France). • Risk of overselling the technical format before semantic payload is sorted?
  • 35. Laying the foundations • The project confirmed that, in order to deliver more useable semantics in our outputs, on a sustainable basis, we need statistical data and metadata to be defined and managed on a consistent, standards aligned basis across the organisation, including – across all statistical subject matter domains (social, economic, environmental) – “end to end” (ie spanning design, collection, processing/integration, analysis and dissemination) • We also need production processes to be automated & sustainable. • This is one example of why ABS needs to “modernise statistical production” to reflect the changed world in which we operate and to offer new services that address new needs and expectations of users. • In the 13/14 Budget Papers funding of $2.1 million was provided to develop a second pass business case for a major statistical infrastructure and business process reengineering project.

Notas del editor

  1. National Statistical Institutions face shared constraints and challenges.External ChallengesRapidly changing external environment - 24 / 7 access to informationIncreasing demand by sophisticated users for more timely, relevant statistical data to meet ‘current’ day issuesincreasing demand for more accessible and ‘joined up’ data to solve complex policy questionsConstraintsReduced funding and volatility in funding Our costs are increasing significantly – unable to contact many households, response rates are dropping, it is becoming more and more difficult to recruit and retain interviewers skills shortages – competing for statistical and ICT skills across government complex work programs siloed processesand aging infrastructure