SlideShare una empresa de Scribd logo
1 de 19
Provenance as an element of
FAIR data principles
Enabling data reuse
Margie Smith
Science Data Governance & Policy
Science Data Section
Data governance and policy
ANDS FAIR webinar series #4 – September 2017
Data Governance Committee
Data Strategy
Data Management Policy
Data Archive Policy
⁞
Product Management Plans
Data Management Plans
Source catalogue
Standardised vocabularies
Publishing schemas
⁞
Why GA cares about data re-use
Understanding the provenance of data that GA creates and
consumes enables the organisation to adhere to its Science
principles
and underpins the organisation’s vision to ‘maximise our data
potential’.
http://www.ga.gov.au/about/corporate-plan
ANDS FAIR webinar series #4 – September 2017
ANDS FAIR webinar series #4 – September 2017
What does provenance information look like
As part of a metadata record
 Information can be brief free-text
 Structured free-text
Pilbara Block 1:100 000 Landsat-5-TM image maps. Image files in BIL format
ANDS FAIR webinar series #4 – September 2017
What does provenance information look like
It can be discursive text
The ANUGA hydrodynamic model (https://anuga.anu.edu.au/) was run based on a Digital
Elevation Model (DEM) and inputs from a regional storm surge model (GEMS GCOM2D)
The maximum inundation depth and momentum values were identified in ArcGIS post processing. DEM
used within ANUGA: Triangular mesh created by/within ANUGA from a regular grid (1 m horizontal
resolution). The input grid was based on elevation data with varing accuary: onshore and
offshore LiDAR, Navy soundings and 1 second SRTM DEM. The derived triangular mesh consisted
of smaller triangles (max 5m^2) around the man-made drainage channels and larger triangles around
the remainder of the study region (max 350m^2)
Regional storm input: Temporal (i.e. storm characteristics through the simulation time) were
extracted from the regional storm modelling (GEMS GCOM2D model) results for point locations
along the Busselton-Dunsborough coastline.
ANUGA model variables Some key variables set within the Python code were:
minimum_storable_height = 0.10m, mannings coefficient of friction = 0.03, 12 minute modelling time
steps, 64 CPUs were used (variations were identifed between the results depending on the number of
CPUs specified.
The 64 CPU results were in the middle of the field (range from 8 to 128 CPUs). Broader detail of the
methods applied within this project are within the technical methodology document.
Also see the GA Professional Opinion (Coastal inundation modelling for Busselton, Western
Australia, under current and future climate)
(http://pid.geoscience.gov.au/dataset/78873)
ANDS FAIR webinar series #4 – September 2017
Why we need provenance
Scenario: advice to the public was generated based on a
collection of sensor data at a point in time.
Advice is
generated
Dataset
A
Agent
Models
Algorithms
used
Dataset A
temporal subset
Software
version
Advice
request
HPRM
eCAT
Nick Car gave a presentation previously
ANDS FAIR webinar series #4 – September 2017
https://youtu.be/elPcKqWoOPg
Provenance for data re-use
ANDS FAIR webinar series #4 – September 2017
Process
Dataset A
HPRM
eCat
Output(s)
Advice
prov:Entity
Temporal DB
Event code /
query
Report
prov:Plan
prov:Activity
wasGeneratedBy
acquisition
GitHub
FAIR principles
TO BE RE-USABLE:
R1. meta(data) have a plurality of accurate and relevant
attributes.
• R1.1. (meta)data are released with a clear and accessible
data usage license.
• R1.2. (meta)data are associated with their provenance.
• R1.3. (meta)data meet domain-relevant community
standards.
ANDS FAIR webinar series #4 – September 2017
https://www.force11.org/fairprinciples
What else we are doing at GA
• We have moved from an Oracle based ‘GeoCat’ catalogue to
our current ‘eCat’ which was made public last month.
• It was released as a minimum viable product and now
improvements are being backlogged and prioritised as
well as the BAU of product release.
• We are currently cataloguing our (300+) services and
linking the services to the data record in eCat where they
exist. (ie some services are based on aggregated datasets
or non-GA datasets)
• Catalogue schema and codelists will be published next
month.
• The processes for releasing/publishing data products is well
described and generally well known in the organisation.
ANDS FAIR webinar series #4 – September 2017
GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
ANDS FAIR webinar series #4 – September 2017
GA Data and Publications Catalogue - eCat
GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
http://pid.geoscience.gov.au/id/dataset/ga/72759
GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
How to support provenance and data reuse
ANDS FAIR webinar series #4 – September 2017
A ‘source
catalogue’ for the
data acquisition
phase
eCat for publishing
the data products
Software and Object
catalogues in the
future
ANDS FAIR webinar series #4 – September 2017
Standards on provenance
“Machine readable” could be:
- An ISO19115 metadata statement per dataset contributing to
a PROV-DM provenance graph
Dataset
Record(1..n)
Product /subset
of data in eCat
Record1
Source
Catalogue
Service
Report
Data
product
Record(1..n)
Product in eCat
Record(1..n)
Product in eCat
derivedFrom
ANDS FAIR webinar series #4 – September 2017
Standards on provenance
Dataset
A
CC-By
Dataset
B
Commercial
Ancestor(s)
Derived /
Aggregated dataset
will inherit a license
Dataset
D
Commercial
Licences
CC-By
CiC
…
License
aggregation WMS
CC-By
Software
C
Data management prioritisation
ANDS FAIR webinar series #4 – September 2017
Useability
High Value
ANDS FAIR webinar series #4 – September 2017
Thank you.
Margie.smith@ga.gov.au

Más contenido relacionado

La actualidad más candente

Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
Jisc RDM
 

La actualidad más candente (20)

Responsible Research Data Management - RMIT - Mar 19
Responsible Research Data Management - RMIT - Mar 19Responsible Research Data Management - RMIT - Mar 19
Responsible Research Data Management - RMIT - Mar 19
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Computing with large datasets
Computing with large datasetsComputing with large datasets
Computing with large datasets
 
DIRISA for Open Data and Open Science/Anwar Vahed
DIRISA for Open Data and Open Science/Anwar VahedDIRISA for Open Data and Open Science/Anwar Vahed
DIRISA for Open Data and Open Science/Anwar Vahed
 
New Product Introduction - Intellixir
New Product Introduction - IntellixirNew Product Introduction - Intellixir
New Product Introduction - Intellixir
 
CSIRO investing in the future of data - John Morrissey
CSIRO investing in the future of data - John Morrissey CSIRO investing in the future of data - John Morrissey
CSIRO investing in the future of data - John Morrissey
 
Charting the Future - Ms Heather Jenks, ANU
Charting the Future - Ms Heather Jenks, ANUCharting the Future - Ms Heather Jenks, ANU
Charting the Future - Ms Heather Jenks, ANU
 
Big Data is today: key issues for big data - Dr Ben Evans
Big Data is today: key issues for big data - Dr Ben EvansBig Data is today: key issues for big data - Dr Ben Evans
Big Data is today: key issues for big data - Dr Ben Evans
 
White Manipulating Metadata to Enhance Access
White Manipulating Metadata to Enhance AccessWhite Manipulating Metadata to Enhance Access
White Manipulating Metadata to Enhance Access
 
SGCI and Globus: Partners for Acceleration of Science
SGCI and Globus: Partners for Acceleration of ScienceSGCI and Globus: Partners for Acceleration of Science
SGCI and Globus: Partners for Acceleration of Science
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Repositories, Plugins and the REF
Repositories, Plugins and the REFRepositories, Plugins and the REF
Repositories, Plugins and the REF
 
Ag Data Commons for AgBioData
Ag Data Commons for AgBioDataAg Data Commons for AgBioData
Ag Data Commons for AgBioData
 
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue DataRDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
 
Anderson Conglomerating and Collocating Collections
Anderson Conglomerating and Collocating CollectionsAnderson Conglomerating and Collocating Collections
Anderson Conglomerating and Collocating Collections
 
e-Infrastructure @ Science
e-Infrastructure @ Sciencee-Infrastructure @ Science
e-Infrastructure @ Science
 
RCUK - what Jisc is doing
RCUK - what Jisc is doingRCUK - what Jisc is doing
RCUK - what Jisc is doing
 
EcoTas13 Turner AEKOS
EcoTas13 Turner AEKOSEcoTas13 Turner AEKOS
EcoTas13 Turner AEKOS
 
Datamining with big data
 Datamining with big data  Datamining with big data
Datamining with big data
 
Poster: Using SEAD to Support Collaboration among Land Managers, Scientists, ...
Poster: Using SEAD to Support Collaboration among Land Managers, Scientists, ...Poster: Using SEAD to Support Collaboration among Land Managers, Scientists, ...
Poster: Using SEAD to Support Collaboration among Land Managers, Scientists, ...
 

Similar a #4 FAIR - Provenance as an element of FAIR data principles - 20-09-17

Research Data Shared Service
Research Data Shared ServiceResearch Data Shared Service
Research Data Shared Service
Jisc
 

Similar a #4 FAIR - Provenance as an element of FAIR data principles - 20-09-17 (20)

Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Leveraging a big data model in the IT domain
Leveraging a big data model in the IT domainLeveraging a big data model in the IT domain
Leveraging a big data model in the IT domain
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Webinar Industrial Data Space Association: Introduction and Architecture
Webinar Industrial Data Space Association: Introduction and ArchitectureWebinar Industrial Data Space Association: Introduction and Architecture
Webinar Industrial Data Space Association: Introduction and Architecture
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
 
An Overview of General Data Mining Tools
An Overview of General Data Mining ToolsAn Overview of General Data Mining Tools
An Overview of General Data Mining Tools
 
Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
RainStor 3.5 Overview
RainStor 3.5 OverviewRainStor 3.5 Overview
RainStor 3.5 Overview
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoopAssociation Rule Mining using RHadoop
Association Rule Mining using RHadoop
 
Retail Design
Retail DesignRetail Design
Retail Design
 
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
 
Research Data Shared Service
Research Data Shared ServiceResearch Data Shared Service
Research Data Shared Service
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Cost and Value analysis of digital data archiving
Cost and Value analysis of digital data archivingCost and Value analysis of digital data archiving
Cost and Value analysis of digital data archiving
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Keynote Presentation at MTSR07
Keynote Presentation at MTSR07Keynote Presentation at MTSR07
Keynote Presentation at MTSR07
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 

Más de ARDC

Más de ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

#4 FAIR - Provenance as an element of FAIR data principles - 20-09-17

  • 1. Provenance as an element of FAIR data principles Enabling data reuse Margie Smith Science Data Governance & Policy Science Data Section
  • 2. Data governance and policy ANDS FAIR webinar series #4 – September 2017 Data Governance Committee Data Strategy Data Management Policy Data Archive Policy ⁞ Product Management Plans Data Management Plans Source catalogue Standardised vocabularies Publishing schemas ⁞
  • 3. Why GA cares about data re-use Understanding the provenance of data that GA creates and consumes enables the organisation to adhere to its Science principles and underpins the organisation’s vision to ‘maximise our data potential’. http://www.ga.gov.au/about/corporate-plan ANDS FAIR webinar series #4 – September 2017
  • 4. ANDS FAIR webinar series #4 – September 2017 What does provenance information look like As part of a metadata record  Information can be brief free-text  Structured free-text Pilbara Block 1:100 000 Landsat-5-TM image maps. Image files in BIL format
  • 5. ANDS FAIR webinar series #4 – September 2017 What does provenance information look like It can be discursive text The ANUGA hydrodynamic model (https://anuga.anu.edu.au/) was run based on a Digital Elevation Model (DEM) and inputs from a regional storm surge model (GEMS GCOM2D) The maximum inundation depth and momentum values were identified in ArcGIS post processing. DEM used within ANUGA: Triangular mesh created by/within ANUGA from a regular grid (1 m horizontal resolution). The input grid was based on elevation data with varing accuary: onshore and offshore LiDAR, Navy soundings and 1 second SRTM DEM. The derived triangular mesh consisted of smaller triangles (max 5m^2) around the man-made drainage channels and larger triangles around the remainder of the study region (max 350m^2) Regional storm input: Temporal (i.e. storm characteristics through the simulation time) were extracted from the regional storm modelling (GEMS GCOM2D model) results for point locations along the Busselton-Dunsborough coastline. ANUGA model variables Some key variables set within the Python code were: minimum_storable_height = 0.10m, mannings coefficient of friction = 0.03, 12 minute modelling time steps, 64 CPUs were used (variations were identifed between the results depending on the number of CPUs specified. The 64 CPU results were in the middle of the field (range from 8 to 128 CPUs). Broader detail of the methods applied within this project are within the technical methodology document. Also see the GA Professional Opinion (Coastal inundation modelling for Busselton, Western Australia, under current and future climate) (http://pid.geoscience.gov.au/dataset/78873)
  • 6. ANDS FAIR webinar series #4 – September 2017 Why we need provenance Scenario: advice to the public was generated based on a collection of sensor data at a point in time. Advice is generated Dataset A Agent Models Algorithms used Dataset A temporal subset Software version Advice request HPRM eCAT
  • 7. Nick Car gave a presentation previously ANDS FAIR webinar series #4 – September 2017 https://youtu.be/elPcKqWoOPg
  • 8. Provenance for data re-use ANDS FAIR webinar series #4 – September 2017 Process Dataset A HPRM eCat Output(s) Advice prov:Entity Temporal DB Event code / query Report prov:Plan prov:Activity wasGeneratedBy acquisition GitHub
  • 9. FAIR principles TO BE RE-USABLE: R1. meta(data) have a plurality of accurate and relevant attributes. • R1.1. (meta)data are released with a clear and accessible data usage license. • R1.2. (meta)data are associated with their provenance. • R1.3. (meta)data meet domain-relevant community standards. ANDS FAIR webinar series #4 – September 2017 https://www.force11.org/fairprinciples
  • 10. What else we are doing at GA • We have moved from an Oracle based ‘GeoCat’ catalogue to our current ‘eCat’ which was made public last month. • It was released as a minimum viable product and now improvements are being backlogged and prioritised as well as the BAU of product release. • We are currently cataloguing our (300+) services and linking the services to the data record in eCat where they exist. (ie some services are based on aggregated datasets or non-GA datasets) • Catalogue schema and codelists will be published next month. • The processes for releasing/publishing data products is well described and generally well known in the organisation. ANDS FAIR webinar series #4 – September 2017
  • 11. GA Data and Publications Catalogue - eCat ANDS FAIR webinar series #4 – September 2017
  • 12. ANDS FAIR webinar series #4 – September 2017 GA Data and Publications Catalogue - eCat
  • 13. GA Data and Publications Catalogue - eCat ANDS FAIR webinar series #4 – September 2017 http://pid.geoscience.gov.au/id/dataset/ga/72759
  • 14. GA Data and Publications Catalogue - eCat ANDS FAIR webinar series #4 – September 2017
  • 15. How to support provenance and data reuse ANDS FAIR webinar series #4 – September 2017 A ‘source catalogue’ for the data acquisition phase eCat for publishing the data products Software and Object catalogues in the future
  • 16. ANDS FAIR webinar series #4 – September 2017 Standards on provenance “Machine readable” could be: - An ISO19115 metadata statement per dataset contributing to a PROV-DM provenance graph Dataset Record(1..n) Product /subset of data in eCat Record1 Source Catalogue Service Report Data product Record(1..n) Product in eCat Record(1..n) Product in eCat derivedFrom
  • 17. ANDS FAIR webinar series #4 – September 2017 Standards on provenance Dataset A CC-By Dataset B Commercial Ancestor(s) Derived / Aggregated dataset will inherit a license Dataset D Commercial Licences CC-By CiC … License aggregation WMS CC-By Software C
  • 18. Data management prioritisation ANDS FAIR webinar series #4 – September 2017 Useability High Value
  • 19. ANDS FAIR webinar series #4 – September 2017 Thank you. Margie.smith@ga.gov.au

Notas del editor

  1. Hi there! My name is Margie Smith and I have worked at Geoscience Australia since November 2016 in the Science Data Governance and Policy team… a team of two. I came across to help GA meet its obligations under the National Archives of Australia’s Digital Continuity 2020 Policy, to bring some external policy knowledge into the organisation and to provide governance guidance around science data management.
  2. In response to the National Archives Digital Continuity 2020 Policy and other Australian Government Open Data policies, government organisations have been tasked with making their data holdings visible and available. Making data open is not new to GA but there is most definitely now a whole of government push for access to all data domains. I have produced several documents to meet the DC2020 data governance milestones, but as you can see from this diagram, there has to be a balance of both oversight and execution across the data lifecycle – to have one without the other will either produce a pile of documents that nobody reads or a plethora of silos of excellence generating portals, datasets and services that only those in the know can find and use.
  3. Whilst there are a series of external drivers for data management, use and re-use, there are also strong drivers currently within the organisation. For example: the cost of collecting or acquiring the data the cost of not finding data previously acquired or finding data and not being the person who ‘knows’ all about it succession planning analogue collections – diaries or paper products that have yet to be digitised general public servant obligations like the Archives Act and, of course, GA’s Science Principles and vision. Provenance will support the organisation through enabling data re-use (as you can now find it) and allow for transparent science and advice through understanding the data supply chain.
  4. At the moment, our metadata records indicate provenance of the data through the lineage statement or in the abstract. As shown in these examples, the provenance of a dataset or product are usually free-text and can be semi-structured or unstructured. Very concise or…
  5. … not exactly concise. Here the abstract includes everything you need to know about the Coastal inundation modelling for Busselton, Western Australia, under current and future climate. Whilst this provenance information is very useful, it is not particularly useable; and by useable*, I mean its ability to be located, retrieved, presented and interpreted – by person or ideally, by machine search. *from the ISO 15489-1:2016 Information and documentation -- Records management -- Part 1: Concepts and principles
  6. As an example of why we need provenance for data reuse, I have made up a scenario. In this scenario, the advice was generated from the complete dataset at the time. A scientist generated a model using algorithms and provided advice based on the output of the model. The advice, assuming it was of a general nature, is then made available through the catalogue – generally as a PDF document. The metadata for the advice gives the name of the dataset used, the area that the advice covers, the organisation as author of the report, and perhaps some of the methodology used in the generation of the report. In most cases, you could link the advice to the name of the dataset that was used to generate the advice, but not easily to the scientist or team and the models used to generate the advice. So this provenance model of a data product could work well as a highly structured PROV system.
  7. My colleague Nick Car gave a presentation on GA’s PROV model to ANDS in March and I suggest you watch that for specific information about the model at Geoscience Australia.
  8. Adapting Nick’s model, I have tried to replicate my previous scenario – modelling what we are working towards at GA. This is currently happening through lineage and association with digital objects rather than a true PROV model of digital objects. Working from right to left, the Advice would have a metadata record in eCat, our electronic catalogue, that indicates the process used to generate the advice, which is made up of the temporal subset of the dataset the advice is based on, the software or models that were applied to the data and information around that data’s acquisition as well as the reason the advice was required. If the data is to be re-used in future advice, it might also be helpful to know what models were tried previously that didn’t work. For our catalogue-like things, we need to gradually add the ability to link Entities, Agents, Activities etc to be able to use graph structured provenance (PROV-DM) across multiple types of objects and across multiple systems in the future. In my role I am particularly interested in the repeatability of advice given by any government entity. Per the Archives Act, advice of this type given by government must be stored for a period of years and include the models, algorithms, software and data used to generate the advice. It is a safety net for the entity and the public servants that generated the advice at that point in time. This is currently a manual process, heavily reliant on the individual generating the advice and storing it appropriately. It would be excellent if the work we are currently undertaking would make it a lot easier for scientists to generate and catalogue this advice in the future.
  9. Prior to sorting out what I wanted to include in this presentation, I had another look at the FAIR principles for data reuse. Looking at these principles, I was feeling a lot better about what has been achieved at GA in the last 18months. We have a public catalogue, it has a clear and accessible data usage license and the standards used for cataloguing are in the spatial domain. The lineage in a metadata record has been the de facto ‘data provenance’ to date. We are currently working on multi-domain metadata retrieval from our catalogue; for example, we will be able to export records in AGRIF for Records Management, ISO19115 for spatial and DCAT for the National Archives. The Google search is already enabled in the search panel on the ga.gov.au splash page – this enables a search of both the website and the catalogue for content. In June, I was fortunate to attend the Open Geospatial Consortium technical meeting which is an international spatial standards organisation. It was evident in discussions there that many other countries were also working towards delivering their catalogues in formats other than spatial to enable searching by other domains.
  10. We have a new catalogue, our eCat: where metadata records will have a persistent identifier the license for data re-use is clear you can get to the data or product directly from the metadata record and records for data are linked to services and portals that use them, and vice versa. At the moment, we are working to publish the 19115-3 catalogue schema and codelists that are used by GA in the catalogue. In terms of oversight, we have data product plans, roles and responsibilities, and workflows for the release of products from GA through eCat which is a longstanding and well understood process. For the past month, my area has been undertaking work to highlight the need for science areas to focus on a data-first rather than product-first view. This data-first process will echo the data product publishing workflows and have a dedicated internal catalogue we are calling SourceCat. SourceCat is a clone of the eCat software and is being trialled within two areas of GA before being released across the organisation. Once we have this in place, being able to show provenance from the product to the data will be made easier as we start the process at the beginning rather than try and remediate at the product publishing end of a project.
  11. This is a view of our new eCat – the electronic catalogue for products generated at GA. We have moved to the newer metadata standard for Australian Spatial Data, the ISO 19115-1:2014 which you can see indicated on the page. There are also Keyword lists which have been somewhat free-forming to date. We have now selected well defined vocabularies where they exist and are working with the custodians to publish them whilst at the same time wrapping a governance structure around their maintenance and future extension.
  12. There is a persistent id and data download is indicated.
  13. When you go into the actual metadata record from the search, the information and links are clearly itemised.
  14. Here is an example where the link to the portal and the associated services is shown but as stated in the record access to the data isn’t available. “Please note: As these data are stored on a Corporate system, we are only able to supply the web services (see download links).”
  15. In the scenario I gave before, I pictured how the provenance of a data product would work well as part of a highly structured PROV model. The structure required supports data provenance and re-use even if it doesn’t become a PROV system immediately. The Source Catalogue is currently being built as a proof of concept for two science areas in the organisation with the intention of making it an agency tool for all data that is acquired or created. In the future we intend to have a Software Catalogue and Objects Catalogue so that the software or models used in data curation or data products can be included as per PROV models. These are all clones of the eCat software. With this comes the need to support the organisation with tools and documented procedures that in the future will become automagic processes to bring data into the building. This support is more of the oversight and execution balance that I spoke of earlier.
  16. We are also using the catalogue standard to introduce elements that will align with a future PROV model. We will be including the element ‘derivedFrom’ in the metadata record. In the future, if a product does not have a ‘derivedFrom’ element, it will not be published. Further into the future we will include the element ‘haveProv’, which is different to lineage, as it is forward facing – linking the data to all products that have used it. By having all these links embedded, Nick explained that this will allow a machine readable PROV-record to link to a metadata record to indicate provenance exists. He then started talking PROV bundles and lost me but hopefully all these steps will lead to the working PROV model of the future GA.
  17. I was also thinking about the next talk on licensing frameworks. In this future machine-to-machine scenario, the licenses of aggregated products may be determined through an automated rule set depending on the way the data product is delivered. In this example a dataset and its associated web service have differing licences. For third-party aggregated data use this process is currently determined through extensive written agreements for each product.
  18. Finally, it takes a lot of work to remediate legacy metadata records. Are we going to remediate every single one of our legacy data records? NO – or at least not straight away. Not all data is high value nor does all data have to be highly useable, but all data acquired and data products created should be FAIR. To re-use data, it is necessary to understand its provenance to assess if it is fit for purpose and in working towards a PROV model and implementing tools like the SourceCat we are also further along the path to achieving GA’s vision to fully maximise our data potential.