Margie Smith
Full Webinar: https://youtu.be/EDhJTCm9RN8
Transcript: https://www.slideshare.net/AustralianNationalDataService/transcript-4-fair-r-for-reusable
Other webinars in the series: http://www.ands.org.au/news-and-events/events/fair-webinar-series
#4 FAIR - Provenance as an element of FAIR data principles - 20-09-17
1. Provenance as an element of
FAIR data principles
Enabling data reuse
Margie Smith
Science Data Governance & Policy
Science Data Section
2. Data governance and policy
ANDS FAIR webinar series #4 – September 2017
Data Governance Committee
Data Strategy
Data Management Policy
Data Archive Policy
⁞
Product Management Plans
Data Management Plans
Source catalogue
Standardised vocabularies
Publishing schemas
⁞
3. Why GA cares about data re-use
Understanding the provenance of data that GA creates and
consumes enables the organisation to adhere to its Science
principles
and underpins the organisation’s vision to ‘maximise our data
potential’.
http://www.ga.gov.au/about/corporate-plan
ANDS FAIR webinar series #4 – September 2017
4. ANDS FAIR webinar series #4 – September 2017
What does provenance information look like
As part of a metadata record
Information can be brief free-text
Structured free-text
Pilbara Block 1:100 000 Landsat-5-TM image maps. Image files in BIL format
5. ANDS FAIR webinar series #4 – September 2017
What does provenance information look like
It can be discursive text
The ANUGA hydrodynamic model (https://anuga.anu.edu.au/) was run based on a Digital
Elevation Model (DEM) and inputs from a regional storm surge model (GEMS GCOM2D)
The maximum inundation depth and momentum values were identified in ArcGIS post processing. DEM
used within ANUGA: Triangular mesh created by/within ANUGA from a regular grid (1 m horizontal
resolution). The input grid was based on elevation data with varing accuary: onshore and
offshore LiDAR, Navy soundings and 1 second SRTM DEM. The derived triangular mesh consisted
of smaller triangles (max 5m^2) around the man-made drainage channels and larger triangles around
the remainder of the study region (max 350m^2)
Regional storm input: Temporal (i.e. storm characteristics through the simulation time) were
extracted from the regional storm modelling (GEMS GCOM2D model) results for point locations
along the Busselton-Dunsborough coastline.
ANUGA model variables Some key variables set within the Python code were:
minimum_storable_height = 0.10m, mannings coefficient of friction = 0.03, 12 minute modelling time
steps, 64 CPUs were used (variations were identifed between the results depending on the number of
CPUs specified.
The 64 CPU results were in the middle of the field (range from 8 to 128 CPUs). Broader detail of the
methods applied within this project are within the technical methodology document.
Also see the GA Professional Opinion (Coastal inundation modelling for Busselton, Western
Australia, under current and future climate)
(http://pid.geoscience.gov.au/dataset/78873)
6. ANDS FAIR webinar series #4 – September 2017
Why we need provenance
Scenario: advice to the public was generated based on a
collection of sensor data at a point in time.
Advice is
generated
Dataset
A
Agent
Models
Algorithms
used
Dataset A
temporal subset
Software
version
Advice
request
HPRM
eCAT
7. Nick Car gave a presentation previously
ANDS FAIR webinar series #4 – September 2017
https://youtu.be/elPcKqWoOPg
8. Provenance for data re-use
ANDS FAIR webinar series #4 – September 2017
Process
Dataset A
HPRM
eCat
Output(s)
Advice
prov:Entity
Temporal DB
Event code /
query
Report
prov:Plan
prov:Activity
wasGeneratedBy
acquisition
GitHub
9. FAIR principles
TO BE RE-USABLE:
R1. meta(data) have a plurality of accurate and relevant
attributes.
• R1.1. (meta)data are released with a clear and accessible
data usage license.
• R1.2. (meta)data are associated with their provenance.
• R1.3. (meta)data meet domain-relevant community
standards.
ANDS FAIR webinar series #4 – September 2017
https://www.force11.org/fairprinciples
10. What else we are doing at GA
• We have moved from an Oracle based ‘GeoCat’ catalogue to
our current ‘eCat’ which was made public last month.
• It was released as a minimum viable product and now
improvements are being backlogged and prioritised as
well as the BAU of product release.
• We are currently cataloguing our (300+) services and
linking the services to the data record in eCat where they
exist. (ie some services are based on aggregated datasets
or non-GA datasets)
• Catalogue schema and codelists will be published next
month.
• The processes for releasing/publishing data products is well
described and generally well known in the organisation.
ANDS FAIR webinar series #4 – September 2017
11. GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
12. ANDS FAIR webinar series #4 – September 2017
GA Data and Publications Catalogue - eCat
13. GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
http://pid.geoscience.gov.au/id/dataset/ga/72759
14. GA Data and Publications Catalogue - eCat
ANDS FAIR webinar series #4 – September 2017
15. How to support provenance and data reuse
ANDS FAIR webinar series #4 – September 2017
A ‘source
catalogue’ for the
data acquisition
phase
eCat for publishing
the data products
Software and Object
catalogues in the
future
16. ANDS FAIR webinar series #4 – September 2017
Standards on provenance
“Machine readable” could be:
- An ISO19115 metadata statement per dataset contributing to
a PROV-DM provenance graph
Dataset
Record(1..n)
Product /subset
of data in eCat
Record1
Source
Catalogue
Service
Report
Data
product
Record(1..n)
Product in eCat
Record(1..n)
Product in eCat
derivedFrom
17. ANDS FAIR webinar series #4 – September 2017
Standards on provenance
Dataset
A
CC-By
Dataset
B
Commercial
Ancestor(s)
Derived /
Aggregated dataset
will inherit a license
Dataset
D
Commercial
Licences
CC-By
CiC
…
License
aggregation WMS
CC-By
Software
C
19. ANDS FAIR webinar series #4 – September 2017
Thank you.
Margie.smith@ga.gov.au
Notas del editor
Hi there!
My name is Margie Smith and I have worked at Geoscience Australia since November 2016 in the Science Data Governance and Policy team… a team of two.
I came across to help GA meet its obligations under the National Archives of Australia’s Digital Continuity 2020 Policy, to bring some external policy knowledge into the organisation and to provide governance guidance around science data management.
In response to the National Archives Digital Continuity 2020 Policy and other Australian Government Open Data policies, government organisations have been tasked with making their data holdings visible and available.
Making data open is not new to GA but there is most definitely now a whole of government push for access to all data domains.
I have produced several documents to meet the DC2020 data governance milestones, but as you can see from this diagram, there has to be a balance of both oversight and execution across the data lifecycle – to have one without the other will either produce a pile of documents that nobody reads or a plethora of silos of excellence generating portals, datasets and services that only those in the know can find and use.
Whilst there are a series of external drivers for data management, use and re-use, there are also strong drivers currently within the organisation.
For example:
the cost of collecting or acquiring the data
the cost of not finding data previously acquired or
finding data and not being the person who ‘knows’ all about it
succession planning
analogue collections – diaries or paper products that have yet to be digitised
general public servant obligations like the Archives Act
and, of course, GA’s Science Principles and vision.
Provenance will support the organisation through enabling data re-use (as you can now find it) and allow for transparent science and advice through understanding the data supply chain.
At the moment, our metadata records indicate provenance of the data through the lineage statement or in the abstract.
As shown in these examples, the provenance of a dataset or product are usually free-text and can be semi-structured or unstructured.
Very concise or…
… not exactly concise.
Here the abstract includes everything you need to know about the Coastal inundation modelling for Busselton, Western Australia, under current and future climate.
Whilst this provenance information is very useful, it is not particularly useable; and by useable*, I mean its ability to be located, retrieved, presented and interpreted – by person or ideally, by machine search.
*from the ISO 15489-1:2016 Information and documentation -- Records management -- Part 1: Concepts and principles
As an example of why we need provenance for data reuse, I have made up a scenario.
In this scenario, the advice was generated from the complete dataset at the time.
A scientist generated a model using algorithms and provided advice based on the output of the model.
The advice, assuming it was of a general nature, is then made available through the catalogue – generally as a PDF document.
The metadata for the advice gives the name of the dataset used, the area that the advice covers, the organisation as author of the report, and perhaps some of the methodology used in the generation of the report.
In most cases, you could link the advice to the name of the dataset that was used to generate the advice, but not easily to the scientist or team and the models used to generate the advice.
So this provenance model of a data product could work well as a highly structured PROV system.
My colleague Nick Car gave a presentation on GA’s PROV model to ANDS in March and I suggest you watch that for specific information about the model at Geoscience Australia.
Adapting Nick’s model, I have tried to replicate my previous scenario – modelling what we are working towards at GA.
This is currently happening through lineage and association with digital objects rather than a true PROV model of digital objects.
Working from right to left, the Advice would have a metadata record in eCat, our electronic catalogue, that indicates the process used to generate the advice, which is made up of the temporal subset of the dataset the advice is based on, the software or models that were applied to the data and information around that data’s acquisition as well as the reason the advice was required.
If the data is to be re-used in future advice, it might also be helpful to know what models were tried previously that didn’t work.
For our catalogue-like things, we need to gradually add the ability to link Entities, Agents, Activities etc to be able to use graph structured provenance (PROV-DM) across multiple types of objects and across multiple systems in the future.
In my role I am particularly interested in the repeatability of advice given by any government entity. Per the Archives Act, advice of this type given by government must be stored for a period of years and include the models, algorithms, software and data used to generate the advice. It is a safety net for the entity and the public servants that generated the advice at that point in time.
This is currently a manual process, heavily reliant on the individual generating the advice and storing it appropriately.
It would be excellent if the work we are currently undertaking would make it a lot easier for scientists to generate and catalogue this advice in the future.
Prior to sorting out what I wanted to include in this presentation, I had another look at the FAIR principles for data reuse.
Looking at these principles, I was feeling a lot better about what has been achieved at GA in the last 18months.
We have a public catalogue, it has a clear and accessible data usage license and the standards used for cataloguing are in the spatial domain.
The lineage in a metadata record has been the de facto ‘data provenance’ to date.
We are currently working on multi-domain metadata retrieval from our catalogue; for example, we will be able to export records in AGRIF for Records Management, ISO19115 for spatial and DCAT for the National Archives.
The Google search is already enabled in the search panel on the ga.gov.au splash page – this enables a search of both the website and the catalogue for content.
In June, I was fortunate to attend the Open Geospatial Consortium technical meeting which is an international spatial standards organisation. It was evident in discussions there that many other countries were also working towards delivering their catalogues in formats other than spatial to enable searching by other domains.
We have a new catalogue, our eCat: where metadata records will have
a persistent identifier
the license for data re-use is clear
you can get to the data or product directly from the metadata record
and records for data are linked to services and portals that use them, and vice versa.
At the moment, we are working to publish the 19115-3 catalogue schema and codelists that are used by GA in the catalogue.
In terms of oversight, we have data product plans, roles and responsibilities, and workflows for the release of products from GA through eCat which is a longstanding and well understood process.
For the past month, my area has been undertaking work to highlight the need for science areas to focus on a data-first rather than product-first view. This data-first process will echo the data product publishing workflows and have a dedicated internal catalogue we are calling SourceCat.
SourceCat is a clone of the eCat software and is being trialled within two areas of GA before being released across the organisation.
Once we have this in place, being able to show provenance from the product to the data will be made easier as we start the process at the beginning rather than try and remediate at the product publishing end of a project.
This is a view of our new eCat – the electronic catalogue for products generated at GA.
We have moved to the newer metadata standard for Australian Spatial Data, the ISO 19115-1:2014 which you can see indicated on the page.
There are also Keyword lists which have been somewhat free-forming to date. We have now selected well defined vocabularies where they exist and are working with the custodians to publish them whilst at the same time wrapping a governance structure around their maintenance and future extension.
There is a persistent id and data download is indicated.
When you go into the actual metadata record from the search, the information and links are clearly itemised.
Here is an example where the link to the portal and the associated services is shown but as stated in the record access to the data isn’t available.
“Please note: As these data are stored on a Corporate system, we are only able to supply the web services (see download links).”
In the scenario I gave before, I pictured how the provenance of a data product would work well as part of a highly structured PROV model.
The structure required supports data provenance and re-use even if it doesn’t become a PROV system immediately.
The Source Catalogue is currently being built as a proof of concept for two science areas in the organisation with the intention of making it an agency tool for all data that is acquired or created.
In the future we intend to have a Software Catalogue and Objects Catalogue so that the software or models used in data curation or data products can be included as per PROV models. These are all clones of the eCat software.
With this comes the need to support the organisation with tools and documented procedures that in the future will become automagic processes to bring data into the building. This support is more of the oversight and execution balance that I spoke of earlier.
We are also using the catalogue standard to introduce elements that will align with a future PROV model.
We will be including the element ‘derivedFrom’ in the metadata record.
In the future, if a product does not have a ‘derivedFrom’ element, it will not be published.
Further into the future we will include the element ‘haveProv’, which is different to lineage, as it is forward facing – linking the data to all products that have used it.
By having all these links embedded, Nick explained that this will allow a machine readable PROV-record to link to a metadata record to indicate provenance exists. He then started talking PROV bundles and lost me but hopefully all these steps will lead to the working PROV model of the future GA.
I was also thinking about the next talk on licensing frameworks. In this future machine-to-machine scenario, the licenses of aggregated products may be determined through an automated rule set depending on the way the data product is delivered.
In this example a dataset and its associated web service have differing licences. For third-party aggregated data use this process is currently determined through extensive written agreements for each product.
Finally, it takes a lot of work to remediate legacy metadata records.
Are we going to remediate every single one of our legacy data records? NO – or at least not straight away. Not all data is high value nor does all data have to be highly useable, but all data acquired and data products created should be FAIR.
To re-use data, it is necessary to understand its provenance to assess if it is fit for purpose and in working towards a PROV model and implementing tools like the SourceCat we are also further along the path to achieving GA’s vision to fully maximise our data potential.