"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Visualising the Australian open data and research data landscape
1. Visualising the Australian open data and research data landscape
LAND AND WATER
Jonathan Yu, Simon Cox, Benjamin Leighton, Hendra Wijaya, Qifeng Bai
C3DIS Melbourne, May 2018
3. Research questions
Understand the current state of the data landscape in
Australia and measure the complexity.
Survey Questions:
• How much data is available?
• How varied?
• Are they interoperable? Metadata and keywords used
Implications
Jonathan Yu3 |
7. 7 |
v
Citizens Research Industry Government
Information ‘prosumer’ community
v
Existing climate (and other) prediction
information infrastructure e.g. ACCESS, CCIA
Data ecosystem in Australia
Academies and research
centres (CRCs, NESP hubs,
CoEs)
ECOcloud
Slide courtesy of Paul Box (CSIRO)
Research
Infrastructures
Government
Data Infrastructures
… Need a more holistic
picture than what
is promoted in the
mainstream
Jonathan Yu
8. Jonathan Yu 8 |
http://kn.csiro.au
URIs/HTTP Identifiers for everything…
Open Data Search across Gov + Research
Indexing:
137,441 datasets
174,809 geo-features
In-development:
Python and R libraries for in-situ analysis
CSIRO Knowledge Network
Web portal
Dataset sources list
Search example
9. Open Data Survey Methodology
1. Query – based on Knowledge Network indexes for format types,
and keywords/tags/subjects against known metadata formats –
CKAN, GeoNetwork, Socrata, CSIRO DAP
2. Fetch file sizes - via web crawler over every resource URL
3. Query special cases - size and file resources – CSIRO DAP, known
THREDDS servers (CSIRO, NCI, AODN/IMOS, TPAC, TERN OzFlux)
4. Aggregate and visualise – source, formats, format types*,
keywords
Jonathan Yu9 |
10. Repositories considered
Open Gov Open Research CKAN THREDDS Socrata WS / Other File size info
data-gov-au
data-nsw-gov-au
data-qld-gov-au
data-sa-gov-au
data-wa-gov-au
nsw-oeh
nsw-seed
www-data-vic-gov-au
Yes Some Yes Contained some Yes
AODN
CSIRO THREDDS
NCI
TERN ozflux
TPAC
Yes Yes Contained some Yes
CSIRO DAP Yes Contained some Yes
ALA, AURIN, TERN Yes Yes
GA Yes Yes
data.act, data.melbourne Yes Yes
Jonathan Yu10 |
13. Frequency - # datasets per source published over time
14. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
15. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Quite a bit of CSVs
16. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Look here for GIS data
17. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Structured data prolific
18. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Lots of APIs!
19. Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Compressed files
popular too
21. Volume - How much data? Open Gov Data
https://bl.ocks.org/jyucsiro/227d
cf04d0b30af3f09435f4d9ef9739
Aggregated size by
Repository source
Gigabytes
22. Open Gov Data – 1.7 TB
CSIRO DAP – 672.3 tb
Total THREDDS - 272.9 tb
Volume - How much data? Open Research data + Open Gov Data
Terabytes
23. Open Gov Data – 1.7 TB
Volume - How much data? Open Research data + Open Gov Data
Terabytes Astronomy
data
24. Volume - How much data? Open Research data + Open Gov Data
Open Gov Data – 1.7 tb
Total THREDDS - 272.9 tb
CSIRO DAP* – 74.1 tb
* Removed
astronomy
data
Terabytes
26. Variation by format and volume (bytes)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
0 1mb 100mb 1gb 50gb 100gb 500gb 1tb >100tb
27. Variation – Keywords…
Jonathan Yu27 |
0 5000 10000 15000 20000
Earth Sciences
Oceans
GA Publication
Ocean Temperature
Water Temperature
Top 5 CKAN tags
0 50 100 150 200 250 300 350 400
environment
EARTH SCIENCES
National Computational Infrastructure (NCI)
climatologyMeteorologyAtmosphere
ATMOSPHERIC SCIENCES
Top 5 GeoNetwork subjects
Keywords found: 667,965 - 25,000 unique
Peeking into the content
43. Variation - Keyword coccurences and correlations
Entirely automated from metadata
catalogues.
Intent was to survey but this correlation data
could have interesting applications…
• “Content signatures” – summary, compare
• Keyword suggestions
• Trends over time
• (Semi-)automated metadata generation
Better if keywords were stronger semantic
tags (e.g. controlled vocabularies)…
44. Reflections on visualisations
Some great tooling in data analytics space - particularly R,
to produce visualisations and exploring data
Opportunities
• Regular updates
• UI elements (like sliders) to filter and select ranges
e.g. Keyword analysis, Subset Organisations, Date
ranges
Jonathan Yu44 |
45. Reflections on survey and future work
First cut quantitative survey across open data ecosystem –
government and research.
Methodology could be equally applied to internal data repositories.
Provider-centric perspective – we need usage/reuse perspectives!
Future work
• Allow better filtering (e.g. on source, clusters) and address gaps
• Unify qualitative and quantitative – provider and user perspectives
• Annual national “State-of-Open-Data” analysis valuable?
Jonathan Yu45 |
46. Summary & Take Home Messages
Australia is ranked high on the open data scale but we lack
consistent approach to quantify the data landscape
Presented a proof-of-concept, 1st-cut quantitative survey of
open data in Australia
“Open Data” needs to include Open Research Data
• Shown volume and velocity beyond just the Aus Gov Data
Future work is needed to gain better insight into the variety
of data being published and tools for understanding how the
data is actually used as that is not well understood.
Jonathan Yu46 |
47. Jonathan Yu
Research Data Scientist
Jonathan.Yu@csiro.au
@jonyu6
@jyucsiro
Thank you
Link to data visualisations:
https://gist.github.com/jyucsiro/3bea05a8977
0b9651e97f3f57d7eb332
53. Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
54. Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Large volumes of structured data
in THREDDS
55. Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Large volumes of compressed files
in these Open Gov Data
56. Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Documents can amount to
large volumes too!
“Open Data” seems to be emphasised as Open Gov Data, whereas it ignores the research infrastructures in Australia and data collected by NCRIS, Universities, Research institutions (e.g. CSIRO), etc. which are often open research data.
Data ecosystems are complex and analogous to natural ecosystems
Individual > populations > communities within ecosystems
- diverse, dynamic and adaptive
- Each ecosystem has its own distinct ‘context set of conditions – combination of constituent parts – culture, behaviours interactions system installed base
2. Data flows through this ecosystem like food webs e.g. biodiversity data
- data collected by many agencies - consumed aggregated and processed to support others in the ecosystem
Consumers bear high cost - aggregation and integration as data is diverse
Fragile supply chains
Need data governance – PC NDC
Agreements including standards around data are necessary to enable data to flow
Rnet – academic and research internet provider (telecoms networks layers) – role in connecting research sensors in Research IoT