WordPress Websites for Engineers: Elevate Your Brand
Geo Analytics Canada Overview - May 2020
1.
2. • Satellite EO data is now too
big to analyze using
traditional desktop analytic
tools
• Impossible to analyze
satellite EO data over wide
areas and deep timeseries
using traditional tools
NASA EO archive (EOSDIS) Growth:
approaching 246PB in 2025
2
3. • Bring your algorithm to the data,
not the other way around
• Embrace big data tools and
systems used in other areas
• Transition away from desktop
analytics to cloud-native analytics
• This new era requires
partnerships between IT and
satellite EO experts
• Demonstration proof of concept
platform: www.geoanalytics.ca
3
www.geoanalytics.ca
4. 4
• We can help you with your Big Geospatial Data Analytic problems
• Work with us to build & host your own platform
• Hatfield can provide embedded geospatial analytic experts to
support your project or initiative
For example: Wetland classification; Ecosystem disturbance monitoring
and recovery assessment; Wildfire mapping; Forest extent and biomass;
Water dynamics, including river and lake ice; Leaf Area Index (LAI) for
water balance studies; Land use and land cover change
5. 5
• We did not want to build a closed platform that requires all data
and tools to be centralized in one place
• Instead we want to develop an ecosystem of open architected
systems that assume data and processing resources will be
distributed
• This platform demonstrates a starting point towards this open
architected, distributed ecosystem approach
6. 6
• Cloud native
• Our solution is built from the ground-up to support the power
of cloud computing rather than simply migrating desktop apps
to the cloud
• Developer friendly
• Develop your own algorithms and systems in python, and scale
them dynamically and massively
• Desktop app friendly
• Take your linux desktop geospatial analytic apps
• Demonstrates the latest user- and machine-friendly OGC protocols
• API-Features and STAC
7. 7
• Infrastructure vendor agnostic
• all tools and systems can be installed on a wide variety of cloud
computing providers. This allows us to pursue hybrid and multi-
cloud architectures that exploit pre-existing distributed data stores
• Supports open science
• All tools and systems support the key tenants of open
science: “openness, transparency, scrutiny and traceability of
results, access to large volume of complex data, and the
availability of community open tools”
• Canadian focused
• Uses only Canadian data storage and compute resource. This
supports Canadian organizations that are required to fulfill
Canadian privacy laws which require data to be kept in Canada.
9. 9
• Based on Hatfield’s direct experience with ESA big
data analytic platforms:
• European Space Agency Thematic Exploitation
Platforms (TEPs), Copernicus Data and
Information Access Services (DIAS), etc.
• Informed by competitive analysis of other
internationally known platforms:
• OpenDataCube, Google Earth Engine,
Hexagon's M.appX, CS-SI’s GeoStorm, FAO’s
Sepal, EarthServer’s Rasdaman, Terradue’s
Ellip, EOS’s Platform, DigitalGlobe’s GBDX,
and Radiant.Earth’s platform
10. Object Storage
EO, ARD + project
shared data storage
Kubernetes
On-Demand
Compute
Docker image
storage
System Functions
STAC
Indexing of EO assets
GT Data Store with
OGC API-Features
OpenLDAP + DEX
authentication
KubeFlow batch
processing and machine
learning
Kubernetes Compute Cluster
Core system nodes
Per-user private
Interactive compute
nodes
On-demand
scalable compute
nodes
Web Portal
GitLab
private code repository
+ Docker registry
Jupyter-Lab model
development
environment
System documentation
+
examples
Desktops + tools (QGIS,
SNAP, etc.) in a browser
GT data upload and
management functions
EO data query +
discovery
User + cost
management
Infrastructure as a Service
Software as a Service
Web-map tile
generation
EO data
pre-processing
functions
File Browser
NFS Storage
User secure data
storage
Cost accounting
13. 13
• Key Requirements:
• Providing managed Kubernetes clusters – dynamically scheduled and
scaled containerized workloads
• Availability of pre-emptible nodes –largescale computations done in a
cost-effective manner
• Having a Canadian data center – to comply with Canadian data residency
requirements.
• Selected: Google cloud
• Meets all the above requirements
• Already hosts Landsat 4-8 and Sentintel-2 collections, so no-need to
duplicate
14. 14
• Vendor Neutrality:
• GEO Analytics Canada uses technologies available
on all major cloud hosting providers
• APIs and layers of abstraction have been used to
assure neutrality
• Vendor neutrality allows us to pursue multi-cloud
integrations
• For example: distributed machine learning, with
compute done close to pre-existing data stores
15. • Entirely based on Kubernetes (K8s)
• An open-source system for automating
deployment, scaling, and management of
containerized applications
• Analytics is done in parallel on many worker
nodes to conduct big data analytics in a
performant manner
• Pre-emptible nodes make on-demand
compute very inexpensive
• Applications and users request compute
resources (# of CPUs & GBs of RAM) which
are provided on-demand within seconds
15
16. 16
• Object storage
• Highly durable with built-in
redundancy
• scales to exabytes of data
• Lowest cost
• On the Demonstration Platform, the
following are stored in object storage:
• Raw satellite EO data, including all
downloaded MODIS products
• Analysis ready satellite data (ARD)
• User and project team shared files
• Docker container images
17. 17
• NFS storage service
• Compatible with all Linux-based
systems used on the
demonstration platform
• Used to store user personal home
directories
• Secure – only available to a
specific user (cannot be shared)
• Transfer to project team storage
area (on object store) if sharing
required
• Back-end storage is a standard
SATA disk
20. • All applications and APIs require
users to be authenticated
• User management and profiles
through LDAP
• Single-Sign-On
• Uses industry standard OAuth 2
protocol
• Users only need to log in once to
gain access to all applications
• APIs require token to
authenticate
22. • Web-browser based browse and
search interfaces
• Browse and search all datasets
• Query and view collections by
time, location
• SpatioTemporal Asset Catalog
(STAC) API of all EO datasets
• OGC API-Features (WFS3)
compliant metadata server
• API documented at
www.stacspec.org
23. • Current EO Data Collections:
Collection Name Description Time Period
Available
landsat-8-l1 Landsat-8 images over eastern Canadian
landmass (Manitoba east) 2003-2020
modis.MCD12Q1 MODIS Land Cover 2000-2020
modis.MOD09GQ Terra Surface Reflectance 2000-2020
modis.MOD09Q1 Terra Surface Reflectance 2000-2020
modis.MOD11A1 Terra Land Surface Temperature and Emissivity 2000-2020
modis.MOD11A2 Terra Land Surface Temperature and Emissivity 2000-2020
modis.MOD13Q1 Terra Vegetation Indices 2000-2020
modis.mod09gq.veg.ndvi NDVI derived from Terra Surface Reflectance 2000-2020
modis.mod09gq.veg.evi2 EVI2 derived from Terra Surface Reflectance 2000-2020
25. • Fully uses the computing power and
scalability of the IAAS tier
• multi-stage data processing pipelines
• Enables containerized applications to
be put into a processing chain that can
be scaled massively
• Implemented using KubeFlow
• primarily designed to enable machine
learning (ML) workflows
• Same ML workflows constructs are re-
purposed for EO data ingestion and
pre-processing
26. • Proof of concept EO data pipelines created:
• Level-2 Sentinel-2 products using Sen2Cor
• Run any set of commands that are available through ESA’s
Sentinel Application Platform (SNAP) software
• Downloads MODIS products to the object store and adds the
product to the EO metadata system
• Adds Landsat-8 images over the Eastern Canadian landmass
(i.e. Manitoba east) to the EO metadata system
• Creates NDVI and EVI2 products from Terra Surface
Reflectance products
• Creates a daily thermal average product from Terra Land
Surface Temperature products
27. • NDVI and EVI2 derived from Terra Surface
Reflectance Pipeline:
• Processing completed for all products available
between 2000-2020
• Results stored in object storage and indexed in
EO data query system
• Results available through all platform systems,
including EO data query and discovery system,
File Browser, desktop in a browser, etc.
• Runtime Example:
• 3 years of data (3 TB) processed in 13 hours
• 36 processing pods (1 per month), Each pod
is allocated 1vCPU, 5GB RAM
• Total cluster resources: 36vCPU, 180GB
RAM Viewing NDVI product using QGIS through the
‘desktop in a browser’ system
28. 28
• 10 Sentinel-2 L1A tiles to L2A conversion
• Typically ~3-4 hours
• GEOAnalytics: ~28 minutes
30. 30
• Python-based scalable data analytics
• Interacts with Kubernetes to provide on-demand scalable compute
• Core software systems:
• Jupyter-Lab – provides the web application framework for
interactive analytics
• Xarray – provides an N-Dimensional Array interface and toolset
• Iris – provides methods for analysing and visualising meteorological
and oceanographic data sets
• Dask – provides flexible parallel computing for analytics
• Zarr – the next generation, cloud-native file format for gridded
datasets
31. 31
• To conduct scalable data analytics
• Use Zarr as your on-disk data storage format
• Use Xarray as your in-memory data interface
• Use Dask to execute your code with parallel execution using
Kubernetes to provide on-demand scalable compute
• Lazy loading/execution throughout (which is the default for Xarray
and Dask)
32. 32
• Xarray and Dask
• Used in both Australia’s Open Data Cube
and the Euro Data Cube’s xcube core library
33. Xarray python N-Dimensional array library
DASK python library for distributed computing
EO & GT data storage
Jupyter-Lab
Kubernetes Compute Cluster
35. 35
• Implements a “Pangeo” Environment
• www.pangeo.io
• Supports both HPC and Cloud infrastructure
• Similar in nature to the European Joint Research
Centre’s “Earth Observation Data and Processing
Platform” (JEODPP)
• https://jeodpp.jrc.ec.europa.eu/home/
36. 36
• Hatfield has started a library
of example notebooks on how
to use the Jupyter-Lab
Environment
• Access Landsat data
through STAC API and
process/analyze it to
create an NDVI timeseries
• Query EO data hosted on
GEOAnalytics.ca using
OwsLib
https://github.com/geoanalytics-ca/example-notebooks
37. 37
• NDVI Landsat-8 Example Notebook:
• 30 nodes, 210GB RAM, 60 CPUs
• Random location close to Saint
Hyacinthe, QC
NDVI of 2018 acquisitions
mean NDVI
39. 39
• Collaboration and sharing
of source code with Git
• Private and shared
repositories available
40. 40
• The container registry is
backed by the object
store system
• Cost effective storage of
large container images
• Images in registry can be used
in scalable workflows in the
platform’s EO data ingestion
and pre-processing systems
42. 42
• Provides users with their own
Personal Ubuntu desktop
environment
• Accessible through a browser
• Enables data exploration directly
on the platform, reducing the need
to download data
• Users can select the amount of
RAM + CPU on startup:
• From 1 to 31 CPUs
• From 1 to 116 GB RAM
43. 43
• Pre-installed software (SNAP,
QGIS, Firefox, etc)
• Users can install their own
software and customize the
desktop environment to be
their own
• EO data stores are mounted in
desktop environment for easy
access:
• All Sentinel-2 data
• All Landsat 4-8 data
• Pre-processed data products Viewing a Sentinel-2 product using QGIS
through the ‘desktop in a browser’ system
45. 45
• Enables browsing and
downloading of all data
stored on the platform
for use in external
systems
• Users can view and
download data from:
• All EO data stores
• Shared data
between users of the
platform
• Their own personal
data
47. 47
• Vector ground truth data
can be uploaded, viewed
and deleted
• Users upload a SHP
file which is imported
into the system
• Organized into collections
that contain features
• A SHP file is a
“collection”
48. 48
• Features can be
browsed/searched
interactively
• Features can be
searched
• Webmap displays
features
• API endpoints implement
OGC API-Features
specification (previously
referred to as WFS3)
• Implemented using
PyGEOApi
50. 50
• We want to help you with
your Big Geospatial Data
Analytic problems
• Not a closed platform.
Instead lets create open
architected systems that
assume data and
processing are distributed
• Cloud native
• Developer friendly
• Desktop app friendly
• Latest OGC protocols
• Infrastructure vendor
agnostic
• Supports open science
• Canadian focused
• What makes this platform different:
51. 51
• The proof of concept platform demonstrates how [1]:
• Existing stores of satellite EO data can be
analyzed in-place using cloud-computing
resources, rather than requiring download
• New modular and user friendly metadata
protocols, particularly Spatio Temporal Asset
Catalogs (STAC), can be used to provide search
interface for satellite EO dataset discovery
52. 52
• The proof of concept platform demonstrates how [2]:
• The new OGC API – Features (WFS 3) standard
can be used manage and make available ground
truth and other in-situ datasets
• Satellite EO analytic programs in Python can be
created interactively, and then scaled to analyze
large areas and deep timeseries using XArray
and Dask libraries
• Ingestion, machine learning, analytical and pre-
processing applications (both binary and python
based) can be linked to form scalable satellite EO
data processing chains
53. 53
• Bring your algorithm to the data, not the
other way around
Email contacts:
info@geoanalytics.ca
jsuwala@hatfieldgroup.com
Notas del editor
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
NFS image from https://medium.com/platformer-blog/nfs-persistent-volumes-with-kubernetes-a-case-study-ce1ed6e2c266
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus
DAG from https://www.slideshare.net/VictorZabalza/lens-data-exploration-with-dask-and-jupyter-widgets?from_action=save
HPC Deployments:
NCAR Cheyenne Cluster
NASA Pleiades Cluster
Columbia Habanero Cluster
CNES HAL
USGS Yeti
UW Hyak
DOD HPC at AFRL
Princeton Tiger
Pawsey Supercomputer
University of Miami Pegasus