These slides were presented by Giri Prakash from Oak Ridge National Lab at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Modern Scientific Data Management Practices: The Atmospheric Radiation Measurement (ARM) Facility Data Center Architecture
1. December 20, 2018 1
Modern Scientific Data Management Practices:
The Atmospheric Radiation Measurement (ARM)
Facility Data Center Architecture
GIRI PRAKASH, RANJEET DEVARAKONDA, ROB RECORDS, KYLE DUMAS
ARM Data Center, Oak Ridge National Laboratory
AGU 100, December 12, 2018
2. ARM’s Vision
2
To provide a detailed & accurate description
of the earth atmosphere in diverse climate
regimes to resolve the uncertainties in climate
and earth system models toward the
development of sustainable solutions for the
Nation’s energy & environmental challenges.
6. 6
§ Based on big data analysis platform
(NoSQL)
§ ARM HPC Clusters for data
processing
§ Provides an interactive web
interface for users to find
simulations of interest through
examination of the LES
performance relative to select ARM
observations
§ Allows user to visualize LASSO
data bundle diagnostics and skill
scores on the fly using plots and
tables
Cassandra
D3 &
NodeJS
Spark
Data Discovery for LASSO
7. Data Retrieval, Packaging, and Delivery
§ Merging
§ DQR filtering
§ Conversion
Retrieval
Future
capability
Data-
streams
HPSS
Online
copy
Link to data access
Data quality
Access to plots
DOI based citation guidance
Publication request
Discovery
UI
&
Web services
NetCDF
data
extractions
Data
staging
order
HPC ML
Live Data WS
7
9. Next-Gen ARM Computing Facility
Cumulus clusterStratus
cluster
§ LASSO model operations and large scale
data analysis/ visualizations
– 112 nodes (4,032 cores)
– 2 PB GPFS storage
§ Routine radar processing
§ Large-scale reprocessing
§ Complex VAP development
§ No-SQL based advanced visualizations
§ Big data extractions for science users
§ Long-term data quality analysis
– 30 nodes (1,080 cores)
– 256 GB memory/node
– Lustre and 2 TB SSD per node
9
10. Data Pipeline and Software Architecture
December 20, 2018 10
Data Processing
Storage &
Data
Model
Querying Analytics Scientific
Users
Data Pipeline
Software Architecture
Interface
Visualization
Analytics
Output
Spark
ARM HPC
Computing Clusters
JupyterLab
Relational Database NoSQL Database
• Supports fast analysis
of voluminous data
• Hides architectural
complexities
• Stage data in HPC
• Metadata
• Order History
• Data from multiple
instruments
Frontend
Analytic Server
Backend
Dr.Bhargavi Krishna, Yuping Lu, and Dr.Jitu Kumar
10
11. 11
§ Allow users to cite exact
ARM data used in their
research/publication
§ Allow ARM to provide
proper data citation credits
to the PIs
and collaborators
§ Allows future data users
and the project to easily
track the data used
in various articles
§ Millions of data files from
over 10,000 data products
§ Typically continuous
datastreams but some
of them are from
field campaigns
§ DOIs are assigned
at the data collection level
§ Recommended
Citation structure
§ Citation Generator and
resolver to help users
Benefits Challenge Strategy
Data Citation and DOI Capabilities
12. Data Sharing with External Portals
ARM Data Center
ISO 19115,CF,
FGDC,
Schema.org,
OAI, JSON-LD,
THREDDS
OPENDAP
Extraction
Visualization
Science Metadata Data Access
Google
IASOA
Data.gov
DataCite
NGEE-Arctic
Other Data
networks
Metadata harvesting
Data download service
DOI
12