Slides 23 and 24 mentions experience with HDF-EOS.
Source: http://hdfeos.org/workshops/ws04/presentations/Jones/000901%20DPEAS%20Overview%20-%20HDFEOS%20Workshop.ppt
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Overview of the Data Processing Error Analysis System (DPEAS)
1. Overview of the
Data Processing and Error
Analysis System (DPEAS)
Andrew S. Jones
Colorado State University (CSU)
Cooperative Institute for Research in the Atmosphere (CIRA)
DOD Center for Geosciences / Atmospheric Research (CG/AR)
Fort Collins, CO
DOD Center for Geosciences / Atmospheric Research
Colorado State University
2. What is it?
Data processing system for “large” data analysis
tasks using common PCs
Features:
2nd generation system (replaces an earlier system called
PORTAL (Jones et al., 1995))
Parallel implementation
Web-based documentation and monitoring
Incorporates a Fortran-interpreter for input tasks
Virtualized I/O subsystem (only memory-resident data
structures are needed, data algorithms now function like a model)
Able to failover to redundant hardware
Extensible User Module
Error Analysis code is still under development
Implemented on Windows NT/2000 OS
DOD Center for Geosciences / Atmospheric Research
Colorado State University
3. What Does it Do?
Global merge capabilities for numerous data sets
Current system in operational use for 2+ years at CIRA
Simplifies
Current average operational throughput rates using 15
processors on 8 PCs is 17 TB/yr (47 GB/day).
Measured max. throughput rate is: 2.5 PB/yr (7.1 TB/day)
Powerful abstraction layers allow anyone to write parallel code
Virtual I/O subsystem reduces end-user code complexities
Users interact using a language most already know
Easily Scales
Limited process “cross-talk” improves scaling behavior
Tests have shown that a 2000 machine cluster is physically
feasible.
Basically… just add hardware.
DOD Center for Geosciences / Atmospheric Research
Colorado State University
4. 10 Data Types Are Currently Supported
Reads
and Writes HDF-EOS natively
GOES IMAGER (McIDAS)
NOAA AVHRR GAC and LAC (McIDAS)
NOAA AMSU-A and B (HDF-EOS)
DMSP SSM/I (Byte Stream)
DMSP SSM/T-2 (NGDC OIS)
DMSP OLS (NGDC OIS)
TRMM TMI and VIRS (HDF)
User extensible… (your format here)
DOD Center for Geosciences / Atmospheric Research
Colorado State University
5. The Hardware
STORAGE VIEW
Legend
Primary
Backup
Wn Worker
Mirrored
Set
Primary
Backup
W1
66 GB
240 GB
240 GB
PROCESSOR VIEW
W2
240 GB
ClusterSummary
- All Ingest Processes
- Most Higher Level
Remapped Products
Primary
Backup
W1
W2
W3
OPERATIONAL CLUSTER (24/7)
9 Processors
3.0 GFlops
2.25 GB RAM
ClusterSummary
- Large Global Sectors
W4
W5
W6
EXPERIMENTAL CLUSTER (nights only/7)
DOD Center for Geosciences / Atmospheric Research
6 Processors
2.5 GFlops
2.5 GB RAM
Colorado State University
6. Failover Mode
STORAGE VIEW
X
Legend
Primary
Backup
Wn Worker
Mirrored
Set
Primary
Backup
W1
66 GB
240 GB
240 GB
PROCESSOR VIEW
W2
240 GB
Failover Steps:
X
Primary
Automated
1. Synchronize states
2. Promote the Backup
Backup
W1
W2
W3
OPERATIONAL CLUSTER (24/7)
W4
W5
Restore Steps:
Manually initiated
1. Demote the Backup
2. Restore Mirror Set
3. Synchronize states
4. Reactivate Primary
W6
EXPERIMENTAL CLUSTER (nights only/7)
DOD Center for Geosciences / Atmospheric Research
Colorado State University
7. Module Context
GUIs
Batch Job Client
Explorer
Command Line
Web Browser
Command Line Script
Command Shell Interpreter
DPEAS Input Script
Other Applications
DPEAS Data Processing Engine
Spawn Subtask
DPEAS Subtask
DPEAS Fortran Interpreter
Batch Job
Service
Analysis Modules
DPEAS
System
State
User Modules
DPEAS HDF-EOS
Virtual I/O Subsystem
Translation
Modules
Output
Modules
This is
DPEAS
Internet Information
Services
Operating System (Windows 2000)
DOD Center for Geosciences / Atmospheric Research
Colorado State University
8. An example of a DPEAS input script file
DOD Center for Geosciences / Atmospheric Research
Colorado State University
9. How DPEAS Starts
Program Start
DPEAS Initialization
Interpreting DPEAS script
declarations
Interpreting DPEAS script
executable statements
DOD Center for Geosciences / Atmospheric Research
Colorado State University
10. How DPEAS Ends
Interpreting DPEAS script
executable statements
DPEAS Summary
Program End
DOD Center for Geosciences / Atmospheric Research
Colorado State University
11. How Are Spawned Input
Scripts and Jobs Created?
All spawned DPEAS jobs run machine-generated
DPEAS input scripts which are generated by the data
processing engine from the Master DPEAS input
script (The examples shown previously were
examples of DPEAS machine-generated code)
This is automated within DPEAS and the user code
goes along for the free ride since it is part of the
DPEAS executable (it’s like meeting a friendly virus
which helps to spread your code along with it)
DOD Center for Geosciences / Atmospheric Research
Colorado State University
12. What Does DPEAS
Parallelism Look Like?
Do loop contents
are sent to other
resources in parallel
The new jobs run the
same “DPEAS.exe”,
but execute only the
subtask operations
Completed Jobs
allow additional jobs
to start
DOD Center for Geosciences / Atmospheric Research
Colorado State University
13. The 3 Programming Steps to
Add a User Routine to DPEAS
1.
Insert a program “hook”
The program hook makes the main DPEAS program
aware of the existence of your wrapper routine.
2.
Create a wrapper routine
The wrapper routine tells the DPEAS fortran
interpreter how to parse and interact with your
application subroutine arguments.
3.
Create an application routine
The application routine performs the “real” work.
You can do anything you want within the application
routine.
DOD Center for Geosciences / Atmospheric Research
Colorado State University
14. How does the “User_Module.f90”
relate to my DPEAS Input Scripts?
Compile
User_Module.f90
Program Hook
Wrapper Routine
Application Routine
Ordinary
Fortran Compiler
Interpret
Automated
Parallelization
DPEAS Input
Script
Using Self-Replication
"DPEAS.exe"
DPEAS Input
Script
Subtask
Interprets DPEAS
Input Script
"DPEAS.exe"
Interprets DPEAS
Input Script
Return to
Master
End
DOD Center for Geosciences / Atmospheric Research
Colorado State University
15. User Example:
The user’s application routine
Using the virtual I/O data via pointers
1. Find each
MW channel
2. Allocate a new
output array
data structure
Your science code
looks like this
DOD Center for Geosciences / Atmospheric Research
Colorado State University
16. User Example:
The results: Complete integration
The new user
routine is now
fully integrated
into DPEAS
DOD Center for Geosciences / Atmospheric Research
Colorado State University
17. User Example:
The output HDF-EOS file
DOD Center for Geosciences / Atmospheric Research
Colorado State University
18. User Example:
The output image representation
150 GHz
Effective
Emissivity
Calculated from:
GOES-08 IMAGER
NOAA-15 AMSU-B
DOD Center for Geosciences / Atmospheric Research
Colorado State University
19. User Example:
Summary
Creates
2 new routines:
Wrapper routine
Application routine
Requires
25 lines of executable code:
2 – Program hook
Small overhead for gaining massive
parallelism capabilities!
4 – Wrapper routine
19 – Application routine
2 – Variable assignments
3 – Science algorithm
14 – Virtual I/O library calls
(using only 2 Virtual I/O library routines)
DOD Center for Geosciences / Atmospheric Research
Colorado State University
20. User Example:
How complex would the user routine be,
if written without the Virtual I/O library?
Creates 2 new routines:
Wrapper routine
Application routine
Requires 59 lines of executable code:
2 – Program hook
4 – Wrapper routine
53 – Application routine
2 – Variable assignments
3 – Science algorithm
48 – HDF-EOS library calls
(using 26 HDF-EOS library
routines)
DOD Center for Geosciences / Atmospheric Research
Answer: Without the
DPEAS Virtual I/O library
there would be:
24 additional I/O routines
called by the user (+1200%)
34 additional lines of user
code (+236%)
Colorado State University
21. User Example:
Conclusions
Implementation Insights
Virtual I/O Insights
Minimal amount of end-user code is required
The effort and resources involved are small
(The DPEAS program recompiled in < 30 s on the user’s desktop)
The DPEAS virtual I/O access method is less complex than
traditional HDF-EOS file access methods
End user’s perspective
End users are protected from technical data format issues
End users can develop higher quality code by leveraging
shared robust common modules
Scalability is greatly enhanced with little end user effort
DOD Center for Geosciences / Atmospheric Research
Colorado State University
22. Summary
DPEAS can process large data sets in an efficient
manner while maintaining centralized management
controls and error handling behaviors
Parallelism of the code is automatic and runs on
“cheap hardware”
Failover capabilities make the system more robust
User code is shielded from complexities of the
system using software abstraction layers
Little training is needed since user interfaces are in
a known scientific language
User modules directly access data from memory –
obsolesces traditional file access methods but
maintains needed file compatibility
DOD Center for Geosciences / Atmospheric Research
Colorado State University
23. What did I learn about
HDF-EOS in the process?
HDF-EOS is an excellent “universal” data format
It works for all satellite sensors types I have
encountered to date (10+)
HDF-EOS requires serious software design before
the implementation stage
It is my experience that “Time” information as a
geo/time field for sectorizing is overrated and is likely
to cause future software design headaches with the
more complex sensors if encouraged to be the
“norm”
DOD Center for Geosciences / Atmospheric Research
Colorado State University
24. My 2 cents: How HDF-EOS
could be made even better
(Hopefully someone has already thought of these things,
and this short list will be a reaffirmation)
Given that GOES data, for example, and other
multi-detector sensors can have multiple times for
each channel for the same geolocation position,
and that in addition, they can and do interrupt their
sensor scans at any time…
Treat “Time” as a data attribute
Currently I associate “Time” and other associated
arrays with its principle data array by nomenclature
It would be better to use data array attribute
“groups”. Then “Time”, “Calibration”, and other
associated arrays could be grouped with the data
array through the data format.
DOD Center for Geosciences / Atmospheric Research
Colorado State University
25. Why Data Attributes?
Many data channels have “associated” information
For example, it might be very meaningful to associate the
min. and max. of a grid location with its mean value
It would be better if there was a standard way of
showing that group association, so we don’t have
to understand each other’s unique nomenclatures,
“intent”, or have to resort to the use of unusual
“mixed” HDF/HDF-EOS data files
Data attributes should not be arbitrarily limited in
scope, but have full data type ranges
Units could also be incorporated through data
attributes
DOD Center for Geosciences / Atmospheric Research
Colorado State University
27. Appendix
The following series of slides show how a
user can easily modify DPEAS
1.
The user’s program hook
2.
… wrapper routine
3.
… application routine
(using the virtual I/O data via pointers)
4.
5.
Usage of the new user routine in a
DPEAS input script file
The Results: Complete Integration
DOD Center for Geosciences / Atmospheric Research
Colorado State University
28. User Example:
The user’s program hook
2 lines of code
DOD Center for Geosciences / Atmospheric Research
Colorado State University
29. User Example:
The user’s wrapper routine
4 lines of executable code
DOD Center for Geosciences / Atmospheric Research
Colorado State University
30. User Example:
The user’s application routine
Using the virtual I/O data via pointers
1. Find each
MW channel
2. Allocate a new
output array
data structure
Your science code
looks like this
DOD Center for Geosciences / Atmospheric Research
Colorado State University
31. User Example:
Usage of the new user routine in a
DPEAS input script file
DOD Center for Geosciences / Atmospheric Research
Colorado State University
32. User Example:
The results: Complete integration
The new user
routine is now
fully integrated
into DPEAS
DOD Center for Geosciences / Atmospheric Research
Colorado State University
33. Where Do I Find DPEAS?
DPEAS Home Page:
http://luna.cira.colostate.edu/DPEAS/DPEAS_frame.htm
Please direct questions to jones@cira.colostate.edu
DOD Center for Geosciences / Atmospheric Research
Colorado State University
Notas del editor
DPEAS is one executable that propagates copies of itself within a network cluster of machines in a controlled fashion.