1. HDF
HDF/HDF-EOS Workshop III
Sept. 14-16, 1999
Mike Folk, HDF Group
http://hdf.ncsa.uiuc.edu/
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
NCSA/Univ of Illinois at Urbana-Champaign
HDF
1
2. Topics
I. Overview
II. NCSA HDF Activities
III. HDF5
IV. HDF4 vs. HDF5
NCSA/Univ of Illinois at Urbana-Champaign
HDF
2
4. HDF Mission
To develop, promote, deploy, and support
open and free technologies that facilitate
scientific data storage, exchange, access,
analysis and discovery.
NCSA/Univ of Illinois at Urbana-Champaign
HDF
4
5. What is HDF?
• Scientific data file format & supporting
software
• For images, arrays, tables, other structures
• Features
– Portability across architectures
• I/O library
• Files
– Efficient I/O
– Efficient storage
HDF
NCSA/Univ of Illinois at Urbana-Champaign
5
6. Why use HDF?
•
•
•
•
•
•
Manage data
Share data
Use software that understands HDF
Improve I/O performance
Improve storage efficiency
Use an open standard
NCSA/Univ of Illinois at Urbana-Champaign
HDF
6
7. An HDF File: A Collection of
Scientific Data Objects
HDF file containing four 3-D arrays
NCSA/Univ of Illinois at Urbana-Champaign
HDF
7
8. Mixing HDF Objects in One File
3-D array
group
Raster image
palette
HDF file
3-D array
Raster
image
Lat lon temp
---- ---- ----12
23
3.1
15
24
4.2
17
21
3.6
16
35
5.7
Table
NCSA/Univ of Illinois at Urbana-Champaign
HDF
8
9. HDF Software
Utilities and applications for
manipulating, viewing, and
analyzing data.
General Applications
Application
Programming
Interfaces
Low-level
Interface
HDF
file
}
HDF I/O library
– High-level, object-specific APIs.
– Low-level API for I/O to files, etc.
File or other data source.
NCSA/Univ of Illinois at Urbana-Champaign
HDF
9
10. HDF Applications Software
• Free software
– NCSA HDF library and utilities
– Other software
• Commercial/other software that “understands”
– all of HDF (Noesys, IDL, HDF Explorer)
– certain HDF objects (MATLAB, WebWinds)
– certain HDF applications (SHARP, WIM)
• http://hdf.ncsa.uiuc.edu/tools.html
NCSA/Univ of Illinois at Urbana-Champaign
HDF
10
11. What platforms does HDF run on?
• Sun: Solaris
• SGI: Indy, Power Challenge, Origin, Cray C90, YMP, T3E
• HP9000, HP-Convex Exemplar
• IBM: RS6000, SP2
• DEC: Alpha/Digital UNIX, OpenVMS
VAX: OpenVMS
• Intel: Solarisx86, Linux, FreeBSD, Windows NT/98
• PowerPC: Mac-OS
University
NCSA/Univ of Illinois at Urbana-Champaign
HDF
11
12. A Sampling of HDF Users
NCSA-affiliated Science teams
Visualization, data exch, fast I/O, ...
Mathworks, Fortner Software,
Research Systems Inc., etc.
Format supported by vendors of vis
and data analysis software
Boeing
Space-time change detection in images
Distributed Oceanographic Data
System (DODS)
Remote access to earth science data
Army Research Lab
Network distributed global memory
Center for Analysis & Prediction
of Storms
Fast parallel I/O, portability,
multi-resolution grids
TRAPPIST
(Euro consortium)
Exchange, analysis & visualization of
non-destructive testing data
NCSA/Univ of Illinois at Urbana-Champaign
HDF
12
13. Major User #1: EOSDIS
• ESDIS Project
– open standard exchange format and I/O library for EOSDIS
– EOS applications
• HDF requirements
–
–
–
–
–
Earth science data types (HDF-EOS, etc,)
User support for scientists, data producers, etc.
Library and file structure improvements
HDF tools, utilities, access software
Software maintenance and QA
NCSA/Univ of Illinois at Urbana-Champaign
HDF
13
14. Major User #2: ASCI
• ASCI Data Models and Formats (DMF) Group
– open standard exchange format and I/O library for ASCI
– DOE tri-lab ASCI applications
• HDF requirements
–
–
–
–
large datasets (> a terabyte)
ASCI data types, especially meshes
good performance in massive parallel environments
primarily HDF 5
NCSA/Univ of Illinois at Urbana-Champaign
HDF
14
15. II. NCSA HDF Activities
NCSA/Univ of Illinois at Urbana-Champaign
HDF
15
16. Java applications
• HDF APIs
– Basis for tools that access HDF
• HDF Viewers
– HDF browser/visualizer
• HDF4 Data Server Prototype
– Lessons learned about remote access to
NCSA/Univ of Illinois at Urbana-Champaign
HDF
16
17. Remote Data Access
• The SDB: Web-based Server-side Data
Browser
• Java for remote access
• WP-ESIP: DODS project
• Computational Grids (Globus/GASS)
NCSA/Univ of Illinois at Urbana-Champaign
HDF
17
18. HDF Standardization
• To share files, users must organize them similarly.
• HDF user groups create standard profiles
– Ways to organize data in HDF files.
– Metadata
– API
• Examples: HDF-EOS, ASCI DMF
NCSA/Univ of Illinois at Urbana-Champaign
HDF
18
19. HDF-EOS software layers
HDF-EOS Applications
HDF-EOS
profiles
General Applications
HDF-EOS API
Application
Programming
Interfaces
Low-level
Interface
HDF
file
NCSA/Univ of Illinois at Urbana-Champaign
HDF
19
20. “HDF Configuration Record” (HCR)
• To simplify the tasks of defining, comparing,
and producing HDF-EOS files
• Formal (ODL) descriptions of HDF-EOS
objects
NCSA/Univ of Illinois at Urbana-Champaign
HDF
20
21. HCR of Swath
/* Project XYZ */
/* First version defined on June 10th, 1998 */
OBJECT = SWATH
NAME = SCAN1
OBJECT = Dimension
NAME = GeoTrack
Size = 1200
END_OBJECT = Dimension
OBJECT = Dimension
NAME = GeoCrossTrack
Size = 205
END_OBJECT = Dimension
OBJECT = Dimension
NAME = DataX
Size = 2410
END_OBJECT = Dimension
END_OBJECT = SWATH
END
NCSA/Univ of Illinois at Urbana-Champaign
HDF
21
22. HCR
• HCR Utilities:
– Converters: HCR ↔ HDF-EOS
– Edit HCR and HDF-EOS
– Compare HCR with HDF-EOS file
• Current projects:
– Extend HCR converters to all of HDF4
– Similar work with HDF5
– XML too
NCSA/Univ of Illinois at Urbana-Champaign
HDF
22
24. Why HDF5?
• HDF shortcomings
exposed by EOSDIS, ASCI and others...
–
–
–
–
–
Limits on object & file size (<2GB)
Limited number of of objects (<20K)
Rigid data models
I/O performance
Aging software infrastructure (code entropy)
NCSA/Univ of Illinois at Urbana-Champaign
HDF
24
25. • …new Demands...
– Bigger, faster machines and storage systems
• massive parallelism, parallel file systems
• teraflop speeds, terabyte storage
– Greater complexity
• complex data structures
• complex subsetting
– More emphasis on remote & distributed access
NCSA/Univ of Illinois at Urbana-Champaign
HDF
25
26. • … and ASCI Requirements
–
–
–
–
Compatibility with vector bundle model
Compatibility with MPI-IO
Ability to transform data between memory & storage
Parallel file systems: PIOFS, HPSS, etc.
NCSA/Univ of Illinois at Urbana-Champaign
HDF
26
27. New HDF5 Features
• More scalable
– Larger arrays and files
– More objects
• Improved data model
– New datatypes
– Single comprehensive dataset object
• Improved software
– More flexible, robust library
– More flexible API
– More I/O options
NCSA/Univ of Illinois at Urbana-Champaign
HDF
27
28. HDF5 data model
• Two primary objects
• Dataset
– multidimensional array of elements
– rich variety of datatypes
• group
– directory-like structure
– contains datasets, groups, other objects
NCSA/Univ of Illinois at Urbana-Champaign
HDF
28
29. Dataset components
• multidimensional array
• header with metadata
–
–
–
–
datatype
dataspace
attributes
storage properties
NCSA/Univ of Illinois at Urbana-Champaign
HDF
29
30. Simple datatypes
•
•
•
•
•
•
The usual scalars: integer & float
user-defined scalars (e.g. 13-bit integers)
variable length (e.g. strings)
pointers to objects or regions of datasets
enumeration
opaque
NCSA/Univ of Illinois at Urbana-Champaign
HDF
30
32. Data Spaces
• How data are organized to form a dataset
– rank
– dimensions
• Subsetting during I/O operations
– What subset of data is to be moved
– In-memory organization of data
– In-file organization of data
NCSA/Univ of Illinois at Urbana-Champaign
HDF
32
33. HDF5 dataset: array of records
int8
int4
int16
Datatype:
float32
Dimensionality: 5 x 3
Record
3
5
NCSA/Univ of Illinois at Urbana-Champaign
HDF
33
34. Dataspaces
Reading Dataset into Memory from File
File
Memory
2D array of integers
3D array of floats
Read
NCSA/Univ of Illinois at Urbana-Champaign
HDF
34
35. Selection: Examples of mappings between file selections
and memory selections.
(a) A hyperslab from a 2D array to the
corner of a smaller 2D array
(c) A sequence of points from a 2D array to
a sequence of points in a 3D array.
(b) A regular series of blocks from a 2D
array to a contiguous sequence at a
certain offset in a 1D array
(d) Union of slabs in file to union of slabs in
memory. No. of elements must be equal.
NCSA/Univ of Illinois at Urbana-Champaign
HDF
35
36. Attributes
• Named pieces of data
• Stored in a dataset or group header
• Operations are scaled-down versions of the
dataset operations
– Not extendible
– No compression
– No partial I/O
NCSA/Univ of Illinois at Urbana-Champaign
HDF
36
37. Property list
• Properties of objects or operations
• Describe how to create, store, access and
transfer data
NCSA/Univ of Illinois at Urbana-Champaign
HDF
37
38. Some Properties
• chunked
Better subsetting
access time;
extendable
• compressed
Improves storage
efficiency,
transmission speed
• extendable
Datasets can be
extended in any
direction
• split file
Dataset “Fred”
File A
HDF
Metadata for Fred
File B
Metadata in one file,
raw data in another.
Data for Fred
NCSA/Univ of Illinois at Urbana-Champaign
38
39. Dataset components
Dataset
Metadata
Data
Attributes
time = 32.4
pressure = 987
temp = 56
Dataspace
Datatype
Dim_3=2
Rank=2
Dim_2=4
Dim_1=5
int16
Storage properties
Chunked; compressed
NCSA/Univ of Illinois at Urbana-Champaign
HDF
39
40. Groups
•
•
•
•
•
Structures for organizing the file
Like Vgroups in HDF4
Like directories in hierarchical file system
Every file starts with a root group
Groups have attributes
NCSA/Univ of Illinois at Urbana-Champaign
HDF
40
41. Groups
• A mechanism for collections of
related objects
• Every file starts with a
root group
• Can have attributes
• Like directories
in Unix, but a graph,
rather than a tree
“root”
NCSA/Univ of Illinois at Urbana-Champaign
HDF
41
42. Groups
Groups and members of groups can be shared
root
NCSA/Univ of Illinois at Urbana-Champaign
HDF
42
44. Reading & writing with HDF5
• Set properties
• Describe the data
– datatypes
– rank and dimensions
– mapping between file and memory
• Read/write
NCSA/Univ of Illinois at Urbana-Champaign
HDF
44
45. Files needn’t be files - Virtual File Layer
VFL: A public API for writing I/O drivers
Hid_t
“File” Handle
VFL: Virtual File I/O Layer
stdio
mpio
memory
network I/O drivers
“Storage”
Files
HDF
Memory
Network
NCSA/Univ of Illinois at Urbana-Champaign
45
46. HDF5 tools
• Current
– hdf5ls - lists contents of HDF5 file
– h5dumper - higher level view
– hdf5 hdf4 converter
• Future
–
–
–
–
–
HDF
Convert HDF5 ↔ ascii, binary, GIFF, etc
Convert HDF4 HDF5
Java tools - VisAD, etc.
File/code generation from DDL description
Talking to vendors
NCSA/Univ of Illinois at Urbana-Champaign
46
48. IV. HDF4 vs. HDF5
NCSA/Univ of Illinois at Urbana-Champaign
HDF
48
49. HDF4 vs. HDF5
• HDF4
• HDF5 - successor to HDF4
– Original format and library
– Compatible with all earlier
versions
– 6 primary objects
•
•
•
•
•
multidim array of scalars
raster image, palette
table
annotation
group
– Biggest current user: Earth
Observing System Data and
Info System (EOSDIS)
– New format and library
– Not compatible with earlier
versions
– 2 primary objects
• multidim. array of records
• group
– Biggest current user: Accelerated
Strategic Computing Initiative
(ASCI)
NCSA/Univ of Illinois at Urbana-Champaign
HDF
49
50. HDF4 object types can be derived from
HDF5 datasets and groups
HDF5 group
HDF5 dataset
HDF4 Vgroup
lat
12
15
17
23
25
lon
23
24
21
35
31
temp
3.1
4.2
3.6
7.2
6.3
HDF4 Vdata
1-dim array
of records
HDF
HDF4 SDS
n-dim array
of scalars
2-dim array of
multi-component
scalars
HDF4
8-bit raster
March 15, 1990.
Simulation with k=10.0,
beta=1.22e3. Calculate
the magnitude ...
03
-3
45
45
04
72
77
67
43
44
34
87
43
50
23
00
43
34
57
45
HDF4 NCSA/Univ of Illinois at Urbana-Champaign
24-bit raster
50
51. Status of HDF4 vs. HDF5
• HDF4 is still an EOS standard
• HDF5 likely also
• HDF4 maintenance
– Maintained as long as EOS needs it
– Minimal new feature
• New applications: use HDF5 if possible!
– New features, performance improvements, etc.
NCSA/Univ of Illinois at Urbana-Champaign
HDF
51
52. HDF Information
• HDF Information Center
– http://hdf.ncsa.uiuc.edu/
• HDF Help email address
– hdfhelp@ncsa.uiuc.edu
• HDF users mailing list
– hdfnews@ncsa.uiuc.edu
NCSA/Univ of Illinois at Urbana-Champaign
HDF
52
Notas del editor
ASCI’s DMF Group is currently supporting HDF work with the idea of possibly adopting HDF as a standard. They want to share data and software among the three labs (Livermore, Sandia, Los Alamos), and would prefer a “non-invented-here,” open standard with publicly available software..
HDF Requirements. ASCI’s needs overlap with those of EOSDIS, but with some important differences:
ASCI deals largely with simulations on massively parallel machines, and hence requires very high performance in doing I/O. Only a parallel version of the library will satisfy ASCI’s needs.
ASCI data deals with meshes, whereas EOS deals largely with remotely sensed data. Many types of meshes can be much more complex than remotely sensed data is, and typically require indexed access.
Because of these requirement, the current official version of HDF (HDF4) is not adequate for the ASCI project. Fortunately, with support from NASA, we have been developing a completely new version of HDF designed to address these kinds of requirements. This is HDF5. More about HDF5 later.
A mesh data repository is being developed by the group to standardize the data models and terminology used by the three labs. This will allow them to share resources much better than is currently the case. There is also the hope that the mesh standard adopted by ASCI will be adopted by others, further expanding the leverage of the standard.
The HDF group has several Java-based projects. Java’s platform independence supports the need to be able to work with HDF on many platforms. Java’s graphical interface features support the creation of platform-independent HDF browsing and visualization software. And Java’s network awareness facilitate the development of software for remote access to HDF data.
A Java HDF Interface (JHI). JHI provides an interface to essentially all the functions of the NCSA HDF 4.1r2 library. The JHI is analogous to the FORTRAN interface already provided as part of the HDF library release.
Basis for tools that access HDF. Any Java application can use the interface classes to read and write HDF. This package ``wraps&apos;&apos; the standard HDF 4.1r2 library, which is called from Java through `native&apos; methods.
A Java HDF Viewer. This is a tool to provide basic viewing capabilities for HDF.
HDF browser/visualizer. With this tool you can open an HDF file, look at images, arrays, tables and attributes, and do some simple visualization.
Template for other Java viz apps. It isn’t meant as an all-encompassing visualization tool for HDF. That is left to others, including commercial vendors. Rather as a template for people to use to build more sophisticated tools.
Java Scientific Data Server Prototype. We experimenting with remote access to HDF. This project examines different ways Java can be used to provide remote access to HDF
Lessons learned about scientific data servers. We are learning a great deal about Java’s remote access capabilities: servlets, RMI, etc.
Template for other Java server apps. Again, we hope this technology will help others who what to do similar things, or to build products out of our prototypes.
HDF5 is a new, experimental version of HDF that addresses limitations of the current version (HDF4) and addresses requirements of modern systems and applications. HDF5 is a complete new format and I/O library, not an incrementally new version of HDF4.. An HDF5 prototype was released in Feb, 1998. Although incomplete, this library shows the basic features of HDF5. A full release is scheduled for Summer 1998.
Why HDF5? HDF5 is motivated by severe limitations in the HDF4 format and library. HDF5 retains most features of HDF4, but addresses these limitations, including:
Large array and files support. A single HDF4 file cannot store more than 20,000 complex objects, and a cannot be larger than 2 GB. HDF5 will be able to store virtually any number of objects of virtually any size.
Simple, comprehensive data model. HDF4 has more object types than necessary, and datatypes are too restricted. HDF5 uses a simpler, more comprehensive data model that includes only two basic structures: a multidimensional array of record structures, and a grouping structure. All HDF4 structures can be derived from these.
New library, with emphasis on parallel I/O. The HDF4 library is old, overly complex, does not support parallel access well, and is not thread tolerant. HDF5 provides a better-engineered library and API, with improved support for parallel I/O, threads, and other requirements imposed by modern systems and applications.
Collaborations. HDF5 was motivated by the needs of many different users, but two projects in particular are driving HDF5 development:
ASCI: mesh data standard for ASCI physics. The DMF’s ASCI mesh standard initiative described in an earlier slide is providing most of the support for HDF5.
Digital Library Initiative (DLI): integrate with commercial object store. A DLI project at the U. of Illinois is using HDF5 for data access in combination with a commercial object store. This project requires very efficient parallel I/O.
Here is an example of a basic HDF5 object.
Notice that each element in the 3D array is a record with four values in it.
Like HDF4, HDF5 has a grouping structure.
The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.
Like HDF4, HDF5 has a grouping structure.
The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.
Like HDF4, HDF5 has a grouping structure.
The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.