This document introduces HDF5 as a new data format that addresses limitations of SEGY, such as only allowing sequential access. HDF5 allows random access, extended metadata, and scales better to large datasets. It describes how to create and read HDF5 files in Claritas using new modules like SEISWRITE and SEISREAD that replace DISCWRITE and DISCREAD. Selections, repeats, and reading multiple files work similarly to existing Claritas tools but with improved flexibility and performance.
2. Getting Started With HDF5
• Why have we brought in a new data format?
• What actually is HDF5?
• How do I create HDF5 files?
• How do I read in HDF5 files
– Reading one file at a time
– Reading multiple files and selections
• Points to Note
• Future Developments
3. SEGY is great but…
• It is designed to be read sequentially from tape
– and our “index” file solution didn’t scale well to “big data”
– and our index file solution only allowed primary key access
• It only has 240 bytes of 32-bit integer headers defined
– and our extended trace headers didn’t scale well to “big data”
• Some processes require “n-key random access”
– “surface consistent” suite, PreSTM, 3DSRME etc.
• You need to read the whole file to access trace headers
– Some “database” systems offer more flexibility
• Parallel I/O doesn’t scale well on large clusters
4. So what is HDF5?
• Developed over the last 20 years
• Initially by National Centre for Supercomputing Applications http://www.ncsa.illinois.edu/
• Now developed by the HDF5 Group http//:www.hdfgroup.org
• A suite of technologies, not just a file format
• General purpose library and file format for storing scientific data
• Fully supported set of command line tools, APIs and interfaces
• A pan-industry open standard
• Used for storage by both MatLab and Scilab, can be read by Mathmatica
• Fully supported set of command line tools, APIs and interfaces
• A self describing format
• No ambiguity about integer or floating point types or storage in trace bytes
• Names can be allocated to components, as you would in a database structure
• Built for “big data”
• Petabyte+ scale datasets running on tens of thousands of cores
5. Our Implementation of HDF5
HDFView 2.9 : free, third party
tool, showing how any HDF5
application can open the new
format
Data, Processing History, 400-byte
reel header, 3200-byte text
header, history and trace headers
from Claritas extended SEGY all
present
Seismic samples displayed
graphically – could also be
displayed as a table
All trace headers – SEGY 240byte
and extended - opened in a
spreadsheet; full mathematical
operations
We have “encapsulated” the GLOBE Claritas SEGY in HDF5
The 400-byte binary reel header
opened as a table, so that values
can be edited or modified
6. Creating HDF5 Files : SEISWRITE
Specify a file name!
Optimisation controls; these have smart defaults set and
can be modified for managing very large datasets where
you know that non-sequential read-access will be
needed, or partial read of trace samples will be required
Replaces current use of DISCWRITE, although this will continue to be available
New functionality development will focus on SEISWRITE and HDF5 format data
7. Reading HDF5 files : SEISREAD
With HDF5 format, you use SEISREAD in place of the DISCxxxxx Modules
You don’t need to worry about the order of data on disc, just how you want to read it
8. Simple Reading
File Name
Primary key order;
default is
all, ascending
Secondary key order;
default is
all, ascending
Tertiary key order; only
when needed
You can read data in ANY order;
original order doesn’t matter
9. Selection and Repeats
6 Repeat copies specified
Primary key SHOTID with only
SHOTID 900 only selected; note
tolerance
Secondary key CHANNEL, all
selected, in ascending order (default)
Six copies of SHOTID 900 passed to the
processing flow, with REPEAT set from 1-6
10. More Complex Selections
Two copies of SHOTIDs from 100 to 900 with
an increment of 100, all channels in
ascending, with REPEAT set to 1 and 2
More complex SHOTID selection using
the same syntax as DISCREAD; note
tolerance is set to 0
11. Sorting to CDP (DISCGATH)
Identical to simple reading
Specify CDP and primary key
Specify CDPTRACE as secondary key
Default is to read all data in ascending
primary/secondary key order
12. Reading Multiple Files
Seismic File List used in the same
format as with DISCREAD, with
selections
SETRAEPEAT parameter used as per
DISCREAD to create panels, files are
merged if this is “no”
Primary Key defined here is used in the
Seismic File List definition
This last file has a “native”
ordering of
CDP, CDPTRACE, but will be
order to SHOT, CHANNEL on
read, automatically
13. Points to Note
• Can only specify a primary key in a Seismic File List
– Same as DISCWRITE, although the original data order no longer matters
• User needs to managed extended trace headers merge
– Use DELHDR prior to merging files; will be removed in future releases
• Files can be 10-15% larger than SEGY
• Compatible with Cluster File Systems (Gluster etc.)
• I/O above about 2Gbytes should be improved
14. Future development
• Improved PKEY/SKEY/TKEY selection handling
• Direct update of trace headers from applications
– Geometry, SV (FB picks) etc.
• Add HDF5 support in KPRET2D
– Only module where this is not available
• Add full parallel I/O to iMage suite
– Increase parallel scalability even further
• Algorithmic optimisation
– Re-write to take full advantage of random access