To understand HDF5 is more than a technical exercise.
HDF5 wasn’t designed by committee. It’s the result of an evolutionary process.
HDF5 is what it is because of how different communities use it to solve data management problems.
We will tell you a little bit about ourselves (The HDF Group), the stewards and maintainers of HDF5 software and technologies.
Last but not least, HDF5 couldn’t endure and thrive without an ecosystem.
A non-profit company on a mission:
To make the management of large & complex data as simple as possible, but no simpler.
Our values:
- Simple data model
- Free Open Source Software
- Technical excellence
- Diverse community
We were excited about big data 20 years ago and welcome the new generation of “dataphiles.” Today, we are no less excited about big data, but also about the new processing and storage capabilities and economies.
We are working hard to bring the associated benefits to our users and customers while protecting their current investment.
Our biggest challenge: HDF spreads through word-of-mouth and not marketing. Spread the word!
It’s hard to find a walk of life not touched by HDF5. (Maybe literary studies?)
- Public and private sector
- Academia, government, industry
- Organizations (NASA, DOE, CEA, ITER, DESY)
- Standards bodies (OGC, EXPRESS, Allotrope, CGNS, RESQML)
- Sectors (aerospace, oil & gas, pharma, power systems, finance)
- Research (astronomy, life science, materials science, ML)
- Data integrators, life cycle managers
- Developers
We like to speak of the HDF5 communtiy as an iceberg, of which we see only about 10%. Despite our best efforts, we usally find out that someone is using our software “by accident.” That leaves plenty of room for interpretation:
- People don’t know of our existence (we have some evidence of that)
- Our product is perfect and there’s no need for any further communication (we know better)
- We are such terrible people that nobody wants to talk to us (perhaps)
- …
- “Raw data” is an oxymoron (It needs an domain-specific interpretive base.)
- Data comes in many representations: in registers, memory, storage…
- Some data is used as metadata for other data.
- Data can be “at rest” (files, objects) or transit environments (I/O)
Data management is the art of:
- Maintaining data integrity
- Leaving clues for the interpretive base
- Removing “data friction”
Many challenges around data management are not exactly new. Big Data’s three Vs (volume, velocity, variety) turn what used to be mere inconvenience into major pain points.
Historically & literally: HDF5 = H ierarchical D ata F ormat v 5
The meaning of a word is its use in the language. (Wittgenstein)
Depending on your use HDF5 is:
- A generic (=not specific) data management interface with a bent
toward multi-dimensional arrays
- HDF5 sits on the periphery
- You get some of the benefits, e.g., portability
- A data management interface toolkit.
- HDF5 is part of the fabric
- You create domain-specific interfaces to manage self-describing data representations
- You get a streamlined data life cycle
HDF5-based interfaces often surpass other solutions in power, sophistication, simplicity, performance, and portability.
Items are presented in a hierarchically structured name space
Complex data & Metadata
- (Complex = consisting of interconnected or interwoven parts, composite)
- Physical experiments & observational data
- Simulations
- Values of function evaluations
- Streams
Data is “caught” in a web of objects. Links and nodes can be pre-defined or user-defined items.
An arrangement of data such that it can be processed or stored by a computer
Many different layouts exist for such HDF5 containers:
- Single file
- Multiple files
- Memory buffers
- Collection of objects in Intel DAOS or Amazon S3
- …
And you can create your own layouts!
We’ve tried (in versions 1,2,3, 4) to make all original mistakes for you!
The introduction to Andrew Collette’s book
begins with an intuitive example.
Suppose we have multiple weather stations for which we would like to record temperature and wind speed time series. Let’s also assume that we acquire those samples at fixed sampling intervals.
This snippet of Python code shows how to arrange measurements from multiple stations in a single HDF5 container and how to capture important metadata such as platform characteristics and sampling rates.
import h5py, numpy as np, platform as pfm
# Weather stations record temperatures and wind speeds
with h5py.File('hello.hdf5', 'w') as f:
f.attrs['system'] = pfm.system();
f.attrs['release'] = pfm.release();
f.attrs['processor'] = pfm.processor();
# station ID 15
temperature = np.random.random(1024)
dt = 10.0 # Temperature sampled every 10 seconds
wind = np.random.random(2048)
dt_wind = 5.0 # Wind speed sampled every 5 seconds
f['/15/temperature'] = temperature
f['/15/temperature'].attrs['dt'] = dt
f['/15/wind'] = wind
f['/15/wind'].attrs['dt'] = dt_wind
# station 20
# f["/20/..."] = ...
from pathlib import Path
print('File size: {} bytes'.format(Path('hello.hdf5').stat().st_size))
File size: 32768 bytes
After running the example, we have an HDF5 file containing temperature and wind speed time series from one or more weather stations.
Judging from our Hello, HDF5!
example, we are dealing with nested groupings of
arrays. There’s one grouping for each weather station and there are two array
variables (temperature
and wind
) per grouping. This is almost accurate, except
that all weather station groupings are part of the so-called root group, and
that there are additional decorations (system characteristics, sampling rates).
For the purpose of this introduction, it’s OK to think of an HDF5 container as a file system in a file. (Of course, as long as there is a file…)
Speaking informally, the HDF5 data model includes two primitives and a set of combination rules. HDF5 is about describing array variables and their relationships.
(Datasets and attributes are roles in which array variables can be used in HDF5, and different rules apply to them.)
People often miss the simplicity of the HDF5 data model for irrelevant technical details. HDF5 represents data as (values of) variables and relationships among them. Isn’t that how mathematical models work? If there is any “secret sauce” to HDF5, this is it, and it’s hidden in plain sight.
HDF5 couldn’t survive without users, without CONTRIBUTORS, or without an ecosystem. Unless you are writing C code all the time (condolences!), it’s very likely that you are benefitting from the work of someone who doesn’t work for The HDF Group. Support us to support them, or support them directly!
- It’s hard to find a language for which there are no HDF5 bindings or an API
- The HDF Group develops and maintains a reference implementation in C and bindings for Fortran
- The community provides excellent bindings for Python, R, C++, Julia, .NET, Java…
- Third parties support HDF5 in their products, e.g., MathWorks, National Instruments, Wolfram, etc.
Download a fully featured Python 3 IDE to your mobile device from the Google Play store.
Homework: Run the "Hello, HDF5!"
example on your phone and look at the file
on your computer!
Below we illustrate how to transition from a single process writing to a dataset to multiple MPI-processes writing to different parts of a single dataset in a single shared HDF5 file. The code is written to emphasize similarities and to highlight the few places where they differ. The common portions are shown in the appendix.
The basic flow is as follows:
- Create an HDF5 file (line (seq-fcrt))
- Create an HDF5 dataset (line (seq-dcrt))
- Select the destination in the file (line (seq-sel))
- Write a data buffer (line (seq-wrt))
#include "literate-hdf5.h"
#define SIZE 1024*1024
int main(int argc, char** argv)
{
hid_t fapl, file, dset, file_space;
float* buffer;
hsize_t file_size;
fapl = H5Pcreate(H5P_FILE_ACCESS);
file = H5Fcreate("single-proc.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
fapl); // (ref:seq-fcrt)
dset = (*
<<make-dataset>>) (file, "1Mi-floats", SIZE); // (ref:seq-dcrt)
file_space = H5Dget_space(dset);
H5Sselect_all(file_space); // (ref:seq-sel)
<<create-buffer-and-write>> // (ref:seq-wrt)
<<clean-up>>
}
h5dump -pBH single-proc.h5
HDF5 "single-proc.h5" { SUPER_BLOCK { SUPERBLOCK_VERSION 0 FREELIST_VERSION 0 SYMBOLTABLE_VERSION 0 OBJECTHEADER_VERSION 0 OFFSET_SIZE 8 LENGTH_SIZE 8 BTREE_RANK 16 BTREE_LEAF 4 ISTORE_K 32 FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR FREE_SPACE_PERSIST FALSE FREE_SPACE_SECTION_THRESHOLD 1 FILE_SPACE_PAGE_SIZE 4096 USER_BLOCK { USERBLOCK_SIZE 0 } } GROUP "/" { DATASET "1Mi-floats" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1048576 ) / ( 1048576 ) } STORAGE_LAYOUT { CONTIGUOUS SIZE 4194304 OFFSET 2048 } FILTERS { NONE } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE H5D_FILL_VALUE_DEFAULT } ALLOCATION_TIME { H5D_ALLOC_TIME_LATE } } } }
The basic flow is exactly the same as in the sequential case:
- Create an HDF5 file (line (par-fcrt))
- Create an HDF5 dataset (line (par-dcrt))
- Select the destination in the file (lines (par-sel1) - (par-sel2))
- Write a data buffer (line (par-wrt))
There are only two differences between the sequential case and the MPI-parallel case:
- We have to instruct the HDF5 library to use MPI-IO layer (line (par-fapl))
- Since the data buffers from different MPI ranks are destined for different “offsets” in the dataset, the selection process is rank dependent (lines (par-sel1) - (par-sel2))
That’s it. Everything else is the same. Most importantly:
The is only _one_ HDF5 file format.
It is impossible to tell if a given HDF5 file was created by a sequential or parallel application.
Notice that the example is a case of weak scaling: each process writes the same amount of data, and the total amount of data written is proportional to the number of processes. (We speak of strong scaling when the total amount of data written is kept constant, independent of the number of writing MPI processes.)
#include "literate-hdf5.h"
#define SIZE 1024*1024
int main(int argc, char** argv)
{
int size, rank;
<<mpi-boilerplate>>
{
hid_t fapl, file, dset, file_space;
float* buffer;
hsize_t file_size;
fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(fapl, MPI_COMM_WORLD, MPI_INFO_NULL); // (ref:par-fapl)
file = H5Fcreate("multi-proc.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
fapl); // (ref:par-fcrt)
dset = (*
<<make-dataset>>) (file, "xMi-floats", size*SIZE); // (ref:par-dcrt)
file_space = H5Dget_space(dset);
{ // (ref:par-sel1)
hsize_t start = rank*SIZE, count = 1, block = SIZE;
H5Sselect_hyperslab(file_space, H5S_SELECT_SET,
&start, NULL, &count, &block);
} // (ref:par-sel2)
<<create-buffer-and-write>> // (ref:par-wrt)
<<clean-up>>
}
MPI_Finalize(); // (ref:par-mpi-shutdown)
}
The best known example is the Highly Scalable Data Service (HSDS). See John Readey’s presentation.
CAUTION: To work with HDF5 in cloud-based environments means different things to different audiences. Without context, it means just this:
Reference: Reproducible Research with GNU Emacs and Org-mode by Thibault Lestang
The following stochastic differential equation describes a 1D Ornstein-Uhlenbeck process:
\begin{equation} \mathrm{d}x_t = -μ x_t + \sqrt{2D}\mathrm{d}W_t \end{equation}
A sample trajectory of the stochastic process can be approximated with a snippet of C++ code.
std::default_random_engine generator;
std::normal_distribution<> distribution{0.0, 1.0};
double dt = 0.1, mu = 0.0, D = 0.5;
double x = 0.0;
for (unsigned i = 0; i < 100; ++i)
{
auto t = i*dt;
auto dw = distribution(generator);
x += (mu - x)*dt + sqrt(2.*D)*dw;
std::cout << t << " " << x << std::endl;
}
import numpy as np, matplotlib.pyplot as plt
timeseries = np.array(timeseries)
fig = plt.figure()
plt.plot(timeseries[:,0], timeseries[:,1])
plt.subplot(111).set_xlabel('t')
plt.subplot(111).set_ylabel('x')
plt.savefig('timeseries_vis.png')
return 'timeseries_vis.png'
The following function computes the sample mean.
from numpy import array, mean
values = array(x)[:,1]
return mean(values)
After a long and complicated statistical analysis, we conclude that the sample
average is call_mean(initial_data) {{{results(-0.023857982999999992
)}}}.
The following snippet stores our sample as a 100 x 2
2D array.
import h5py, numpy as np
with h5py.File('hello.hdf5', 'a') as f:
f['t_x'] = np.array(x)
return 'SUCCESS'
SUCCESS
h5dump -H -d t_x hello.hdf5
HDF5 "hello.hdf5" { DATASET "t_x" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 100, 2 ) / ( 100, 2 ) } } }
Except for the name of the dataset t_x
, it may not be obvious who’s who.
The following snippet stores our sample as a 100
element 1D array of a compound
datatype.
import h5py, numpy as np
dt = np.dtype([("time", np.double), ("position", np.double)])
a = np.array(x)
with h5py.File('hello.hdf5', 'a') as f:
f.create_dataset("compound", (100,), dtype=dt)
f['compound'][:,'time'] = a[:,0]
f['compound'][:,'position'] = a[:,1]
return 'SUCCESS'
SUCCESS
h5dump -H -d compound hello.hdf5
HDF5 "hello.hdf5" { DATASET "compound" { DATATYPE H5T_COMPOUND { H5T_IEEE_F64LE "time"; H5T_IEEE_F64LE "position"; } DATASPACE SIMPLE { ( 100 ) / ( 100 ) } } }
Let’s make this container more self-documenting by storing the simulation
parameters
import h5py, numpy as np
with h5py.File('hello.hdf5', 'a') as f:
dset = f["compound"]
dset.attrs['dt'] = 0.1
dset.attrs['D'] = 0.5
dset.attrs['μ'] = 0.0
h5dump -A -d compound hello.hdf5
HDF5 "hello.hdf5" { DATASET "compound" { DATATYPE H5T_COMPOUND { H5T_IEEE_F64LE "time"; H5T_IEEE_F64LE "position"; } DATASPACE SIMPLE { ( 100 ) / ( 100 ) } ATTRIBUTE "D" { DATATYPE H5T_IEEE_F64LE DATASPACE SCALAR DATA { (0): 0.5 } } ATTRIBUTE "dt" { DATATYPE H5T_IEEE_F64LE DATASPACE SCALAR DATA { (0): 0.1 } } ATTRIBUTE "μ" { DATATYPE H5T_IEEE_F64LE DATASPACE SCALAR DATA { (0): 0 } } } }
Other good candidates for attributes include physical units, calibrations, RNG seeds, etc.
In this section, we focus on raster images. However, the two approaches presented here apply, mutatis mutandis, to vector images.
We can treat images as blobs or byte sequences (see the section Opaque datasets), or we cant treat them as 2D arrays of pixels/color values plus certain metadata, e.g., palette (see the section Annotated 2D datasets). Whichever approach we choose determines how they then can be accessed or manipulated.
#include "literate-hdf5.h"
int main(int argc, char** argv)
{
size_t size;
char* buf = (*
<<read-image-bytes>>) ("./img/timeseries_vis.png", &size);
printf("%ld\n", size);
(*
<<create-and-write-opaque-dset>>) ("hello.hdf5", "bytes", buf, size);
free(buf);
return 0;
}
lambda(char*, (const char* name, size_t* size),
{
char* result;
FILE* fp = fopen(name, "rb");
fseek(fp, 0L, SEEK_END);
*size = ftell(fp);
fseek(fp, 0, SEEK_SET);
result = (char*) malloc(*size);
fread(result, size, 1, fp);
fclose(fp);
return result;
})
lambda(void,
(const char* fname, const char* dname,
const char* buf, size_t size),
{
hid_t file = H5Fopen(fname, H5F_ACC_RDWR, H5P_DEFAULT);
hid_t dtype = H5Tcreate(H5T_OPAQUE, size);
hid_t dspace = H5Screate(H5S_SCALAR);
hid_t dset;
H5Tset_tag(dtype, "image/png");
dset = H5Dcreate(file, dname, dtype, dspace,
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
H5Dwrite(dset, dtype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf);
H5Dclose(dset);
H5Sclose(dspace);
H5Tclose(dtype);
H5Fclose(file);
})
We use a simple tool gif2h5
to create a dataset representation conforming to
the HDF5 image specification. As a sample image, we use the sample trajectory
from section Visualization. Unfortunately, the simple gif2h5
tool accepts only
GIF images, and we need to first convert the PNG file timeseries_vis.png
.
ImageMagick to the rescue!
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org Copyright: © 1999-2019 ImageMagick Studio LLC License: https://imagemagick.org/script/license.php Features: Cipher DPC Modules OpenMP Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
convert ./img/timeseries_vis.png timeseries_vis.gif
Now we are ready to call gif2h5
.
gif2h5 timeseries_vis.gif timeseries_vis.h5
timeseries_vis.h5
contains a 2D dataset of pixels called Image0
and a 2D
palette called global
.
h5ls -v timeseries_vis.h5
Opened "timeseries_vis.h5" with sec2 driver. Image0 Dataset {480/480, 640/640} Attribute: CLASS scalar Type: 6-byte null-terminated ASCII string Attribute: IMAGE_SUBCLASS scalar Type: 14-byte null-terminated ASCII string Attribute: IMAGE_VERSION scalar Type: 4-byte null-terminated ASCII string Attribute: PALETTE scalar Type: object reference Location: 1:1400 Links: 1 Storage: 307200 logical bytes, 307200 allocated bytes, 100.00% utilization Type: native unsigned char global Dataset {128/128, 3/3} Attribute: CLASS scalar Type: 8-byte null-terminated ASCII string Attribute: PAL_VERSION scalar Type: 4-byte null-terminated ASCII string Location: 1:800 Links: 1 Storage: 384 logical bytes, 384 allocated bytes, 100.00% utilization Type: native unsigned char
Use HDFView to look at the image! (TODO Add a screenshot!)
Finally we copy the pixel and palette datasets to hello.hdf5
.
h5copy -v -f ref -i timeseries_vis.h5 -s Image0 -o hello.hdf5 -d Image0
Finally, we jam this text file (Org file)
h5jam -i hello.hdf5 -u $ublock --clobber
head -n 10 hello.hdf5
#+TITLE: HDF5 Introduction #+AUTHOR: Gerd Heber #+EMAIL: [email protected] #+CREATOR: <a href="http://www.gnu.org/software/emacs/">Emacs</a> 27.1.90 (<a href="http://orgmode.org">Org</a> mode 9.4.4) #+DATE: [2021-01-01 Fri] #+OPTIONS: author:t creator:t email:t toc:nil num:nil #+PROPERTY: header-args :eval never-export * Outline
It’s still an HDF5 file:
h5ls -vr hello.hdf5
Opened "hello.hdf5" with sec2 driver. / Group Attribute: processor scalar Type: variable-length null-terminated UTF-8 string Attribute: release scalar Type: variable-length null-terminated UTF-8 string Attribute: system scalar Type: variable-length null-terminated UTF-8 string Location: 1:96 Links: 1 /15 Group Location: 1:1344 Links: 1 /15/temperature Dataset {1024/1024} Attribute: dt scalar Type: native double Location: 1:1072 Links: 1 Storage: 8192 logical bytes, 8192 allocated bytes, 100.00% utilization Type: native double /15/wind Dataset {2048/2048} Attribute: dt scalar Type: native double Location: 1:14992 Links: 1 Storage: 16384 logical bytes, 16384 allocated bytes, 100.00% utilization Type: native double /Image0 Dataset {480/480, 640/640} Attribute: CLASS scalar Type: 6-byte null-terminated ASCII string Attribute: IMAGE_SUBCLASS scalar Type: 14-byte null-terminated ASCII string Attribute: IMAGE_VERSION scalar Type: 4-byte null-terminated ASCII string Attribute: PALETTE scalar Type: object reference Location: 1:347520 Links: 1 Storage: 307200 logical bytes, 307200 allocated bytes, 100.00% utilization Type: native unsigned char /compound Dataset {100/100} Attribute: D scalar Type: native double Attribute: dt scalar Type: native double Attribute: \316\274 scalar Type: native double Location: 1:36416 Links: 1 Storage: 1600 logical bytes, 1600 allocated bytes, 100.00% utilization Type: struct { "time" +0 native double "position" +8 native double } 16 bytes /t_x Dataset {100/100, 2/2} Location: 1:32768 Links: 1 Storage: 1600 logical bytes, 1600 allocated bytes, 100.00% utilization Type: native double /~obj_pointed_by_347888 Dataset {128/128, 3/3} Attribute: CLASS scalar Type: 8-byte null-terminated ASCII string Attribute: PAL_VERSION scalar Type: 4-byte null-terminated ASCII string Location: 1:347888 Links: 2 Storage: 384 logical bytes, 384 allocated bytes, 100.00% utilization Type: native unsigned char
h5dump -BH hello.hdf5
We can extract the so-called user block at the beginning of the file with h5unjam
:
h5unjam -i hello.hdf5 -o no-user-block.h5 -u user-block.org
head -n 10 user-block.org
In this section, we provide the common code snippets for the sequential and parallel examples.
This typical MPI boilerplate. Each MPI process determines the communicator size and its own rank.
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
To create a dataset (array variable) we need to specify its shape (line (dsp-crt)) and the
datatype of its elements (H5T_IEEE_F32
on line (dst-crt)).
lambda(hid_t, (hid_t file, const char* name, hsize_t elt_count),
{
hid_t result;
hid_t fspace = H5Screate_simple(1, (hsize_t[]) { elt_count },
NULL); // (ref:dsp-crt)
result = H5Dcreate(file, name, H5T_IEEE_F32LE, fspace,
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); // (ref:dst-crt)
H5Sclose(fspace);
return result;
})
We create and initialize the data buffer to be written. Its shape is described
by its in-memory dataspace mem_space
(line (msp)). Since we are writing the
entire buffer, we are selecting all elements (line (msp-sel)).
buffer = (float*) malloc(SIZE*sizeof(float));
{ /* Do something interesting with buffer! */
size_t i;
for (i = 0; i < SIZE; ++i)
buffer[i] = (float) i;
}
{
hid_t mem_space = H5Screate_simple(1, (hsize_t[]) { SIZE }, NULL); // (ref:msp)
H5Sselect_all(mem_space); // (ref:msp-sel)
H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT,
buffer);
H5Sclose(mem_space);
}
Adhering to the HDF5 library’s handle discipline is the A and
H5Pclose(fapl);
free(buffer);
H5Sclose(file_space);
H5Dclose(dset);
H5Fclose(file);