DataSet::read over 2x slower than DataSet::read_raw for Eigen::Matrix #1051

quantumsteve · 2024-10-22T20:07:32Z

quantumsteve
Oct 22, 2024

I discovered that for Eigen::Matrix, DataSet::read is much slower than DataSet::read_raw. I believe this is because in converting from row-major to column-major, HighFive is transposing the data instead of reversing the ordering of the dimension extents.

I created a minimal example that reads a file with 3 rows and 3e8 columns, and another with 3e8 rows and 3 columns.
https://github.com/quantumsteve/highfive_eigen

create_input.py writes the input files. The read executable uses DataSet::read and the read_raw executable uses DataSet::read_raw.

On my laptop I get the following load times.

$ ./read
rows: 3 cols: 300000000
rows time: 2.5856s
rows: 300000000 cols: 3
cols time: 2.41162s

$ ./read_raw
rows: 300000000 cols: 3
rows time: 1.06574s
rows: 3 cols: 300000000
cols time: 1.08497s

Both MATLAB and HDF5.jl reverse the dimensions. Would this be a better default behavior for HighFive?

1uc · 2024-10-23T08:42:34Z

1uc
Oct 23, 2024
Maintainer

Thank you for the suggestion. There will be several answers (which can be discussed separately). This answer explains the advantages of this choice.

When writing the array to disk, the information that it used to be an Eigen::Matrix is lost. This is a good thing, because it allows us to read back into something else without the need to implement different readers depending on whether it was written from an Eigen::Matrix, an xtensor::array or something else from an infinitely long list of possibilities.

For this to work, it needs to be implicitly understood which element corresponds to a particular row and column of the array. The only convention I'm familiar with is that A(i, j) refers to the element on row i and column j. The relevant point is that i is the index for axis 0 and j for axis 1. This is also the convention HighFive uses; and it's tailored to row-major storage (a popular choice in C/C++ and the only choice in HDF5).

The provided example demonstrates the issue quite nicely, when writing an [n, m] matrix from h5py, it is stored as [n, m]. If we read it with read into a [n, m] matrix the element (i, j) ends up on row i, column j. However, if we use read_raw (for an Eigen::Matrix) it ends up on row j column i and the matrix has shape [m, n]. Conversely, if we write via read_raw and then read in h5py the elements again end up in the wrong location.

What's nice about the current convention is that it's independent of the container used to store the array, it works for Eigen::Matrix, xtensor::array and awkward things like std::vector<std::array<>>; and it works seamlessly with h5py. If you write an Eigen::Matrix of shape [n, m] you can read it back via h5py and get an array that's [n, m].

Same point once more: the format on disk is independent of how the matrix happens to be arranged in RAM. (Personally, I find this property very valuable.)

0 replies

1uc · 2024-10-23T08:54:26Z

1uc
Oct 23, 2024
Maintainer

This answer is related to API stability.

The proposed change can easily go unnoticed, e.g. in square matrices, but also in cases where one doesn't know that shape via an independent source of truth. By simply looking at the file, it's impossible to know if a [n, m] array is genuinely [n, m] or actually [m, n] stored column major.

Therefore, this change is quite error prone. It's also extremely hard to recover from, because one can't know if a particular file was written with highfive<=3.0.0-rc1 or not. Hence one can't write code to mitigate the change, making it impossible to read old files with new HighFive correctly in a transparent way.

Hence, given the state of HighFive, personally, I'm against making this type of breaking change.

0 replies

1uc · 2024-10-23T08:57:32Z

1uc
Oct 23, 2024
Maintainer

Third answer, about the performance impact: It's (somewhat) surprising that the impact is this big. "Surprising" because RAM/nvme speeds would suggest that the impact should be less. "Somewhat" because we're very naive about how we transpose arrays. Hence, something we can do is look at optimizing the transpose (for special cases).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSet::read over 2x slower than DataSet::read_raw for Eigen::Matrix #1051

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

DataSet::read over 2x slower than DataSet::read_raw for Eigen::Matrix #1051

quantumsteve Oct 22, 2024

Replies: 3 comments

1uc Oct 23, 2024 Maintainer

1uc Oct 23, 2024 Maintainer

1uc Oct 23, 2024 Maintainer

quantumsteve
Oct 22, 2024

1uc
Oct 23, 2024
Maintainer

1uc
Oct 23, 2024
Maintainer

1uc
Oct 23, 2024
Maintainer