DataSet::read over 2x slower than DataSet::read_raw for Eigen::Matrix #1051
Replies: 3 comments
-
Thank you for the suggestion. There will be several answers (which can be discussed separately). This answer explains the advantages of this choice. When writing the array to disk, the information that it used to be an For this to work, it needs to be implicitly understood which element corresponds to a particular row and column of the array. The only convention I'm familiar with is that The provided example demonstrates the issue quite nicely, when writing an What's nice about the current convention is that it's independent of the container used to store the array, it works for Same point once more: the format on disk is independent of how the matrix happens to be arranged in RAM. (Personally, I find this property very valuable.) |
Beta Was this translation helpful? Give feedback.
-
This answer is related to API stability. The proposed change can easily go unnoticed, e.g. in square matrices, but also in cases where one doesn't know that shape via an independent source of truth. By simply looking at the file, it's impossible to know if a Therefore, this change is quite error prone. It's also extremely hard to recover from, because one can't know if a particular file was written with Hence, given the state of HighFive, personally, I'm against making this type of breaking change. |
Beta Was this translation helpful? Give feedback.
-
Third answer, about the performance impact: It's (somewhat) surprising that the impact is this big. "Surprising" because RAM/nvme speeds would suggest that the impact should be less. "Somewhat" because we're very naive about how we transpose arrays. Hence, something we can do is look at optimizing the transpose (for special cases). |
Beta Was this translation helpful? Give feedback.
-
I discovered that for Eigen::Matrix,
DataSet::read
is much slower thanDataSet::read_raw
. I believe this is because in converting from row-major to column-major, HighFive is transposing the data instead of reversing the ordering of the dimension extents.I created a minimal example that reads a file with 3 rows and 3e8 columns, and another with 3e8 rows and 3 columns.
https://github.com/quantumsteve/highfive_eigen
create_input.py writes the input files. The
read
executable usesDataSet::read
and theread_raw
executable usesDataSet::read_raw
.On my laptop I get the following load times.
Both MATLAB and HDF5.jl reverse the dimensions. Would this be a better default behavior for HighFive?
Beta Was this translation helpful? Give feedback.
All reactions