Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle compound dtypes #19

Closed
wants to merge 1 commit into from
Closed

handle compound dtypes #19

wants to merge 1 commit into from

Conversation

magland
Copy link
Collaborator

@magland magland commented Mar 18, 2024

@rly this is a proposal.

The goal here is to support datasets with compound dtypes in both LindiH5Store and LindiClient such that the round-trip leaves compound dtypes intact. In particular, you can slice the dataset in the LindiClient in the same way as the h5py slicing of the original.

(btw, I didn't look too carefully at how this is handled in hdmf-zarr, but I think the below proposal is consistent with how things are represented in numpy and hdf5)

To illustrate, here is a test that was added:

def test_compound_dtype():
    print("Testing compound dtype")
    with tempfile.TemporaryDirectory() as tmpdir:
        filename = f"{tmpdir}/test.h5"
        with h5py.File(filename, "w") as f:
            dt = np.dtype([("x", "i4"), ("y", "f8")])
            f.create_dataset("X", data=[(1, 3.14), (2, 6.28)], dtype=dt)
        h5f = h5py.File(filename, "r")
        store = LindiH5Store.from_file(filename, url=filename)
        rfs = store.to_reference_file_system()
        client = LindiClient.from_reference_file_system(rfs)
        X1 = h5f["X"]
        assert isinstance(X1, h5py.Dataset)
        X2 = client["X"]
        assert isinstance(X2, LindiDataset)
        assert X1.shape == X2.shape
        assert X1.dtype == X2.dtype
        assert X1.size == X2.size
        # assert X1.nbytes == X2.nbytes  # nbytes are not going to match because the internal representation is different
        assert len(X1) == len(X2)
        if not _check_equal(X1['x'][:], X2['x'][:]):
            print("WARNING. Arrays for x are not equal")
            print(X1['x'][:])
            print(X2['x'][:])
            raise ValueError("Arrays are not equal")
        if not _check_equal(X1['y'][:], X2['y'][:]):
            print("WARNING. Arrays for y are not equal")
            print(X1['y'][:])
            print(X2['y'][:])
            raise ValueError("Arrays are not equal")
        store.close()

To summarize this test:

  • We create an hdf5 file with an array X with compound datatype and fields x and y of types i4 and f8 respectively.
  • We wrap that in a LindiH5Store -- so we get a zarr array. The compound dtype array is represented using a special attribute (described below)
  • We then generate a reference file system and load that into a LindiClient. This represents the compound dtype array in the same was as the h5py client on the original file.

Internally this is represented using the _COMPOUND_DTYPE attribute on the zarr array. In the above case, this would be

_COMPOUND_DTYPE: [['x', 'int32'], ['y', 'float64']]

The data for the array is a JSON encoded array of the form [[1, 3.14], [2, 6.28]]

See comments in the source code for more details.

A side note. Ideally, we wouldn't need to use JSON encoding, but rather we could point to the chunks in the original hdf5 file. But that would be a lot more tricky to get to work with zarr. And we're going to want to support special encoding for references anyway. One caveat of JSON encoding inline data approach is that it potentially will cause the .zarr.json file to be very large, in the case of large compound dtype datasets.

@rly
Copy link
Contributor

rly commented Mar 21, 2024

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

@magland
Copy link
Collaborator Author

magland commented Mar 21, 2024

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

Sounds good. Yeah, I'm not sure that json encoding is the right thing here. Worth thinking about

@magland magland closed this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants