Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: write xarray-compatible Zarr files #176

Closed
3 tasks done
alejoe91 opened this issue Mar 21, 2024 · 9 comments · Fixed by #207
Closed
3 tasks done

[Feature]: write xarray-compatible Zarr files #176

alejoe91 opened this issue Mar 21, 2024 · 9 comments · Fixed by #207
Assignees
Labels
category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Milestone

Comments

@alejoe91
Copy link
Collaborator

What would you like to see added to HDMF-ZARR?

Xarray supports the Zarr backend, but requires the _ARRAY_DIMENSIONS attribute to be set with a list of names for the array dimensions (e.g. [samples, channels]) - see https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding

It would be great to add these attributes as default for known data types (e.g. ElectricalSeries)

@jsiegle

Is your feature request related to a problem?

NWB-Zarr files cannot be opened by xarray.open_zarr

What solution would you like?

Adding the _ARRAY_DIMENSIONS attributes to all "known" neurodata_types

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@bendichter
Copy link
Contributor

bendichter commented Mar 21, 2024

from xarray import DataArray
import numpy as np
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from h5py import Dataset
from pynwb import get_type_map
import json

dset_types = (np.ndarray, Dataset)  # etc.


def get_dimension_labels(cls, ndims, dataset_name):
    spec = get_type_map().namespace_catalog.get_spec(cls.namespace, cls.neurodata_type)
    data = next(x for x in spec["datasets"] if x["name"] == dset_name)
    dims = data["dims"]
    
    if isinstance(dims[0], str):  # only one shape spec
        return dims
    
    for i_dims in  dims:
        if len(i_dims) == ndims:
            return i_dims
        
def load_dset_as_xarray(obj, dset_name):
    dset = obj.fields[dset_name]
    cls = obj.__class__
    ndims = len(dset.shape)
    
    dim_labels = get_dimension_labels(cls, ndims, dset_name)
    
    coords = dict(num_channels=electrical_series.electrodes.data)
    if obj.timestamps is not None:
        coords.update(num_times=obj.timestamps)
        
    attrs = {k: v for k, v in obj.fields.items() if not isinstance(v, dset_types)}
    return DataArray(dset, dims=dim_labels, coords=coords, attrs=attrs)
    
        

electrical_series = mock_ElectricalSeries(timestamps=np.arange(10), rate=None)

load_dset_as_xarray(electrical_series, "data")

@bendichter
Copy link
Contributor

^ Code that solves a related problem and might be helpful

@oruebel
Copy link
Contributor

oruebel commented Mar 21, 2024

^ Code that solves a related problem and might be helpful

I think this could be useful to have in some form available in PyNWB. Maybe as a utility method and/or as a method on TimeSeries, since a common use-case for this is probably representing TimeSeries.data as xarray.

@oruebel
Copy link
Contributor

oruebel commented Mar 21, 2024

It would be great to add these attributes as default for known data types (e.g. ElectricalSeries)

Adding the _ARRAY_DIMENSIONS attribute for cases where we know the dimensions seems like a good idea 👍

@oruebel
Copy link
Contributor

oruebel commented Mar 21, 2024

In terms of implementation, I think this will require changes in HDMF as well. Here a rough plan of how this could be implemented:

  1. Add dimension_labels as an attribute on the DatasetBuilder(which may be None of the labels are unknown) https://github.com/hdmf-dev/hdmf/blob/5c8506216995f995b891da1e6b596ee42b7dd948/src/hdmf/build/builders.py#L321
  2. Enhance BuildManger.build to set the dimension_labels for DatasetBuilders https://github.com/hdmf-dev/hdmf/blob/5c8506216995f995b891da1e6b596ee42b7dd948/src/hdmf/build/manager.py#L148
  3. Update ZarrIO.write_dataset to add the _ARRAY_DIMENSIONS to the attributes of the dataset if builder.dimension_labels are present.

attributes = builder.attributes

@rly does that plan sound reasonable or what this also require changes in the ObjectMapper to determine the dimensions there instead of in the BuildManager?

@rly
Copy link
Contributor

rly commented Mar 23, 2024

@rly does that plan sound reasonable or what this also require changes in the ObjectMapper to determine the dimensions there instead of in the BuildManager?

That sounds reasonable, except that all the building / creation of DatasetBuilder objects happen in the ObjectMapper. We'd probably do it in __add_attributes or the constructor.

@oruebel
Copy link
Contributor

oruebel commented Mar 24, 2024

@mavaylon1 could you take a look at this?

@rly
Copy link
Contributor

rly commented Mar 24, 2024

I'll work on the HDMF side

@mavaylon1
Copy link
Contributor

Can do

@mavaylon1 mavaylon1 added this to the Future milestone Apr 6, 2024
@rly rly added category: bug errors in the code or code behavior category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats and removed category: bug errors in the code or code behavior labels Apr 11, 2024
@rly rly modified the milestones: Future, Next Release Apr 11, 2024
@mavaylon1 mavaylon1 mentioned this issue Jul 24, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants