You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.
A minimal example
importxarrayasxrimportnumpyasnpfromkerchunk.hdfimportSingleHdf5ToZarrfromkerchunk.combineimportMultiZarrToZarr# Set up fake netCDF files with latitude as a non-dimension coordinate.time_1_3=xr.Dataset(
{
"delta_time": [12, 13, 14]
},
coords={
'latitude': ('delta_time', [1, 2, 3]),
})
time_15_17=xr.Dataset(
{
"delta_time": [15, 16, 17]
},
coords={
'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22=xr.Dataset(
{
"delta_time": [18, 19, 20, 21, 22]
},
coords={
'latitude': ('delta_time', [7, 8, 9, 10, 11]),
})
time_1_3
# Create netCDFs with chunksize aligned to data sizechunksize=3encoding= {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
defcreate_reference(files: list[str]):
single_jsons= [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() forfilepathinfiles]
mzz=MultiZarrToZarr(
single_jsons,
concat_dims=["delta_time"],
)
combined_test_json=mzz.translate()
combined_test=xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={
"storage_options": {
"fo": combined_test_json,
},
"consolidated": False,
}
)
returncombined_test# This works as expectedcombined_test_3=create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data
# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data sizechunksize=10encoding= {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
# This results in empty values for the resulting non-dimension coordinate variable.combined_test_10=create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data
This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).
If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).
I may be misunderstanding the MultiZarrToZarr logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?
The text was updated successfully, but these errors were encountered:
If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).
I think in this case, the coordinates get inlined as a single chunk in the output references, so the chunking of the original doesn't matter. The same could be done with any variable. This is not chunk inlining but whole array inlining to a single chunk, e.g., kerchunk.utils.inline_array . However, MultiZarrToZarr doesn't even bother accumulating values of non-coordinates, since they are not used for figuring out the concatenation chunk placement.
More generally, non-equal chunks are just not possible because of the limitations of zarr at least until ZEP003 is accepted and implemented ( zarr-developers/zarr-python#1483 ). If this were done, no magic inlining would be needed. Feel free to ping on the issue or the ZEP repo saying you need this.
While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using
MultiZarrToZarr
with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.A minimal example
This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).
If
latitude
is promoted to aconcat_dim
the output is correct (with all of thelatitude
values included).I may be misunderstanding the
MultiZarrToZarr
logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?The text was updated successfully, but these errors were encountered: