-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control chunksize of the underlying zarrdata #406
Comments
The chunksize outputted by Kerchunk in no way changes the actual data storage in the original files it references, so there is no way to change the chunking. Unfortunately, this means you cannot combine incompatible data, as you have found.
You mean exactly one chunk for each input file? That is probably something we could do fairly easily, effectively making hdf (or kerchunk itslef) the codec for loading each file. |
Thanks for your detailed response @martindurant ! I would be happy to contribute to the last proposal (chunksizes=datafiles_shape), if you could give me some pointers on what to look at and how to get started? @martindurant In xarrays .to_zarr(...), you can control the chunksizes. Would it not be a valuable contribution to have the same functionality in Kerchunk - or is it for some reason not feasible. |
You would have a numcodecs implementation something like class HDF5file(Code):
def __init__(self, path: str):
self.path = path
def decode(self, buffer):
import io, h5py
b = io.BytesIO(buffer)
h = h5py.File(b, "r")
return h[self.path][...] and this would be the "codec" for the whole array. Here, "path" is the name of the array from each HDF5 file to read (but the whole of the file will be pulled into memory first). |
It is not possible, because kerchunk works using the chunks as they are stored in the original files. The whole point is, that you don't need to rewrite/copy the data. If you do have the option to do so, you may as well use normal zarr output. |
I would appreciate to have more control over the way Kerchunk is writing "refs" -especially control the chunking.
Context:
I previously used fsspec and kerchunk to store my data while continously exanding my dataset.
My data always has the same dimensionality of course and even the same coordinetes except one dimension: "release"
When using
class SingleHdf5ToZarr
in Kerchunk I have no control over the zarr group/store createdI think it is because this part of the init is hardcoded and not mutable through any methods:
My data files have the same coordinate size : datafile_shape=(1,n2,n3,n4)
When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)
Now that I have updated some dependencies in my env I get another arbitrary chunksize (1,n_c2',n_c3',n_c4')
Here I would actually ideally just have had chunksize = datafile_shape. But the fatal issue is that I can no longer combine new and old data with MultiZarrToZarr. When I try to combine my kerchunk metadata chunks I get:
Problem in short:
The text was updated successfully, but these errors were encountered: