Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use chunk_iter when adding chunk refs #68

Merged
merged 13 commits into from
May 16, 2024
Merged

Use chunk_iter when adding chunk refs #68

merged 13 commits into from
May 16, 2024

Conversation

rly
Copy link
Contributor

@rly rly commented May 14, 2024

Fix #67.

I added lindi.LindiH5ZarrStore._util._get_all_chunk_info.

LindiH5ZarrStore._get_chunk_file_bytes_data was being called in two places:

  1. in to_reference_file_system.process_dataset in a loop over all chunks to get the chunk info for each chunk to add to the refs dict
  2. by _get_chunk_file_bytes which is called by __getitem__ for a single key

Because the new chunk_iter method returns the chunk info for all chunks and takes 1-5 seconds, it is best to use that in use case 1 above. The old get_chunk_info method takes about 0.2 seconds. It is best to use that for use case 2 above when a user wants to access a single key.

TODO: if the user wants to access more than ~10 keys for a given dataset, e.g., they are requesting a large slice, then probably it would be best to use the chunk_iter method and cache the chunk info in LindiH5ZarrStore. Alternatively, we could just always run chunk_iter and cache the chunk info.

I tried putting a bunch of if/else branches in LindiH5ZarrStore._get_chunk_file_bytes_data to handle the two use cases above, but it got messy and complex. So I just created a new function _add_chunk_info_to_refs for use case 1 which has some of the same code as _get_chunk_file_bytes_data. It needs to be refactored.

This also adds tqdm as a dependency for showing a progress bar when iterating through many chunks. I think that is useful overall, but we could move that to an optional dependency if we think it is not so important.

@magland before I continue, can you take a look at this and let me know if I am understanding the code correctly and if the approach looks good?

TODO:

  • refactor redundant code
  • remove unnecessary chunk coordinate calculating code
  • add tests, especially with HDF5 < 1.12.3 and >= 1.12.3

@rly rly requested a review from magland May 14, 2024 02:44
@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

Attention: Patch coverage is 82.75862% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 79.53%. Comparing base (1fc62ef) to head (2897504).

Files Patch % Lines
lindi/LindiH5ZarrStore/_util.py 63.15% 7 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 92.30% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #68      +/-   ##
==========================================
- Coverage   81.35%   79.53%   -1.83%     
==========================================
  Files          30       30              
  Lines        2194     2238      +44     
==========================================
- Hits         1785     1780       -5     
- Misses        409      458      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@magland
Copy link
Collaborator

magland commented May 14, 2024

Pushed some small changes to keep the linter happy (pyright / flake8). Passes tests now.

... continuing my review...

@magland
Copy link
Collaborator

magland commented May 14, 2024

@rly This all looks great

@rly rly marked this pull request as ready for review May 14, 2024 13:39
@rly rly marked this pull request as draft May 14, 2024 13:40
@rly
Copy link
Contributor Author

rly commented May 14, 2024

@magland great! thanks for the look over

In

(shape[i] + chunks[i] - 1) // chunks[i] if chunks[i] != 0 else 0

when would chunks[i] ever be 0?

When I try to create an h5py dataset with a chunk dimension length = 0, I get an error:

with h5py.File("test.h5", "w") as f:
    f.create_dataset("data", shape=(2,2,2), dtype="i4", chunks=(1,1,0))
ValueError: All chunk dimensions must be positive (all chunk dimensions must be positive)

@rly
Copy link
Contributor Author

rly commented May 14, 2024

Maybe related: the line above says chunks = zarray_dict.get("chunks", None) but the lines below assume that chunks is not None. Is chunks always set in the zarray dict? I think Zarr always sets chunks.

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=None)
> data.chunks
(1, 1)

@rly
Copy link
Contributor Author

rly commented May 14, 2024

Though in Zarr, you can explicitly set the chunk dimension lengths to 0...

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=(0, 0))
> data.chunks
(0, 0)

@magland
Copy link
Collaborator

magland commented May 14, 2024

when would chunks[i] ever be 0?

I have the following in comment elsewhere in the code (it should probably be pasted above this line as well):

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"

I'm not sure how the shape got to be that... but you can check that dandiset.

@rly
Copy link
Contributor Author

rly commented May 14, 2024

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"
This dataset has shape = (0,0,0) and the chunks = None.

I just removed that chunk of code and wanted to make sure I wasn't missing an unusual edge case. I think the modification would work for this dataset, but I'll add a test.

@rly
Copy link
Contributor Author

rly commented May 14, 2024

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

@magland
Copy link
Collaborator

magland commented May 14, 2024

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

This all sounds good to me.

@magland
Copy link
Collaborator

magland commented May 16, 2024

@rly is the ready to get merged in yet? Is there something I can help with?

@rly rly marked this pull request as ready for review May 16, 2024 21:55
@rly rly changed the title [WIP] Use chunk_iter when adding chunk refs Use chunk_iter when adding chunk refs May 16, 2024
@rly
Copy link
Contributor Author

rly commented May 16, 2024

I wanted to add some tests but let's go ahead and merge this so that we can use it in the benchmarks for the proposal. We can add tests in a separate PR.

@rly rly merged commit 929952c into main May 16, 2024
6 checks passed
@rly rly deleted the chunk_iter branch May 16, 2024 21:57
@magland
Copy link
Collaborator

magland commented May 16, 2024

Okay, I just published 0.3.5 from main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use chunk_iter to avoid slow performance for files with many chunks
3 participants