Use chunk_iter when adding chunk refs #68

rly · 2024-05-14T02:44:45Z

Fix #67.

I added lindi.LindiH5ZarrStore._util._get_all_chunk_info.

LindiH5ZarrStore._get_chunk_file_bytes_data was being called in two places:

in to_reference_file_system.process_dataset in a loop over all chunks to get the chunk info for each chunk to add to the refs dict
by _get_chunk_file_bytes which is called by __getitem__ for a single key

Because the new chunk_iter method returns the chunk info for all chunks and takes 1-5 seconds, it is best to use that in use case 1 above. The old get_chunk_info method takes about 0.2 seconds. It is best to use that for use case 2 above when a user wants to access a single key.

TODO: if the user wants to access more than ~10 keys for a given dataset, e.g., they are requesting a large slice, then probably it would be best to use the chunk_iter method and cache the chunk info in LindiH5ZarrStore. Alternatively, we could just always run chunk_iter and cache the chunk info.

I tried putting a bunch of if/else branches in LindiH5ZarrStore._get_chunk_file_bytes_data to handle the two use cases above, but it got messy and complex. So I just created a new function _add_chunk_info_to_refs for use case 1 which has some of the same code as _get_chunk_file_bytes_data. It needs to be refactored.

This also adds tqdm as a dependency for showing a progress bar when iterating through many chunks. I think that is useful overall, but we could move that to an optional dependency if we think it is not so important.

@magland before I continue, can you take a look at this and let me know if I am understanding the code correctly and if the approach looks good?

TODO:

refactor redundant code
remove unnecessary chunk coordinate calculating code
add tests, especially with HDF5 < 1.12.3 and >= 1.12.3

codecov-commenter · 2024-05-14T10:38:59Z

Codecov Report

Attention: Patch coverage is 82.75862% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 79.53%. Comparing base (1fc62ef) to head (2897504).

Files	Patch %	Lines
lindi/LindiH5ZarrStore/_util.py	63.15%	7 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py	92.30%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #68      +/-   ##
==========================================
- Coverage   81.35%   79.53%   -1.83%     
==========================================
  Files          30       30              
  Lines        2194     2238      +44     
==========================================
- Hits         1785     1780       -5     
- Misses        409      458      +49

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

magland · 2024-05-14T10:41:17Z

Pushed some small changes to keep the linter happy (pyright / flake8). Passes tests now.

... continuing my review...

magland · 2024-05-14T10:58:52Z

@rly This all looks great

rly · 2024-05-14T14:17:58Z

@magland great! thanks for the look over

In

lindi/lindi/LindiH5ZarrStore/LindiH5ZarrStore.py

Line 573 in 1fc62ef

(shape[i] + chunks[i] - 1) // chunks[i] if chunks[i] != 0 else 0

when would chunks[i] ever be 0?

When I try to create an h5py dataset with a chunk dimension length = 0, I get an error:

with h5py.File("test.h5", "w") as f:
    f.create_dataset("data", shape=(2,2,2), dtype="i4", chunks=(1,1,0))
ValueError: All chunk dimensions must be positive (all chunk dimensions must be positive)

rly · 2024-05-14T14:23:42Z

Maybe related: the line above says chunks = zarray_dict.get("chunks", None) but the lines below assume that chunks is not None. Is chunks always set in the zarray dict? I think Zarr always sets chunks.

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=None)
> data.chunks
(1, 1)

rly · 2024-05-14T14:24:57Z

Though in Zarr, you can explicitly set the chunk dimension lengths to 0...

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=(0, 0))
> data.chunks
(0, 0)

magland · 2024-05-14T15:20:54Z

when would chunks[i] ever be 0?

I have the following in comment elsewhere in the code (it should probably be pasted above this line as well):

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"

I'm not sure how the shape got to be that... but you can check that dandiset.

rly · 2024-05-14T15:37:01Z

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"
This dataset has shape = (0,0,0) and the chunks = None.

I just removed that chunk of code and wanted to make sure I wasn't missing an unusual edge case. I think the modification would work for this dataset, but I'll add a test.

rly · 2024-05-14T18:19:48Z

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

magland · 2024-05-14T20:53:33Z

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

This all sounds good to me.

magland · 2024-05-16T15:50:55Z

@rly is the ready to get merged in yet? Is there something I can help with?

rly · 2024-05-16T21:57:10Z

I wanted to add some tests but let's go ahead and merge this so that we can use it in the benchmarks for the proposal. We can add tests in a separate PR.

magland · 2024-05-16T22:01:15Z

Okay, I just published 0.3.5 from main branch.

rly added 3 commits May 13, 2024 19:15

Use chunk_iter when adding chunk refs

620b436

Fix for case when chunk_iter is not available

a39b7f8

Refactor and fix

2defaf8

rly requested a review from magland May 14, 2024 02:44

rly and others added 3 commits May 13, 2024 19:45

Fix

adb67ff

Cleanup

ce75b32

linter fixes

5dbf83b

cleanup comments

3652ebb

rly marked this pull request as ready for review May 14, 2024 13:39

rly marked this pull request as draft May 14, 2024 13:40

rly added 2 commits May 14, 2024 08:31

Simplify method for adding all chunk info

2a31250

Fix lint

c9c8955

rly added 2 commits May 14, 2024 08:57

Use _get_max_num_chunks

53663be

Ues get_num_chunks when iterating chunks manually

1eced73

FIx comment

2897504

magland approved these changes May 14, 2024

View reviewed changes

Merge branch 'main' into chunk_iter

08eace7

rly marked this pull request as ready for review May 16, 2024 21:55

rly changed the title ~~[WIP] Use chunk_iter when adding chunk refs~~ Use chunk_iter when adding chunk refs May 16, 2024

rly merged commit 929952c into main May 16, 2024
6 checks passed

rly deleted the chunk_iter branch May 16, 2024 21:57

This was referenced May 28, 2024

Data with shape (0,0,0) triggering assertion error after get chunk refactor #76

Closed

Fix handling of datasets with shape (0,0,...) #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use chunk_iter when adding chunk refs #68

Use chunk_iter when adding chunk refs #68

rly commented May 14, 2024 •

edited

Loading

codecov-commenter commented May 14, 2024 •

edited

Loading

magland commented May 14, 2024

magland commented May 14, 2024

rly commented May 14, 2024

rly commented May 14, 2024

rly commented May 14, 2024

magland commented May 14, 2024 •

edited

Loading

rly commented May 14, 2024

rly commented May 14, 2024 •

edited

Loading

magland commented May 14, 2024

magland commented May 16, 2024

rly commented May 16, 2024

magland commented May 16, 2024

Use chunk_iter when adding chunk refs #68

Use chunk_iter when adding chunk refs #68

Conversation

rly commented May 14, 2024 • edited Loading

codecov-commenter commented May 14, 2024 • edited Loading

Codecov Report

magland commented May 14, 2024

magland commented May 14, 2024

rly commented May 14, 2024

rly commented May 14, 2024

rly commented May 14, 2024

magland commented May 14, 2024 • edited Loading

rly commented May 14, 2024

rly commented May 14, 2024 • edited Loading

magland commented May 14, 2024

magland commented May 16, 2024

rly commented May 16, 2024

magland commented May 16, 2024

rly commented May 14, 2024 •

edited

Loading

codecov-commenter commented May 14, 2024 •

edited

Loading

magland commented May 14, 2024 •

edited

Loading

rly commented May 14, 2024 •

edited

Loading