Parallel zarr with Queue instance attributes #118

CodyCBakerPhD · 2023-08-31T02:48:55Z

replace #111

Motivation

Zarr supports efficient parallelization, but enabling it seamlessly with only a single argument (number_of_jobs at io.write) took a bit of effort.

Currently seeing progressive speedups with the attached dummy script as the number of jobs increases; on the DANDI Hub ~160s for 1 CPU, . Will make an averaged plot over the number of jobs to use for reference

Will make a full averaged plot over the number of jobs to use for reference

Opening in draft while I assess what all is still necessary and what can still be optimized in terms of worker/job initialization

Also will have to think on how to add tests; I suppose just adding some that use 2 jobs and making sure it works should be enough

How to test the behavior?

import numpy as np
from pathlib import Path

from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries
from neuroconv.tools.hdmf import SliceableDataChunkIterator

number_of_jobs = 1  # increase according to screenshot

dat_file_path = "/home/jovyan/performance_tests/example_data.dat"
n_frames = 30000 * 60 * 2
n_channels = 384
data_shape = (n_frames, n_channels)
dtype = "int16"
memory_map = np.memmap(filename=dat_file_path, dtype=dtype, mode="r", shape=data_shape)  # about ~2.75 GB of data

nwbfile = mock_NWBFile()
time_series = mock_TimeSeries(data=SliceableDataChunkIterator(data=memory_map, buffer_gb=0.1 / number_of_jobs))
nwbfile.add_acquisition(time_series)

zarr_top_level_path = f"/home/jovyan/Downloads/example_parallel_zarr_{number_of_jobs}.nwb"
with NWBZarrIO(path=zarr_top_level_path, mode="w") as io:
    io.write(nwbfile, number_of_jobs=number_of_jobs)

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

…hdmf-zarr into add_parallel_zarr

src/hdmf_zarr/backend.py

oruebel · 2023-08-31T03:14:23Z

ReadTheDocs shows the following error due to the added threadpoolctl requirement.

ERROR: Could not find a version that satisfies the requirement threadpoolctl==3.2.0 (from versions: 1.0.0, 1.1.0, 2.0.0, 2.1.0, 2.2.0, 3.0.0, 3.1.0)
ERROR: No matching distribution found for threadpoolctl==3.2.0

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

CodyCBakerPhD · 2023-08-31T03:18:38Z

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

I do believe this is an issue with the version of Python being used to compile the docs - do you know what that is?

You can also specify it explicitly in the config file like here: https://github.com/catalystneuro/neuroconv/blob/main/.readthedocs.yaml#L10-L11

Alternative would I suppose be to lower the exact version pin for the CI, but I defer to how y'all prefer to have all that setup

tests/unit/test_parallel_write.py

rly · 2023-08-31T08:31:03Z

Some tests define a new ZarrIO and call write_dataset directly. Because self.__dci_queue is initialized only in write and now export, these tests fail because self.dci_queue is None. If these tests are meant to test write_dataset in a unit test fashion, then these tests need to be adjusted so that write is called first or the __dci_queue variable is otherwise set. If these tests are meant to be integration tests, then these tests need to be adjusted so that they call write instead of write_dataset which users would not be doing.

CodyCBakerPhD · 2023-08-31T15:44:44Z

Ahh good catch:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 839 to 840 in 6c13e14

    
           if exhaust_dci: 
        
               self.__dci_queue.exhaust_queue()

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

oruebel · 2023-08-31T17:20:27Z

If these tests are meant to test write_dataset in a unit test fashion,

Those should be unit tests, since write_dataset is not a a function that a user should ever call, but is used internally. The tests should be adjusted to manually set the dci_queue variable before calling write_dataset. Alternatively, we could also add if __dci_queue is None at the beginning of write_dataset to set it if it is not initialized (which may be safer)

oruebel · 2023-08-31T17:22:13Z

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

Sorry, I didn't see your comment until after I posted my other response. I agree, "instantiate a standard non-parallel Queue at that point then" is the way to go.

CodyCBakerPhD · 2023-09-01T14:06:49Z

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

CodyCBakerPhD · 2023-09-01T14:12:57Z

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

oruebel · 2023-09-01T16:52:46Z

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

This is due to the pin on the HDMF version. Once we have the issue with references on export resolved we'll update the HDMF version and then we can also update the tests. @mavaylon1 is working on the issue.

oruebel · 2023-09-01T16:58:29Z

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

To get the tests to run for now, you could just increase the hdmf version on this PR so we can see that the CI is running. There will be a couple of tests that fail for export, but at least we can then see that everything is working in the CI. Aside from the bug on export, I believe you can safely use the current version of HDMF without having to change anything else in the code.

oruebel · 2023-09-29T23:24:54Z

@CodyCBakerPhD with #120 now merged, the dev branch now supports the latest HDMF. Could you sync this PR with the dev branch to see if that fixes the failing tests now so that we can move forward with merging this PR as well.

codecov-commenter · 2023-09-29T23:55:09Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c262481) 84.73% compared to head (9d84044) 85.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #118      +/-   ##
==========================================
+ Coverage   84.73%   85.66%   +0.92%     
==========================================
  Files          12       13       +1     
  Lines        2903     3139     +236     
==========================================
+ Hits         2460     2689     +229     
- Misses        443      450       +7

Files	Coverage Δ
src/hdmf_zarr/backend.py	`90.41% <100.00%> (+0.07%)`	⬆️
tests/unit/test_parallel_write.py	`98.57% <98.57%> (ø)`
src/hdmf_zarr/utils.py	`95.79% <95.32%> (-0.96%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CodyCBakerPhD · 2023-09-30T20:30:51Z

@oruebel Done, not sure what's up with the coverage workflows though

oruebel

Looks good to me

CodyCBakerPhD and others added 20 commits July 29, 2023 20:17

adding WIP

5c12e6c

working!

635b3f1

move threadpool to zarr

cde517e

cleanup; propagating remaining args

e025bda

Merge branch 'dev' into add_parallel_zarr

a88e03b

first attempt at a test

b239e10

make pytest from notebook cell

d90b3e7

add notes for other tests

56a9e7c

Merge branch 'dev' into add_parallel_zarr

ff1868f

more tests; more debugs

9ff7a17

more tests; more debugs

2f37253

debug roundtrip; some flake8

35242d0

add optional requirements and some test integration for that

dff285a

readthedocs optional requirements

6e6a6ed

update coverage CI with optional requirements

1e3e8f1

update external links CI with optional requirements

141e2fe

modular scope on parallel helpers

15cfa8d

Merge branch 'add_parallel_zarr' of https://github.com/catalystneuro/…

516ccd8

…hdmf-zarr into add_parallel_zarr

make arguments attributes of Queue class; add test for propagation

7d807ec

add changelog

6945cc2

CodyCBakerPhD mentioned this pull request Aug 31, 2023

Add parallel zarr #111

Closed

6 tasks

ryans suggestions from other branch

f172ceb

oruebel reviewed Aug 31, 2023

View reviewed changes

src/hdmf_zarr/backend.py Show resolved Hide resolved

fix all flake8

cca279d

CodyCBakerPhD commented Aug 31, 2023

View reviewed changes

tests/unit/test_parallel_write.py Show resolved Hide resolved

CodyCBakerPhD and others added 3 commits August 31, 2023 00:02

fix typo in threadpool install

ebc7805

Update .readthedocs.yaml

1c04acc

Update requirements-min.txt

f0b48a5

rly added 4 commits August 31, 2023 00:49

Update setup.py

3f035b0

Update tox.ini

d283963

Update backend.py

17a53cc

Update backend.py

4699a29

fix unit run of write_dataset

aefdd2a

CodyCBakerPhD self-assigned this Sep 1, 2023

Merge branch 'dev' into parallel_zarr_as_queue_attributes

04919f8

restore 3.8 compatible nested with statements

261e095

Revert changes to requirements.txt

9d84044

oruebel approved these changes Oct 1, 2023

View reviewed changes

oruebel merged commit 9f6c386 into hdmf-dev:dev Oct 1, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel zarr with Queue instance attributes #118

Parallel zarr with Queue instance attributes #118

CodyCBakerPhD commented Aug 31, 2023 •

edited

Loading

oruebel commented Aug 31, 2023 •

edited

Loading

CodyCBakerPhD commented Aug 31, 2023

rly commented Aug 31, 2023

CodyCBakerPhD commented Aug 31, 2023

oruebel commented Aug 31, 2023

oruebel commented Aug 31, 2023

CodyCBakerPhD commented Sep 1, 2023

CodyCBakerPhD commented Sep 1, 2023

oruebel commented Sep 1, 2023

oruebel commented Sep 1, 2023

oruebel commented Sep 29, 2023

codecov-commenter commented Sep 29, 2023 •

edited

Loading

CodyCBakerPhD commented Sep 30, 2023

oruebel left a comment

Parallel zarr with Queue instance attributes #118

Parallel zarr with Queue instance attributes #118

Conversation

CodyCBakerPhD commented Aug 31, 2023 • edited Loading

Motivation

How to test the behavior?

Checklist

oruebel commented Aug 31, 2023 • edited Loading

CodyCBakerPhD commented Aug 31, 2023

rly commented Aug 31, 2023

CodyCBakerPhD commented Aug 31, 2023

oruebel commented Aug 31, 2023

oruebel commented Aug 31, 2023

CodyCBakerPhD commented Sep 1, 2023

CodyCBakerPhD commented Sep 1, 2023

oruebel commented Sep 1, 2023

oruebel commented Sep 1, 2023

oruebel commented Sep 29, 2023

codecov-commenter commented Sep 29, 2023 • edited Loading

Codecov Report

CodyCBakerPhD commented Sep 30, 2023

oruebel left a comment

Choose a reason for hiding this comment

CodyCBakerPhD commented Aug 31, 2023 •

edited

Loading

oruebel commented Aug 31, 2023 •

edited

Loading

codecov-commenter commented Sep 29, 2023 •

edited

Loading