Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel zarr with Queue instance attributes #118

Merged
merged 33 commits into from
Oct 1, 2023

Conversation

CodyCBakerPhD
Copy link
Contributor

@CodyCBakerPhD CodyCBakerPhD commented Aug 31, 2023

fix #101

replace #111

Motivation

Zarr supports efficient parallelization, but enabling it seamlessly with only a single argument (number_of_jobs at io.write) took a bit of effort.

Currently seeing progressive speedups with the attached dummy script as the number of jobs increases; on the DANDI Hub ~160s for 1 CPU, . Will make an averaged plot over the number of jobs to use for reference

Screenshot 2023-07-30 at 5 37 08 PM Screenshot 2023-07-30 at 5 37 30 PM Screenshot 2023-07-30 at 5 38 08 PM

Will make a full averaged plot over the number of jobs to use for reference

Opening in draft while I assess what all is still necessary and what can still be optimized in terms of worker/job initialization

Also will have to think on how to add tests; I suppose just adding some that use 2 jobs and making sure it works should be enough

How to test the behavior?

import numpy as np
from pathlib import Path

from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries
from neuroconv.tools.hdmf import SliceableDataChunkIterator

number_of_jobs = 1  # increase according to screenshot

dat_file_path = "/home/jovyan/performance_tests/example_data.dat"
n_frames = 30000 * 60 * 2
n_channels = 384
data_shape = (n_frames, n_channels)
dtype = "int16"
memory_map = np.memmap(filename=dat_file_path, dtype=dtype, mode="r", shape=data_shape)  # about ~2.75 GB of data

nwbfile = mock_NWBFile()
time_series = mock_TimeSeries(data=SliceableDataChunkIterator(data=memory_map, buffer_gb=0.1 / number_of_jobs))
nwbfile.add_acquisition(time_series)

zarr_top_level_path = f"/home/jovyan/Downloads/example_parallel_zarr_{number_of_jobs}.nwb"
with NWBZarrIO(path=zarr_top_level_path, mode="w") as io:
    io.write(nwbfile, number_of_jobs=number_of_jobs)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@CodyCBakerPhD CodyCBakerPhD mentioned this pull request Aug 31, 2023
6 tasks
@oruebel
Copy link
Contributor

oruebel commented Aug 31, 2023

ReadTheDocs shows the following error due to the added threadpoolctl requirement.

ERROR: Could not find a version that satisfies the requirement threadpoolctl==3.2.0 (from versions: 1.0.0, 1.1.0, 2.0.0, 2.1.0, 2.2.0, 3.0.0, 3.1.0)
ERROR: No matching distribution found for threadpoolctl==3.2.0

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

@CodyCBakerPhD
Copy link
Contributor Author

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

I do believe this is an issue with the version of Python being used to compile the docs - do you know what that is?

You can also specify it explicitly in the config file like here: https://github.com/catalystneuro/neuroconv/blob/main/.readthedocs.yaml#L10-L11

Alternative would I suppose be to lower the exact version pin for the CI, but I defer to how y'all prefer to have all that setup

@rly
Copy link
Contributor

rly commented Aug 31, 2023

Some tests define a new ZarrIO and call write_dataset directly. Because self.__dci_queue is initialized only in write and now export, these tests fail because self.dci_queue is None. If these tests are meant to test write_dataset in a unit test fashion, then these tests need to be adjusted so that write is called first or the __dci_queue variable is otherwise set. If these tests are meant to be integration tests, then these tests need to be adjusted so that they call write instead of write_dataset which users would not be doing.

@CodyCBakerPhD
Copy link
Contributor Author

Ahh good catch:

if exhaust_dci:
self.__dci_queue.exhaust_queue()

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

@oruebel
Copy link
Contributor

oruebel commented Aug 31, 2023

If these tests are meant to test write_dataset in a unit test fashion,

Those should be unit tests, since write_dataset is not a a function that a user should ever call, but is used internally. The tests should be adjusted to manually set the dci_queue variable before calling write_dataset. Alternatively, we could also add if __dci_queue is None at the beginning of write_dataset to set it if it is not initialized (which may be safer)

@oruebel
Copy link
Contributor

oruebel commented Aug 31, 2023

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

Sorry, I didn't see your comment until after I posted my other response. I agree, "instantiate a standard non-parallel Queue at that point then" is the way to go.

@CodyCBakerPhD CodyCBakerPhD self-assigned this Sep 1, 2023
@CodyCBakerPhD
Copy link
Contributor Author

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

@CodyCBakerPhD
Copy link
Contributor Author

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

@oruebel
Copy link
Contributor

oruebel commented Sep 1, 2023

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

This is due to the pin on the HDMF version. Once we have the issue with references on export resolved we'll update the HDMF version and then we can also update the tests. @mavaylon1 is working on the issue.

@oruebel
Copy link
Contributor

oruebel commented Sep 1, 2023

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

To get the tests to run for now, you could just increase the hdmf version on this PR so we can see that the CI is running. There will be a couple of tests that fail for export, but at least we can then see that everything is working in the CI. Aside from the bug on export, I believe you can safely use the current version of HDMF without having to change anything else in the code.

@oruebel
Copy link
Contributor

oruebel commented Sep 29, 2023

@CodyCBakerPhD with #120 now merged, the dev branch now supports the latest HDMF. Could you sync this PR with the dev branch to see if that fixes the failing tests now so that we can move forward with merging this PR as well.

@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c262481) 84.73% compared to head (9d84044) 85.66%.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #118      +/-   ##
==========================================
+ Coverage   84.73%   85.66%   +0.92%     
==========================================
  Files          12       13       +1     
  Lines        2903     3139     +236     
==========================================
+ Hits         2460     2689     +229     
- Misses        443      450       +7     
Files Coverage Δ
src/hdmf_zarr/backend.py 90.41% <100.00%> (+0.07%) ⬆️
tests/unit/test_parallel_write.py 98.57% <98.57%> (ø)
src/hdmf_zarr/utils.py 95.79% <95.32%> (-0.96%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CodyCBakerPhD
Copy link
Contributor Author

@oruebel Done, not sure what's up with the coverage workflows though

Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@oruebel oruebel merged commit 9f6c386 into hdmf-dev:dev Oct 1, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Parallel Write Support for HDMF-Zarr
4 participants