Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write references in compound datasets as an array #146

Merged
merged 9 commits into from
Feb 2, 2024

Conversation

sneakers-the-rat
Copy link
Contributor

@sneakers-the-rat sneakers-the-rat commented Dec 8, 2023

Motivation

Fix #144

Writes references in compound datasets as an same-sized array, rather than iteratively as an array of lists

How to test the behavior?

Convert a dataset with compound datasets, confirm that it doesn't take a really long time.

Making a more detailed example blocked for me by #145 bc i am not sure if i should actually work on this since it seems like devs are just doing their own version here (???) #149

Checklist

very tired and so leaving this as a draft for now

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

src/hdmf_zarr/backend.py Outdated Show resolved Hide resolved
@oruebel oruebel added the category: enhancement improvements of code or code behavior label Dec 8, 2023
@oruebel oruebel added this to the Next Release milestone Dec 8, 2023
@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Dec 14, 2023

Implemented writing as a compound dset, I used the information from the options passed to the function rather than using info from the builder class to reduce coupling, but if y'all prefer it that way it's a simple drop-in fix.

So just after the dset is written...

>>> (Pdb) dset[0]
(3798, 1, {'source': '.', 'path': '/processing/stimulus/timestamps', 'object_id': 'ff28f35d-9211-4bdb-b026-ad27d46367a3', 'source_object_id': 'a695c6dd-5c46-428d-a927-5c2027c4b653'})
>>> (Pdb) dset.dtype
dtype([('idx_start', '<i4'), ('count', '<i4'), ('timeseries', 'O')])

This version also doesn't have the same problem as #149 , since the dataset itself is given the dtype, the returned type is a numpy void type rather than a tuple:

>>> (Pdb) type(dset[0])
<class 'numpy.void'>
>>> (Pdb) dset[0][1] = 5
>>> (Pdb) # (it works)

whereas if the zarr dataset is created with dtype=object...

>>> (Pdb) type(dset[0])
<class 'tuple'>
>>> (Pdb) dset[0][1] = 5
*** TypeError: 'tuple' object does not support item assignment

all tests pass. not sure what tests i would need to add here for this specific behavior but lmk

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (f5f09c8) 85.16% compared to head (b33301c) 85.59%.

Files Patch % Lines
src/hdmf_zarr/backend.py 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #146      +/-   ##
==========================================
+ Coverage   85.16%   85.59%   +0.42%     
==========================================
  Files           5        5              
  Lines        1146     1159      +13     
  Branches      291      295       +4     
==========================================
+ Hits          976      992      +16     
+ Misses        114      110       -4     
- Partials       56       57       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mavaylon1 mavaylon1 marked this pull request as ready for review December 20, 2023 18:26
@mavaylon1
Copy link
Contributor

@sneakers-the-rat I'll take a look at the code coverage tonight, but from a quick glance it is passing.

@mavaylon1
Copy link
Contributor

@sneakers-the-rat Looks good to me. I can update the changelog for you if you'd like. After that, I'll do one last review and approve it.

@sneakers-the-rat
Copy link
Contributor Author

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

@mavaylon1
Copy link
Contributor

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

Looking at the coverage, no further test is needed.

@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Dec 21, 2023

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec
edit2: (moved to comment below)

@mavaylon1
Copy link
Contributor

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec

So this should already be tested with existing code, that's why it is passing coverage prior.

@sneakers-the-rat
Copy link
Contributor Author

moving this out of edit so it stays in sequence:
ok i don't see a better place for this, but can revert if not wanted. coverage should we the same since all the legs of the changed code are reached in the existing tests, but i don't see an explicit test for the dtype of the representation in the stored zarr array, so figured i would at least offer a test of the thing that i changed :)

@rly
Copy link
Contributor

rly commented Dec 21, 2023

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

@sneakers-the-rat
Copy link
Contributor Author

In this case the test also revealed a failure on windows, which is fun. Lemme debug

@mavaylon1
Copy link
Contributor

mavaylon1 commented Dec 25, 2023

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

@rly I'm aware. My point is the fact that there already should be a test that checks compound data is written as compound data. If not, then I understand adding one. The changes here, in practice, shouldn't break any tests because the results should be the same. Checking if the data was written as a compound data correctly should be the same of the tests that already check compound data (again these should already exist). From my understanding, the issue isn't a big but an optimization problem. The "fix" should be already covered by existing tests.

@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Dec 25, 2023

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444
that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

@mavaylon1
Copy link
Contributor

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444

that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

I'll look after the break. By all means if there's a gap in the test we should fill it. My point was based on the assumption that there's tests that already do what we want because this isn't a format change. Just an order of operations. But if this process "exposed" a gap, then let's fill it.

@mavaylon1
Copy link
Contributor

@sneakers-the-rat I updated the test to our usual use of TestCase "assertEqual". This gave more insight on the issue for windows: AssertionError: dtype([('id', '<i4'), ('name', '<U25'), ('reference', 'O')]) != dtype([('id', '<i8'), ('name', '<U25'), ('reference', 'O')])

The id types are not matching.

Expand check for string types
Make 'else' a failure condition in `__resolve_dtype_helper__` (rather than implicitly assuming list)
@sneakers-the-rat
Copy link
Contributor Author

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here

item['dtype'] = 'U25'
which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3


This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map:

t = self.__resolve_dtype_helper__(i)

That gets saved as a dataset attribute:

dset.attrs['zarr_dtype'] = type_str

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

@mavaylon1
Copy link
Contributor

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here

item['dtype'] = 'U25'
which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3


This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map:

t = self.__resolve_dtype_helper__(i)

That gets saved as a dataset attribute:

dset.attrs['zarr_dtype'] = type_str

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

The strings don't need to be length 25. I also found that doing <U would return empty strings downstream. The U25 was me testing to see if providing a cap would fix that and it does. That's an open ended discussion that I thought was fixed with your implementation, but I think there's something lingering that has yet to be solidified. I'm going to dive into this Friday and see what I can come up with.

@sneakers-the-rat
Copy link
Contributor Author

Basically all that needs to happen is to decide on either a single length, or calculate the Max needed length from the data, ideally in a single mapping method that can get hooked into from the various places one would need to get a zarr storage dtype from a source dtype. Hopefully not too complicated :)

if field['dtype'] is str or field['dtype'] in (
'str', 'text', 'utf', 'utf8', 'utf-8', 'isodatetime'
):
new_dtype.append((field['name'], 'U25'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oruebel I feel that setting a cap on the length, though quick and it works, is not general. I think to get this ticket wrapped up we decide on a temporary cap and investigate why on read the field turns to an empty string if 'U25' is just 'U'. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if Zarr support variable length structured data types. That's something to look at. If not, then we may need to either: 1) introspect the data to determine the string length or 2) maybe it's possible to use 'O' as instead of 'U25' in the compound dtype or 3) store variable length compound types as objects. Did you try the second option. i.e., use 'O' or 'object' in the dtype instead of 'U'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt support variable length strings, no. Object would almost certainly work at the cost of needing to invoke a json serializer, but since we already have to do that for the reference object it would probs be fine. Introspecting the string would also probably be fine but gives me fragility vibes. Can try both, would the deciding factor be perf or what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introspecting the string would also probably be fine but gives me fragility vibes

I agree, introspecting on write is not desirable both from a fragility and performance perspective.

@mavaylon1
Copy link
Contributor

Hey @sneakers-the-rat let me know if you want to finish this up or if you want me to.

@sneakers-the-rat
Copy link
Contributor Author

Ah I was not sure what approach we were taking, I can do some tests today to finish it out

@sneakers-the-rat
Copy link
Contributor Author

Alright, I didn't do seriously careful profiling here, but switching out the 'U25' string with 'O' is a perf wash, and when read back seems to be seamlessly recast back to a string.

On a single run (not robust) of the relevant test cases (cProfile's cumulative time):

Case U25 Object
test_read_reference_compound_buf 0.0375 0.0455
test_read_reference_compound 0.0356 0.0325
test_write_reference_compound 0.0381 0.0388
total 0.0911 0.1

But on a "naturalistic" case where I just timed the whole write of the dataset i had referenced in the issue, oddly the object encoding was faster. for the write_dataset method:

  • object: 38s
  • U25: 39.1s

So pretty much the same. I won't add a test case for very long strings since if I did it would just be a hackjob attached to the presently modified test methods, and that seems like something better suited for a text fixture/parameterization.

@sneakers-the-rat
Copy link
Contributor Author

Sorry for the force push, just to amend an incorrect commit message

@sneakers-the-rat
Copy link
Contributor Author

I think this is good to go at this point :) lmk if changes still needed

@mavaylon1 mavaylon1 merged commit 68218cf into hdmf-dev:dev Feb 2, 2024
24 checks passed
@oruebel oruebel mentioned this pull request Mar 5, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Iterative write for compound reference datasets takes 5ever
5 participants