Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-26658: Re-implement the Formatter class #1018

Merged
merged 30 commits into from
Jul 25, 2024
Merged

DM-26658: Re-implement the Formatter class #1018

merged 30 commits into from
Jul 25, 2024

Conversation

timj
Copy link
Member

@timj timj commented May 17, 2024

  • Use entirely new class: FormatterV2
  • Provide FormatterV1inV2 shim class.
  • Since FormatterV2 needs access to the cache, moved all the cache code to FormatterV2 itself out of Datastore.
  • Three read methods can be provided, two write methods.
  • Can now read content from within zip archives.

For read:

  • read_from_uri gives you the full URI to do with whatever you want but might give you a locally file if it was in the cache.
  • read_from_stream gives you a file handle for the resource.
  • read_from_local_file gives you a guaranteed local file.

Flags declare which of those 3 are available for a given formatter to allow the FormatterV2.read method to pick a suitable option based on what it knows (or if a zip file is in play, in which case read_from_uri can't be called).

For write:

  • to_bytes is called first (can be Not Implemented)
  • Falls back to write_local_file

The default implementation of write_local_file calls to_bytes.

All daf_butler formatters ported to V2. There are some V1 test formatters since otherwise nothing will test FormatterV1inV2.

This mostly seems to work but there are some things I still need to do:

  • There is some code duplication within FormatterV2 that I should tidy up.
  • I agreed to add any notes from sub exceptions when I re-raise so that notes are more visible in the final stack trace.
  • Work out how to prevent storage class conversion before handing off to a formatter.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented May 18, 2024

Codecov Report

Attention: Patch coverage is 85.87571% with 100 lines in your changes missing coverage. Please review.

Project coverage is 89.37%. Comparing base (cdbbd14) to head (c4dd9bc).

Files Patch % Lines
python/lsst/daf/butler/_formatter.py 81.58% 40 Missing and 25 partials ⚠️
python/lsst/daf/butler/tests/_datasetsHelper.py 73.68% 2 Missing and 3 partials ⚠️
python/lsst/daf/butler/formatters/typeless.py 90.24% 3 Missing and 1 partial ⚠️
python/lsst/daf/butler/datastores/fileDatastore.py 86.36% 1 Missing and 2 partials ⚠️
python/lsst/daf/butler/formatters/file.py 0.00% 3 Missing ⚠️
python/lsst/daf/butler/formatters/parquet.py 88.46% 0 Missing and 3 partials ⚠️
python/lsst/daf/butler/formatters/yaml.py 85.71% 2 Missing and 1 partial ⚠️
python/lsst/daf/butler/_file_descriptor.py 50.00% 1 Missing and 1 partial ⚠️
python/lsst/daf/butler/_storage_class.py 75.00% 1 Missing and 1 partial ⚠️
...n/lsst/daf/butler/datastores/file_datastore/get.py 91.66% 1 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
- Coverage   89.55%   89.37%   -0.18%     
==========================================
  Files         358      359       +1     
  Lines       45277    45622     +345     
  Branches     9276     9348      +72     
==========================================
+ Hits        40548    40775     +227     
- Misses       3433     3521      +88     
- Partials     1296     1326      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timj timj force-pushed the tickets/DM-26658 branch 2 times, most recently from 0f4c8c7 to d6e5ce6 Compare May 24, 2024 21:52
@timj
Copy link
Member Author

timj commented May 24, 2024

For the record, we have been asked to support reading a file from within a zip file (where a single zip file is registered as multiple datasets). This might affect some of our thinking here as to the best interface.

@timj timj force-pushed the tickets/DM-26658 branch 2 times, most recently from 8017aba to e271398 Compare June 8, 2024 00:09
@timj
Copy link
Member Author

timj commented Jun 12, 2024

The .name property used for stream handles requires lsst/resources#89.

@timj timj force-pushed the tickets/DM-26658 branch 8 times, most recently from 3bd3e3a to 005bc0a Compare July 3, 2024 22:51
@timj timj marked this pull request as ready for review July 3, 2024 23:08
python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I've left a lot of comments because there's a lot here, but overall I'm very happy with all of this.

I think my biggest concern is that I'd like to make it really clear what all the storage classes are in the various places they appear in a formatter, for the various scenarios in which a formatter is used, and I could imagine trying to document/test that more rigorously leading to some changes to make it more consistent. There's a line comment thread on this we can use for further discussion.

python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
cache_manager = DatastoreDisabledCacheManager(None, None)
return cache_manager

def read(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to use @typing.final to enforce what the docs say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. Except TypelessFormatter only works because it overrides read...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that reinforces my thinking that we should move away from having a JSON formatter and a single YAML formatter that try to do everything, and specialize for Pydantic/dataclasses,etc. But not on this ticket.

python/lsst/daf/butler/_formatter.py Show resolved Hide resolved
python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/formatters/typeless.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/formatters/typeless.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/formatters/typeless.py Outdated Show resolved Hide resolved
This class provides a ``read()`` method that will run `FormatterV2.read`
and coerce the return type using a variety of techniques. Use the
standard `FormatterV2` methods for reading bytes/files and writing
bytes/files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets me thinking that we should have different JSON/YAML formatters for different categories of target types (one for things that are natively pydantic models, another for those that have to_simple and from_simple, etc.).

May be too late for PropertySet vs. TaskMetadata, of course, but if we'd had different formatters earlier that transition would have been easier to handle, and I'd like to do better in future similar transitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a PydanticFormatter implementation explicitly. This wouldn't have solved the PropertySet vs TaskMetadata problem though because those are entirely different types and the fix there would have been to have a specialist class from the beginning for task metadata that might have internally used a PropertySet rather than using a generic container class at the start.

doc/lsst.daf.butler/formatters.rst Outdated Show resolved Hide resolved
@timj timj force-pushed the tickets/DM-26658 branch 4 times, most recently from 205761f to 27c5986 Compare July 10, 2024 17:27
timj added 27 commits July 25, 2024 08:39
Entirely new FormatterV2 with a wrapper class to support
legacy Formatter implementations. No longer subclass the
read and write method but instead subclass

* read_from_uri
* read_from_stream
* read_from_local_file
* to_bytes
* write_local_file
The tests still use V1 though
This is more akin to how dunder methods work in general and
is a reasonable approach when we have three options and want to
try each.
This allows the type ignore to be removed.
ci_middleware has mocks that look like they are all compatible
with the requested python type when they aren't so by default
a formatter can't naively compare types and decide that since
they match no conversion is required. Always return False for
Formatter.can_accept and force type coercion (which works
with the mocks because of subclassing). Formatter can't know
about mocks (which are in pipe_base) and returning False
doesn't change anything in the default case because conversion
will be a no-op.

The mocks also track file conversions so if conversions are
bypassed the tests confirming conversions won't pass.
This is now all handled internally in the formatter.
@timj timj merged commit 0ea5e73 into main Jul 25, 2024
16 of 18 checks passed
@timj timj deleted the tickets/DM-26658 branch July 25, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants