Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41941: Handle all options for RemoteButler.get() inputs #923

Merged
merged 8 commits into from
Dec 14, 2023

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Dec 8, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Dec 8, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (40b9058) 87.60% compared to head (166fc17) 87.63%.

Files Patch % Lines
...t/daf/butler/remote_butler/server/_dependencies.py 33.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #923      +/-   ##
==========================================
+ Coverage   87.60%   87.63%   +0.03%     
==========================================
  Files         295      295              
  Lines       38254    38327      +73     
  Branches     8088     8097       +9     
==========================================
+ Hits        33511    33588      +77     
+ Misses       3539     3535       -4     
  Partials     1204     1204              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-41941 branch 2 times, most recently from ee6a532 to 61b7dee Compare December 11, 2023 17:03
@dhirving dhirving force-pushed the tickets/DM-41941 branch 5 times, most recently from 96f5018 to c54ad78 Compare December 13, 2023 00:28
@dhirving dhirving marked this pull request as ready for review December 13, 2023 00:30
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Nice cleanups.

@@ -69,17 +76,20 @@
DataCoordinate or other data ID.
"""

SerializedDataId = dict[str, DataIdValue]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, DataId is defined below to use Any instead of DataIdValue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in this bit?:

DataId = DataCoordinate | Mapping[str, Any]

Not really sure what that's about. DataCoordinate proper does seem to use the DataIdValue pretty consistently everywhere. That DataId union exists mainly as a parameter for Butler methods I think, so maybe when you guys added mypy there was too much random code that was passing in like stringifiable things that weren't actually strings or something. There is a bit of code in standardize doing things like coercing numpy.int64 to int, but not as much as I would have thought.

I kind of feel like there are a values that aren't ints or strings though. Isn't some stuff like floats or datetimes?

I sort of suspect that the existing QuantumGraph code or wherever this is getting used just happens to not use anything unusual, and we may end up with some stuff getting accidentally coerced to strings at some point. In any case I just pulled out an alias for the existing definition from SerializedDataCoordinate, so a problem for another day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimensions do not use floats because you can't do equality on them. Dimensions are strings or int (and as you found, integral types so really should be numbers.Integral or something instead of int).

def direct(
cls, *, dataId: dict[str, DataIdValue], records: dict[str, dict] | None
) -> SerializedDataCoordinate:
def direct(cls, *, dataId: SerializedDataId, records: dict[str, dict] | None) -> SerializedDataCoordinate:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the type of the dataId parameter in the docstring below (although last I looked Sphinx can get upset with type aliases).

def _simplify_dataId(
self, dataId: DataId | None, **kwargs: dict[str, int | str]
) -> SerializedDataCoordinate | None:
def _simplify_dataId(self, dataId: DataId | None, kwargs: dict[str, int | str]) -> SerializedDataId:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _simplify_dataId(self, dataId: DataId | None, kwargs: dict[str, int | str]) -> SerializedDataId:
def _simplify_dataId(self, dataId: DataId | None, kwargs: dict[str, DataIdValue]) -> SerializedDataId:

?


Returns
-------
data_id : `SerializedDataCoordinate` or `None`
data_id : `SerializedDataId` or `None`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_id : `SerializedDataId` or `None`
data_id : `SerializedDataId`

dataset_id = datasetRefOrType.id
response = self._get(f"get_file/{dataset_id}")
if response.status_code == 404:
raise LookupError(f"Dataset not found with ID {dataset_id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was in the old code but it might be better to print the datasetRefOrType here and not just the ID since that is more human readable (and will include the ID).

ref = DatasetRef.from_simple(
self._parse_model(response, SerializedDatasetRef), universe=self.dimensions
)
if storage_class is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

"""Convert DatasetType parameters in the format used by Butler methods
to a standardized string name for the REST API.
"""
if isinstance(datasetTypeOrName, DatasetType):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would

return getattr(datasetTypeOrName, "name", datasetTypeOrName)

be more efficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly, yeah, but less type safe and more ambiguous. I personally have no idea off the top of my head whether str or any str-like objects that may be passed in by a user have a name attribute or not.

)
return _get_file_by_ref(butler, ref)
except LookupError as e:
raise NotFoundException() from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually shipping the error message from those NotFoundExceptions to the client yet, and the from e will put the original error message in the server logs. I suppose I could have put in "DataId not found" or something.

In a couple weeks when I go in and set up the error propagation framework, that try block will go away entirely.


def check_sc_override(converted):
self.assertNotEqual(type(metric), type(converted))
self.assertIs(type(converted), new_sc.pytype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertIs(type(converted), new_sc.pytype)
self.assertIsInstance(converted, new_sc.pytype)

?

@@ -81,6 +84,8 @@ def get_dataset_as_python_object(
]

ref = DatasetRef.from_simple(payload.dataset_ref, universe=universe)
if component is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that you have to apply the component override before the storage class override.

Adjusted the handling of dataId and kwargs in RemoteButler.find_datasets to mitigate a few issues:
1. There was a security vulnerability where the caller could smuggle in an arbitrary unvalidated string,  integer or 'None' value to any keyword argument on the server-side 'find_dataset' function
2. Augmentation of DataCoordinate values by kwargs was not working -- it only worked for the 'Mapping' case of DataId
3. It was sending and validating unused DimensionRecords from DataCoordinate.  Removed them to avoid forwards compatibility issues
4. Fixed an issue where 'None' for DataId was not correctly being passed to the server -- it would instead throw a Pydantic error on the client side. None is equivalent to empty dict in find_datasets and get(), so just removed the nullability here.
Implement storage class overrides via DatasetRef and DatasetType.
The StorageClass requested by the client may not be known to the server.  There's no need for the server to get involved in StorageClass conversion at all, so it also simplifies the REST API.
This doesn't mutate the state of the server and has only a single string parameter, so GET makes sense
@dhirving dhirving merged commit cc65914 into main Dec 14, 2023
18 checks passed
@dhirving dhirving deleted the tickets/DM-41941 branch December 14, 2023 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants