Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41878: Implement RemoteButler.get() backed by a single FileDatastore #912

Merged
merged 9 commits into from
Dec 8, 2023

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Nov 28, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Nov 28, 2023

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (4845263) 87.50% compared to head (d718a09) 87.55%.
Report is 1 commits behind head on main.

Files Patch % Lines
...n/lsst/daf/butler/datastores/file_datastore/get.py 85.06% 13 Missing and 10 partials ⚠️
python/lsst/daf/butler/datastore/generic_base.py 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #912      +/-   ##
==========================================
+ Coverage   87.50%   87.55%   +0.04%     
==========================================
  Files         292      295       +3     
  Lines       38124    38241     +117     
  Branches     8081     8088       +7     
==========================================
+ Hits        33362    33482     +120     
+ Misses       3554     3553       -1     
+ Partials     1208     1206       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-41878 branch 4 times, most recently from 9097407 to 9406699 Compare December 1, 2023 21:14
@dhirving dhirving marked this pull request as ready for review December 1, 2023 21:24
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

There are some numpydoc issues and single backticks vs double backticks issues for rst fun.

@@ -808,7 +808,7 @@ def get_dataset_type(self, name: str) -> DatasetType:
def get_dataset(
self,
id: DatasetId,
storage_class: str | StorageClass | None,
storage_class: str | StorageClass | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we mark this method as requiring keyword args after the id?

@@ -320,3 +332,26 @@ def update(self, **kwargs: Any) -> StoredFileInfo:

def __reduce__(self) -> str | tuple[Any, ...]:
return (self.from_record, (self.to_record(),))


class SerializedStoredFileInfo(_BaseModelCompat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's a bit odd that we now have a dataclass defining the datastore record content, a pydantic model defining the datastore record content, and a method in FileDatastore for creating the database table that must match the other two but uses neither to define it. If makeTableSpec built up its definition from StoredFileInfo (if we are saying that it is required that StoredFileInfo and makeTableSpec agree but the serialized form is allowed to be different) that would at least limit the duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoredFileInfo isn't actually a dataclass, it's a subclass of the abstract base class StoredDatastoreItemInfo with a property that can't be instantiated without injecting configuration and several business logic methods. And the representation of some of the fields varies between the various cases (e.g. component is str | None on the public API but in the database it's str | Literal['__NULL_STRING__'].)

I had considered making StoredFileInfo have-a SerializedStoredFileInfo and forwarding the properties to it, but it didn't seem to buy much because you still have to have the duplicate properties to do the forwarding. I also think some of the representations might need to vary here (e.g. I kinda think SerializedStoredFileInfo shouldn't have a path at all, or if it does it should always be relative. Whereas the one in the StoredFileInfo is sometimes absolute.)

I also considered making StoredFileInfo inherit-from SerializedStoreFileInfo or an additional shared pydantic model for the duplicated fields, but inheritance seems harder to understand than duplicating 5 string properties. And if we add more properties, they will likely be non-nullable on StoredFileInfo but will have to be nullable on the serialized one to handle backwards compatibility.

I do think that the methods on StoredFileInfo could be moved to FileDatastore and StoredFileInfo could become an actual dataclass. Or like if we were using the SqlAlchemy ORM layer StoredFileInfo could just be the ORM object. But I know you guys had plans with exposing these more places which is why you added the abstract base class, so to my mind that kind of change is way outside the scope of this ticket.

def prepare_get_for_external_client(self, ref: DatasetRef) -> FileDatastoreGetPayload:
# Docstring inherited

def to_file_info_payload(info: DatasetLocationInformation) -> FileDatastoreGetPayloadFileInfo:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the helper function needed here? Isn't it clearer to do instead:

file_records = []
for location, file_info in self._get_dataset_locations_info(ref):
    file_records.append(FileDatastoreGetPayloadFileInfo(...)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 to me it's equally clear to have a small function explicitly declaring what its inputs and outputs are, but that might be the typescript programmer in me talking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A timing test in ipython hints that calling the helper function takes a third longer than doing the loop directly. It seems a lot of overhead for something that is simpler without it (and there are still type annotations you can add to the list above to make it clear.

dataId=ref.dataId,
)
return get_dataset_as_python_object_from_get_info(
datastore_file_info, ref=ref, parameters=parameters, cache_manager=DatastoreDisabledCacheManager()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the cache being disabled here? Isn't the cache the thing we want for remote data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do eventually want it to be enabled, but I'll need to make a place for it to live and still need to audit it for thread-safety. I didn't see any locks in there so I assumed by default that it's probably not threadsafe.

Also it probably should be configurable on the client side, I'm assuming that the services backend use case probably does not want caching on a local disk. Probably should be on by default for the notebook use case.

I should have made a ticket for this though, I'll do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not using globals and is meant to be safe for use with multiple processes sharing a cache. I would not expect threads to be a problem -- it mostly scans the file system for its source of truth and when a file is being retrieved it puts it in a separate location to make sure another process doesn't delete the file whilst it's being read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so it seems that having a separate instance of DatastoreCacheManager per-thread is likely threadsafe, but sharing an instance across threads is not because of self._cache_entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a mutable object so that's always going to be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened DM-42066

python/lsst/daf/butler/datastores/file_datastore/get.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/file_datastore/get.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/file_datastore/get.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/file_datastore/get.py Outdated Show resolved Hide resolved
dhirving and others added 9 commits December 7, 2023 12:34
In preparation for extracting portions of the FileDatastore get() method to a free function to be used for client/server, remove references to 'self' and split the function into a few pieces.
Convert portions of FileDatastore.get() to free functions, to allow the get operation to be called on the client side without instantiating a FileDatastore object
Added re-usable functions for sending a POST using a Pydantic model, and for parsing a response using the model.
Fixed an issue where UnboundLocalError would be thrown when passing a
dict as a DataId to a RemoteButler function
Added a partial implementation of get() for RemoteButler, which so far
only handles the DatasetRef form of get().  This modifies the DataStore
interface to include a function that serializes the information
necessary to do the get() on the client side.
Remove a small amount of duplicated code by simplifying the function parameters
Co-authored-by: Tim Jenness <[email protected]>
@dhirving dhirving merged commit 1e7b397 into main Dec 8, 2023
17 checks passed
@dhirving dhirving deleted the tickets/DM-41878 branch December 8, 2023 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants