-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #899 +/- ##
==========================================
+ Coverage 87.66% 87.68% +0.01%
==========================================
Files 276 276
Lines 36165 36326 +161
Branches 7547 7581 +34
==========================================
+ Hits 31705 31851 +146
- Misses 3305 3313 +8
- Partials 1155 1162 +7
☔ View full report in Codecov by Sentry. |
07f5f39
to
5768235
Compare
f124fbc
to
7b59d29
Compare
# Populate the test server. | ||
server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "base.yaml")) | ||
server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "datasets-uuid.yaml")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter yet but you may want to move this above the call to _make_remote_butler, in case we decide to read some stuff from the server during RemoteButler initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping that it doesn't make a difference here either way because the clients won't be downloading caches of dataset type and collection before this runs will they?
7b59d29
to
9f27ffd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a look at the APIs, and while I left a lot of comments they're all minor. Did not look at the implementations (I gather that's not needed from me).
*, | ||
collections: CollectionArgType | None = None, | ||
timespan: Timespan | None = None, | ||
datastore_records: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we provide this as an option, we should also provide a way to expand the data ID (i.e. add dimension records) at the same time, and we should think about whether we want just one option to do both or allow the two kinds of expansion to happen separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had completely failed to notice that that flag was in there now. One option is for find_dataset
to always expand and add everything. People shouldn't be calling it in bulk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably get_dataset should have the datastore_records and dimension_records booleans as well for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also need to add datastore records to the SerializedDatasetRef because that was never updated and so it's impossible to return datastore records from the server at the moment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jim asked me to not add them to SerializedDatasetRef
, as it may increase graph size on disk (but I don't think we give records to refs that go into the graph, at least not yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sending SerializedDatasetRef over the wire so we can only reconstruct things in the far end that were put in at the other end. We need to add them somehow (but obviously they should be optional). For now I've turned the feature off in client/server so you can't get the datastore records.
python/lsst/daf/butler/_butler.py
Outdated
@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType: | |||
""" | |||
raise NotImplementedError() | |||
|
|||
@abstractmethod | |||
def get_dataset(self, id: DatasetId) -> DatasetRef | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want a vectorized version of this in the future, but it's a pain to implement now. I don't think this interface needs to change in order to live alongside a vectorized one, but I wanted to raise it in case you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two open questions:
- If we have vectorized version,
get_datasets
do we keepget_dataset
around? - What should
get_datasets
return?list[DatasetRef]
orDatasetRefContainer
?
I think we likely want a single request to return a single ref so it seems like we always want Butler.get_dataset
even if we later add Butler.get_datasets
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we ever expose bare DatasetId
to butler clients (do we need this method at butler level)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly makes client/server testing a lot easier to begin with since it's the simplest possible API to implement. We do have VO services that are given a UUID (via datalink from, eg, ObsTAP) and need to convert that to a dataset ref so it's not impossible that the server will need to do that (depending on how services connect to butlers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to start exposing dataset IDs in more places, too; I've been thinking about how we might to get them into the --where
expression language, for instance. They appear pretty prominently in QGs already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, few minor questions.
python/lsst/daf/butler/_butler.py
Outdated
@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType: | |||
""" | |||
raise NotImplementedError() | |||
|
|||
@abstractmethod | |||
def get_dataset(self, id: DatasetId) -> DatasetRef | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we ever expose bare DatasetId
to butler clients (do we need this method at butler level)?
*, | ||
collections: CollectionArgType | None = None, | ||
timespan: Timespan | None = None, | ||
datastore_records: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.
# Temporary hack. Assume strings for collections. In future | ||
# want to construct CollectionWildcard and filter it through collection | ||
# cache to generate list of collection names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on caching ticket now, I would probably prefer to not have a complete list of collections on every client in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I thought we said we would last week. I thought @TallJimbo wanted the client to know all possible collections when it was building the query object before sending it to the server. The caching story between client and server is a bit tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but we may need to rethink that part, otherwise we may have scaling issue. I looked at Butler initialization time today, the simplest butler query-data-ids --collections=HSC/raw/all /repo/main exposure
takes 10 seconds, large fraction of that comes from pre-loading of collections cache. We have 300k collections today, and it will only grow with time. There are ways to reduce that probably, but in my ideal world it would be better if we cache almost nothing on client side. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the caches in the client has worked well for us so far (i.e. with the server just being the DB), and I like the idea of being able to optimize server calls away entirely in the client sometimes. And I do think caching on the server is a lot harder, since we can't just punt cache misses to the user there.
But I do hesitate a bit because I'm not sure if "cache all the collections the user can see" continues to be in the "small enough to be irrelevant" category after DR1, both in terms of downloading the cache and holding it in memory. And I have no idea if my "do more on the client to keep the server load light" intuition is better than @dhirving 's "do nothing on the client to keep version management easier" intuition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloading a million collection records just to figure out that pattern does not match any of them looks like a lot of overhead. It is of course about the right balance between what we do on server vs client, we need to find something that works and scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm sold on not trying to aggressively fetch all collections the user can access. It's the dataset type definitions, chained collection members, and collection summaries that are needed for query contruction, and we don't need all of any of those.
I think what may break if we drop the complete collection-name cache is our regex/glob collection query implementation - we'll need to start translating/limiting those patterns to something that can be run inside the DB. Yes, we could do a full cache in the RemoteButler
server, but this scaling problem will hit DirectButler
, too.
I'll do a brain dump on Slack later today about the places we use the caches in the query system so somebody other than me can do some conceptual design work on what it should look like. I think we need to have a better sense of where this is all going, even if we don't decide to switch fully from the current approach yet, but I want to stay off of at least this part of the critical path.
# Temporary hack. Assume strings for collections. In future | ||
# want to construct CollectionWildcard and filter it through collection | ||
# cache to generate list of collection names. | ||
wildcards = CollectionWildcard.from_expression(collections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do a quick check here and raise if patterns
specifies something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I have to change the API to something a bit simpler for collection list anyhow.
) -> SerializedDatasetType: | ||
"""Return the dataset type.""" | ||
butler = factory.create_butler() | ||
datasetType = butler.get_dataset_type(dataset_type_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is exception transported transparently back to client?
9f27ffd
to
13d3bff
Compare
The underscores are the preferred way to add new APIs now. The collections parameter should not support wildcards so explicitly declare it should be a sequence of str.
13d3bff
to
b22a5f3
Compare
visit_system is now required but no test was triggering validation until recently.
07b0946
to
9afece7
Compare
…aset Datastore records do not work in remote butler because they were never added to SerializedDatasetRef.
Demonstrate that MissingDatasetTypeError can be caught in the server and raised again in the client.
9afece7
to
17bc74e
Compare
Checklist
doc/changes