DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

dhirving · 2024-11-04T23:14:58Z

Added a server-side endpoint to handle query_all_datasets in a single request. query_all_datasets can potentially involve hundreds or thousands of separate dataset queries, and we don't want clients slamming the server with that many HTTP requests.

The new endpoint streams results in the same manner as the existing query endpoints used by QueryDriver, but it is separate from the Query/QueryDriver framework.

This is not yet used in the CLI tools and Butler._query_all_datasets is still private -- we need to deploy an updated server with this change before we can release the client side.

--order-by in the CLI tools is now restricted to queries for a single dataset type -- future implementations of query_all_datasets may not support it.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-11-04T23:28:55Z

Codecov Report

Attention: Patch coverage is 98.84170% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (615b2ba) to head (3eb4836).

Files with missing lines	Patch %	Lines
python/lsst/daf/butler/_query_all_datasets.py	92.85%	1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/remote_butler/_query_results.py	95.45%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1114      +/-   ##
==========================================
+ Coverage   89.41%   89.44%   +0.02%     
==========================================
  Files         363      366       +3     
  Lines       48440    48598     +158     
  Branches     5879     5890      +11     
==========================================
+ Hits        43315    43469     +154     
- Misses       3716     3717       +1     
- Partials     1409     1412       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Switch from the query_datasets convenience method to the advanced query system in query_all_datasets. This lets us get the results one page at a time, which will be needed to prevent memory exhaustion when running these queries on the server.

It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.

The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.

The upcoming implementation of query_all_datasets will not support order_by, so remove it. This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.

In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.

After the refactor in the previous commit, this is somewhat independent of the query routes.

This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.

This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.

query_all_datasets can potentially involve hundreds or thousands of separate dataset queries. We don't want clients slamming the server with that many HTTP requests, so add a server-side endpoint that can handle these queries in a single request.

It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included. So add back `with_dimension_records` to the internal implementation of query_all_datasets.

dhirving changed the base branch from main to tickets/DM-45873 November 4, 2024 23:15

dhirving force-pushed the tickets/DM-47375 branch 3 times, most recently from 03351be to 79c520d Compare November 5, 2024 22:54

dhirving force-pushed the tickets/DM-45873 branch from 2cd8e3e to c077b29 Compare November 8, 2024 21:23

Base automatically changed from tickets/DM-45873 to main November 8, 2024 23:31

dhirving added 9 commits November 12, 2024 14:14

Remove with_dimension_records from query-datasets

89ac3c6

It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.

Restrict --order-by in query-datasets to single type

cfe5e58

The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.

Remove order_by from query_all_datasets

c183e9e

The upcoming implementation of query_all_datasets will not support order_by, so remove it. This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.

Make streaming query logic reusable

b65ba2c

In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.

Move query streaming logic to its own file

96f5b94

After the refactor in the previous commit, this is somewhat independent of the query routes.

Move query streaming client code to its own file

e2e7ace

This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.

Define a dataclass for query_all_datasets args

c8fd5f7

This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.

dhirving force-pushed the tickets/DM-47375 branch from 07afc52 to 32c647e Compare November 12, 2024 21:15

Add back dimension records to QueryDatasets

3eb4836

It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included. So add back `with_dimension_records` to the internal implementation of query_all_datasets.

dhirving marked this pull request as ready for review November 12, 2024 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

dhirving commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

Are you sure you want to change the base?

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

Conversation

dhirving commented Nov 4, 2024 • edited Loading

Checklist

codecov bot commented Nov 4, 2024 • edited Loading

Codecov Report

dhirving commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading