Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Nov 4, 2024

Added a server-side endpoint to handle query_all_datasets in a single request. query_all_datasets can potentially involve hundreds or thousands of separate dataset queries, and we don't want clients slamming the server with that many HTTP requests.

The new endpoint streams results in the same manner as the existing query endpoints used by QueryDriver, but it is separate from the Query/QueryDriver framework.

This is not yet used in the CLI tools and Butler._query_all_datasets is still private -- we need to deploy an updated server with this change before we can release the client side.

--order-by in the CLI tools is now restricted to queries for a single dataset type -- future implementations of query_all_datasets may not support it.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

@dhirving dhirving changed the base branch from main to tickets/DM-45873 November 4, 2024 23:15
Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 98.84170% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (615b2ba) to head (3eb4836).

Files with missing lines Patch % Lines
python/lsst/daf/butler/_query_all_datasets.py 92.85% 1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/remote_butler/_query_results.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1114      +/-   ##
==========================================
+ Coverage   89.41%   89.44%   +0.02%     
==========================================
  Files         363      366       +3     
  Lines       48440    48598     +158     
  Branches     5879     5890      +11     
==========================================
+ Hits        43315    43469     +154     
- Misses       3716     3717       +1     
- Partials     1409     1412       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-47375 branch 3 times, most recently from 03351be to 79c520d Compare November 5, 2024 22:54
Base automatically changed from tickets/DM-45873 to main November 8, 2024 23:31
Switch from the query_datasets convenience method to the advanced query system in query_all_datasets.

This lets us get the results one page at a time, which will be needed to prevent memory exhaustion when running these queries on the server.
It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.
The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.
The upcoming implementation of query_all_datasets will not support order_by, so remove it.  This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.
In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.
After the refactor in the previous commit, this is somewhat independent of the query routes.
This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.
This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.
query_all_datasets can potentially involve hundreds or thousands of separate dataset queries.  We don't want clients slamming the server with that many HTTP requests, so add a server-side endpoint that can handle these queries in a single request.
It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included.  So add back `with_dimension_records` to the internal implementation of query_all_datasets.
@dhirving dhirving marked this pull request as ready for review November 12, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant