DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

andy-slac · 2024-09-09T22:05:47Z

Direct butler reimplements query_info method to avoid multiple queries, which makes it significantly faster. This patch also adds two optional parameters to query_info to allow further optimizations. There is still an inefficiency in fetch_summaries method when the number of potential collections is very large (when collections are *). Further optimization would probably need more work (and I think that we'll have to optimize it as the number of collections grows every day).

@dhirving, I added the same parameters to remote butler interface, but they are not used for now. I know you are working on DM-46129, maybe you can add forwarding of those parameters to remote server?

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

dhirving · 2024-09-09T22:09:27Z

Sounds good. I'll rebase #1074 after this merges and add the new parameters.

codecov · 2024-09-09T22:19:25Z

Codecov Report

Attention: Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.

Project coverage is 89.66%. Comparing base (1d1bf7b) to head (76d34d6).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
python/lsst/daf/butler/script/queryDatasets.py	60.00%	2 Missing and 2 partials ⚠️
python/lsst/daf/butler/_butler_collections.py	81.81%	1 Missing and 1 partial ⚠️
...butler/direct_butler/_direct_butler_collections.py	91.66%	1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/script/queryDimensionRecords.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1075   +/-   ##
=======================================
  Coverage   89.65%   89.66%           
=======================================
  Files         359      359           
  Lines       46885    46925   +40     
  Branches     9637     9650   +13     
=======================================
+ Hits        42036    42073   +37     
- Misses       3482     3485    +3     
  Partials     1367     1367

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andy-slac · 2024-09-09T23:12:16Z

@timj, I have added a new private method to Butler.collections and changed query-data-ids, query-datasets, and query-dimension-records to use it.

timj

Looks great. Thanks for unifying the logic in query-dimension-record and data-ids as well.
It makes sense not to include the doc strings in queries by default.

python/lsst/daf/butler/_butler_collections.py

… (DM-45993) This reduces drastically the number of queries that query_info needs to run.

Filter both dataset types and collections to query from collection summaries.

`query_info` nw receives optional `include_doc` parameter to allow explicit loading of doc strings.

This allows more efficient filtering with per-dataset type list of collection names returned.

Co-authored-by: Tim Jenness <[email protected]>

andy-slac force-pushed the tickets/DM-45993 branch from c88a4bb to bcba021 Compare September 9, 2024 23:06

timj approved these changes Sep 9, 2024

View reviewed changes

python/lsst/daf/butler/_butler_collections.py Outdated Show resolved Hide resolved

andy-slac and others added 5 commits September 9, 2024 21:54

Optimize DirectButlerCollections.query_info to avoid too many queries…

0b9f0e2

… (DM-45993) This reduces drastically the number of queries that query_info needs to run.

Update query-datasets script to limit number of collections in query.

4128540

Filter both dataset types and collections to query from collection summaries.

Enable vectorized loading of collection doc strings.

f1141a4

`query_info` nw receives optional `include_doc` parameter to allow explicit loading of doc strings.

Add new private filtering method to Butler.collections.

70b56d2

This allows more efficient filtering with per-dataset type list of collection names returned.

Apply review suggestion

76d34d6

Co-authored-by: Tim Jenness <[email protected]>

andy-slac force-pushed the tickets/DM-45993 branch from 4b25396 to 76d34d6 Compare September 10, 2024 04:55

andy-slac merged commit 5c4a71f into main Sep 10, 2024
17 of 18 checks passed

andy-slac deleted the tickets/DM-45993 branch September 10, 2024 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

andy-slac commented Sep 9, 2024

dhirving commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

andy-slac commented Sep 9, 2024

timj left a comment

DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

Conversation

andy-slac commented Sep 9, 2024

Checklist

dhirving commented Sep 9, 2024

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

andy-slac commented Sep 9, 2024

timj left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2024 •

edited

Loading