-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41117: Improvements to caching and Butler initialization #904
Conversation
3e515c4
to
ebd9eb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here looks great; definitely a step in the right direction.
Pre-fetching of collection summaries was quite expensive and we did not use those summaries very often. Removing the cache completely, now we query summaries each time but only for the collections that are actually used.
`DefaultCollectionManager.refresh` now runs a single query to fetch full contents of collection_chain table. This removes update logic from collection record classes which became simple data classes now.
Static tables do not really need schema verification because we rely on version numbers in butler_attributes. Verification and reflection may still be usefule for dynamic tables (tags/calibs) but we now delay it until the tables are actually used.
ceede1c
to
b1d4cb8
Compare
python/lsst/daf/butler/registry/datasets/byDimensions/_manager.py
Outdated
Show resolved
Hide resolved
python/lsst/daf/butler/registry/datasets/byDimensions/summaries.py
Outdated
Show resolved
Hide resolved
f2e1c09
to
8227a57
Compare
0181a32
to
1e59418
Compare
@TallJimbo, sorry for another review request, I have added three more commits to try to improve things:
I'm not very happy with how it is structured now, cache for dataset types looks particularly ugly, but this is the best that I could do. As the result, number of queries is approximately the same in both QG building and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I think it's time to wrap this one up and start one or more new tickets for further caching work, as I think this is a clear step in the right direction with some immediate benefits and no obvious downsides.
I do still think there's a ton of work ahead here, because the caching context interface implies a per-user cache and that'd be a massive pain to put on the server, and yet some of what we cache now can only go on the server (SQLAlchemy table objects) or needs to be on the server as a check even if it's also on the client (collection summaries). I think I'd advise some "step back and design it conceptually" work before further coding on this problem, though.
Typo: additiona
in commit message for 14721a2.
Adds special cache classes for collection and summary records and an additional structure that holds caches. New registry method is a context manager that enables caches temporarily for the duration of that context.
Unlike collection caches, dataset type cache is always on, this helps to reduce number of queries in `pipetask run` without the need to explicitly enable caching in multiple places.
e6bf834
to
eaa968d
Compare
Registry shiim now redirects to this method instead of Registry method. RemoteButler raises NotImplementedError but may do something non-trivial later when we know how caching is going to work with client/server.
A bunch of improvements for collection caching to reduce the need to cache all collections in memory. In some cases when we need to filter collections based on regexes we still need to fetch everything into memory, but in other cases we only read collections that are explicitly specified.
Similarly, collection summary cache is eliminated completely to avoid upfront loading which takes significant time. Now we are only fetching summaries for collections that are needed.
Table reflection is eliminated for static tables, and is postponed for dynamic (tags/calib) tables until they are used.
Checklist
doc/changes