-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41158: (mostly) implement QueryDriver for DirectButler #915
Conversation
00946bf
to
c53102e
Compare
67bdb5a
to
0e68108
Compare
0e68108
to
232b56d
Compare
a9ca7d0
to
cfd5b2d
Compare
cfd5b2d
to
8f5d995
Compare
e090ce4
to
2c640a6
Compare
9b11866
to
d47954f
Compare
d47954f
to
18ba8ac
Compare
ab12144
to
1e537b4
Compare
f894111
to
94f0a3e
Compare
716f283
to
f629cea
Compare
cc2dc67
to
c3b43a4
Compare
c3b43a4
to
c1778cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a bunch of nitpicks you can take or leave. (Like last time GitHub is hiding some of the comments because there are too many.)
I really appreciated your extensive commenting in the hairier parts of query construction... I'm not going to claim that I fully understand all of the details but definitely closer than I was before.
Looks like it will be pretty easy to wrap the client/server version around this.
@@ -79,11 +80,14 @@ class _BaseColumnSpec(pydantic.BaseModel, ABC): | |||
description="Whether the column may be ``NULL``.", | |||
) | |||
|
|||
def to_sql_spec(self, **kwargs: Any) -> ddl.FieldSpec: | |||
def to_sql_spec(self, name_shrinker: NameShrinker | None = None, **kwargs: Any) -> ddl.FieldSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this nameshrinker argument be non-optional? I would think that if it is ever needed, it will always be needed... otherwise different parts of the code can think names are different. Just creates the potential for certain columns to work with some types of queries/operations but not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to do that at a higher level, because the set of names that need shrinking is limited to only those constructed from dataset type names, and that keeps name-shrinker arguments from sprouting up in many more places.
# complicated to implement anyway. | ||
# | ||
# We start a transaction rather than just opening a connection both | ||
# to establish consistency guarantees throughout the query context and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What consistency guarantees do we get from the transaction? I thought Butler transactions were always READ COMMITTED, which is the same as what we'd get without the explicit transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there was some lazy thinking going on here; I had been mostly thinking about consistency of things in caches (which you reminded me I hadn't actually done in another thread) and started conflating that with actual transaction consistency (which as you note here depends on isolation level). I'll drop the first clause of this sentence and just keep the second.
TESTDIR = os.path.abspath(os.path.dirname(__file__)) | ||
|
||
|
||
class DirectButlerSQLiteTests(ButlerQueryTests, unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be another instance of these tests for Postgres as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant to do that before I put it up for review and apparently I forgot.
But having added it now, I see some failures in cursor/temp-table lifetimes. I wasn't able to figure it quickly, so I've marked the tests that use temp tables as expected failures and will try to fix them on DM-43697.
self._universe = universe | ||
self._defaults = defaults | ||
self._materializations: dict[ | ||
qt.MaterializationKey, tuple[sqlalchemy.Table, frozenset[str], Postprocessing] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this tuple a NamedTuple or frozen dataclass would make it self-documenting and less fiddly to interact with.
assert self._exit_stack is not None | ||
while self._cursors: | ||
_, cursor = self._cursors.popitem() | ||
cursor.close(exc_type, exc_value, traceback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to push this cursor closing onto the exit stack instead of directly calling close... that way the rest of cleanup will still occur even if closing a cursor raises an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something like that first, but got stuck being unable to pop a cursor out of the exit stack by name when it could be closed due to normal termination of iteration. But you're right that there's a resource management gap here. Adding a manual try/except here is nontrivial because we do want to re-raise at the end, so I'll punt this to DM-43697.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking call exit_stack.push(cursor)
right here instead of calling cursor.close()
... push()
calls __exit__
without calling __enter__
.
else: | ||
result.append((record, self.managers.datasets.getCollectionSummary(record))) | ||
|
||
recurse(collections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this same operation is already implemented with somewhat fewer queries at self.managers.datasets.fetch_summaries
. The keys of the returned dict aren't CollectionRecord but it could easily be changed so it is -- it's operating on CollectionRecord internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this on DM-43698. We might want to replace both with that recursive CTE at the same time.
Keys are "logical tables" - dimension element names or dataset type names. | ||
""" | ||
|
||
special: dict[str, sqlalchemy.ColumnElement[Any]] = dataclasses.field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned about the way assignments are made to this throughout the code... seems like there is opportunity for one chunk of code to accidentally clobber something added by another chunk.
I didn't spot any cases where that is actually occurring, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I really struggled with how much to encapsulate QueryBuilder
and QueryJoiner
state throughout this ticket, and after a couple of flip-flops I made myself stop in order to keep making progress overall. So there's room for improvement, and if you spot a way to improve it sometime while you're working nearby, feel free to give it a try; you may spot a possibility I didn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only obvious idea I had was a dict that only allows one write to each key by default, so you at least have to explicitly specify clobbering. But yeah it's not an easy problem to solve.
def expect_timespan(self, expression: qt.ColumnExpression) -> TimespanDatabaseRepresentation: | ||
return cast(TimespanDatabaseRepresentation, expression.visit(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider adding instanceof asserts to these expect_
functions instead of casting... might save some debugging time if something of an unexpected type creeps in.
""" | ||
|
||
def __post_init__(self) -> None: | ||
assert not (self.distinct and self.group_by), "At most one of distinct and group_by can be set." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should probably be in select()
instead of post_init... most of the assignments to distinct and group_by occur after initialization.
included in raw SQL results. | ||
""" | ||
|
||
name_shrinker: NameShrinker | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be easier to add a name_shrinker
property to Database
instead of worrying about creating them and transferring records between them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure; I'd need to look for places where we might want the NameShrinker
without having the Database
it came from. If we don't have any such places (or they're easy to address some other way), this sounds like a good idea. But one for another ticket (I'll put it on DM-43697 for now).
I'd like to start working on the client/server version of this early next week, while it's still fresh in my mind. There's absolutely nothing in these review comments above the level of a petty nitpick. So since you've got a lot on your plate at the moment I think it would be fine to merge this down as-is. |
It is very unlikely that we'll ever trip this guard.
This reduces the need to import more stuff from the queries subpackage in butlers.
Query already has 'constraint_dimensions' (the full dimensions that can be used for e.g. WHERE expressions), which is different from what 'dimensions' should have meant (the dimensions of the output objects) on the QueryResults classes.
Previous annotations rejected a lot of typical usage that would almost always be fine (because most DimensionElements are Dimensions and are hence usable directly as values).
SkyPix regions should always be computed on the fly in the client.
8305fc9
to
d5b742e
Compare
They're not public, and some of them clash with Registry symbols.
This test wanted the result objects from lsst.daf.butler.registry.queries for use in annotations, but it was getting them from lsst.daf.butler.queries and we didn't notice because we don't run MyPy on tests.
Co-authored-by: David H. Irving <[email protected]>
This has some failures that I've marked as expected until we try to fix them again on DM-43697.
d5b742e
to
8a03dd5
Compare
I've inspected all of the missed-coverage lines and they fall into two categories:
DatasetRef
orDataCoordinate
result support to be tested (which I'm deferring to another ticket);Checklist
added a release note for user-visible changes to(N/A)doc/changes