-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-45872: Make the new query system public #1068
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1068 +/- ##
==========================================
+ Coverage 89.54% 89.65% +0.10%
==========================================
Files 359 359
Lines 46509 46671 +162
Branches 9566 9597 +31
==========================================
+ Hits 41648 41844 +196
+ Misses 3511 3468 -43
- Partials 1350 1359 +9 ☔ View full report in Codecov by Sentry. |
parse_expression can in fact return, as I've discovered from experience, contrary to its type annotations. By considering empty strings to evaluate to True, they'll get simplified away, which is consistent with the where string not being provided and a convenience to at least QG generation (at it's easier to implement than raising an exception).
7a907f4
to
539e929
Compare
No tests yet.
They are not needed because the default Butler implementation works fine with remote butler.
These are the advanced tests with the simple interface where possible.
Also removed some duplicate code in retrieve-artifacts and transfer-from.
63916fb
to
9824f09
Compare
Do not include it in butler associate. This required a small rewrite of the table accumulator to use a dict rather than a set.
9824f09
to
501b331
Compare
rows.append(row) | ||
|
||
dataset_table = AstropyTable(np.array(rows), names=columnNames, dtype=columnTypes) | ||
return sortAstropyTable(dataset_table, dimensions, ["type", "run"]) | ||
if sort: | ||
return sortAstropyTable(dataset_table, dimensions, ["type", "run"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea we had a separate sorting code path for order_by
on the CLI that did it all in memory, and I'm not thrilled by that, in terms of the maintenance load and keeping behavior consistent across the two code paths so we have the freedom to change the implementation behind the scenes (it already looks inconsistent in at least not supporting descending sorts with -identifier
syntax). I also have no idea why it's trying to sort by spatial columns (what does that even mean?) and the list of dimensions it takes is not the right way to work out spatial/temporal things (which may be dimension elements, not dimensions).
I do think it might make sense to do some in-memory sorting to make output order deterministic when order_by
is not given, but that's a lot simpler that reimplementing order_by
in Python.
Fixing all that in the existing CLI commands is out of scope here, but I think I'd prefer we not add new functionality that depends on it here. For now, could we support --order-by
in query-datasets
only on single-dataset-type queries (which we can do in the DB), and revisit it for multiple-dataset-type queries when we have the new Python methods for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only ever sorting each table by dataset type and not trying to sort across dataset types. Nate P was trying to make it so that the results were returned in a consistent way each time and it didn't seem to be a huge issue given people shouldn't be querying for millions of rows on the command line. If the user specifies --order-by
no sorting happens here so I think that's what you are asking for and if someone asks to order by with a dimension that is not part of the dataset type the query breaks before any of this code is hit.
dc9b74c
to
5f08e8e
Compare
Aggressive error handling + __getattr__ was masking this.
Now output begins appearing as soon as the first dataset type has been queried. No longer accumulates all the results into memory before writing them and so feels more responsive with lower memory footprint. Does not paginate queries for a single dataset type.
c05697f
to
d56f382
Compare
The ref will be the same so we need to store multiple URIs for that ref.
0915247
to
c7e55f9
Compare
Without this change the chained datastore created and populated is not used in the actual chaining test and so the tests are not testing what it is stated to test.
Fixes: python/lsst/daf/butler/queries/expression_factory.py:468: error: Argument "field" to "DatasetFieldReference" has incompatible type "str"; expected "Literal['dataset_id', 'ingest_date', 'run', 'collection', 'timespan']" [arg-type] triggered by pydantic 2.9.0.
This allows tests to run with and without composite disassembly.
This reverts commit a287d0b. After some thought it became clear that get_many_uris guarantees only a single URI per dataset ref regardless of whether a chained datastore is present so this additional complexity is not needed. Tables are created for each component dataset type so even for composite disassembly there will not be multiple URIs in a table for the same dataset ref.
Checklist
doc/changes
configs/old_dimensions