Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45872: Make the new query system public #1068

Merged
merged 30 commits into from
Sep 6, 2024
Merged

DM-45872: Make the new query system public #1068

merged 30 commits into from
Sep 6, 2024

Conversation

timj
Copy link
Member

@timj timj commented Sep 4, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 94.53552% with 20 lines in your changes missing coverage. Please review.

Project coverage is 89.65%. Comparing base (50c0a9e) to head (e588c74).
Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
python/lsst/daf/butler/script/queryDatasets.py 89.04% 3 Missing and 5 partials ⚠️
python/lsst/daf/butler/_butler.py 93.87% 0 Missing and 3 partials ⚠️
python/lsst/daf/butler/script/transferDatasets.py 40.00% 3 Missing ⚠️
python/lsst/daf/butler/script/exportCalibs.py 0.00% 2 Missing ⚠️
python/lsst/daf/butler/script/_associate.py 50.00% 1 Missing ⚠️
...thon/lsst/daf/butler/script/certifyCalibrations.py 0.00% 1 Missing ⚠️
python/lsst/daf/butler/script/queryDatasetTypes.py 0.00% 1 Missing ⚠️
...on/lsst/daf/butler/script/queryDimensionRecords.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1068      +/-   ##
==========================================
+ Coverage   89.54%   89.65%   +0.10%     
==========================================
  Files         359      359              
  Lines       46509    46671     +162     
  Branches     9566     9597      +31     
==========================================
+ Hits        41648    41844     +196     
+ Misses       3511     3468      -43     
- Partials     1350     1359       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo and others added 2 commits September 4, 2024 10:14
parse_expression can in fact return, as I've discovered from
experience, contrary to its type annotations.

By considering empty strings to evaluate to True, they'll get
simplified away, which is consistent with the where string not being
provided and a convenience to at least QG generation (at it's easier
to implement than raising an exception).
@timj timj force-pushed the tickets/DM-45872 branch 3 times, most recently from 7a907f4 to 539e929 Compare September 4, 2024 17:50
They are not needed because the default Butler implementation
works fine with remote butler.
These are the advanced tests with the simple interface where
possible.
Also removed some duplicate code in retrieve-artifacts and
transfer-from.
Do not include it in butler associate.
This required a small rewrite of the table accumulator to use
a dict rather than a set.
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
rows.append(row)

dataset_table = AstropyTable(np.array(rows), names=columnNames, dtype=columnTypes)
return sortAstropyTable(dataset_table, dimensions, ["type", "run"])
if sort:
return sortAstropyTable(dataset_table, dimensions, ["type", "run"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea we had a separate sorting code path for order_by on the CLI that did it all in memory, and I'm not thrilled by that, in terms of the maintenance load and keeping behavior consistent across the two code paths so we have the freedom to change the implementation behind the scenes (it already looks inconsistent in at least not supporting descending sorts with -identifier syntax). I also have no idea why it's trying to sort by spatial columns (what does that even mean?) and the list of dimensions it takes is not the right way to work out spatial/temporal things (which may be dimension elements, not dimensions).

I do think it might make sense to do some in-memory sorting to make output order deterministic when order_by is not given, but that's a lot simpler that reimplementing order_by in Python.

Fixing all that in the existing CLI commands is out of scope here, but I think I'd prefer we not add new functionality that depends on it here. For now, could we support --order-by in query-datasets only on single-dataset-type queries (which we can do in the DB), and revisit it for multiple-dataset-type queries when we have the new Python methods for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only ever sorting each table by dataset type and not trying to sort across dataset types. Nate P was trying to make it so that the results were returned in a consistent way each time and it didn't seem to be a huge issue given people shouldn't be querying for millions of rows on the command line. If the user specifies --order-by no sorting happens here so I think that's what you are asking for and if someone asks to order by with a dimension that is not part of the dataset type the query breaks before any of this code is hit.

python/lsst/daf/butler/script/queryDatasets.py Outdated Show resolved Hide resolved
timj and others added 5 commits September 5, 2024 13:31
Aggressive error handling + __getattr__ was masking this.
Now output begins appearing as soon as the first dataset type
has been queried. No longer accumulates all the results into
memory before writing them and so feels more responsive with
lower memory footprint.

Does not paginate queries for a single dataset type.
Without this change the chained datastore created and populated
is not used in the actual chaining test and so the tests are
not testing what it is stated to test.
Fixes:

python/lsst/daf/butler/queries/expression_factory.py:468: error: Argument "field" to "DatasetFieldReference" has incompatible type "str";
expected "Literal['dataset_id', 'ingest_date', 'run', 'collection', 'timespan']"  [arg-type]

triggered by pydantic 2.9.0.
This allows tests to run with and without composite disassembly.
This reverts commit a287d0b.

After some thought it became clear that get_many_uris guarantees
only a single URI per dataset ref regardless of whether a
chained datastore is present so this additional complexity is
not needed. Tables are created for each component dataset type
so even for composite disassembly there will not be multiple
URIs in a table for the same dataset ref.
@timj timj merged commit 767a556 into main Sep 6, 2024
18 checks passed
@timj timj deleted the tickets/DM-45872 branch September 6, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants