DM-45872: Make the new query system public #1068

timj · 2024-09-04T04:43:55Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-09-04T04:56:51Z

Codecov Report

Attention: Patch coverage is 94.53552% with 20 lines in your changes missing coverage. Please review.

Project coverage is 89.65%. Comparing base (50c0a9e) to head (e588c74).
Report is 31 commits behind head on main.

Files with missing lines	Patch %	Lines
python/lsst/daf/butler/script/queryDatasets.py	89.04%	3 Missing and 5 partials ⚠️
python/lsst/daf/butler/_butler.py	93.87%	0 Missing and 3 partials ⚠️
python/lsst/daf/butler/script/transferDatasets.py	40.00%	3 Missing ⚠️
python/lsst/daf/butler/script/exportCalibs.py	0.00%	2 Missing ⚠️
python/lsst/daf/butler/script/_associate.py	50.00%	1 Missing ⚠️
...thon/lsst/daf/butler/script/certifyCalibrations.py	0.00%	1 Missing ⚠️
python/lsst/daf/butler/script/queryDatasetTypes.py	0.00%	1 Missing ⚠️
...on/lsst/daf/butler/script/queryDimensionRecords.py	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1068      +/-   ##
==========================================
+ Coverage   89.54%   89.65%   +0.10%     
==========================================
  Files         359      359              
  Lines       46509    46671     +162     
  Branches     9566     9597      +31     
==========================================
+ Hits        41648    41844     +196     
+ Misses       3511     3468      -43     
- Partials     1350     1359       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parse_expression can in fact return, as I've discovered from experience, contrary to its type annotations. By considering empty strings to evaluate to True, they'll get simplified away, which is consistent with the where string not being provided and a convenience to at least QG generation (at it's easier to implement than raising an exception).

No tests yet.

They are not needed because the default Butler implementation works fine with remote butler.

These are the advanced tests with the simple interface where possible.

Also removed some duplicate code in retrieve-artifacts and transfer-from.

Do not include it in butler associate. This required a small rewrite of the table accumulator to use a dict rather than a set.

python/lsst/daf/butler/_butler.py

python/lsst/daf/butler/script/queryDatasets.py

TallJimbo · 2024-09-05T16:15:29Z

python/lsst/daf/butler/script/queryDatasets.py

            rows.append(row)

        dataset_table = AstropyTable(np.array(rows), names=columnNames, dtype=columnTypes)
-        return sortAstropyTable(dataset_table, dimensions, ["type", "run"])
+        if sort:
+            return sortAstropyTable(dataset_table, dimensions, ["type", "run"])


I had no idea we had a separate sorting code path for order_by on the CLI that did it all in memory, and I'm not thrilled by that, in terms of the maintenance load and keeping behavior consistent across the two code paths so we have the freedom to change the implementation behind the scenes (it already looks inconsistent in at least not supporting descending sorts with -identifier syntax). I also have no idea why it's trying to sort by spatial columns (what does that even mean?) and the list of dimensions it takes is not the right way to work out spatial/temporal things (which may be dimension elements, not dimensions).

I do think it might make sense to do some in-memory sorting to make output order deterministic when order_by is not given, but that's a lot simpler that reimplementing order_by in Python.

Fixing all that in the existing CLI commands is out of scope here, but I think I'd prefer we not add new functionality that depends on it here. For now, could we support --order-by in query-datasets only on single-dataset-type queries (which we can do in the DB), and revisit it for multiple-dataset-type queries when we have the new Python methods for that?

It's only ever sorting each table by dataset type and not trying to sort across dataset types. Nate P was trying to make it so that the results were returned in a consistent way each time and it didn't seem to be a huge issue given people shouldn't be querying for millions of rows on the command line. If the user specifies --order-by no sorting happens here so I think that's what you are asking for and if someone asks to order by with a dimension that is not part of the dataset type the query breaks before any of this code is hit.

python/lsst/daf/butler/script/queryDatasets.py

Aggressive error handling + __getattr__ was masking this.

Now output begins appearing as soon as the first dataset type has been queried. No longer accumulates all the results into memory before writing them and so feels more responsive with lower memory footprint. Does not paginate queries for a single dataset type.

The ref will be the same so we need to store multiple URIs for that ref.

…irst

…records

Without this change the chained datastore created and populated is not used in the actual chaining test and so the tests are not testing what it is stated to test.

Fixes: python/lsst/daf/butler/queries/expression_factory.py:468: error: Argument "field" to "DatasetFieldReference" has incompatible type "str"; expected "Literal['dataset_id', 'ingest_date', 'run', 'collection', 'timespan']" [arg-type] triggered by pydantic 2.9.0.

This allows tests to run with and without composite disassembly.

This reverts commit a287d0b. After some thought it became clear that get_many_uris guarantees only a single URI per dataset ref regardless of whether a chained datastore is present so this additional complexity is not needed. Tables are created for each component dataset type so even for composite disassembly there will not be multiple URIs in a table for the same dataset ref.

TallJimbo and others added 2 commits September 4, 2024 10:14

Rename Butler._query to Butler.query

034b6ac

timj force-pushed the tickets/DM-45872 branch 3 times, most recently from 7a907f4 to 539e929 Compare September 4, 2024 17:50

timj added 5 commits September 4, 2024 15:46

Make the ButlerCollections.query methods public

0b29c09

Make Butler.query_datasets and others public

bce90de

No tests yet.

Remove the simple query APIs from hybrid butler

372285c

They are not needed because the default Butler implementation works fine with remote butler.

Add some simple query interface tests

dcf3b11

These are the advanced tests with the simple interface where possible.

Add --limit to query-datasets and related command-line scripts

c324dac

Also removed some duplicate code in retrieve-artifacts and transfer-from.

timj force-pushed the tickets/DM-45872 branch from 63916fb to 9824f09 Compare September 4, 2024 22:46

timj added 3 commits September 4, 2024 16:40

Add --order-by support to query-datasets command-line

97885ea

Do not include it in butler associate. This required a small rewrite of the table accumulator to use a dict rather than a set.

Add news fragment

01c425c

Add a test for query-datasets command-line with limit/order_by

501b331

timj force-pushed the tickets/DM-45872 branch from 9824f09 to 501b331 Compare September 4, 2024 23:40

TallJimbo approved these changes Sep 5, 2024

View reviewed changes

timj added 2 commits September 5, 2024 11:58

Be more explicit about the default status of --find-first

bbf27d7

Fix handling of find_first=True with collection wildcards

5f08e8e

timj force-pushed the tickets/DM-45872 branch from dc9b74c to 5f08e8e Compare September 5, 2024 19:27

timj and others added 5 commits September 5, 2024 13:31

Add support for negative limit to Butler.query_datasets

c309707

Use same limit for command line query-datasets and butler.query_datasets

9dfe525

Change command-line to convert limit=0 to limit=None for API usage

ff0d8f9

Fix copy-paste bug in expression factory dataset timespan access.

0cd7c5f

Aggressive error handling + __getattr__ was masking this.

timj force-pushed the tickets/DM-45872 branch from c05697f to d56f382 Compare September 6, 2024 05:23

timj added 4 commits September 6, 2024 08:09

Fix --show-uri for chained datasstores

a287d0b

The ref will be the same so we need to store multiple URIs for that ref.

Add support for negative limit to query_dimension_records

75ed7ac

Add negative limit support to query_data_ids

8841e2d

Make sure query-datasets complains if collection wildcard with find-f…

b1dc851

…irst

Rename Query.general to x_general to mark as experimental

c7e55f9

timj force-pushed the tickets/DM-45872 branch from 0915247 to c7e55f9 Compare September 6, 2024 15:09

timj added 8 commits September 6, 2024 09:13

Make in-memory dataset URI more informative and remove duplication

6e98bfa

Set large but negative limits for query_data_ids and query_dimension_…

c2c7073

…records

Only use ephemeral URIs when we need them in get_many_uris

e14eae1

Change query-datasets command-line to use the butler it creates

1dbd433

Without this change the chained datastore created and populated is not used in the actual chaining test and so the tests are not testing what it is stated to test.

Allow MetricTestRepo to use a different storage class

4eb4204

This allows tests to run with and without composite disassembly.

Add a test for query-datasets uris without disassembly

8ccae72

timj merged commit 767a556 into main Sep 6, 2024
18 checks passed

timj deleted the tickets/DM-45872 branch September 6, 2024 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-45872: Make the new query system public #1068

DM-45872: Make the new query system public #1068

timj commented Sep 4, 2024 •

edited

Loading

codecov bot commented Sep 4, 2024 •

edited

Loading

TallJimbo Sep 5, 2024

timj Sep 5, 2024

DM-45872: Make the new query system public #1068

DM-45872: Make the new query system public #1068

Conversation

timj commented Sep 4, 2024 • edited Loading

Checklist

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

TallJimbo Sep 5, 2024

Choose a reason for hiding this comment

timj Sep 5, 2024

Choose a reason for hiding this comment

timj commented Sep 4, 2024 •

edited

Loading

codecov bot commented Sep 4, 2024 •

edited

Loading