Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-44875: Handle ambiguous calibration lookups on older postgres #1029

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dhirving
Copy link
Contributor

For Postgres older than version 16, we now throw an error for find-first calibration searches instead of silently returning a potentially-incorrect result. This is similar to the behavior of the old query system.

The query requires the ANY_VALUE aggregate function to generate the validity match count column used by postprocessing to check for ambiguous results. This function is not available on older Postgres.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Fix an issue where using order_by in the new query system for a data ID query or dataset query would sometimes fail with the postgres error "SELECT DISTINCT ON expressions must match initial ORDER BY expressions".

This was occurring because the data ID de-duplication logic sometimes uses DISTINCT ON.
 Postgres requires that the leftmost ORDER BY expressions match the DISTINCT ON clause, and we were not enforcing that.
For Postgres older than version 16, we now throw an error for find-first calibration searches instead of silently returning a potentially-incorrect result.   This is similar to the behavior of the old query system.

The query requires the ANY_VALUE aggregate function to generate the validity match count column used by postprocessing to check for ambiguous results.  This function is not available on older Postgres.
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 82.60870% with 4 lines in your changes missing coverage. Please review.

Project coverage is 89.39%. Comparing base (aadb42e) to head (7a73223).
Report is 23 commits behind head on main.

Files Patch % Lines
...hon/lsst/daf/butler/direct_query_driver/_driver.py 66.66% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1029      +/-   ##
==========================================
+ Coverage   89.38%   89.39%   +0.01%     
==========================================
  Files         358      358              
  Lines       45473    45481       +8     
  Branches     9346     9347       +1     
==========================================
+ Hits        40648    40660      +12     
+ Misses       3523     3522       -1     
+ Partials     1302     1299       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -715,11 +717,11 @@ def apply_query_projection(
unique_keys.append(builder.joiner.fields[dataset_type]["collection_key"])
else:
derived_fields.append((dataset_type, "collection_key"))
elif dataset_field == "timespan" and plan.datasets[dataset_type].is_calibration_search:
elif dataset_field == "timespan" and is_calibration_search:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataset_field == "timespan" case was already not covered before I got here. I wanted to add a test but can't figure out what circumstances would trigger it, since the validity range comparison is handled in-DB, and DatasetRefs don't include the calibration collection timespan field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this can't be tested until we add support for "general" results.

@dhirving dhirving marked this pull request as ready for review June 27, 2024 22:16
# If we're doing a non-find-first query against a
# CALIBRATION collection, the timespan is also a unique
# key...
if dataset_type == plan.find_first_dataset:
if is_find_first_search:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I realized last night that we need to add dataset_id to unique keys for non-find-first search, I'll do that today before you review this for real hopefully

@dhirving dhirving marked this pull request as draft June 28, 2024 18:12
Base automatically changed from tickets/DM-44868 to main July 2, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants