DM-45680: Allow boolean columns to be used in query 'where' #1051

dhirving · 2024-08-08T23:51:52Z

Fix an issue where boolean metadata columns (like exposure.can_see_sky and exposure.has_simulated) were not usable in "where" clauses for query functions.

These column names can now be used as a boolean expression, for example where="exposure.can_see_sky or where="NOT exposure.can_see_sky".

In the new query system's ExpressionFactory, you can convert a boolean column to a usable boolean Predicate by calling as_boolean().

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-08-09T00:09:10Z

Codecov Report

Attention: Patch coverage is 93.04348% with 8 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (8e57637) to head (7aa86be).

Files	Patch %	Lines
...hon/lsst/daf/butler/queries/_expression_strings.py	84.21%	0 Missing and 3 partials ⚠️
python/lsst/daf/butler/queries/tree/_predicate.py	81.25%	2 Missing and 1 partial ⚠️
...thon/lsst/daf/butler/queries/expression_factory.py	89.47%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1051      +/-   ##
==========================================
+ Coverage   89.40%   89.44%   +0.04%     
==========================================
  Files         360      360              
  Lines       45733    45844     +111     
  Branches     9382     9406      +24     
==========================================
+ Hits        40888    41006     +118     
+ Misses       3519     3506      -13     
- Partials     1326     1332       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo · 2024-08-09T14:51:37Z

python/lsst/daf/butler/queries/tree/_predicate.py

+    @classmethod
+    def from_bool_expression(cls, value: ColumnExpression) -> Predicate:
+        """Construct a predicate that wraps a boolean ColumnExpression, taking
+        on the value of the underlying ColumnExpression.


I had hoped to never have to allow a boolean ColumnExpression to be constructed (going to Predicate immediately instead). But I think I've figured out why you didn't do that, and I agree with it: we'd need Predicate-subclass versions of DimensionFieldReference and DatasetFieldReference, and we'd need to include them in all of the cases where column references of other types already appear, and that'd be a much bigger problem. Is that right?

If so, I think instead we need to be extra vigilant about putting those column expressions in their predicate wrappers as early as possible, and keeping boolean operators out of those column expressions, because that would subvert all of the boolean-expression normalization and reasoning we do. I suspect typing is already going to do prevent problems pretty effectively, since we don't have any boolean-operator expressions, but we could lock it down a bit further by declaring that value here (and in BooleanWrapper) must be ColumnReference, not merely ColumnExpression.

My thinking wasn't actually that involved, I just observed that:

Boolean ColumnExpressions already existed, and it didn't occur to me to go out of my way to create a separate representation for boolean SQL columns than the one used for int or string.

Most of the other predicates worked by wrapping one or more column expressions with a little bit of information specifying how they convert to booleans -- and "no-op" seemed like a reasonable addition to the set of conversions.

So I just went with the grain of what was already there and it slotted in easily with minimal code.

I do share your concern about this problem potentially coming back since any place that references a boolean column will need to do the same thing, but that's sort of inherent to treating booleans as a special case. Changing the type is a good suggestion, I'll do that.

I suspect handling boolean_column = NULL is going to get ugly though so I may need to figure something else out -- or just not support it, or special case it to only work for a single boolean column and not arbitrary boolean expressions.

dhirving · 2024-08-10T00:22:41Z

python/lsst/daf/butler/queries/expression_factory.py

+    def is_null(self) -> tree.Predicate:
+        return ResolvedScalarExpressionProxy(self._boolean_expression).is_null
+
+    def as_boolean(self) -> tree.Predicate:


I don't like forcing users to call as_boolean() to get a Predicate out of this, but I didn't really have a better idea.

We can't just return a Predicate directly from DimensionElementProxy.__getattr__ because it would completely destroy the static typing. The return type would become ScalarExpressionProxy | Predicate and those types have almost no methods in common.

This at least throws a semi-helpful runtime error if you fail to call as_boolean() or attempt to do a comparison like boolean_column == 0.

I really wish Python typing had something like Any[A | B], where it'd happily do an implicit cast to one of the options you give it without losing the typing entirely (the way Any does).

I'm not convinced the static typing is so important here that it's worth doing even slight harm to the runtime interface - I'm imagining users in Jupyter notebooks doing tab completion as the most important use case for ExpressionFactory. But I do expect boolean columns to be very rare, and this is a small change that's easy to revisit later, and the same cannot be said of trying to put the entire Predicate interface into ScalarExpressionProxy.

Fix an issue where boolean metadata columns (like exposure.can_see_sky and exposure.has_simulated) were not usable in "where" clauses for Registry query functions. These column names can now be used as a boolean expression, for example `where="exposure.can_see_sky"` or `where="NOT exposure.can_see_sky"`.

We need roughly the same test for both registry and the new query system, so move it out of the registry tests to a place where we have access to both query systems.

Add an as_boolean() method to the expression proxies used to reference dimension fields, so boolean-valued fields can be converted to Predicate instances that can be used to constrain a query.

Boolean columns in the DB are always nullable (and exposure.can_see_sky has many NULLs in the real databases.) So add a way to search for them in the query system.

Add support for `= NULL` and `!= NULL` syntax for boolean columns in query where strings.

TallJimbo · 2024-08-12T17:56:47Z

python/lsst/daf/butler/queries/expression_factory.py

+    def is_null(self) -> tree.Predicate:
+        return ResolvedScalarExpressionProxy(self._boolean_expression).is_null
+
+    def as_boolean(self) -> tree.Predicate:


I really wish Python typing had something like Any[A | B], where it'd happily do an implicit cast to one of the options you give it without losing the typing entirely (the way Any does).

I'm not convinced the static typing is so important here that it's worth doing even slight harm to the runtime interface - I'm imagining users in Jupyter notebooks doing tab completion as the most important use case for ExpressionFactory. But I do expect boolean columns to be very rare, and this is a small change that's easy to revisit later, and the same cannot be said of trying to put the entire Predicate interface into ScalarExpressionProxy.

dhirving force-pushed the tickets/DM-45680 branch from 56f222d to 5b4498c Compare August 8, 2024 23:52

TallJimbo reviewed Aug 9, 2024

View reviewed changes

dhirving force-pushed the tickets/DM-45680 branch from 8c4c852 to 70730b3 Compare August 10, 2024 00:17

dhirving changed the title ~~DM-45680: Allow boolean columns to be used in where strings~~ DM-45680: Allow boolean columns to be used in query 'where' Aug 10, 2024

dhirving commented Aug 10, 2024

View reviewed changes

dhirving force-pushed the tickets/DM-45680 branch from 70730b3 to 128cf76 Compare August 10, 2024 00:32

dhirving added 6 commits August 9, 2024 17:36

Move boolean column tests to queries test file

3373bbc

We need roughly the same test for both registry and the new query system, so move it out of the registry tests to a place where we have access to both query systems.

Allow boolean columns to be used in ExpressionFactory

89c4334

Add an as_boolean() method to the expression proxies used to reference dimension fields, so boolean-valued fields can be converted to Predicate instances that can be used to constrain a query.

Add ExpressionFactory support for null boolean columns

d740e47

Boolean columns in the DB are always nullable (and exposure.can_see_sky has many NULLs in the real databases.) So add a way to search for them in the query system.

Add expression string syntax for null boolean columns

ff10942

Add support for `= NULL` and `!= NULL` syntax for boolean columns in query where strings.

Add test coverage for query expression NULL syntax

7aa86be

dhirving force-pushed the tickets/DM-45680 branch from 128cf76 to 7aa86be Compare August 10, 2024 00:36

dhirving marked this pull request as ready for review August 10, 2024 00:50

TallJimbo approved these changes Aug 12, 2024

View reviewed changes

dhirving merged commit a0830e1 into main Aug 12, 2024
18 checks passed

dhirving deleted the tickets/DM-45680 branch August 12, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-45680: Allow boolean columns to be used in query 'where' #1051

DM-45680: Allow boolean columns to be used in query 'where' #1051

dhirving commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 9, 2024 •

edited

Loading

TallJimbo Aug 9, 2024

dhirving Aug 9, 2024

dhirving Aug 10, 2024 •

edited

Loading

TallJimbo Aug 12, 2024

TallJimbo Aug 12, 2024

DM-45680: Allow boolean columns to be used in query 'where' #1051

DM-45680: Allow boolean columns to be used in query 'where' #1051

Conversation

dhirving commented Aug 8, 2024 • edited Loading

Checklist

codecov bot commented Aug 9, 2024 • edited Loading

Codecov Report

TallJimbo Aug 9, 2024

Choose a reason for hiding this comment

dhirving Aug 9, 2024

Choose a reason for hiding this comment

dhirving Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

TallJimbo Aug 12, 2024

Choose a reason for hiding this comment

TallJimbo Aug 12, 2024

Choose a reason for hiding this comment

dhirving commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 9, 2024 •

edited

Loading

dhirving Aug 10, 2024 •

edited

Loading