[FEAT]: Support `.clip` function #3136

conradsoon · 2024-10-28T13:39:26Z

Closes #1907.

TODO:

Write tests
DaftSQL integration
Docs

TESTS:

conradsoon · 2024-10-28T13:50:12Z

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?
What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?
Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?

codspeed-hq · 2024-10-28T13:55:11Z

CodSpeed Performance Report

Merging #3136 will not alter performance

_{Comparing conradsoon:feat-clip (e09720e) with main (3cef614)}

Summary

✅ 17 untouched benchmarks

colin-ho · 2024-10-29T16:01:40Z

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?

What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?

Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?

Let's use clamp for simplicity + performance. Performing clamp in a single pass potentially elides the greater than check. Whereas doing min then max will always do both less than and greater than checks for each value.
Throw an error if max < min.
Ideally we should just expose a single clip expression, but allow flexibility for the user to choose if they want to clip only with an upper bound, only lower bound, or both. (i.e. if upper bound is None, then we just do a min). Also the num_traits crate has clamp, clamp_min, and clamp_max convenience functions that work for partialord.

conradsoon · 2024-11-01T09:08:14Z

Hey @colin-ho, have made the requested changes:

Now we explicitly throw an error if we try to .clip with a max < min.
We perform clamp in a single-pass now, rather than calling max followed by min.
Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.
Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)
Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

colin-ho · 2024-11-03T22:41:09Z

Hey @colin-ho, have made the requested changes:

Now we explicitly throw an error if we try to .clip with a max < min.

We perform clamp in a single-pass now, rather than calling max followed by min.

Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.

Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)

Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

Let's stick with just numeric types for this PR

colin-ho · 2024-11-03T23:16:21Z

src/daft-core/src/array/ops/clip.rs

+fn clamp_helper<T: PartialOrd + Copy>(
+    value: Option<&T>,
+    left_bound: Option<&T>,
+    right_bound: Option<&T>,
+) -> Option<T> {
+    match (value, left_bound, right_bound) {
+        (None, _, _) => None,
+        (Some(v), Some(l), Some(r)) => {
+            assert!(l <= r, "Left bound is greater than right bound");
+            Some(clamp(*v, *l, *r))
+        }
+        (Some(v), Some(l), None) => Some(clamp_min(*v, *l)),
+        (Some(v), None, Some(r)) => Some(clamp_max(*v, *r)),
+        (Some(v), None, None) => Some(*v),
+    }
+}


They key observation we can leverage here for some better performance is that the result of the clamp is None if the original value is None. Therefore instead of doing as_arrow.iter() you can use as_arrow.values_iter(), which will return an iterator of all the values, ignoring the validity. This is fine because we slap on the validity of the original array anyway. The very small benefit of this is that it will reduce the number of match branches, i think only by 1 or something.

Unfortunately we can't do this for left and right though, because we need to account for their validity.

But in a case like (array_size, 1, rbound_size) and the single left_bound is not None, you only need 1 validity check per row! i.e. for the right_bound (because you are using values_iter for the array, and your left bound is a non-null scalar).

Lastly, and probably the most important, in the case of (_, 1, 1) you can probably do something like

let left = left_bound.get(0); let right = right_bound.get(0); if let Some(left) = left && let Some(right) = right { self.apply(|v| clamp(v, left, right)) } else if let Some(left) = left { self.apply(|v| clamp_min(v, left)) } else if let Some(right) = right { self.apply(|v| clamp_max(v, right)) } else { Ok(Self::full_null(self.name(), self.data_type(), self.len())) }

colin-ho

Looking good so far!

colin-ho · 2024-11-03T23:32:51Z

src/daft-functions/src/numeric/clip.rs

+        if !array_field.dtype.is_numeric() {
+            return Err(DaftError::TypeError(format!(
+                "Expected array input to be numeric, got {}",
+                array_field.dtype
+            )));
+        }
+
+        // Check if min_field and max_field are numeric or null
+        if !(min_field.dtype.is_numeric() || min_field.dtype == DataType::Null) {
+            return Err(DaftError::TypeError(format!(
+                "Expected min input to be numeric or null, got {}",
+                min_field.dtype
+            )));
+        }
+
+        if !(max_field.dtype.is_numeric() || max_field.dtype == DataType::Null) {
+            return Err(DaftError::TypeError(format!(
+                "Expected max input to be numeric or null, got {}",
+                max_field.dtype
+            )));
+        }


These should be consolidated in InferDataType instead.

colin-ho · 2024-11-03T23:34:08Z

daft/expressions/expressions.py

@@ -623,6 +623,19 @@ def floor(self) -> Expression:
        expr = native.floor(self._expr)
        return Expression._from_pyexpr(expr)

+    def clip(self, min: Expression, max: Expression) -> Expression:


Allow Expression | None = None as the arguments instead

colin-ho · 2024-11-03T23:35:46Z

src/daft-core/src/array/ops/clip.rs

+    }
+}
+
+macro_rules! create_data_array {


I don't see much benefit in this macro, since the amount of lines covered is pretty minimal.

colin-ho · 2024-11-03T23:45:39Z

src/daft-core/src/series/ops/clip.rs

+            .then(|| create_null_series(max.name()))
+            .unwrap_or_else(|| max.clone());
+
+        match &output_type {


Try:

output_type if output_type.is_numeric() => { with_match_numeric_daft_types!(output_type, |$T| { let self_casted = self.cast(output_type)?; let min_casted = min.cast(output_type)?; let max_casted = max.cast(output_type)?; let self_downcasted = self_casted.downcast::<<$T as DaftDataType>::ArrayType>()?; let min_downcasted = min_casted.downcast::<<$T as DaftDataType>::ArrayType>()?; let max_downcasted = max_casted.downcast::<<$T as DaftDataType>::ArrayType>()?; Ok(self_downcasted.clip(min_downcasted, max_downcasted)?.into_series()) }) }

instead, which fits in a little better with our codebase.

github-actions bot added the enhancement New feature or request label Oct 28, 2024

conradsoon marked this pull request as draft October 28, 2024 13:42

conradsoon force-pushed the feat-clip branch from 54e79d8 to 1529be2 Compare October 28, 2024 13:42

conradsoon force-pushed the feat-clip branch from 0cd6b03 to cf37543 Compare October 28, 2024 13:55

feat(clip): add clip function

7fa3e1d

conradsoon force-pushed the feat-clip branch from cf37543 to 7fa3e1d Compare November 1, 2024 08:57

conradsoon changed the title ~~[FEAT]: binary_min, binary_max and clip Series functions~~ [FEAT]: Support .clip function Nov 1, 2024

conradsoon added 2 commits November 2, 2024 16:26

feat(clip): add daft sql support

c73f2fc

chore(clip): add tests and docs

13f2ac0

github-actions bot added the documentation Improvements or additions to documentation label Nov 3, 2024

conradsoon added 2 commits November 3, 2024 15:44

chore(clip): add more tests

503ae13

fix(clip): fix tests

e09720e

colin-ho reviewed Nov 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Support `.clip` function #3136

[FEAT]: Support `.clip` function #3136

conradsoon commented Oct 28, 2024 •

edited

Loading

conradsoon commented Oct 28, 2024 •

edited

Loading

codspeed-hq bot commented Oct 28, 2024 •

edited

Loading

colin-ho commented Oct 29, 2024

conradsoon commented Nov 1, 2024 •

edited

Loading

colin-ho commented Nov 3, 2024

colin-ho Nov 3, 2024

colin-ho left a comment

colin-ho Nov 3, 2024

colin-ho Nov 3, 2024

colin-ho Nov 3, 2024

colin-ho Nov 3, 2024

[FEAT]: Support .clip function #3136

Are you sure you want to change the base?

[FEAT]: Support .clip function #3136

Conversation

conradsoon commented Oct 28, 2024 • edited Loading

conradsoon commented Oct 28, 2024 • edited Loading

codspeed-hq bot commented Oct 28, 2024 • edited Loading

CodSpeed Performance Report

Merging #3136 will not alter performance

Summary

colin-ho commented Oct 29, 2024

conradsoon commented Nov 1, 2024 • edited Loading

colin-ho commented Nov 3, 2024

colin-ho Nov 3, 2024

Choose a reason for hiding this comment

colin-ho left a comment

Choose a reason for hiding this comment

colin-ho Nov 3, 2024

Choose a reason for hiding this comment

colin-ho Nov 3, 2024

Choose a reason for hiding this comment

colin-ho Nov 3, 2024

Choose a reason for hiding this comment

colin-ho Nov 3, 2024

Choose a reason for hiding this comment

[FEAT]: Support `.clip` function #3136

[FEAT]: Support `.clip` function #3136

conradsoon commented Oct 28, 2024 •

edited

Loading

conradsoon commented Oct 28, 2024 •

edited

Loading

codspeed-hq bot commented Oct 28, 2024 •

edited

Loading

conradsoon commented Nov 1, 2024 •

edited

Loading