Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Support .clip function #3136

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

conradsoon
Copy link
Contributor

@conradsoon conradsoon commented Oct 28, 2024

Closes #1907.

TODO:

  • Write tests
  • DaftSQL integration
  • Docs

TESTS:

  • Works-as-expected check
  • Integer promotions check
  • Negative/positive zero (float) check
  • Null check
  • Broadcast fail check
  • Unsupported type check

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2024
@conradsoon conradsoon marked this pull request as draft October 28, 2024 13:42
@conradsoon
Copy link
Contributor Author

conradsoon commented Oct 28, 2024

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

  • I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?
  • What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?
  • Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?

Copy link

codspeed-hq bot commented Oct 28, 2024

CodSpeed Performance Report

Merging #3136 will not alter performance

Comparing conradsoon:feat-clip (e09720e) with main (3cef614)

Summary

✅ 17 untouched benchmarks

@colin-ho
Copy link
Contributor

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

  • I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?
  • What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?
  • Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?
  • Let's use clamp for simplicity + performance. Performing clamp in a single pass potentially elides the greater than check. Whereas doing min then max will always do both less than and greater than checks for each value.
  • Throw an error if max < min.
  • Ideally we should just expose a single clip expression, but allow flexibility for the user to choose if they want to clip only with an upper bound, only lower bound, or both. (i.e. if upper bound is None, then we just do a min). Also the num_traits crate has clamp, clamp_min, and clamp_max convenience functions that work for partialord.

@conradsoon
Copy link
Contributor Author

conradsoon commented Nov 1, 2024

Hey @colin-ho, have made the requested changes:

  • Now we explicitly throw an error if we try to .clip with a max < min.
  • We perform clamp in a single-pass now, rather than calling max followed by min.
  • Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.
  • Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

  • Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)
  • Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

@conradsoon conradsoon changed the title [FEAT]: binary_min, binary_max and clip Series functions [FEAT]: Support .clip function Nov 1, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 3, 2024
@colin-ho
Copy link
Contributor

colin-ho commented Nov 3, 2024

Hey @colin-ho, have made the requested changes:

  • Now we explicitly throw an error if we try to .clip with a max < min.
  • We perform clamp in a single-pass now, rather than calling max followed by min.
  • Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.
  • Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

  • Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)
  • Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

Let's stick with just numeric types for this PR

Comment on lines +8 to +23
fn clamp_helper<T: PartialOrd + Copy>(
value: Option<&T>,
left_bound: Option<&T>,
right_bound: Option<&T>,
) -> Option<T> {
match (value, left_bound, right_bound) {
(None, _, _) => None,
(Some(v), Some(l), Some(r)) => {
assert!(l <= r, "Left bound is greater than right bound");
Some(clamp(*v, *l, *r))
}
(Some(v), Some(l), None) => Some(clamp_min(*v, *l)),
(Some(v), None, Some(r)) => Some(clamp_max(*v, *r)),
(Some(v), None, None) => Some(*v),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They key observation we can leverage here for some better performance is that the result of the clamp is None if the original value is None. Therefore instead of doing as_arrow.iter() you can use as_arrow.values_iter(), which will return an iterator of all the values, ignoring the validity. This is fine because we slap on the validity of the original array anyway. The very small benefit of this is that it will reduce the number of match branches, i think only by 1 or something.

Unfortunately we can't do this for left and right though, because we need to account for their validity.

But in a case like (array_size, 1, rbound_size) and the single left_bound is not None, you only need 1 validity check per row! i.e. for the right_bound (because you are using values_iter for the array, and your left bound is a non-null scalar).

Lastly, and probably the most important, in the case of (_, 1, 1) you can probably do something like

let left = left_bound.get(0);
let right = right_bound.get(0);
if let Some(left) = left
    && let Some(right) = right
{
    self.apply(|v| clamp(v, left, right))
} else if let Some(left) = left {
    self.apply(|v| clamp_min(v, left))
} else if let Some(right) = right {
    self.apply(|v| clamp_max(v, right))
} else {
    Ok(Self::full_null(self.name(), self.data_type(), self.len()))
}

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

Comment on lines +38 to +58
if !array_field.dtype.is_numeric() {
return Err(DaftError::TypeError(format!(
"Expected array input to be numeric, got {}",
array_field.dtype
)));
}

// Check if min_field and max_field are numeric or null
if !(min_field.dtype.is_numeric() || min_field.dtype == DataType::Null) {
return Err(DaftError::TypeError(format!(
"Expected min input to be numeric or null, got {}",
min_field.dtype
)));
}

if !(max_field.dtype.is_numeric() || max_field.dtype == DataType::Null) {
return Err(DaftError::TypeError(format!(
"Expected max input to be numeric or null, got {}",
max_field.dtype
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be consolidated in InferDataType instead.

@@ -623,6 +623,19 @@ def floor(self) -> Expression:
expr = native.floor(self._expr)
return Expression._from_pyexpr(expr)

def clip(self, min: Expression, max: Expression) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow Expression | None = None as the arguments instead

}
}

macro_rules! create_data_array {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much benefit in this macro, since the amount of lines covered is pretty minimal.

.then(|| create_null_series(max.name()))
.unwrap_or_else(|| max.clone());

match &output_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try:

output_type if output_type.is_numeric() => {
    with_match_numeric_daft_types!(output_type, |$T| {
        let self_casted = self.cast(output_type)?;
        let min_casted = min.cast(output_type)?;
        let max_casted = max.cast(output_type)?;

        let self_downcasted = self_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        let min_downcasted = min_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        let max_downcasted = max_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        Ok(self_downcasted.clip(min_downcasted, max_downcasted)?.into_series())
    })
}

instead, which fits in a little better with our codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EXPRESSIONS] .clip
2 participants