-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]: Support .clip
function
#3136
base: main
Are you sure you want to change the base?
Conversation
Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though. Some things I'd like to ask for direction on:
|
CodSpeed Performance ReportMerging #3136 will not alter performanceComparing Summary
|
|
Hey @colin-ho, have made the requested changes:
Could I ask for your thoughts on these questions:
|
binary_min
, binary_max
and clip
Series functions.clip
function
Let's stick with just numeric types for this PR |
fn clamp_helper<T: PartialOrd + Copy>( | ||
value: Option<&T>, | ||
left_bound: Option<&T>, | ||
right_bound: Option<&T>, | ||
) -> Option<T> { | ||
match (value, left_bound, right_bound) { | ||
(None, _, _) => None, | ||
(Some(v), Some(l), Some(r)) => { | ||
assert!(l <= r, "Left bound is greater than right bound"); | ||
Some(clamp(*v, *l, *r)) | ||
} | ||
(Some(v), Some(l), None) => Some(clamp_min(*v, *l)), | ||
(Some(v), None, Some(r)) => Some(clamp_max(*v, *r)), | ||
(Some(v), None, None) => Some(*v), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They key observation we can leverage here for some better performance is that the result of the clamp is None if the original value is None. Therefore instead of doing as_arrow.iter()
you can use as_arrow.values_iter()
, which will return an iterator of all the values, ignoring the validity. This is fine because we slap on the validity of the original array anyway. The very small benefit of this is that it will reduce the number of match branches, i think only by 1 or something.
Unfortunately we can't do this for left and right though, because we need to account for their validity.
But in a case like (array_size, 1, rbound_size)
and the single left_bound is not None, you only need 1 validity check per row! i.e. for the right_bound (because you are using values_iter
for the array, and your left bound is a non-null scalar).
Lastly, and probably the most important, in the case of (_, 1, 1)
you can probably do something like
let left = left_bound.get(0);
let right = right_bound.get(0);
if let Some(left) = left
&& let Some(right) = right
{
self.apply(|v| clamp(v, left, right))
} else if let Some(left) = left {
self.apply(|v| clamp_min(v, left))
} else if let Some(right) = right {
self.apply(|v| clamp_max(v, right))
} else {
Ok(Self::full_null(self.name(), self.data_type(), self.len()))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
if !array_field.dtype.is_numeric() { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected array input to be numeric, got {}", | ||
array_field.dtype | ||
))); | ||
} | ||
|
||
// Check if min_field and max_field are numeric or null | ||
if !(min_field.dtype.is_numeric() || min_field.dtype == DataType::Null) { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected min input to be numeric or null, got {}", | ||
min_field.dtype | ||
))); | ||
} | ||
|
||
if !(max_field.dtype.is_numeric() || max_field.dtype == DataType::Null) { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected max input to be numeric or null, got {}", | ||
max_field.dtype | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be consolidated in InferDataType
instead.
@@ -623,6 +623,19 @@ def floor(self) -> Expression: | |||
expr = native.floor(self._expr) | |||
return Expression._from_pyexpr(expr) | |||
|
|||
def clip(self, min: Expression, max: Expression) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow Expression | None = None
as the arguments instead
} | ||
} | ||
|
||
macro_rules! create_data_array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much benefit in this macro, since the amount of lines covered is pretty minimal.
.then(|| create_null_series(max.name())) | ||
.unwrap_or_else(|| max.clone()); | ||
|
||
match &output_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try:
output_type if output_type.is_numeric() => {
with_match_numeric_daft_types!(output_type, |$T| {
let self_casted = self.cast(output_type)?;
let min_casted = min.cast(output_type)?;
let max_casted = max.cast(output_type)?;
let self_downcasted = self_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
let min_downcasted = min_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
let max_downcasted = max_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
Ok(self_downcasted.clip(min_downcasted, max_downcasted)?.into_series())
})
}
instead, which fits in a little better with our codebase.
Closes #1907.
TODO:
TESTS: