-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy rectilinear interpolator #6084
base: main
Are you sure you want to change the base?
Conversation
if self._interpolator is None: | ||
# Cache the interpolator instance. | ||
# NB. The constructor of the _RegularGridInterpolator class does | ||
# some unnecessary checks on the fill_value parameter, | ||
# so we set it afterwards instead. Sneaky. ;-) | ||
self._interpolator = _RegularGridInterpolator( | ||
self._src_points, | ||
data, | ||
method=self.method, | ||
bounds_error=mode.bounds_error, | ||
fill_value=None, | ||
) | ||
else: | ||
self._interpolator.values = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_interpolate
is now mapped to all the chunks of the data we are interpolating, thus we cannot cache the _RegularGridInterpolator
instance anymore. Is this acceptable?
One can still cache the RectilinearInterpolator
thought, and, as far as I can say, this would be similar to the way in which the caching of the regridder works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read through the code base: I believe the caching is because the same interpolation routine gets used on a series of coordinates - start here and follow the call stack down to _points()
:
iris/lib/iris/analysis/_interpolation.py
Lines 656 to 658 in 4c73724
for coord in cube.dim_coords + cube.aux_coords: | |
new_coord, dims = construct_new_coord(coord) | |
gen_new_cube() |
... meaning multiple uses of _RegularGridInterpolator
but with only .values
changing. I assume that there is noticeable overhead when creating a new instance of _RegularGridInterpolator
, but that there is no problem applying the same instance to a series of different arrays, hence the caching.
There is a risk that your refactoring - which makes the routine much faster when parallelising large datasets - will make small-to-medium cases slower, especially for those cases that need to interpolate a larger number of coordinates. So I am very keen that we try out my suggestion using args=[self]
(#6084 (comment)), which I believe would allow us to retain the existing caching? If that does not work then I'll need to assess the performance impact with some benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But I am slightly afraid the args=[self]
approach could be problematic due to the reference to the full cube being sent to all workers (see answer to your other comment below)..
So I had a closer look at the _RegularGridInterpolator
: It looks to me that only few checks are carried out when initializing the object, while all the expensive tasks (calculation of weights, interpolation) only takes place when calling interpolator:
iris/lib/iris/analysis/_scipy_interpolate.py
Lines 155 to 156 in e1f5a8b
weights = self.compute_interp_weights(xi, method) | |
return self.interp_using_pre_computed_weights(weights) |
I have measured some timings based on the example you provided in the comment above, using it as a test case for a small dataset:
Code (run into Jupyter to benchmark timings)
import numpy as np
import iris.analysis
from iris.coords import DimCoord
from iris.cube import Cube
from iris.analysis._scipy_interpolate import _RegularGridInterpolator
# Create a series of orthogonal coordinates
longitude = np.linspace(1, 9, 5)
latitude = np.linspace(10, 90, 9)
coord_names_and_points = [
('longitude', longitude),
('latitude', latitude),
('altitude', np.linspace(100, 900, 18)),
('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
DimCoord(points, standard_name=name)
for name, points in coord_names_and_points
]
# Create a Cube to house the above coordinates.
shape = [len(points) for _, points in coord_names_and_points]
dimension_mapping = [
(c, ix) for ix, c in enumerate(coord_list)
] # [(time, 0), (height, 1), ...]
data = np.arange(np.prod(shape)).reshape(shape)
cube = Cube(data, dim_coords_and_dims=dimension_mapping)
# Perform interpolation over multiple dimensions, but NOT all the dimensions of the Cube
# So we're estimating the values that would appear at:
# (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
coords = ("longitude", "latitude")
points = [[3.5, 8.5], [15, 25, 75]]
# Create the interpolator instance to benchmark total time of interpolation
interpolator = iris.analysis.RectilinearInterpolator(
cube, coords, "linear", "mask"
)
%%timeit
# Measure total time of interpolation
result = interpolator(points, collapse_scalar=True)
# 1.28 ms ± 40.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
# Measure time required to instantiate the interpolator
_ = _RegularGridInterpolator(
[longitude, latitude],
data,
method="linear",
bounds_error=False,
fill_value=None,
)
# 60.6 μs ± 17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
So the time to instantiate the interpolator is quite small (<5%) compared to the total time for interpolation, even for such a small dataset. So, I am actually wondering whether caching the interpolator ultimately brings much benefits - but maybe I am not looking at a suitable test case? What do you think?
Nice to see this progressing @fnattino! Did you notice that CI is failing on this pull request? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6084 +/- ##
=======================================
Coverage 89.82% 89.83%
=======================================
Files 88 88
Lines 23150 23180 +30
Branches 5043 5043
=======================================
+ Hits 20794 20823 +29
+ Misses 1624 1622 -2
- Partials 732 735 +3 ☔ View full report in Codecov by Sentry. |
res = interpolator([[1.5]]) | ||
self.assertFalse(res.has_lazy_data()) | ||
|
||
|
||
class Test___call___lazy_data(ThreeDimCube): | ||
def test_src_cube_data_loaded(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test used to check that setting up the interpolator would trigger data loading in the source cube, in order to avoid that new interpolations would have to load the same data again and again (see #1222). I have modified the test in favour of a couple of tests that should make more sense for a lazy interpolator.
@fnattino just want to reassure you that I have been looking at this, but since I have never worked with our interpolators before it is slow progress. Having another go this afternoon with help from some coffee ☕ |
No worries @trexfeathers, but thanks for the heads-up! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fnattino, thank you for your hard work on this.
Here is a partial review. I have left myself a couple of TODO comments. But the suggestions I have already might take some time, and may change the code significantly - mainly #6084 (comment) - so it seemed important to get these suggestions to you as soon as possible.
Also thank you to @HarriKeenan for helping me with this review last week 🤜🤛
src_points=self._src_points, | ||
interp_points=interp_points, | ||
interp_shape=interp_shape, | ||
method=self._method, | ||
extrapolation_mode=self._mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using self=self
or something like that instead? I.e. not needing to refactor _interpolate()
to be passed all these values as arguments, but instead to just access them from the existing instance of RectilinearInterpolator
. I'm not 100% sure that this works but it would be worth trying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically I think it would be like this, since self
is a positional argument (not a keyword argument):
src_points=self._src_points, | |
interp_points=interp_points, | |
interp_shape=interp_shape, | |
method=self._method, | |
extrapolation_mode=self._mode, | |
args=[self], | |
interp_shape=interp_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I haven't thought about that possibility. The main reason why I have refactored _interpolate()
is that the existing instance contains a copy of the data to be interpolated (self._src_cube
), so I thought it was safer to make _interpolate()
a static method and only pass the required input arguments.
In particular, I was afraid that serializing the RectilinearInterpolator
instance and copying it to the Dask workers might cause the full data array to be loaded and copied everywhere (even though I am not 100% sure whether this is what would actually happen).
My refactored implementation gives _interpolate()
access to the only data chunk that it should work on, so it looked "safer" to me? What do you think? But if you think we should really try to avoid this refactoring, I will give it a try to passing the instance to _interpolate()
via map_complete_blocks
..
def _interpolated_dtype(dtype, method): | ||
"""Determine the minimum base dtype required by the underlying interpolator.""" | ||
if method == "nearest": | ||
result = dtype | ||
else: | ||
result = np.result_type(_DEFAULT_DTYPE, dtype) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this needs to stay (see my other comment about args=[self]
- #6084 (comment)), then I'd be interested in us unifying this function with RectilinearInterpolator._interpolated_dtype()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to put this back as a (static)method of the RectilinearInterpolator
? Or to merge the body of this function with RectilinearInterpolator._interpolate
?
# Determine the shape of the interpolated result. | ||
ndims_interp = len(interp_shape) | ||
extra_shape = data.shape[ndims_interp:] | ||
final_shape = [*interp_shape, *extra_shape] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: understand the logic of reshaping, and the way it is now split between _interpolate()
and _points()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HarriKeenan after you shared my struggle last week, here is the script I used to help me eventually understand, in case you were interested:
Expand to view code
"""
Script to help understand how array shapes are handled in Iris interpolation.
Creates a scenario that is easy to follow when debugging iris.analysis._interpolation.py.
"""
import numpy as np
import iris.analysis
from iris.coords import DimCoord
from iris.cube import Cube
# Create a series of orthogonal coordinates that are obviously different - to
# help with debugging Iris internals.
coord_names_and_points = [
('longitude', np.linspace(1, 9, 5)),
('latitude', np.linspace(10, 90, 9)),
('altitude', np.linspace(100, 900, 18)),
('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
DimCoord(points, standard_name=name)
for name, points in coord_names_and_points
]
# Create a Cube to house the above coordinates.
# The data is not the subject of this investigation so can be arbitrary.
shape = [len(points) for name, points in coord_names_and_points]
dimension_mapping = [
(c, ix) for ix, c in enumerate(coord_list)
] # [(time, 0), (height, 1), ...]
cube = Cube(
np.arange(np.prod(shape)).reshape(shape),
dim_coords_and_dims=dimension_mapping,
)
print(cube)
unknown / (unknown) (longitude: 5; latitude: 9; altitude: 18; time: 36)
Dimension coordinates:
longitude x - - -
latitude - x - -
altitude - - x -
time - - - x
# Perform interpolation over multiple dimensions, but NOT all the dimensions
# of the Cube, so we can see how the most complex cases are handled when
# debugging.
# So we're estimating the values that would appear at:
# (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
sampling_points = [("longitude", [3.5, 8.5]), ("latitude", [15, 25, 75])]
interpolated_cube = cube.interpolate(sampling_points, iris.analysis.Linear())
print(interpolated_cube)
unknown / (unknown) (longitude: 2; latitude: 3; altitude: 18; time: 36)
Dimension coordinates:
longitude x - - -
latitude - x - -
altitude - - x -
time - - - x
# The interpolated result has now shape "points_shape + extra_shape" | ||
# where "points_shape" is the leading dimension of "interp_points" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has there been some renaming since this was written? points_shape
-> interp_shape
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The renaming of the shape in the docstring comes from the fact that I have moved some of the reshaping of the interpolated results from outside to inside _interpolate()
(see comment), so the shape of the data that _interpolate()
returns is actually different.
Here, I am describing the reshaping that takes place here: bringing data from shape "points_shape + extra_shape"
(as returned by interpolator()
) to shape "interp_shape + extra_shape"
(="final_shape"
) .
result = self._interpolator(interp_points) | ||
# The interpolated result has now shape "points_shape + extra_shape" | ||
# where "points_shape" is the leading dimension of "interp_points" | ||
# (i.e. 'interp_points.shape[:-1]'). We reshape it to match the shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true if there is more than 1 dimension not being interpolated (i.e. len(extra_shape) > 1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I have looked at the example you have shared in the comment above (very useful, thanks!), where you have 4 dimensions, 2 of which are not interpolated.
At this point of the code, I get the following shapes:
interp_points.shape
# (6, 2) --> 6 points with 2 coordinates each
interp_points.shape[:-1]
# (6, ) --> this is "points_shape"
extra_shape
# (18, 36)
result.shape
# (6, 18, 36) --> thus equal to "points_shape + extra_shape"
Or am I missing something? Maybe I should rephrase the comment in the following way? I have just used the same description as previously provided in the docstring.
# The interpolated result has now shape "(num_points, ) + extra_shape"
# where "num_points" is the number of points for which we are carrying
# out the interpolation. We reshape it to match the shape of the
# interpolated dimensions.
if self._interpolator is None: | ||
# Cache the interpolator instance. | ||
# NB. The constructor of the _RegularGridInterpolator class does | ||
# some unnecessary checks on the fill_value parameter, | ||
# so we set it afterwards instead. Sneaky. ;-) | ||
self._interpolator = _RegularGridInterpolator( | ||
self._src_points, | ||
data, | ||
method=self.method, | ||
bounds_error=mode.bounds_error, | ||
fill_value=None, | ||
) | ||
else: | ||
self._interpolator.values = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read through the code base: I believe the caching is because the same interpolation routine gets used on a series of coordinates - start here and follow the call stack down to _points()
:
iris/lib/iris/analysis/_interpolation.py
Lines 656 to 658 in 4c73724
for coord in cube.dim_coords + cube.aux_coords: | |
new_coord, dims = construct_new_coord(coord) | |
gen_new_cube() |
... meaning multiple uses of _RegularGridInterpolator
but with only .values
changing. I assume that there is noticeable overhead when creating a new instance of _RegularGridInterpolator
, but that there is no problem applying the same instance to a series of different arrays, hence the caching.
There is a risk that your refactoring - which makes the routine much faster when parallelising large datasets - will make small-to-medium cases slower, especially for those cases that need to interpolate a larger number of coordinates. So I am very keen that we try out my suggestion using args=[self]
(#6084 (comment)), which I believe would allow us to retain the existing caching? If that does not work then I'll need to assess the performance impact with some benchmarking.
# Interpolate the data, merging the chunks in the interpolated | ||
# dimensions. | ||
dims_merge_chunks = [dmap[d] for d in di] | ||
result = map_complete_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: confirm that this use of map_complete_blocks()
does not suffer the memory leaks described in #5767.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: review the tests.
Co-authored-by: Martin Yeo <[email protected]>
Co-authored-by: Martin Yeo <[email protected]>
…iris into lazy-rectilinearinterpolator-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trexfeathers thanks a lot for the review and apologies for the scattered response.
As I have tried to explain in reply to your comments, I am a bit hesitant to implement the solution that would copy the current instance of the RectilinearInterpolation
- but I am very curious to hear your thoughts on this!
# The interpolated result has now shape "points_shape + extra_shape" | ||
# where "points_shape" is the leading dimension of "interp_points" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The renaming of the shape in the docstring comes from the fact that I have moved some of the reshaping of the interpolated results from outside to inside _interpolate()
(see comment), so the shape of the data that _interpolate()
returns is actually different.
Here, I am describing the reshaping that takes place here: bringing data from shape "points_shape + extra_shape"
(as returned by interpolator()
) to shape "interp_shape + extra_shape"
(="final_shape"
) .
result = self._interpolator(interp_points) | ||
# The interpolated result has now shape "points_shape + extra_shape" | ||
# where "points_shape" is the leading dimension of "interp_points" | ||
# (i.e. 'interp_points.shape[:-1]'). We reshape it to match the shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I have looked at the example you have shared in the comment above (very useful, thanks!), where you have 4 dimensions, 2 of which are not interpolated.
At this point of the code, I get the following shapes:
interp_points.shape
# (6, 2) --> 6 points with 2 coordinates each
interp_points.shape[:-1]
# (6, ) --> this is "points_shape"
extra_shape
# (18, 36)
result.shape
# (6, 18, 36) --> thus equal to "points_shape + extra_shape"
Or am I missing something? Maybe I should rephrase the comment in the following way? I have just used the same description as previously provided in the docstring.
# The interpolated result has now shape "(num_points, ) + extra_shape"
# where "num_points" is the number of points for which we are carrying
# out the interpolation. We reshape it to match the shape of the
# interpolated dimensions.
src_points=self._src_points, | ||
interp_points=interp_points, | ||
interp_shape=interp_shape, | ||
method=self._method, | ||
extrapolation_mode=self._mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I haven't thought about that possibility. The main reason why I have refactored _interpolate()
is that the existing instance contains a copy of the data to be interpolated (self._src_cube
), so I thought it was safer to make _interpolate()
a static method and only pass the required input arguments.
In particular, I was afraid that serializing the RectilinearInterpolator
instance and copying it to the Dask workers might cause the full data array to be loaded and copied everywhere (even though I am not 100% sure whether this is what would actually happen).
My refactored implementation gives _interpolate()
access to the only data chunk that it should work on, so it looked "safer" to me? What do you think? But if you think we should really try to avoid this refactoring, I will give it a try to passing the instance to _interpolate()
via map_complete_blocks
..
if self._interpolator is None: | ||
# Cache the interpolator instance. | ||
# NB. The constructor of the _RegularGridInterpolator class does | ||
# some unnecessary checks on the fill_value parameter, | ||
# so we set it afterwards instead. Sneaky. ;-) | ||
self._interpolator = _RegularGridInterpolator( | ||
self._src_points, | ||
data, | ||
method=self.method, | ||
bounds_error=mode.bounds_error, | ||
fill_value=None, | ||
) | ||
else: | ||
self._interpolator.values = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But I am slightly afraid the args=[self]
approach could be problematic due to the reference to the full cube being sent to all workers (see answer to your other comment below)..
So I had a closer look at the _RegularGridInterpolator
: It looks to me that only few checks are carried out when initializing the object, while all the expensive tasks (calculation of weights, interpolation) only takes place when calling interpolator:
iris/lib/iris/analysis/_scipy_interpolate.py
Lines 155 to 156 in e1f5a8b
weights = self.compute_interp_weights(xi, method) | |
return self.interp_using_pre_computed_weights(weights) |
I have measured some timings based on the example you provided in the comment above, using it as a test case for a small dataset:
Code (run into Jupyter to benchmark timings)
import numpy as np
import iris.analysis
from iris.coords import DimCoord
from iris.cube import Cube
from iris.analysis._scipy_interpolate import _RegularGridInterpolator
# Create a series of orthogonal coordinates
longitude = np.linspace(1, 9, 5)
latitude = np.linspace(10, 90, 9)
coord_names_and_points = [
('longitude', longitude),
('latitude', latitude),
('altitude', np.linspace(100, 900, 18)),
('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
DimCoord(points, standard_name=name)
for name, points in coord_names_and_points
]
# Create a Cube to house the above coordinates.
shape = [len(points) for _, points in coord_names_and_points]
dimension_mapping = [
(c, ix) for ix, c in enumerate(coord_list)
] # [(time, 0), (height, 1), ...]
data = np.arange(np.prod(shape)).reshape(shape)
cube = Cube(data, dim_coords_and_dims=dimension_mapping)
# Perform interpolation over multiple dimensions, but NOT all the dimensions of the Cube
# So we're estimating the values that would appear at:
# (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
coords = ("longitude", "latitude")
points = [[3.5, 8.5], [15, 25, 75]]
# Create the interpolator instance to benchmark total time of interpolation
interpolator = iris.analysis.RectilinearInterpolator(
cube, coords, "linear", "mask"
)
%%timeit
# Measure total time of interpolation
result = interpolator(points, collapse_scalar=True)
# 1.28 ms ± 40.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
# Measure time required to instantiate the interpolator
_ = _RegularGridInterpolator(
[longitude, latitude],
data,
method="linear",
bounds_error=False,
fill_value=None,
)
# 60.6 μs ± 17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
So the time to instantiate the interpolator is quite small (<5%) compared to the total time for interpolation, even for such a small dataset. So, I am actually wondering whether caching the interpolator ultimately brings much benefits - but maybe I am not looking at a suitable test case? What do you think?
def _interpolated_dtype(dtype, method): | ||
"""Determine the minimum base dtype required by the underlying interpolator.""" | ||
if method == "nearest": | ||
result = dtype | ||
else: | ||
result = np.result_type(_DEFAULT_DTYPE, dtype) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to put this back as a (static)method of the RectilinearInterpolator
? Or to merge the body of this function with RectilinearInterpolator._interpolate
?
🚀 Pull Request
Description
Different take to enable the rectilinear interpolator to run lazily #6002 .
Trying to address the same issue as #6006, but there I have made the underlying
_RegularGridInterpolator
from_scipy_interpolate
to run on lazy data, which required switching fromscipy.sparse
tosparse
(not ideal since it would add numba as a dependency).Here I have tried instead to implement a similar approach as used for regridding, which works on lazy data as well. The downside is that the chunks in the dimensions we are interpolating over need to be merged, but at least we could run interpolation in parallel over the chunks in the other dimensions (and we do not need to add extra dependencies to iris).