Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy rectilinear interpolator #6084

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

fnattino
Copy link
Contributor

🚀 Pull Request

Description

Different take to enable the rectilinear interpolator to run lazily #6002 .

Trying to address the same issue as #6006, but there I have made the underlying _RegularGridInterpolator from _scipy_interpolate to run on lazy data, which required switching from scipy.sparse to sparse (not ideal since it would add numba as a dependency).

Here I have tried instead to implement a similar approach as used for regridding, which works on lazy data as well. The downside is that the chunks in the dimensions we are interpolating over need to be merged, but at least we could run interpolation in parallel over the chunks in the other dimensions (and we do not need to add extra dependencies to iris).

Comment on lines -318 to -331
if self._interpolator is None:
# Cache the interpolator instance.
# NB. The constructor of the _RegularGridInterpolator class does
# some unnecessary checks on the fill_value parameter,
# so we set it afterwards instead. Sneaky. ;-)
self._interpolator = _RegularGridInterpolator(
self._src_points,
data,
method=self.method,
bounds_error=mode.bounds_error,
fill_value=None,
)
else:
self._interpolator.values = data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_interpolate is now mapped to all the chunks of the data we are interpolating, thus we cannot cache the _RegularGridInterpolator instance anymore. Is this acceptable?

One can still cache the RectilinearInterpolator thought, and, as far as I can say, this would be similar to the way in which the caching of the regridder works?

Copy link
Contributor

@trexfeathers trexfeathers Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read through the code base: I believe the caching is because the same interpolation routine gets used on a series of coordinates - start here and follow the call stack down to _points():

for coord in cube.dim_coords + cube.aux_coords:
new_coord, dims = construct_new_coord(coord)
gen_new_cube()

... meaning multiple uses of _RegularGridInterpolator but with only .values changing. I assume that there is noticeable overhead when creating a new instance of _RegularGridInterpolator, but that there is no problem applying the same instance to a series of different arrays, hence the caching.

There is a risk that your refactoring - which makes the routine much faster when parallelising large datasets - will make small-to-medium cases slower, especially for those cases that need to interpolate a larger number of coordinates. So I am very keen that we try out my suggestion using args=[self] (#6084 (comment)), which I believe would allow us to retain the existing caching? If that does not work then I'll need to assess the performance impact with some benchmarking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. But I am slightly afraid the args=[self] approach could be problematic due to the reference to the full cube being sent to all workers (see answer to your other comment below)..

So I had a closer look at the _RegularGridInterpolator: It looks to me that only few checks are carried out when initializing the object, while all the expensive tasks (calculation of weights, interpolation) only takes place when calling interpolator:

weights = self.compute_interp_weights(xi, method)
return self.interp_using_pre_computed_weights(weights)

I have measured some timings based on the example you provided in the comment above, using it as a test case for a small dataset:

Code (run into Jupyter to benchmark timings)
import numpy as np
import iris.analysis

from iris.coords import DimCoord
from iris.cube import Cube
from iris.analysis._scipy_interpolate import _RegularGridInterpolator

# Create a series of orthogonal coordinates
longitude = np.linspace(1, 9, 5)
latitude = np.linspace(10, 90, 9)
coord_names_and_points = [
    ('longitude', longitude),
    ('latitude', latitude),
    ('altitude', np.linspace(100, 900, 18)),
    ('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
    DimCoord(points, standard_name=name)
    for name, points in coord_names_and_points
]

# Create a Cube to house the above coordinates.
shape = [len(points) for _, points in coord_names_and_points]
dimension_mapping = [
    (c, ix) for ix, c in enumerate(coord_list)
]  # [(time, 0), (height, 1), ...]
data = np.arange(np.prod(shape)).reshape(shape)
cube = Cube(data, dim_coords_and_dims=dimension_mapping)

# Perform interpolation over multiple dimensions, but NOT all the dimensions of the Cube
# So we're estimating the values that would appear at:
# (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
coords = ("longitude", "latitude")
points = [[3.5, 8.5], [15, 25, 75]]

# Create the interpolator instance to benchmark total time of interpolation
interpolator = iris.analysis.RectilinearInterpolator(
    cube, coords, "linear", "mask"
)

%%timeit
# Measure total time of interpolation
result = interpolator(points, collapse_scalar=True)
# 1.28 ms ± 40.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# Measure time required to instantiate the interpolator
_ = _RegularGridInterpolator(
    [longitude, latitude],
    data,
    method="linear",
    bounds_error=False,
    fill_value=None,
)
# 60.6 μs ± 17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So the time to instantiate the interpolator is quite small (<5%) compared to the total time for interpolation, even for such a small dataset. So, I am actually wondering whether caching the interpolator ultimately brings much benefits - but maybe I am not looking at a suitable test case? What do you think?

@fnattino fnattino marked this pull request as ready for review July 25, 2024 09:44
@bouweandela
Copy link
Member

Nice to see this progressing @fnattino! Did you notice that CI is failing on this pull request?

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.83%. Comparing base (d76a4d8) to head (3481e46).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6084   +/-   ##
=======================================
  Coverage   89.82%   89.83%           
=======================================
  Files          88       88           
  Lines       23150    23180   +30     
  Branches     5043     5043           
=======================================
+ Hits        20794    20823   +29     
+ Misses       1624     1622    -2     
- Partials      732      735    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

res = interpolator([[1.5]])
self.assertFalse(res.has_lazy_data())


class Test___call___lazy_data(ThreeDimCube):
def test_src_cube_data_loaded(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test used to check that setting up the interpolator would trigger data loading in the source cube, in order to avoid that new interpolations would have to load the same data again and again (see #1222). I have modified the test in favour of a couple of tests that should make more sense for a lazy interpolator.

@trexfeathers trexfeathers self-assigned this Sep 4, 2024
@trexfeathers
Copy link
Contributor

@fnattino just want to reassure you that I have been looking at this, but since I have never worked with our interpolators before it is slow progress. Having another go this afternoon with help from some coffee ☕

@fnattino
Copy link
Contributor Author

No worries @trexfeathers, but thanks for the heads-up! :)

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fnattino, thank you for your hard work on this.

Here is a partial review. I have left myself a couple of TODO comments. But the suggestions I have already might take some time, and may change the code significantly - mainly #6084 (comment) - so it seemed important to get these suggestions to you as soon as possible.

Also thank you to @HarriKeenan for helping me with this review last week 🤜🤛

lib/iris/analysis/_interpolation.py Outdated Show resolved Hide resolved
Comment on lines +582 to +586
src_points=self._src_points,
interp_points=interp_points,
interp_shape=interp_shape,
method=self._method,
extrapolation_mode=self._mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using self=self or something like that instead? I.e. not needing to refactor _interpolate() to be passed all these values as arguments, but instead to just access them from the existing instance of RectilinearInterpolator. I'm not 100% sure that this works but it would be worth trying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically I think it would be like this, since self is a positional argument (not a keyword argument):

Suggested change
src_points=self._src_points,
interp_points=interp_points,
interp_shape=interp_shape,
method=self._method,
extrapolation_mode=self._mode,
args=[self],
interp_shape=interp_shape,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I haven't thought about that possibility. The main reason why I have refactored _interpolate() is that the existing instance contains a copy of the data to be interpolated (self._src_cube), so I thought it was safer to make _interpolate() a static method and only pass the required input arguments.

In particular, I was afraid that serializing the RectilinearInterpolator instance and copying it to the Dask workers might cause the full data array to be loaded and copied everywhere (even though I am not 100% sure whether this is what would actually happen).

My refactored implementation gives _interpolate() access to the only data chunk that it should work on, so it looked "safer" to me? What do you think? But if you think we should really try to avoid this refactoring, I will give it a try to passing the instance to _interpolate() via map_complete_blocks..

Comment on lines +168 to +174
def _interpolated_dtype(dtype, method):
"""Determine the minimum base dtype required by the underlying interpolator."""
if method == "nearest":
result = dtype
else:
result = np.result_type(_DEFAULT_DTYPE, dtype)
return result
Copy link
Contributor

@trexfeathers trexfeathers Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this needs to stay (see my other comment about args=[self] - #6084 (comment)), then I'd be interested in us unifying this function with RectilinearInterpolator._interpolated_dtype().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to put this back as a (static)method of the RectilinearInterpolator? Or to merge the body of this function with RectilinearInterpolator._interpolate?

Comment on lines +335 to +338
# Determine the shape of the interpolated result.
ndims_interp = len(interp_shape)
extra_shape = data.shape[ndims_interp:]
final_shape = [*interp_shape, *extra_shape]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for @trexfeathers: understand the logic of reshaping, and the way it is now split between _interpolate() and _points().

Copy link
Contributor

@trexfeathers trexfeathers Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarriKeenan after you shared my struggle last week, here is the script I used to help me eventually understand, in case you were interested:

Expand to view code
"""
Script to help understand how array shapes are handled in Iris interpolation.

Creates a scenario that is easy to follow when debugging iris.analysis._interpolation.py.
"""

import numpy as np

import iris.analysis
from iris.coords import DimCoord
from iris.cube import Cube


# Create a series of orthogonal coordinates that are obviously different - to
#  help with debugging Iris internals.
coord_names_and_points = [
    ('longitude', np.linspace(1, 9, 5)),
    ('latitude', np.linspace(10, 90, 9)),
    ('altitude', np.linspace(100, 900, 18)),
    ('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
    DimCoord(points, standard_name=name)
    for name, points in coord_names_and_points
]

# Create a Cube to house the above coordinates.
#  The data is not the subject of this investigation so can be arbitrary.
shape = [len(points) for name, points in coord_names_and_points]
dimension_mapping = [
    (c, ix) for ix, c in enumerate(coord_list)
]  # [(time, 0), (height, 1), ...]
cube = Cube(
    np.arange(np.prod(shape)).reshape(shape),
    dim_coords_and_dims=dimension_mapping,
)
print(cube)
unknown / (unknown)                 (longitude: 5; latitude: 9; altitude: 18; time: 36)
    Dimension coordinates:
        longitude                             x            -            -         -
        latitude                              -            x            -         -
        altitude                              -            -            x         -
        time                                  -            -            -         x
# Perform interpolation over multiple dimensions, but NOT all the dimensions
#  of the Cube, so we can see how the most complex cases are handled when
#  debugging.
# So we're estimating the values that would appear at:
#  (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
sampling_points = [("longitude", [3.5, 8.5]), ("latitude", [15, 25, 75])]
interpolated_cube = cube.interpolate(sampling_points, iris.analysis.Linear())
print(interpolated_cube)
unknown / (unknown)                 (longitude: 2; latitude: 3; altitude: 18; time: 36)
    Dimension coordinates:
        longitude                             x            -            -         -
        latitude                              -            x            -         -
        altitude                              -            -            x         -
        time                                  -            -            -         x

lib/iris/analysis/_interpolation.py Show resolved Hide resolved
Comment on lines +355 to +356
# The interpolated result has now shape "points_shape + extra_shape"
# where "points_shape" is the leading dimension of "interp_points"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has there been some renaming since this was written? points_shape -> interp_shape?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renaming of the shape in the docstring comes from the fact that I have moved some of the reshaping of the interpolated results from outside to inside _interpolate() (see comment), so the shape of the data that _interpolate() returns is actually different.

Here, I am describing the reshaping that takes place here: bringing data from shape "points_shape + extra_shape" (as returned by interpolator() ) to shape "interp_shape + extra_shape" (="final_shape") .

result = self._interpolator(interp_points)
# The interpolated result has now shape "points_shape + extra_shape"
# where "points_shape" is the leading dimension of "interp_points"
# (i.e. 'interp_points.shape[:-1]'). We reshape it to match the shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true if there is more than 1 dimension not being interpolated (i.e. len(extra_shape) > 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I have looked at the example you have shared in the comment above (very useful, thanks!), where you have 4 dimensions, 2 of which are not interpolated.

At this point of the code, I get the following shapes:

interp_points.shape
# (6, 2) --> 6 points with 2 coordinates each

interp_points.shape[:-1]
# (6, ) --> this is "points_shape"

extra_shape
# (18, 36)

result.shape
# (6, 18, 36) --> thus equal to "points_shape + extra_shape"

Or am I missing something? Maybe I should rephrase the comment in the following way? I have just used the same description as previously provided in the docstring.

# The interpolated result has now shape "(num_points, ) + extra_shape"
# where "num_points" is the number of points for which we are carrying
# out the interpolation. We reshape it to match the shape of the
# interpolated dimensions.

Comment on lines -318 to -331
if self._interpolator is None:
# Cache the interpolator instance.
# NB. The constructor of the _RegularGridInterpolator class does
# some unnecessary checks on the fill_value parameter,
# so we set it afterwards instead. Sneaky. ;-)
self._interpolator = _RegularGridInterpolator(
self._src_points,
data,
method=self.method,
bounds_error=mode.bounds_error,
fill_value=None,
)
else:
self._interpolator.values = data
Copy link
Contributor

@trexfeathers trexfeathers Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read through the code base: I believe the caching is because the same interpolation routine gets used on a series of coordinates - start here and follow the call stack down to _points():

for coord in cube.dim_coords + cube.aux_coords:
new_coord, dims = construct_new_coord(coord)
gen_new_cube()

... meaning multiple uses of _RegularGridInterpolator but with only .values changing. I assume that there is noticeable overhead when creating a new instance of _RegularGridInterpolator, but that there is no problem applying the same instance to a series of different arrays, hence the caching.

There is a risk that your refactoring - which makes the routine much faster when parallelising large datasets - will make small-to-medium cases slower, especially for those cases that need to interpolate a larger number of coordinates. So I am very keen that we try out my suggestion using args=[self] (#6084 (comment)), which I believe would allow us to retain the existing caching? If that does not work then I'll need to assess the performance impact with some benchmarking.

# Interpolate the data, merging the chunks in the interpolated
# dimensions.
dims_merge_chunks = [dmap[d] for d in di]
result = map_complete_blocks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for @trexfeathers: confirm that this use of map_complete_blocks() does not suffer the memory leaks described in #5767.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for @trexfeathers: review the tests.

Copy link
Contributor Author

@fnattino fnattino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trexfeathers thanks a lot for the review and apologies for the scattered response.

As I have tried to explain in reply to your comments, I am a bit hesitant to implement the solution that would copy the current instance of the RectilinearInterpolation - but I am very curious to hear your thoughts on this!

Comment on lines +355 to +356
# The interpolated result has now shape "points_shape + extra_shape"
# where "points_shape" is the leading dimension of "interp_points"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renaming of the shape in the docstring comes from the fact that I have moved some of the reshaping of the interpolated results from outside to inside _interpolate() (see comment), so the shape of the data that _interpolate() returns is actually different.

Here, I am describing the reshaping that takes place here: bringing data from shape "points_shape + extra_shape" (as returned by interpolator() ) to shape "interp_shape + extra_shape" (="final_shape") .

result = self._interpolator(interp_points)
# The interpolated result has now shape "points_shape + extra_shape"
# where "points_shape" is the leading dimension of "interp_points"
# (i.e. 'interp_points.shape[:-1]'). We reshape it to match the shape
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I have looked at the example you have shared in the comment above (very useful, thanks!), where you have 4 dimensions, 2 of which are not interpolated.

At this point of the code, I get the following shapes:

interp_points.shape
# (6, 2) --> 6 points with 2 coordinates each

interp_points.shape[:-1]
# (6, ) --> this is "points_shape"

extra_shape
# (18, 36)

result.shape
# (6, 18, 36) --> thus equal to "points_shape + extra_shape"

Or am I missing something? Maybe I should rephrase the comment in the following way? I have just used the same description as previously provided in the docstring.

# The interpolated result has now shape "(num_points, ) + extra_shape"
# where "num_points" is the number of points for which we are carrying
# out the interpolation. We reshape it to match the shape of the
# interpolated dimensions.

Comment on lines +582 to +586
src_points=self._src_points,
interp_points=interp_points,
interp_shape=interp_shape,
method=self._method,
extrapolation_mode=self._mode,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I haven't thought about that possibility. The main reason why I have refactored _interpolate() is that the existing instance contains a copy of the data to be interpolated (self._src_cube), so I thought it was safer to make _interpolate() a static method and only pass the required input arguments.

In particular, I was afraid that serializing the RectilinearInterpolator instance and copying it to the Dask workers might cause the full data array to be loaded and copied everywhere (even though I am not 100% sure whether this is what would actually happen).

My refactored implementation gives _interpolate() access to the only data chunk that it should work on, so it looked "safer" to me? What do you think? But if you think we should really try to avoid this refactoring, I will give it a try to passing the instance to _interpolate() via map_complete_blocks..

Comment on lines -318 to -331
if self._interpolator is None:
# Cache the interpolator instance.
# NB. The constructor of the _RegularGridInterpolator class does
# some unnecessary checks on the fill_value parameter,
# so we set it afterwards instead. Sneaky. ;-)
self._interpolator = _RegularGridInterpolator(
self._src_points,
data,
method=self.method,
bounds_error=mode.bounds_error,
fill_value=None,
)
else:
self._interpolator.values = data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. But I am slightly afraid the args=[self] approach could be problematic due to the reference to the full cube being sent to all workers (see answer to your other comment below)..

So I had a closer look at the _RegularGridInterpolator: It looks to me that only few checks are carried out when initializing the object, while all the expensive tasks (calculation of weights, interpolation) only takes place when calling interpolator:

weights = self.compute_interp_weights(xi, method)
return self.interp_using_pre_computed_weights(weights)

I have measured some timings based on the example you provided in the comment above, using it as a test case for a small dataset:

Code (run into Jupyter to benchmark timings)
import numpy as np
import iris.analysis

from iris.coords import DimCoord
from iris.cube import Cube
from iris.analysis._scipy_interpolate import _RegularGridInterpolator

# Create a series of orthogonal coordinates
longitude = np.linspace(1, 9, 5)
latitude = np.linspace(10, 90, 9)
coord_names_and_points = [
    ('longitude', longitude),
    ('latitude', latitude),
    ('altitude', np.linspace(100, 900, 18)),
    ('time', np.linspace(1000, 9000, 36)),
]
coord_list = [
    DimCoord(points, standard_name=name)
    for name, points in coord_names_and_points
]

# Create a Cube to house the above coordinates.
shape = [len(points) for _, points in coord_names_and_points]
dimension_mapping = [
    (c, ix) for ix, c in enumerate(coord_list)
]  # [(time, 0), (height, 1), ...]
data = np.arange(np.prod(shape)).reshape(shape)
cube = Cube(data, dim_coords_and_dims=dimension_mapping)

# Perform interpolation over multiple dimensions, but NOT all the dimensions of the Cube
# So we're estimating the values that would appear at:
# (3.5, 15), (3.5, 25), (3.5, 75), (8.5, 15), (8.5, 25), (8.5, 75)
coords = ("longitude", "latitude")
points = [[3.5, 8.5], [15, 25, 75]]

# Create the interpolator instance to benchmark total time of interpolation
interpolator = iris.analysis.RectilinearInterpolator(
    cube, coords, "linear", "mask"
)

%%timeit
# Measure total time of interpolation
result = interpolator(points, collapse_scalar=True)
# 1.28 ms ± 40.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# Measure time required to instantiate the interpolator
_ = _RegularGridInterpolator(
    [longitude, latitude],
    data,
    method="linear",
    bounds_error=False,
    fill_value=None,
)
# 60.6 μs ± 17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So the time to instantiate the interpolator is quite small (<5%) compared to the total time for interpolation, even for such a small dataset. So, I am actually wondering whether caching the interpolator ultimately brings much benefits - but maybe I am not looking at a suitable test case? What do you think?

Comment on lines +168 to +174
def _interpolated_dtype(dtype, method):
"""Determine the minimum base dtype required by the underlying interpolator."""
if method == "nearest":
result = dtype
else:
result = np.result_type(_DEFAULT_DTYPE, dtype)
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to put this back as a (static)method of the RectilinearInterpolator? Or to merge the body of this function with RectilinearInterpolator._interpolate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants