Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy rectilinear interpolator #6084
base: main
Are you sure you want to change the base?
Lazy rectilinear interpolator #6084
Changes from all commits
b77af1d
0021cb0
18908ae
47a8599
555f3c7
7a08108
c453e01
3814383
0c5dc9a
3481e46
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this needs to stay (see my other comment about
args=[self]
- #6084 (comment)), then I'd be interested in us unifying this function withRectilinearInterpolator._interpolated_dtype()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to put this back as a (static)method of the
RectilinearInterpolator
? Or to merge the body of this function withRectilinearInterpolator._interpolate
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_interpolate
is now mapped to all the chunks of the data we are interpolating, thus we cannot cache the_RegularGridInterpolator
instance anymore. Is this acceptable?One can still cache the
RectilinearInterpolator
thought, and, as far as I can say, this would be similar to the way in which the caching of the regridder works?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read through the code base: I believe the caching is because the same interpolation routine gets used on a series of coordinates - start here and follow the call stack down to
_points()
:iris/lib/iris/analysis/_interpolation.py
Lines 656 to 658 in 4c73724
... meaning multiple uses of
_RegularGridInterpolator
but with only.values
changing. I assume that there is noticeable overhead when creating a new instance of_RegularGridInterpolator
, but that there is no problem applying the same instance to a series of different arrays, hence the caching.There is a risk that your refactoring - which makes the routine much faster when parallelising large datasets - will make small-to-medium cases slower, especially for those cases that need to interpolate a larger number of coordinates. So I am very keen that we try out my suggestion using
args=[self]
(#6084 (comment)), which I believe would allow us to retain the existing caching? If that does not work then I'll need to assess the performance impact with some benchmarking.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But I am slightly afraid the
args=[self]
approach could be problematic due to the reference to the full cube being sent to all workers (see answer to your other comment below)..So I had a closer look at the
_RegularGridInterpolator
: It looks to me that only few checks are carried out when initializing the object, while all the expensive tasks (calculation of weights, interpolation) only takes place when calling interpolator:iris/lib/iris/analysis/_scipy_interpolate.py
Lines 155 to 156 in e1f5a8b
I have measured some timings based on the example you provided in the comment above, using it as a test case for a small dataset:
Code (run into Jupyter to benchmark timings)
So the time to instantiate the interpolator is quite small (<5%) compared to the total time for interpolation, even for such a small dataset. So, I am actually wondering whether caching the interpolator ultimately brings much benefits - but maybe I am not looking at a suitable test case? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: understand the logic of reshaping, and the way it is now split between
_interpolate()
and_points()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HarriKeenan after you shared my struggle last week, here is the script I used to help me eventually understand, in case you were interested:
Expand to view code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has there been some renaming since this was written?
points_shape
->interp_shape
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The renaming of the shape in the docstring comes from the fact that I have moved some of the reshaping of the interpolated results from outside to inside
_interpolate()
(see comment), so the shape of the data that_interpolate()
returns is actually different.Here, I am describing the reshaping that takes place here: bringing data from shape
"points_shape + extra_shape"
(as returned byinterpolator()
) to shape"interp_shape + extra_shape"
(="final_shape"
) .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true if there is more than 1 dimension not being interpolated (i.e.
len(extra_shape) > 1
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I have looked at the example you have shared in the comment above (very useful, thanks!), where you have 4 dimensions, 2 of which are not interpolated.
At this point of the code, I get the following shapes:
Or am I missing something? Maybe I should rephrase the comment in the following way? I have just used the same description as previously provided in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: confirm that this use of
map_complete_blocks()
does not suffer the memory leaks described in #5767.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using
self=self
or something like that instead? I.e. not needing to refactor_interpolate()
to be passed all these values as arguments, but instead to just access them from the existing instance ofRectilinearInterpolator
. I'm not 100% sure that this works but it would be worth trying?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically I think it would be like this, since
self
is a positional argument (not a keyword argument):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I haven't thought about that possibility. The main reason why I have refactored
_interpolate()
is that the existing instance contains a copy of the data to be interpolated (self._src_cube
), so I thought it was safer to make_interpolate()
a static method and only pass the required input arguments.In particular, I was afraid that serializing the
RectilinearInterpolator
instance and copying it to the Dask workers might cause the full data array to be loaded and copied everywhere (even though I am not 100% sure whether this is what would actually happen).My refactored implementation gives
_interpolate()
access to the only data chunk that it should work on, so it looked "safer" to me? What do you think? But if you think we should really try to avoid this refactoring, I will give it a try to passing the instance to_interpolate()
viamap_complete_blocks
..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for @trexfeathers: review the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test used to check that setting up the interpolator would trigger data loading in the source cube, in order to avoid that new interpolations would have to load the same data again and again (see #1222). I have modified the test in favour of a couple of tests that should make more sense for a lazy interpolator.