Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move computation of wet_fac to post_init #22

Merged
merged 1 commit into from
Feb 11, 2021

Conversation

NoraLoose
Copy link
Member

This PR does the following things:

In CartesianLaplacianWithLandMask

  • move the computation of fac (now: wet_fac) to pre_init. This factor is independent of the field which the Laplacian is applied to. If we apply the Laplacian many times (as we will for the filtering operation), computing the factor only once will speed things up.
  • delete 1 unneccesary line: out = field.copy()

Following @rabernat's recommendation in #21 (comment), I also disabled the mypy check.

@rabernat
Copy link
Contributor

This is great, thanks for the update! Two questions.

  • Do you know if this is actually faster? Have you tested that? Just curious...
  • Have you verified that it works on GPU? That's something our test suite currently can't do, since we don't have a GPU inside the github workflow

@NoraLoose
Copy link
Member Author

Thanks for looking at this!

Have you verified that it works on GPU? That's something our test suite currently can't do, since we don't have a GPU inside the github workflow

I just tested this. Yes, it works on GPU.

Do you know if this is actually faster? Have you tested that? Just curious...

The answer is: I just assumed it would be faster. Turns out it takes pretty much the same time, see this notebook.

The good news is that gcm_filters is much faster on GPU than on CPU (as we saw before).
For the example in the notebook:
CPU 3min 27.0s
GPU 14.5s

Since this new implementation is not faster, we can just disregard this PR. Sorry for the noise!

@NoraLoose
Copy link
Member Author

Having a closer look at the code in filter.py, I now understand why this PR doesn't lead to a large speedup.

In my example, I am filtering 1600 2D fields (100 time slices, 16 layers). For each of the 1600 fields, the Laplacian is applied 8 times (since I chose n_steps = 8). I initially thought that with my PR, wet_fac would be computed only once, instead of 1600 * 8 times. But this is not the case. With the following code structure, post_init() for the Laplacian (and the computation of wet_fac) is called 1600 times.

field_smooth = xr.apply_ufunc(
filter_func,
field,
*grid_args,
input_core_dims=n_args * [dims],
output_core_dims=[dims],
output_dtypes=[field.dtype],
dask="parallelized",
)

In summary, while I expected to reduce 1600 * 8 computations to 1 computation, my PR led only to a reduction from 1600 * 8 to 1600 * 1 computations. Since you typically do not choose very high values for n_steps (you will actually run into numerical instabilities for large n_steps), I suggest that we just stick to the current implementation and close this PR.

@rabernat
Copy link
Contributor

rabernat commented Feb 11, 2021

With the following code structure, post_init() for the Laplacian (and the computation of wet_fac) is called 1600 times.

Thanks for digging into this. It is extremely useful information. Did you run profiling to get it? If so, can you post the profile report here?

I think we should keep this PR and think of a way to refactor the kernel so that the initialization only has to be done once. (That can be separate PR and should probably be done by me.)

@rabernat rabernat merged commit 8b92035 into ocean-eddy-cpt:master Feb 11, 2021
@NoraLoose
Copy link
Member Author

Thanks for digging into this. It is extremely useful information. Did you run profiling to get it? If so, can you post the profile report here?

I haven't done any rigorous profiling, but saw no significant difference in the execution time. That, together with looking at the code structure, led me to this conclusion. But I could be wrong.

I think I will prioritize the implementation of the Laplacian kernels for some of the gcm grids in the near future, but could get back to this performance question at a later time.

@rabernat
Copy link
Contributor

rabernat commented Feb 11, 2021

That, together with looking at the code structure, led me to this conclusion

I think you are right. The kernel is initialized inside the function passed to apply_ufunc:

def filter_func(field, *args):
# these next steps are a kind of hack we have to turn keyword arugments into regular arguments
# the reason for doing this is that Xarray's apply_ufunc machinery works a lot better
# with regular arguments
assert len(args) == len(Laplacian.required_grid_args())
grid_vars = {k: v for k, v in zip(Laplacian.required_grid_args(), args)}
laplacian = Laplacian(**grid_vars)
np = get_array_module(field)

We should be able to move the initialization (Laplacian()) outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants