Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of conservative routine #42

Open
BSchilperoort opened this issue Sep 3, 2024 · 5 comments
Open

Improve performance of conservative routine #42

BSchilperoort opened this issue Sep 3, 2024 · 5 comments
Labels
conservative Issues related to the conservative regridder enhancement New feature or request

Comments

@BSchilperoort
Copy link
Contributor

BSchilperoort commented Sep 3, 2024

We can lessen the performance penalty significantly by doing something like notnull.any([non_grid_dims]) here, in which case we track the nan fraction as any batch slices that have valid data. Could be a reasonable tradeoff, or a configurable argument.

Originally posted by @slevang in #39 (comment)

@BSchilperoort BSchilperoort added enhancement New feature or request conservative Issues related to the conservative regridder labels Sep 3, 2024
@BSchilperoort
Copy link
Contributor Author

This is the source of the big performance penalty for now with skipna=True in any case where we have dimensions beyond the regridding dims (e.g. batched regridding over time). The issue is that with this implementation, we are tracking the valid_frac over all the dims, so this normalized weight matrix includes all those extra dimensions and explodes the size of the einsum operations downstream.

Originally posted by @slevang in #39 (comment)

@slevang
Copy link
Collaborator

slevang commented Sep 20, 2024

I've been trying out the conservative method on some more realistic workloads, and found the performance comparisons with xesmf not super compelling. Here's a basic example:

import dask.array as da
import xarray as xr
import xarray_regrid

bounds = dict(south=-90, north=90, west=-180, east=180)

source = xarray_regrid.Grid(
    resolution_lat=0.25,
    resolution_lon=0.25,
    **bounds,
).create_regridding_dataset()

target = xarray_regrid.Grid(
    resolution_lat=1,
    resolution_lon=1,
    **bounds,
).create_regridding_dataset()

n_times = 1000

data = da.random.random(
    size=(n_times, source.latitude.size, source.longitude.size),
    chunks=(1, -1, -1),
).astype("float32")

source = xr.DataArray(
    data,
    dims=["time", "latitude", "longitude"],
    coords={
        "time": xr.date_range("2000-01-01", periods=n_times, freq="D"),
        "latitude": source.latitude,
        "longitude": source.longitude,
    }
)

xarray-regrid:

%time source.regrid.conservative(target, skipna=False).compute();
%time source.regrid.conservative(target, skipna=True).compute();
CPU times: user 8min 59s, sys: 9min 37s, total: 18min 37s
Wall time: 44.1 s
CPU times: user 1h 6min 47s, sys: 44min 26s, total: 1h 51min 13s
Wall time: 3min 51s

vs xesmf:

import xesmf as xe
regridder = xe.Regridder(source, target, "conservative")
%time regridder(source, skipna=False).compute()
%time regridder(source, skipna=True).compute();
CPU times: user 4min 9s, sys: 21.8 s, total: 4min 30s
Wall time: 34.5 s
CPU times: user 8min, sys: 35 s, total: 8min 35s
Wall time: 1min 3s

@BSchilperoort
Copy link
Contributor Author

Hm, I get much better performance on a small XPS13 laptop (19 seconds wall time for xarray-regrid with skipna=False, 164 seconds for xESMF). What is your Dask setup? Have you tried setting up dask.distributed?

import dask.distributed
client = dask.distributed.Client()

I am using the latest (non-released) xarray-regrid code, and latest xESMF. For both regridders all CPU threads are 100% occupied during most of the benchmark run.

Dask is complaining about large graph sizes with xESMF though.

@slevang
Copy link
Collaborator

slevang commented Sep 20, 2024

Interesting! Do you have opt-einsum installed? xr.dot routes to completely different routines depending on that.

I ran these on a 32 core GCP VM, and only with the default threaded scheduler, so it's definitely worth profiling across other uses. I'll try distributed but wouldn't expect much difference since this is a very straightforward task graph and no impact from the GIL.

With dense weights I definitely see all CPUs churning at full speed, but there are a massive number of 0s in those einsums. Sparse multiplication is algorithmically less efficient but we have a lot less numbers to multiply.

@BSchilperoort
Copy link
Contributor Author

BSchilperoort commented Sep 20, 2024

Do you have opt-einsum installed

I did not. Installing it also did not seem to matter.

I ran these on a 32 core GCP VM

I would have expected a much better performance then. I use a 4-core/8 thread Intel i7, and my compute time was <40% of yours for the same code.

the default threaded scheduler

I stopped using that one due to finding it not as reliable (sometimes it's a lot less performant) and more difficult to debug vs the distributed scheduler https://dask-local.readthedocs.io/en/latest/setup/single-distributed.html#single-machine-dask-distributed

Sparse multiplication is algorithmically less efficient but we have a lot less numbers to multiply.

Yeah it makes sense. I did not notice an improvement, but also not drop in performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conservative Issues related to the conservative regridder enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants