Improve performance of conservative routine #42

BSchilperoort · 2024-09-03T12:51:12Z

We can lessen the performance penalty significantly by doing something like notnull.any([non_grid_dims]) here, in which case we track the nan fraction as any batch slices that have valid data. Could be a reasonable tradeoff, or a configurable argument.

Originally posted by @slevang in #39 (comment)

The text was updated successfully, but these errors were encountered:

BSchilperoort · 2024-09-03T12:53:01Z

This is the source of the big performance penalty for now with skipna=True in any case where we have dimensions beyond the regridding dims (e.g. batched regridding over time). The issue is that with this implementation, we are tracking the valid_frac over all the dims, so this normalized weight matrix includes all those extra dimensions and explodes the size of the einsum operations downstream.

Originally posted by @slevang in #39 (comment)

slevang · 2024-09-20T13:42:54Z

I've been trying out the conservative method on some more realistic workloads, and found the performance comparisons with xesmf not super compelling. Here's a basic example:

import dask.array as da
import xarray as xr
import xarray_regrid

bounds = dict(south=-90, north=90, west=-180, east=180)

source = xarray_regrid.Grid(
    resolution_lat=0.25,
    resolution_lon=0.25,
    **bounds,
).create_regridding_dataset()

target = xarray_regrid.Grid(
    resolution_lat=1,
    resolution_lon=1,
    **bounds,
).create_regridding_dataset()

n_times = 1000

data = da.random.random(
    size=(n_times, source.latitude.size, source.longitude.size),
    chunks=(1, -1, -1),
).astype("float32")

source = xr.DataArray(
    data,
    dims=["time", "latitude", "longitude"],
    coords={
        "time": xr.date_range("2000-01-01", periods=n_times, freq="D"),
        "latitude": source.latitude,
        "longitude": source.longitude,
    }
)

xarray-regrid:

%time source.regrid.conservative(target, skipna=False).compute();
%time source.regrid.conservative(target, skipna=True).compute();

CPU times: user 8min 59s, sys: 9min 37s, total: 18min 37s
Wall time: 44.1 s
CPU times: user 1h 6min 47s, sys: 44min 26s, total: 1h 51min 13s
Wall time: 3min 51s

vs xesmf:

import xesmf as xe
regridder = xe.Regridder(source, target, "conservative")
%time regridder(source, skipna=False).compute()
%time regridder(source, skipna=True).compute();

CPU times: user 4min 9s, sys: 21.8 s, total: 4min 30s
Wall time: 34.5 s
CPU times: user 8min, sys: 35 s, total: 8min 35s
Wall time: 1min 3s

BSchilperoort · 2024-09-20T14:16:02Z

Hm, I get much better performance on a small XPS13 laptop (19 seconds wall time for xarray-regrid with skipna=False, 164 seconds for xESMF). What is your Dask setup? Have you tried setting up dask.distributed?

import dask.distributed
client = dask.distributed.Client()

I am using the latest (non-released) xarray-regrid code, and latest xESMF. For both regridders all CPU threads are 100% occupied during most of the benchmark run.

Dask is complaining about large graph sizes with xESMF though.

slevang · 2024-09-20T14:39:40Z

Interesting! Do you have opt-einsum installed? xr.dot routes to completely different routines depending on that.

I ran these on a 32 core GCP VM, and only with the default threaded scheduler, so it's definitely worth profiling across other uses. I'll try distributed but wouldn't expect much difference since this is a very straightforward task graph and no impact from the GIL.

With dense weights I definitely see all CPUs churning at full speed, but there are a massive number of 0s in those einsums. Sparse multiplication is algorithmically less efficient but we have a lot less numbers to multiply.

BSchilperoort · 2024-09-20T14:46:19Z

Do you have opt-einsum installed

I did not. Installing it also did not seem to matter.

I ran these on a 32 core GCP VM

I would have expected a much better performance then. I use a 4-core/8 thread Intel i7, and my compute time was <40% of yours for the same code.

the default threaded scheduler

I stopped using that one due to finding it not as reliable (sometimes it's a lot less performant) and more difficult to debug vs the distributed scheduler https://dask-local.readthedocs.io/en/latest/setup/single-distributed.html#single-machine-dask-distributed

Sparse multiplication is algorithmically less efficient but we have a lot less numbers to multiply.

Yeah it makes sense. I did not notice an improvement, but also not drop in performance.

BSchilperoort added enhancement New feature or request conservative Issues related to the conservative regridder labels Sep 3, 2024

BSchilperoort mentioned this issue Sep 3, 2024

Nan threshold for conservative regridding (continuation of #39) #41

Merged

slevang mentioned this issue Sep 6, 2024

Spherical padding and faster tests #45

Merged

6 tasks

slevang mentioned this issue Sep 20, 2024

Sparse weights in conservative method #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of conservative routine #42

Improve performance of conservative routine #42

BSchilperoort commented Sep 3, 2024 •

edited

Loading

BSchilperoort commented Sep 3, 2024

slevang commented Sep 20, 2024

BSchilperoort commented Sep 20, 2024

slevang commented Sep 20, 2024 •

edited

Loading

BSchilperoort commented Sep 20, 2024 •

edited

Loading

Improve performance of conservative routine #42

Improve performance of conservative routine #42

Comments

BSchilperoort commented Sep 3, 2024 • edited Loading

BSchilperoort commented Sep 3, 2024

slevang commented Sep 20, 2024

BSchilperoort commented Sep 20, 2024

slevang commented Sep 20, 2024 • edited Loading

BSchilperoort commented Sep 20, 2024 • edited Loading

BSchilperoort commented Sep 3, 2024 •

edited

Loading

slevang commented Sep 20, 2024 •

edited

Loading

BSchilperoort commented Sep 20, 2024 •

edited

Loading