Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #134 merge partitions with inconsistent grids amongst partitions #216

Merged
merged 27 commits into from
Feb 14, 2024

Conversation

JoerivanEngelen
Copy link
Contributor

@JoerivanEngelen JoerivanEngelen commented Feb 13, 2024

Fix #134 and #86,

Add support for merging partitions with different grids in each partition, for example 1D grid in one partition and 2D grids in each. Quite some refactoring was required:

  • groups grids, data_objects, and other_vars by gridname
  • all partitions with a certain grid should have the same variables, this is validated
  • connectivity dims are padded to maximum size if not consistent
  • add property to AbstractUgrid to get connectivity dimension names

@JoerivanEngelen JoerivanEngelen added the enhancement New feature or request label Feb 13, 2024
@JoerivanEngelen JoerivanEngelen self-assigned this Feb 13, 2024
@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Feb 14, 2024

Tested with both examples from #134 and #86 and they both work. However, I do notice odd behaviour. When using xu.open_mfdataset the merging takes 0.3 seconds. When using xu.open_dataset the merging takes >260 seconds:

import glob
import xugrid as xu
import datetime as dt

file_nc = r'p:\1230882-emodnet_hrsm\GTSMv5.0\SO_NHrivGTSM\computations\BD014_fix_mapformat4\output\gtsm_model_0*_map.nc'
file_nc_list = glob.glob(file_nc)
partitions = []
for iF, file_nc_one in enumerate(file_nc_list):
    uds_part = xu.open_mfdataset(file_nc_one)
    partitions.append(uds_part)
dtstart = dt.datetime.now()
uds = xu.merge_partitions(partitions)
print(f'merging took: {(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

Using xarray 2023.11.0

@Huite
Copy link
Collaborator

Huite commented Feb 14, 2024

I guess you can also try loading the datasets with just xr.open_dataset, then turning them into UgridDatasets, then merging them.
That will give the same result, I'm fairly sure. The xugrid.open_mfdataset just adds the data_vars kwarg then calls xr.open_mfdataset.
From the source, parallel=False by default in the mf_opendataset call and that triggers some dask usage or not -- so that shouldn't make a difference here. But maybe there's something more subtle going on as well.

Is there maybe a difference between the eagerness? If you call .compute() in the timing, are the timings the same?

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Feb 14, 2024

@Huite, thanks for your suggestions. Indeed, the combination of ds=xr.open_dataset() and xu.UgridDataset(ds) behaves the same. And your second suggestion, to call .compute() significantly increases the performance indeed (0.8 sec). This also triggered me to remember the importance of the chunks argument (chunks={"time":1} is the default in dfm_tools), this also worked as a charm (0.5 sec). So I guess we can conclude there was no additional problem introduced with this PR, and the PR also provides what it aims to do.

Copy link
Collaborator

@veenstrajelmer veenstrajelmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this. There seemed to bee some issues, but input from @Huite I think we can conclude this PR works like a charm.

docs/changelog.rst Outdated Show resolved Hide resolved
docs/changelog.rst Show resolved Hide resolved
docs/changelog.rst Outdated Show resolved Hide resolved

Changed
~~~~~~~

- :meth:`xugrid.Ugrid2d.from_structured` now takes ``x`` and ``y`` arguments instead
of ``x_bounds`` and ``y_bounds`` arguments.
- :func:`xugrid.merge_partitions` allows merging partitions with different grids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, this is added under fixed and changed? Maybe just mention it only for "fixed"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit in doubt:

I added it to changed, because it is a change of behavior: Prior to this changeset, we test explicitly if all grids occur in every partitions and throw a error if this is not the case. Therefore this changeset doesn't really "fix" a bug, it alters behavior.

On the other hand, the changeset brings xugrid closer to supporting UGRID conventions, so if the premise of xugrid is to support every UGRID file (at least: those without mistakes), you can argue this is a "fix".

For now, I just kept the text under "changed".

xugrid/ugrid/partitioning.py Outdated Show resolved Hide resolved
xugrid/ugrid/partitioning.py Outdated Show resolved Hide resolved
xugrid/ugrid/partitioning.py Outdated Show resolved Hide resolved
xugrid/ugrid/partitioning.py Outdated Show resolved Hide resolved
xugrid/ugrid/partitioning.py Outdated Show resolved Hide resolved
xugrid/ugrid/partitioning.py Show resolved Hide resolved
@Huite Huite merged commit 16b6c9a into main Feb 14, 2024
5 checks passed
@Huite Huite deleted the issue_134_merge_partitions_1D2D branch February 14, 2024 19:19
@Huite
Copy link
Collaborator

Huite commented Feb 14, 2024

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge_partitions fails for datasets with multiple topologies
3 participants