Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of _get_topology() by not accessing dataArray each time #285

Closed
veenstrajelmer opened this issue Aug 20, 2024 · 0 comments · Fixed by #286
Closed

Improve performance of _get_topology() by not accessing dataArray each time #285

veenstrajelmer opened this issue Aug 20, 2024 · 0 comments · Fixed by #286

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Aug 20, 2024

_get_topology loops over all data_vars:

def _get_topology(ds: xr.Dataset) -> List[str]:
return [k for k in ds.data_vars if ds[k].attrs.get("cf_role") == "mesh_topology"]

It seems that when accessing this via ds.variables.items() instead of ds[var], the dataarray is not accessed each time which saves a lot of time in case of many variables. The original method profiles like this:
image

When replacing the _get_topology() code with [k for k, var in ds.variables.items() if var.attrs.get("cf_role") == "mesh_topology"] or [k for k in ds.data_vars if ds.variables[k].attrs.get("cf_role") == "mesh_topology"] (so adding only .variables), the profiler looks like this:
image

So the timings drop from 16 seconds to <1 second in an example with 5 partitions. This will cause a tremendous improvement when using all 256 partitions of the dataset. Do note that this case covers a dataset with 2410 variables, so it will mostly improve performance of datasets with many variables. Some code to reproduce:

import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
partitions = []
for iF, file_nc_one in enumerate(file_nc_list):
    print(iF+1,end=' ')
    ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
    uds_one = xu.core.wrap.UgridDataset(ds_one)
    partitions.append(uds_one)
print(': ',end='')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant