-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge partition chunks #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some very limited performance checking, but will do more later. Up to now, I see minor performance improvements. Please do note that I use chunks = {'time':1}
, this gives a significant performance improvement for opening/merging the dataset. However, it will cause serious performance reductions when reducing/retrieving over the time dimension.
Old:
merge_ugrid_chunks: False
>> xu.open_dataset() with 8 partition(s): 1 2 3 4 5 6 7 8 : 0.84 sec
>> xu.merge_partitions() with 8 partition(s): 2.87 sec
>> plotting: 2.08 sec
versus new:
merge_ugrid_chunks: True
>> xu.open_dataset() with 8 partition(s): 1 2 3 4 5 6 7 8 : 1.59 sec
>> xu.merge_partitions() with 8 partition(s): 3.45 sec
>> plotting: 1.93 sec
This is for the RMM 2D dataset processed with the code below:
import glob
import datetime as dt
import xugrid as xu
import matplotlib.pyplot as plt
plt.close('all')
file_nc = 'p:\\1204257-dcsmzuno\\2006-2012\\3D-DCSM-FM\\A18b_ntsu1\\DFM_OUTPUT_DCSM-FM_0_5nm\\DCSM-FM_0_5nm_0*_map.nc' #3D DCSM
file_nc = 'p:\\archivedprojects\\11206813-006-kpp2021_rmm-2d\\C_Work\\31_RMM_FMmodel\\computations\\model_setup\\run_207\\results\\RMM_dflowfm_0*_map.nc' #RMM 2D
# file_nc = 'p:\\1230882-emodnet_hrsm\\GTSMv5.0\\runs\\reference_GTSMv4.1_wiCA_2.20.06_mapformat4\\output\\gtsm_model_0*_map.nc' #GTSM 2D
# file_nc = 'p:\\11208053-005-kpp2022-rmm3d\\C_Work\\01_saltiMarlein\\RMM_2019_computations_02\\computations\\theo_03\\DFM_OUTPUT_RMM_dflowfm_2019\\RMM_dflowfm_2019_0*_map.nc' #RMM 3D
# file_nc = 'p:\\archivedprojects\\11203379-005-mwra-updated-bem\\03_model\\02_final\\A72_ntsu0_kzlb2\\DFM_OUTPUT_MB_02\\MB_02_0*_map.nc'
# Old timings from 2023 or so (xu.open_dataset/xu.merge_partitions):
# - DCSM 3D 20 partitions 367 timesteps: 231.5/ 4.5 sec (decode_times=False: 229.0 sec)
# - RMM 2D 8 partitions 421 timesteps: 55.4/ 4.4 sec (decode_times=False: 56.6 sec)
# - GTSM 2D 8 partitions 746 timesteps: 71.8/30.0 sec (decode_times=False: 204.8 sec)
# - RMM 3D 40 partitions 146 timesteps: 168.8/ 6.3 sec (decode_times=False: 158.4 sec)
# - MWRA 3D 20 partitions 2551 timesteps: 74.4/ 3.4 sec (decode_times=False: 79.0 sec)
file_nc_list = glob.glob(file_nc)
# chunks = 'auto' # ValueError: Object has inconsistent chunks along dimension time. This can be fixed by calling unify_chunks().
chunks = {'time':1}
merge_ugrid_chunks = True
print('merge_ugrid_chunks:', merge_ugrid_chunks)
print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
partitions = []
for iF, file_nc_one in enumerate(file_nc_list):
print(iF+1,end=' ')
uds = xu.open_dataset(file_nc_one, chunks=chunks)
partitions.append(uds)
print(': ',end='')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
print(f'>> xu.merge_partitions() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
ds_merged_xu = xu.merge_partitions(partitions, merge_ugrid_chunks=merge_ugrid_chunks)
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
print('>> plotting: ',end='')
dtstart = dt.datetime.now()
ds_merged_xu.mesh2d_s1.isel(time=-1).ugrid.plot()
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
Might be worthwhile to check something like a |
After our discussion I tried to plot a transect to one partition of the merged model. This performance is indeed worse than before. With the RMM example from above: import dfm_tools as dfmt
import numpy as np
line_array = np.array([[ 42546.47095912, 483810.44968039],
[ 67475.42610872, 491926.42350426],
[ 86985.04318231, 478574.33753595],
[ 77230.23464551, 467054.8908182 ],
[101797.90059004, 469672.94689042]])
print('>> plotting transect: ',end='')
dtstart = dt.datetime.now()
uds_crs = dfmt.polyline_mapslice(ds_merged_xu.isel(time=3), line_array)
fig, ax = plt.subplots()
uds_crs['mesh2d_sa1'].ugrid.plot(cmap='jet')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec') Also this action seems to be faster with the new method, or at least as fast. |
This addresses #252
Also related to: Deltares/dfm_tools#679 >> JV: this is something different, not occurring with merging