Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge partition chunks #253

Merged
merged 1 commit into from
Jul 9, 2024
Merged

Merge partition chunks #253

merged 1 commit into from
Jul 9, 2024

Conversation

Huite
Copy link
Collaborator

@Huite Huite commented Jul 9, 2024

This addresses #252

Also related to: Deltares/dfm_tools#679 >> JV: this is something different, not occurring with merging

@Huite Huite requested review from veenstrajelmer and JoerivanEngelen and removed request for veenstrajelmer July 9, 2024 09:10
@veenstrajelmer veenstrajelmer linked an issue Jul 9, 2024 that may be closed by this pull request
@Huite Huite merged commit 292bbf8 into main Jul 9, 2024
11 checks passed
@Huite Huite deleted the merge_partition_chunks branch July 9, 2024 09:34
Copy link
Collaborator

@veenstrajelmer veenstrajelmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some very limited performance checking, but will do more later. Up to now, I see minor performance improvements. Please do note that I use chunks = {'time':1}, this gives a significant performance improvement for opening/merging the dataset. However, it will cause serious performance reductions when reducing/retrieving over the time dimension.

Old:

merge_ugrid_chunks: False
>> xu.open_dataset() with 8 partition(s): 1 2 3 4 5 6 7 8 : 0.84 sec
>> xu.merge_partitions() with 8 partition(s): 2.87 sec
>> plotting: 2.08 sec

versus new:

merge_ugrid_chunks: True
>> xu.open_dataset() with 8 partition(s): 1 2 3 4 5 6 7 8 : 1.59 sec
>> xu.merge_partitions() with 8 partition(s): 3.45 sec
>> plotting: 1.93 sec

This is for the RMM 2D dataset processed with the code below:

import glob
import datetime as dt
import xugrid as xu
import matplotlib.pyplot as plt
plt.close('all')

file_nc = 'p:\\1204257-dcsmzuno\\2006-2012\\3D-DCSM-FM\\A18b_ntsu1\\DFM_OUTPUT_DCSM-FM_0_5nm\\DCSM-FM_0_5nm_0*_map.nc' #3D DCSM
file_nc = 'p:\\archivedprojects\\11206813-006-kpp2021_rmm-2d\\C_Work\\31_RMM_FMmodel\\computations\\model_setup\\run_207\\results\\RMM_dflowfm_0*_map.nc' #RMM 2D
# file_nc = 'p:\\1230882-emodnet_hrsm\\GTSMv5.0\\runs\\reference_GTSMv4.1_wiCA_2.20.06_mapformat4\\output\\gtsm_model_0*_map.nc' #GTSM 2D
# file_nc = 'p:\\11208053-005-kpp2022-rmm3d\\C_Work\\01_saltiMarlein\\RMM_2019_computations_02\\computations\\theo_03\\DFM_OUTPUT_RMM_dflowfm_2019\\RMM_dflowfm_2019_0*_map.nc' #RMM 3D
# file_nc = 'p:\\archivedprojects\\11203379-005-mwra-updated-bem\\03_model\\02_final\\A72_ntsu0_kzlb2\\DFM_OUTPUT_MB_02\\MB_02_0*_map.nc'

# Old timings from 2023 or so (xu.open_dataset/xu.merge_partitions):
# - DCSM 3D 20 partitions  367 timesteps: 231.5/ 4.5 sec (decode_times=False: 229.0 sec)
# - RMM  2D  8 partitions  421 timesteps:  55.4/ 4.4 sec (decode_times=False:  56.6 sec)
# - GTSM 2D  8 partitions  746 timesteps:  71.8/30.0 sec (decode_times=False: 204.8 sec)
# - RMM  3D 40 partitions  146 timesteps: 168.8/ 6.3 sec (decode_times=False: 158.4 sec)
# - MWRA 3D 20 partitions 2551 timesteps:  74.4/ 3.4 sec (decode_times=False:  79.0 sec)

file_nc_list = glob.glob(file_nc)
# chunks = 'auto' # ValueError: Object has inconsistent chunks along dimension time. This can be fixed by calling unify_chunks().
chunks = {'time':1}

merge_ugrid_chunks = True
print('merge_ugrid_chunks:', merge_ugrid_chunks)

print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
partitions = []
for iF, file_nc_one in enumerate(file_nc_list):
    print(iF+1,end=' ')
    uds = xu.open_dataset(file_nc_one, chunks=chunks)
    partitions.append(uds)
print(': ',end='')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

print(f'>> xu.merge_partitions() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
ds_merged_xu = xu.merge_partitions(partitions, merge_ugrid_chunks=merge_ugrid_chunks)
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

print('>> plotting: ',end='')
dtstart = dt.datetime.now()
ds_merged_xu.mesh2d_s1.isel(time=-1).ugrid.plot()
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

@Huite
Copy link
Collaborator Author

Huite commented Jul 9, 2024

Might be worthwhile to check something like a reindex as well for completeness, since that's definitely one of the cases where the previous implementation was worst.

@veenstrajelmer
Copy link
Collaborator

After our discussion I tried to plot a transect to one partition of the merged model. This performance is indeed worse than before. With the RMM example from above:

import dfm_tools as dfmt
import numpy as np
line_array = np.array([[ 42546.47095912, 483810.44968039],
       [ 67475.42610872, 491926.42350426],
       [ 86985.04318231, 478574.33753595],
       [ 77230.23464551, 467054.8908182 ],
       [101797.90059004, 469672.94689042]])
print('>> plotting transect: ',end='')
dtstart = dt.datetime.now()
uds_crs = dfmt.polyline_mapslice(ds_merged_xu.isel(time=3), line_array)
fig, ax = plt.subplots()
uds_crs['mesh2d_sa1'].ugrid.plot(cmap='jet')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

Also this action seems to be faster with the new method, or at least as fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically resize chunks after merging parallel results
2 participants