Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF error in a simple parallel run #690

Open
einola opened this issue Sep 5, 2024 · 8 comments
Open

HDF error in a simple parallel run #690

einola opened this issue Sep 5, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@einola
Copy link
Member

einola commented Sep 5, 2024

I'm trying to run the model in a very simple parallel configuration, but this ends in a non-zero exit.

The setup consists of the single column config_column.cfg configuration run on a 30 x 30 grid. This should result in a spatially uniform solution, but time varying.

I created the init file using make_init_column.py, changing nfirst and nsecond to 30 (lines 3 and 4 in the file). I also changed the name of the output file to init_column_30x30.nc for clarity).

Inside the development docker image, I run (from a build directory) cmake .. -DENABLE_MPI=ON, and then (from the run directory) I run

root@5323bdc8d33f:/nextsim/nextsimdg/run# mpirun -n 2 --allow-run-as-root ../build/nextsim --config-file=config_column.cfg --model.partition_file=partition_metadata_2.nc --model.init_file=init_column_30x30.nc 

The run completes and returns reasonable-looking diagnostics and restart files but gives the following error:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5A.c line 2397 in H5Aexists(): can't synchronously check if attribute exists
    major: Attribute
    minor: Can't get value
  #001: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5A.c line 2364 in H5A__exists_api_common(): can't set object access arguments
    major: Attribute
    minor: Can't set value
  #002: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 2634 in H5VL_setup_self_args(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #003: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 1733 in H5VL_vol_object(): invalid identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5Adeprec.c line 134 in H5Acreate1(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 1733 in H5VL_vol_object(): invalid identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5A.c line 2397 in H5Aexists(): can't synchronously check if attribute exists
    major: Attribute
    minor: Can't get value
  #001: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5A.c line 2364 in H5A__exists_api_common(): can't set object access arguments
    major: Attribute
    minor: Can't set value
  #002: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 2634 in H5VL_setup_self_args(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #003: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 1733 in H5VL_vol_object(): invalid identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5Adeprec.c line 134 in H5Acreate1(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: /tmp/root/spack-stage/spack-stage-hdf5-1.14.3-y3ghlib6vghmysul3bm7ew5jm4qqk3fn/spack-src/src/H5VLint.c line 1733 in H5VL_vol_object(): invalid identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
terminate called after throwing an instance of 'netCDF::exceptions::NcFileMeta'
  what():  NetCDF: Can't add HDF5 file metadata
file: ncFile.cpp  line:33
terminate called after throwing an instance of 'netCDF::exceptions::NcFileMeta'
  what():  NetCDF: Can't add HDF5 file metadata
file: ncFile.cpp  line:33
[5323bdc8d33f:00102] *** Process received signal ***
[5323bdc8d33f:00101] *** Process received signal ***
[5323bdc8d33f:00101] Signal: Aborted (6)
[5323bdc8d33f:00101] Signal code:  (-6)
[5323bdc8d33f:00102] Signal: Aborted (6)
[5323bdc8d33f:00102] Signal code:  (-6)
[5323bdc8d33f:00101] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffab48f7a0]
[5323bdc8d33f:00101] [ 1] /lib/aarch64-linux-gnu/libc.so.6(+0x7f200)[0xffffaaacf200]
[5323bdc8d33f:00101] [ 2] [5323bdc8d33f:00102] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffa80087a0]
[5323bdc8d33f:00102] [ 1] /lib/aarch64-linux-gnu/libc.so.6(+0x7f200)[0xffffa76cf200]
[5323bdc8d33f:00102] [ 2] /lib/aarch64-linux-gnu/libc.so.6(raise+0x1c)[0xffffa768a67c]
[5323bdc8d33f:00102] [ 3] /lib/aarch64-linux-gnu/libc.so.6(raise+0x1c)[0xffffaaa8a67c]
[5323bdc8d33f:00101] [ 3] /lib/aarch64-linux-gnu/libc.so.6(abort+0xe4)[0xffffa7677130]
[5323bdc8d33f:00102] [ 4] /lib/aarch64-linux-gnu/libc.so.6(abort+0xe4)[0xffffaaa77130]
[5323bdc8d33f:00101] [ 4] /lib/aarch64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x18c)[0xffffa74a62dc]
[5323bdc8d33f:00102] [ 5] /lib/aarch64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x18c)[0xffffaa8a62dc]
[5323bdc8d33f:00101] [ 5] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2abc)[0xffffaa8a2abc]
[5323bdc8d33f:00101] [ 6] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2b20)[0xffffaa8a2b20]
[5323bdc8d33f:00101] [ 7] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2e04)[0xffffaa8a2e04]
[5323bdc8d33f:00101] [ 8] /opt/views/view/lib/libnetcdf_c++4.so.1(_ZN6netCDF7ncCheckEiPKci+0xbf0)[0xffffab355fc0]
[5323bdc8d33f:00101] [ 9] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2abc)[0xffffa74a2abc]
[5323bdc8d33f:00102] [ 6] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2b20)[0xffffa74a2b20]
[5323bdc8d33f:00102] [ 7] /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa2e04)[0xffffa74a2e04]
/opt/views/view/lib/libnetcdf_c++4.so.1(_ZN6netCDF6NcFile5closeEv+0x44)[0xffffab35a6e4]
[5323bdc8d33f:00101] [10] ../cmake-build-release-docker-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO5closeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x17c)[0xffffab14662c]
[5323bdc8d33f:00101] [11] [5323bdc8d33f:00102] [ 8] ../cmake-build-release-docker-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO13closeAllFilesEv+0x60)[0xffffab146844]
[5323bdc8d33f:00101] [12] /opt/views/view/lib/libnetcdf_c++4.so.1(_ZN6netCDF7ncCheckEiPKci+0xbf0)[0xffffa7bb5fc0]
[5323bdc8d33f:00102] [ 9] /opt/views/view/lib/libnetcdf_c++4.so.1(_ZN6netCDF6NcFile5closeEv+0x44)[0xffffa7bba6e4]
[5323bdc8d33f:00102] [10] ../cmake-build-release-docker-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO5closeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x17c)[0xffffa7d4662c]
[5323bdc8d33f:00102] [11] ../cmake-build-release-docker-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO13closeAllFilesEv+0x60)[0xffffa7d46844]
[5323bdc8d33f:00102] [12] /lib/aarch64-linux-gnu/libc.so.6(+0x3cde8)[0xffffa768cde8]
[5323bdc8d33f:00102] [13] /lib/aarch64-linux-gnu/libc.so.6(+0x3cf0c)[0xffffa768cf0c]
[5323bdc8d33f:00102] [14] /lib/aarch64-linux-gnu/libc.so.6(+0x27400)[0xffffa7677400]
[5323bdc8d33f:00102] [15] /lib/aarch64-linux-gnu/libc.so.6(+0x3cde8)[0xffffaaa8cde8]
[5323bdc8d33f:00101] [13] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffa76774cc]
[5323bdc8d33f:00102] [16] ../cmake-build-release-docker-mpi/nextsim(+0x2d30)[0xaaaadef12d30]
[5323bdc8d33f:00102] *** End of error message ***
/lib/aarch64-linux-gnu/libc.so.6(+0x3cf0c)[0xffffaaa8cf0c]
[5323bdc8d33f:00101] [14] /lib/aarch64-linux-gnu/libc.so.6(+0x27400)[0xffffaaa77400]
[5323bdc8d33f:00101] [15] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffaaa774cc]
[5323bdc8d33f:00101] [16] ../cmake-build-release-docker-mpi/nextsim(+0x2d30)[0xaaaab5cb2d30]
[5323bdc8d33f:00101] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 5323bdc8d33f exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@einola
Copy link
Member Author

einola commented Sep 9, 2024

@andreapiacentini places the error at line 496 in ParaGridIO.cpp (ParaGridIO::close). See here.

@andreapiacentini
Copy link

A "dummy" question: in this test there is only one output file and all the timesteps are output to the same diagnostic.nc file.
The error message output from the underlying HDF5 call (Invalid argument) is mentioned in google threads and often due to a second closure of an already closed file. Is it possible that diagnostic.nc get closed more than once?

@timspainNERSC
Copy link
Collaborator

@andreapiacentini Not a dumb question at all!

Would it be possible to re-run using the branch issue690_doubleclose? This is develop with a diagnostic message to stdout whenever a diagnostic file is attempted to be closed. If a double close is attempted, then the same file should show up twice.

@andreapiacentini
Copy link

My hint is wrong: here the console output where "Closing" appears only once.

(PyO) (davinci)~/SASIP_DEV/nextsimdg/run:1025>mpirun -np 1 ../build_MPI_NCIntel/nextsim --config-file config_column.cfg --model.partition_file=partition.nc --model.init_file=init_column_30x30.nc
Closing diagnostic.nc
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 0:
  #000: H5F.c line 549 in H5Fflush(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
terminate called after throwing an instance of 'netCDF::exceptions::NcHdfErr'
  what():  NetCDF: HDF error
file: /root/Downloads/netcdf-cxx4-4.3.1/cxx4/ncFile.cpp  line:33

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 19738 RUNNING AT davinci.cerfacs.fr
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

The job log is neat:
nextsim.10:41:15.log

@andreapiacentini
Copy link

I am still trying to track down what's going on inside NetCDF.
Still no hints on the error at closing, but found an answer to the lack of restart after a parallel run.

Open MPI I/O plugins may have restrictions on
characters that can be used in filenames. For example, the ROMIO plugin may disallow the
colon (":") character from appearing in a filename

therefore the use of the iso format for dates with colons for separating the hours from the minutes and the seconds make the file creation fail. Unfortunately this is inside a silent try and no errors are issued. I tried to remove the date part from the filename and restart.nc is correctly output.

Yet, closing the diagnostic file causes a crash: my hypothesis is that the std::atexit(closeAllFiles) from ParaGridIO::makeDimCompMap is really executed at exit, hence after MPI_Finalize.
With a dirty trick (commenting out MPI_Finalize) I get a clean closure. Please find a way of atexiting the MPI_Finalize too (for whatever it could mean) !!

@andreapiacentini
Copy link

For the moment I can run the parallel test cleanly to the end with this "a bit less dirty" patch in main.cpp

[...]
#include "include/Model.hpp"
#include "include/NetcdfMetadataConfiguration.hpp"

static void exitMPI()
{
  MPI_Finalize();
}

int main(int argc, char* argv[])
{
#ifdef USE_MPI
    MPI_Init(&argc, &argv);
    std::atexit(exitMPI);
    MPI_Comm modelCommunicator = MPI_COMM_WORLD;
#endif // USE_MPI
[...]

The standard says that atexit functions are called in reverse order and, indeed, I've checked that the diagnostic.nc file is closed before triggering MPI_Finalize.

@einola
Copy link
Member Author

einola commented Sep 20, 2024

That looks very nice! Would it work with just

std::atexit(MPI_Finalize());

so that you don't have to define the additional function extiMPI()?

Otherwise, can you create an issue branch and make a pull request?

$ git checkout develop
$ git pull
$ git checkout -b issue690_HDF_error
(edit the code, git commit, git push, and pull request)

@andreapiacentini
Copy link

Thanks.
No it wouldn't because MPI_Finalize() returns an error code and only void functions are accepted by atexit (that's what google told me).
I can post a Pull Request, but @timspainNERSC wrote to me that he's changing the way ParaGridIO finalizes with a method to be explicitly called cf PR #683 and #685 Should we wait for it or should I post the PR and revert it (if needed) when Tim is done?
Anyway, all that should probably reside in an MPI related class. Don't you (and @TomMelt ) think so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

4 participants