Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible floating point exception in HDF5 library #153

Closed
anton-seaice opened this issue May 1, 2024 · 5 comments
Closed

Investigate possible floating point exception in HDF5 library #153

anton-seaice opened this issue May 1, 2024 · 5 comments
Labels

Comments

@anton-seaice
Copy link
Contributor

When using the 0.3.0 Debug build of OM3 (spack 0.21.2), @micaeljtoliveira found this error when running with the MOM6-CICE6 1deg_jra55do_iaf config

forrtl: error (65): floating invalid
Image             PC               Routine           Line       Source
libpthread-2.28.s 000014DFE4FC4CF0 Unknown              Unknown Unknown
libhdf5.so.310.3. 000014DFE2BA1B6B H5T__init_native_    Unknown Unknown
libhdf5.so.310.3. 000014DFE2B06FA8 H5T_init             Unknown Unknown
libhdf5.so.310.3. 000014DFE2BC5130 H5VL_init_phase2     Unknown Unknown
libhdf5.so.310.3. 000014DFE2856673 H5_init_library      Unknown Unknown
libhdf5.so.310.3. 000014DFE2925945 H5Eset_auto2         Unknown Unknown
libnetcdf.so.19   000014DFE7CB5ABC nc4_hdf5_initiali    Unknown Unknown
libnetcdf.so.19   000014DFE7CBDFEC NC_HDF5_initializ    Unknown Unknown
libnetcdf.so.19   000014DFE7C1ACB8 nc_initialize        Unknown Unknown
libnetcdf.so.19   000014DFE7C1E58A NC_open              Unknown Unknown
libnetcdf.so.19   000014DFE7C1E297 nc_open              Unknown Unknown
libnetcdff.so.7.2 000014DFE8011BC2 nf_open_             Unknown Unknown
libnetcdff.so.7.2 000014DFE8055329 netcdf_mp_nf90_op    Unknown Unknown
access-om3-MOM6-C 00000000072D2BB5 ice_read_write_mp       1072 ice_read_write.F90
access-om3-MOM6-C 000000000713E142 ice_grid_mp_init_        342 ice_grid.F90
access-om3-MOM6-C 00000000075B984B cice_initmod_mp_c         57 CICE_InitMod.F90
access-om3-MOM6-C 0000000006A87EA2 ice_comp_nuopc_mp        589 ice_comp_nuopc.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000B4CD64 nuopc_driver_mp_l       2886 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000B1490F nuopc_driver_mp_i       1318 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000B4CD64 nuopc_driver_mp_l       2886 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000B14B62 nuopc_driver_mp_i       1323 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000AF9D7A nuopc_driver_mp_i        481 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000431FE8 MAIN__                   128 esmApp.F90
access-om3-MOM6-C 000000000043124D Unknown              Unknown Unknown
libc-2.28.so      000014DFE4C27D85 __libc_start_main    Unknown Unknown
access-om3-MOM6-C 000000000043116E Unknown              Unknown Unknown

The run did not fail with the Release build.

Debug sets -fpe0, so we believe there is a bug within the HDF5 library which causes the exception.

A small (not quite minimal) example to reproduce the problem (Use mpifort -fpe0) :

program nc_open_example
  use netcdf
  
  implicit none
  
  integer :: status, fid
  character(len=500) :: filename_nc3,  filename_nc4
  
  filename_nc4 = '/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/grid_2024.04.04.nc '
  filename_nc3 = '/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/grid.nc '

  status = nf90_open(filename_nc3, NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

  status = nf90_open(filename_nc4, NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

  status = nf90_open('link_to_grid.nc', NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

  
end program nc_open_example
@dougiesquire
Copy link
Collaborator

This has been fixed in hdf5 1.14.4 - see HDFGroup/hdf5#3837

@dougiesquire dougiesquire added blocked For issues waiting resolution of issues outside this repository priority:low and removed priority:low blocked For issues waiting resolution of issues outside this repository labels May 4, 2024
@dougiesquire
Copy link
Collaborator

The executables in

/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/access-om3-d6813d6b9e1df560ac3f6ba6a605daab9cfd9569_main-q4wfaqb

are built against [email protected] which includes the above fix.

@anton-seaice
Copy link
Contributor Author

There are no DEBUG builds in that folder - I guess I would need to do a seperate debug build using those modules from that path ?

@anton-seaice
Copy link
Contributor Author

There are no DEBUG builds in that folder - I guess I would need to do a seperate debug build using those modules from that path ?

I may have misunderstood, when building through spack, the file name doesn't include Release/Debug like building though build.sh. So possibly your executable would have fixed the problem.

This bug may no longer be relevant. With the ACCESS-NRI build, specifying build_type=Debug doesn't flow down to the dependncies. i.e. the Debug flags are only on for compiling the access-om3 code (the bits in this repo and submodules) and not on for compiling hdf5.

@anton-seaice
Copy link
Contributor Author

This has been addressed in hdf5 :

HDFGroup/hdf5#3837

and patched in spack for the affected version :

https://github.com/spack/spack/blob/5c59bb87a4d3ed5a190f9ef37b98cd91c91f020f/var/spack/repos/builtin/packages/hdf5/package.py#L191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants