Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test sample aborts early when built with LTO #159

Open
eli-schwartz opened this issue Aug 5, 2024 · 1 comment
Open

Test sample aborts early when built with LTO #159

eli-schwartz opened this issue Aug 5, 2024 · 1 comment

Comments

@eli-schwartz
Copy link

It is somewhat popular to build software with -flto in CFLAGS / FCFLAGS / LDFLAGS for additional optimizations. Various linux distros either try to build all software that way by default, or permit users to opt in to that on a per-package or global level.

Enabling LTO by default on Gentoo and then attempting to compile various packages including tinker, the following bug report was discovered: https://bugs.gentoo.org/878059

Basically, Gentoo's testsuite for tinker looks like this:

cd test/

for t in *.run; do
    bash "${t}" || exit 1
done

to run the samples. This is actually not super flexible because it appears these run scripts don't really fail when a sample fails. :( Instead, the return value of the last command is the exit status of the test itself.

When running ice.run:

 * Testing ice.run ...

     ######################################################################
   ##########################################################################
  ###                                                                      ###
 ###            TINKER  ---  Software Tools for Molecular Design            ###
 ##                                                                          ##
 ##                          Version 8.2  June 2017                          ##
 ##                                                                          ##
 ##               Copyright (c)  Jay William Ponder  1990-2017               ##
 ###                           All Rights Reserved                          ###
  ###                                                                      ###
   ##########################################################################
     ######################################################################


 Smooth Particle Mesh Ewald Parameters :

    Ewald Coefficient       PME Grid Dimensions       B-Spline Order

          0.5446               36    30    54                5

 Random Number Generator Initialized with SEED :      123456789

 Molecular Dynamics Trajectory via Modified Beeman Algorithm

    MD Step       E Total    E Potential      E Kinetic       Temp       Pres

         1     -8310.0499    -10641.4570      2331.4070     258.73    4990.48
         2     -8307.2488    -10630.1223      2322.8735     257.78  123059.73

 VLIST  --  Pairwise Neighbor List cannot be used with Replicas

 TINKER is Unable to Continue; Terminating the Current Calculation

rm: cannot remove 'ice.arc*': No such file or directory
rm: cannot remove 'ice.dyn*': No such file or directory
 * ERROR: sci-chemistry/tinker-8.2.1-r1::gentoo failed (test phase):

the rm command is last, and fails, even though ../bin/dynamic ice 200 1.0 0.1 4 253.0 2960.0 doesn't fail at all? But when built without LTO, ice.run will run 2 groups of 100 steps and that "123059.73" result is a less unusual looking 4383.34.

@jayponder
Copy link
Member

jayponder commented Aug 5, 2024

There are several comments to be made here:

(1) For the example you cite above, where you get 123059.73 at Step 2, something is very, very badly broken. That is a "crazy" number, while the 4383.34 value is correct, as you can see in the ice.log file provided with the Tinker distribution. The reason the "rm" failed is because the calculation went so drastically wrong that the files intended to be removed at the end were never even generated. We could obviously avoid the failure by first checking for the existence of the files before trying to remove them. But at least with the above "failure" it shows there is a problem.

(2) The Tinker "test" cases are not really intended to be run as automated tests, such as the simple Gentoo script you cite above is trying to do. Each test can be run individually (and manually) and the result can then be compared to the result in the *.log file we provide for each test. We are aware this is so old-fashioned as to be "nonstandard", but we have our reasons. The above said, the Gentoo script will probably at least run when you are not using the -flto flag. However getting the Gentoo script to merely run is not sufficient- you need to compare the output to our *.log files to verify correctness.

(3) I've not tried these kinds of "global" optimizers, such as -flto, for some time. However I did play with them in the past at various points, both with the GCC and Intel compiler suites. In my hands, they do not help performance. The longer jobs run with Tinker, such as molecular dynamics, are not interprocess, memory, data or communication limited. They are pretty much pure number crunching in fairly tight, self-contained, parallellized loops. This is why essentially nobody runs such calculations on CPUs these days- GPUs are used almost exclusively. And for that purpose we suggest you look into the Tinker9 GPU code in place of CPU Fortran Tinker8. Tinker8 and Tinker9 are file compatible, so switching between the two is fairly easy. And for the Fortran Tinker8, I recommend that you not try to use the GCC -flto switch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants