Checkpointed analyses hang on restart #290

DMaddison · 2023-10-19T16:32:29Z

What is the current observed behaviour?

Kip Will and I are independently trying to restart MrBayes (3.2.7) runs from checkpoints, and for both of us the restart is failing. It seems to go fine, but then just hangs (at the same point) with no further output. Although our data matrices are independent, and rather different (Kip’s has over 600 taxa and about 5000 nucleotides; mine has 46 taxa and 1 million nucleotides), we are doing similar fossilized-birth-death analyses.

In my case, I ran a bit over 21 million generations on a 20-core Apple M1 Ultra Mac Studio computer with 128GB RAM, asking for the 16 cores high-performance cores to be used under mpi, using “mpirun -np 16 mb”. I used the ARM version of MrBayes (3.2.7a). Kip is doing his analysis on a Linux box with Intel Xeon chips with a total of 16 hyperthreaded cores (thus 32 apparent cores), and started his run using "nohup mpirun -np 32 mb [filename.nex] &".

After having done an initial run, we then wanted to continue the MCMC analysis from that point but using different swapfreq and temp settings. I added “append=yes” and the new swapfreq and temp settings to the mcmc command in the MrBayes block of the NEXUS file (Kip added them to the mcmcp command), and asked the file to be executed again.

All seemed good after invoking the mcmc command. MrBayes chugs through the NEXUS file, gets into the MrBayes block, eventually copies the .p and .t files, making .p~ and .t~ copies, hums along, and then it just stops doing anything apparent. Here is the last bit of the log:

   Exiting mrbayes block
   Reached end of file
   Returning execution to calling file ...
      Using samples up to generation 21982000 from previous analysis.

      Initial log likelihoods and log prior probs for run 1:
         Chain 1 -- -7027326.220655 -- nan

      There are 15 more chains on other processor(s)
      Using a relative burnin of 25.0 % for diagnostics
      Chain results (continued from previous run; 1000000000 generations requested):

Outputting the "Chain results" line is thus the last thing it does that is apparent; at that point nothing more happens. I've left my machine going for 48 hours, and no files are written, nothing more comes to the log, etc.. In my case, the memory usage per core goes up to about 4.5GB each, but the total memory used by all 16 mb processes is far less than the available 128GB, and there is a lot of unused memory. All 16 cores are showing as active, but with mpirun that is how it appears even if the mb executable isn’t actually doing an analysis at all.

There is one hint in my runs of something amiss. As the MrBayes block is being read in, it of course sends to the log information about what is going on. At one point, it says this:

      Setting number of generations to 1000000000
      Using relative burnin (a fraction of samples discarded).
      Setting burnin fraction to 0.25
      Setting print frequency to 1000
      Setting sample frequency to 1000
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      Setting number of runs to 4
      Setting number of chains to 4

Not sure if that warning is important, but that is the only hint of problems.

In case it is relevant, here are some of the commands used to set up the restarted run (Kip's are likely fairly similar):

	prset brlenspr=clock:fossilization; 
	prset samplestrat=diversity;

	prset clockvarpr=mixed;  
	prset clockratepr=normal(0.001,0.02);  

	prset topologypr=constraints([constraints listed here]);
	prset nodeagepr=calibrated;

	mcmcp ngen= 1000000000 relburnin=yes burninfrac=0.25 printfreq=1000  samplefreq=1000 nruns=4 nchains=4 savebrlens=yes;
	mcmc Swapfreq=7 Temp=0.03 append=yes ;
	sumt;

How may we reproduce this bug?

I can supply my files as needed, and I suspect Kip can too.

Would you be able to compile and run MrBayes to test fixes to this bug?

Yes

What is the environment that you run MrBayes in?

Two different environments.

My environment:

Operating system (including variant and release): MacOS X Sonoma 14.0
If possible, include the output of the Version command in MrBayes below:

 ---------------------------------------------------------------------------
  Version

  MrBayes 3.2.7a

  Features:  MPI
  Host type: arm-apple-darwin22.1.0 (CPU: arm)
  Compiler:  clang 14.0.0
  ---------------------------------------------------------------------------

Kip's environment:

Operating system (including variant and release): Red Hat Enterprise Linux 7.9
Version of MrBayes: MrBayes "v3.2.7-svn(r1079) x64"

The text was updated successfully, but these errors were encountered:

DMaddison · 2023-10-20T23:47:44Z

One thing I forgot to mention. We had to modify the .ckp files in order for MrBayes to accept them. In particular, in the trees block the start of the tree commands look like this:

	tree mcmc.tree_1 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_2 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_3 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_4 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_5 [&B MixedBrlens 0] = [&R] 
	tree mcmc.tree_6 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_7 [&B MixedBrlens 0] = [&R] 
	tree mcmc.tree_8 [&B MixedBrlens 1] = [&R] 
	tree mcmc.tree_9 [&B MixedBrlens 0] = [&R] 
	tree mcmc.tree_10 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_11 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_12 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_13 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_14 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_15 [&B MixedBrlens 1] = [&R]
	tree mcmc.tree_16 [&B MixedBrlens 1] = [&R]

MrBayes chokes on the number, 0 or 1, after "MixedBrlens". If they are in, you get an error message and MrBayes stops processing the file. If you remove the 0 or 1, then it appears to accept those lines.

nylander added the Checkpoint-append-related Checkpoint and/or append related issue label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointed analyses hang on restart #290

Checkpointed analyses hang on restart #290

DMaddison commented Oct 19, 2023 •

edited

Loading

DMaddison commented Oct 20, 2023

Checkpointed analyses hang on restart #290

Checkpointed analyses hang on restart #290

Comments

DMaddison commented Oct 19, 2023 • edited Loading

What is the current observed behaviour?

How may we reproduce this bug?

Would you be able to compile and run MrBayes to test fixes to this bug?

What is the environment that you run MrBayes in?

DMaddison commented Oct 20, 2023

DMaddison commented Oct 19, 2023 •

edited

Loading