Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs that fail on firebot every few days #13433

Open
mcgratta opened this issue Sep 15, 2024 · 4 comments
Open

Jobs that fail on firebot every few days #13433

mcgratta opened this issue Sep 15, 2024 · 4 comments
Assignees

Comments

@mcgratta
Copy link
Contributor

When I run 64 instances of this case in the firebot queue, all are successful. However, if I switch to the uneven meshes, some will fail. These cases do not have to be run with other cases, like the verification cases. They can be run alone.

&HEAD CHID='simple_caseNNN' /

&TIME T_END=60. /

&MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,0.0,1.0 /
&MESH IJK=50,50,50, XB=1.0,2.0,0.0,1.0,0.0,1.0 /
&MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,1.0,2.0 /
&MESH IJK=50,50,50, XB=1.0,2.0,0.0,1.0,1.0,2.0 /

 MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,0.0,1.0 /
 MESH IJK=25,25,25, XB=1.0,2.0,0.0,1.0,0.0,1.0 /
 MESH IJK=25,25,25, XB=0.0,1.0,0.0,1.0,1.0,2.0 /
 MESH IJK=25,25,25, XB=1.0,2.0,0.0,1.0,1.0,2.0 /

&TAIL /

Run 64 cases on firebot, designating one mesh per node.

#!/bin/bash

qfds.sh -p 4 -n 1 -q firebot simple_case_01.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_02.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_03.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_04.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_05.fds
...
@mcgratta
Copy link
Contributor Author

I compiled the code in dv mode and run 64 unbalanced cases on nodes 31-34. A few jobs hung. I killed one of them and got line numbers of 1833 and 1834 in main, called from the line 689. These are ALLREDUCE calls, one after the other. The fact that one process was hung up on the first call, and another on the second is suspicious. I added MPI_BARRIERs to this routine to see if I can still get the hang.

@mcgratta
Copy link
Contributor Author

I added some MPI_BARRIERs and I see that process 1 was stuck at the barrier while 2 and 3 were stuck at the first allreduce.

IF (N_MPI_PROCESSES>1) THEN
   CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
   CALL MPI_ALLREDUCE(MPI_IN_PLACE,DSUM_ALL(1),N_ZONE,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,IERR)

Check if your cases are handing in this vicinity.

@marcosvanella
Copy link
Contributor

Ok, I'm testing the cases with impi first in firebot. The balanced cases went through fine. I'm testing uneven cases now.
MPI_ALLREDUCE call looks fine. I'm going from your template input file. I build input files and submit with this script:

#!/bin/bash

myqfds=/home/mnv/FireModels_fork_home/fds/Utilities/Scripts/qfds.sh

for n in $(seq 1 64);
do
    if [ $n -lt 10 ]; then
    echo 0$n
    cp simple_caseNNN.fds simple_case_0$n.fds
    sed -i "s/NNN/_0$n/g" simple_case_0$n.fds
    $myqfds -p 4 -n 1 -T dv -q firebot simple_case_0$n.fds
    else
    echo $n
    cp simple_caseNNN.fds simple_case_$n.fds
    sed -i "s/NNN/_$n/g" simple_case_$n.fds
    $myqfds -p 4 -n 1 -T dv -q firebot simple_case_$n.fds
    fi
done

@marcosvanella
Copy link
Contributor

I had three hangs for the uneven cases with the impi -T dv options, but could not retrieve backtrace information to the FDS source. The ompi_gnu_linux cases did not hang.
We can try changing the compilation flags for the impi dv target to see if we get more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants