GARD fails due to MPI setup (?) #30

Shellfishgene · 2021-12-21T13:08:51Z

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster?
This is gard.log:

libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            host
  Device name:           i40iw0
  Device vendor ID:      0x8086
  Device vendor part ID: 14290

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           host
  Local device:         i40iw0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[ERROR] This analysis requires an MPI environment to run


[host:1017209] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[host:1017209] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[host:1017209] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

The text was updated successfully, but these errors were encountered:

hoelzer · 2021-12-23T20:07:18Z

Hi @Shellfishgene , thanks for your interest in the pipeline!

All should happen inside the container: but it seems there is some issue with the Singularity container version for GARD+MPI. I will try to look into it asap

I guess you have no way on your cluster to run the Docker profile?

Shellfishgene · 2021-12-24T07:59:57Z

No Docker on the cluster, I can run it on a workstation though. It's not urgent anyway... Thanks for having a look!

mchaisso · 2021-12-30T04:10:46Z

Getting similar problem, different log with singularity:

Failed to create a completion queue (CQ):

Hostname: endeavour2
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: endeavour2

Failed to create a completion queue (CQ):

Hostname: endeavour2
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: endeavour2

No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

Local host: endeavour2
Local device: mlx4_0
Local port: 1
CPCs attempted: udcm

[ERROR] This analysis requires an MPI environment to run

[endeavour2.hpc.usc.edu:161337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

fischer-hub · 2022-01-03T18:16:54Z

Hey @Shellfishgene!
Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta.
Did you try to just run the pipeline again or with the -resume flag turned on? Also are you running the latest release of poseidon?

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This is gard.log:

Shellfishgene · 2022-01-05T13:09:34Z

Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta.

I figured out what the problem was: I forgot to set the local profile in Nextflow, and ran it with -profile singularity --cores 4. However that seems to set ${task.cpus} to 1 for the gard task, and mpirun -np 1 causes the error. It needs to be >1. The error message from mpirun is not exacly clear... With -profile local,singularity it works.

hoelzer · 2022-01-05T14:16:06Z

@Shellfishgene ah great, thanks for letting us know!

So it seems that when no "execution" profile is defined, the default core number as defined here:
https://github.com/hoelzer/poseidon/blob/master/nextflow.config#L15

is not distributed to the processes.

With -profile local,singularity the default value is passed to the GARD process:
https://github.com/hoelzer/poseidon/blob/master/configs/local.config#L14

@fischer-hub maybe we can just add a check to the poseidon.nf that the task.cpus must be >1?

fischer-hub · 2022-01-05T14:23:51Z

@hoelzer Yes good idea probably, I also ran into some other issues with the gard process when running with --profile slurm,singularity, might as well fix all of that together!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GARD fails due to MPI setup (?) #30

GARD fails due to MPI setup (?) #30

Shellfishgene commented Dec 21, 2021

hoelzer commented Dec 23, 2021

Shellfishgene commented Dec 24, 2021

mchaisso commented Dec 30, 2021

fischer-hub commented Jan 3, 2022

Shellfishgene commented Jan 5, 2022

hoelzer commented Jan 5, 2022

fischer-hub commented Jan 5, 2022

GARD fails due to MPI setup (?) #30

GARD fails due to MPI setup (?) #30

Comments

Shellfishgene commented Dec 21, 2021

hoelzer commented Dec 23, 2021

Shellfishgene commented Dec 24, 2021

mchaisso commented Dec 30, 2021

Check the CQE attribute.

Hostname: endeavour2

Check the CQE attribute.

Hostname: endeavour2

Local host: endeavour2 Local device: mlx4_0 Local port: 1 CPCs attempted: udcm

fischer-hub commented Jan 3, 2022

Shellfishgene commented Jan 5, 2022

hoelzer commented Jan 5, 2022

fischer-hub commented Jan 5, 2022

Local host: endeavour2
Local device: mlx4_0
Local port: 1
CPCs attempted: udcm