Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GARD fails due to MPI setup (?) #30

Open
Shellfishgene opened this issue Dec 21, 2021 · 7 comments
Open

GARD fails due to MPI setup (?) #30

Shellfishgene opened this issue Dec 21, 2021 · 7 comments

Comments

@Shellfishgene
Copy link

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster?
This is gard.log:

libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            host
  Device name:           i40iw0
  Device vendor ID:      0x8086
  Device vendor part ID: 14290

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           host
  Local device:         i40iw0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[ERROR] This analysis requires an MPI environment to run


[host:1017209] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[host:1017209] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[host:1017209] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
@hoelzer
Copy link
Collaborator

hoelzer commented Dec 23, 2021

Hi @Shellfishgene , thanks for your interest in the pipeline!

All should happen inside the container: but it seems there is some issue with the Singularity container version for GARD+MPI. I will try to look into it asap

I guess you have no way on your cluster to run the Docker profile?

@Shellfishgene
Copy link
Author

No Docker on the cluster, I can run it on a workstation though. It's not urgent anyway... Thanks for having a look!

@mchaisso
Copy link

Getting similar problem, different log with singularity:

Failed to create a completion queue (CQ):

Hostname: endeavour2
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.


Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: endeavour2


Failed to create a completion queue (CQ):

Hostname: endeavour2
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.


Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: endeavour2


No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

Local host: endeavour2
Local device: mlx4_0
Local port: 1
CPCs attempted: udcm

[ERROR] This analysis requires an MPI environment to run

[endeavour2.hpc.usc.edu:161337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

@fischer-hub
Copy link
Collaborator

Hey @Shellfishgene!
Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta.
Did you try to just run the pipeline again or with the -resume flag turned on? Also are you running the latest release of poseidon?

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This is gard.log:

@Shellfishgene
Copy link
Author

Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta.

I figured out what the problem was: I forgot to set the local profile in Nextflow, and ran it with -profile singularity --cores 4. However that seems to set ${task.cpus} to 1 for the gard task, and mpirun -np 1 causes the error. It needs to be >1. The error message from mpirun is not exacly clear... With -profile local,singularity it works.

@hoelzer
Copy link
Collaborator

hoelzer commented Jan 5, 2022

@Shellfishgene ah great, thanks for letting us know!

So it seems that when no "execution" profile is defined, the default core number as defined here:
https://github.com/hoelzer/poseidon/blob/master/nextflow.config#L15

is not distributed to the processes.

With -profile local,singularity the default value is passed to the GARD process:
https://github.com/hoelzer/poseidon/blob/master/configs/local.config#L14

@fischer-hub maybe we can just add a check to the poseidon.nf that the task.cpus must be >1?

@fischer-hub
Copy link
Collaborator

@hoelzer Yes good idea probably, I also ran into some other issues with the gard process when running with --profile slurm,singularity, might as well fix all of that together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants