Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receptor: Error locating unit #758

Open
4 of 9 tasks
anxstj opened this issue Oct 13, 2022 · 8 comments
Open
4 of 9 tasks

receptor: Error locating unit #758

anxstj opened this issue Oct 13, 2022 · 8 comments
Labels
component:receptor type:bug Something isn't working

Comments

@anxstj
Copy link

anxstj commented Oct 13, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

My receptor services on my execution nodes show the following errors:

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

It seems that it shows up whenever a job finishes. The jobs are working, though. And AWX doesn't show any additional error messages.

What could cause this? And how can I debug it?

I'm running AWX 21.5.0 and receptor 1.2.0+g72a97e5

Receptor is installed with the AWX image:

Dockerfile:

COPY --from={{ receptor_image }} /usr/bin/receptor /usr/bin/receptor

Makefile:

RECEPTOR_IMAGE ?= [quay.io/ansible/receptor:devel](http://quay.io/ansible/receptor:devel)

AWX version

21.5.0

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

docker development environment

Modifications

no

Ansible version

2.12.2

Operating system

Debian 11

Web browser

Firefox

Steps to reproduce

Create a setup with two controller nodes and two execution nodes. Then execute a job on one of the execution nodes. The job should succeed, but receptor will log a similar error message as mentioned above with the end of the job.

Expected results

No error message.

Actual results

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

Additional information

No response

@github-actions github-actions bot added needs_triage type:bug Something isn't working labels Oct 13, 2022
@fosterseth
Copy link
Member

fosterseth commented Oct 14, 2022

@anxstj thanks for opening the ticket!
my hunch is that AWX is trying to release or cancel old receptor work units somewhere (i.e. reaper code). Needs some investigation

@anxstj
Copy link
Author

anxstj commented Oct 17, 2022

I just found out that old podman instances are not cleaned up successfully. They stay as zombies on the system:

ps faux
...
1000        7586  0.9  0.3 807216 57784 ?        Ssl  Sep27 270:41  \_ receptor --config /etc/receptor/receptor.conf
1000        8004  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [podman] <defunct>
1000        8009  0.0  0.0   1088     0 ?        S    Sep27   0:00      \_ catatonit -P
1000        8669  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [slirp4netns] <defunct>
1000        8691  0.0  0.0      0     0 ?        Zs   Sep27   0:09      \_ [fuse-overlayfs] <defunct>
1000        8699  0.0  0.0      0     0 ?        Zs   Sep27   0:00      \_ [conmon] <defunct>

In the long run, this will cause trouble, e.g. the systemd MaxTasks limit will be reached:

cgroup: fork rejected by pids controller in /system.slice/...

Could this be related to #439 ? (Just an uneducated guess)

@mabashian mabashian transferred this issue from ansible/awx Mar 29, 2023
@anxstj
Copy link
Author

anxstj commented Apr 26, 2023

Could this be related to #439 ? (Just an uneducated guess)

FTR: the receptor container had a wrong entrypoint that prevented the container to be cleaned up.

@golakiyaalice
Copy link

I am running awx-operator:2.6.0 and facing the same issue while setting up executors on a VM .
Is there any workaround for it?
@ALL,please help.

@vvarga007
Copy link

Any update on this?

@golakiyaalice
Copy link

I was able to fix this.
my executor was running behind the firewall and podman was not able to fetch the image from the quay.io registry.
Either get your container launched using the image available in your environment or either make sure your executor is able to reach to the quay repos.
This issue can be closed.

@golakiyaalice
Copy link

golakiyaalice commented Apr 11, 2024

Any update on this?

Yes this is a feature of awx.
This issue can be closed.

@golakiyaalice
Copy link

golakiyaalice commented Apr 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:receptor type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants