assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

pdfruth · 2022-06-26T00:37:41Z

I'm using the self-hosted assisted installer service to install Single Node OKD.
The assisted installer service is running in podman containers, as documented here

This method of doing a single node install of OKD used to work. But, has started to fail recently (within the last 30 days or so).

The host registers with the installer service, but gets stuck on an NTP synchronization failure as seen in the attached screen-shot

Looking into the pod logs of the assisted installer service, I see this message;

level=error msg="Received step reply <ntp-synchronizer-392f0f02> from infra-env <ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3> host <68b15b04-5cb1-429f-9778-3c8727d0235d> exit-code <-1> stderr <chronyc exited with non-zero exit code 127: \nchronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory\n> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2992" go-id=9762 host_id=68b15b04-5cb1-429f-9778-3c8727d0235d infra_env_id=ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3 pkg=Inventory request_id=6a4edac8-f290-4cb2-813e-f6a67ef9c50b

The relevant part of the message being - chronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory

I believe the root cause for this is due to the changes introduced by this commit

The code change introduced by that commit mounts the chronyc command binary of the underlying OS (on which the assisted-installer-agent container runs on) into the /usr/bin directory inside the container. In my particular instance that host OS is Fedora CoreOS 35.20220327.3.0. The problem, in this case, is that the chronyc command is a dynamically linked ELF that depends on the libnettle.so.8 shared library... which isn't present in the container. The container does contain libnettle.so.6 tho.

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern.

Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

@tsorya could you have a look at the change introduced in that commit. This introduces a significant pre-req of same shared library (that which the chronyc binary is dynamically linked) also be present on the assisted installer agent container image. Is there a different approach?

The text was updated successfully, but these errors were encountered:

pdfruth · 2022-06-26T04:23:11Z

In the mean time, I've been able to work around the error by explicitly setting AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 when customizing the sample okd-config.yml file here
Note: v2.4.1 is the version of the image just prior to the introduction of the commit that introduced the problem mentioned above.

For example, here is an okd-configmap.yml that works for me today;

apiVersion: v1
kind: ConfigMap
metadata:
  name: config
data:
  ASSISTED_SERVICE_HOST: 192.168.10.2:8090
  ASSISTED_SERVICE_SCHEME: http
  AUTH_TYPE: none
  DB_HOST: 127.0.0.1
  DB_NAME: installer
  DB_PASS: admin
  DB_PORT: "5432"
  DB_USER: admin
  DEPLOY_TARGET: onprem
  DISK_ENCRYPTION_SUPPORT: "false"
  DUMMY_IGNITION: "false"
  ENABLE_SINGLE_NODE_DNSMASQ: "false"
  HW_VALIDATOR_REQUIREMENTS: '[{"version":"default","master":{"cpu_cores":4,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0},"worker":{"cpu_cores":2,"ram_mib":8192,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10},"sno":{"cpu_cores":8,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10}}]'
  IMAGE_SERVICE_BASE_URL: http://192.168.10.2:8888
  IPV6_SUPPORT: "true"
  LISTEN_PORT: "8888"
  NTP_DEFAULT_SERVER: ""
  POSTGRESQL_DATABASE: installer
  POSTGRESQL_PASSWORD: admin
  POSTGRESQL_USER: admin
  PUBLIC_CONTAINER_REGISTRIES: 'quay.io'
  SERVICE_BASE_URL: http://192.168.10.2:8090
  STORAGE: filesystem
  OS_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live.x86_64.iso","rootfs_url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live-rootfs.x86_64.img","version":"35.20220327.3.0"}]'
  RELEASE_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"quay.io/openshift/okd:4.10.0-0.okd-2022-06-10-131327","version":"4.10.0-0.okd-2022-06-10-131327","default":true}]'
  OKD_RPMS_IMAGE: quay.io/vrutkovs/okd-rpms:4.10
  AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1

tsorya · 2022-06-26T07:37:08Z

Hi, the commit that you mentioned actually fixes the problem we introduced in 2.4.0 where this mount was deleted by mistake. from https://github.com/openshift/assisted-installer-agent/blob/v2.3.1/src/commands/actions/ntp_sync_cmd.go#L44 you can see that we have this mount before and it was removed by mistake in 2.4.0 and returned in 2.4.1. We mount chronyc from this commit openshift/assisted-service@7ec8448 that took place in Nov'21.

…

On Sun, Jun 26, 2022 at 7:23 AM pdfruth ***@***.***> wrote: In the mean time, I've been able to work around the error by explicitly setting AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 when customizing the sample *okd-config.yml* file here <https://github.com/openshift/assisted-service/blob/master/deploy/podman/okd-configmap.yml> For example, here is an *okd-configmap.yml* that works for me today; apiVersion: v1 kind: ConfigMap metadata: name: config data: ASSISTED_SERVICE_HOST: 192.168.10.2:8090 ASSISTED_SERVICE_SCHEME: http AUTH_TYPE: none DB_HOST: 127.0.0.1 DB_NAME: installer DB_PASS: admin DB_PORT: "5432" DB_USER: admin DEPLOY_TARGET: onprem DISK_ENCRYPTION_SUPPORT: "false" DUMMY_IGNITION: "false" ENABLE_SINGLE_NODE_DNSMASQ: "false" HW_VALIDATOR_REQUIREMENTS: '[{"version":"default","master":{"cpu_cores":4,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0},"worker":{"cpu_cores":2,"ram_mib":8192,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10},"sno":{"cpu_cores":8,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10}}]' IMAGE_SERVICE_BASE_URL: http://192.168.10.2:8888 IPV6_SUPPORT: "true" LISTEN_PORT: "8888" NTP_DEFAULT_SERVER: "" POSTGRESQL_DATABASE: installer POSTGRESQL_PASSWORD: admin POSTGRESQL_USER: admin PUBLIC_CONTAINER_REGISTRIES: 'quay.io' SERVICE_BASE_URL: http://192.168.10.2:8090 STORAGE: filesystem OS_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live.x86_64.iso","rootfs_url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live-rootfs.x86_64.img","version":"35.20220327.3.0"}]' RELEASE_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"quay.io/openshift/okd:4.10.0-0.okd-2022-06-10-131327","version":"4.10.0-0.okd-2022-06-10-131327","default":true}]' OKD_RPMS_IMAGE: quay.io/vrutkovs/okd-rpms:4.10 AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 — Reply to this email directly, view it on GitHub <#385 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSNXFS3LHS65JHJQ3HFIR3VQ7LLVANCNFSM5Z3CTPFQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Igal Tsoiref He / His / Him Red Hat Israel <https://www.redhat.com/> 34 Jerusalem rd. Ra'anana, 43501 ***@***.*** ***@***.***> @redhat <https://twitter.com/redhat> Red Hat <https://www.linkedin.com/company/red-hat> Red Hat <https://www.facebook.com/RedHatInc> <https://red.ht/sig>

omertuc · 2022-06-26T10:16:54Z

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern.
Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

It's not so simple as chronyc inside the agent container is communicating through a UDS socket mount with the host's operating system's non-containerized chronyd daemon, and so we're just moving the problem from "Host<->container shared library incompatibilities" to "Chronyc<->Chronyd socket API across versions incompatibility". Sadly the former affects OKD users, the latter affects (or at-least used to affect, maybe with recent RHCOS versions it has been solved) upstream OCP Assisted Installer agent users. I think there is no "right" answer between those two options, they're both bound to break (and have in the past), we've just chosen to solve the latter due to a user complaint a while ago, but we've done so in a problematic manner (mount), creating this issue for OKD users.

But we can do something else - ideally the solution here would be to disable the host's chronyd systemd service and have an equivalent, containerized chronyd service, but that's a big change. We should consider this probably

omertuc · 2022-06-26T10:31:22Z

Temporarily, as a workaround, we can solve it by not doing the bind when running on top of FCOS

omertuc · 2022-06-26T10:39:10Z

Created https://issues.redhat.com/browse/MGMT-10937 to track the workaround / solution

omertuc · 2022-06-26T10:42:38Z

cc @vrutkovs

openshift-bot · 2022-09-25T01:00:18Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

omertuc · 2022-09-26T19:07:13Z

/lifecycle frozen

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2022

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

pdfruth commented Jun 26, 2022

pdfruth commented Jun 26, 2022 •

edited

Loading

tsorya commented Jun 26, 2022 via email

omertuc commented Jun 26, 2022 •

edited

Loading

omertuc commented Jun 26, 2022

omertuc commented Jun 26, 2022

omertuc commented Jun 26, 2022

openshift-bot commented Sep 25, 2022

omertuc commented Sep 26, 2022

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

Comments

pdfruth commented Jun 26, 2022

pdfruth commented Jun 26, 2022 • edited Loading

tsorya commented Jun 26, 2022 via email

omertuc commented Jun 26, 2022 • edited Loading

omertuc commented Jun 26, 2022

omertuc commented Jun 26, 2022

omertuc commented Jun 26, 2022

openshift-bot commented Sep 25, 2022

omertuc commented Sep 26, 2022

pdfruth commented Jun 26, 2022 •

edited

Loading

omertuc commented Jun 26, 2022 •

edited

Loading