Shoot worker node hostname changes after machine reboot #569

timebertt · 2023-02-08T15:34:35Z

How to categorize this issue?

/area robustness
/kind bug
/platform openstack

What happened:

When rebooting a shoot worker node, its hostname changes.

# cat /etc/hostname # before
shoot--1ad1ca31bc--migrate0-pool-xuyn9ea3hs-z1-54964-td9lx
# cat /etc/hostname # after
shoot--1ad1ca31bc--migrate0-pool-xuyn9ea3hs-z1-54964-td9lx.noval

This causes kubelet to fail to start after the machine reboot because it can't get the Node object with the new name:

kubelet.go:2424] "Error getting node" err="node \"shoot--1ad1ca31bc--migrate0-pool-xuyn9ea3hs-z1-54964-td9lx.noval\" not found"

Note: the default dns_domain for neutron network is novalocal in our installation, which is appended to the server name. Because the entire FQDN hostname is too long, it is shortened in the above example.
provider-openstack doesn't set the dns_domain in the created neutron networks explicitly.

What you expected to happen:

The hostname should be stable and kubelet should be able to start again after a node reboot.

How to reproduce it (as minimally and precisely as possible):

SSH into a node
reboot the machine
observe that kubelet fails to start and the Node is not able to recover from state Unready

Anything else we need to know?:

This extension adds an ExecStartPre directive to the kubelet unit which changes the hostname:

gardener-extension-provider-openstack/pkg/webhook/controlplane/ensurer.go

Lines 265 to 267 in a9035cb

    
           Section: "Service", 
        
           Name:    "ExecStartPre", 
        
           Value:   `/bin/sh -c 'hostnamectl set-hostname $(cat /etc/hostname | cut -d '.' -f 1)'`,

On the initial boot of the machine, this always works as the kubelet unit and the hostnamectl command is always invoked after any cloud-init mechanisms (the unit is only present after the first successful run of the cloud-config downloader/executor).
However, after rebooting the machine, all the kubelet unit and its hostnamectl command race with other cloud-init mechanisms which can lead to a changed hostname.

Environment:

Gardener version (if relevant): v1.62.1
Extension version: v1.31.0
Kubernetes version (use kubectl version): v1.24.8
Cloud provider or hardware configuration: STACKIT / OpenStack (Queens/Yoga)

The text was updated successfully, but these errors were encountered:

gardener-robot added area/robustness Robustness, reliability, resilience related kind/bug Bug platform/openstack OpenStack platform/infrastructure labels Feb 8, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 18, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shoot worker node hostname changes after machine reboot #569

Shoot worker node hostname changes after machine reboot #569

timebertt commented Feb 8, 2023

Shoot worker node hostname changes after machine reboot #569

Shoot worker node hostname changes after machine reboot #569

Comments

timebertt commented Feb 8, 2023