Configure HealthCheck with `podman update` #24442

Honny1 · 2024-11-01T07:28:22Z

New flags in a podman update can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container.

This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc.

These flags are added to the podman update command:

--health-cmd string: set a healthcheck command for the container ('none' disables the existing healthcheck)
--health-interval string: set an interval for the healthcheck (a value of disable results in no automatic timer setup)(Changing this setting resets timer.) (default "30s")
--health-log-destination string: set the destination of the HealthCheck log. Directory path, local or events_logger (local use container state file)(Warning: Changing this setting may cause the loss of previous logs.) (default "local")
--health-max-log-count uint: set maximum number of attempts in the HealthCheck log file. ('0' value means an infinite number of attempts in the log file) (default 5)
--health-max-log-size uint: set maximum length in characters of stored HealthCheck log. ('0' value means an infinite log length) (default 500)
--health-on-failure string: action to take once the container turns unhealthy (default "none")
--health-retries uint: the number of retries allowed before a healthcheck is considered to be unhealthy (default 3)
--health-start-period string: the initialization time needed for a container to bootstrap (default "0s")
--health-startup-cmd string: Set a startup healthcheck command for the container
--health-startup-interval string: Set an interval for the startup healthcheck. Changing this setting resets the timer, depending on the state of the container. (default "30s")
--health-startup-retries uint: Set the maximum number of retries before the startup healthcheck will restart the container
--health-startup-success uint: Set the number of consecutive successes before the startup healthcheck is marked as successful and the normal healthcheck begins (0 indicates any success will start the regular healthcheck)
--health-startup-timeout string: Set the maximum amount of time that the startup healthcheck may take before it is considered failed (default "30s")
--health-timeout string: the maximum time allowed to complete the healthcheck before an interval is considered failed (default "30s")
--no-healthcheck: Disable healthchecks on container

Fixes: https://issues.redhat.com/browse/RHEL-60561

Does this PR introduce a user-facing change?

Configure HealthCheck with podman update

openshift-ci · 2024-11-01T07:28:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign mheon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

packit-as-a-service · 2024-11-01T07:29:59Z

Ephemeral COPR build failed. @containers/packit-build please check.

Luap99

not really a full review just a quick look:

Can you remove all the flag description from the commit? I do not see how they add any value there. If I want to know what a flag does one should look at the docs or I can see that already in the diff here anyway.

Luap99 · 2024-11-01T17:16:26Z

libpod/container_internal.go

+	noHealthCheck := false
+	for k, v := range *changedHealthCheckConfig {
+		switch k {
+		case "NoHealthCheck":


all these strings seem to be used in many places so can you define them as constants somewhere, ie. libpod/define and then use the constants over the string duplication

Thanks, I have an idea of how to simplify it overall.

mheon · 2024-11-04T18:48:50Z

cmd/podman/common/create.go

@@ -665,6 +548,141 @@ func DefineCreateFlags(cmd *cobra.Command, cf *entities.ContainerCreateOptions,
 			`If a container with the same name exists, replace it`,
 		)
 	}
+	if mode == entities.CreateMode || mode == entities.UpdateMode {
+		// TODO: Focus on disable


What does this mean?

My note, I'll delete it.

mheon · 2024-11-04T18:49:41Z

cmd/podman/containers/update.go

@@ -17,7 +17,7 @@ import (
 )

 var (
-	updateDescription = `Updates the cgroup configuration of a given container`
+	updateDescription = `Updates the configuration of an already existing container, allowing different resource limits to be set, and HealthCheck configuration. The currently supported options are a subset of the podman create/run.`


"Updates the configuration of an existing container, allowing changes to resource limits and healthchecks"

mheon · 2024-11-04T19:14:47Z

libpod/container_internal.go

@@ -2738,3 +2741,327 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string

 	return nil
 }
+
+func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) {
+	var healthCheck manifest.Schema2HealthConfig


This and startupHealthCheck should be pointers - will make returning nil a lot easier

mheon · 2024-11-04T19:15:20Z

libpod/container_internal.go

+func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) {
+	var healthCheck manifest.Schema2HealthConfig
+	if c.config.HealthCheckConfig != nil {
+		healthCheck = manifest.Schema2HealthConfig{


I think you can JSONDeepCopy this and save a lot of code

mheon · 2024-11-04T19:15:41Z

libpod/container_internal.go

+
+	var startupHealthCheck define.StartupHealthCheck
+	if c.config.StartupHealthCheckConfig != nil {
+		startupHealthCheck = define.StartupHealthCheck{


Same here, investigate JSONDeepCopy, I don't think this is performance critical at all.

mheon · 2024-11-04T19:16:36Z

libpod/container_internal.go

+	return &healthCheck, &startupHealthCheck
+}
+
+func (c *Container) changeHealthCheckConfiguration(changedHealthCheckConfig *entities.UpdateHealthCheckConfig) (bool, error) {


I don't like leaking entities into Libpod. Maybe move UpdateHealthCheckConfig into libpod and make the entities version just contain it?

Actually, no. We should probably move this parsing code out of Libpod. Instead we should accept a manifest.Schema2HealthConfig and define.StartupHealthCheck in Update in libpod, and do all the validation in the frontend.

mheon · 2024-11-04T19:19:56Z

libpod/container_internal.go

+	}
+
+	logrus.Debugf("HealthCheck updated for container %s", c.ID())
+	c.newContainerEvent(events.Update)


If you're calling this from Update() this is unnecessary, we only want one and it should be in Update itself

New flags in a `podman update` can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container. This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc. Fixes: https://issues.redhat.com/browse/RHEL-60561 Signed-off-by: Jan Rodák <[email protected]>

Honny1 · 2024-11-06T18:41:51Z

@mheon and @Luap99 I have edited the PR according to your comments.

mheon · 2024-11-06T20:35:49Z

libpod/container_api.go

@@ -136,6 +137,61 @@ func (c *Container) Update(resources *spec.LinuxResources, restartPolicy *string
 	return c.update(resources, restartPolicy, restartRetries)
 }

+// UpdateHealthCheckConfig updates HealthCheck configuration the given container.
+func (c *Container) UpdateHealthCheckConfig(healthCheckConfig *manifest.Schema2HealthConfig, changedTimer bool, noHealthCheck bool) error {


noHealthCheck can probably be healthCheckConfig == nil

And I think we might want to compute changedTimer in here, instead of requiring the user pass it in - moving the rest of the parsing out of libpod was good but this was maybe a step too far.

mheon · 2024-11-06T20:38:16Z

libpod/container_api.go

+
+// UpdateGlobalHealthCheckConfig updates global HealthCheck configuration the given container.
+// If value is nil then value will be not changed.
+func (c *Container) UpdateGlobalHealthCheckConfig(healthLogDestination *string, healthMaxLogCount *uint, healthMaxLogSize *uint, healthCheckOnFailureAction *define.HealthCheckOnFailureAction) error {


Maybe make a GlobalHealthCheckOptions struct that contains all of these? Arguments list is a little long

mheon · 2024-11-06T20:39:50Z

libpod/container_internal.go

@@ -2738,3 +2739,122 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string

 	return nil
 }
+
+func (c *Container) resetHealthCheckTimers(noHealthCheck bool, changedTimer bool, isStartup bool) error {
+	if !c.ensureState(define.ContainerStateCreated, define.ContainerStateRunning) {


Maybe Paused as well? Do we stop the timers when we pause a container?

mheon · 2024-11-06T20:41:39Z

libpod/container_internal.go

+			c.state.HCUnitName); err != nil {
+			return err
+		}
+	case isStartup && changedTimer && c.config.StartupHealthCheckConfig != nil && !c.state.StartupHCPassed:


Hm. I think we hit this if we add a startup healthcheck to a running container that did not previously have a startup healthcheck. We probably don't want to do that, so we should set StartupHCPassed to true when adding a startup healthcheck to a running container that doesn't have one already

mheon · 2024-11-06T20:42:46Z

libpod/container_internal.go

+		return err
+	}
+
+	err := c.resetHealthCheckTimers(noHealthCheck, changedTimer, false)


This can be combined with the following line

mheon · 2024-11-06T20:45:45Z

libpod/container_internal.go

+	return nil
+}
+
+func (c *Container) updateStartupHealthCheckConfiguration(startupHealthCheckConfig *define.StartupHealthCheck, changedStartupTimer bool, noHealthCheck bool) error {


This is basically identical to updateHealthCheckConfiguration minus a few variables; can the two be refactored to share code? It seems like everything from line 2805 on is basically identical.

mheon · 2024-11-06T20:46:31Z

libpod/events/config.go

@@ -217,6 +217,8 @@ const (
 	Untag Status = "untag"
 	// Update indicates that a container's configuration has been modified.
 	Update Status = "update"
+	// Update indicates that a container's HealthCheck configuration has been modified.
+	UpdateHealthCheckConfig Status = "update-HealthCheck-config"


Probably unnecessary; we can just use Update

TomSweeneyRedHat · 2024-11-07T00:58:38Z

docs/source/markdown/podman-update.1.md.in

+
+@@option health-interval
+
+Changing this setting resets timer.


Suggested change

Changing this setting resets timer.

Changing this setting resets the timer.

TomSweeneyRedHat · 2024-11-07T01:01:34Z

libpod/define/healthchecks.go

+	// HealthMaxLogCount set maximum number of attempts in the HealthCheck log file.
+	// ('0' value means an infinite number of attempts in the log file)
+	HealthMaxLogCount *uint `json:"health_max_log_count,omitempty"`
+	// HealthOnFailure set action to take once the container turns unhealthy.


Suggested change

// HealthOnFailure set action to take once the container turns unhealthy.

// HealthOnFailure set the action to take once the container turns unhealthy.

TomSweeneyRedHat · 2024-11-07T01:05:34Z

A couple of small nits. Once @mheon 's concerns are addressed, LGTM

Luap99

@edsantiago Mind looking at the sys tests?

Luap99 · 2024-11-07T10:35:42Z

test/system/280-update.bats

+    if [[ $is_startup = "yes" ]]; then
+        run_podman run -d --name $ctrname    \
+            --health-cmd "echo $msg"         \
+            --health-startup-cmd "echo $msg" \
+            $IMAGE /home/podman/pause
+        cid="$output"
+    else
+        if [[ $no_hc = "yes" ]]; then
+            run_podman run -d --name $ctrname \
+                    $IMAGE /home/podman/pause
+            cid="$output"
+        else
+            run_podman run -d --name $ctrname \
+                --health-cmd "echo $msg"  \
+                $IMAGE /home/podman/pause
+            cid="$output"
+        fi
+    fi


there is a lot of duplication here, first you can use elif to avoid one layer of nesting

In general it is hard for the reader to figure out the differences, you should split out the run_podman call and only do the stuff that is different in there, i.e. define a hc_flags arrray where you add the flags too and then to the run_podman call below which would make this a lot simpler.

Second, what Paul said

Luap99 · 2024-11-07T10:41:28Z

test/system/280-update.bats

+    sleep 2s
+
+    run_podman update $ctrname --health-interval 1s
+
+    sleep 5s


Adding random sleeps is very bad you tests are adding over 20s of CI time doing nothing but sleeping, this is very wasteful.

Is there a way we can combine the test cases to avoid adding so much sleeps. And if we know one things sleeps means flakes so this is not very good. If there is no way to avoid it then at the very least all these testsshoul dbe run in parallel to not waste so much time

Luap99 · 2024-11-07T10:47:33Z

pkg/domain/entities/types/containers.go

-	Specgen  *specgen.SpecGenerator
+	NameOrID                        string
+	Specgen                         *specgen.SpecGenerator
+	ChangedHealthCheckConfiguration *define.UpdateHealthCheckConfig


I have no idea why we decided to send a full specgen when the server does not understand most of it but doesn't the specgen already contain all the healtchcheck settings? So it seems wasteful and confusing to send yet another type? Am I missing something?

edsantiago

DIE and DRY are critical concepts in software development. I encourage you to develop a mindset where you're looking for duplication and, each time you see it, or each time you find yourself copypasting, taking a step back to ask yourself how you can eliminate it.

Once you clean up all my suggestions below, I bet you might even find more, and I bet you can then find a way to make the test table-driven which is even better.

And then, of course, please add # bats test_tags=ci:parallel

edsantiago · 2024-11-07T14:30:38Z

test/system/280-update.bats

+    local ctrname="$1"
+    local msg="$2"


There is so much duplication here that it will become a maintenance nightmare. First and most obvious, ctrname and msg are never used by the caller. There is no reason to pass them as parameters. They should be defined in here.

edsantiago · 2024-11-07T14:31:36Z

test/system/280-update.bats

+    if [[ $is_startup = "yes" ]]; then
+        run_podman run -d --name $ctrname    \
+            --health-cmd "echo $msg"         \
+            --health-startup-cmd "echo $msg" \
+            $IMAGE /home/podman/pause
+        cid="$output"
+    else
+        if [[ $no_hc = "yes" ]]; then
+            run_podman run -d --name $ctrname \
+                    $IMAGE /home/podman/pause
+            cid="$output"
+        else
+            run_podman run -d --name $ctrname \
+                --health-cmd "echo $msg"  \
+                $IMAGE /home/podman/pause
+            cid="$output"
+        fi
+    fi


Second, what Paul said

edsantiago · 2024-11-07T14:33:43Z

test/system/280-update.bats

+    local flag="$4"
+    local value="$5"
+    local expect="$6"
+    local expect_msg="$7"


Third, this is a completely unnecessary argument. It is identical to $format except in the health-on-failure test where I have no idea if that's a dup-typo or intentional. Please eliminate this.

edsantiago · 2024-11-07T14:35:50Z

test/system/280-update.bats

+    local format="$3"
+    local flag="$4"
+    local value="$5"
+    local expect="$6"


There's almost no need for this one, either. It is almost always identical to $value. Therefore, the logic should be:

local expect="${6:-$5}"

so the caller can then use "" to reduce duplication.

edsantiago · 2024-11-07T14:38:21Z

test/system/280-update.bats

+    local is_startup="$9"
+    local no_hc="${10}"


These can be collapsed into a type option (choose a better name please) with values "" (third condition), "nohc" (second), "startup" (first).

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Nov 1, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024

github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 2f18a35 to 475e16f Compare November 1, 2024 07:35

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 475e16f to e052162 Compare November 1, 2024 15:20

openshift-ci bot added release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from e052162 to 432b6d3 Compare November 1, 2024 15:52

Luap99 reviewed Nov 1, 2024

View reviewed changes

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch 2 times, most recently from 621c46c to 96def55 Compare November 1, 2024 19:25

mheon reviewed Nov 4, 2024

View reviewed changes

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch 2 times, most recently from e5d1666 to 74eae00 Compare November 6, 2024 16:40

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 74eae00 to 4471330 Compare November 6, 2024 16:46

Honny1 marked this pull request as ready for review November 6, 2024 18:39

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2024

mheon reviewed Nov 6, 2024

View reviewed changes

TomSweeneyRedHat reviewed Nov 7, 2024

View reviewed changes

Luap99 reviewed Nov 7, 2024

View reviewed changes

edsantiago requested changes Nov 7, 2024

View reviewed changes

	Changing this setting resets timer.
	Changing this setting resets the timer.

	// HealthOnFailure set action to take once the container turns unhealthy.
	// HealthOnFailure set the action to take once the container turns unhealthy.

Configure HealthCheck with podman update #24442

Are you sure you want to change the base?

Configure HealthCheck with podman update #24442

Conversation

Honny1 commented Nov 1, 2024 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Nov 1, 2024

packit-as-a-service bot commented Nov 1, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomSweeneyRedHat Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomSweeneyRedHat commented Nov 7, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Configure HealthCheck with `podman update` #24442

Configure HealthCheck with `podman update` #24442

Honny1 commented Nov 1, 2024 •

edited

Loading

TomSweeneyRedHat Nov 7, 2024 •

edited

Loading