-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure HealthCheck with podman update
#24442
base: main
Are you sure you want to change the base?
Configure HealthCheck with podman update
#24442
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Honny1 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ephemeral COPR build failed. @containers/packit-build please check. |
2f18a35
to
475e16f
Compare
475e16f
to
e052162
Compare
e052162
to
432b6d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a full review just a quick look:
Can you remove all the flag description from the commit? I do not see how they add any value there. If I want to know what a flag does one should look at the docs or I can see that already in the diff here anyway.
libpod/container_internal.go
Outdated
noHealthCheck := false | ||
for k, v := range *changedHealthCheckConfig { | ||
switch k { | ||
case "NoHealthCheck": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these strings seem to be used in many places so can you define them as constants somewhere, ie. libpod/define and then use the constants over the string duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have an idea of how to simplify it overall.
621c46c
to
96def55
Compare
cmd/podman/common/create.go
Outdated
@@ -665,6 +548,141 @@ func DefineCreateFlags(cmd *cobra.Command, cf *entities.ContainerCreateOptions, | |||
`If a container with the same name exists, replace it`, | |||
) | |||
} | |||
if mode == entities.CreateMode || mode == entities.UpdateMode { | |||
// TODO: Focus on disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My note, I'll delete it.
cmd/podman/containers/update.go
Outdated
@@ -17,7 +17,7 @@ import ( | |||
) | |||
|
|||
var ( | |||
updateDescription = `Updates the cgroup configuration of a given container` | |||
updateDescription = `Updates the configuration of an already existing container, allowing different resource limits to be set, and HealthCheck configuration. The currently supported options are a subset of the podman create/run.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Updates the configuration of an existing container, allowing changes to resource limits and healthchecks"
libpod/container_internal.go
Outdated
@@ -2738,3 +2741,327 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string | |||
|
|||
return nil | |||
} | |||
|
|||
func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) { | |||
var healthCheck manifest.Schema2HealthConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and startupHealthCheck should be pointers - will make returning nil a lot easier
libpod/container_internal.go
Outdated
func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) { | ||
var healthCheck manifest.Schema2HealthConfig | ||
if c.config.HealthCheckConfig != nil { | ||
healthCheck = manifest.Schema2HealthConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can JSONDeepCopy this and save a lot of code
libpod/container_internal.go
Outdated
|
||
var startupHealthCheck define.StartupHealthCheck | ||
if c.config.StartupHealthCheckConfig != nil { | ||
startupHealthCheck = define.StartupHealthCheck{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, investigate JSONDeepCopy, I don't think this is performance critical at all.
libpod/container_internal.go
Outdated
return &healthCheck, &startupHealthCheck | ||
} | ||
|
||
func (c *Container) changeHealthCheckConfiguration(changedHealthCheckConfig *entities.UpdateHealthCheckConfig) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like leaking entities into Libpod. Maybe move UpdateHealthCheckConfig into libpod and make the entities version just contain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no. We should probably move this parsing code out of Libpod. Instead we should accept a manifest.Schema2HealthConfig
and define.StartupHealthCheck
in Update in libpod, and do all the validation in the frontend.
libpod/container_internal.go
Outdated
} | ||
|
||
logrus.Debugf("HealthCheck updated for container %s", c.ID()) | ||
c.newContainerEvent(events.Update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're calling this from Update() this is unnecessary, we only want one and it should be in Update itself
e5d1666
to
74eae00
Compare
New flags in a `podman update` can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container. This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc. Fixes: https://issues.redhat.com/browse/RHEL-60561 Signed-off-by: Jan Rodák <[email protected]>
74eae00
to
4471330
Compare
@@ -136,6 +137,61 @@ func (c *Container) Update(resources *spec.LinuxResources, restartPolicy *string | |||
return c.update(resources, restartPolicy, restartRetries) | |||
} | |||
|
|||
// UpdateHealthCheckConfig updates HealthCheck configuration the given container. | |||
func (c *Container) UpdateHealthCheckConfig(healthCheckConfig *manifest.Schema2HealthConfig, changedTimer bool, noHealthCheck bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noHealthCheck can probably be healthCheckConfig == nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we might want to compute changedTimer
in here, instead of requiring the user pass it in - moving the rest of the parsing out of libpod was good but this was maybe a step too far.
|
||
// UpdateGlobalHealthCheckConfig updates global HealthCheck configuration the given container. | ||
// If value is nil then value will be not changed. | ||
func (c *Container) UpdateGlobalHealthCheckConfig(healthLogDestination *string, healthMaxLogCount *uint, healthMaxLogSize *uint, healthCheckOnFailureAction *define.HealthCheckOnFailureAction) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make a GlobalHealthCheckOptions
struct that contains all of these? Arguments list is a little long
@@ -2738,3 +2739,122 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string | |||
|
|||
return nil | |||
} | |||
|
|||
func (c *Container) resetHealthCheckTimers(noHealthCheck bool, changedTimer bool, isStartup bool) error { | |||
if !c.ensureState(define.ContainerStateCreated, define.ContainerStateRunning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Paused as well? Do we stop the timers when we pause a container?
c.state.HCUnitName); err != nil { | ||
return err | ||
} | ||
case isStartup && changedTimer && c.config.StartupHealthCheckConfig != nil && !c.state.StartupHCPassed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I think we hit this if we add a startup healthcheck to a running container that did not previously have a startup healthcheck. We probably don't want to do that, so we should set StartupHCPassed
to true when adding a startup healthcheck to a running container that doesn't have one already
return err | ||
} | ||
|
||
err := c.resetHealthCheckTimers(noHealthCheck, changedTimer, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be combined with the following line
return nil | ||
} | ||
|
||
func (c *Container) updateStartupHealthCheckConfiguration(startupHealthCheckConfig *define.StartupHealthCheck, changedStartupTimer bool, noHealthCheck bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically identical to updateHealthCheckConfiguration minus a few variables; can the two be refactored to share code? It seems like everything from line 2805 on is basically identical.
@@ -217,6 +217,8 @@ const ( | |||
Untag Status = "untag" | |||
// Update indicates that a container's configuration has been modified. | |||
Update Status = "update" | |||
// Update indicates that a container's HealthCheck configuration has been modified. | |||
UpdateHealthCheckConfig Status = "update-HealthCheck-config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably unnecessary; we can just use Update
|
||
@@option health-interval | ||
|
||
Changing this setting resets timer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this setting resets timer. | |
Changing this setting resets the timer. |
// HealthMaxLogCount set maximum number of attempts in the HealthCheck log file. | ||
// ('0' value means an infinite number of attempts in the log file) | ||
HealthMaxLogCount *uint `json:"health_max_log_count,omitempty"` | ||
// HealthOnFailure set action to take once the container turns unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// HealthOnFailure set action to take once the container turns unhealthy. | |
// HealthOnFailure set the action to take once the container turns unhealthy. |
A couple of small nits. Once @mheon 's concerns are addressed, LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edsantiago Mind looking at the sys tests?
if [[ $is_startup = "yes" ]]; then | ||
run_podman run -d --name $ctrname \ | ||
--health-cmd "echo $msg" \ | ||
--health-startup-cmd "echo $msg" \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
else | ||
if [[ $no_hc = "yes" ]]; then | ||
run_podman run -d --name $ctrname \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
else | ||
run_podman run -d --name $ctrname \ | ||
--health-cmd "echo $msg" \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
fi | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a lot of duplication here, first you can use elif to avoid one layer of nesting
In general it is hard for the reader to figure out the differences, you should split out the run_podman call and only do the stuff that is different in there, i.e. define a hc_flags
arrray where you add the flags too and then to the run_podman call below which would make this a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second, what Paul said
sleep 2s | ||
|
||
run_podman update $ctrname --health-interval 1s | ||
|
||
sleep 5s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding random sleeps is very bad you tests are adding over 20s of CI time doing nothing but sleeping, this is very wasteful.
Is there a way we can combine the test cases to avoid adding so much sleeps. And if we know one things sleeps means flakes so this is not very good. If there is no way to avoid it then at the very least all these testsshoul dbe run in parallel to not waste so much time
Specgen *specgen.SpecGenerator | ||
NameOrID string | ||
Specgen *specgen.SpecGenerator | ||
ChangedHealthCheckConfiguration *define.UpdateHealthCheckConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why we decided to send a full specgen when the server does not understand most of it but doesn't the specgen already contain all the healtchcheck settings? So it seems wasteful and confusing to send yet another type? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIE and DRY are critical concepts in software development. I encourage you to develop a mindset where you're looking for duplication and, each time you see it, or each time you find yourself copypasting, taking a step back to ask yourself how you can eliminate it.
Once you clean up all my suggestions below, I bet you might even find more, and I bet you can then find a way to make the test table-driven which is even better.
And then, of course, please add # bats test_tags=ci:parallel
local ctrname="$1" | ||
local msg="$2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is so much duplication here that it will become a maintenance nightmare. First and most obvious, ctrname
and msg
are never used by the caller. There is no reason to pass them as parameters. They should be defined in here.
if [[ $is_startup = "yes" ]]; then | ||
run_podman run -d --name $ctrname \ | ||
--health-cmd "echo $msg" \ | ||
--health-startup-cmd "echo $msg" \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
else | ||
if [[ $no_hc = "yes" ]]; then | ||
run_podman run -d --name $ctrname \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
else | ||
run_podman run -d --name $ctrname \ | ||
--health-cmd "echo $msg" \ | ||
$IMAGE /home/podman/pause | ||
cid="$output" | ||
fi | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second, what Paul said
local flag="$4" | ||
local value="$5" | ||
local expect="$6" | ||
local expect_msg="$7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third, this is a completely unnecessary argument. It is identical to $format
except in the health-on-failure
test where I have no idea if that's a dup-typo or intentional. Please eliminate this.
local format="$3" | ||
local flag="$4" | ||
local value="$5" | ||
local expect="$6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's almost no need for this one, either. It is almost always identical to $value
. Therefore, the logic should be:
local expect="${6:-$5}"
so the caller can then use ""
to reduce duplication.
local is_startup="$9" | ||
local no_hc="${10}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be collapsed into a type
option (choose a better name please) with values ""
(third condition), "nohc"
(second), "startup"
(first).
New flags in a
podman update
can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container.This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc.
These flags are added to the
podman update
command:--health-cmd
string
: set a healthcheck command for the container ('none' disables the existing healthcheck)--health-interval
string
: set an interval for the healthcheck (a value of disable results in no automatic timer setup)(Changing this setting resets timer.) (default "30s")--health-log-destination
string
: set the destination of the HealthCheck log. Directory path, local or events_logger (local use container state file)(Warning: Changing this setting may cause the loss of previous logs.) (default "local")--health-max-log-count
uint
: set maximum number of attempts in the HealthCheck log file. ('0' value means an infinite number of attempts in the log file) (default 5)--health-max-log-size
uint
: set maximum length in characters of stored HealthCheck log. ('0' value means an infinite log length) (default 500)--health-on-failure
string
: action to take once the container turns unhealthy (default "none")--health-retries
uint
: the number of retries allowed before a healthcheck is considered to be unhealthy (default 3)--health-start-period
string
: the initialization time needed for a container to bootstrap (default "0s")--health-startup-cmd
string
: Set a startup healthcheck command for the container--health-startup-interval
string
: Set an interval for the startup healthcheck. Changing this setting resets the timer, depending on the state of the container. (default "30s")--health-startup-retries
uint
: Set the maximum number of retries before the startup healthcheck will restart the container--health-startup-success
uint
: Set the number of consecutive successes before the startup healthcheck is marked as successful and the normal healthcheck begins (0 indicates any success will start the regular healthcheck)--health-startup-timeout
string
: Set the maximum amount of time that the startup healthcheck may take before it is considered failed (default "30s")--health-timeout
string
: the maximum time allowed to complete the healthcheck before an interval is considered failed (default "30s")--no-healthcheck
: Disable healthchecks on containerFixes: https://issues.redhat.com/browse/RHEL-60561
Does this PR introduce a user-facing change?