Fix context issue during cleanup of kind clusters #6771

jainpulkit22 · 2024-10-25T03:32:32Z

Fix context issue during cleanup of kind clusters.

Signed-off-by: Pulkit Jain <[email protected]>

rajnkamr · 2024-10-25T06:16:42Z

ci/kind/kind-setup.sh

-        creationTimestamp=$(kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \
+    for context in $(kubectl config get-contexts -o name | grep 'kind-'); do
+      cluster_name=$(echo $context | sed 's/^kind-//')
+      if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then


Does this list all containers in the given cluster name , docker ps --format '{{.Names}}' | grep -q "$cluster_name" ?
docker ps --format '{{.Names}}' | grep '$cluster_name'

Suggested change

if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then

if docker ps --format '{{.Names}}' | grep '$cluster_name'; then

We don't need to list the cluster names we just want to check if they are present or not, so -q is required here.

antoninbas · 2024-10-28T20:02:36Z

ci/kind/kind-setup.sh

-        creationTimestamp=$(kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \
+    for context in $(kubectl config get-contexts -o name | grep 'kind-'); do
+      cluster_name=$(echo $context | sed 's/^kind-//')
+      if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then


why are we relying on this docker command instead of kind get clusters?

And kind get nodes can also be used to list all node names.

We are not relying on kind get clusters because it lists those clusters also for which the creation is not yet completed or successful, and is in mid way, which creates a problem because when we try to get contexts of such clusters which are yet not completed it returns an error.
This error we may encounter in case of aborted jobs and multiple jobs run.

In case of aborted job suppose we abort the job as soon as the cluster creation starts, so when we do kind get clusters we will get to see that cluster from aborted job in the list but since its context will not be available so the cleanup will panic and job will fail.

In case of multiple jobs run suppose there are two jobs running parallely and one job has just triggered cluster creation step and the other job triggers the cleanup function so it will list the cluster which is ye not created and then it will try to fetch the context for that cluster and the job will fail because of panic.

Sometimes because of these stale clusters present in the environment testbed also becomes unhealthy.

@jainpulkit22 Thanks for the explanation, my concern is whether docker ps --format '{{.Names}}' | grep -q "$cluster_name" is a sufficient basis to determine if a cluster is ready. Is there something like status.conditions fields in the context that would allow us to accurately determine if the cluster is ready?

I was trying to point out that at this stage we are already iterating over known contexts (for context in $(kubectl config get-contexts -o name | grep 'kind-'), so we know that the context exists. It is not the same as the previous solution, which was first calling kind get clusters, and then calling kubectl config get-contexts.

In your comment and in the original issue, you talk about a "panic", but it's not clear to me what failure you are referring to exactly.

Note that even with your proposed solution, races seem possible:

job 1 calls kubectl config get-contexts and gets context1

job 2 calls kubectl config get-contexts and gets context1

job 1 calls docker ps and finds the container it is looking for

job 2 calls docker ps and finds the container it is looking for

job 1 calls kubectl get nodes --context context1 ... succesfully

job 1 deletes the kind cluster sucessfully

job 2 calls kubectl get nodes --context context1 ... but context1 does not exist anymore!

Either the code should be best effort and tolerate failures, or maybe you should use a lock (see flock) for correctness, and then you don't need to worry about concurrent executions and you can use the most appropriate commands for the task.

Fix context issue during cleanup of kind clusters

c7011d3

Signed-off-by: Pulkit Jain <[email protected]>

rajnkamr reviewed Oct 25, 2024

View reviewed changes

rajnkamr added this to the Antrea v2.3 release milestone Oct 25, 2024

jainpulkit22 requested review from antoninbas and XinShuYang October 28, 2024 05:45

antoninbas reviewed Oct 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix context issue during cleanup of kind clusters #6771

Fix context issue during cleanup of kind clusters #6771

jainpulkit22 commented Oct 25, 2024

rajnkamr Oct 25, 2024 •

edited

Loading

jainpulkit22 Oct 28, 2024

antoninbas Oct 28, 2024

XinShuYang Oct 29, 2024

jainpulkit22 Oct 29, 2024

jainpulkit22 Oct 29, 2024

XinShuYang Oct 29, 2024

antoninbas Oct 29, 2024

	if docker ps --format '{{.Names}}' \| grep -q "$cluster_name"; then
	if docker ps --format '{{.Names}}' \| grep '$cluster_name'; then

Fix context issue during cleanup of kind clusters #6771

Are you sure you want to change the base?

Fix context issue during cleanup of kind clusters #6771

Conversation

jainpulkit22 commented Oct 25, 2024

rajnkamr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

jainpulkit22 Oct 28, 2024

Choose a reason for hiding this comment

antoninbas Oct 28, 2024

Choose a reason for hiding this comment

XinShuYang Oct 29, 2024

Choose a reason for hiding this comment

jainpulkit22 Oct 29, 2024

Choose a reason for hiding this comment

jainpulkit22 Oct 29, 2024

Choose a reason for hiding this comment

XinShuYang Oct 29, 2024

Choose a reason for hiding this comment

antoninbas Oct 29, 2024

Choose a reason for hiding this comment

rajnkamr Oct 25, 2024 •

edited

Loading