Multiple node-pools issue (?) #1604

sadjy · 2022-07-05T06:11:32Z

sadjy
Jul 5, 2022

My apologies for a poor title and if this discussion has already been raised (I didn't find a similar topic).

I've set up the whole thing on Google Cloud Platform using Terraform, every works great in the standard setup (and using workflowJob to scale up runners). The runners scale up perfectly.

Now, I'm trying to run those runners in a separate node pool (let's call it NP-2) than the "system" (autoscaler, webhook server, etc)'s node pool (NP-1). The idea is that this "runners" node pool would run on Spot VMs allowing me to save cost. At the same time, it makes sense not to run "system" pods on Spot VMs because they are not reliable.

After doing this setup and testing it, it seems like the autoscaler (which is in NP-1) cannot scale up those runners (in that different node-pool NP-2).
I thought it would be fine because they belong to the same namespace, but that assumption doesn't seem to be correct.

I'm not asking for a "solution" per say, I'd just like to understand what's going on and why it is not able to "reach and control" runners from that separate node pool.

Here's the conf for the autoscaler:

resource "kubernetes_manifest" "autoscaler" {
  manifest = {
    "apiVersion" = "actions.summerwind.dev/v1alpha1"
    "kind"       = "HorizontalRunnerAutoscaler"
    "metadata" = {
      "name"      = "actions-runner-autoscaler"
      "namespace" = "actions-runner-system"
    }
    "spec" = {
      "scaleDownDelaySecondsAfterScaleOut" = 120
      "minReplicas"                        = 1
      "maxReplicas"                        = 15
      "scaleTargetRef" = {
        "name" = "actions-runner-runners"
      }
      "scaleUpTriggers" = [
        {
          "duration" = "5m"
          "githubEvent" = {
            "workflowJob" = {}
          }
        },
      ]
    }
  }
}

Here's the conf for the runners:

resource "kubernetes_manifest" "runner" {
  manifest = {
    "apiVersion" = "actions.summerwind.dev/v1alpha1"
    "kind"       = "RunnerDeployment"
    "metadata" = {
      "name"      = "actions-runner-runners"
      "namespace" = "actions-runner-system"
    }
    "spec" = {
      "template" = {
        "spec" = {
          "repository" = "myrepo/test-repo"
          "nodeSelector" = {
            "cloud.google.com/gke-spot" = "true"
          }
          "resources" = {
            "limits" = {
              "cpu"    = "4"
              "memory" = "16Gi"
            }
            "requests" = {
              "cpu"    = "2"
              "memory" = "8Gi"
            }
          }
          "dockerdContainerResources" = {
            "limits" = {
              "cpu"    = "4"
              "memory" = "16Gi"
            }
            "requests" = {
              "cpu"    = "2"
              "memory" = "8Gi"
            }
          }
        }
      }
    }
  }
}

Thanks in advance for the help and let me know if further details are needed.

mumoshu · 2022-07-07T02:19:22Z

mumoshu
Jul 7, 2022
Maintainer

@sadjy Hey!

To be clear, ARC doesn't know about the nodes and nodepools of the cluster. All it does is to eventually create a vanilla K8s pod or a K8s statefulset so that your K8s cluster's control-plane(controller-manager) will schedule the pods onto available nodes.

If you have a cluster-autoscaler installed onto your cluster, it might find a pod stuck in Unschedulable and add another node(s) so that the unschedulabe pods can be scheduled onto newly created nodes. And all this happens outside of ARC. ARC knows nothing about the nodes.

Do you see any Unschedulable and Pending runner pods? Then there might be any cluster issues that is preventing pods from starting.

Probably this is unrelated to your issue, but anyway- I see you have configured the scale up trigger duration to only 5m. This means that every new runner added via ARC's autoscaler(HRA, not K8s HPA) "expires" after 5m and gets deleted from the cluster. How long does your GitHbu Actions workflow job takes, from it's scheduled onto a runner, to the completion? If it can take more than 5 minutes, your configuration is incorrect. In many cases, it's safer for you to set a lot longer scale trigger duration so that ARC won't scale down the runner pool too early. It might not be the end of the world as ARC won't delete the runner that is already running a GitHub Actions workflow job, but worth being configured correctly anyway to make it reliable.

0 replies

sadjy · 2022-07-07T18:46:33Z

sadjy
Jul 7, 2022
Author

Hi @mumoshu! Thank you for your response.

Probably this is unrelated to your issue, but anyway- I see you have configured the scale up trigger duration to only 5m. This means that every new runner added via ARC's autoscaler(HRA, not K8s HPA) "expires" after 5m and gets deleted from the cluster. How long does your GitHbu Actions workflow job takes, from it's scheduled onto a runner, to the completion? If it can take more than 5 minutes, your configuration is incorrect. In many cases, it's safer for you to set a lot longer scale trigger duration so that ARC won't scale down the runner pool too early. It might not be the end of the world as ARC won't delete the runner that is already running a GitHub Actions workflow job, but worth being configured correctly anyway to make it reliable.

That's a good point. I set it to a more reasonable value of 30 minutes now.

To be clear, ARC doesn't know about the nodes and nodepools of the cluster. All it does is to eventually create a vanilla K8s pod or a K8s statefulset so that your K8s cluster's control-plane(controller-manager) will schedule the pods onto available nodes.

Okay that answers my question and clarifies the way ARC works. Thanks.

If you have a cluster-autoscaler installed onto your cluster, it might find a pod stuck in Unschedulable and add another node(s) so that the unschedulabe pods can be scheduled onto newly created nodes. And all this happens outside of ARC. ARC knows nothing about the nodes.
Do you see any Unschedulable and Pending runner pods? Then there might be any cluster issues that is preventing pods from starting.

Yes, it turns out, my cluster was not in a stable state. The thing is I made a bunch of change of multiple things at the same time, and so multiple times using Terraform, and I think that messed up some of the critical GKE systems. I did have a few Unschedulable pods and they were kind of "stuck" in this state, so it seems the node-pool (cluster) autoscaler wasn't working properly.

After destroying everything and spinning it all back up again, I got the following:
1- When I set the minReplicas to my autoscaler to say, 12 and apply the autoscaler manifest, 12 pods are created but some of them are Unschedulable, after a few seconds, the node-pool autoscaler scales out NP-2 (the node-pool on which the runners are supposed to run) and all those Unschedulable pods become Running.

2- However, if I set minReplicas to just 1, the wished idle value. And I come and spins up n (n > 1) jobs, those jobs just get queued and stay queued until the first job is done. The runners don't seem to scale out at all and I fail to understand why (and that worked properly before - when everything was on the same node pool). No Pending or Unschedulable pods.
I see the following debug message when looking at the webhook server logs:

DEBUG controllers.webhookbasedautoscaler Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event {"event": "workflow_job", "hookID": "365072824", "delivery": "38113e00-fe22-11ec-9f06-655c90c9a6e0", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "linux", "x64"], "repository.name": "test-repo", "repository.owner.login": "my-repo", "repository.owner.type": "Organization", "enterprise.slug": "enterprise-test", "action": "queued"}

Additionally and oddly enough, after those n jobs are done running, the webhook processes the complete event and ARC shuts down the node (even if minReplicas is 1), and then because minReplicas is 1, it spins up another. This last behavior is not a big deal, just mentioning it maybe it's linked to the problem I'm having.

Again, I know this isn't an issue so to speak, and I'm certain I'm missing something here. Just not quite sure what just yet and can't figure out where to look.
I appreciate and I'm grateful for you taking the time to read!

0 replies

mumoshu · 2022-07-08T00:01:02Z

mumoshu
Jul 8, 2022
Maintainer

Hey!

Scale target not found. If this is unexpected, ensure that there is precisely one repository-wide or organizational runner deployment that matches this webhook event.

ARC emits this error when more than two HRA+RunnrDeployment pairs match the information in a workflow_job webhook event.
How many HRAs and RunnerDeployments do you have? Could you share the YAML manifests of those resources?

ARC shuts down the node (even if minReplicas is 1), and then because minReplicas is 1, it spins up another.

I might be missing something, but this sentence describes how ARC is intended to work.
I'm still curious though- What was your desired behavior? Probably I can provide "why" for why it doesn't work as you intended.

the webhook processes the complete event and ARC shuts down the node

Assuming you're talking about the pod here(as ARC knows nothing about nodes), ARC's webhook-server doesn't terminated the runner pod. It's that actions/runner in the ephemeral mode exists with 0 after running the first Actions workflow job assigned to the runner. ARC detects and terminates it gracefully and creates another runner pod so that it can maintain minReplicas. ARC's webhook-server updates RunnerDeployment's replicas. On workflow_job of complete event, the webhook-server tries to decrease the replicas of the RunnerDeployment, but not smaller than minReplicas.

0 replies

sadjy · 2022-07-08T03:53:11Z

sadjy
Jul 8, 2022
Author

Hey @mumoshu!

I might be missing something, but this sentence describes how ARC is intended to work.
I'm still curious though- What was your desired behavior? Probably I can provide "why" for why it doesn't work as you intended.

Oh okay! I assumed ARC would notice that this runner is the "last" one and not let it terminate but the following explanation (where you say that it's the runner that terminates itself after a job) makes me understand the dynamic:

Assuming you're talking about the pod here(as ARC knows nothing about nodes), ARC's webhook-server doesn't terminated the runner pod. It's that actions/runner in the ephemeral mode exists with 0 after running the first Actions workflow job assigned to the runner. ARC detects and terminates it gracefully and creates another runner pod so that it can maintain minReplicas. ARC's webhook-server updates RunnerDeployment's replicas. On workflow_job of complete event, the webhook-server tries to decrease the replicas of the RunnerDeployment, but not smaller than minReplicas.

Yes, got it, that clarifies that last bit of the behavior. I think I didn't really understand that dynamic, but now it's 100% clear. Thank you so much :).

Okay so now the main trouble:

ARC emits this error when more than two HRA+RunnrDeployment pairs match the information in a workflow_job webhook event.
How many HRAs and RunnerDeployments do you have? Could you share the YAML manifests of those resources?

I have one of each of those. The same configuration that I pasted in the first post (with the adjusted duration):

Autoscaler:

resource "kubernetes_manifest" "autoscaler" {
  manifest = {
    "apiVersion" = "actions.summerwind.dev/v1alpha1"
    "kind"       = "HorizontalRunnerAutoscaler"
    "metadata" = {
      "name"      = "actions-runner-autoscaler"
      "namespace" = "actions-runner-system"
    }
    "spec" = {
      "scaleDownDelaySecondsAfterScaleOut" = 300
      "minReplicas"                        = 1
      "maxReplicas"                        = 15
      "scaleTargetRef" = {
        "name" = "actions-runner-runners"
      }
      "scaleUpTriggers" = [
        {
          "duration" = "30m"
          "githubEvent" = {
            "workflowJob" = {}
          }
        },
      ]
    }
  }
}

RunnerDeployment:

resource "kubernetes_manifest" "runner" {
  manifest = {
    "apiVersion" = "actions.summerwind.dev/v1alpha1"
    "kind"       = "RunnerDeployment"
    "metadata" = {
      "name"      = "actions-runner-runners"
      "namespace" = "actions-runner-system"
    }
    "spec" = {
      "template" = {
        "spec" = {
          "repository" = "myrepo/test-repo"
          "nodeSelector" = {
            "cloud.google.com/gke-spot" = "true"
          }
          "resources" = {
            "limits" = {
              "cpu"    = "4"
              "memory" = "16Gi"
            }
            "requests" = {
              "cpu"    = "2"
              "memory" = "8Gi"
            }
          }
          "dockerdContainerResources" = {
            "limits" = {
              "cpu"    = "4"
              "memory" = "16Gi"
            }
            "requests" = {
              "cpu"    = "2"
              "memory" = "8Gi"
            }
          }
        }
      }
    }
  }
}

I tested those on a brand new cluster, same behavior.
As you explained to me, ARC doesn't know about node-pools or nodes of the cluster. But somehow I feel like nodeSelector on this resource is the key to the issue (as a matter of fact, the workflowJob scale-out works flawlessly without -- when the ARC pod and the runner pod are in the same node-pool). Which of course doesn't make rational sense based on what I know and what you explained.

8 replies

sadjy Jul 8, 2022
Author

Thanks! That's a good catch. But yes I do see those labels automatically added to my runners so I assumed I didn't need to specify them in the manifest.

Note: this exact configuration worked (i.e scales out properly upon workflowJob event without the nodeSelector field.

mumoshu Jul 8, 2022
Maintainer

@sadjy Thanks for confirming! Glad to hear it worked.

Note: this exact configuration worked (i.e scales out properly upon workflowJob event without the nodeSelector field.

Still not sure what that's possible at all. Do you have any theory? 🤔

so I assumed I didn't need to specify them in the manifest.

I understand your expectation! FWIW, the rationale is, it's the GitHub Actions service that adds those labels. ARC doesn't specify those labels. ARC doesn't know what nodes the runner pods are scheduled onto(by K8s) so ARC isn't aware of what os/arch the pod is intended to be scheduled on.

There are node labels added by K8s, like beta.kubernetes.io/os=linux and beta.kubernetes.io/arch=amd64, which might be the source of information if ARC needs to auto-assume runner labels after the K8s decided the pod scheduling. But I'm not sure if it's worth the effort yet.

Before investigating such an endeavor, we might be able to make it a bit easier, by documenting the necessity to explicitly specify the arch and the os runner labels in the RunnerDeployment spec, if the user is going to specify those labels in their workflow definition.

WDYT?

sadjy Jul 8, 2022
Author

Thanks for confirming! Glad to hear it worked.

Oh sorry, no I didn't mean to say that adding those two labels to the manifest fixed the issue, I meant to say that this exact setup worked well before I separated into 2 node pools and added nodeSelector to the RunnerDeployment.

I understand your expectation! FWIW, the rationable is, it's the GitHub Actions service that adds those labels. ARC doesn't specify those labels. ARC doesn't know what nodes the runner pods are scheduled onto(by K8s) so ARC isn't aware of what os/arch the pod is intended to be scheduled on.
There are node labels added by K8s, like beta.kubernetes.io/os=linux and beta.kubernetes.io/arch=amd64, which might be the source of information if ARC needs to auto-assume runner labels after the K8s decided the pod scheduling. But I'm not sure if it's worth the effort yet

Ohh that clarifies things even more for me, thank you so much.

Before investigating such an endeavor, we might be able to make it a bit easier, by documenting the necessity to explicitly specify the arch and the os runner labels in the RunnerDeployment spec, if the user is going to specify those labels in their workflow definition.

Yes that sounds like a good idea.

But before that though, I'm not in reach of my station so I haven't yet tested to add the labels to the manifest, but now that I understand that those extra labels (os + arch) are added by K8s unbeknownst of ARC, I'll try the 2 test scenarios:
A- Add those two labels to my manifest
B- Remove those two labels from my workflow definition.

I'll update this discussion with results of how these went. :)

I'd like to say that I'm grateful that this amazing project has an equally amazing maintainer, thanks for taking the time to clarify those.

mumoshu Jul 8, 2022
Maintainer

no I didn't mean to say that adding those two labels to the manifest fixed the issue

Ah sorry! I read it too quickly 😅

I'll update this discussion with results of how these went. :)

Looking forward to your update. Thanks for your cooperation!

Could you also retry it without nodeSelector, if you have some more time? I'm still wondering what was going on without it. When it did reproduce without the node selector, it would be great if you could take note of some scaling decision logs written by ARC's github webhook server. It might contain various workflow job related metadata like {"event": "workflow_job", "hookID": "365072824", "delivery": "38113e00-fe22-11ec-9f06-655c90c9a6e0", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "linux", "x64"], ... seen in the Scale target not found case too. It would allow us to correlate which webhook event triggered a scale on which HRA/RD.

sadjy Jul 8, 2022
Author

Okay! So you were right when you said:

FWIW, the rationable is, it's the GitHub Actions service that adds those labels. ARC doesn't specify those labels. ARC doesn't know what nodes the runner pods are scheduled onto(by K8s) so ARC isn't aware of what os/arch the pod is intended to be scheduled on.
There are node labels added by K8s, like beta.kubernetes.io/os=linux and beta.kubernetes.io/arch=amd64, which might be the source of information if ARC needs to auto-assume runner labels after the K8s decided the pod scheduling.

I did some tests:

B- Remove those two labels from my workflow definition.

I tried this first, only targeting [self-hosted] in my workflow definition, and it worked, scaled-out properly!

A- Add those two labels to my manifest

I added this to the runners manifest:

          "labels" = [
            "linux",
            "x64"
          ]

Then I according modified my workflow definition with: runs-on: [self-hosted, linux, x64].
This worked! However, it is interesting to note that in the Github settings, looking at the runners list, the tag were self-hosted, X64 and Linux (the labels were capitalized).

I was curious and I adjusted my workflow definition with: runs-on: [self-hosted, Linux, X64]. However, here, the scaling-out on event didn't work here.

Following up on that curiosity, I adjusted my runners manifest with capitalized labels:

          "labels" = [
            "Linux",
            "X64"
          ]

In this case, the workflow definition with capitalized os/arch labels worked (runs-on: [self-hosted, Linux, X64]), but the one with lowercase didn't (runs-on: [self-hosted, linux, x64]): controllers.webhookbasedautoscaler no repository/organizational/enterprise runner found {"event": "workflow_job", "hookID": "365072824", "delivery": "fef5c8a0-fee1-11ec-9193-685b8d0e66b1", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "linux", "x64"], "repository.name": ....

In both of those cases though, the workflow definition with only runs-on: [self-hosted] worked (scale-out OK) - but this is the case I don't really understand to be honest. Does that mean that the list of labels defined in runs-on has to be any subset of an runner's label for this job to be assigned to that runner?

From there we could made the assumption that case for labels matters for ARC/workflow definition but the listing in Github settings didn't really respect it. To verify this, I modified my runners manifest with those labels:

          "labels" = [
            "linux",
            "x64",
            "CamelCase",
            "lowerCamelCase"
          ]

And labels in the list came out like this:

And here, the workflows with runs-on: [self-hosted] and runs-on: [self-hosted, linux, x64] worked (i.e. runners scale out) but not the one with runs-on: [self-hosted, Linux, X64]. Which bring back that "'runs-on' labels just need to be a subset of any runner's labels" question.

So what we can try conclude here is:

Type case matters for all labels (custom or os/arch) at the 'workflow definition + ARC' levels
There is a specific list of Github auto-assignable default labels (maybe mostly related to architecture/OS) that are always capitalized in the Github runners list in the settings (but ONLY in that list).

Could you also retry it without nodeSelector, if you have some more time? I'm still wondering what was going on without it.

Okay, so here what I think happened (and the source of my issue was), and I feel acutely silly for not noticing it...
When I first set up ARC with a single node-pool (without nodeSelector), I tested the scaling out with a workflow definition that only had -runs-on: [self-hosted].
Then came the idea to use 2 node-pools, one for the controller and main K8s systems and one node-pool only for runners (and added the nodeSelector field). I think just before that very migration, I noticed the runner's labels in the Github settings were self-hosted, linux and x64, so I figured "hey, I'll just update my workflow definition to use that instead" and changed my workflow to use -runs-on: [self-hosted, linux, x64]. And that caused the HRA scaling issue and I (wrongly) correlated this issue and the newly added nodeSelector in my runnerDeployment manifest.

Your comment "ARC doesn't specify those labels. ARC doesn't know what nodes the runner pods are scheduled onto(by K8s) so ARC isn't aware of what os/arch the pod is intended to be scheduled on." was actually spot-on.

I'm extremely thankful for you to help me figure this out, and understand more about how ARC works along the way. 🙏 Thanks a ton @mumoshu!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple node-pools issue (?) #1604

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Multiple node-pools issue (?) #1604

sadjy Jul 5, 2022

Replies: 4 comments · 8 replies

mumoshu Jul 7, 2022 Maintainer

sadjy Jul 7, 2022 Author

mumoshu Jul 8, 2022 Maintainer

sadjy Jul 8, 2022 Author

sadjy Jul 8, 2022 Author

mumoshu Jul 8, 2022 Maintainer

sadjy Jul 8, 2022 Author

mumoshu Jul 8, 2022 Maintainer

sadjy Jul 8, 2022 Author

sadjy
Jul 5, 2022

Replies: 4 comments 8 replies

mumoshu
Jul 7, 2022
Maintainer

sadjy
Jul 7, 2022
Author

mumoshu
Jul 8, 2022
Maintainer

sadjy
Jul 8, 2022
Author

sadjy Jul 8, 2022
Author

mumoshu Jul 8, 2022
Maintainer

sadjy Jul 8, 2022
Author

mumoshu Jul 8, 2022
Maintainer

sadjy Jul 8, 2022
Author