Replies: 4 comments 8 replies
-
@sadjy Hey! To be clear, ARC doesn't know about the nodes and nodepools of the cluster. All it does is to eventually create a vanilla K8s pod or a K8s statefulset so that your K8s cluster's control-plane(controller-manager) will schedule the pods onto available nodes. If you have a cluster-autoscaler installed onto your cluster, it might find a pod stuck in Unschedulable and add another node(s) so that the unschedulabe pods can be scheduled onto newly created nodes. And all this happens outside of ARC. ARC knows nothing about the nodes. Do you see any Unschedulable and Pending runner pods? Then there might be any cluster issues that is preventing pods from starting. Probably this is unrelated to your issue, but anyway- I see you have configured the scale up trigger duration to only |
Beta Was this translation helpful? Give feedback.
-
Hi @mumoshu! Thank you for your response.
That's a good point. I set it to a more reasonable value of 30 minutes now.
Okay that answers my question and clarifies the way ARC works. Thanks.
Yes, it turns out, my cluster was not in a stable state. The thing is I made a bunch of change of multiple things at the same time, and so multiple times using Terraform, and I think that messed up some of the critical GKE systems. I did have a few After destroying everything and spinning it all back up again, I got the following: 2- However, if I set
Additionally and oddly enough, after those Again, I know this isn't an issue so to speak, and I'm certain I'm missing something here. Just not quite sure what just yet and can't figure out where to look. |
Beta Was this translation helpful? Give feedback.
-
Hey!
ARC emits this error when more than two HRA+RunnrDeployment pairs match the information in a
I might be missing something, but this sentence describes how ARC is intended to work.
Assuming you're talking about the |
Beta Was this translation helpful? Give feedback.
-
Hey @mumoshu!
Oh okay! I assumed ARC would notice that this runner is the "last" one and not let it terminate but the following explanation (where you say that it's the runner that terminates itself after a job) makes me understand the dynamic:
Yes, got it, that clarifies that last bit of the behavior. I think I didn't really understand that dynamic, but now it's 100% clear. Thank you so much :). Okay so now the main trouble:
I have one of each of those. The same configuration that I pasted in the first post (with the adjusted Autoscaler:
RunnerDeployment:
I tested those on a brand new cluster, same behavior. |
Beta Was this translation helpful? Give feedback.
-
My apologies for a poor title and if this discussion has already been raised (I didn't find a similar topic).
I've set up the whole thing on Google Cloud Platform using Terraform, every works great in the standard setup (and using
workflowJob
to scale up runners). The runners scale up perfectly.Now, I'm trying to run those runners in a separate node pool (let's call it
NP-2
) than the "system" (autoscaler, webhook server, etc)'s node pool (NP-1
). The idea is that this "runners" node pool would run on Spot VMs allowing me to save cost. At the same time, it makes sense not to run "system" pods on Spot VMs because they are not reliable.After doing this setup and testing it, it seems like the autoscaler (which is in
NP-1
) cannot scale up those runners (in that different node-poolNP-2
).I thought it would be fine because they belong to the same namespace, but that assumption doesn't seem to be correct.
I'm not asking for a "solution" per say, I'd just like to understand what's going on and why it is not able to "reach and control" runners from that separate node pool.
Here's the conf for the autoscaler:
Here's the conf for the runners:
Thanks in advance for the help and let me know if further details are needed.
Beta Was this translation helpful? Give feedback.
All reactions