-
Notifications
You must be signed in to change notification settings - Fork 54
SPIKE: Operator to evict tainted pods - issue 18 #47
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start, thanks for spiking on this. We may find other uses for the operator down the road as well. I left a few suggestions.
cmd/evicter/controller.go
Outdated
|
||
const ( | ||
annotationPreventEviction = "k-rails/tainted-prevent-eviction" | ||
annotationTimestamp = "k-rails/tainted-timestamp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to add an eviction reason annotation too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What workflows other than #9 do you have in mind for the tainted
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing particular in mind right now, but there may be some use cases in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A usecase came to mind: An image's package or dependency scan results contains CVE-2019-XXXXX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good scenario to discuss. If we flag it as tainted, we could have an 👀 on it in monitoring. Redeploying the pod by the operator would not fix the vulnerability, though just kills transient state which is not a bad thing.
This reminds me that this spike does not remove the label + annotations before calling evict.... will add to the not impl list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the label + annotation is added only to the Pod, I think it should be cleared after the Pod is evicted and a new Pod replaces it.
Should the taint reason should be a label as well, so it's easy to filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "reason" as label can create a lot of noise in upstream systems like monitoring. Although reasons are currently very limited, it may be extended by somebody to any kind of text in the future. If you are think about scaling the operator we can run multiple and have some sharding.
How do you want to proceed with this task? Go this way with an operator or try something else? In terms of priorities I assume Issue #9 should be implemented first. |
I think you've shown that the operator approach is a fine way to go and worth continuing on. From the user perspective, it's an implementation detail and deployment will be just as easy as it was before. I don't order is too important for #9, but it should be a quick and easy policy. There is already a PodExec resource extractor used by the No Exec policy. |
I had a standalone operator in mind to separate concerns. Which means different binary, docker file and deployment. Also a separate service account to not mix permissions. This adds some complexity to the system and build process, though. |
I think standalone is the right choice, I just meant that the included helm chart could roll it out as a separate Deployment in the namespace. If we do that, the installation won't be any more difficult. |
c57f770
to
e7c0cac
Compare
e7c0cac
to
87f2bf8
Compare
Closing this in favour of #54 |
🚧 SPIKE - do not merge
See #18 for discussions.
This is a demo of a standalone k8s operator that runs within the cluster to find and evict tainted pods.
The code is based on the workqueue example.
The workqueue design comes with some learning curve but once you get used to it it is very nice. I kept most comments and functions from the original example.
A good start to read the code would be
controller.evictPod
and thenmain.go
.Disclaimer: It compiles but I have not tested or run it locally.Update:
I have started to implement this, now:
Todo:
default
(all)Happens automatically on restart