-
Notifications
You must be signed in to change notification settings - Fork 267
Adjust controller to process one ingress at a time. Use per-item rate limiting workqueue. #329
Conversation
… limiting workqueue.
@munnerz: GitHub didn't allow me to request PR reviews from the following users: jsha. Note that only jetstack members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
} | ||
kl.workQueue.AddRateLimited(key) | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only used by the renewal ticker in order to trigger a resync of ingress resources.
Previously, it resynced all resources at once, with no kind of rate limit applied.
This changes that to add it via the rate limited queue. For items that are already failing, this will increment the rate limit slightly, however this is probably not an issue.
For the items the are currently not 'known' by the rate limiter (i.e. succeeding ingresses), they will be checked one by one after a delay of 10 minutes, which is also acceptable imo.
e2e tests passed after a few more commits 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good as far as I understand it. :-) Thanks for implementing!
kl.Log().Debugf("worker: done processing %v", item) | ||
kl.workQueue.Done(item) | ||
func(item interface{}) { | ||
defer kl.workQueue.Done(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's okay to call workQueue.Forget(item)
and then workQueue.Done(item)
? Similarly, if you call workQueue.AddRateLimited(key)
, then workQueue.Done(item)
, the workQueue will not actually forget the item and its ongoing backoff status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - they serve two different purposes:
-
Done will inform the queue that the particular work item has finished processing, as the workqueue will stop the same item being processed by two workers at once (although we actually only process one certificate at once anyway).
-
Forget will forget the item altogether and clear its back off status.
So 'Done' should be called after after the function processing the results of Get() is done (regardless of err/success)
'Forget' should be called once the processing has 'succeeded' and we want to clear the rate limit state for that particular item.
Calling AddRateLimit along with Done will therefore not clear the rate limit. It will mark the queue as finished processing that particular item, and also queue the item to be re-added to the queue after the rate limit is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have to call both:
// Forget indicates that an item is finished being retried. Doesn't matter whether its for perm failing
// or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you
// still have to call `Done` on the queue.
Forget(item interface{})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wallrj 👍 we do call both - Forget gets called after the item has been successfully processed (or when it can no longer be processed, e.g. due to deletion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @munnerz
It looks like you've tried to reduce the API request rate from 4 angles:
- Only perform certificate requests / validation when the ingress spec changes.
- Only perform validation for the ingress object that has changed.
- Backoff when retrying failed validations.
- Reduce the resync period of the informer.
All of those sound good, but in the absence of tests and in the interest of not accidentally breaking things, I wonder if any one of those changes would have been sufficient.
I spotted some things and left some comments.
Please address those.
} | ||
} else { | ||
if o.exists { | ||
if o.Exists { | ||
err = o.client().Delete(o.IngressApi.Namespace, &k8sMeta.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be deleting a namespace rather than the ingres. Is that right?
Ignore me if this is unimportant / unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated - I am not looking to change how validation happens 😄
pkg/kubelego/configure.go
Outdated
if providerName == ing.IngressProvider() { | ||
err = provider.Process(ing) | ||
if err != nil { | ||
provider.Log().Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return this err
before over writing it below? Or must Finalize
always be run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not looking to change behaviour. These lines changed due to indentation changes.
|
||
// normify tls config | ||
tlsSlice = kl.TlsIgnoreDuplicatedSecrets(tlsSlice) | ||
// NOTE: this no longer performs a global deduplication | ||
tlsSlice := kl.TlsIgnoreDuplicatedSecrets(ing.Tls()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences of not doing global de-duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @simonswine and we think this will be okay - see #298 for more info
for _, ing := range ingressesAll { | ||
if ing.Ignore() { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this ing.Ignore()
check in the new code below. Is it important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/kubelego/configure.go
Outdated
@@ -107,23 +93,11 @@ func (kl *KubeLego) reconfigure(ingressesAll []kubelego.Ingress) error { | |||
errsStr = append(errsStr, fmt.Sprintf("%s", err)) | |||
} | |||
kl.Log().Error("Error while processing certificate requests: ", strings.Join(errsStr, ", ")) | |||
|
|||
// request a rerun of reconfigure | |||
kl.workQueue.Add(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we no longer re-queue when there are errors?
Don't we need to return an aggregate error here so that the certificate requests can be retried?
Perhaps this is done somewhere below..../me reads on....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -70,15 +116,32 @@ func (kl *KubeLego) WatchEvents() { | |||
return | |||
} | |||
kl.Log().Debugf("CREATE ingress/%s/%s", addIng.Namespace, addIng.Name) | |||
kl.workQueue.Add(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I get it. So previously we added a bool to the queue!? which meant that all certificates were re-validated every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes spot on 😬
kl.Log().Infof("Detected deleted ingress %q - skipping", key) | ||
// skip processing deleted items, as there is no reason to due to | ||
// the way kube-lego serialises authorization attempts | ||
// kl.workQueue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call kl.workQueue.Forget(key)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/kubelego/watch.go
Outdated
@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() { | |||
oldIng.ResourceVersion = "" | |||
upIng.ResourceVersion = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be removed now.
pkg/kubelego/watch.go
Outdated
@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() { | |||
oldIng.ResourceVersion = "" | |||
upIng.ResourceVersion = "" | |||
|
|||
if !reflect.DeepEqual(oldIng, upIng) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have played a large part of the flood of API requests I guess. If the status and meta were changing often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly! 😬
pkg/kubelego/watch.go
Outdated
// we requeue ingresses only when their spec has changed, as the indicates | ||
// a user has updated the specification of their ingress and as such we should | ||
// re-trigger a validation if required. | ||
if !reflect.DeepEqual(oldIng.Spec, upIng.Spec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are annotations / labels on the ingress object examined by Kube-lego? Should we compare those also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - if a user switches the tls-acme
annotation from false
to true
, we won't notice it right now.
Hi! Just checking in: Any ETA on responding to the above review and potentially merging & releasing? Thanks! |
Excellent, thanks for the update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @munnerz
Merge at will!
for providerName, provider := range kl.legoIngressProvider { | ||
err := provider.Reset() | ||
if err != nil { | ||
provider.Log().Error(err) | ||
continue | ||
errs = append(errs, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Should we keep the continue
? We previously skipped the logic below when reset fails.
Hi! Friendly post-KubeCon ping? This is still causing a lot of issues for us. |
Ping? :-) |
Thanks very much! |
Among other things, the changes since 0.1.3 include rate limiting to avoid hitting Let's Encrypt servers too often. See: - jetstack/kube-lego#329 - jetstack/kube-lego@0.1.3...0.1.6
This is in relation to cert-manager/cert-manager#407
From that issue:
This PR changes the usage of the workqueue to only process one ingress at a time. It also switches the workqueue to use the rate limiting interface instead of a plain workqueue. This allows us to exponentially backoff validation attempts on a per ingress basis.
NOTE: I have not run e2e tests against this patch yet - I will update this PR with results once I have/cc @jsha @simonswine
ref #328