Skip to content
This repository has been archived by the owner on Aug 26, 2021. It is now read-only.

Adjust controller to process one ingress at a time. Use per-item rate limiting workqueue. #329

Merged
merged 5 commits into from
May 8, 2018

Conversation

munnerz
Copy link
Contributor

@munnerz munnerz commented Apr 24, 2018

This is in relation to cert-manager/cert-manager#407

From that issue:

Based on the code in WatchEvents (https://github.com/jetstack/kube-lego/blob/master/pkg/kubelego/watch.go#L73) - it appears that whenever any Kubernetes Ingress resource is created, updated, or deleted, all Ingress resources are immediately scheduled for re-processing.

Ingresses that already have a valid certificate will be skipped, but any user with a number of failing/invalid ingresses will make requests to LE APIs in an attempt to validate those ingresses.

As part of those syncs, more updates will likely be made to ingresses, thus re-queuing these ingresses to be immediately reprocessed after the 'round' of processing ingresses fails.

The good news is we do only process one Ingress resource at a time, which should reduce the hits to the API somewhat (this could be a lot worse otherwise).

This PR changes the usage of the workqueue to only process one ingress at a time. It also switches the workqueue to use the rate limiting interface instead of a plain workqueue. This allows us to exponentially backoff validation attempts on a per ingress basis.

NOTE: I have not run e2e tests against this patch yet - I will update this PR with results once I have

/cc @jsha @simonswine

ref #328

@jetstack-bot
Copy link
Collaborator

@munnerz: GitHub didn't allow me to request PR reviews from the following users: jsha.

Note that only jetstack members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is in relation to cert-manager/cert-manager#407

From that issue:

Based on the code in WatchEvents (https://github.com/jetstack/kube-lego/blob/master/pkg/kubelego/watch.go#L73) - it appears that whenever any Kubernetes Ingress resource is created, updated, or deleted, all Ingress resources are immediately scheduled for re-processing.

Ingresses that already have a valid certificate will be skipped, but any user with a number of failing/invalid ingresses will make requests to LE APIs in an attempt to validate those ingresses.

As part of those syncs, more updates will likely be made to ingresses, thus re-queuing these ingresses to be immediately reprocessed after the 'round' of processing ingresses fails.

The good news is we do only process one Ingress resource at a time, which should reduce the hits to the API somewhat (this could be a lot worse otherwise).

This PR changes the usage of the workqueue to only process one ingress at a time. It also switches the workqueue to use the rate limiting interface instead of a plain workqueue. This allows us to exponentially backoff validation attempts on a per ingress basis.

NOTE: I have not run e2e tests against this patch yet - I will update this PR with results once I have

/cc @jsha @simonswine

ref #328

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}
kl.workQueue.AddRateLimited(key)
}
return nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used by the renewal ticker in order to trigger a resync of ingress resources.

Previously, it resynced all resources at once, with no kind of rate limit applied.

This changes that to add it via the rate limited queue. For items that are already failing, this will increment the rate limit slightly, however this is probably not an issue.

For the items the are currently not 'known' by the rate limiter (i.e. succeeding ingresses), they will be checked one by one after a delay of 10 minutes, which is also acceptable imo.

@munnerz
Copy link
Contributor Author

munnerz commented Apr 24, 2018

e2e tests passed after a few more commits 😃

Copy link

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good as far as I understand it. :-) Thanks for implementing!

kl.Log().Debugf("worker: done processing %v", item)
kl.workQueue.Done(item)
func(item interface{}) {
defer kl.workQueue.Done(item)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's okay to call workQueue.Forget(item) and then workQueue.Done(item)? Similarly, if you call workQueue.AddRateLimited(key), then workQueue.Done(item), the workQueue will not actually forget the item and its ongoing backoff status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - they serve two different purposes:

  • Done will inform the queue that the particular work item has finished processing, as the workqueue will stop the same item being processed by two workers at once (although we actually only process one certificate at once anyway).

  • Forget will forget the item altogether and clear its back off status.

So 'Done' should be called after after the function processing the results of Get() is done (regardless of err/success)

'Forget' should be called once the processing has 'succeeded' and we want to clear the rate limit state for that particular item.

Calling AddRateLimit along with Done will therefore not clear the rate limit. It will mark the queue as finished processing that particular item, and also queue the item to be re-added to the queue after the rate limit is up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have to call both:

	// Forget indicates that an item is finished being retried.  Doesn't matter whether its for perm failing
	// or for success, we'll stop the rate limiter from tracking it.  This only clears the `rateLimiter`, you
	// still have to call `Done` on the queue.
	Forget(item interface{})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wallrj 👍 we do call both - Forget gets called after the item has been successfully processed (or when it can no longer be processed, e.g. due to deletion)

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @munnerz

It looks like you've tried to reduce the API request rate from 4 angles:

  1. Only perform certificate requests / validation when the ingress spec changes.
  2. Only perform validation for the ingress object that has changed.
  3. Backoff when retrying failed validations.
  4. Reduce the resync period of the informer.

All of those sound good, but in the absence of tests and in the interest of not accidentally breaking things, I wonder if any one of those changes would have been sufficient.

I spotted some things and left some comments.
Please address those.

}
} else {
if o.exists {
if o.Exists {
err = o.client().Delete(o.IngressApi.Namespace, &k8sMeta.DeleteOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be deleting a namespace rather than the ingres. Is that right?
Ignore me if this is unimportant / unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated - I am not looking to change how validation happens 😄

if providerName == ing.IngressProvider() {
err = provider.Process(ing)
if err != nil {
provider.Log().Error(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return this err before over writing it below? Or must Finalize always be run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not looking to change behaviour. These lines changed due to indentation changes.


// normify tls config
tlsSlice = kl.TlsIgnoreDuplicatedSecrets(tlsSlice)
// NOTE: this no longer performs a global deduplication
tlsSlice := kl.TlsIgnoreDuplicatedSecrets(ing.Tls())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences of not doing global de-duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @simonswine and we think this will be okay - see #298 for more info

for _, ing := range ingressesAll {
if ing.Ignore() {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this ing.Ignore() check in the new code below. Is it important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -107,23 +93,11 @@ func (kl *KubeLego) reconfigure(ingressesAll []kubelego.Ingress) error {
errsStr = append(errsStr, fmt.Sprintf("%s", err))
}
kl.Log().Error("Error while processing certificate requests: ", strings.Join(errsStr, ", "))

// request a rerun of reconfigure
kl.workQueue.Add(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we no longer re-queue when there are errors?
Don't we need to return an aggregate error here so that the certificate requests can be retried?
Perhaps this is done somewhere below..../me reads on....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -70,15 +116,32 @@ func (kl *KubeLego) WatchEvents() {
return
}
kl.Log().Debugf("CREATE ingress/%s/%s", addIng.Namespace, addIng.Name)
kl.workQueue.Add(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I get it. So previously we added a bool to the queue!? which meant that all certificates were re-validated every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes spot on 😬

kl.Log().Infof("Detected deleted ingress %q - skipping", key)
// skip processing deleted items, as there is no reason to due to
// the way kube-lego serialises authorization attempts
// kl.workQueue.AddRateLimited(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call kl.workQueue.Forget(key) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() {
oldIng.ResourceVersion = ""
upIng.ResourceVersion = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed now.

@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() {
oldIng.ResourceVersion = ""
upIng.ResourceVersion = ""

if !reflect.DeepEqual(oldIng, upIng) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have played a large part of the flood of API requests I guess. If the status and meta were changing often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly! 😬

// we requeue ingresses only when their spec has changed, as the indicates
// a user has updated the specification of their ingress and as such we should
// re-trigger a validation if required.
if !reflect.DeepEqual(oldIng.Spec, upIng.Spec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are annotations / labels on the ingress object examined by Kube-lego? Should we compare those also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - if a user switches the tls-acme annotation from false to true, we won't notice it right now.

@jsha
Copy link

jsha commented Apr 27, 2018

Hi! Just checking in: Any ETA on responding to the above review and potentially merging & releasing? Thanks!

@munnerz
Copy link
Contributor Author

munnerz commented May 1, 2018

Hey @jsha - I've addressed the review comments. We're all at KubeCon this week so quite busy, but I will talk to @wallrj and try and get the latest commit re-reviewed.

As soon as this is merged, I'll then cut a new release of kube-lego.

@jsha
Copy link

jsha commented May 1, 2018

Excellent, thanks for the update!

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @munnerz

Merge at will!

for providerName, provider := range kl.legoIngressProvider {
err := provider.Reset()
if err != nil {
provider.Log().Error(err)
continue
errs = append(errs, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Should we keep the continue ? We previously skipped the logic below when reset fails.

@jsha
Copy link

jsha commented May 7, 2018

Hi! Friendly post-KubeCon ping? This is still causing a lot of issues for us.

@jsha
Copy link

jsha commented May 8, 2018

Ping? :-)

@jsha
Copy link

jsha commented May 9, 2018

Thanks very much!

gobengo pushed a commit to gobengo/helm-gitlab-omnibus that referenced this pull request Jun 21, 2018
Among other things, the changes since 0.1.3 include rate limiting to
avoid hitting Let's Encrypt servers too often.

See:
- jetstack/kube-lego#329
- jetstack/kube-lego@0.1.3...0.1.6
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants