Adjust controller to process one ingress at a time. Use per-item rate limiting workqueue. #329

munnerz · 2018-04-24T20:48:48Z

This is in relation to cert-manager/cert-manager#407

From that issue:

Based on the code in WatchEvents (https://github.com/jetstack/kube-lego/blob/master/pkg/kubelego/watch.go#L73) - it appears that whenever any Kubernetes Ingress resource is created, updated, or deleted, all Ingress resources are immediately scheduled for re-processing.

Ingresses that already have a valid certificate will be skipped, but any user with a number of failing/invalid ingresses will make requests to LE APIs in an attempt to validate those ingresses.

As part of those syncs, more updates will likely be made to ingresses, thus re-queuing these ingresses to be immediately reprocessed after the 'round' of processing ingresses fails.

The good news is we do only process one Ingress resource at a time, which should reduce the hits to the API somewhat (this could be a lot worse otherwise).

This PR changes the usage of the workqueue to only process one ingress at a time. It also switches the workqueue to use the rate limiting interface instead of a plain workqueue. This allows us to exponentially backoff validation attempts on a per ingress basis.

NOTE: I have not run e2e tests against this patch yet - I will update this PR with results once I have

/cc @jsha @simonswine

ref #328

… limiting workqueue.

jetstack-bot · 2018-04-24T20:48:51Z

@munnerz: GitHub didn't allow me to request PR reviews from the following users: jsha.

Note that only jetstack members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is in relation to cert-manager/cert-manager#407

From that issue:
Based on the code in WatchEvents (https://github.com/jetstack/kube-lego/blob/master/pkg/kubelego/watch.go#L73) - it appears that whenever any Kubernetes Ingress resource is created, updated, or deleted, all Ingress resources are immediately scheduled for re-processing.

Ingresses that already have a valid certificate will be skipped, but any user with a number of failing/invalid ingresses will make requests to LE APIs in an attempt to validate those ingresses.

As part of those syncs, more updates will likely be made to ingresses, thus re-queuing these ingresses to be immediately reprocessed after the 'round' of processing ingresses fails.

The good news is we do only process one Ingress resource at a time, which should reduce the hits to the API somewhat (this could be a lot worse otherwise).
This PR changes the usage of the workqueue to only process one ingress at a time. It also switches the workqueue to use the rate limiting interface instead of a plain workqueue. This allows us to exponentially backoff validation attempts on a per ingress basis.

NOTE: I have not run e2e tests against this patch yet - I will update this PR with results once I have

/cc @jsha @simonswine

ref #328

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ec changes.

munnerz · 2018-04-24T22:56:55Z

pkg/kubelego/watch.go

+		}
+		kl.workQueue.AddRateLimited(key)
+	}
+	return nil


This function is only used by the renewal ticker in order to trigger a resync of ingress resources.

Previously, it resynced all resources at once, with no kind of rate limit applied.

This changes that to add it via the rate limited queue. For items that are already failing, this will increment the rate limit slightly, however this is probably not an issue.

For the items the are currently not 'known' by the rate limiter (i.e. succeeding ingresses), they will be checked one by one after a delay of 10 minutes, which is also acceptable imo.

munnerz · 2018-04-24T23:02:52Z

e2e tests passed after a few more commits 😃

jsha

This looks good as far as I understand it. :-) Thanks for implementing!

jsha · 2018-04-25T00:20:11Z

pkg/kubelego/watch.go

-			kl.Log().Debugf("worker: done processing %v", item)
-			kl.workQueue.Done(item)
+			func(item interface{}) {
+				defer kl.workQueue.Done(item)


I assume it's okay to call workQueue.Forget(item) and then workQueue.Done(item)? Similarly, if you call workQueue.AddRateLimited(key), then workQueue.Done(item), the workQueue will not actually forget the item and its ongoing backoff status?

Yep - they serve two different purposes:

Done will inform the queue that the particular work item has finished processing, as the workqueue will stop the same item being processed by two workers at once (although we actually only process one certificate at once anyway).

Forget will forget the item altogether and clear its back off status.

So 'Done' should be called after after the function processing the results of Get() is done (regardless of err/success)

'Forget' should be called once the processing has 'succeeded' and we want to clear the rate limit state for that particular item.

Calling AddRateLimit along with Done will therefore not clear the rate limit. It will mark the queue as finished processing that particular item, and also queue the item to be re-added to the queue after the rate limit is up.

Looks like you have to call both:

// Forget indicates that an item is finished being retried. Doesn't matter whether its for perm failing // or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you // still have to call `Done` on the queue. Forget(item interface{})

@wallrj 👍 we do call both - Forget gets called after the item has been successfully processed (or when it can no longer be processed, e.g. due to deletion)

wallrj

Thanks @munnerz

It looks like you've tried to reduce the API request rate from 4 angles:

Only perform certificate requests / validation when the ingress spec changes.
Only perform validation for the ingress object that has changed.
Backoff when retrying failed validations.
Reduce the resync period of the informer.

All of those sound good, but in the absence of tests and in the interest of not accidentally breaking things, I wonder if any one of those changes would have been sufficient.

I spotted some things and left some comments.
Please address those.

wallrj · 2018-04-25T10:25:48Z

pkg/ingress/ingress.go

 		}
 	} else {
-		if o.exists {
+		if o.Exists {
 			err = o.client().Delete(o.IngressApi.Namespace, &k8sMeta.DeleteOptions{})


This seems to be deleting a namespace rather than the ingres. Is that right?
Ignore me if this is unimportant / unrelated.

Unrelated - I am not looking to change how validation happens 😄

wallrj · 2018-04-25T10:29:29Z

pkg/kubelego/configure.go

+		if providerName == ing.IngressProvider() {
+			err = provider.Process(ing)
+			if err != nil {
+				provider.Log().Error(err)


Should we return this err before over writing it below? Or must Finalize always be run?

Again, not looking to change behaviour. These lines changed due to indentation changes.

wallrj · 2018-04-25T10:34:29Z

pkg/kubelego/configure.go


 	// normify tls config
-	tlsSlice = kl.TlsIgnoreDuplicatedSecrets(tlsSlice)
+	// NOTE: this no longer performs a global deduplication
+	tlsSlice := kl.TlsIgnoreDuplicatedSecrets(ing.Tls())


What are the consequences of not doing global de-duplication?

Discussed with @simonswine and we think this will be okay - see #298 for more info

wallrj · 2018-04-25T10:35:55Z

pkg/kubelego/configure.go

-	for _, ing := range ingressesAll {
-		if ing.Ignore() {
-			continue
-		}


I don't see this ing.Ignore() check in the new code below. Is it important?

wallrj · 2018-04-25T10:40:51Z

pkg/kubelego/configure.go

@@ -107,23 +93,11 @@ func (kl *KubeLego) reconfigure(ingressesAll []kubelego.Ingress) error {
 			errsStr = append(errsStr, fmt.Sprintf("%s", err))
 		}
 		kl.Log().Error("Error while processing certificate requests: ", strings.Join(errsStr, ", "))
-
-		// request a rerun of reconfigure
-		kl.workQueue.Add(true)


So we no longer re-queue when there are errors?
Don't we need to return an aggregate error here so that the certificate requests can be retried?
Perhaps this is done somewhere below..../me reads on....

wallrj · 2018-04-25T11:23:42Z

pkg/kubelego/watch.go

@@ -70,15 +116,32 @@ func (kl *KubeLego) WatchEvents() {
 				return
 			}
 			kl.Log().Debugf("CREATE ingress/%s/%s", addIng.Namespace, addIng.Name)
-			kl.workQueue.Add(true)


Ok I get it. So previously we added a bool to the queue!? which meant that all certificates were re-validated every time?

Yes spot on 😬

wallrj · 2018-04-25T11:25:22Z

pkg/kubelego/watch.go

+				kl.Log().Infof("Detected deleted ingress %q - skipping", key)
+				// skip processing deleted items, as there is no reason to due to
+				// the way kube-lego serialises authorization attempts
+				// kl.workQueue.AddRateLimited(key)


Should we call kl.workQueue.Forget(key) here?

wallrj · 2018-04-25T11:28:41Z

pkg/kubelego/watch.go

@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() {
 			oldIng.ResourceVersion = ""
 			upIng.ResourceVersion = ""


These can be removed now.

wallrj · 2018-04-25T11:30:11Z

pkg/kubelego/watch.go

@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() {
 			oldIng.ResourceVersion = ""
 			upIng.ResourceVersion = ""

-			if !reflect.DeepEqual(oldIng, upIng) {


This would have played a large part of the flood of API requests I guess. If the status and meta were changing often.

Yes exactly! 😬

wallrj · 2018-04-25T11:31:31Z

pkg/kubelego/watch.go

+			// we requeue ingresses only when their spec has changed, as the indicates
+			// a user has updated the specification of their ingress and as such we should
+			// re-trigger a validation if required.
+			if !reflect.DeepEqual(oldIng.Spec, upIng.Spec) {


Are annotations / labels on the ingress object examined by Kube-lego? Should we compare those also?

Good point - if a user switches the tls-acme annotation from false to true, we won't notice it right now.

jsha · 2018-04-27T23:12:26Z

Hi! Just checking in: Any ETA on responding to the above review and potentially merging & releasing? Thanks!

munnerz · 2018-05-01T10:33:58Z

Hey @jsha - I've addressed the review comments. We're all at KubeCon this week so quite busy, but I will talk to @wallrj and try and get the latest commit re-reviewed.

As soon as this is merged, I'll then cut a new release of kube-lego.

jsha · 2018-05-01T17:35:15Z

Excellent, thanks for the update!

wallrj

Looks good to me @munnerz

Merge at will!

wallrj · 2018-05-02T07:29:34Z

pkg/kubelego/configure.go

 	for providerName, provider := range kl.legoIngressProvider {
 		err := provider.Reset()
 		if err != nil {
-			provider.Log().Error(err)
-			continue
+			errs = append(errs, err)


❓ Should we keep the continue ? We previously skipped the logic below when reset fails.

jsha · 2018-05-07T16:49:46Z

Hi! Friendly post-KubeCon ping? This is still causing a lot of issues for us.

jsha · 2018-05-08T17:36:22Z

Ping? :-)

jsha · 2018-05-09T02:20:33Z

Thanks very much!

Among other things, the changes since 0.1.3 include rate limiting to avoid hitting Let's Encrypt servers too often. See: - jetstack/kube-lego#329 - jetstack/kube-lego@0.1.3...0.1.6

Adjust controller to process one ingress at a time. Use per-item rate…

6cb0464

… limiting workqueue.

munnerz assigned simonswine Apr 24, 2018

munnerz requested a review from simonswine April 24, 2018 20:48

jetstack-bot added the size/L label Apr 24, 2018

munnerz added 3 commits April 24, 2018 21:50

Remove erroneous call to workQueue.Add

38ae381

Queue ingress immediately on CREATE

1e37ef3

Only apply rate limit upon failure. Only sync ingresses when their sp…

61acbfd

…ec changes.

munnerz commented Apr 24, 2018

View reviewed changes

jsha reviewed Apr 25, 2018

View reviewed changes

wallrj suggested changes Apr 25, 2018

View reviewed changes

Address review comments

b75fb1d

wallrj approved these changes May 2, 2018

View reviewed changes

munnerz merged commit 87ffd39 into master May 8, 2018

ftab mentioned this pull request May 8, 2018

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

Open

		@@ -88,13 +151,24 @@ func (kl *KubeLego) WatchEvents() {
		oldIng.ResourceVersion = ""
		upIng.ResourceVersion = ""

Adjust controller to process one ingress at a time. Use per-item rate limiting workqueue. #329

Adjust controller to process one ingress at a time. Use per-item rate limiting workqueue. #329

Conversation

munnerz commented Apr 24, 2018 • edited Loading

jetstack-bot commented Apr 24, 2018

Choose a reason for hiding this comment

munnerz commented Apr 24, 2018

jsha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsha commented Apr 27, 2018

munnerz commented May 1, 2018

jsha commented May 1, 2018

wallrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsha commented May 7, 2018

jsha commented May 8, 2018

jsha commented May 9, 2018

munnerz commented Apr 24, 2018 •

edited

Loading