Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] A prototype implementation to improve startup performance #3821

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

M00nF1sh
Copy link
Collaborator

@M00nF1sh M00nF1sh commented Aug 24, 2024

Issue

#3326

Description

DON'T MERGE THIS
Built image for test: m00nf1sh/aws-load-balancer-controller:v2.8.3-checkpoint

Intro

This is a prototype implementation to improve performance during controller startup. The root cause is during restart(e.g. leadership change or hardware issues on node), the controller will reconcile all existing Ingress and Services in cluster. In some large clusters with a lot Ingress/Services, this could take a long time due to AWS API throttles, thus impacting the ability to handle other events(e.g. pod deployments).

Design

The idea is to save the "last reconciled state" as annotations into Ingress and Service objects, (potentially TargetGroupBindings) as well. So during controller restart, it can compare the current state vs "last reconciled state"(from annotation), and skip reconcile on already reconciled resources.
With this implementation, the "last reconciled state" is computed as sha256 of the ELBv2 JSON model built for Ingress and Services. For TargetGroupBindings, it can be the list of current backend targets(TODO, need some refactor).

Alternative design considered:

  1. alternatively, the "last reconciled state" can be the Ingress & Service's annotation & spec itself. I originally tried this approach but later realized that Ingress's configuration(ALB) can be impacted by Service and even Secrets as well(e.g. OIDC). So if we only record Ingress itself's annotation & spec as "last reconciled state", then we won't be able to trigger updates when Service/Secrets were changed(e.g. during the gap when controller restart)

Next Steps

  1. refactor the Ingress&Service controller's code to make the code more cleaner.
  2. The Marshaled JSON don't contain OIDC clientID/secret, so to trigger Ingress updates for OIDC secret changes, we need to include them in the digest somehow.
  3. add checkpoint support for targetGroupBinding
  4. [best to have] inject a custom work queue implementation into controllers where we can still enqueue those unchanged Ingress & Service during restart, but reconcile them at a lower priority and limited enqueue rate limit.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: M00nF1sh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants