Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

Open
3 tasks
tobiscr opened this issue Aug 20, 2024 · 2 comments
Open
3 tasks

Comments

@tobiscr
Copy link
Contributor

tobiscr commented Aug 20, 2024

Description

As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.

KIM's logic as to ensure:

  • In case of exceptions (e.g. caused by illegal replies from Gardener, failed K8s API calls, any other recoverable error etc.) the failed step in FSM will be retried with exponential backoff between each try.
  • If the max amount of retries is reached, it has to be ensured that the monitoring detects this unexpected situations and fires an alert to the on-call , respectively the development team.

AC:

  • Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff
    • Share the concept in the team and fine a common agreement (ADR?)
  • Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM

Depends on
#113

@m00g3n
Copy link
Contributor

m00g3n commented Sep 2, 2024

Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff

The FSM is just a convenient way how we organise the code. The kyma-infrastructure-manager controller is using the Kubebuilder SDK and we handle retries:

  • either by requeue with rate limiting (it happens when we return error or instant response from the reconciliation loop)
  • or by timed manual requeue of a failed request.

Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM

The Kubebuilder SDK exposes basic metrics (provided by controller-runtime) that can be used to trigger the alerts. If we will find the basic metrics insufficient, we can extend them.

@tobiscr tobiscr assigned tobiscr and unassigned m00g3n Sep 4, 2024
@tobiscr
Copy link
Contributor Author

tobiscr commented Sep 4, 2024

Depends on #113 for establishing KPIs and alerting

@tobiscr tobiscr removed their assignment Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants