[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

tobiscr · 2024-08-20T14:47:32Z

Description

As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.

KIM's logic as to ensure:

In case of exceptions (e.g. caused by illegal replies from Gardener, failed K8s API calls, any other recoverable error etc.) the failed step in FSM will be retried with exponential backoff between each try.
If the max amount of retries is reached, it has to be ensured that the monitoring detects this unexpected situations and fires an alert to the on-call , respectively the development team.

AC:

Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff
- Share the concept in the team and fine a common agreement (ADR?)
Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM

Depends on
#113

m00g3n · 2024-09-02T15:05:31Z

Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff

The FSM is just a convenient way how we organise the code. The kyma-infrastructure-manager controller is using the Kubebuilder SDK and we handle retries:

either by requeue with rate limiting (it happens when we return error or instant response from the reconciliation loop)
or by timed manual requeue of a failed request.

Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM

The Kubebuilder SDK exposes basic metrics (provided by controller-runtime) that can be used to trigger the alerts. If we will find the basic metrics insufficient, we can extend them.

tobiscr · 2024-09-04T13:24:16Z

Depends on #113 for establishing KPIs and alerting

tobiscr mentioned this issue Aug 29, 2024

Making the processing more resilient #356

Closed

tobiscr assigned m00g3n Sep 2, 2024

tobiscr assigned tobiscr and unassigned m00g3n Sep 4, 2024

tobiscr removed their assignment Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

tobiscr commented Aug 20, 2024 •

edited

Loading

m00g3n commented Sep 2, 2024

tobiscr commented Sep 4, 2024 •

edited

Loading

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

Comments

tobiscr commented Aug 20, 2024 • edited Loading

m00g3n commented Sep 2, 2024

tobiscr commented Sep 4, 2024 • edited Loading

tobiscr commented Aug 20, 2024 •

edited

Loading

tobiscr commented Sep 4, 2024 •

edited

Loading