You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.
KIM's logic as to ensure:
In case of exceptions (e.g. caused by illegal replies from Gardener, failed K8s API calls, any other recoverable error etc.) the failed step in FSM will be retried with exponential backoff between each try.
If the max amount of retries is reached, it has to be ensured that the monitoring detects this unexpected situations and fires an alert to the on-call , respectively the development team.
AC:
Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff
Share the concept in the team and fine a common agreement (ADR?)
Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM
Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff
The FSM is just a convenient way how we organise the code. The kyma-infrastructure-manager controller is using the Kubebuilder SDK and we handle retries:
either by requeue with rate limiting (it happens when we return error or instant response from the reconciliation loop)
or by timed manual requeue of a failed request.
Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM
The Kubebuilder SDK exposes basic metrics (provided by controller-runtime) that can be used to trigger the alerts. If we will find the basic metrics insufficient, we can extend them.
Description
As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.
KIM's logic as to ensure:
AC:
Depends on
#113
The text was updated successfully, but these errors were encountered: