-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HA mode for service-mirror #11047
Conversation
In certain scenarios, the service-mirror may act as a single point of failure. Linkerd's multicluster extension supports an `--ha` mode to increase reliability by adding more replicas, however, it is currently supported only in the gateway. To avoid the service-mirror as a single point of failure, this change introduces an `--ha` flag for `linkerd multicluster link`. The HA flag will use a set of value overrides that will: * Configure the service-mirror with affinity and PDB policies to ensure replicas are spread across hosts to protect against (in)voluntary disruptions; * Configure the service-mirror to run with more than 3 replicas; * Configure the service-mirror deployment's rolling strategy to ensure at least one replica is available. Additionally, with the introduction of leader election, `linkerd mc gateways` displays redundant information since metrics are collected from each pod. This change adds a small lookup table of currently lease claimants. Metrics are extracted only for claimants. Signed-off-by: Matei David <[email protected]>
DIff:#
# < -- missing from unha.yaml
# > -- present in unha.yaml
:; diff ha.yaml unha.yaml
121c121
< replicas: 3
---
> replicas: 1
126,128d125
< strategy:
< rollingUpdate:
< maxUnavailable: 1
167,182d163
< ---
< kind: PodDisruptionBudget
< apiVersion: policy/v1
< metadata:
< name: linkerd-service-mirror-target
< namespace: linkerd-multicluster
< labels:
< component: linkerd-service-mirror
< annotations:
< linkerd.io/created-by: linkerd/cli dev-352e404a-matei
< spec:
< maxUnavailable: 1
< selector:
< matchLabels:
< component: linkerd-service-mirror
< mirror.linkerd.io/cluster-name: target # Before
:; linkerd mc gateways
CLUSTER ALIVE NUM_SVC LATENCY
target False 2 -
target True 2 3ms
target False 2 -
# After
:; bin/linkerd mc gateways
CLUSTER ALIVE NUM_SVC LATENCY
target True 2 3ms |
multicluster/cmd/check.go
Outdated
@@ -579,9 +579,21 @@ func (hc *healthChecker) checkIfGatewayMirrorsHaveEndpoints(ctx context.Context, | |||
continue | |||
} | |||
|
|||
leases, err := hc.KubeAPIClient().CoordinationV1().Leases(multiclusterNs.Name).List(ctx, selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a non HA mode with only one instance running will it still have a lease resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. This check should be safe to run regardless of the deployment model. The same applies for multicluster gateways
where we pull metrics from the "leader". We can guarantee there will always be a leader since a lease is created and claimed under all circumstances.
As an aside, there might be a smarter way to pull metrics (e.g. aggregate metrics from all pods under a deployment). I thought I'd keep it simple though.
Signed-off-by: Matei David <[email protected]>
Signed-off-by: Matei David <[email protected]>
Signed-off-by: Matei David <[email protected]>
… into matei/service-mirror-ha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Can you also also include a podAntiAffinity
field like we do for the linkerd-control-plane components? I believe the PDB doesn't cover that aspect.
Signed-off-by: Matei David <[email protected]>
Added missing pod anti affinity partial in the deployment. For context:
As a result, I have chosen to use a different label selector. Full list of labels each service-mirror receives is below: {
"component":"linkerd-service-mirror",
"linkerd.io/control-plane-ns":"linkerd",
"linkerd.io/extension":"multicluster",
"linkerd.io/proxy-deployment":"linkerd-service-mirror-target",
"linkerd.io/workload-ns":"linkerd-multicluster",
"mirror.linkerd.io/cluster-name":"target",
"pod-template-hash":"c7c6555df"
} The only variable value is the link cluster name. As a result, anti affinity selects pods using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mateiidavid, I think the pod anti-affinity scheme if fine enough; people could kustomize if they'd require something more sophisticated 👍
In certain scenarios, the service-mirror may act as a single point of failure. Linkerd's multicluster extension supports an
--ha
mode to increase reliability by adding more replicas, however, it is currently supported only in the gateway.To avoid the service-mirror as a single point of failure, this change introduces an
--ha
flag forlinkerd multicluster link
. The HA flag will use a set of value overrides that will:Additionally, with the introduction of leader election,
linkerd mc gateways
displays redundant information since metrics are collected from each pod. This change adds a small lookup table of currently lease claimants. Metrics are extracted only for claimants.Depends on #11046