Add HA mode for service-mirror #11047

mateiidavid · 2023-06-21T15:45:42Z

In certain scenarios, the service-mirror may act as a single point of failure. Linkerd's multicluster extension supports an --ha mode to increase reliability by adding more replicas, however, it is currently supported only in the gateway.

To avoid the service-mirror as a single point of failure, this change introduces an --ha flag for linkerd multicluster link. The HA flag will use a set of value overrides that will:

Configure the service-mirror with affinity and PDB policies to ensure replicas are spread across hosts to protect against (in)voluntary disruptions;
Configure the service-mirror to run with more than 3 replicas;
Configure the service-mirror deployment's rolling strategy to ensure at least one replica is available.

Additionally, with the introduction of leader election, linkerd mc gateways displays redundant information since metrics are collected from each pod. This change adds a small lookup table of currently lease claimants. Metrics are extracted only for claimants.

Depends on #11046

In certain scenarios, the service-mirror may act as a single point of failure. Linkerd's multicluster extension supports an `--ha` mode to increase reliability by adding more replicas, however, it is currently supported only in the gateway. To avoid the service-mirror as a single point of failure, this change introduces an `--ha` flag for `linkerd multicluster link`. The HA flag will use a set of value overrides that will: * Configure the service-mirror with affinity and PDB policies to ensure replicas are spread across hosts to protect against (in)voluntary disruptions; * Configure the service-mirror to run with more than 3 replicas; * Configure the service-mirror deployment's rolling strategy to ensure at least one replica is available. Additionally, with the introduction of leader election, `linkerd mc gateways` displays redundant information since metrics are collected from each pod. This change adds a small lookup table of currently lease claimants. Metrics are extracted only for claimants. Signed-off-by: Matei David <[email protected]>

mateiidavid · 2023-06-21T15:47:58Z

DIff:

# 
# < -- missing from unha.yaml
# > -- present in unha.yaml
:; diff ha.yaml unha.yaml
121c121
<   replicas: 3
---
>   replicas: 1
126,128d125
<   strategy:
<     rollingUpdate:
<       maxUnavailable: 1
167,182d163
< ---
< kind: PodDisruptionBudget
< apiVersion: policy/v1
< metadata:
<   name: linkerd-service-mirror-target
<   namespace: linkerd-multicluster
<   labels:
<     component: linkerd-service-mirror
<   annotations:
<     linkerd.io/created-by: linkerd/cli dev-352e404a-matei
< spec:
<   maxUnavailable: 1
<   selector:
<     matchLabels:
<       component: linkerd-service-mirror
<       mirror.linkerd.io/cluster-name: target

# Before
:; linkerd mc gateways
CLUSTER  ALIVE    NUM_SVC      LATENCY
target   False          2            -
target   True           2          3ms
target   False          2            -

# After
:; bin/linkerd mc gateways
CLUSTER  ALIVE    NUM_SVC      LATENCY
target   True           2          3ms

risingspiral · 2023-06-21T21:47:34Z

multicluster/cmd/check.go

@@ -579,9 +579,21 @@ func (hc *healthChecker) checkIfGatewayMirrorsHaveEndpoints(ctx context.Context,
 			continue
 		}

+		leases, err := hc.KubeAPIClient().CoordinationV1().Leases(multiclusterNs.Name).List(ctx, selector)


For a non HA mode with only one instance running will it still have a lease resource?

Yup. This check should be safe to run regardless of the deployment model. The same applies for multicluster gateways where we pull metrics from the "leader". We can guarantee there will always be a leader since a lease is created and claimed under all circumstances.

As an aside, there might be a smarter way to pull metrics (e.g. aggregate metrics from all pods under a deployment). I thought I'd keep it simple though.

…-mirror-ha

Signed-off-by: Matei David <[email protected]>

multicluster/cmd/check.go

multicluster/cmd/gateways.go

Signed-off-by: Matei David <[email protected]>

… into matei/service-mirror-ha

alpeb

Nice!
Can you also also include a podAntiAffinity field like we do for the linkerd-control-plane components? I believe the PDB doesn't cover that aspect.

multicluster/values/values.go

Signed-off-by: Matei David <[email protected]>

mateiidavid · 2023-07-12T13:57:22Z

Added missing pod anti affinity partial in the deployment. For context:

Anti affinity works by selecting over a group of pods (using a label selector) and a topology key (e.g. kubernetes.io/hostname). Depending on the strategy (preferred vs required) pods will not be scheduled on the same node that satisfies the topology key. This allows us to have separate failure domains for HA.
Typically, a component or app label key is used to differentiate between pods (i.e. as the affinity label selector). All service-mirror components, irrespective of the cluster they're linked against, have the same value for "component"; component=linkerd-service-mirror
Users should be able to selectively turn HA on on a link-by-link basis. Enforcing affinity using component means all service-mirror pods will be affected by the scheduling policy, not just service-mirror pods specific to that link.

As a result, I have chosen to use a different label selector. Full list of labels each service-mirror receives is below:

{
"component":"linkerd-service-mirror",
"linkerd.io/control-plane-ns":"linkerd",
"linkerd.io/extension":"multicluster",
"linkerd.io/proxy-deployment":"linkerd-service-mirror-target",
"linkerd.io/workload-ns":"linkerd-multicluster",
"mirror.linkerd.io/cluster-name":"target",
"pod-template-hash":"c7c6555df"
}

The only variable value is the link cluster name. As a result, anti affinity selects pods using mirror.linkerd.io/cluster-name=<name> as a label selector. Another alternative would be to use both mirror.linkerd.io/cluster-name and component, or better yet use component and append the cluster name in the label value.

alpeb

Thanks @mateiidavid, I think the pod anti-affinity scheme if fine enough; people could kustomize if they'd require something more sophisticated 👍

mateiidavid requested a review from a team as a code owner June 21, 2023 15:45

risingspiral reviewed Jun 21, 2023

View reviewed changes

mateiidavid added 3 commits June 29, 2023 10:29

Merge branch 'main' of github.com:linkerd/linkerd2 into matei/service…

130c975

…-mirror-ha

LeaseMeta does not apply when elector creates a lease

4944114

Signed-off-by: Matei David <[email protected]>

Fix lease fetching in multicluster check

8da7cf7

Signed-off-by: Matei David <[email protected]>

mateiidavid commented Jun 30, 2023

View reviewed changes

multicluster/cmd/check.go Outdated Show resolved Hide resolved

multicluster/cmd/check.go Outdated Show resolved Hide resolved

Fix comment and error message typos

b30e177

adleong reviewed Jul 3, 2023

View reviewed changes

multicluster/cmd/gateways.go Show resolved Hide resolved

mateiidavid added 2 commits July 5, 2023 09:51

Ignore non-svc mirror leases when computing gateway metrics

b628230

Signed-off-by: Matei David <[email protected]>

Merge branch 'matei/service-mirror-ha' of github.com:linkerd/linkerd2…

ed67dca

… into matei/service-mirror-ha

adleong approved these changes Jul 5, 2023

View reviewed changes

alpeb reviewed Jul 10, 2023

View reviewed changes

multicluster/values/values.go Outdated Show resolved Hide resolved

Add pod antiaffinity

fa21cda

Signed-off-by: Matei David <[email protected]>

mateiidavid requested a review from alpeb July 12, 2023 14:15

alpeb approved these changes Jul 13, 2023

View reviewed changes

mateiidavid merged commit d0e837d into main Jul 17, 2023
37 checks passed

mateiidavid deleted the matei/service-mirror-ha branch July 17, 2023 08:25

knutgoetz mentioned this pull request Jul 23, 2023

multicluster: Provide High Availability mode for service-mirror component #8848

Closed

mcharriere mentioned this pull request Aug 15, 2023

Implement best practices for linkerd-multicluster-link app giantswarm/roadmap#2319

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HA mode for service-mirror #11047

Add HA mode for service-mirror #11047

mateiidavid commented Jun 21, 2023 •

edited

Loading

mateiidavid commented Jun 21, 2023

risingspiral Jun 21, 2023

mateiidavid Jun 22, 2023 •

edited

Loading

alpeb left a comment

mateiidavid commented Jul 12, 2023 •

edited

Loading

alpeb left a comment

Add HA mode for service-mirror #11047

Add HA mode for service-mirror #11047

Conversation

mateiidavid commented Jun 21, 2023 • edited Loading

mateiidavid commented Jun 21, 2023

DIff:

risingspiral Jun 21, 2023

Choose a reason for hiding this comment

mateiidavid Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

mateiidavid commented Jul 12, 2023 • edited Loading

alpeb left a comment

Choose a reason for hiding this comment

mateiidavid commented Jun 21, 2023 •

edited

Loading

mateiidavid Jun 22, 2023 •

edited

Loading

mateiidavid commented Jul 12, 2023 •

edited

Loading