Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable 2.14.10 #12111

Merged
merged 12 commits into from
Feb 20, 2024
Merged

stable 2.14.10 #12111

merged 12 commits into from
Feb 20, 2024

Conversation

adleong
Copy link
Member

@adleong adleong commented Feb 19, 2024

This stable release back-ports bugfixes and improvements from recent edge
releases.

  • Introduced support for arbitrary labels in the podMonitors field in the
    control plane Helm chart (thanks @jseiser!) (#11222; fixes #11175)
  • Added a prometheusUrl field for the heartbeat job in the control plane Helm
    chart (thanks @david972!) (#11343; fixes #11342)
  • Updated the Destination controller to return INVALID_ARGUMENT status codes
    properly when a ServiceProfile is requested for a service that does not
    exist. (#11980)
  • Reduced the load on the Destination controller by only processing Server
    updates on workloads affected by the Server (#12017)
  • Changed how updates to a Server selector are handled in the destination
    service. When a Server that marks a port as opaque no longer selects a
    resource, the resource's opaqueness will reverted to default settings
    (#12031; fixes #11995)
  • Fixed a race condition in the destination service that could cause panics
    under very specific conditions (#12022; fixes #12010)
  • Fixed an issue where inbound policy could be incorrect after certain policy
    resources are deleted (#12088)

adleong and others added 8 commits February 17, 2024 00:43
…olicy (#12088)

In the inbound policy index, we maintain a policy index per namespace which holds various policy resources for that namespace.  When a per namespace index becomes empty, we remove it.  However, we were not considering authorization policy resources when determining if the index is empty.  This could result in the index being removed even while it contained authorization policy resources, as long as all other resource types did not exist.

This can lead to incorrect inbound policy responses when the per namespace index is recreated, since it will not longer contain the authorization policy.

We update the `is_empty()` function to properly consider authorization policies as well.  We also add some generally useful logging at debug and trace level.

Signed-off-by: Alex Leong <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
Fixes #12010

## Problem

We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.

## Cause

The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.

## Solution

Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated.

## Repro

Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):

<details>
  <summary>endpoint_translator.go diff</summary>

```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
        "reflect"
        "strconv"
        "strings"
+       "time"

        pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
        "github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
 }

 func (et *endpointTranslator) add(set watcher.AddressSet) {
        for id, address := range set.Addresses {
+               time.Sleep(1 * time.Second)
                et.availableEndpoints.Addresses[id] = address
        }
```
</details>

Then create these two Server manifests:

<details>
  <summary>emoji-web-server.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: HTTP/1
```
</details>

<details>
  <summary>emoji-web-server-opaque.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: opaque
```
</details>

In separate consoles run the patched destination service and a destination client:

```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```

```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```

And run this to continuously switch the `proxyProtocol` field:

```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```

You'll see the following data race report in the Destination controller logs:

<details>
  <summary>destination logs</summary>

```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
      <autogenerated>:1 +0x64
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246 +0x81
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x1f
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0x1f4
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xbe
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x9b
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x38
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x86

Previous read at 0x00c0006d30e0 by goroutine 360:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>

## Extras

This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
Fixes #11995

If a Server is marking a Pod's port as opaque and then the Server's podSelector is updated to no longer select that Pod, then the Pod's port should no longer be marked as opaque. However, this update does not result in any updates from the destination API's Get stream and the port remains marked as opaque.

We fix this by updating the endpoint watcher's handling of Server updates to consider both the old and the new Server.

Signed-off-by: Alex Leong <[email protected]>
When the destination controller receives a GetProfile request for an ExternalName service, it should return gRPC status code INVALID_ARGUMENT to signal to the proxy that ExternalName services do not have endpoints and service discovery should not be used.  However, the destination controller is returning gRPC status code UNKNOWN instead.  This causes the proxy to retry these requests, resulting in a flurry of warning logs in the destination controller.

We fix the destination controller to properly return INVALID_ARGUMENT instead of UNKNOWN for ExternalName services.

Signed-off-by: Alex Leong <[email protected]>
Can not use external prometheus with hearbeat. This change adds a new variable `.prometheusUrl` in value and use it into heartbeat from linkerd-control-plane chart.

Fixes #11342

Signed-off-by: David ALEXANDRE <[email protected]>
Need to be able to set labels on Pod Monitors

add a labels section to podMonitor

Helm Lint/Helm Template

Fixes #[11175]

Signed-off-by: Justin S <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
We the destination controller's workload watcher receives an update for any Server resource, it recomputes opaqueness for every workload.  This is because the Server update may have changed opaqueness for that workload.  However, this is very CPU intensive for the destination controller, especially during resyncs when we get Server updates for every Server resource in the cluster.

Instead, we only need to recompute opaqueness for workloads that are selected by the old version of the Server or by the new version of the Server.  If a workload is not selected by either the new or old version of the Server, then the Server update cannot have changed the workload's opaqueness.

Signed-off-by: Alex Leong <[email protected]>
@adleong adleong requested a review from a team as a code owner February 19, 2024 19:11
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cherry picks looked good to me 👍

controller/api/destination/watcher/pod_watcher.go Outdated Show resolved Hide resolved
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
@alpeb
Copy link
Member

alpeb commented Feb 19, 2024

Oh I think we should also include the Helm security update #12085

Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
@adleong adleong merged commit 1ea6b27 into release/stable-2.14 Feb 20, 2024
41 checks passed
@adleong adleong deleted the alex/stable-2.14.10 branch February 20, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants