-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stable 2.14.10 #12111
Merged
Merged
stable 2.14.10 #12111
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…olicy (#12088) In the inbound policy index, we maintain a policy index per namespace which holds various policy resources for that namespace. When a per namespace index becomes empty, we remove it. However, we were not considering authorization policy resources when determining if the index is empty. This could result in the index being removed even while it contained authorization policy resources, as long as all other resource types did not exist. This can lead to incorrect inbound policy responses when the per namespace index is recreated, since it will not longer contain the authorization policy. We update the `is_empty()` function to properly consider authorization policies as well. We also add some generally useful logging at debug and trace level. Signed-off-by: Alex Leong <[email protected]> Co-authored-by: Oliver Gould <[email protected]>
Fixes #12010 ## Problem We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010. ## Cause The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue. ## Solution Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated. ## Repro Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below): <details> <summary>endpoint_translator.go diff</summary> ```diff diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go index d1018d5f9..7d5abd638 100644 --- a/controller/api/destination/endpoint_translator.go +++ b/controller/api/destination/endpoint_translator.go @@ -5,6 +5,7 @@ import ( "reflect" "strconv" "strings" + "time" pb "github.com/linkerd/linkerd2-proxy-api/go/destination" "github.com/linkerd/linkerd2-proxy-api/go/net" @@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) { } func (et *endpointTranslator) add(set watcher.AddressSet) { for id, address := range set.Addresses { + time.Sleep(1 * time.Second) et.availableEndpoints.Addresses[id] = address } ``` </details> Then create these two Server manifests: <details> <summary>emoji-web-server.yml</summary> ```yaml apiVersion: policy.linkerd.io/v1beta2 kind: Server metadata: namespace: emojivoto name: web-http labels: app.kubernetes.io/part-of: emojivoto app.kubernetes.io/name: web app.kubernetes.io/version: v11 spec: podSelector: matchLabels: app: web-svc port: http proxyProtocol: HTTP/1 ``` </details> <details> <summary>emoji-web-server-opaque.yml</summary> ```yaml apiVersion: policy.linkerd.io/v1beta2 kind: Server metadata: namespace: emojivoto name: web-http labels: app.kubernetes.io/part-of: emojivoto app.kubernetes.io/name: web app.kubernetes.io/version: v11 spec: podSelector: matchLabels: app: web-svc port: http proxyProtocol: opaque ``` </details> In separate consoles run the patched destination service and a destination client: ```bash HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211 ``` ```bash go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80 ``` And run this to continuously switch the `proxyProtocol` field: ```bash while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done ``` You'll see the following data race report in the Destination controller logs: <details> <summary>destination logs</summary> ```console ================== WARNING: DATA RACE Write at 0x00c0006d30e0 by goroutine 178: github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772 github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150 github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173 github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm() <autogenerated>:1 +0x64 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246 +0x81 k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate() <autogenerated>:1 +0x1f k8s.io/client-go/tools/cache.(*processorListener).run.func1() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0x1f4 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x41 k8s.io/apimachinery/pkg/util/wait.BackoffUntil() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xbe k8s.io/apimachinery/pkg/util/wait.JitterUntil() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10a k8s.io/apimachinery/pkg/util/wait.Until() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x9b k8s.io/client-go/tools/cache.(*processorListener).run() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x38 k8s.io/client-go/tools/cache.(*processorListener).run-fm() <autogenerated>:1 +0x33 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x86 Previous read at 0x00c0006d30e0 by goroutine 360: github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166 github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1() /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45 ``` </details> ## Extras This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
Fixes #11995 If a Server is marking a Pod's port as opaque and then the Server's podSelector is updated to no longer select that Pod, then the Pod's port should no longer be marked as opaque. However, this update does not result in any updates from the destination API's Get stream and the port remains marked as opaque. We fix this by updating the endpoint watcher's handling of Server updates to consider both the old and the new Server. Signed-off-by: Alex Leong <[email protected]>
When the destination controller receives a GetProfile request for an ExternalName service, it should return gRPC status code INVALID_ARGUMENT to signal to the proxy that ExternalName services do not have endpoints and service discovery should not be used. However, the destination controller is returning gRPC status code UNKNOWN instead. This causes the proxy to retry these requests, resulting in a flurry of warning logs in the destination controller. We fix the destination controller to properly return INVALID_ARGUMENT instead of UNKNOWN for ExternalName services. Signed-off-by: Alex Leong <[email protected]>
Can not use external prometheus with hearbeat. This change adds a new variable `.prometheusUrl` in value and use it into heartbeat from linkerd-control-plane chart. Fixes #11342 Signed-off-by: David ALEXANDRE <[email protected]>
Need to be able to set labels on Pod Monitors add a labels section to podMonitor Helm Lint/Helm Template Fixes #[11175] Signed-off-by: Justin S <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
We the destination controller's workload watcher receives an update for any Server resource, it recomputes opaqueness for every workload. This is because the Server update may have changed opaqueness for that workload. However, this is very CPU intensive for the destination controller, especially during resyncs when we get Server updates for every Server resource in the cluster. Instead, we only need to recompute opaqueness for workloads that are selected by the old version of the Server or by the new version of the Server. If a workload is not selected by either the new or old version of the Server, then the Server update cannot have changed the workload's opaqueness. Signed-off-by: Alex Leong <[email protected]>
alpeb
approved these changes
Feb 19, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cherry picks looked good to me 👍
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
Oh I think we should also include the Helm security update #12085 |
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
mateiidavid
approved these changes
Feb 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This stable release back-ports bugfixes and improvements from recent edge
releases.
podMonitors
field in thecontrol plane Helm chart (thanks @jseiser!) (#11222; fixes #11175)
prometheusUrl
field for the heartbeat job in the control plane Helmchart (thanks @david972!) (#11343; fixes #11342)
INVALID_ARGUMENT
status codesproperly when a
ServiceProfile
is requested for a service that does notexist. (#11980)
updates on workloads affected by the Server (#12017)
Server
selector are handled in the destinationservice. When a
Server
that marks a port as opaque no longer selects aresource, the resource's opaqueness will reverted to default settings
(#12031; fixes #11995)
under very specific conditions (#12022; fixes #12010)
resources are deleted (#12088)