Kubernetes Scalability/Performance Regressions - Case Studies & Insights

by Shyam JVS, Google Inc

February 2018

Overview

This document is a compilation of some interesting scalability/performance regression stories from the past. These were identified/studied/fixed largely by sig-scalability. We begin by listing them down, along with their succinct explanations, features/components that were involved, and relevant SIGs (besides sig-scalability). We also accompany them with data on what was the smallest scale, both for real and simulated (i.e kubemark) clusters, that managed to catch those regressions. At the end of the document we draw some useful insights based on the case studies.

Case Studies

Issue	Brief Description	Details	Relevant feature(s)/components(s)	Relevant SIG(s)	Smallest real cluster affected	Smallest kubemark cluster affected
#60035	Kubemark-scale fails with couple of hollow-nodes getting pre-empted due to higher mem usage	Few hollow-nodes started getting pre-empted by kubelets due to memory shortage for running critical pods. The increase in memory usage of hollow-nodes (more specifically hollow kube-proxy) was due to resolving a recent bug with endpoints in kubemark (#59823).	Pre-emption (feature) Kubelet Kube-proxy mock	-	-	5000
#59823	Endpoints objects in kubemark are empty, leading to misleading performance results	Endpoints objects weren't getting populated with more than a single entry, due to conflicting node names for same pod IP. The reason for pod IPs being the same is a bug with our mock docker-client, which assigned a constant IP to all fake pods. This is probably a regression that didn't exist about an year back. It had significant performance implications (see the bug).	Kubelet mock Docker-client mock Kube-proxy mock Endpoints-controller Apiserver Etcd	sig-network	-	100
#56061	Apiserver memory usage increased by 10-20% after addition of admission metrics	A bunch of admission metrics were added to the apiserver for monitoring admission plugins, webhooks, etc. Soon after that change we started seeing a 100-200MB increase in memory usage of apiserver on a 100-node cluster. Thanks to the resource usage checks in our performance tests, we were able to spot the regression. It was fixed later by making those metrics lighter (i.e removing some SummaryVec metrics, reducing histogram buckets)	Admission control (feature) Apiserver Prometheus	sig-api-machinery sig-instrumentation	100	-
#55695	Metadata-proxy not able to handle too many pods per node	Metadata-proxy, a newly enabled node agent for proxy'ing metadata requests coming from pods on the node, was unable to handle load from >70 pods due to memory starvation. This violated our official k8s support for 110 pods/node.	Metadata concealment (feature) Metadata-proxy agent	sig-auth sig-node	-	500
#55060	Increase in pod startup latency due to Duplicate Address Detection in CNI plugin	An update in the Container Network Interface (CNI) library introduced a new step for DAD, that caused a delay for the CNI plugins waiting on it to finish. Since this was along the code path for container creation, it led to increase in pod startup latency on the kubelet side by more than a second. As a result, we saw violation of our 5s pod-startup latency SLO on reasonably large clusters (where we were already close enough to the SLO earlier).	Container networking (feature) Kubelet	sig-node sig-network	2000 (though some effect was also seen at 100)	-
#54164	Kube-dns pods coming up super slowly in large clusters due to inter-pod anti-affinity	Kube-dns, a default deployment for k8s clusters, introduced node-level soft inter-pod anti-affinity in order to spread those pods across different nodes. However, the O(pods^2) implementation of the anti-affinity in the scheduler, made their scheduling super-slow. As a result, cluster creation was failing with timeout.	Inter-pod anti-affinity (feature) Scheduler Kube-dns	sig-scheduling sig-network	2000	-
#53327 (part)	Performance tests seeing a huge drop in scheduler throughput due to one predicate slowing down	One of the scheduler predicates was changed to compute a random 32-length string. That made the predicate super-slow as it started starving for randomness (especially with the predicate running for each of 1000s of pods) and hugely reduced the scheduler throughput (by ~10 times). After few optimizations to the random pkg (eventually getting rid of the rand() call), this was fixed.	Scheduler	sig-scheduling	100 (mild signal)	500 (strong signal)
#53327 (part)	Kubemark performance tests fail with timeout during pod deletion due to bug in kubelet mock	The kubelet mock (hollow-kubelet) started showing behavioral difference from the real kubelet due to some changes in the latter. As a result, the hollow-kubelet was failing to delete pods forever under a corner condition, which is - a "DELETE pod" event is received for a pod while kubelet is in the middle of it's container creation. A tricky regression needing quite some hunting before we could set the mock right.	Kubelet Kubelet mock	sig-node	-	5000 (also 500, but flakily)
#52284	CIDR allocation super slow with IP aliases	This was a performance issue existing from before, but got exposed as a regression when we turned on IP aliasing for large clusters. CIDR-allocator (part of controller-manager) was having poor performance due to bad design. The main reasons being lack of concurrency and synchronous processing of events from shared informers. A bunch of optimizations later (#52292) fixed it's performance.	IP-aliasing (feature) Controller-manager (cidr-allocator)	sig-network	2000	-
#51903	Few nodes failing to start in kubemark due to reduced PIDs limit for docker in newer COS image	When COS m60 image was introduced, we started seeing that some of the kubemark hollow-node pods were failing to start due to docker on the host-node crossing the PID limit. This is a risky regression in terms of the damage it could've caused if rolled out to production, and our scalability tests caught it. Besides the low PID threshold issue, it helped also catch another issue on containerd-shim starting too many threads.	Kubelet Docker Containerd-shim	sig-node	-	500
#51899 (part)	"PATCH node-status" calls seeing high latency due to blocking on audit-logging	Those calls are made by kubelets once every X seconds - which adds up to be quite some qps for large clusters. Part of handling those calls is audit-logging them. When a change moving the default audit-log format to JSON was made, a performance issue with the design was exposed. The update handler for those calls was doing the audit-writing synchronously (instead of buffering + asynchronous writing), which slowed down those calls by an order of magnitude.	Audit-logging (feature) Apiserver	sig-auth sig-instrumentation sig-api-machinery	2000	-
#51899 (part)	"DELETE pods" API call latencies shot up on large cluster tests due to kubelet thundering herd	A change to kubelet pod deletion resulted in delete pod api calls from kubelets being concentrated immediately after container garbage collection. When performing deletion of large numbers (O(10k)) of pods across large numbers (O(1k)) of nodes, the resulting concentrated delete calls from the kubelets cause increased latency of "DELETE pods" API calls (above our target SLO of 1s).	Container GC (feature) Kubelet	sig-node	2000	-
#51099	gRPC update causing failure of API calls with large responses	When gRPC vendor library was updated to v1.5.1, the default MTU for response size b/w apiserver <-> etcd changed to 4MB. This could only be caught by scalability tests, as our regular tests run at a much smaller scale - so they don't actually encounter such large response sizes.	gRPC framework (feature) Etcd Apiserver	sig-api-machinery	100	100
#50854	Route-controller timing out while listing routes from cloud-provider	Route-controller was failing to list routes from the cloud-provider API and in turn failed to create routes for the nodes. The reason was that the project in which the cluster was being created, started to have another huge cluster running there (with O(5k) routes) which was interfering with the list routes call for this cluster, due to cloud-provider side issues.	Controller-manager (route-controller) Cloud-provider API (GCE)	sig-network sig-gcp	-	5000 (running besides a real 5000 cluster)
#50366	Failing to fit some pods on cluster due to accidentally increased fluentd resource request	Some change around setting fluentd resource requests accidentally doubled it's CPU request. This was caught by our kubemark scalability test where we tightly fit our hollow-node pods onto a small set of nodes. With the fluentd increase, some of those pods couldn't be scheduled due to CPU shortage and we caught it. This bug was risky for production, as it could've preempted some of the users pods for fluentd (a critical pod).	Resource requests (feature) Fluentd	sig-instrumentation	-	500
#48700	Apiserver panic while logging a request in TooManyRequests handler	A change in the ordering of apiserver request handlers (where one of them is the TooManyRequests handler) caused a panic while instrumenting the request. Though this is not a scalability regression per se, this is a scenario which was exposed only by our scale tests where we actually see 429s (TooManyRequests) due to the scale at which we run the clusters (unlike normal scale tests).	Apiserver	sig-api-machinery	100	500
#47419	Performance tests failing due to newly exposed high LIST api latencies	After fixing a notorious bug in the instrumentation code for the 'API request latency' metric, we started seeing performance test failures due to high LIST call latencies. Though it seemed like a regression at first, it was actually a hidden performance issue that was brought to light by the fix. We then realized that list calls were not actually satisfying our 1s api latency SLO and tuned it for them appropriately.	Apiserver	sig-api-machinery sig-instrumentation	2000	5000
#45216	Upgrade to Go 1.8. resulted in significant performance regression	When k8s was upgraded to go-1.8, we were seeing timeouts in our kubemark-scale tests due to ~2x increase in the time taken to create services. After some experimenting/profiling, it seemed to originate from changes to the net/http.(*http2serverConn).serve library function which had some extra cases added to a select statement. One of them added some logic for gracefulShutdown which slowed down the function a lot. It was eventually fixed in a patch release by the golang team.	Golang (net/http library)	-	-	5000
#42000	Kube-proxy backlog processing causing CPU starvation for kubelet to start new pods	Kube-proxies were slow in processing endpoints updates. As a result, they were building up a backlog of work to be done while load test (which creates many services) was running. Later when the density test ran (where we create 1000s of pods), the kube-proxies were still busy processing the backlog from load test and hence consuming high memory. This memory-starved the kubelets from creating the density pods after cgroups were enabled. Before cgroups, this issue was hidden.	Cgroups (feature) Kubelet Kube-proxy	sig-network sig-node	-	500

Insights

On many occasions our scalability tests caught critical/risky bugs which were missed by most other tests. If not caught, those could've seriously jeopardized production-readiness of k8s.
SIG-Scalability has caught/fixed several important issues that span across various components, features and SIGs.
Around 60% of times (possibly even more), we catch scalability regressions with just our medium-scale (and fast) tests, i.e gce-100 and kubemark-500. Making them run as presubmits should act as a strong shield against regressions.
Majority of the remaining ones are caught by our large-scale (and slow) tests, i.e kubemark-5k and gce-2k. Making them as post-submit blockers (given they're "usually" quite healthy) should act as a second layer of protection against regressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalability-regressions-case-studies.md

scalability-regressions-case-studies.md

Kubernetes Scalability/Performance Regressions - Case Studies & Insights

Overview

Case Studies

Insights

Files

scalability-regressions-case-studies.md

Latest commit

History

scalability-regressions-case-studies.md

File metadata and controls

Kubernetes Scalability/Performance Regressions - Case Studies & Insights

Overview

Case Studies

Insights