Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace shared-informers' methods with cached client #461

Conversation

MatousJobanek
Copy link
Contributor

KUBESAW-132

First PR to replace the custom implementation of shared informers in reg-service with the cached controller-runtime client.
This PR changes the informer_service.go's methods so they use the client instead.
I still keep the service file with the same methods to minimize the scope of the changes

  • I don't need to change the places where it is used
  • I can also update the unit tests in a separate PR

@@ -273,8 +273,8 @@ func (s *TestSignupServiceSuite) TestGetSignupFailsWithNotFoundThenOtherError()
s.Application.MockInformerService(inf)
svc := service.NewSignupService(
fake.MemberClusterServiceContext{
Client: s,
Svcs: s.Application,
CrtClient: s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to rename the field to CrtClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is a new method called Client and you cannot have an exported field with the same name. At the end of the day (and this whole refactoring madness) this will all go away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, I missed that

Copy link

codecov bot commented Sep 16, 2024

Codecov Report

Attention: Patch coverage is 52.83019% with 25 lines in your changes missing coverage. Please review.

Project coverage is 79.45%. Comparing base (92998a6) to head (723bc4c).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/informers/service/informer_service.go 68.75% 9 Missing and 1 partial ⚠️
pkg/kubeclient/signup.go 0.00% 9 Missing ⚠️
pkg/kubeclient/client.go 66.66% 3 Missing ⚠️
pkg/server/in_cluster_application.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   78.69%   79.45%   +0.76%     
==========================================
  Files          42       42              
  Lines        2774     2726      -48     
==========================================
- Hits         2183     2166      -17     
+ Misses        494      471      -23     
+ Partials       97       89       -8     
Flag Coverage Δ
unittests 79.45% <52.83%> (+0.76%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

sonarcloud bot commented Sep 17, 2024

Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ! 👍

Thanks for the simplification 🙏

I have only one query - do we have any data/idea about performance impact of this switch ? I mean will it be faster/same/slower or we're unsure ?

@MatousJobanek
Copy link
Contributor Author

Great question @mfrancisc.
We don't have any real data or document for this. But I've tested it locally, I created a few thousands of UserSignups, MURs, Spaces, and SpaceBindings and I tested the proxy - there was no delay.
But I could have tried to compare the numbers. Let me try that.

@mfrancisc
Copy link
Contributor

I was mostly curious - maybe there are some docs/blog posts providing comparison/pros and cons between the two approaches.

@MatousJobanek
Copy link
Contributor Author

MatousJobanek commented Sep 17, 2024

As I see it, the biggest benefits/pros are:

  • simplifies our code
  • we don't need to maintain our own implementation of the shared informers
  • the same is used in all our controllers
  • we saw that the memory consumption is lower compared to our own custom implementation

@MatousJobanek
Copy link
Contributor Author

MatousJobanek commented Sep 17, 2024

so I have some numbers, I created 16k of UserSignups, MURs, Spaces, and SpaceBindings (I meak 16k for each resource kind)
I made a request to get ConfigMaps from user's namespace via the proxy:

  • with the old custom implementation it took around 60-90 milliseconds
  • with the new one the number was the same
  • without any cache it took 2.3-2.5 seconds

as for memory consumption of the reg-service pods:

  • old custom implementation 700-800MiB
  • new one 300-450MiB
  • without cache around 35MiB

Copy link
Contributor

@alexeykazakov alexeykazakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used controller runtime client(s) in our first proxy version as far as I remember. Assuming that the built-in client cache would be enough. But the performance was not consistent. Every few minutes of constant proxy usage the performance degraded. It looked like the cache was "reset" every few minutes or so.
This was the main reason why we switched to informers in the first place.
But maybe we did something different back then? If so then what exactly we did differently and why it should work now?
I might be missing or forgetting something though.
We also might need to do additional performance testing (or collaborate with the konflux folks to closely monitor any potential performance issues after deploying it and be ready to revert it if needed).

cc: @rajivnathan

@MatousJobanek
Copy link
Contributor Author

I can test that - I still have the CRC with all the resources.

@MatousJobanek
Copy link
Contributor Author

MatousJobanek commented Sep 18, 2024

TBH, I shortly discussed this plan with @rajivnathan, and he didn't mention anything about such issues.
Are you sure that the original proxy implementation used the same/similar way of creating a cached client? And was it really a client with a cache or a standard controller-runtime client (without a cache)?
Do you think that we could still find it in the history of the commits?

@alexeykazakov
Copy link
Contributor

No, I'm not sure! Let me try to find the original code so we can compare.

@MatousJobanek
Copy link
Contributor Author

MatousJobanek commented Sep 18, 2024

I don't see any cache reset or performance difference after 15, 40, or 60 minutes:

$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 08:33:54.361713 2595794 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 08:33:54.434089 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 71 milliseconds
I0918 08:33:54.520956 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 79 milliseconds
I0918 08:33:54.656424 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 97 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      15h
openshift-service-ca.crt   1      15h

# waited for 40 minutes

$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 09:12:53.866156 2597468 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 09:12:53.908376 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 41 milliseconds
I0918 09:12:53.929227 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 19 milliseconds
I0918 09:12:53.968135 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 23 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      16h
openshift-service-ca.crt   1      16h

# waited for 15 minutes

$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 09:18:20.803630 2598547 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 09:18:20.833579 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 29 milliseconds
I0918 09:18:20.853703 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 19 milliseconds
I0918 09:18:20.895189 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 26 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      16h
openshift-service-ca.crt   1      16h

# waited for more than 1 hour

$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:21:25.334231 2604585 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 10:21:25.382830 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 47 milliseconds
I0918 10:21:25.407047 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 23 milliseconds
I0918 10:21:25.452966 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 30 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      17h
openshift-service-ca.crt   1      17h

There is also no difference in the memory consumption of the registration-service pods

@MatousJobanek
Copy link
Contributor Author

MatousJobanek commented Sep 18, 2024

I realized one thing now, since I deleted the cache, then the oc did discovery calls, so the very first call wasn't the "get configmaps" one, but other ones not shown at the level of verbosity. Let me do that one more time without resetting the cache.
But from what I have seen (when running the commands) there wasn't any difference, really :-)

@MatousJobanek
Copy link
Contributor Author

ok, here are the numbers. To be sure, I also measured the execution time of the whole oc command:

$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:37:57.186574 2606282 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 10:37:57.214022 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 27 milliseconds
I0918 10:37:57.235949 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 20 milliseconds
I0918 10:37:57.269956 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 21 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      17h
openshift-service-ca.crt   1      17h

real	0m0.107s
user	0m0.118s
sys	0m0.024s

# waited for 22 minutes

$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:59:18.052817 2607023 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 10:59:18.094227 2607023 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 34 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      18h
openshift-service-ca.crt   1      18h

real	0m0.146s
user	0m0.148s
sys	0m0.033s

# waited for almost 90 minutes and made multiple calls:

$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:00.902698 2610103 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 12:26:00.960992 2610103 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 52 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      19h
openshift-service-ca.crt   1      19h

real	0m0.180s
user	0m0.109s
sys	0m0.045s

$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:09.873033 2610208 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 12:26:09.925783 2610208 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 40 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      19h
openshift-service-ca.crt   1      19h

real	0m0.191s
user	0m0.187s
sys	0m0.039s

$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:18.126960 2610483 loader.go:373] Config loaded from file:  /tmp/dummy-kubeconfig
I0918 12:26:18.156485 2610483 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 23 milliseconds
NAME                       DATA   AGE
kube-root-ca.crt           1      19h
openshift-service-ca.crt   1      19h

real	0m0.107s
user	0m0.118s
sys	0m0.024s

so there is no difference, really

@MatousJobanek
Copy link
Contributor Author

There is actually the metric that should show the process time of the requests in proxy, but I have no clue how to visualize that easily in dev cluster.
Maybe @xcoulon knows. I guess that I would need to configure the Prometheus and ServiceMonitor properly, but I'm not sure if I want to go through the process - if it's worth it.

Copy link
Contributor

@alexeykazakov alexeykazakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. In the original code we didn't use the controller-runtime cluster's client. We used a regular client instead. So I guess this is why you couldn't reproduce this problem with "cache reset" every few minutes.
Thank you for the extensive performance testing! I think it's good enough. And if we get complaints about degradation proxy performance then we can re-visit it. But for now I think we are good! 👍
I have just a couple of minor comments.
Great job!

objectsToList := []client.ObjectList{&toolchainv1alpha1.ToolchainConfigList{}, &corev1.SecretList{}}
for i := range objectsToList {
if err := hostCluster.GetClient().List(ctx, objectsToList[i], client.InNamespace(configuration.Namespace())); err != nil {
objectsToList := map[string]client.ObjectList{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we could make it easier to NOT forget to update this list when we need to access more resource kinds via the client... Maybe add a comment to https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L20 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We list only those objects we want to cache upfront. As it is mentioned in the comment above this code, the cache works in a lazy way so it starts caching GVK resources only after it is fetched for the first time.
In other words, the cache is populated only after making the first call on the GVK resource.

If we didn't do the list here, then the very first call to the reg-service/proxy would take a lot of time because the client would start populating the cache with the resources.
By listing the resources before the reg-service pod gets ready, we ensure that the cache is fully synced with all resources we need to have cached.

Copy link
Contributor

@alexeykazakov alexeykazakov Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got that. My point is that it's easy to forget to add a new resource to this list. Let's say we are introducing some change to the proxy so it now needs to access a new resource with kind X. If we don't add it to this list then the proxy performance may degrade for the first request. It will degrade again after the proxy is restarted for every replica. So for example if there is 10 replicas of reg-service/proxy then up to 10 requests per deployment/version would be impacted.
You can't access a new kind without adding this X resource to the operator role in https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L20
So I was wondering if that yaml is a good place to add a reminder to update the cache initialization if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the call:

  1. I'll add a comment to the Role template
  2. I'll drop the ConfigMap from the Role
  3. I'll try to make the list for loop a bit more generic so it goes over all toolchain kinds except for some specific ones like TierTemplate which we don't want to cache (there can be many of them and they are huge resources)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for 1. and 2. see the PRs:
codeready-toolchain/host-operator#1089
#463

As for the 3. item, it's easily doable to go over all kinds of the toolchain api, but there is one problem. The scheme contains api of both host & member, and at the level of the scheme we cannot decide which one is the host resource and which one belongs to the member. We could list available api groups from the cluster and exclude those kinds that are not available, but this wouldn't work in single-cluster environments. In addition to that, host-operator SA doesn't have permission to read/list member-operator CRDs, so it would fail on authorization as well.
In other words, there are too many complications in doing it in a generic way so it's not worth it.

"NSTemplateTier": &toolchainv1alpha1.NSTemplateTierList{},
"ToolchainConfig": &toolchainv1alpha1.ToolchainConfigList{},
"BannedUser": &toolchainv1alpha1.BannedUserList{},
"Secret": &corev1.SecretList{}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Secret": &corev1.SecretList{}}
"Secret": &corev1.SecretList{},
"Config": &corev1.ConfigMapList{}}

While I don't see we actually access CMs in reg-service it's listed here:
https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L43
So I wonder should we either add it to the cache initialization or remove it from the role.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Matous added these resources based on the informers we were creating in informers.go. If my memory serves me correctly, the list of resources there are those that are needed in the proxy flow and so we pre-populate the cache for those resources so that the proxy is as fast as possible even on first time use by a user. I don't remember whether ConfigMaps are part of that flow but I don't think it hurts to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajivnathan is correct, we want to cache only those resources we access in reg-service. Caching anything else doesn't make much sense.

We could theoretically drop a list of some reources like Secrets, NSTemplateTiers, ProxyPlugins, ToolchainStatuses, and ToolchainConfigs because we don't expect many resources to be present in the namespace, so the very first call wouldn't take much more time compared to when it is already cached. But I thought that it's also completely fine to pre-populate the cache with all resources the reg-service and proxy touch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the ConfigMap, if I'm not mistaken, the reg-service doesn't touch any ConfigMap in the toolchain-host-operator namespace. The permission in the Role is most likely a leftover from the time before we introduced ToolchainConfig and used CM to configure the service and operators.
That being said, while it wouldn't theoretically hurt (as we cache resources only in the host-operator namespace) I would rather keep the cache minimal and focused only on those resources that are really accessed by the reg-service.
I would rather avoid the situation when we would keep adding more and more resources to the cache without knowing why we are doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache can be crucial for performance. Especially for the proxy. And it's easy to miss this (don't add specific resources to the initialization step) during development. We don't have any performance tests in our CI.
If neither reg-service or proxy touches CMs then IMO we should remove it from the role.
And I would argue that it would be safer to keep everything reg-service & proxy access in the cache initialization.

I checked it looks like the CM are not used by reg-service and proxy. So let's create a separate PR to remove it from the role.

Copy link
Contributor

@rajivnathan rajivnathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @MatousJobanek! 🙌 It's very cool to see the memory footprint improvement as well! 🤩

We used controller runtime client(s) in our first proxy version as far as I remember. Assuming that the built-in client cache would be enough. But the performance was not consistent. Every few minutes of constant proxy usage the performance degraded. It looked like the cache was "reset" every few minutes or so.
This was the main reason why we switched to informers in the first place.
But maybe we did something different back then? If so then what exactly we did differently and why it should work now?
I might be missing or forgetting something though.

@alexeykazakov TBH, I don't remember the reason why we used informers. Looking back in the JIRA and the code it sounds like we were using some simple cache that mapped users to cluster access (this part I recall) but that breaks when a user is retargeted to another cluster and that was an issue we were trying to solve. Not sure whether there were other issues that led us to use informers over runtime client. Maybe we just didn't realize that the runtime client is backed by informers at the time?

@@ -98,26 +99,29 @@ func (c *userSignupClient) listActiveSignupsByLabelForHashedValue(labelKey, valu
// label matching the specified label
func (c *userSignupClient) listActiveSignupsByLabel(labelKey, labelValue string) ([]*crtapi.UserSignup, error) {

stateRequirement, err := labels.NewRequirement(crtapi.UserSignupStateLabelKey, selection.NotEquals, []string{crtapi.UserSignupStateLabelValueDeactivated})
// TODO add unit tests checking that the label selection works as expected. Right now, it's not possible to do that thanks to the great abstraction and multiple level of layers, mocks, and services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one big 🤦‍♂️
I was looking at options of covering that in unit-tests of the verification function, but unfortunately, it's not easily doable with the current setup.

Copy link

openshift-ci bot commented Sep 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, MatousJobanek, mfrancisc, rajivnathan, xcoulon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [MatousJobanek,alexeykazakov,mfrancisc,rajivnathan,xcoulon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MatousJobanek MatousJobanek merged commit 36ce810 into codeready-toolchain:master Sep 20, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants