-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Wait for the certs to be mounted inside the container #2198
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5e9e0f3
to
cddd82d
Compare
@@ -67,11 +70,34 @@ func (c *CertGenerator) Start(ctx context.Context) error { | |||
if err := c.generate(ctx); err != nil { | |||
return err | |||
} | |||
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a requirement for a timeout? How does the user get to know that it is being waited for certs to be ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a requirement for a timeout?
Yes, timeout is 5 minutes.
How does the user get to know that it is being waited for certs to be ready?
The controller pod doesn't ready until the cert is ready.
NAME READY STATUS RESTARTS AGE
katib-controller-687798c75b-mtlz7 0/1 Running 0 4s
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-08-13T11:48:12Z"
message: 'containers with unready status: [katib-controller]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
controller@sha256:86697de94ac8dfbe53082beb96f1ea4ffea2bac982a2332719a99224e17335c7
lastState: {}
name: katib-controller
ready: false
restartCount: 0
started: true
state:
running:
startedAt: "2023-08-13T11:48:14Z"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, how will the user know that Katib controller is waiting for certs to get ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller doesn't say, Waiting for certs to get ready
. So, there are two ways that users can indirectly know whether the controller waits for certs to get ready:
- As shown above, confirmation pod condition.
- Creating the Experiments. If the controller waits for certs, users can get the following error:
Error from server (InternalError): error when creating "../../testdata/valid-experiment.yaml": Internal error occurred: failed calling webhook "defaulter.experiment.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Adding a log, Waiting for certs to get ready
, to L72 in this file might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a log is better because user can understand the reason behind the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
cddd82d
to
23dcbb5
Compare
} | ||
// We need to wait for certs to get ready even though using cert-manager. | ||
klog.Info("Waiting for certs to get ready.") | ||
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we use cert-manager, the controller must wait for certs to be injected into the Secret resource.
namespace: consts.DefaultKatibNamespace, | ||
webhookServiceName: config.WebhookServiceName, | ||
webhookSecretName: config.WebhookSecretName, | ||
kubeClient: mgr.GetClient(), | ||
certsReady: certsReady, | ||
enableGenerateCerts: config.Enable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set fullServiceDomain
here when we configure CertGenerator struct ?
Currently, we set it here, and I think we can specify the CertGenerator parameter in a single place.
namespace: consts.DefaultKatibNamespace, | |
webhookServiceName: config.WebhookServiceName, | |
webhookSecretName: config.WebhookSecretName, | |
kubeClient: mgr.GetClient(), | |
certsReady: certsReady, | |
enableGenerateCerts: config.Enable, | |
namespace: consts.DefaultKatibNamespace, | |
webhookServiceName: config.WebhookServiceName, | |
fullServiceDomain: strings.Join([]string{config.WebhookServiceName, consts.DefaultKatibNamespace, "svc"}, "."), | |
webhookSecretName: config.WebhookSecretName, | |
kubeClient: mgr.GetClient(), | |
certsReady: certsReady, | |
enableGenerateCerts: config.Enable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. Thanks!
} | ||
// We need to wait for certs to get ready even though using cert-manager. | ||
klog.Info("Waiting for certs to get ready.") | ||
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge @tenzen-y Does it look confusing that we have this timeout in our generator package even if Katib Cert Generator doesn't used ?
Maybe we can add this timeout to the setupControllers
section before we adding the Webhook to the manager ?
We can rename certsReady
channel to certsGenerated
to be more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think locating this timeout here would be better since this timeout is a function for the webhook certs.
However, I agree with your confusion. So renaming the certgenerator
with certmanager
or certcontroller
might be better to clarify the purpose of this component.
@andreyvelich WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think certcontroller
would be better since certmanager
conflicts context with cert-manager
:-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rename it to certcontroller
how we can deal with Katib Config Settings ?
If user wants the enable cert generator they set the following setting:
certGenerator:
enable: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. I will revert 69e9dfe since we can avoid the timeout error in the kubeflow installation by removing katib-webhook-cert Secret from kubeflow installation.
So, we wouldn't have to change the KatibConfig. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the certcontroller
will start only in the standalone installation (certGenerator.enable=true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if we are going to remove this timeout from Kubeflow installation, we need to name it as certgenerator
since it is only internal logic that we have when Katib certGenerator
is enabled.
Let's keep it as certgenerator
for now and use this timeout only when our internal Cert Generator is used.
Does it sound good @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich That sounds good to me. Thanks for the great suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
certFile := filepath.Join(consts.CertDir, serverCertName) | ||
if _, err := os.Stat(certFile); err != nil { | ||
return false, nil | ||
} | ||
keyFile := filepath.Join(consts.CertDir, serverKeyName) | ||
if _, err := os.Stat(keyFile); err != nil { | ||
return false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the log line here as well ? E.g.
klog.Info("Public key file %v doesn't exist in the container yet", certFile)
klog.Info("Private key file %v doesn't exist in the container yet", keyFile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate to add the log since the logs will be output in large quantities and the controller's logs will be dirty.
So output logs every 30 sec or 15 sec might be better.
@andreyvelich WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that also works, if we can print them every 15 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
69e9dfe
to
fa81a60
Compare
@andreyvelich I addressed your suggestions. Please take another look. |
Thank you @tenzen-y! |
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
…ainer Signed-off-by: Yuki Iwai <[email protected]>
fa81a60
to
9d85d60
Compare
New changes are detected. LGTM label has been removed. |
/lgtm |
…2198) * Wait for the certs to be mounted inside the container Signed-off-by: Yuki Iwai <[email protected]> * Initialize fullServiceDomain when adding certgenerator to the manager Signed-off-by: Yuki Iwai <[email protected]> * Output logs every 15 seconds if the certs don't yet exist in the container Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>
… inside the container (#2213) * Wait for the certs to be mounted inside the container * Initialize fullServiceDomain when adding certgenerator to the manager * Output logs every 15 seconds if the certs don't yet exist in the container --------- Signed-off-by: Yuki Iwai <[email protected]>
What this PR does / why we need it:
To avoid the below error the cert-generator should wait for the certs to be mounted inside the container:
Currently, the cert-generator immediately sends data to the
certsReady
channel once the certs are generated.However, since completed to generate certs doesn't mean the mounted secret is updated with certs, the cert-generator should wait for the mounted secret to be updated by kubelet.
ref: https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: