Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Wait for the certs to be mounted inside the container #2198

Merged
merged 3 commits into from
Aug 15, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Aug 5, 2023

What this PR does / why we need it:
To avoid the below error the cert-generator should wait for the certs to be mounted inside the container:

{"level":"error","ts":"2023-08-05T18:14:44Z","logger":"entrypoint","msg":"Unable to run the manager","error":"open /tmp/cert/tls.crt: no such file or directory","stacktrace":"main.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:164\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Currently, the cert-generator immediately sends data to the certsReady channel once the certs are generated.
However, since completed to generate certs doesn't mean the mounted secret is updated with certs, the cert-generator should wait for the mounted secret to be updated by kubelet.

ref: https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -67,11 +70,34 @@ func (c *CertGenerator) Start(ctx context.Context) error {
if err := c.generate(ctx); err != nil {
return err
}
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a requirement for a timeout? How does the user get to know that it is being waited for certs to be ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a requirement for a timeout?

Yes, timeout is 5 minutes.

How does the user get to know that it is being waited for certs to be ready?

The controller pod doesn't ready until the cert is ready.

NAME                                READY   STATUS              RESTARTS   AGE
katib-controller-687798c75b-mtlz7   0/1     Running             0          4s

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-13T11:48:12Z"
    message: 'containers with unready status: [katib-controller]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
controller@sha256:86697de94ac8dfbe53082beb96f1ea4ffea2bac982a2332719a99224e17335c7
    lastState: {}
    name: katib-controller
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-08-13T11:48:14Z"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, how will the user know that Katib controller is waiting for certs to get ready?

Copy link
Member Author

@tenzen-y tenzen-y Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller doesn't say, Waiting for certs to get ready. So, there are two ways that users can indirectly know whether the controller waits for certs to get ready:

  1. As shown above, confirmation pod condition.
  2. Creating the Experiments. If the controller waits for certs, users can get the following error:
Error from server (InternalError): error when creating "../../testdata/valid-experiment.yaml": Internal error occurred: failed calling webhook "defaulter.experiment.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

Adding a log, Waiting for certs to get ready, to L72 in this file might be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a log is better because user can understand the reason behind the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}
// We need to wait for certs to get ready even though using cert-manager.
klog.Info("Waiting for certs to get ready.")
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we use cert-manager, the controller must wait for certs to be injected into the Secret resource.

Comment on lines 113 to 118
namespace: consts.DefaultKatibNamespace,
webhookServiceName: config.WebhookServiceName,
webhookSecretName: config.WebhookSecretName,
kubeClient: mgr.GetClient(),
certsReady: certsReady,
enableGenerateCerts: config.Enable,
Copy link
Member

@andreyvelich andreyvelich Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set fullServiceDomain here when we configure CertGenerator struct ?
Currently, we set it here, and I think we can specify the CertGenerator parameter in a single place.

Suggested change
namespace: consts.DefaultKatibNamespace,
webhookServiceName: config.WebhookServiceName,
webhookSecretName: config.WebhookSecretName,
kubeClient: mgr.GetClient(),
certsReady: certsReady,
enableGenerateCerts: config.Enable,
namespace: consts.DefaultKatibNamespace,
webhookServiceName: config.WebhookServiceName,
fullServiceDomain: strings.Join([]string{config.WebhookServiceName, consts.DefaultKatibNamespace, "svc"}, "."),
webhookSecretName: config.WebhookSecretName,
kubeClient: mgr.GetClient(),
certsReady: certsReady,
enableGenerateCerts: config.Enable,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. Thanks!

}
// We need to wait for certs to get ready even though using cert-manager.
klog.Info("Waiting for certs to get ready.")
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @tenzen-y Does it look confusing that we have this timeout in our generator package even if Katib Cert Generator doesn't used ?
Maybe we can add this timeout to the setupControllers section before we adding the Webhook to the manager ?

We can rename certsReady channel to certsGenerated to be more precise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think locating this timeout here would be better since this timeout is a function for the webhook certs.
However, I agree with your confusion. So renaming the certgenerator with certmanager or certcontroller might be better to clarify the purpose of this component.

@andreyvelich WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think certcontroller would be better since certmanager conflicts context with cert-manager :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to certcontroller how we can deal with Katib Config Settings ?
If user wants the enable cert generator they set the following setting:

  certGenerator:
    enable: true

Copy link
Member Author

@tenzen-y tenzen-y Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I will revert 69e9dfe since we can avoid the timeout error in the kubeflow installation by removing katib-webhook-cert Secret from kubeflow installation.

So, we wouldn't have to change the KatibConfig. WDYT?

Copy link
Member Author

@tenzen-y tenzen-y Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the certcontroller will start only in the standalone installation (certGenerator.enable=true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if we are going to remove this timeout from Kubeflow installation, we need to name it as certgenerator since it is only internal logic that we have when Katib certGenerator is enabled.
Let's keep it as certgenerator for now and use this timeout only when our internal Cert Generator is used.
Does it sound good @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich That sounds good to me. Thanks for the great suggestion!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 94 to 114
certFile := filepath.Join(consts.CertDir, serverCertName)
if _, err := os.Stat(certFile); err != nil {
return false, nil
}
keyFile := filepath.Join(consts.CertDir, serverKeyName)
if _, err := os.Stat(keyFile); err != nil {
return false, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the log line here as well ? E.g.

klog.Info("Public key file %v doesn't exist in the container yet", certFile)
klog.Info("Private key file %v doesn't exist in the container yet", keyFile)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate to add the log since the logs will be output in large quantities and the controller's logs will be dirty.
So output logs every 30 sec or 15 sec might be better.
@andreyvelich WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that also works, if we can print them every 15 sec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y
Copy link
Member Author

@andreyvelich I addressed your suggestions. Please take another look.

@andreyvelich
Copy link
Member

Thank you @tenzen-y!
/lgtm
/hold for review
/assign @johnugeorge

@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot merged commit 1b68744 into kubeflow:master Aug 15, 2023
58 checks passed
tenzen-y added a commit to tenzen-y/katib that referenced this pull request Aug 16, 2023
…2198)

* Wait for the certs to be mounted inside the container

Signed-off-by: Yuki Iwai <[email protected]>

* Initialize fullServiceDomain when adding certgenerator to the manager

Signed-off-by: Yuki Iwai <[email protected]>

* Output logs every 15 seconds if the certs don't yet exist in the container

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
@tenzen-y tenzen-y deleted the wait-certs-mounted branch August 16, 2023 08:15
google-oss-prow bot pushed a commit that referenced this pull request Aug 16, 2023
… inside the container (#2213)

* Wait for the certs to be mounted inside the container



* Initialize fullServiceDomain when adding certgenerator to the manager



* Output logs every 15 seconds if the certs don't yet exist in the container



---------

Signed-off-by: Yuki Iwai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants