Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator deletes tigera-system namespace on ApiServer deployment #2912

Closed
difrost opened this issue Oct 3, 2023 · 7 comments · Fixed by #3228
Closed

Operator deletes tigera-system namespace on ApiServer deployment #2912

difrost opened this issue Oct 3, 2023 · 7 comments · Fixed by #3228

Comments

@difrost
Copy link

difrost commented Oct 3, 2023

Operator running in tigera-system will perform self-destruction (delete own Namespace) if ApiServer is enabled.

Expected Behavior

Operator should never attempt to delete a namespace in which it is deployed.

Current Behavior

Operator running in tigera-system will perform self-destruction (delete own Namespace) if ApiServer is enabled.

Steps to Reproduce (for bugs)

Deploy Tigera Oprator in tigera-system and enable ApiServer.

Your Environment

We are running Tiger Operator 1.29.4 with Calico 3.25.1 and ApiServer disabled. The reason for such a setup is that we've seen major issues with internal controllers (registering as owners of certain namespaces) being broken when ApiServer was deployed. Now that I've got some free cycles I wanted to return to this issue and try to investigate it further (in the meantime we've done a number of upgrades of our Tigera/Calico stack hence the issue might be gone already). To my surprise when I dropped the ApiServer resource the tigera-system namespace was deleted (that's where we install the Operator). To be precise it was still hanging in the Terminating state as there were many references to the Operator that were preventing clean deletion.

@difrost difrost changed the title Operators self-destruction on failed ApiServer deployment Operators deletes tigera-system namespace on ApiServer deployment Oct 18, 2023
@tmjd
Copy link
Member

tmjd commented Oct 18, 2023

Running the operator in one of the namespaces it uses/manages is likely to cause other problems. It is recommended to run the operator in the tigera-operator namespace and if it is desired to use a different namespace then it should be a dedicated namespace that only the operator will run in.

The reason the operator is deleting that namespace is because we use that namespace (tigera-system) in the Enterprise variant of APIServer and to ensure Enterprise resources are cleaned up when running the Calico variant it will delete the Enterprise variant of resources. This is expected behavior.

@difrost
Copy link
Author

difrost commented Oct 18, 2023

Investigation

Api Server is deployed to either calico-apiserver or tigera-system namespace: see https://github.com/tigera/operator/blob/master/pkg/render/common/meta/meta.go#L108

When Operator is installed in tigera-system namespace Reconciliation will include this namespace under globalEnterpriseObjects: https://github.com/tigera/operator/blob/master/pkg/render/apiserver.go#L236 that is later appended (see https://github.com/tigera/operator/blob/master/pkg/render/apiserver.go#L282) to a list of objects that are marked for deletion with a Calico installation variant.

Same situation would happen for an Enterprise variant if, somehow, the Operator would be installed in calico-apiserver namespace.

We can easily avoid the deletion of a namespace (the namespace where Operator is running is available via common.OperatorNamespace()) but in such case, we can't rely on the GC cleaning the namespaced objects.

I would suggest maintaining two lists for namespaced objects (both for Calico and Enterprise) and on a clash append corresponding lists to deleted objects or just the namespace if there's no clash. Any thoughts/comments? I can take care of this.

CC: @caseydavenport

@difrost
Copy link
Author

difrost commented Oct 18, 2023

@tmjd we've been running Operator in tigera-system before the #5649 PR allowed us to install it in any namespace and we rely on the '*-system' naming convention for platform components. A self-destruction of the Operator is sth that I would not expect. The #2932 will allow us to identify potential clash and (see my comment above) we can also easily avoid deleting tigera-system with ApiServer allowing some freedom on where the Operator is running.

@difrost difrost changed the title Operators deletes tigera-system namespace on ApiServer deployment Operator deletes tigera-system namespace on ApiServer deployment Oct 18, 2023
@tmjd
Copy link
Member

tmjd commented Oct 18, 2023

To mitigate this issue, it should be very easy to migrate to a different namespace, like tigera-operator-system if you need to keep it ending in -system. To migrate, stop the current operator, copy all the secrets and configmaps to the new namespace, deploy the operator in the new namespace.

Like I mentioned running the operator in a namespace it manages could cause other problems. I think if we add any changes to support a non-recommended setup, I'd suggest we add something to startup to try to check that the operator isn't being deployed to a namespace it might attempt to manage. What I mean is something that we add at startup to have the operator compare its namespace to a list of ones it might use and if it finds itself running in one of those it would exit/crash. That would prevent the operator from ever getting in a situation where the namespace it is running in would be deleted because of its management.

I'm not sure if this is a good idea since the reuse of the names for other operator resource (like its ServiceAccount, ClusterRole, and ClusterRoleBinding) could still be done and cause problems.

@difrost
Copy link
Author

difrost commented Oct 18, 2023

Like I mentioned running the operator in a namespace it manages could cause other problems.

Self-destruction is a pretty serious "problem". In such case indeed Operator should refuse to start if deployed to a namespace it is about to manage. That should also be documented in detail.

I'll check if we can move the installation easily.

I guess we can close this issue and the associated PR in favor of a new one that will block Operator from starting in one of the "opinionated" namespaces.

Cheers

@caseydavenport
Copy link
Member

caseydavenport commented Oct 18, 2023

I agree about failing to start if running in one of these namespaces - @tmjd what do you think about having a change to our main() function that simply exits if we're running in a known namespace managed by the operator? e.g., calico-system, tigera-system, tigera-apiserver, etc.

I think we want the operator to simply die if anyone tried to run in those namespaces, because otherwise unexpected behavior will occur somewhere further along in the operator code (probably more can go wrong than simply deleting its own namespace)

@tmjd
Copy link
Member

tmjd commented Oct 19, 2023

Yep, that is exactly what I was thinking would be a reasonable course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants