-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518
base: main
Are you sure you want to change the base?
Conversation
This mounts a read-write version of /proc and /sys under /kind/private, which allows bind mounting and also makes use cases that need an unmasked proc or sys possible. /proc/sys is bind mounted read only per the systemd container interface[1]. Then some sysctls are made writable again by bind mounting across from the private /proc which was mounted. This may cause issues for privileged daemonsets which set sysctls which aren't namespaced (this may work anyway as often they set them to the same value on multiple nodes). That can be worked around by adding additional bind mounts via docker exec, making it clear kind can't support such interfaces and they might leak from the container. [1]: https://systemd.io/CONTAINER_INTERFACE/
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dgl The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @dgl. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
if [[ -f /kind/private/proc/sys/"${mount_point}" ]]; then | ||
mount --bind -o rw /kind/private/proc/sys/"${mount_point}" /proc/sys/"${mount_point}" | ||
fi | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the robustness of this.
I think this should be opt-in.
log_info 'remounting /sys read-only' | ||
# systemd-in-a-container should have read only /sys | ||
# https://systemd.io/CONTAINER_INTERFACE/ | ||
# however, we need other things from `docker run --privileged` ... | ||
# and this flag also happens to make /sys rw, amongst other things | ||
# | ||
# This step is ignored when running inside UserNS, because it fails with EACCES. | ||
# This step is ignored when running inside UserNS, because it can fail with | ||
# EACCES. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change
Rootless mode ( https://kind.sigs.k8s.io/docs/user/rootless/ ) almost solves this issue. |
I'm really hesitant to ship a change like this because it's hard to say how we'll break users that have come to rely on this over the years and disabling something like udev/binfmt misc on the other hand is cheap and reasonable, at the risk of missing some future systemd behavior. |
Is this still a WIP? As Ben mentioned, it does seem like a risky change to make, so if we're not going to do it, it might be good to close. |
As mentioned on #3511 this could be a more complete way to ensure systemd or other components don't change sysctls unexpectedly. This also makes sysfs mountable per #3436 (but that is just the mount of sysfs on
/kind/private/sys
, so can easily be split, aside from any naming preferences).WIP as I'm not sure it's the best option, but possibly better than fragile breakage due to unexpected sysctl changes.
The downside is it needs an allow list of sysctls which is probably going to need additions for other use cases, but it does mean kind can be explicit about what is supported.
The workaround to add a sysctl as writable would be:
(This currently won't support running in some userns configurations yet, but it should be a case of just ignoring the error from mount if it errors (it can work, it depends on the exact userns environment). In a user namespace the host's sysctls can't be modified anyway. I can test userns cases if this option is worth taking further.)