[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

dgl · 2024-02-13T03:05:04Z

As mentioned on #3511 this could be a more complete way to ensure systemd or other components don't change sysctls unexpectedly. This also makes sysfs mountable per #3436 (but that is just the mount of sysfs on /kind/private/sys, so can easily be split, aside from any naming preferences).

WIP as I'm not sure it's the best option, but possibly better than fragile breakage due to unexpected sysctl changes.

The downside is it needs an allow list of sysctls which is probably going to need additions for other use cases, but it does mean kind can be explicit about what is supported.

The workaround to add a sysctl as writable would be:

docker exec a-node mount --rbind /kind/private/proc/sys/some-sysctl /proc/sys/some-sysctl

(This currently won't support running in some userns configurations yet, but it should be a case of just ignoring the error from mount if it errors (it can work, it depends on the exact userns environment). In a user namespace the host's sysctls can't be modified anyway. I can test userns cases if this option is worth taking further.)

This mounts a read-write version of /proc and /sys under /kind/private, which allows bind mounting and also makes use cases that need an unmasked proc or sys possible. /proc/sys is bind mounted read only per the systemd container interface[1]. Then some sysctls are made writable again by bind mounting across from the private /proc which was mounted. This may cause issues for privileged daemonsets which set sysctls which aren't namespaced (this may work anyway as often they set them to the same value on multiple nodes). That can be worked around by adding additional bind mounts via docker exec, making it clear kind can't support such interfaces and they might leak from the container. [1]: https://systemd.io/CONTAINER_INTERFACE/

k8s-ci-robot · 2024-02-13T03:05:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgl
Once this PR has been reviewed and has the lgtm label, please assign aojea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-02-13T03:05:13Z

Hi @dgl. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

AkihiroSuda · 2024-02-13T13:32:47Z

images/base/files/usr/local/bin/entrypoint

+    if [[ -f /kind/private/proc/sys/"${mount_point}" ]]; then
+      mount --bind -o rw /kind/private/proc/sys/"${mount_point}" /proc/sys/"${mount_point}"
+    fi
+  done


I'm not sure about the robustness of this.
I think this should be opt-in.

AkihiroSuda · 2024-02-13T13:32:55Z

images/base/files/usr/local/bin/entrypoint

  log_info 'remounting /sys read-only'
  # systemd-in-a-container should have read only /sys
  # https://systemd.io/CONTAINER_INTERFACE/
  # however, we need other things from `docker run --privileged` ...
  # and this flag also happens to make /sys rw, amongst other things
  #
-  # This step is ignored when running inside UserNS, because it fails with EACCES.
+  # This step is ignored when running inside UserNS, because it can fail with
+  # EACCES.


Unnecessary change

AkihiroSuda · 2024-02-13T13:36:35Z

ensure systemd or other components don't change sysctls unexpectedly

Rootless mode ( https://kind.sigs.k8s.io/docs/user/rootless/ ) almost solves this issue.

BenTheElder · 2024-02-13T18:34:47Z

As mentioned on #3511 this could be a more complete way to ensure systemd or other components don't change sysctls unexpectedly. This also makes sysfs mountable per #3436 (but that is just the mount of sysfs on /kind/private/sys, so can easily be split, aside from any naming preferences).

I'm really hesitant to ship a change like this because it's hard to say how we'll break users that have come to rely on this over the years and disabling something like udev/binfmt misc on the other hand is cheap and reasonable, at the risk of missing some future systemd behavior.

stmcginnis · 2024-07-19T16:29:02Z

Is this still a WIP? As Ben mentioned, it does seem like a risky change to make, so if we're not going to do it, it might be good to close.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 13, 2024

k8s-ci-robot requested review from aojea and neolit123 February 13, 2024 03:05

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2024

AkihiroSuda reviewed Feb 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

dgl commented Feb 13, 2024 •

edited

Loading

k8s-ci-robot commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

AkihiroSuda Feb 13, 2024

AkihiroSuda Feb 13, 2024

AkihiroSuda commented Feb 13, 2024

BenTheElder commented Feb 13, 2024

stmcginnis commented Jul 19, 2024

[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

Are you sure you want to change the base?

[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

Conversation

dgl commented Feb 13, 2024 • edited Loading

k8s-ci-robot commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

AkihiroSuda Feb 13, 2024

Choose a reason for hiding this comment

AkihiroSuda Feb 13, 2024

Choose a reason for hiding this comment

AkihiroSuda commented Feb 13, 2024

BenTheElder commented Feb 13, 2024

stmcginnis commented Jul 19, 2024

dgl commented Feb 13, 2024 •

edited

Loading