PVF worker: Add seccomp restrictions (restrict networking) #2009

mrcnski · 2023-10-24T14:23:12Z

Overview

We're already working on sandboxing by blocking all unneeded syscalls. However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready.

For security we block the following with seccomp:

creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus.
io_uring - as discussed here, io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this.
connecting to sockets - the above two points are enough for networking and is what birdcage does (or used to do) to restrict networking. However, it is possible to connect to abstract unix sockets to do some kinds of sandbox escapes, so we also block the connect syscall.

Safety of blocking io_uring

(Intentionally left out of implementer's guide because it felt like too much detail.)

io_uring is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use io_uring in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned io_uring for these reasons.

Considering io_uring's status, and that it very likely would get detected either by our recently-added static analysis or by testing, I think it is fairly safe to block it.

Consensus analysis

If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us.

Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen:

An update to wasmtime is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications)
The new syscall is not detected by our static analysis
It is never triggered in any of our tests
It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely)
Or, it triggers on only a few validators causing a dispute (more likely but not as bad?)

Considering how many things would have to go wrong here, we believe it's safe to block io_uring.

We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready. For security we block the following with `seccomp`: - creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus. - `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this. - `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/[email protected]/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall. (Intentionally left out of implementer's guide because it felt like too much detail.) `io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons. Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it. If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us. Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen: 1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications) 2. The new syscall is not detected by our static analysis 3. It is never triggered in any of our tests 4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely) 5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?) Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`. Closes #619 Original PR in Polkadot repo: paritytech/polkadot#7334

Starting the tokio runtime was calling `socketpair` and triggering the new seccomp filter. Removed tokio since we wanted to do it soon anyway as part of #649.

eskimor

Nice!

polkadot/node/core/pvf/common/src/worker/mod.rs

mrcnski · 2023-10-24T19:32:52Z

polkadot/node/core/pvf/common/src/worker/security/seccomp.rs

+//! # Action on	caught syscall
+//!
+//!	TODO


Instead of killing the process on violations, we can have seccomp notify the parent, having it kill the process and log what happened. We anyway want logging to ensure that the syscall detection script is really sound, before enabling a seccomp whitelist based on it in production. I've done some experiments and performance does not suffer too much.

how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF?

If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions

ptrace. With SECCOMP_RET_USER_NOTIF we'd need to check if it's supported, as you previously pointed out. Maybe not a big deal as it was introduced in 5.0.

I considered audit logging, but AFAICT there are some issues with it. We'd need to manually parse the log which seems hacky (there is a Linux utility for it but not installed by default). Also, it's possible for operators to have disabled seccomp logging. ptrace seemed like the right way to do it, albeit more complex; but I'd really appreciate your input.

Also, it's possible for operators to have disabled seccomp logging.

well, users can disable ptrace too. But is this a concern? we only have access to the logs of nodes if they're run by us, right? we can configure the system whichever way we see fit.

I don't have a very strong opinion honestly. whichever option works, doesn't impact performance too much and isn't too complicated to implement.
I am guessing the most performant one is audit logging, because it's all happening in the kernel. But it's indeed more complicated to parse the logs and build the right plubming to feed them in loki

I went with parsing the audit logs. Audit events are not guaranteed to be observable, but we don't necessarily need them: the logs we emit are merely informative, so that operators know what happened. Note that if we were to handle seccomp violations differently from regular worker death, then an attacker could perhaps abuse that non-determinism somehow.

Using ptrace instead would be more reliable and (I think) the performance would be fine, but it seemed more convoluted to implement that strategy.

alindima

Looking good

polkadot/node/core/pvf/common/Cargo.toml

polkadot/node/core/pvf/common/src/worker/security/mod.rs

polkadot/node/core/pvf/common/src/worker/security/seccomp.rs

alindima · 2023-10-25T07:17:20Z

polkadot/node/core/pvf/common/src/worker/security/seccomp.rs

+//! # Action on	caught syscall
+//!
+//!	TODO


how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF?

If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions

polkadot/node/core/pvf/common/src/worker/security/seccomp.rs

polkadot/node/core/pvf/src/host.rs

NOTE: Log, but don't change the outcome. Not all validators may have auditing enabled, so we don't want attackers to abuse a non-deterministic outcome. TESTING: Some manual testing where seccomp events were triggered confirmed that the logs are parsed correctly: Prepare worker: ``` Oct 27 09:15:42.725 WARN parachain::pvf: failed to recv a prepare response: Custom { kind: UnexpectedEof, error: "early eof" } worker_pid=2691819 Oct 27 09:15:42.726 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2691819 audit_log_path="/var/log/syslog" Oct 27 09:15:42.727 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2691819 syscall=41 pvf=Pvf { code, code_hash: 0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35, executor_params: ExecutorParams([]), prep_timeout: 3s } ``` Execute worker: ``` Oct 27 09:17:28.006 WARN parachain::pvf: failed to recv an execute response worker_pid=2692712 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 error=Custom { kind: UnexpectedEof, error: "early eof" } Oct 27 09:17:28.006 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2692712 audit_log_path="/var/log/syslog" Oct 27 09:17:28.007 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2692712 syscall=41 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 artifact_path="/tmp/.tmp4p6bN0/wasmtime_polkadot_v1.1.0_0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35_0x03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c111314" ```

alindima

Looks reasonable. Another way of getting a log would be to use the Trap return action and install a signal handler for SIGSYS that logs and then kills the process.

But that get's tricky because you can only use async-signal-safe functions in signal handlers (and we'd have to audit all of that code constantly).

polkadot/node/core/pvf/tests/it/main.rs

mrcnski · 2023-10-28T16:13:19Z

Perhaps we should only log for now, instead of killing the process. Just for one release. We haven't tested with all parachains, after all, and the socket call we're blocking is in the syscall lists for the processes.

I say we log, just to be safe. The jobs really shouldn't actually do any networking. (And creating a Unix socket would have already failed with our FS restrictions.)

alindima · 2023-10-30T07:44:10Z

Perhaps we should only log for now, instead of killing the process. Just for one release. We haven't tested with all parachains, after all, and the socket call we're blocking is in the syscall lists for the processes.
I say we log, just to be safe. The jobs really shouldn't actually do any networking. (And creating a Unix socket would have already failed with our FS restrictions.)

Sounds good to me. We'll need some script that parses the audit logs from the kernel and notifies us if there's a logged violation

mrcnski · 2023-10-30T09:14:30Z

Another way of getting a log would be to use the Trap return action and install a signal handler for SIGSYS that logs and then kills the process.

But that get's tricky because you can only use async-signal-safe functions in signal handlers (and we'd have to audit all of that code constantly).

I agree that signal handlers are better avoided where possible. :P And, according to man:

This value results in the kernel sending a thread-directed SIGSYS signal to the triggering thread.

IIUC, with this scheme, an attacker could overwrite our signal handlers with his own.

We'll need some script that parses the audit logs from the kernel and notifies us if there's a logged violation

I believe we can just parse the logs after every job, similar to how we do it now. We don't need a script running continuously (if I understood you correctly).

I've confirmed that we have time to log for one release before enabling the protections.

alindima · 2023-10-30T09:42:35Z

IIUC, with this scheme, an attacker could overwrite our signal handlers with his own.

You could add sigaction to the seccomp blacklist. Anyway, it's best to avoid signal handlers if possible, as we both agree

I believe we can just parse the logs after every job, similar to how we do it now. We don't need a script running continuously (if I understood you correctly).

Sure. We still need some kind of rough notification mechanism if a fault occurred. Or just remember to scan the logs regularly :D

s0me0ne-unkn0wn · 2023-10-30T10:08:10Z

We'll need some script that parses the audit logs from the kernel

I'm not following the whole discussion so a possibly dumb question... Why do we need to parse audit logs instead of connecting to audit via netlink and observing audit events directly?

mrcnski · 2023-10-30T10:46:00Z

@s0me0ne-unkn0wn I don't know what netlink is. :P Parsing logs is never ideal, but this is the best way I could find since encountering the problem back in April. If your solution is quick to implement I could do it, otherwise I've already implemented and tested the parsing way.

s0me0ne-unkn0wn · 2023-10-30T11:10:44Z

@mrcnski see man 3 audit_open and further links. It's the interface to the kernel's audit subsys. I think I sent you a usage example to Matrix PM once.

mrcnski · 2023-10-30T11:18:06Z

Thanks @s0me0ne-unkn0wn, it's a good idea if I can figure out how to do it. Raised a follow-up at #2080.

Use a combination of rusty-fork (separate processes in rust tests) and new sessions to safely kill child workers in tests.

polkadot/roadmap/implementers-guide/src/node/utility/pvf-host-and-workers.md

We've merged #2009 which logs seccomp violations for networking, but does not switch on full seccomp restrictions (voting against blocks on violations). We should monitor releases and once #2009 has been released, we can switch this on for the next release. Closes #2163

mrcnski added 2 commits October 24, 2023 16:11

Remove tokio from workers

2bae7a6

Starting the tokio runtime was calling `socketpair` and triggering the new seccomp filter. Removed tokio since we wanted to do it soon anyway as part of #649.

mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Oct 24, 2023

mrcnski requested review from koute, eskimor, alindima and s0me0ne-unkn0wn October 24, 2023 14:23

mrcnski self-assigned this Oct 24, 2023

eskimor approved these changes Oct 24, 2023

View reviewed changes

polkadot/node/core/pvf/common/src/worker/mod.rs Outdated Show resolved Hide resolved

mrcnski commented Oct 24, 2023

View reviewed changes

alindima reviewed Oct 25, 2023

View reviewed changes

mrcnski added 3 commits October 25, 2023 10:51

Address review

a7ce198

Merge branch 'master' into mrcnski/pvf-worker-restrict-networking

4754a5a

Update syscall lists

cd10afa

mrcnski requested a review from alindima October 25, 2023 08:56

mrcnski added 2 commits October 25, 2023 11:24

Fix benchmark

f0b41bd

Address some comments

122eeb4

alindima approved these changes Oct 25, 2023

View reviewed changes

mrcnski added 2 commits October 26, 2023 13:27

Move security checks to a separate module

f792d88

mrcnski requested a review from alindima October 27, 2023 09:50

Update doc

38012ec

mrcnski requested a review from eskimor October 27, 2023 10:00

alindima reviewed Oct 27, 2023

View reviewed changes

polkadot/node/core/pvf/tests/it/main.rs Outdated Show resolved Hide resolved

Log violations, don't kill process yet

c9b0aaa

mrcnski mentioned this pull request Oct 30, 2023

PVF: consider fetching audit logs with netlink #2080

Closed

Un-ignore tests

5417818

Use a combination of rusty-fork (separate processes in rust tests) and new sessions to safely kill child workers in tests.

mrcnski requested a review from alindima October 30, 2023 11:57

alindima reviewed Oct 31, 2023

View reviewed changes

polkadot/roadmap/implementers-guide/src/node/utility/pvf-host-and-workers.md Outdated Show resolved Hide resolved

Comment out section in impl guide

e89873c

alindima approved these changes Oct 31, 2023

View reviewed changes

mrcnski merged commit 9faea38 into master Oct 31, 2023
111 of 112 checks passed

mrcnski deleted the mrcnski/pvf-worker-restrict-networking branch October 31, 2023 10:08

This was referenced Oct 31, 2023

PVF worker: separate worker binaries and build with musl #650

Closed

PVF worker: switch on seccomp networking restrictions #2163

Closed

mrcnski mentioned this pull request Nov 8, 2023

PVF worker: switch on seccomp networking restrictions #2221

Merged

mrcnski mentioned this pull request Nov 24, 2023

PVF: Add Secure Validator Mode #2486

Merged

2 tasks

github-actions bot mentioned this pull request Feb 19, 2024

Update substrate/polkadot/cumulus from v1.3.0 to v1.6.0 moondance-labs/tanssi#419

Closed

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF worker: Add seccomp restrictions (restrict networking) #2009

PVF worker: Add seccomp restrictions (restrict networking) #2009

mrcnski commented Oct 24, 2023 •

edited

Loading

eskimor left a comment

mrcnski Oct 24, 2023

alindima Oct 25, 2023

mrcnski Oct 25, 2023

alindima Oct 25, 2023

mrcnski Oct 27, 2023 •

edited

Loading

alindima left a comment

alindima Oct 25, 2023

alindima left a comment

mrcnski commented Oct 28, 2023

alindima commented Oct 30, 2023

mrcnski commented Oct 30, 2023

alindima commented Oct 30, 2023

s0me0ne-unkn0wn commented Oct 30, 2023

mrcnski commented Oct 30, 2023

s0me0ne-unkn0wn commented Oct 30, 2023

mrcnski commented Oct 30, 2023

PVF worker: Add seccomp restrictions (restrict networking) #2009

PVF worker: Add seccomp restrictions (restrict networking) #2009

Conversation

mrcnski commented Oct 24, 2023 • edited Loading

Overview

Safety of blocking io_uring

Consensus analysis

Related

eskimor left a comment

Choose a reason for hiding this comment

mrcnski Oct 24, 2023

Choose a reason for hiding this comment

alindima Oct 25, 2023

Choose a reason for hiding this comment

mrcnski Oct 25, 2023

Choose a reason for hiding this comment

alindima Oct 25, 2023

Choose a reason for hiding this comment

mrcnski Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

alindima left a comment

Choose a reason for hiding this comment

alindima Oct 25, 2023

Choose a reason for hiding this comment

alindima left a comment

Choose a reason for hiding this comment

mrcnski commented Oct 28, 2023

alindima commented Oct 30, 2023

mrcnski commented Oct 30, 2023

alindima commented Oct 30, 2023

s0me0ne-unkn0wn commented Oct 30, 2023

mrcnski commented Oct 30, 2023

s0me0ne-unkn0wn commented Oct 30, 2023

mrcnski commented Oct 30, 2023

mrcnski commented Oct 24, 2023 •

edited

Loading

mrcnski Oct 27, 2023 •

edited

Loading