PVF: Move landlock out of thread into process; add landlock exceptions #7580

mrcnski · 2023-08-04T14:20:23Z

Currently we apply sandboxing per-thread, when it should be per-process. This shouldn't be a big change, we just need sandboxing exceptions for the artifacts/cache directories.

Without this, the sandboxing we have with landlock is not really secure.

TODO

~~We can just pass around a cloned config instead of all the values separately. Link~~
We can only throw an error if landlock is enabled and expected to work. Link
Use assert! instead of io::Error. Link
- NOTE: Maybe can instead remove the artifact cache exception, see here.
The --artifact-dir should be passed through the handshake instead of through a CLI param. Link
- NOTE: Maybe can instead remove the artifact cache exception, see here.

This has one issue which is that execution jobs, which need read permissions on the cache directory, can now read other artifacts in the directory. As different validators may have different dir contents this can be a source of randomness for attackers. This makes it possible to attack the chain by crafting a PVF+candidate which results in arbitrary code execution which: reads the contents of the cache directory, uses it to seed some randomness, and causes the execution job to vote for or against the candidate with 50% chance, thus stalling the chain which assumes that at least 66% of validators are trustworthy and not compromised

We could spwn a brand new process for each job, but that seems like a lot of overhead and I'm not sure it's feasible. cc @s0me0ne-unkn0wn

But, there are so many sources of randomness, that I have been thinking to abandon this goal to avoid it. In the end we would need virtualization and that is not a free win but comes with challenges. Perhaps it is better to rely on governance to deal with attacks on the chain. cc @eskimor I forget if we discussed this already or not

At any rate, securing validators themselves is a priority right now so we need this landlock fix.

eskimor · 2023-08-06T17:30:29Z

We could spwn a brand new process for each job, but that seems like a lot of overhead and I'm not sure it's feasible. cc @s0me0ne-unkn0wn

Spawning a new process is not that much overhead. Spawning a new process for each and every http request as some webservers do is quite bad, but if the useful work is in the hundreds of milliseconds to seconds, it should be negligible.

mrcnski · 2023-08-07T07:42:18Z

@eskimor according to our metrics, execution jobs are on the order of 10ms - so spawning a process may not be so negligible. I also remember that when @s0me0ne-unkn0wn was redesigning the execution job queue, one of the goals was to reuse the same worker process as much as possible for jobs. I don't remember the details at this point, but @s0me0ne-unkn0wn can surely provide more insight to how feasible is a new process for each job.

s0me0ne-unkn0wn · 2023-08-07T08:49:37Z

Spawning a new process on Unix is a fork(), which creates a new process duplicating the whole virtual space of the calling thread, and may be expensive enough. I believe that's why we relied on only spawning workers when absolutely needed (no worker available or a previous one died or ended up in some strange state), as the polkadot memory footprint is huge. The worker then spawns jobs in threads, which are effectively CoW forks and thus less expensive.

But there are two variables here, one new and one old. The old one is the parachain logic. If it's purely computational, it's likely that spawning a new process overhead will take more time than the execution itself. But if the parachain does some i/o ops, calling host functions reading and writing storage, it may turn the other way.

The new variable is separate binary workers. Their memory footprint is small and that may reduce that overhead significantly and we might see it's negligible now.

But again, those are empirical estimations. It's easy to check, just build a version with the execution worker gracefully exiting after every execution, burn it in on Versi, and compare the results.

s0me0ne-unkn0wn · 2023-08-07T08:58:42Z

Well, I've probably screwed up the explanation somewhat. We still need to fork the main process (not the whole process indeed, only a calling thread) to spawn a new process from an external binary. But prior to worker separation, we needed that forked process to execute the whole polkadot binary with its cli and everything, and now it's a lightweight polkadot-execute-worker which is not so much effort, that's what I meant.

mrcnski · 2023-08-07T09:27:25Z

Cool, I've raised an issue here: paritytech/polkadot-sdk#584. It's a follow-up because it doesn't block this PR, and it would likely come with changes to the execution queue logic, so it should be a separate PR. Thanks @s0me0ne-unkn0wn!

mrcnski · 2023-08-08T09:44:41Z

TODO: The --artifact-dir should actually be passed through the handshake instead of through a CLI param, this would simplify the worker interface. Only the socket-path and node/worker versions (to do the version check before the handshake) should be passed by params.

Allow an exception for reading from the artifact cache, but disallow listing the directory contents. Since we prepend artifact names with a random hash, this means attackers can't discover artifacts apart from the current job.

We already checked whether landlock is enabled in the host. We can therefore only throw an error here if landlock is enabled and expected to work. Otherwise we shouldn't even log here, as errors are already logged in the host, and is just noise here.

This is an attempt at an improved chroot jail that doesn't require root, but still allows us to use sockets and artifacts from the host.

paritytech-cicd-pr · 2023-08-23T14:54:58Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/3431179

koute · 2023-08-24T08:47:54Z