Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipe: Fall back to write() on vmsplice() EPERM #2294

Open
wants to merge 177 commits into
base: criu-dev
Choose a base branch
from

Conversation

ymanton
Copy link
Contributor

@ymanton ymanton commented Oct 30, 2023

vmsplice can be blocked via seccomp and currently is in Podman containers. CRIU uses vmsplice a lot on the checkpoint side, so it's easier to just run with seccomp=unconfined in that case. On the restore side I've only seen one use in restore_pipe_data being reached, so rather than forcing users to run with a custom profile or disable seccomp altogether it might be better to just fall back to calling write, which this patch does.

prakritigoyal19 and others added 30 commits June 11, 2023 23:30
Change made through this commit:
- Include copy of flog as a seperate tree.
- Modify the makefile to add and compile flog code.

Signed-off-by: prakritigoyal19 <[email protected]>
CID 302713 (#1 of 1): Missing varargs init or cleanup (VARARGS)
 va_end was not called for argptr.

Signed-off-by: Adrian Reber <[email protected]>
Separate commit for easier criu-dev <-> master transfer.

Acked-by: Mike Rapoport <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
It is mapped, not maped. Same applies for mmap I guess.

Found by codespell, except it wants to change it to mapped,
which will make it less specific.

Signed-off-by: Kir Kolyshkin <[email protected]>
Brought to you by

    codespell -w

(using codespell v2.1.0).

[v2: use "make indent" on the result]

Signed-off-by: Kir Kolyshkin <[email protected]>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.

Signed-off-by: Suraj Shirvankar <[email protected]>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.

Signed-off-by: zhoujie <[email protected]>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)

On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

This increase the xsave area from one page to four pages.

Without this patch the fpu03 test fails, with this patch it works again.

Signed-off-by: Adrian Reber <[email protected]>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.

We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.

Signed-off-by: Pavel Tikhomirov <[email protected]>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).

Signed-off-by: Radostin Stoyanov <[email protected]>
crun wants to set empty_ns and this interface is missing from the
library. This adds it to libcriu.

Signed-off-by: Adrian Reber <[email protected]>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes remove and update the changes introduced in
7177938 in favor of the
Python version in CI.

os.waitstatus_to_exitcode() function appeared in Python 3.9

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes add test implementations for criu-ns script.

Fixes: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns

Signed-off-by: Dhanuka Warusadura <[email protected]>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.

This reverse the changes made in a15a63f

Signed-off-by: Dhanuka Warusadura <[email protected]>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/

It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.

First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.

Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.

Thanks a lot to Thomas!

Signed-off-by: Pavel Tikhomirov <[email protected]>
This hook allows to start image streamer process from an action script.

Signed-off-by: Radostin Stoyanov <[email protected]>
…tions

This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.

Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.

Signed-off-by: Valeriy Vdovin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.

Signed-off-by: Michał Mirosław <[email protected]>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.

Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <[email protected]>
If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it
for easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
While at it, don't carry over stale errno to the fail() message.

Signed-off-by: Michał Mirosław <[email protected]>
Add a sanity check for THP_DISABLE. This discovered a broken commit
in Google's kernel tree.

Signed-off-by: Michał Mirosław <[email protected]>
rst0git and others added 19 commits October 4, 2023 23:40
amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
        if (ret) {
            ^~~
amdgpu_plugin.c:988:8: note: uninitialized use occurs here
        xfree(buffer);

Signed-off-by: Radostin Stoyanov <[email protected]>
One memfd can be shared by a few restored files. Only of these files is
restored with a file created with memfd_open. Others are restored by reopening
memfd files via /proc/self/fd/.

It seems unnecessary for restoring memfd memory mappings. We can always use the
origin file.

Signed-off-by: Andrei Vagin <[email protected]>
The "ColumnLimit: 120" is not only allowing lines to be longer than 80
characters but it also forces line wrapping at 120 characters. If total
expression length is more than 120 characters, clang-format will try to
wrap it as close to 120 as it can, it would not even allow to wrap at 80
characters if we really want it. But as we all know 80 characters is
Linux kernel coding style default and as far as our coding style is
based on it it is really strange to prohibit wrapping lines at 80
characters...

Signed-off-by: Pavel Tikhomirov <[email protected]>
GCC's lto source:
> To avoid this problem the compiler must assume that it sees the
> whole program when doing link-time optimization.  Strictly
> speaking, the whole program is rarely visible even at link-time.
> Standard system libraries are usually linked dynamically or not
> provided with the link-time information.  In GCC, the whole
> program option (@option{-fwhole-program}) asserts that every
> function and variable defined in the current compilation
> unit is static, except for function @code{main} (note: at
> link time, the current unit is the union of all objects compiled
> with LTO).  Since some functions and variables need to
> be referenced externally, for example by another DSO or from an
> assembler file, GCC also provides the function and variable
> attribute @code{externally_visible} which can be used to disable
> the effect of @option{-fwhole-program} on a specific symbol.

As far as I read gcc's source, ipa_comdats() will avoid placing symbols
that are either already in a user-defined section or have
externally_visible attribute into new optimized gcc sections.

Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
fork_and_ptrace_attach has to fork a child with CLONE_UNTRACED,
so that strace doesn't trace it.

Signed-off-by: Andrei Vagin <[email protected]>
read_ns_sys_file() can return an error, but we are trying to parse a
buffer before checking a return code.

CID 417395 (checkpoint-restore#3 of 3): String not null terminated (STRING_NULL)
2. string_null: Passing unterminated string buf to strtol, which expects
   a null-terminated string.

Signed-off-by: Andrei Vagin <[email protected]>
This check is redundant as line 201 checks for this condition.

Signed-off-by: Taemin Ha <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
The is_native field is a boolean. Therefore, else if() should can be
changed to a simple else{}.

Signed-off-by: Taemin Ha <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
The condition meant to check fd2 instead of fd1, which is checked in
line 24.

Signed-off-by: Taemin Ha <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
line 131 checks if (ret >= 0). line 133 could be replaced by a simple else statement

Signed-off-by: Taemin Ha <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Eventpollentry's fields are set only when ret == 3 or ret == 6. The
remaining cases can be grouped together to an error

Signed-off-by: Taemin Ha <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
At this point the correct position is already restored, so reading from
the fd results in the position being moved forward by 5 bytes.

Fixes: 9191f87 ("criu/files-reg.c: add build-id validation functionality")
Signed-off-by: Michal Clapinski <[email protected]>
cgroup_ifpriomap test needs net_prio cgroup, which might not be
available. Make the .checkskip script check it.

Signed-off-by: Michał Mirosław <[email protected]>
Newer versions of pip use an isolated virtual environment when building
Python projects. However, when the source code of CRIT is copied into
the isolated environment, the symlink for `../lib/py` (pycriu) becomes
invalid. As a workaround, we used the `--no-build-isolation` option for
`pip install`. However, this functionality has issues in some versions
of PIP [1, 2]. To fix this problem, this patch adds separate packages
for pycriu and crit, and each package is installed independently.

[1] pypa/pip#8221
[2] pypa/pip#8165 (comment)

Signed-off-by: Radostin Stoyanov <[email protected]>
Do not use $(USERCFLAGS) for anything other than what the user provide.

Signed-off-by: Marcus Folkesson <[email protected]>
vmsplice can be blocked via seccomp and currently is in Podman
containers.

Signed-off-by: Younes Manton <[email protected]>
@0x7f454c46
Copy link
Member

Hm, this seems like a workaround for a specific environment setup, rather than for a kernel code.
Unsure if this should be hardcoded like this. Maybe an option in criu.config?
This will be a policy/config-based decision and in result CRIU won't try hard on vmsplice() potentially causing notifications and issuing syscalls that are expected to fail, decreasing the performance as well.

@avagin
Copy link
Member

avagin commented Oct 31, 2023

@ymanton I look at the list [1] and can't firgure out why they decide to block vmsplice. It looks like a mistake.
[1] https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json#L84

@avagin
Copy link
Member

avagin commented Oct 31, 2023

containers/common@7ced5da

@ymanton
Copy link
Contributor Author

ymanton commented Nov 3, 2023

containers/common@7ced5da

Thanks.

The discussion on lkml is very long and deep in kernel internals that I don't understand, but from what I can gather there is some worry that vmsplice can be abused, and I guess the Podman folks have blocked it to be on the safe side.

https://lore.kernel.org/linux-mm/[email protected]/:

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

https://lore.kernel.org/linux-mm/[email protected]/:

I already told Jens, is io_uring could use mmu notifier already (that
would make any io_uring GUP pin not count anymore with regard to
page_mapcount vs page_count difference) and vmsplice also needs some
handling or maybe become privileged.

So unless you guys think otherwise I doubt we will convince the Podman folks to unblock it any time soon, and some others might follow in their footsteps.

I don't mind making it a config option. Do we want an option that avoids all uses of vmsplice on both checkpoint and restore? Or should we consider just getting rid of it altogether?

@Snorch
Copy link
Member

Snorch commented Nov 6, 2023

Can we in CRIU temporary disable this seccomp rule while restoring (e.g. only for restored container) or even restore seccomp rules later after vmsplicing everything we need to?

@avagin
Copy link
Member

avagin commented Nov 6, 2023

@Snorch I think it is about unprivileged C/R. In this case, we can't suspend seccomp.

Copy link

github-actions bot commented Dec 7, 2023

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.