Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sk-inet: Support IP_TRANSPARENT and IPV6_TRANSPARENT socket options #2357

Open
wants to merge 235 commits into
base: criu-dev
Choose a base branch
from

Conversation

juntongdeng
Copy link
Contributor

Support for IP_TRANSPARENT and IPV6_TRANSPARENT socket options is
useful for restoring transparent proxies.

prakritigoyal19 and others added 30 commits June 11, 2023 23:30
Change made through this commit:
- Include copy of flog as a seperate tree.
- Modify the makefile to add and compile flog code.

Signed-off-by: prakritigoyal19 <[email protected]>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS)
 va_end was not called for argptr.

Signed-off-by: Adrian Reber <[email protected]>
Separate commit for easier criu-dev <-> master transfer.

Acked-by: Mike Rapoport <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
It is mapped, not maped. Same applies for mmap I guess.

Found by codespell, except it wants to change it to mapped,
which will make it less specific.

Signed-off-by: Kir Kolyshkin <[email protected]>
Brought to you by

    codespell -w

(using codespell v2.1.0).

[v2: use "make indent" on the result]

Signed-off-by: Kir Kolyshkin <[email protected]>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.

Signed-off-by: Suraj Shirvankar <[email protected]>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.

Signed-off-by: zhoujie <[email protected]>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)

On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

This increase the xsave area from one page to four pages.

Without this patch the fpu03 test fails, with this patch it works again.

Signed-off-by: Adrian Reber <[email protected]>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.

We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.

Signed-off-by: Pavel Tikhomirov <[email protected]>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).

Signed-off-by: Radostin Stoyanov <[email protected]>
crun wants to set empty_ns and this interface is missing from the
library. This adds it to libcriu.

Signed-off-by: Adrian Reber <[email protected]>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes remove and update the changes introduced in
7177938 in favor of the
Python version in CI.

os.waitstatus_to_exitcode() function appeared in Python 3.9

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes add test implementations for criu-ns script.

Fixes: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns

Signed-off-by: Dhanuka Warusadura <[email protected]>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.

This reverse the changes made in a15a63f

Signed-off-by: Dhanuka Warusadura <[email protected]>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/

It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.

First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.

Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.

Thanks a lot to Thomas!

Signed-off-by: Pavel Tikhomirov <[email protected]>
This hook allows to start image streamer process from an action script.

Signed-off-by: Radostin Stoyanov <[email protected]>
…tions

This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.

Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.

Signed-off-by: Valeriy Vdovin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.

Signed-off-by: Michał Mirosław <[email protected]>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.

Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <[email protected]>
If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it
for easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
While at it, don't carry over stale errno to the fail() message.

Signed-off-by: Michał Mirosław <[email protected]>
Add a sanity check for THP_DISABLE. This discovered a broken commit
in Google's kernel tree.

Signed-off-by: Michał Mirosław <[email protected]>
rppt and others added 11 commits February 19, 2024 18:32
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.

In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.

For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.

In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.

Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.

Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.

Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
The shadow stack VMAs require special care because they can only be
created and populated using special system calls.

Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in
/proc/pid/smaps

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Shadow stack VMAs cannot be mmap()ed, they must be created using
map_shadow_stack() system call and populated using special wrss
instruction available only when shadow stack is enabled.

Premap them to reserve virtual address space and populate it to have
there contents available for later copying after enabling shadow stack.

Along with the space required by shadow stack VMAs also reserve an extra
page that will be later used as a temporary shadow stack.

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Shadow stacks must be populated using special WRSS instruction. This
instruction is only available when shadow stack is enabled, calling it
with disabled shadow stack causes #UD.

Moreover, shadow stack VMAs cannot be mremap()ed and they must be
created using map_shadow_stack() system call. This requires delaying the
restore of shadow stacks to restorer blob after the CRIU mappings are
cleared.

Introduce rst_shstk_info structure to hold shadow stack parameters
required in the restorer blob and populate this structure in
arch_prepare_shstk() method.

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Detect if CRIU runs with shadow stack enabled and store the result in
kerndat.

Unlike most kerndat knobs, kdat_has_shstk() does not check for
availability of the shadow stack in the kernel, but rather checks if
criu runs with shadow stack enabled.

This depends on hardware availabilty, kernel and glibc support, compiler
options and glibc tunables, so kdat_has_shstk() must be called every
time CRIU starts and its result cannot be cached.

The result will be used by the code that controls shadow stack
enablement in the next commit.

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
There are several gotachs when restoring a task with shadow stack:
* depending on the compiler options, glibc version and glibc tunables
  CRIU can run with or without shadow stack.
* shadow stack VMAs are special, they must be created using a dedicated
  map_shadow_stack() system call and can be modified only by a special
  instruction (wrss) that is only available when shadow stack is
  enabled.
* once shadow stack is enabled, it is not writable even with wrss;
  writes to shadow stack can be only enabled with ptrace() and only when
  shadow stack is enabled in the tracee.
* if the shadow stack is enabled during restore rather than by glibc,
  calling retq after arch_prctl() that enables the shadow stack causes
  #CP, so the function that enables shadow stack can never return.

Add the infrastructure required to cope with all of those:

* modify the restore code to allow trampoline (arch_shstk_trampoline)
  that will enable shadow stack and call restore_task_with_children().
* add call to arch_shstk_unlock() right after the tasks are clone()ed;
  this will allow unlocking shadow stack features and making shadow
  stack writable.
* add stubs for architectures that do not support shadow stacks
* add implementation of arch_shstk_trampoline() and arch_shstk_unlock()
  for x86, but keep it disabled; it will be enabled along with addtion
  of the code that will restore shadow stack in the restorer blob

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
The restore of a task with shadow stack enabled adds these steps:

* switch from the default shadow stack to a temporary shadow stack
  allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
  CRIU mappings can be removed only after switching to a temporary
  shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
IP_TRANSPARENT and IPV6_TRANSPARENT are used in transparent proxies.
Support for these two socket options is useful for restoring transparent
proxies.

Signed-off-by: Juntong Deng <[email protected]>
@adrianreber adrianreber reopened this Mar 10, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.72%. Comparing base (cb39c62) to head (85842b2).

❗ Current head 85842b2 differs from pull request most recent head 7c2c8a8. Consider uploading reports for the commit 7c2c8a8 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2357      +/-   ##
============================================
+ Coverage     70.21%   70.72%   +0.50%     
============================================
  Files           132      135       +3     
  Lines         34372    32907    -1465     
============================================
- Hits          24134    23272     -862     
+ Misses        10238     9635     -603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adrianreber and others added 5 commits March 25, 2024 15:59
Upgrade to 22.04 base image and use the existing version of docker.

Signed-off-by: Adrian Reber <[email protected]>
A memory interval is a half-open interval, so the condition
when pr->pe->vaddr == vma->e->end should not be interpreted
as an intersection and should cause vma to be marked with VMA_NO_PROT_WRITE.

Fixes: checkpoint-restore#2364

Signed-off-by: Artem Trushkin <[email protected]>
This patch extends CRIU with support for SCHED_RESET_ON_FORK.
When the SCHED_RESET_ON_FORK flag is set, the following rules
apply for subsequently created children:

- If the calling thread has a scheduling policy of SCHED_FIFO or
SCHED_RR, the policy is reset to SCHED_OTHER in child processes.

- If the calling process has a negative nice value, the nice value
is reset to zero in child processes.

(See 'man 7 sched')

Fixes: checkpoint-restore#2359

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the sched_policy00 test case to verify that
the SCHED_RESET_ON_FORK flag is restored correctly.

Signed-off-by: Radostin Stoyanov <[email protected]>
@Snorch
Copy link
Member

Snorch commented Mar 27, 2024

Please reorder patches, c/r support must be first patch and a test for it must be second patch. This way you would not break tests on bisect.

adrianreber and others added 2 commits March 27, 2024 00:03
CircleCI currently prints out the following warning:

   This job is using a deprecated image 'ubuntu-2004:202010-01', please update to a newer image

According to https://discuss.circleci.com/t/linux-image-deprecations-and-eol-for-2024/
the recommended image name is: "image: default"

Signed-off-by: Adrian Reber <[email protected]>
Currently we have tabs + spaces on the wrapped line but the wrapped part
is not alligned to the opening bracket.

Fixes: bbe26d1b7 ("timer: fix allignment in function definition")
Signed-off-by: Pavel Tikhomirov <[email protected]>
@@ -21,6 +21,7 @@ message ip_opts_entry {
optional bool pktinfo = 5;
optional uint32 tos = 6;
optional uint32 ttl = 7;
optional bool transparent = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excess tab -> bad allignment.

@Snorch
Copy link
Member

Snorch commented Mar 27, 2024

This PR tries to do a good thing. But actually there it misses a bigger problem with IP_TRANSPARENT option, this option allows to bind to non-local address, so if we modify the test a bit to actually do the bind to non-local address:

--- a/test/zdtm/static/sock_ip_opts00.c
+++ b/test/zdtm/static/sock_ip_opts00.c
@@ -5,6 +5,7 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/in6.h>
+#include <arpa/inet.h>
 
 #include "zdtmtst.h"
 
@@ -83,6 +84,21 @@ int main(int argc, char **argv)
                                goto close;
                        }
                }
+
+               if (sk_confs[i].domain == AF_INET) {
+                       socklen_t len = sizeof(struct sockaddr_in);
+                       struct sockaddr_in addr;
+
+                       memset(&addr, 0, sizeof(addr));
+                       addr.sin_family = AF_INET;
+                       addr.sin_addr.s_addr = inet_addr("5.5.5.5");
+                       addr.sin_port = htons(5555);
+
+                       if (bind(sk_confs[i].sk, (struct sockaddr *)&addr, len) < 0) {
+                               pr_perror("bind on 5.5.5.5:5555 failed");
+                               goto close;
+                       }
+               }
        }
 
        test_daemon();

We see that nothing works:

[root@turmoil criu]# ./test/zdtm.py run -t zdtm/static/sock_ip_opts00
userns is supported
Checking feature ipv6_freebind
ipv6_freebind is supported
=== Run 1/1 ================ zdtm/static/sock_ip_opts00
===================== Run zdtm/static/sock_ip_opts00 in ns =====================
 DEP       sock_ip_opts00.d
 DEP       sock_ip_opts01.d
 CC        sock_ip_opts00.o
 LINK      sock_ip_opts00
Start test
Test is SUID
./sock_ip_opts00 --pidfile=sock_ip_opts00.pid --outfile=sock_ip_opts00.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/sock_ip_opts00/104/1/restore.log
------------------------ grep Error ------------------------
b'(00.002868)      1: No ipcns-sem-11.img image'
b'(00.003590)      1: net: Try to restore a link 10:1:lo'
b'(00.003707)      1: net: Restoring link lo type 1'
b'(00.004127)      1: net: \tRunning ip addr restore'
b'Error: ipv4: Address already assigned.'
b'Error: ipv6: address already assigned.'
b'(00.030163)      4: \t\tCreate fd for 3'
b'(00.030170)      1: `- render 1 iovs (0x7f09c5a79000:8192...)'
b'(00.030177)      4: inet: \tRestore: family AF_INET    type SOCK_DGRAM     proto IPPROTO_UDP      port 5555 state TCP_CLOSE        src_addr 5.5.5.5'
b'(00.030182)      1: Restore via sigreturn'
b"(00.030217)      4: Error (criu/sk-inet.c:1033): inet: Can't bind inet socket (id 12): Cannot assign requested address"
b'(00.030229)      4: Error (criu/files.c:1213): Unable to open fd=4 id=0xc'
b'(00.030838)      1: Error (criu/cr-restore.c:1518): 4 exited, status=1'
b'(00.031589) mnt: Switching to new ns to clean ghosts'
b'(00.031695) Error (criu/cr-restore.c:2572): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
############# Test zdtm/static/sock_ip_opts00 FAIL at CRIU restore #############
Test output: ================================

 <<< ================================
##################################### FAIL #####################################

Just dumping and restoring IP_TRANSPARENT is not enough, it should be properly integrated with the bind stage (I doubt that any app which uses it uses it without bind to non-local adress). Note that we have exactly the same problem with IP_FREEBIND (on ipv4) and to actually check IP_TRANSPARENT you need to:

--- a/test/zdtm/static/sock_ip_opts00.c
+++ b/test/zdtm/static/sock_ip_opts00.c
@@ -25,7 +25,6 @@ struct sk_opt {
 };
 
 struct sk_opt sk_opts_v4[] = {
-       { SOL_IP, IP_FREEBIND, IP_OPT_VAL },
        { SOL_IP, IP_PKTINFO, IP_OPT_VAL },
        { SOL_IP, IP_TTL, 32 },
        { SOL_IP, IP_TOS, IPTOS_TOS(IPTOS_THROUGHPUT) },
@@ -37,7 +36,6 @@ struct sk_opt sk_opts_v4[] = {
 #endif
 
 struct sk_opt sk_opts_v6[] = {
-       { SOL_IPV6, IPV6_FREEBIND, IP_OPT_VAL },
        { SOL_IPV6, IPV6_RECVPKTINFO, IP_OPT_VAL },
        { SOL_IPV6, IPV6_TRANSPARENT, IP_OPT_VAL },
 };

@juntongdeng
Copy link
Contributor Author

This PR tries to do a good thing. But actually there it misses a bigger problem with IP_TRANSPARENT option, this option allows to bind to non-local address, so if we modify the test a bit to actually do the bind to non-local address:

Just dumping and restoring IP_TRANSPARENT is not enough, it should be properly integrated with the bind stage (I doubt that any app which uses it uses it without bind to non-local adress). Note that we have exactly the same problem with IP_FREEBIND (on ipv4) and to actually check IP_TRANSPARENT you need to:

I think the cause of this problem is that in the current open_inet_sk() of sk-inet.c, the socket options are restored at the end, and listen(), connect(), bind() are all called before setsockopt().

I think we should change this order and set all socket options after creating the socket, before listen(), connect(), bind().

What do you think? If you agree, I can create a new pull request.

@Snorch
Copy link
Member

Snorch commented Apr 16, 2024

set all socket options after creating the socket, before listen(), connect(), bind().
What do you think? If you agree, I can create a new pull request.

Looks like a good idea.

Also one more possible complication: 1) set IP_TRANSPARENT 2) bind to non-local ip 3) unset IP_TRANSPARENT, this way we somehow need to guess to set this before bind and then unset it to restore properly.

So maybe we just need to extend 73a739b ("inet: set IP_FREEBIND before binding socket to an ipv6 address (v2)") to ipv4 to fix that.

Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.