Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.10-rc2-scx1 #28

Merged
merged 748 commits into from
Jun 6, 2024
Merged

v6.10-rc2-scx1 #28

merged 748 commits into from
Jun 6, 2024

Conversation

Byte-Lab
Copy link
Contributor

@Byte-Lab Byte-Lab commented Jun 6, 2024

No description provided.

htejun and others added 30 commits March 8, 2024 07:53
scx: Make exit debug dump buffer resizable
scx: Trivial updates from patch splitting
reset_idle_masks() is called while loading a BPF scheduler to mark all CPUs
as idle for scx_bpf_select_cpu_dfl(). It was cpumask_setall() which could
make scx_bpf_select_cpu_dfl() pick a CPU which is possible but not online.

Note that such spurious picking can only happen one time and it's generally
safe to pick an ineligible CPU, so nothing should be broken but the behavior
isn't ideal. In general the initial values of idle masks aren't that
important. They quickly get synchronized to the actual state through the
CPUs entering and leaving the idle state.

However, let's still use cpu_online_mask instead so that the idle masks are
initialized with online CPUs.
scx: Use cpu_online_mask when resetting idle masks
The UEI macros were updated in a prior commit. Apply the changes to the
sched_ext selftests dir.

Signed-off-by: David Vernet <[email protected]>
Right now we're just printing what the user passes to SCX_ERROR(). This
can cause the output from that error message to appear on the same line
as the results output from the test runner. Let's append a newline.

Signed-off-by: David Vernet <[email protected]>
scx: Update selftests to use new UEI macros
…dly NULL

Make the sanity check a bit more concise and ensure that ops.cgroup_move()
is never called with NULL source cgroup.
…ugh ops.cgroup_prep_move()

sched_move_task() takes an early exit if the source and destination are
identical. This triggers the warning in scx_cgroup_can_attach() as it leaves
p->scx.cgrp_moving_from uncleared.

Update the cgroup migration path so that ops.cgroup_prep_move() is skipped
for identity migrations so that its invocations always match
ops.cgroup_move() one-to-one.
scx: cgroup: Fix mismatch between `ops.cgroup_prep_move()` and `ops.cgroup_move()` invocations
In scx_select_cpu_dfl(), we currently migrate the waking task
to the CPU of the waker in the following scenario:

1. WAKE_SYNC is specified in wake_flags
2. There is at least one idle core in the system
3. The wakee can run on the waker CPU

The assumption implicit with (2) is that the system is under saturated, and
that therefore the wakee's runqueue delay would not be impacted by migrating to
the waker's CPU rather than migrating to an idle core. This doesn't always
happen in practice though. Consider the following scenario:

1. The system is overloaded, and at least one core becomes idle
2. Some groups of pairs of tasks that communicate over IPC are spawned.
3. Sender tasks are running on cores that still have enqueued tasks from when
   the system was overloaded, and they repeatedly wake waker tasks with
   WAKE_SYNC.
4. The waker tasks observe that the system is underloaded, and so think that
   it's optimal for the wakee to be migrated to their CPU despite having a deep
   runqueue.

This can cause serious performance regressions for such workloads. For example,
hackbench regresses by nearly 10x relative to EEVDF:

[1]+ ./scx_simple > /dev/null 2> /dev/null &
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 2.944
[root@virtme-ng bin]# fg
./scx_simple > /dev/null 2> /dev/null
^C
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.345

What we really want is to only migrate to the waker CPU if nobody else
is already enqueued there. This will cause tasks to fan out over any
idle CPUs when they're available if the waker's rq is overloaded, and
then eventually to start enjoying wakeups on the waker's CPU once load
has been distributed and tasks are no longer piling up on a subset of
cores.

With this patch, the regression is addressed:

[root@virtme-ng bin]# ./scx_simple > /dev/null &
[1] 336
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.348
[root@virtme-ng bin]# fg
./scx_simple > /dev/null
^CEXIT: BPF scheduler unregistered
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.352

Signed-off-by: David Vernet <[email protected]>
scx: Update conditions for WAKE_SYNC migration
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep
track of vtime. On cgroup init, a cgv_node is created and stashed in a
hashmap - cgv_node_stash - for later use. In cgroup_enqueued and
try_pick_next_cgroup, the node is inserted into the rbtree, which
required removing it from the stash before this patch's changes.

This patch makes cgv_node refcounted, which allows keeping it in the
stash for the entirety of the cgroup's lifetime. Unnecessary
bpf_kptr_xchg's and other boilerplate can be removed as a result.

Note that in addition to bpf_refcount patches, which have been upstream
for quite some time, this change depends on a more recent series [0].

  [0]: https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Dave Marchevsky <[email protected]>
There may be cases where a scheduler wants to cause itself to exit under
normal circumstances. For example, if a scheduler does not support
hotplug, it may want to exit to user space if a CPU is hotplugged or
unplugged to let it know that it should restart itself.

To enable this, we implement a new kfunc called scx_bpf_exit_bstr().
This kfunc shares a good amount of logic with scx_bpf_error_bstr(), as
both of them can plumb an exit message from BPF to user space. In
addition, we also augment struct scx_exit_info to include an exit code
field.

Signed-off-by: David Vernet <[email protected]>
As with scx_bpf_error(), we want schedulers to have an ergonomic API for
exiting execution. Let's add an scx_bpf_exit() macro wrapper that does
nearly the same thing as scx_bpf_error(), but invokes
scx_bpf_exit_bstr() instead of scx_bpf_error_bstr().

Signed-off-by: David Vernet <[email protected]>
We no longer have scx_bpf_switch_all(). Let's update the test to use
__COMPAT_SCX_OPS_SWITCH_PARTIAL. Along the way, make it less flaky.

Signed-off-by: David Vernet <[email protected]>
Let's add a selftest that verifies us using scx_bpf_exit() to exit from
various callbacks in the program.

Signed-off-by: David Vernet <[email protected]>
Allow clean exits from the scheduler with scx_bpf_exit()
scx_flatcg: Keep cgroup rb nodes stashed
Where 0 fields are ignored if not present.
scx: Use the new struct_ops compatibililty mechanism
SCX_OPS_SWITCH_PARTIAL was added recently but assigned the first bit
shifting all the other values. This breaks backward compatibility as flag
values are hard coded into e.g. sched_ext_ops definitions. Let's reorder the
enums so that the existing values aren't shifted around.
instead of btf__find_by_name_kind() w/ BTF_KIND_ENUM as it also needs to
read BTF_KIND_ENUM64's.
We'll want to be able to issue BPF_PROG_RUN calls in order to
synchronously call into the kernel from user space scheduling
components. The most natural prog type for this seems to be
BPF_PROG_TYPE_SYSCALL, which is a very safe program type. Let's allow
invoking the core kfuncs from this program type as well.

Signed-off-by: David Vernet <[email protected]>
Now that we can invoke scx kfuncs inside of a BPF_PROG_TYPE_SYSCALL
program, let's add some selftests that validate the behavior.

Signed-off-by: David Vernet <[email protected]>
Dr. David Alan Gilbert and others added 26 commits June 3, 2024 16:52
'scale_test_def' is unused since commit 3762a39 ("selftests/bpf: Split out
bpf_verif_scale selftests into multiple tests"). Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
'key_t' is unused in a couple of files since the original commit 60dd49e
("selftests/bpf: Add test for bpf array map iterators"). Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
'libcap' is unused since commit b1c2768 ("bpf: selftests: Remove libcap
usage from test_verifier"). Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Correct typo in bpftool profiler and change all instances of 'MATRICS' to
'METRICS' in the profiler.bpf.c file.

Signed-off-by: Swan Beaujard <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bpf.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
The values of tx_prog_fd in run_options() should not be 0, so set it as -1
in else branch, and test it using "if (tx_prog_fd > 0)" condition, not
"if (tx_prog_fd)" or "if (tx_prog_fd >= 0)".

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/08b20ffc544324d40939efeae93800772a91a58e.1716446893.git.tanggeliang@kylinos.cn
There's already a definition of i in run_options() at the beginning, no
need to define a new one in "if (tx_prog_fd > 0)" block.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/8d690682330a59361562bca75d6903253d16f312.1716446893.git.tanggeliang@kylinos.cn
Switch attachments to bpf_link using bpf_program__attach_sockmap() instead
of bpf_prog_attach().

This patch adds a new array progs[] to replace prog_fd[] array, set in
populate_progs() for each program in bpf object.

And another new array links[] to save the attached bpf_link. It is
initalized as NULL in populate_progs, set as the return valuses of
bpf_program__attach_sockmap(), and detached by bpf_link__detach().

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/32cf8376a810e2e9c719f8e4cfb97132ed2d1f9c.1716446893.git.tanggeliang@kylinos.cn
bpf_program__attach_sockmap() needs to take a parameter of type bpf_program
instead of an fd, so tx_prog_fd becomes useless. This patch uses a pointer
tx_prog to point to an item in progs[] array.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/23b37f932c547dd1ebfe154bbc0b0e957be21ee6.1716446893.git.tanggeliang@kylinos.cn
The program fds can be got by using bpf_program__fd(progs[]), then
prog_fd becomes useless. This patch drops it.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/9a6335e4d8dbab23c0d8906074457ceddd61e74b.1716446893.git.tanggeliang@kylinos.cn
The array size of map_fd[] is 9, not 8. This patch changes it as a more
general form: ARRAY_SIZE(map_fd).

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/0972529ee01ebf8a8fd2b310bdec90831c94be77.1716446893.git.tanggeliang@kylinos.cn
The value of recv in msg_loop may be negative, like EWOULDBLOCK, so it's
necessary to check if it is positive before accumulating it to bytes_recvd.

Fixes: 16962b2 ("bpf: sockmap, add selftests")
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/5172563f7c7b2a2e953cef02e89fc34664a7b190.1716446893.git.tanggeliang@kylinos.cn
bpf_map_lookup_elem is invoked in bpf_prog3() already, no need to invoke
it again. This patch drops it.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Jakub Sitnicki <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/ea8458462b876ee445173e3effb535fd126137ed.1716446893.git.tanggeliang@kylinos.cn
@Gabant noticed in sched-ext/sched_ext#218 that our
comparison is backwards in dispatch_to_local_dsq() when determining if the
destination rq in a local dispatch is idle. This could cause us to fail to send
resched IPIs to those remote DSQs when a scheduler uses SCX_DSQ_LOCAL_ON when
dispatching. Thankfully, the issue doesn't happen if dispatching to a local DSQ
from the ops.select_cpu() path, as a resched IPI is sent by the core scheduler
if a task is migrated to an idle core on the wakeup path.

Let's fix it.

Reported-by: @Gabant
Signed-off-by: David Vernet <[email protected]>
We should have _some_ test that exercises that codepath.

Signed-off-by: David Vernet <[email protected]>
scx: Fix local DSQ comparison in dispatch_to_local_dsq()
The existing hotplug code is a bit brittle in that the sched_class->{online,
offline} callbacks can be invoked both from hotplug context, or from domain
rebuild context. A callback is only invoked in one of the two contexts, with
the context that runs the callback setting rq->online accordingly to avoid
doing the callback more times than necessary. Unfortuntaly, this causes commit
2125c00 ("cgroup/cpuset: Make cpuset hotplug processing synchronous") to
break hotplug for us because it makes the topology rq->online event happen
before the cpu hotplug rq->online event; thus preventing us from invoking the
cpu online callback in the scheduler.

This integration is fragile and hacky, so we'll instead call directly into
ext.c on the hotplug path. This will be added in a subsequent commit.

Signed-off-by: David Vernet <[email protected]>
The rq_online() and rq_offline() sched_class callbacks may be invoked on either
the domain creation path, or on the hotplug path. ext.c would previously only
invoke its hotplug logic when those callbacks are invoked on the hotplug path.
This turned out to be a bit brittle, as hotplug could break for sched_ext if we
first invoked rq_online() from the domain path, as it would cause us to skip
invoking it on the hotplug path due to hotplug implementation details.

To avoid this frailty (and fix a hotplug regression for us after merging
v6.10), let's instead just call directly into ext.c on the hotplug path. In
doing so, we can also move the ops.cpu_online() and ops.cpu_offline() callbacks
outside of the rq lock region, thus making them sleepable. A test for this will
be added in a subsequent patch.

Signed-off-by: David Vernet <[email protected]>
Now that ops.cpu_online() and ops.cpu_offline() can be sleepable,
let's update their test BPF progs to be sleepable to validate.

Signed-off-by: David Vernet <[email protected]>
Tejun pointed out that if we set or unset the hotplug online flag before we
invoke the relevant hotplug callbacks, schedulers could get confused because
they could begin receiving callbacks on a CPU before (or after) they're
notified that a CPU is online (or offline).

Let's reverse the order to fix that.

Signed-off-by: David Vernet <[email protected]>
Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested a review from htejun June 6, 2024 21:48
@htejun htejun merged commit 49374b6 into scx-6.10rc.y Jun 6, 2024
1 check passed
@htejun htejun deleted the scx-6.10rc2-scx1 branch June 6, 2024 21:51
htejun added a commit that referenced this pull request Jun 6, 2024
scx_pair uses the default stride value of nr_cpu_ids / 2, which matches most
x86 SMT configurations. However, it does allow specifying a custom stride
value with -S so that e.g. neighboring CPUs can be paired up. However, not
all stride values work and errors were not reported very well.

This patch improves error handling so that scx_pair fails with clear error
message if CPUs can't be paired up with the specified stride value. scx_pair
now also prints out how CPUs are paired on startup.

This should address issues #28 and #29.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.