v6.10-rc2-scx1 #28

Byte-Lab · 2024-06-06T21:48:14Z

No description provided.

scx: Make exit debug dump buffer resizable

scx: Trivial updates from patch splitting

reset_idle_masks() is called while loading a BPF scheduler to mark all CPUs as idle for scx_bpf_select_cpu_dfl(). It was cpumask_setall() which could make scx_bpf_select_cpu_dfl() pick a CPU which is possible but not online. Note that such spurious picking can only happen one time and it's generally safe to pick an ineligible CPU, so nothing should be broken but the behavior isn't ideal. In general the initial values of idle masks aren't that important. They quickly get synchronized to the actual state through the CPUs entering and leaving the idle state. However, let's still use cpu_online_mask instead so that the idle masks are initialized with online CPUs.

scx: Use cpu_online_mask when resetting idle masks

The UEI macros were updated in a prior commit. Apply the changes to the sched_ext selftests dir. Signed-off-by: David Vernet <[email protected]>

Right now we're just printing what the user passes to SCX_ERROR(). This can cause the output from that error message to appear on the same line as the results output from the test runner. Let's append a newline. Signed-off-by: David Vernet <[email protected]>

scx: Update selftests to use new UEI macros

…dly NULL Make the sanity check a bit more concise and ensure that ops.cgroup_move() is never called with NULL source cgroup.

…ugh ops.cgroup_prep_move() sched_move_task() takes an early exit if the source and destination are identical. This triggers the warning in scx_cgroup_can_attach() as it leaves p->scx.cgrp_moving_from uncleared. Update the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one.

scx: cgroup: Fix mismatch between `ops.cgroup_prep_move()` and `ops.cgroup_move()` invocations

In scx_select_cpu_dfl(), we currently migrate the waking task to the CPU of the waker in the following scenario: 1. WAKE_SYNC is specified in wake_flags 2. There is at least one idle core in the system 3. The wakee can run on the waker CPU The assumption implicit with (2) is that the system is under saturated, and that therefore the wakee's runqueue delay would not be impacted by migrating to the waker's CPU rather than migrating to an idle core. This doesn't always happen in practice though. Consider the following scenario: 1. The system is overloaded, and at least one core becomes idle 2. Some groups of pairs of tasks that communicate over IPC are spawned. 3. Sender tasks are running on cores that still have enqueued tasks from when the system was overloaded, and they repeatedly wake waker tasks with WAKE_SYNC. 4. The waker tasks observe that the system is underloaded, and so think that it's optimal for the wakee to be migrated to their CPU despite having a deep runqueue. This can cause serious performance regressions for such workloads. For example, hackbench regresses by nearly 10x relative to EEVDF: [1]+ ./scx_simple > /dev/null 2> /dev/null & [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 2.944 [root@virtme-ng bin]# fg ./scx_simple > /dev/null 2> /dev/null ^C [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.345 What we really want is to only migrate to the waker CPU if nobody else is already enqueued there. This will cause tasks to fan out over any idle CPUs when they're available if the waker's rq is overloaded, and then eventually to start enjoying wakeups on the waker's CPU once load has been distributed and tasks are no longer piling up on a subset of cores. With this patch, the regression is addressed: [root@virtme-ng bin]# ./scx_simple > /dev/null & [1] 336 [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.348 [root@virtme-ng bin]# fg ./scx_simple > /dev/null ^CEXIT: BPF scheduler unregistered [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.352 Signed-off-by: David Vernet <[email protected]>

scx: Update conditions for WAKE_SYNC migration

The flatcg scheduler uses a rb_node type - struct cgv_node - to keep track of vtime. On cgroup init, a cgv_node is created and stashed in a hashmap - cgv_node_stash - for later use. In cgroup_enqueued and try_pick_next_cgroup, the node is inserted into the rbtree, which required removing it from the stash before this patch's changes. This patch makes cgv_node refcounted, which allows keeping it in the stash for the entirety of the cgroup's lifetime. Unnecessary bpf_kptr_xchg's and other boilerplate can be removed as a result. Note that in addition to bpf_refcount patches, which have been upstream for quite some time, this change depends on a more recent series [0]. [0]: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Dave Marchevsky <[email protected]>

There may be cases where a scheduler wants to cause itself to exit under normal circumstances. For example, if a scheduler does not support hotplug, it may want to exit to user space if a CPU is hotplugged or unplugged to let it know that it should restart itself. To enable this, we implement a new kfunc called scx_bpf_exit_bstr(). This kfunc shares a good amount of logic with scx_bpf_error_bstr(), as both of them can plumb an exit message from BPF to user space. In addition, we also augment struct scx_exit_info to include an exit code field. Signed-off-by: David Vernet <[email protected]>

As with scx_bpf_error(), we want schedulers to have an ergonomic API for exiting execution. Let's add an scx_bpf_exit() macro wrapper that does nearly the same thing as scx_bpf_error(), but invokes scx_bpf_exit_bstr() instead of scx_bpf_error_bstr(). Signed-off-by: David Vernet <[email protected]>

We no longer have scx_bpf_switch_all(). Let's update the test to use __COMPAT_SCX_OPS_SWITCH_PARTIAL. Along the way, make it less flaky. Signed-off-by: David Vernet <[email protected]>

Let's add a selftest that verifies us using scx_bpf_exit() to exit from various callbacks in the program. Signed-off-by: David Vernet <[email protected]>

Allow clean exits from the scheduler with scx_bpf_exit()

scx_flatcg: Keep cgroup rb nodes stashed

Where 0 fields are ignored if not present.

Pull in bpf/for-next

scx: Use the new struct_ops compatibililty mechanism

SCX_OPS_SWITCH_PARTIAL was added recently but assigned the first bit shifting all the other values. This breaks backward compatibility as flag values are hard coded into e.g. sched_ext_ops definitions. Let's reorder the enums so that the existing values aren't shifted around.

instead of btf__find_by_name_kind() w/ BTF_KIND_ENUM as it also needs to read BTF_KIND_ENUM64's.

sched_ext: API update fallouts

We'll want to be able to issue BPF_PROG_RUN calls in order to synchronously call into the kernel from user space scheduling components. The most natural prog type for this seems to be BPF_PROG_TYPE_SYSCALL, which is a very safe program type. Let's allow invoking the core kfuncs from this program type as well. Signed-off-by: David Vernet <[email protected]>

Now that we can invoke scx kfuncs inside of a BPF_PROG_TYPE_SYSCALL program, let's add some selftests that validate the behavior. Signed-off-by: David Vernet <[email protected]>

'scale_test_def' is unused since commit 3762a39 ("selftests/bpf: Split out bpf_verif_scale selftests into multiple tests"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

'key_t' is unused in a couple of files since the original commit 60dd49e ("selftests/bpf: Add test for bpf array map iterators"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

'libcap' is unused since commit b1c2768 ("bpf: selftests: Remove libcap usage from test_verifier"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Correct typo in bpftool profiler and change all instances of 'MATRICS' to 'METRICS' in the profiler.bpf.c file. Signed-off-by: Swan Beaujard <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Quentin Monnet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bpf.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

The values of tx_prog_fd in run_options() should not be 0, so set it as -1 in else branch, and test it using "if (tx_prog_fd > 0)" condition, not "if (tx_prog_fd)" or "if (tx_prog_fd >= 0)". Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/08b20ffc544324d40939efeae93800772a91a58e.1716446893.git.tanggeliang@kylinos.cn

There's already a definition of i in run_options() at the beginning, no need to define a new one in "if (tx_prog_fd > 0)" block. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/8d690682330a59361562bca75d6903253d16f312.1716446893.git.tanggeliang@kylinos.cn

Switch attachments to bpf_link using bpf_program__attach_sockmap() instead of bpf_prog_attach(). This patch adds a new array progs[] to replace prog_fd[] array, set in populate_progs() for each program in bpf object. And another new array links[] to save the attached bpf_link. It is initalized as NULL in populate_progs, set as the return valuses of bpf_program__attach_sockmap(), and detached by bpf_link__detach(). Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/32cf8376a810e2e9c719f8e4cfb97132ed2d1f9c.1716446893.git.tanggeliang@kylinos.cn

bpf_program__attach_sockmap() needs to take a parameter of type bpf_program instead of an fd, so tx_prog_fd becomes useless. This patch uses a pointer tx_prog to point to an item in progs[] array. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/23b37f932c547dd1ebfe154bbc0b0e957be21ee6.1716446893.git.tanggeliang@kylinos.cn

The program fds can be got by using bpf_program__fd(progs[]), then prog_fd becomes useless. This patch drops it. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/9a6335e4d8dbab23c0d8906074457ceddd61e74b.1716446893.git.tanggeliang@kylinos.cn

The array size of map_fd[] is 9, not 8. This patch changes it as a more general form: ARRAY_SIZE(map_fd). Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/0972529ee01ebf8a8fd2b310bdec90831c94be77.1716446893.git.tanggeliang@kylinos.cn

The value of recv in msg_loop may be negative, like EWOULDBLOCK, so it's necessary to check if it is positive before accumulating it to bytes_recvd. Fixes: 16962b2 ("bpf: sockmap, add selftests") Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/5172563f7c7b2a2e953cef02e89fc34664a7b190.1716446893.git.tanggeliang@kylinos.cn

bpf_map_lookup_elem is invoked in bpf_prog3() already, no need to invoke it again. This patch drops it. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/ea8458462b876ee445173e3effb535fd126137ed.1716446893.git.tanggeliang@kylinos.cn

Pull bpf/for-next

@Gabant

@Gabant noticed in sched-ext/sched_ext#218 that our comparison is backwards in dispatch_to_local_dsq() when determining if the destination rq in a local dispatch is idle. This could cause us to fail to send resched IPIs to those remote DSQs when a scheduler uses SCX_DSQ_LOCAL_ON when dispatching. Thankfully, the issue doesn't happen if dispatching to a local DSQ from the ops.select_cpu() path, as a resched IPI is sent by the core scheduler if a task is migrated to an idle core on the wakeup path. Let's fix it. Reported-by: @Gabant Signed-off-by: David Vernet <[email protected]>

We should have _some_ test that exercises that codepath. Signed-off-by: David Vernet <[email protected]>

scx: Fix local DSQ comparison in dispatch_to_local_dsq()

The existing hotplug code is a bit brittle in that the sched_class->{online, offline} callbacks can be invoked both from hotplug context, or from domain rebuild context. A callback is only invoked in one of the two contexts, with the context that runs the callback setting rq->online accordingly to avoid doing the callback more times than necessary. Unfortuntaly, this causes commit 2125c00 ("cgroup/cpuset: Make cpuset hotplug processing synchronous") to break hotplug for us because it makes the topology rq->online event happen before the cpu hotplug rq->online event; thus preventing us from invoking the cpu online callback in the scheduler. This integration is fragile and hacky, so we'll instead call directly into ext.c on the hotplug path. This will be added in a subsequent commit. Signed-off-by: David Vernet <[email protected]>

The rq_online() and rq_offline() sched_class callbacks may be invoked on either the domain creation path, or on the hotplug path. ext.c would previously only invoke its hotplug logic when those callbacks are invoked on the hotplug path. This turned out to be a bit brittle, as hotplug could break for sched_ext if we first invoked rq_online() from the domain path, as it would cause us to skip invoking it on the hotplug path due to hotplug implementation details. To avoid this frailty (and fix a hotplug regression for us after merging v6.10), let's instead just call directly into ext.c on the hotplug path. In doing so, we can also move the ops.cpu_online() and ops.cpu_offline() callbacks outside of the rq lock region, thus making them sleepable. A test for this will be added in a subsequent patch. Signed-off-by: David Vernet <[email protected]>

Now that ops.cpu_online() and ops.cpu_offline() can be sleepable, let's update their test BPF progs to be sleepable to validate. Signed-off-by: David Vernet <[email protected]>

scx: Fix hotplug

Tejun pointed out that if we set or unset the hotplug online flag before we invoke the relevant hotplug callbacks, schedulers could get confused because they could begin receiving callbacks on a CPU before (or after) they're notified that a CPU is online (or offline). Let's reverse the order to fix that. Signed-off-by: David Vernet <[email protected]>

…rc2-scx1v2

Signed-off-by: David Vernet <[email protected]>

scx_pair uses the default stride value of nr_cpu_ids / 2, which matches most x86 SMT configurations. However, it does allow specifying a custom stride value with -S so that e.g. neighboring CPUs can be paired up. However, not all stride values work and errors were not reported very well. This patch improves error handling so that scx_pair fails with clear error message if CPUs can't be paired up with the specified stride value. scx_pair now also prints out how CPUs are paired on startup. This should address issues #28 and #29.

htejun and others added 30 commits March 8, 2024 07:53

Merge pull request #158 from sched-ext/htejun

13f2f03

scx: Make exit debug dump buffer resizable

scx: Trivial updates from patch splitting

68fc8d6

Merge pull request #160 from sched-ext/htejun

aea1cd0

scx: Trivial updates from patch splitting

Merge pull request #161 from sched-ext/htejun

925e4d8

scx: Use cpu_online_mask when resetting idle masks

scx: Update selftests to use new UEI macros

7b1e063

The UEI macros were updated in a prior commit. Apply the changes to the sched_ext selftests dir. Signed-off-by: David Vernet <[email protected]>

Merge pull request #163 from sched-ext/fix_selftests

bbb65a3

scx: Update selftests to use new UEI macros

scx: Improve error behavior when p->scx.cgrp_moving_from is unexpecte…

6d5da8c

…dly NULL Make the sanity check a bit more concise and ensure that ops.cgroup_move() is never called with NULL source cgroup.

Merge pull request #165 from sched-ext/htejun

fc86083

scx: cgroup: Fix mismatch between `ops.cgroup_prep_move()` and `ops.cgroup_move()` invocations

Merge pull request #167 from sched-ext/select_dfl_wake_sync

2fc287e

scx: Update conditions for WAKE_SYNC migration

scx: Fix init_enable_count

a0b24e0

We no longer have scx_bpf_switch_all(). Let's update the test to use __COMPAT_SCX_OPS_SWITCH_PARTIAL. Along the way, make it less flaky. Signed-off-by: David Vernet <[email protected]>

scx: Add selftests for scx_bpf_exit()

14162f7

Let's add a selftest that verifies us using scx_bpf_exit() to exit from various callbacks in the program. Signed-off-by: David Vernet <[email protected]>

Merge pull request #166 from sched-ext/clean_exit_bpf

27688fa

Allow clean exits from the scheduler with scx_bpf_exit()

Merge pull request #168 from sched-ext/sync-from-scx

e227931

scx_flatcg: Keep cgroup rb nodes stashed

Merge branch 'bpf/for-next' into sync-bpf-next

368ada6

scx: Build fix after pulling bpf/for-next

fe1f6ad

scx: Use the new struct_ops compatibililty mechanism

7db6edd

Where 0 fields are ignored if not present.

Merge pull request #169 from sched-ext/sync-bpf-next

5121551

Pull in bpf/for-next

Merge pull request #170 from sched-ext/scx-simpler-ops-versioning

e34c7df

scx: Use the new struct_ops compatibililty mechanism

scx/compat.h: __COMPAT_read_enum() should use btf__find_by_name()

5dc9530

instead of btf__find_by_name_kind() w/ BTF_KIND_ENUM as it also needs to read BTF_KIND_ENUM64's.

Merge pull request #171 from sched-ext/api-update-fallouts

d36c66c

sched_ext: API update fallouts

scx/selftest: Add testcase invoking a kfunc with BPF_PROG_RUN

711d8c1

Now that we can invoke scx kfuncs inside of a BPF_PROG_TYPE_SYSCALL program, let's add some selftests that validate the behavior. Signed-off-by: David Vernet <[email protected]>

Dr. David Alan Gilbert and others added 26 commits June 3, 2024 16:52

Merge branch 'bpf/for-next' into htejun/pull-bpf-for-next

ae87a82

scx: Build fixes after pulling in bpf/for-next

e6508a4

Merge pull request #216 from sched-ext/htejun/pull-bpf-for-next

c071b9d

Pull bpf/for-next

scx: Add selftest for SCX_DSQ_LOCAL_ON dispatching

35aacc4

We should have _some_ test that exercises that codepath. Signed-off-by: David Vernet <[email protected]>

Merge pull request #219 from sched-ext/dispatch_local

95e71aa

scx: Fix local DSQ comparison in dispatch_to_local_dsq()

scx: Validate that cpu callbacks can be sleepable

36773d9

Now that ops.cpu_online() and ops.cpu_offline() can be sleepable, let's update their test BPF progs to be sleepable to validate. Signed-off-by: David Vernet <[email protected]>

Merge pull request #217 from sched-ext/fix_hotplug

9ee6343

scx: Fix hotplug

Merge commit '56d4acf2c0ae90e8f96550137e1fd63b817f7260' into scx-6.10…

418ead7

…rc2-scx1v2

v6.10-rc2-scx1

cf73b5c

Signed-off-by: David Vernet <[email protected]>

Byte-Lab requested a review from htejun June 6, 2024 21:48

htejun merged commit 49374b6 into scx-6.10rc.y Jun 6, 2024
1 check passed

htejun deleted the scx-6.10rc2-scx1 branch June 6, 2024 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v6.10-rc2-scx1 #28

v6.10-rc2-scx1 #28

Byte-Lab commented Jun 6, 2024

v6.10-rc2-scx1 #28

v6.10-rc2-scx1 #28

Conversation

Byte-Lab commented Jun 6, 2024