-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v6.10-rc2-scx1 #28
Merged
Merged
v6.10-rc2-scx1 #28
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
scx: Make exit debug dump buffer resizable
scx: Trivial updates from patch splitting
reset_idle_masks() is called while loading a BPF scheduler to mark all CPUs as idle for scx_bpf_select_cpu_dfl(). It was cpumask_setall() which could make scx_bpf_select_cpu_dfl() pick a CPU which is possible but not online. Note that such spurious picking can only happen one time and it's generally safe to pick an ineligible CPU, so nothing should be broken but the behavior isn't ideal. In general the initial values of idle masks aren't that important. They quickly get synchronized to the actual state through the CPUs entering and leaving the idle state. However, let's still use cpu_online_mask instead so that the idle masks are initialized with online CPUs.
scx: Use cpu_online_mask when resetting idle masks
The UEI macros were updated in a prior commit. Apply the changes to the sched_ext selftests dir. Signed-off-by: David Vernet <[email protected]>
Right now we're just printing what the user passes to SCX_ERROR(). This can cause the output from that error message to appear on the same line as the results output from the test runner. Let's append a newline. Signed-off-by: David Vernet <[email protected]>
scx: Update selftests to use new UEI macros
…dly NULL Make the sanity check a bit more concise and ensure that ops.cgroup_move() is never called with NULL source cgroup.
…ugh ops.cgroup_prep_move() sched_move_task() takes an early exit if the source and destination are identical. This triggers the warning in scx_cgroup_can_attach() as it leaves p->scx.cgrp_moving_from uncleared. Update the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one.
scx: cgroup: Fix mismatch between `ops.cgroup_prep_move()` and `ops.cgroup_move()` invocations
In scx_select_cpu_dfl(), we currently migrate the waking task to the CPU of the waker in the following scenario: 1. WAKE_SYNC is specified in wake_flags 2. There is at least one idle core in the system 3. The wakee can run on the waker CPU The assumption implicit with (2) is that the system is under saturated, and that therefore the wakee's runqueue delay would not be impacted by migrating to the waker's CPU rather than migrating to an idle core. This doesn't always happen in practice though. Consider the following scenario: 1. The system is overloaded, and at least one core becomes idle 2. Some groups of pairs of tasks that communicate over IPC are spawned. 3. Sender tasks are running on cores that still have enqueued tasks from when the system was overloaded, and they repeatedly wake waker tasks with WAKE_SYNC. 4. The waker tasks observe that the system is underloaded, and so think that it's optimal for the wakee to be migrated to their CPU despite having a deep runqueue. This can cause serious performance regressions for such workloads. For example, hackbench regresses by nearly 10x relative to EEVDF: [1]+ ./scx_simple > /dev/null 2> /dev/null & [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 2.944 [root@virtme-ng bin]# fg ./scx_simple > /dev/null 2> /dev/null ^C [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.345 What we really want is to only migrate to the waker CPU if nobody else is already enqueued there. This will cause tasks to fan out over any idle CPUs when they're available if the waker's rq is overloaded, and then eventually to start enjoying wakeups on the waker's CPU once load has been distributed and tasks are no longer piling up on a subset of cores. With this patch, the regression is addressed: [root@virtme-ng bin]# ./scx_simple > /dev/null & [1] 336 [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.348 [root@virtme-ng bin]# fg ./scx_simple > /dev/null ^CEXIT: BPF scheduler unregistered [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.352 Signed-off-by: David Vernet <[email protected]>
scx: Update conditions for WAKE_SYNC migration
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep track of vtime. On cgroup init, a cgv_node is created and stashed in a hashmap - cgv_node_stash - for later use. In cgroup_enqueued and try_pick_next_cgroup, the node is inserted into the rbtree, which required removing it from the stash before this patch's changes. This patch makes cgv_node refcounted, which allows keeping it in the stash for the entirety of the cgroup's lifetime. Unnecessary bpf_kptr_xchg's and other boilerplate can be removed as a result. Note that in addition to bpf_refcount patches, which have been upstream for quite some time, this change depends on a more recent series [0]. [0]: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Dave Marchevsky <[email protected]>
There may be cases where a scheduler wants to cause itself to exit under normal circumstances. For example, if a scheduler does not support hotplug, it may want to exit to user space if a CPU is hotplugged or unplugged to let it know that it should restart itself. To enable this, we implement a new kfunc called scx_bpf_exit_bstr(). This kfunc shares a good amount of logic with scx_bpf_error_bstr(), as both of them can plumb an exit message from BPF to user space. In addition, we also augment struct scx_exit_info to include an exit code field. Signed-off-by: David Vernet <[email protected]>
As with scx_bpf_error(), we want schedulers to have an ergonomic API for exiting execution. Let's add an scx_bpf_exit() macro wrapper that does nearly the same thing as scx_bpf_error(), but invokes scx_bpf_exit_bstr() instead of scx_bpf_error_bstr(). Signed-off-by: David Vernet <[email protected]>
We no longer have scx_bpf_switch_all(). Let's update the test to use __COMPAT_SCX_OPS_SWITCH_PARTIAL. Along the way, make it less flaky. Signed-off-by: David Vernet <[email protected]>
Let's add a selftest that verifies us using scx_bpf_exit() to exit from various callbacks in the program. Signed-off-by: David Vernet <[email protected]>
Allow clean exits from the scheduler with scx_bpf_exit()
scx_flatcg: Keep cgroup rb nodes stashed
Where 0 fields are ignored if not present.
Pull in bpf/for-next
scx: Use the new struct_ops compatibililty mechanism
SCX_OPS_SWITCH_PARTIAL was added recently but assigned the first bit shifting all the other values. This breaks backward compatibility as flag values are hard coded into e.g. sched_ext_ops definitions. Let's reorder the enums so that the existing values aren't shifted around.
instead of btf__find_by_name_kind() w/ BTF_KIND_ENUM as it also needs to read BTF_KIND_ENUM64's.
sched_ext: API update fallouts
We'll want to be able to issue BPF_PROG_RUN calls in order to synchronously call into the kernel from user space scheduling components. The most natural prog type for this seems to be BPF_PROG_TYPE_SYSCALL, which is a very safe program type. Let's allow invoking the core kfuncs from this program type as well. Signed-off-by: David Vernet <[email protected]>
Now that we can invoke scx kfuncs inside of a BPF_PROG_TYPE_SYSCALL program, let's add some selftests that validate the behavior. Signed-off-by: David Vernet <[email protected]>
'scale_test_def' is unused since commit 3762a39 ("selftests/bpf: Split out bpf_verif_scale selftests into multiple tests"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
'key_t' is unused in a couple of files since the original commit 60dd49e ("selftests/bpf: Add test for bpf array map iterators"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
'libcap' is unused since commit b1c2768 ("bpf: selftests: Remove libcap usage from test_verifier"). Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Correct typo in bpftool profiler and change all instances of 'MATRICS' to 'METRICS' in the profiler.bpf.c file. Signed-off-by: Swan Beaujard <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Quentin Monnet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bpf.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
The values of tx_prog_fd in run_options() should not be 0, so set it as -1 in else branch, and test it using "if (tx_prog_fd > 0)" condition, not "if (tx_prog_fd)" or "if (tx_prog_fd >= 0)". Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/08b20ffc544324d40939efeae93800772a91a58e.1716446893.git.tanggeliang@kylinos.cn
There's already a definition of i in run_options() at the beginning, no need to define a new one in "if (tx_prog_fd > 0)" block. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/8d690682330a59361562bca75d6903253d16f312.1716446893.git.tanggeliang@kylinos.cn
Switch attachments to bpf_link using bpf_program__attach_sockmap() instead of bpf_prog_attach(). This patch adds a new array progs[] to replace prog_fd[] array, set in populate_progs() for each program in bpf object. And another new array links[] to save the attached bpf_link. It is initalized as NULL in populate_progs, set as the return valuses of bpf_program__attach_sockmap(), and detached by bpf_link__detach(). Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/32cf8376a810e2e9c719f8e4cfb97132ed2d1f9c.1716446893.git.tanggeliang@kylinos.cn
bpf_program__attach_sockmap() needs to take a parameter of type bpf_program instead of an fd, so tx_prog_fd becomes useless. This patch uses a pointer tx_prog to point to an item in progs[] array. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/23b37f932c547dd1ebfe154bbc0b0e957be21ee6.1716446893.git.tanggeliang@kylinos.cn
The program fds can be got by using bpf_program__fd(progs[]), then prog_fd becomes useless. This patch drops it. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/9a6335e4d8dbab23c0d8906074457ceddd61e74b.1716446893.git.tanggeliang@kylinos.cn
The array size of map_fd[] is 9, not 8. This patch changes it as a more general form: ARRAY_SIZE(map_fd). Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/0972529ee01ebf8a8fd2b310bdec90831c94be77.1716446893.git.tanggeliang@kylinos.cn
The value of recv in msg_loop may be negative, like EWOULDBLOCK, so it's necessary to check if it is positive before accumulating it to bytes_recvd. Fixes: 16962b2 ("bpf: sockmap, add selftests") Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/5172563f7c7b2a2e953cef02e89fc34664a7b190.1716446893.git.tanggeliang@kylinos.cn
bpf_map_lookup_elem is invoked in bpf_prog3() already, no need to invoke it again. This patch drops it. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/ea8458462b876ee445173e3effb535fd126137ed.1716446893.git.tanggeliang@kylinos.cn
Pull bpf/for-next
@Gabant noticed in sched-ext/sched_ext#218 that our comparison is backwards in dispatch_to_local_dsq() when determining if the destination rq in a local dispatch is idle. This could cause us to fail to send resched IPIs to those remote DSQs when a scheduler uses SCX_DSQ_LOCAL_ON when dispatching. Thankfully, the issue doesn't happen if dispatching to a local DSQ from the ops.select_cpu() path, as a resched IPI is sent by the core scheduler if a task is migrated to an idle core on the wakeup path. Let's fix it. Reported-by: @Gabant Signed-off-by: David Vernet <[email protected]>
We should have _some_ test that exercises that codepath. Signed-off-by: David Vernet <[email protected]>
scx: Fix local DSQ comparison in dispatch_to_local_dsq()
The existing hotplug code is a bit brittle in that the sched_class->{online, offline} callbacks can be invoked both from hotplug context, or from domain rebuild context. A callback is only invoked in one of the two contexts, with the context that runs the callback setting rq->online accordingly to avoid doing the callback more times than necessary. Unfortuntaly, this causes commit 2125c00 ("cgroup/cpuset: Make cpuset hotplug processing synchronous") to break hotplug for us because it makes the topology rq->online event happen before the cpu hotplug rq->online event; thus preventing us from invoking the cpu online callback in the scheduler. This integration is fragile and hacky, so we'll instead call directly into ext.c on the hotplug path. This will be added in a subsequent commit. Signed-off-by: David Vernet <[email protected]>
The rq_online() and rq_offline() sched_class callbacks may be invoked on either the domain creation path, or on the hotplug path. ext.c would previously only invoke its hotplug logic when those callbacks are invoked on the hotplug path. This turned out to be a bit brittle, as hotplug could break for sched_ext if we first invoked rq_online() from the domain path, as it would cause us to skip invoking it on the hotplug path due to hotplug implementation details. To avoid this frailty (and fix a hotplug regression for us after merging v6.10), let's instead just call directly into ext.c on the hotplug path. In doing so, we can also move the ops.cpu_online() and ops.cpu_offline() callbacks outside of the rq lock region, thus making them sleepable. A test for this will be added in a subsequent patch. Signed-off-by: David Vernet <[email protected]>
Now that ops.cpu_online() and ops.cpu_offline() can be sleepable, let's update their test BPF progs to be sleepable to validate. Signed-off-by: David Vernet <[email protected]>
scx: Fix hotplug
Tejun pointed out that if we set or unset the hotplug online flag before we invoke the relevant hotplug callbacks, schedulers could get confused because they could begin receiving callbacks on a CPU before (or after) they're notified that a CPU is online (or offline). Let's reverse the order to fix that. Signed-off-by: David Vernet <[email protected]>
Signed-off-by: David Vernet <[email protected]>
htejun
added a commit
that referenced
this pull request
Jun 6, 2024
scx_pair uses the default stride value of nr_cpu_ids / 2, which matches most x86 SMT configurations. However, it does allow specifying a custom stride value with -S so that e.g. neighboring CPUs can be paired up. However, not all stride values work and errors were not reported very well. This patch improves error handling so that scx_pair fails with clear error message if CPUs can't be paired up with the specified stride value. scx_pair now also prints out how CPUs are paired on startup. This should address issues #28 and #29.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.