scx_rustland_core: performance regression due to kernel change #788

arighi · 2024-10-11T23:48:23Z

This commit in the kernel introduces a pretty bad performance regression in all the scx_rustland_core schedulers:

7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")

System becomes completely unresponsive when it's saturated and it's very easy to reproduce (i.e., starting a parallel kernel build with scx_rustland active).

I think the reason is one (or both) of these behavior changes:

    This causes two behavior changes observable from the BPF scheduler:

    - When a task keep running, it no longer goes through enqueue/dequeue cycle
      and thus ops.stopping/running() transitions. The new behavior is better
      and all the existing schedulers should be able to handle the new behavior.

    - The BPF scheduler cannot keep executing the current task by enqueueing
      SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
      BPF scheduler is responsible for resuming execution after each
      SCX_ENQ_LAST.

But I haven't figured out exactly why, I've been playing a bit with SCX_ENQ_LAST, unsuccessfully, so I'm just opening the issue for now. Any pointers on how to attack this?

The text was updated successfully, but these errors were encountered:

Never let the CPU go idle. This is a stress test to prove issue #788. The expected behavior is that CPUs should not go idle due to the immediate re-kick in ops.update_idle(). However, in version 6.12, the CPUs are still entering idle states, indicating that in certain cases, ops.update_idle() is not being correctly invoked by the sched_ext core. This is likely due to the pick_next_task()/put_prev_task() rework in sched core. WARNING: do not run this for too long or it may burn your CPUs. Signed-off-by: Andrea Righi <[email protected]>

arighi · 2024-10-13T08:34:56Z

I think I found a much easier reproducer, see 7f9b009.

It seems that in 6.12, ops.update_idle() is occasionally not being called. scx_rustland_core depends on ops.update_idle() to trigger the wakeup of the user-space scheduler to handle pending tasks, so skipping it leads to poor performance. This issue is likely related to changes of pick_next_task() / put_prev_task() in the kernel.

I don't have a fix yet, I'm just sharing the reproducer for now, I'll investigate more on the kernel side.

With the consolidation of put_prev_task/set_next_task(), we are now skipping the sched_ext ops.stopping/running() transitions when the previous and next tasks are the same, see commit 436f3ee ("sched: Combine the last put_prev_task() and the first set_next_task()"). While this optimization makes sense in general, it can negatively impact performance in some user-space schedulers, that expect to handle such transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST). For example, scx_rustland suffers a significant performance regression (e.g., gaming benchmarks drop from ~60fps to ~10fps). To fix this, ensure that put_prev_task()/set_next_task() are never skipped when the scx scheduling class is enabled, allowing the scx class to handle such transitions. This change restores the previous behavior, fixing the performance regression in scx_rustland. Link: sched-ext/scx#788 Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <[email protected]>

arighi · 2024-10-13T17:41:17Z

FYI, https://lore.kernel.org/lkml/[email protected]/T/#u seems to fix this regression.

With the consolidation of put_prev_task/set_next_task(), we are now skipping the sched_ext ops.stopping/running() transitions when the previous and next tasks are the same, see commit 436f3ee ("sched: Combine the last put_prev_task() and the first set_next_task()"). While this optimization makes sense in general, it can negatively impact performance in some user-space schedulers, that expect to handle such transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST). For example, scx_rustland suffers a significant performance regression (e.g., gaming benchmarks drop from ~60fps to ~10fps). To fix this, ensure that put_prev_task()/set_next_task() are never skipped when the scx scheduling class is enabled, allowing the scx class to handle such transitions. This change restores the previous behavior, fixing the performance regression in scx_rustland. Link: sched-ext/scx#788 Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <[email protected]>

Prevent CPUs from going idle when the user-space scheduler has some pending activities to complete. Keeping the CPU alive allows to consume tasks from the user-space scheduler more efficiently, preventing bubbles in the scheduling pipeline. To achieve this, trigger a CPU kick from ops.update_idle() and set a flag in the CPU context to prevent it from going idle. Then keep kicking the CPU from ops.dispatch() until the flag is cleared, which occurs when no more tasks are pending or when the CPU exits idle as a task starts running on it. This allows to fix the performance regression introduced by the put_prev_task_scx() behavior change in Linux 6.12 (see #788). Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Andrea Righi <[email protected]>

Prevent CPUs from going idle when the user-space scheduler has some pending activities to complete. Keeping the CPU alive allows to consume tasks from the user-space scheduler more efficiently, preventing bubbles in the scheduling pipeline. To achieve this, trigger a CPU kick from ops.update_idle() and set a flag in the CPU context to prevent it from going idle. Then keep kicking the CPU from ops.dispatch() until the flag is cleared, which occurs when no more tasks are pending or when the CPU exits idle as a task starts running on it. This allows to fix the performance regression introduced by the put_prev_task_scx() behavior change in Linux 6.12 (see sched-ext#788). Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Andrea Righi <[email protected]>

With the consolidation of put_prev_task/set_next_task(), we are now skipping the sched_ext ops.stopping/running() transitions when the previous and next tasks are the same, see commit 436f3ee ("sched: Combine the last put_prev_task() and the first set_next_task()"). While this optimization makes sense in general, it can negatively impact performance in some user-space schedulers, that expect to handle such transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST). For example, scx_rustland suffers a significant performance regression (e.g., gaming benchmarks drop from ~60fps to ~10fps). To fix this, ensure that put_prev_task()/set_next_task() are never skipped when the scx scheduling class is enabled, allowing the scx class to handle such transitions. This change restores the previous behavior, fixing the performance regression in scx_rustland. Link: sched-ext/scx#788 Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <[email protected]>

arighi added scx_rustland help wanted Extra attention is needed labels Oct 12, 2024

arighi added the kernel Expose a kernel issue label Oct 13, 2024

arighi mentioned this issue Oct 15, 2024

scx_rustland fixes and improvements #804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland_core: performance regression due to kernel change #788

scx_rustland_core: performance regression due to kernel change #788

arighi commented Oct 11, 2024

arighi commented Oct 13, 2024

arighi commented Oct 13, 2024

scx_rustland_core: performance regression due to kernel change #788

scx_rustland_core: performance regression due to kernel change #788

Comments

arighi commented Oct 11, 2024

arighi commented Oct 13, 2024

arighi commented Oct 13, 2024