Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland_core: performance regression due to kernel change #788

Open
arighi opened this issue Oct 11, 2024 · 2 comments
Open

scx_rustland_core: performance regression due to kernel change #788

arighi opened this issue Oct 11, 2024 · 2 comments
Labels
help wanted Extra attention is needed kernel Expose a kernel issue scx_rustland

Comments

@arighi
Copy link
Contributor

arighi commented Oct 11, 2024

This commit in the kernel introduces a pretty bad performance regression in all the scx_rustland_core schedulers:

7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")

System becomes completely unresponsive when it's saturated and it's very easy to reproduce (i.e., starting a parallel kernel build with scx_rustland active).

I think the reason is one (or both) of these behavior changes:

    This causes two behavior changes observable from the BPF scheduler:

    - When a task keep running, it no longer goes through enqueue/dequeue cycle
      and thus ops.stopping/running() transitions. The new behavior is better
      and all the existing schedulers should be able to handle the new behavior.

    - The BPF scheduler cannot keep executing the current task by enqueueing
      SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
      BPF scheduler is responsible for resuming execution after each
      SCX_ENQ_LAST.

But I haven't figured out exactly why, I've been playing a bit with SCX_ENQ_LAST, unsuccessfully, so I'm just opening the issue for now. Any pointers on how to attack this?

@arighi arighi added scx_rustland help wanted Extra attention is needed labels Oct 12, 2024
arighi added a commit that referenced this issue Oct 13, 2024
Never let the CPU go idle. This is a stress test to prove issue #788.

The expected behavior is that CPUs should not go idle due to the
immediate re-kick in ops.update_idle().

However, in version 6.12, the CPUs are still entering idle states,
indicating that in certain cases, ops.update_idle() is not being
correctly invoked by the sched_ext core.

This is likely due to the pick_next_task()/put_prev_task() rework in
sched core.

WARNING: do not run this for too long or it may burn your CPUs.

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor Author

arighi commented Oct 13, 2024

I think I found a much easier reproducer, see 7f9b009.

It seems that in 6.12, ops.update_idle() is occasionally not being called. scx_rustland_core depends on ops.update_idle() to trigger the wakeup of the user-space scheduler to handle pending tasks, so skipping it leads to poor performance. This issue is likely related to changes of pick_next_task() / put_prev_task() in the kernel.

I don't have a fix yet, I'm just sharing the reproducer for now, I'll investigate more on the kernel side.

@arighi arighi added the kernel Expose a kernel issue label Oct 13, 2024
arighi added a commit to arighi/sched_ext that referenced this issue Oct 13, 2024
With the consolidation of put_prev_task/set_next_task(), we are now
skipping the sched_ext ops.stopping/running() transitions when the
previous and next tasks are the same, see commit 436f3ee ("sched:
Combine the last put_prev_task() and the first set_next_task()").

While this optimization makes sense in general, it can negatively impact
performance in some user-space schedulers, that expect to handle such
transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST).

For example, scx_rustland suffers a significant performance regression
(e.g., gaming benchmarks drop from ~60fps to ~10fps).

To fix this, ensure that put_prev_task()/set_next_task() are never
skipped when the scx scheduling class is enabled, allowing the scx class
to handle such transitions.

This change restores the previous behavior, fixing the performance
regression in scx_rustland.

Link: sched-ext/scx#788
Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor Author

arighi commented Oct 13, 2024

FYI, https://lore.kernel.org/lkml/[email protected]/T/#u seems to fix this regression.

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Oct 13, 2024
With the consolidation of put_prev_task/set_next_task(), we are now
skipping the sched_ext ops.stopping/running() transitions when the
previous and next tasks are the same, see commit 436f3ee ("sched:
Combine the last put_prev_task() and the first set_next_task()").

While this optimization makes sense in general, it can negatively impact
performance in some user-space schedulers, that expect to handle such
transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST).

For example, scx_rustland suffers a significant performance regression
(e.g., gaming benchmarks drop from ~60fps to ~10fps).

To fix this, ensure that put_prev_task()/set_next_task() are never
skipped when the scx scheduling class is enabled, allowing the scx class
to handle such transitions.

This change restores the previous behavior, fixing the performance
regression in scx_rustland.

Link: sched-ext/scx#788
Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Oct 15, 2024
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.

Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.

To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.

This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see #788).

Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Oct 16, 2024
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.

Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.

To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.

This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see #788).

Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Andrea Righi <[email protected]>
minosfuture pushed a commit to minosfuture/scx that referenced this issue Oct 19, 2024
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.

Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.

To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.

This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see sched-ext#788).

Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit to arighi/sched_ext that referenced this issue Oct 20, 2024
With the consolidation of put_prev_task/set_next_task(), we are now
skipping the sched_ext ops.stopping/running() transitions when the
previous and next tasks are the same, see commit 436f3ee ("sched:
Combine the last put_prev_task() and the first set_next_task()").

While this optimization makes sense in general, it can negatively impact
performance in some user-space schedulers, that expect to handle such
transitions when tasks exhaust their timeslice (see SCX_OPS_ENQ_LAST).

For example, scx_rustland suffers a significant performance regression
(e.g., gaming benchmarks drop from ~60fps to ~10fps).

To fix this, ensure that put_prev_task()/set_next_task() are never
skipped when the scx scheduling class is enabled, allowing the scx class
to handle such transitions.

This change restores the previous behavior, fixing the performance
regression in scx_rustland.

Link: sched-ext/scx#788
Fixes: 7c65ae8 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kernel Expose a kernel issue scx_rustland
Projects
None yet
Development

No branches or pull requests

1 participant