Skip to content
Johannes Bechberger edited this page Oct 25, 2024 · 6 revisions

sched_ext is a Linux kernel feature that implements kernel thread schedulers in BPF and dynamically loads them. This wiki contains helpful information on this technology and schedulers implemented with it. Feel free to come to the #docs channel in slack to share your own.

Be aware that sched-ext is a rapidly evolving ecosystem, so the resources listed here might be slightly outdated.

The ground truth is in the code and the kernel.org documentation.

In the following a short tutorial to get you started:

Minimal Scheduler

In the following a short tutorial for creating a minimal scheduler written with sched_ext in C. This scheduler uses a global scheduling queue from which every CPU gets its tasks to run for a time slice. The scheduler order is First-In-First-Out. So it essentially implements a round-robin scheduler:

Round Robin Diagram

This short tutorial covers the basics; to learn more, visit the resources from the scx wiki.

Requirements

We need a 6.12 kernel or a patched 6.11 kernel to build a custom scheduler. You can get a kernel patched with the scheduler extensions on Ubuntu 24.10 from here, or you can use CachyOS and install a patched kernel from there.

Furthermore, you also need

  • a recent clang for compilation
  • bpftool for attaching the scheduler

On Ubuntu, for example, you can run: apt install clang linux-tools-common linux-tools-$(uname -r).

Nothing more is needed to run it, and you can find the code of this tutorial in the minimal-scheduler repository. I would advise to just cloning it:

git clone https://github.com/parttimenerd/minimal-scheduler
cd minimal-scheduler

The scheduler lives in the sched_ext.bpf.c file, but before we take a look at it, I want to show you how you can use this scheduler:

Usage

In short, we only need two steps:

# build the scheduler binary
./build.sh

# start the scheduler
sudo ./start.sh

# do something ...

# stop the scheduler
sudo ./stop.sh

I'll show you later what's in these scripts, but first, let's get to the scheduler code:

The Scheduler

We assume that you have some experience writing eBPF programs. If not, Liz Rice's book Learning eBPF is a good starting point.

The scheduler code only depends on the Linux bpf kernel headers and sched-ext. So here is the code from sched_ext.bpf.c:

// This header is autogenerated, as explained later
#include <vmlinux.h>
// The following two headers come from the Linux headers library
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

// Define a shared Dispatch Queue (DSQ) ID
// We use this as our global scheduling queue
#define SHARED_DSQ_ID 0

// Two macros that make the later code more readable
// and place the functions in the correct sections
// of the binary file
#define BPF_STRUCT_OPS(name, args...)	\
    SEC("struct_ops/"#name)	BPF_PROG(name, ##args)

#define BPF_STRUCT_OPS_SLEEPABLE(name, args...)	\
    SEC("struct_ops.s/"#name)							      \
    BPF_PROG(name, ##args)

// Initialize the scheduler by creating a shared dispatch queue (DSQ)
s32 BPF_STRUCT_OPS_SLEEPABLE(sched_init) {
    // All scx_ functions come from vmlinux.h
    return scx_bpf_create_dsq(SHARED_DSQ_ID, -1);
}

// Enqueue a task to the shared DSQ that wants to run, 
// dispatching it with a time slice
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
    // Calculate the time slice for the task based on the number of tasks in the queue
    // This makes the system slightly more responsive than a basic round-robin
    // scheduler, which assigns every task the same time slice all the time
    // The base time slice is 5_000_000ns or 5ms
    u64 slice = 5000000u / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);
    scx_bpf_dispatch(p, SHARED_DSQ_ID, slice, enq_flags);
    return 0;
}

// Dispatch a task from the shared DSQ to a CPU,
// whenever a CPU needs something to run, usually after it is finished
// running the previous task for the allotted time slice
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
    scx_bpf_consume(SHARED_DSQ_ID);
    return 0;
}

// Define the main scheduler operations structure (sched_ops)
SEC(".struct_ops.link")
struct sched_ext_ops sched_ops = {
    .enqueue   = (void *)sched_enqueue,
    .dispatch  = (void *)sched_dispatch,
    .init      = (void *)sched_init,
    // There are more functions available, but we'll focus
    // on the important ones for a minimal scheduler
    .flags     = SCX_OPS_ENQ_LAST | SCX_OPS_KEEP_BUILTIN_IDLE,
    // A name that will appear in
    // /sys/kernel/sched_ext/root/ops
    // after we attached the scheduler
    // The name has to be a valid C identifier
    .name      = "minimal_scheduler"
};

// All schedulers have to be GPLv2 licensed
char _license[] SEC("license") = "GPL";

We can visualize the interaction of all functions in the scheduler with the following diagram:

Scheduler Diagram

Now, after you've seen the code, run the build.sh script to generate the vmlinux.h BPF header and then compile the scheduler code to BPF bytecode:

bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
clang -target bpf -g -O2 -c sched_ext.bpf.c -o sched_ext.bpf.o -I. 

Then run the start.sh script as a root user to attach the scheduler using the bpftool:

bpftool struct_ops register sched_ext.bpf.o /sys/fs/bpf/sched_ext

The custom scheduler is now the scheduler of this system. You can check this by accessing the /sys/kernel/sched_ext/root/ops file:

> cat /sys/kernel/sched_ext/root/ops
minimal_scheduler

And by checking dmesg | tail:

> sudo dmesg | tail
# ...
[32490.366637] sched_ext: BPF scheduler "minimal_scheduler" enabled

Play around with your system and see how it behaves.

If you're done, you can detach the scheduler by running the stop.sh script using root privileges. This removes the /sys/fs/bpf/sched_ext/sched_ops file.

Tasks for the Reader

Now that you know what a basic scheduler looks like, you can start modifying it. Here are a few suggestions:

Vary the Time Slice

How does your system behave when you increase or decrease the time slice?

For example, try a time slice of 1s. Do you see any difference in how your cursor moves? Or try a small time slice of 100us and run a program like that that does some computation, do you see a difference in its performance?

Use a fixed Time Slice

How does your system behave when you change the scheduler to use the same time slice, ignoring the number of enqueued processes?

Try, for example, to create load on your system and see how it behaves.

Limit the Used CPUs

How does your system behave if the scheduler only schedules to specific CPUs?

Try, for example, to make your system effectively single-core by only consuming tasks on CPU 0 in sched_dispatch (Hint: the cpu parameter is the CPU id).

Create multiple Scheduling Queues

How does your system behave with multiple scheduling queues for different CPUs and processes?

Try, for example, to create two scheduling queues, with one scheduling queue only for a process with a specific id (Hint: task_struct#tgid gives you the process id) which is scheduled on half of your CPUs.

Look into the linux/sched.h header to learn more about task_struct.

Use more BPF features

If you already know how to write basic eBPF programs, use bpf_trace_printk and the running and stopping hooks.

The running hook is called whenever a task starts running on a CPU; get the current CPU id via smp_processor_id():

int BPF_STRUCT_OPS(sched_running, struct task_struct *p) {
    // ...
    return 0; // there are no void functions in eBPF 
}

The stopping hook is called whenever a task stops running:

int BPF_STRUCT_OPS(sched_stopping, struct task_struct *p, bool runnable) {
    // ...
    return 0;
}

You can use this to create visualizations of the scheduling order.

Going Further

To do even more, you can look at the collected resources further down, especially the well-documented sched-ext code in the kernel.

If you're interested in how to use it in Rust, take a look at scx and scx_rust_scheduler, and for Java at hello-ebpf.

Resources

Blogs

Talk Recordings

There are a few talk recordings (and one podcast episode) on sched-ext. You can find them all in the Sched-Ext YouTube playlist. The following is a selection of the most notable videos:

David Vernet

David Vernet gives a lot of talks on sched-ext, this is a collection:

The recent Linux Storage, Filesystem, MM & BPF Summit also had two talks on sched-ext, giving an update to the previous talk by David Vernet:

The Linux Plumbers Conference 2024 had a micro-conference related to sched-ext, Sched-Ext: The BPF extensible scheduler class MC which covered major topics of the sched-ext community:

Additionally, there was Andrea Righi's talk Crafting a Linux kernel scheduler that runs in user-space using Rust in the LPC Refereed Track.

The whole sched-ext micro-conference has been summed up quite nicely by Jonathan Corbet in his LWN article.