Skip to content

DTrace based on BPF and other Linux kernel facilities

Kris Van Hees edited this page Aug 27, 2020 · 4 revisions

DTrace based on BPF

Overview

DTrace is a very powerful dynamic tracing tool that is currently available in Oracle Linux as part of the Unbreakable Enterprise Kernel (UEK). The initial port to Linux started in 2010, and has matured quite a bit over the past 9 years. It has a rich feature set, supports most of the common probe providers, and is available for x86_64 and aarch64 architectures. The current kernel implementation comprises patches that touch 275 different files (many are new files), involving almost 45000 lines of code (and comments) being added/changed, and a mere 300 lines being deleted. The userspace component covers about 55000 lines of code.

Since the beginning of the DTrace on Linux project a lot has changed in the Linux kernel with respect to tracing. Many features that did not exist throughout most of the history of DTrace in Linux have recently been added to the kernel and some that were added in the past have matured sufficiently to be usable in an enterprise-level tool like DTrace. The emergence of these features makes it feasible to envision a less invasive implementation of DTrace for Linux. As other tracing tools are moving in a similar direction, leveraging common generic facilities in the kernel, it stands to reason that DTrace joins this direction of development.

Rationale

DTrace has existed in isolation for a long time for a variety of reasons. Now that those barriers have been lifted, it would be a benefit to the greater community to make DTrace available beyond just Oracle Linux and the Unbreakable Enterprise Kernel. This would not only make it available to a larger public but it also encourages a larger developer base and better integration/interoperability with other tracing projects.

There are various obstacles to getting the current DTrace implementation accepted into the Linux kernel, including but not limited to:

  • The userspace tool (consumer) is useless without the kernel component.
  • The kernel component is useless without the userspace tool (consumer). (A typical chicken-and-egg problem.)
  • The kernel component is large (about 45000 lines of code and comments).
  • The kernel component is invasive. Some implementation details touch on very low level parts of the kernel, such as the scheduler, trap handling, system call invocation, etc.
  • The kernel component provides DTrace-specific implementations for features that exist (in some fashion) in the kernel. Providing features that more or less duplicate existing functionality is frowned upon, especially when it is done in support of a single specialized tool.

Objective

The project comprises the following sub-goals:

  • Avoid duplication of kernel features, and instead leverage existing functionality.
  • Improve the maintainability of the DTrace code base.
  • Provide (limited) DTrace tracing on kernels even when they do not contain DTrace-specific patches
  • Facilitate cooperation between various tracing tools and facilities, towards establishing a true tracing framework for Linux that includes various tracing components that can truly work together.
  • Integrate kernel-level changes with the upstream kernel, i.e. make all DTrace-related kernel patches part of the upstream kernel.

The methodology behind accomplishing these goals involves an incremental approach to making contributions to the Linux kernel and the various tracing facilities therein. This will also benefit DTrace as a whole because it ensures that the design remains clean and Linux-centric (as opposed to being DTrace-centric).

It is important to note that the existing implementation of DTrace is tied to the original design. In that design, the tracing implementation at the kernel level played a central role. The new design based on BPF and other existing Linux kernel features aims to reduce the impact on the kernel, and favors implementing features at the userspace level. As a result, the dtrace userspace component takes on a more important role as the main controller of tracing sessions.

There is nothing that requires DTrace to be part of the kernel tree under the new design. We can certainly develop DTrace as a userspace application outside of the kernel tree. It will have a reasonable (but limited) level of functionality against newer Linux kernels. Kernel changes that are necessary to provide the full level of functionality of DTrace will be submitted for upstream inclusion in the Linux kernel. The goal is to ensure that the userspace component can determine what features are available in the running kernel, and adapt to that context.

Design

Functional overview

A detailed description of DTrace and its inner workings is beyond the scope of this document, but some overview of the functional design behind DTrace is important to understanding the design presented here. At its highest level, DTrace can be split into two major components:

  • The producer: this component resides in the kernel and produces tracing data.
  • One or more consumers: this component is a userspace application and/or library that consumes the tracing data generated by the producer and usually generates some output for the user.

Obviously, this design requires coordination between the producer and the consumer(s) in terms of data sources, what data the consumer cares about, and how that data is to be presented so that the consumer can use it.

Data sources

Probes are the source of all tracing data. A probe is an abstraction for any type of event that can trigger the producer to generate tracing data. It can be a timer event, an event generated upon entry into or return from a function, an event that is associated with specific kernel functionality, etc. The commonality between all these events is that they cause the current execution flow to be interrupted by invoking a predefined entry point into the producer. The producer will perform any number of actions associated with the event (we'll get into that further into this overview), and then it will cause the original execution flow to be resumed. Unless so-called destructive actions were used, care is taken to ensure that the interruption is virtually unnoticeable.

Probes are grouped by a common characteristic such as event type, functionality, association with a specific kernel sub-system, etc. Each grouping is managed by a provider and the probes in the group implement a common API, and therefore the producer can be seen as the manager of a set of providers. As such, there are two levels of abstraction in the way data sources are managed.

Although the interruption mechanism (also called 'firing mechanism') is designed to have as little impact on performance as possible, that impact is not zero. Therefore, the probe providers must implement a mechanism to enable and disable probes. As can be expected, disabled probes never fire, i.e. they never interrupt the execution flow of a task.

All probes operate within a specific context, comprising two main components:

  • Common context: content of the CPU register set, state of the executing task, tracing state (more on that later)
  • Probe-specific context: probe arguments, probe definition

The context is used to retrieve data from the task (or system) at the point of interruption, commonly referred to as the point when the 'probe is hit' or the 'probe fires'. Users with limited privileges may have restrictions imposed on the data they can retrieve. Note: this DTrace feature is not currently implemented on Linux.

Coordination between producer and consumer in terms of data sources involves communicating what data sources are available, i.e. making it possible for the consumers to know what providers are available, and what probes are made available by each provider. It also involves defining what data can be obtained from the context of each probe,

Data selection

The data selection is one of the most complex aspects of the coordination between producer and consumer. The consumer must at least define what probes it is interested in. In other words, the consumer communicates to the producer that it wants to know when any of a given list of probes gets triggered. Technically, it is even possible to communicate to the producer that a certain set of probes should be enabled without generating any data at all, but the practical use of doing so is pretty much non-existent.

Of course, knowing that a probe got triggered is only of limited use in the context of a tracing tool. More often than not, we are interested in information about a task or the operating system at the time of the probe getting triggered. To accomplish that goal, DTrace provides the D language. It is a high level language that allows the user to describe what should happen when a probe is triggered. A user can specify how to retrieve the data we are interested in, and how to record it. The D language also allows for non-data producing specifications that form the basis for more complex operations. Together with the ability to work with global variables, thread-local variables, and clause-local variables (storing values, loading values, and comparing values) it is possible to develop small programs that are associated with a probe, and that are executed when the probe is triggered.

Conditional selection of probe firings is also supported. I.e. the consumer could be interested in the use of a specific system call (probing the entry of the system call function) for one specific task. Enabling the probe on the entry point of the system call causes it to be triggered every time the system call is invoked regardless of the task that is executing. The consumer would receive data on the system call use for every single task on the system. To avoid the generation of data that will be ignored, the producer can be told that data should only be generated for probe firings that satisfy a given condition. This is called a predicate.

Data selection is therefore much more than stating what data is to be collected when a specific probe fires. It supports writing D scripts: a collection of probe specifications (telling the producer what probe(s) should be enabled), each with an optional predicate (enforcing a condition on the probe firing event) and a D clause (a D code specification that states what should happen when the probe fires and the optional predicate is satisfied).

The main complexity in terms of coordination between producer and consumer for data selection is found in the implementation of the producer component that evaluates the predicate and executes the clause when a probe fires. Since the D language is a high level programming language, the clauses must be compiled into instructions that can be executed by the producer. This involves the implementation of some kind of virtual architecture that serves as compilation target architecture for the D code.

Data representation

Assuming that a D script has been developed based on existing data sources (probes) specifying what happens when each probe fires, and assuming that the consumer expects some data to be generated, the remaining problem is simple: the producer must represent the generated data in a form that the consumer understands. This truly is a simple problem with a complex and invasive solution in the existing DTrace implementation.

A choice was made in the early stages of the DTrace design (well before we did the port for Linux - and yes, that is a way to say that it isn't my fault) to hardcode the way tracing data is stored in the output buffers at the producer side based on what the consumer expects. This has enormous consequences for the interaction between producer and consumers because making changes to the way data is stored in the buffers requires coordinating changes to the producer and all consumers. Fortunately, DTrace provides a userspace library that provides an implementation of all interaction with the producer so that consumers need not be aware of the implementation details. Yet, this still imposes a hardcoded dependency between the kernel component (producer) and a userspace component (library used by consumers).

This also affects any possible integration with other tools. So much of the consumer/producer interaction is delegated to the library that any tool built based on the library is invariably locked into many of the semantics that are part of the DTrace design. Any tool that wants to make use of the functionality of the producer needs to be very DTrace-like.

If that is not enough of a complication to behold, we should consider the fact that the existing implementation has spread the data selection component between producer and consumer for some of the functionality. One good way to describe that is to look at the printf() statement in the D language. It does what one would expect from the printf() function in C. It takes a format string as first argument, and uses the remainder of its argument to populate conversion specifications that are present in the format string. Based on the description of DTrace thus far one might make the assumption that the implementation of the virtual architecture in the producer contains a printf-alike function that builds an output string based on the format string and the supplied arguments. That output string would then be written to the output buffer, and all is well.

That is not how it is done in the existing DTrace implementation. Instead, the building of the output string is done within the userspace library that the consumer uses. When a printf() statement is encountered in a D clause, each argument is taken as an expression that is to be compiled into its own set of virtual architecture instructions, generating a value that is written to the output buffer. When a probe associated with this D clause is triggered, data is written to the buffer to identify the probe that fired. Then a chain of instruction blocks is executed, one block at a time, and each block generates some piece of data that is written to the buffer. When the buffer is processed by the consumer (actually, the DTrace consumer library), the probe identification is recognized as one that is associated with a clause with a printf() statement and a library function is invoked that implements the printf functionality and that obtains the values to use when populating conversion specifications in the format string from data items that are stored in the buffer. In other words, the implementation of the printf() statement is partly done in the producer and partly in the consumer library. This is only one example of a tight coupling between the producer and the consumer library.

High level design

The design of DTrace based on BPF and other kernel features is driven by a re-interpretation of the original DTrace design. Rather than looking at the current implementation from a source code level, we have gone through a very thorough analysis of the DTrace design with the goal to identify how we can leverage existing features in the Linux kernel. One of the first conclusions is instrumental to the overall approach: the producer should provide a clean API for the consumer to use. This covers the mechanism to pass information from the consumer to the producer (settings, compiled D scripts, etc), the virtual architecture specification, and the mechanism to pass information from the producer to the consumer (state, tracing data, etc). It must be clean in the sense that hard-coded assumptions should be avoided whenever possible.

Why BPF?

The virtual architecture used in the existing DTrace implementation is very specific to DTrace and contains many hardcoded assumptions. Its implementation covers about 7500 lines of code, which is roughly 15% of the entire DTrace producer. It is a very powerful piece of code and crucial to DTrace. When DTrace was ported to Linux, there was no existing facility in the kernel that could provide the same functionality.

As BPF emerged as an extended version of the Berkeley Packet Filter, the Linux kernel gained a high performance generic execution engine, implementing a virtual architecture that lends itself quite well to providing much of the functionality we require for DTrace. While it poses some limitations (that can be worked around or alleviated by means of contributions we can make to BPF), it also offers features that make it more powerful than the existing execution engine in the DTrace producer. In addition, it provides access to a form of higher level data structures (BPF maps) that are a good fit for DTrace variable storage.

Finally, there are further benefits from using an existing kernel facility. Various other projects use BPF and contribute to it. This means that it is under very active development, and there is a significant developer base involved. Using BPF as DTrace execution engine allows us to make use of a component that others actively maintain. Rather than spending time maintaining our own engine, we can benefit from the work others do on BPF and we can also contribute to the maintenance effort. Furthermore, we can be part of the effort to improve BPF over time.

What do we gain?

Aside from the aforementioned maintenance benefit, we are also able to re-work some of the less fortunate design choices that were made with DTrace. Some examples have already been discussed, but at a higher level we can see a more fundamental flaw that we are now able to avoid. The existing DTrace implementation limits the use of the execution engine to merely evaluating expressions. The real work is done in native code, i.e. all functionality that uses the values obtained from evaluating expressions is hard-coded into the producer. In DTrace terminology we refer to this functionality as 'actions'. A D script comprises multiple probe specifications (identifying the sources we want to use), and each specification is associated with an optional predicate (a D expression) and a clause (zero or more D statements). The compiler in the consumer library transforms each clause into a sequence of actions. Each action usually contains a D expression that evaluates to a value that is used in the execution of the action.

So, while the D script source code associates data sources (probes) each with their own D clause, the implementation in both the consumer library and the producer associates probes each with a chain of actions. The actions are hard-coded in the producer, and their associated compiled D expression is evaluated by means of the execution engine. A very simple example could be this:

BEGIN { trace(curthread); trace(probename); }

The probe specification is BEGIN which is a probe that is triggered when tracing starts for the current consumer. There is no predicate, so the associated clause will always be executed. The clause is compiled into two actions that get associated with the BEGIN probe. Each action has type DIFEXPR, and its associated instructions load the value of the named global variable (curthread or probename) into a register and then return the value of the register. (There are a few technical details left out, but that is the gist of it.)

When the probe is triggered, the two actions will be executed one after the other. The first action has compiled code associated with it, so the instructions are fed to the execution engine, and they are executed. The value returned from the engine is then processed based on the action type. Since the action is a simple DIFEXPR (D expression), the value that was returned by the execution engine is written to the output buffer. Processing of the probe triggering then continues with the second action. Again, associated compiled code is fed to the execution engine, the resulting value is processed based on the action type, which results in the value being written to the output buffer.

It is important to note that a simple D clause with two statements resulted in two separate invocations of the execution engine. The execution context of the probe transitioned from native code (the function that gets called when the probe fires) to interpreted code (the execution engine processing instructions generated for the first action), back to native code (to process the value obtained from the execution engine), transitioning back to interpreted code (the execution engine processing instructions generated for the second action), and back to native code (to process the value obtain from the execution engine).

With the help of some contributions that we can make to BPF (with benefits to many other projects that use BPF) we can simplify this process tremendously. It is possible to compile the entire D clause into a single BPF program (or possibly a chain of BPF programs that are linked through the tail-call mechanism, which means we never leave the context of the execution engine as we pass from one to the next). This has the great benefit that most actions are no longer hardcoded, and those that require native code can be called from the BPF program. The direct mapping of the D source code to compiled BPF code simplifies the implementation for both the producer and the consumer library, and opens up the possibility to allow other tools to associate their own custom BPF programs with DTrace probes without sacrificing safety or stability.

It is important to note that in this new design, DTrace will still work as documented. The userspace compiler will ensure that the generated BPF code is functionally equivalent with the existing DTrace implementation. Various capabilities that BPF programs offer will not be available when writing D scripts because they go beyond the D language specification. But this will offer the flexibility to extend the D language with features that are not possible in the current implementation.

Another important benefit from the new design is that more functionality is implemented at the userspace level. Often this turns out to be a better choice than what was done before. E.g. listing probes used to involve userspace issuing a sequence of ioctl() requests to the DTrace core module in the kernel, requesting the list of probes that has been registered with DTrace. When a tracing script was being compiled, any probe specifications (possibly with wildcards) were also being resolved through ioctl() requests to the kernel. In the new design, the userspace utility obtains a list of probes from sysfs files without any further interaction with the kernel.

Finally, even though it is anticipated that kernel patches will be necessary for some of the more involved pieces of DTrace functionality, it will be possible to provide a reasonable level of functionality on kernels that do not have those kernel patches (yet). This means that we are able to provide somewhat limited DTrace functionality to users while upstream kernel patches are pending, and the review and approval process for those patches will not be a roadblock to DTrace overall.

Workflow

The workflow remains very much the same (which is important because we want to retain semantics that are part of the DTrace design), but as mentioned above there is a significant shift of responsibility involved. More of the established behavior is defined by the userspace consumer library which enables the kernel portion (the producer) to become less complex and more generic.

The following workflow is envisioned when a consumer starts a tracing session:

  1. The consumer creates output buffers for each online CPU using the perf_event memory mapped ring-buffer mechanism, and makes them available to BPF programs by storing the buffer information in a map (buffers). The consumer also creates a map (ecbs) that will be populated with information about enabled probes. It is used by BPF programs to obtain probe specific information (because BPF programs do not know what event triggered their execution).
  2. The consumer performs a code analysis of the D script to determine the needs for variables (static vs dynamic, global vs thread-local vs clause-local).
  3. The consumer creates BPF maps for variable storage for static variables (global and clause-local variables) and dynamic variables (which includes thread-local variables).
  4. The consumer compiles the D script into BPF programs using the defined BPF maps. The programs consist of enabling-specific trampolines that call BPF functions.
  5. The consumer loads the BPF map definitions and the BPF programs into the kernel using the bpf() system call.
  6. The consumer enables the probes that are needed for the D script by associating the enabling-specific BPF programs with their respective probes. The trampoline BPF programs are triggered when a probe fires and they operate as an entry point into the DTrace BPF program execution. They perform the following three operations:
    1. Test whether tracing has been enabled, and if not, terminate.
    2. Retrieve the ECB for this specific BPF program from the ecbs map.
    3. Populate a dt_bpf_context structure with probe event specific information (ECB id, probe id, probe arguments, ...)
    4. Call the BPF function that implements the actual D action, providing the dt_bpf_context structure as argument.
  7. The consumer starts the tracing session by setting the session state in the global variable map.

The following workflow is envisioned when a probe is triggered:

  1. The probe is triggered which causes the execution of trampoline BPF programs, one after the other.
  2. The trampoline BPF program verifies that probing has been enabled (if not, it terminates), obtains its ECB, populates a dt_bpf_context, and calls the BPF function that implements the probe action.
  3. The DTrace BPF program has access to the DTrace execution state and probe-specific data through the dt_bpf_context structure that is passed to the function. Output is generated by populating a structure that is written to the perf event output buffer using the bpf_perf_event_output() helper.
  4. If an error occurs, all data written to the tracing output buffer is thrown away. If execution was successful, the data is committed to the tracing output buffer.

Is that it?

There are many other possibilities to improve the DTrace implementation by leveraging existing Linux facilities and sub-systems. One pain point has always been the fact that various probes (event sources) exist in the kernel but not in a way that DTrace can make use of them. One area of analysis that is still ongoing is looking at the possibility to make use of these existing probes based on the fact that they have already been updated to support attaching BPF programs to them. There is still work to be done in terms of ensuring that the proper BPF context is available for all probe types.

Various complexities in the management of dynamic variables in D scripts make for a complicated design. The use of BPF maps can help, if only because the variable management code can be re-implemented based on existing code in the Linux kernel. It reduces the code footprint of DTrace and shifts the burden of code maintenance from a very DTrace specific component to a generic one. Again, this allows our team to contribute (and help support) generic code in the Linux kernel rather than maintaining code that duplicates existing functionality.

The design presented here does not discuss changes/improvements that can be applied to the handling of aggregations and speculative tracing. Ongoing analysis is looking at ways to re-design those parts of DTrace as well in terms of using existing code.

And there is more to come as we work through this incremental process of reworking DTrace based on BPF!

Implementation

This section provides a high level view of the implementation plan for DTrace on Linux using BPF and other kernel tracing features. For more details, refer to the DTrace based on BPF Implementation Plam.

Implementation plan

The implementation of BPF-based DTrace is not entirely cast in stone as some aspects of the design mature and change to accommodate the ever-changing BPF implementation. We also continue to refine our own knowledge of the capabilities of BPF and how to interact with various Linux kernel subsystems that were not quite designed to work in a manner that DTrace expects.

The current implementation plan uses the existing (legacy) DTrace userspace consumer implementation and modifies it to make use of BPF and other Linux kernel tracing facilities. The two greatest benefits from the new implementation plan are that we can leverage the existing D compiler and tracing data processor and that we can roll out language functionality as we move forward. This means that development is primarily structured around D language features rather than probe providers as we used to do with legacy DTrace). This avoids behind-the-curtain style development, and also allows us to engage users much sooner. By putting a reasonably strong focus on remaining fully compatible with the existing DTrace implementation in terms of user experience, we can also leverage the 'familiarity' angle to users who have used DTrace before.

  1. Import the DTrace header files (previous found in the kernel source tree) into the userspace consumer source tree. DONE
  2. Improve the D compiler debugging facility to aid in the conversion from generating DIF code to generating BPF code. DONE
  3. Remove any reference to loading or using DTrace kernel modules, the DTrace device node, and the drti.o (USDT helper device support code). DONE
  4. Remove support for anonymous tracing. This is a sad side-effect of moving DTrace towards an all-userspace implementation. There are some thoughts on how this may be implemented in the future but that is still very sketchy. DONE
  5. Introduce a generic hashtable implementation to support probe management in userspace. DONE
  6. Introduce probe and provider management code in the userspace consumer. In the legacy implementation, the userspace consumer merely cached information about probes that were referenced from D scripts based on information from the kernel. In the new implementation, the userspace consumer maintains the full collection of probes organized under providers. DONE
  7. Convert the D compiler (parser, code generator, and assembler) from generating DIF code to generating BPF code. This step in the implementation plan covers the basic D language implementation and does not include support for complex instructions (e.g. string comparison) or subroutines. DONE
  8. Remove the integer constant table. This is no longer needed because BPF allow direct loading of 64-bit values. DONE
  9. Convert the compiler to compile D clauses as a single BPF function rather than a sequence of DIF actions. This has significant impact on the compiler because it requires execution state to be managed at the clause level rather than at the action level. The string constant table and other structures must now be at the clause level. DONE
  10. Implement an BPF disassembler (to replace the DIF disassembler) to support the -S option to output a disassembler dump of the compiled program for a given D script. DONE
  11. Add support for using /proc/kallsyms when /proc/kallmodsyms is not available. DONE
  12. Implement D actions as function calls, and implement each D action as an BPF function. Many of these functions can be provided in a pre-compiled ELF object (dcore.o file) that is generated from C code by compiling it using Jose's bpf-unknown-none-gcc compiler implementation. ONGOING
  13. Implement D language features that are not native to BPF (such as string comparison, string copy, ...) ONGOING
  14. Implement data generation by clause rather than action. This involves modifying the trace data record handling in rather extensive ways. DONE
  15. Implementation of the ERROR probe because faults may occur at any time and we must ensure that partial trace data records are not processed as if they are complete. ONGOING
  16. Implement D language subroutines as BPF function and/or by means of an BPF helper. Since BPF helpers reside in kernel code and require a kernel patch that needs to be sent for upstream review and inclusion, I anticipate that we may have to use a UEK-specific patch for this while an upstream patch is negotiated. ONGOING
  17. Implement a turn-key (on/off) switch for toggling tracing as DTrace is accustomed to. This replaces the DTrace GO/STOP mechanism. DONE
  18. Submit a kernel patch upstream for adding /proc/kallmodsyms. While it is certainly DTrace-specific in its design, the benefits to the overall tracing community are significant as well. We should highlight the benefits in terms of tracing scripts that can be written for e.g. ext4 fs tracing without needing to know whether ext4 is compiled into the kernel or loaded as a module. The kallmodsyms information also reduces the amount of function name collisions, ensuring better kprobes coverage for kernel functions. In fact, this work will make it possible to probe functions by symbolic identifier that previously could only be traced by their address because a name collision made them invisible to symbol lookup mechanisms.
  19. Submit a kernel patch upstream for generating CTF data for kernel compiles. This ties in with the binutils and gcc CTF support work.
  20. Add DTrace-specific SDT probes to the kernel as new tracepoints. DTrace users are accustomed to documented SDT probes that are available on multiple operating systems. Many are not available as LInux tracepoints or do not provide the arguments that allow them to be used as an equivalent probe. While these probes are very specific to DTrace, making them available as tracepoints means that other tracing tools can make use of them.
  21. Perform performance comparisons to ensure that the Linux implementations of various probe types can sustain the more severe use cases that DTrace presents with (especially in the testsuite). Use results of this work to help support potential improvements to probes in the Linux kernel, or to substantiate claims that some functionality is not at the level necessary for DTrace users.
  22. Implement kernel changes as necessary based on needed functionality that cannot be implemented at the userspace level, or for which we can demonstrate that a userspace implementation does not satisfy the requirements of users. We will always first try to do things without requiring kernel changes.

Contributions

Current contributions

Add CTF support to the toolchain (binutils, gcc, gdb)

Support for CTF data generation and manipulation has been added to the toolchain (binutils, gcc, gdb) to provide type information that can be used by a variety of tools. DTrace makes use of this data to have full access to type information for the core kernel, kernel modules, and userspace components (executable, libraries, etc).

Future contributions

Generate CTF data at kernel build time (pending...)

With CTF support available in the toolchain it is now possible to generate kernel type information at kernel build time. This ensures that accurate information can be packaged with each kernel, either as a separate package (versioned to match the kernel package) or included with the kernel.

Add kallmodsyms support to the upstream kernel (pending...)

This is an extension to the kallsyms interface to associate symbols with module names, regardless of whether those modules exist as actual loadable modules or were compiled into the kernel image. In other words, any symbols that belong to code that could be built as a loadable module will be listed with that module name even if the kernel configuration causes that code to be compiled directly into the kernel. The augmented version of kallsyms is provided as a new /proc/kallmodsyms file to ensure that the existing /proc/kallsyms interface remains untouched.

DTrace uses this to provide a consistent probe naming scheme (provider:module:function:name) regardless of whether a kernel subsystem is compiled in or built as loadable module(s).

CUrrently, DTrace is the primary user of this feature, but it is anticipated that other projects (especially tracers) may find it useful as well as a way to have a way to organize kernel symbols independent from kernel configuration settings concerning what is compiled in vs built as loadable modules.

waitfd() system call (or equivalent functionality) (pending...)

This is a new syscall implementing waitpid() over fds

This syscall, originally due to Casey Dahlin but significantly modified since, is called quite like waitid():

fd = waitfd(P_PID, some_pid, WEXITED | WSTOPPED, 0);

This returns a file descriptor which becomes ready whenever waitpid() would return, and when read() returns the return value waitpid() would have returned. (Alternatively, you can use it as a pure indication that waitpid() is callable without hanging, and then call waitpid()).

Clone this wiki locally