SystemTap

SystemTap 3.2 includes an early prototype of SystemTap's new BPF backend (stapbpf). It represents a first step towards leveraging powerful new tracing and performance analysis capabilities recently added to the Linux kernel. In this post, I will compare the translation process of stapbpf with the default backend (stap) and compare some differences in functionality between these two backends.

Stap and stapbpf share common parsing and semantic analysis stages. As input for translation, both backends receive data structures representing a parse tree complete with type information and references to the definitions of all variables and functions being used. A summary of this information can be displayed using the stap command's '-p2' option.

$ cat sample.stp
probe kernel.function("sys_read") { printf("hi from sys_read!\n"); exit() }

$ stap -p2 sample.stp
# functions
exit:unknown ()
kernel.function("SyS_read@fs/read_write.c:542") /* pc=_stext+0x273da0 */ /* <- kernel.function("SyS_read@fs/read_write.c:542") */

$ stap -p2 --runtime=bpf sample.stp
# functions
_set_exit_status:long ()
exit:unknown ()
# probes
kernel.function("SyS_read@fs/read_write.c:542") /* pc=_stext+0x273da0 */ /* <- kernel.function("SyS_read@fs/read_write.c:542") */

You can see that stapbpf's exit function involves an additional call to _set_exit_status but otherwise, the two backends are probing the same location.

From this point, the translation processes diverge. Stap's goal is to convert the script into a kernel module. To accomplish this, stap translates the parse tree into the C source code of the desired kernel module. At runtime, GCC is used to compile this source code into the actual kernel module. The '-p4' option can be used with the stap command to produce the kernel object file.

# stap -p4 sample.stp
[...]_1316.ko
# staprun [...]_1316.ko
hi from sys_read!

Instead of C, stapbpf translates the script directly into BPF bytecode to be executed by an in-kernel virtual machine. The bytecode is then stored in a BPF-ELF file intended for use by the stapbpf runtime.

# stap -p4 --runtime=bpf sample.stp
stap_1348.bo
# stapbpf stap_1348.bo
hi from sys_read!

Unlike stap's kernel modules, producing the BPF bytecode requires no external compiler. This helps keep stapbpf's compile times and installation footprint low. With the '-v' option, we can see the duration of each stage of translation.

# stap -v -p4 sample.stp
[...]
Pass 3: translated to C [...] in 0usr/0sys/4real ms.
Pass 4: compiled C [...] in 1330usr/310sys/1559real ms.

# stap -v -p4 --runtime=bpf sample.stp
[...]
Pass 4: compiled BPF into "stap_3792.bo" in 0usr/0sys/0real ms.

Notice that pass 3 and 4 takes 1563ms for stap but <1ms for stapbpf (which combines pass 3 and 4 into a single pass).

When loading BPF programs into the kernel, they are first checked for safety by a BPF verifier built into the kernel. It checks for undesirable behaviors such as out of bound jumps, out of bound stack loads/stores and reads from uninitialized addresses. It also checks for the presence of unreachable instructions. Any BPF program, which does not pass verification will not be loaded into the BPF virtual machine. Although the default stap is held to similar standards and is known to be very safe to use, stapbpf has the advantage of inheriting BPF's simpler security model.

However, this advantage does come with some trade-offs. For example, BPF does not support writing to kernel memory. Although stap disables this ability by default, it does provide a "guru mode" that acts as an escape hatch for the user who wishes to have this level of control over their operating system. This means that stapbpf does not share stap's ability to, for example, administer security band-aids to a live system. Even more restricting is that in order to ensure that BPF programs terminate quickly; the verifier rejects any program with loops. While it would be possible for stapbpf to perform loop unwinding, BPF also imposes a limit of 4096 instructions per program.

# stap --runtime=bpf contains_loops.stp
Error loading /tmp/stapxSM7Kg/stap_8316.bo: bpf program load failed: Invalid argument
[...]
Pass 5: run failed.

# stap --runtime=bpf too_many_insns.stp
Error loading /tmp/stapqxRXi4/stap_8432.bo: bpf program load failed: Argument list too long
[...]
Pass 5: run failed.

The following table is a summary comparing stap and stapbpf. Features which BPF permits but are not yet implemented in stapbpf are indicated with 'possible'.

 

stap stapbpf
non-blocking probe handlers yes yes
protected probe execution environment yes yes
lock-protected global variables per probe locking per operation locking
kprobes (DWARF) yes yes
kprobes (DWARF-less) yes possible
uprobes yes possible
tracepoint probes yes possible
probe dynamically loaded kernel objects yes possible
timer-based probing yes yes
able to change state in probed program yes possible (userspace only)
means available to bypass protections for advanced users yes no
loop support (for, while, for each) yes limited*
string support (variables, literals) yes limited**
probe handler length limit 1000 statements 4096 instructions
means available to increase handler length limit yes no
kernel verifies the safety of program no yes

* For and while loops are enabled in begin and end probes. These probes are executed in user space and therefore do not require verification.
** There is support for printf's format string literal.

It can be seen that stapbpf is able to provide only a subset of stap's functionality. However, for systems whose security policies either prevent the full kernel module backend or require software with a security model simpler than stap's, stapbpf aims to provide a convenient way to use this subset.


Take advantage of your Red Hat Developers membership and download RHEL today at no cost.

Last updated: December 12, 2017