Site link: https://www.ebpf.top/post/bpf2pbpf_tail_call

Author: Richard Li (Original author’s permission obtained)

Original article link: https://blog.csdn.net/weixin_43705457/article/details/123474244

1. Introduction

This article first introduces the general restrictions and usage of tail calls, compares them with BPF to BPF calls, and finally provides a modified version I made of the tail call sample in the kernel source code (using CO-RE). (When learning about tail calls, I struggled with not having a simple and understandable example that could run, so I ended up creating one myself. I believe this version is the most beginner-friendly and logically clear among all examples currently available).

2. Tail Call

BPF provides a capability to safely inject code when kernel events and user program events occur, allowing non-kernel developers to control the kernel. However, due to inherent limitations such as 11 64-bit registers and 32-bit subregisters, a program counter, a 512-byte BPF stack space, and 1 million instructions (5.1+), and a maximum recursion depth of 33, the achievable logic is limited (not Turing complete).

The kernel stack is valuable, and in general, BPF to BPF calls will use additional stack frames. The biggest advantage of tail calls is the reuse of the current stack frame and a jump to another eBPF program, as described in [5]:

The important detail that it’s not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame.

eBPF programs are independently verified (values in the caller’s stack and registers are inaccessible to the callees), so state transfer is typically done using per-CPU maps. Tail call can also use special data items like skb_buff->cb for data transfer [8]. Furthermore, only BPF programs of the same type can be tail-called, and they must match with the JIT compiler. Therefore, a given BPF program is either JIT-compiled or interpreted (invoke interpreted programs).

The steps for tail calls require coordination between user space and kernel space and primarily consist of two parts:

  • User space: A special map of type BPF_MAP_TYPE_PROG_ARRAY, storing a mapping of custom indexes to bpf_program_fd.

  • Kernel space: The bpf_tail_call helper function, responsible for jumping to another eBPF program. Its function definition is as follows: static long (*bpf_tail_call)(void *ctx, void *prog_array_map, __u32 index). Here, ctx represents the context, prog_array_map is the aforementioned map of type BPF_MAP_TYPE_PROG_ARRAY, used for user space to set up the mapping of the jump program and user-customized index. Index is the user-defined index.

If bpf_tail_call runs successfully, the kernel immediately runs the first instruction of the new eBPF program (never returns to the previous program). If the target program to jump to does not exist (i.e., the index does not exist in the prog_array_map), or the program chain has reached the maximum tail call limit, the call may fail. If the call fails, the caller continues executing subsequent instructions.

In [5], we can see the following text:

The chain of tail calls can form unpredictable dynamic loops, therefore, tail_call_cnt is used to limit the number of calls and currently is set to 32.

This limit in the kernel is defined by the macro MAX_TAIL_CALL_CNT (inaccessible from user space), currently set to 32 (I do not know what this so-called unpredictable dynamic loops refers to).

In addition to saving kernel stack space, I believe the biggest advantages of tail calls are:

  • Increasing the maximum number of executable eBPF program instructions
  • eBPF program orchestration

The above two advantages are not baseless. Let me explain with two examples for each point:

  1. In [10], there is an eBPF program in BMC with a large loop. Although the eBPF program is only 142 lines long, the bytecode has already reached over 700 thousand lines. Without logical splitting, it would be rejected in the verification phase.

  2. [9] proposes an eBPF orchestration strategy that allows arbitrary combinations of eBPF programs through a configuration file. It gives me the impression of creating a third-party storage, automatically fetching the required eBPF programs through configuration, and then automatically orchestrating, loading, and executing them. The orchestration process here is implemented using tail calls. The basic process is as follows:

3. BPF to BPF Calls

Starting from Linux 4.16 and LLVM 6.0, this restriction has been resolved, and support for function calls has been added in loaders, verifiers, interpreters, and JITs.

The significant advantage is reducing the size of generated BPF code, hence being more friendly to the CPU instruction cache. The BPF helper function calling convention also applies to calls between BPF functions, with r1 - r5 used to pass parameters and the return result stored in r0. r1 - r5 are scratch registers, and r6 - r9 are reserved registers as usual. The maximum nested call depth is 8. The caller can pass pointers (e.g., pointing to the caller’s stack frame) to the callee.The disadvantage of tail calls is that the generated program image is large but saves memory; while the advantage of BPF to BPF Calls is a smaller image size but greater memory consumption. Before kernel 5.9, tail call and BPF to BPF Call cooperation was not allowed. However, on the X86 architecture after 5.10, both types of calls can be used simultaneously.

It is mentioned in [7] that there are certain limitations when using both types simultaneously, otherwise it may lead to kernel stack overflow:

flow Taking the example of the call chain in the above figure, the limitation mentioned is that the stack space for each subprogram cannot exceed 256 bytes (if the verifier detects a bpf to bpf call, the main program will also be treated as a subprogram). This constraint allows BPF program call chains to use a maximum of 8KB of stack space, calculated as 256 bytes/stack multiplied by the maximum limit of 33 tail calls. Without this limitation, BPF programs would use 512 bytes of stack space, resulting in consuming up to 16KB of total stack space, potentially leading to a stack overflow on certain architectures.

One thing to note here is how memory is saved when both types are used simultaneously. For example, when subfunc1 makes a Tail Call to func2, the stack frame of subfunc is already reused by func2. Then, when func2 makes a BPF to BPF Call to subfunc2, a third stack frame is created. Subsequently, when a Tail Call is made to func3, five logical processes utilize three stacks, saving memory.

Since subfunc1 was called initially, the program execution will eventually return to func1.

4. CO-RE Sample

We can see the official example of Tail Call in the kernel in [11][12]. I modified [11] using the libbpf CO-RE feature to make it easier for user programs to understand libbpf using Tail Call. Although mounting eBPF programs can be assisted using other commands like tc, prctl, I believe using the interface directly in the example program might be easier to understand. We use [11] as an example.

The example uses the seccomp filter, which is used to reduce the exposure of system calls in the kernel to applications, limiting the system calls each process can use (parameter modifications at the task_struct level). It supports two filtering modes, SECCOMP_SET_MODE_STRICT and SECCOMP_SET_MODE_FILTER, for restricting a subset (read, write, _exit, sigreturn) of filters and eBPF-style filters. [3] mentions that it can prevent TOCTTOU since open is restricted, naturally eliminating the risk of TOCTTOU errors [6].

Regarding the seccomp filter, [3] explains it as follows:

System call filtering isn’t a sandbox. It provides a clearly defined mechanism for minimizing the exposed kernel surface. It is meant to be a tool for sandbox developers to use.

[2] states:

Designed to sandbox compute-bound programs that deal with untrusted byte code.

tracex5_kern.bpf.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "bpf_helpers.h"

char LICENSE[] SEC("license") = "Dual BSD/GPL";

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1024);
	__type(key, u32);
	__type(value, u32);
} progs SEC(".maps");


SEC("kprobe/__seccomp_filter")
int BPF_KPROBE(__seccomp_filter, int this_syscall, const struct seccomp_data *sd, const bool recheck_after_trace)
{	
	// Here, note that the eBPF program stack space is only 512 bytes. If it's too large, there will be an error. You can try adjusting it.
	char comm_name[30];
	bpf_get_current_comm(comm_name, sizeof(comm_name));
	// If the call fails, it will fall through directly
    bpf_tail_call(ctx, &progs, this_syscall);

	char fmt[] = "syscall=%d common=%s\n";
	bpf_trace_printk(fmt, sizeof(fmt), this_syscall, comm_name);
    return 0;
}

/* We jump here when syscall number == __NR_write */
SEC("kprobe/SYS__NR_write")
int bpf_func_SYS__NR_write(struct pt_regs *ctx)
{
	struct seccomp_data sd;
	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM2(ctx));
	if (sd.args[2] > 0) {
		char fmt[] = "write(fd=%d, buf=%p, size=%d)\n";
		bpf_trace_printk(fmt, sizeof(fmt),
				 sd.args[0], sd.args[1], sd.args[2]);
	}
	return 0;
}```c
SEC("kprobe/SYS__NR_read")
int bpf_func_SYS__NR_read(struct pt_regs *ctx)
{	
	struct seccomp_data sd;
	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM2(ctx));
	if (sd.args[2] > 0 && sd.args[2] <= 1024) {
		char fmt[] = "read(fd=%d, buf=%p, size=%d)\n";
		bpf_trace_printk(fmt, sizeof(fmt),
				 sd.args[0], sd.args[1], sd.args[2]);
	}
	return 0;
}

SEC("kprobe/SYS__NR_open")
int bpf_func_SYS__NR_open(struct pt_regs *ctx)
{	
	struct seccomp_data sd;
	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM2(ctx));
	char fmt[] = "open(fd=%d, path=%p)\n";
	bpf_trace_printk(fmt, sizeof(fmt), sd.args[0], sd.args[1]);
	return 0;
}

tracex5_user.c

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
#include <time.h>
#include <assert.h>
#include <errno.h>
#include <sys/resource.h>
#include <linux/if_link.h>
#include <linux/limits.h>

#include <bpf/libbpf.h>
#include "trace.skel.h"

#define BPF_SYSFS_ROOT "/sys/fs/bpf"

enum {
    SYS__NR_read = 3,
    SYS__NR_write = 4,
	SYS__NR_open = 5,
};

struct bpf_progs_desc {
	char name[256];
	enum bpf_prog_type type;
	int map_prog_idx;
	struct bpf_program *prog;
};

static struct bpf_progs_desc progs[] = {
	{"kprobe/__seccomp_filter", BPF_PROG_TYPE_KPROBE, -1, NULL},
	{"kprobe/SYS__NR_read", BPF_PROG_TYPE_KPROBE, SYS__NR_read, NULL},
	{"kprobe/SYS__NR_write", BPF_PROG_TYPE_KPROBE, SYS__NR_write, NULL},
	{"kprobe/SYS__NR_open", BPF_PROG_TYPE_KPROBE, SYS__NR_open, NULL},
};

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
	return vfprintf(stderr, format, args);
}

static volatile bool exiting = false;

static void sig_handler(int sig)
{
	exiting = true;
}

int main(int argc, char **argv)
{
	struct trace_bpf *skel;
    int map_progs_fd, main_prog_fd, prog_count;
	int err;

	// Set up some callbacks for debug info
	libbpf_set_print(libbpf_print_fn);

	signal(SIGINT, sig_handler);
	signal(SIGTERM, sig_handler);

	// Load and verify the BPF application 
	skel = trace_bpf__open();
	if (!skel) {
		fprintf(stderr, "Failed to open and load the BPF skeleton\n");
		return 1;
	}

	// Load and verify the BPF programs 
	err = trace_bpf__load(skel);
	if (err) {
		fprintf(stderr, "Failed to load and verify the BPF skeleton\n");
		goto cleanup;
	}

    map_progs_fd = bpf_object__find_map_fd_by_name(skel->obj, "progs");
    prog_count = sizeof(progs) / sizeof(progs[0]);
    for (int i = 0; i < prog_count; i++) {
		progs[i].prog = bpf_object__find_program_by_title(skel->obj, progs[i].name);
		if (!progs[i].prog) {
			fprintf(stderr, "Error: bpf_object__find_program_by_title failed\n");
			return 1;
		}
		bpf_program__set_type(progs[i].prog, progs[i].type);
    }
``````markdown
Translate the following text from Chinese to English:

```python
for (int i = 0; i < prog_count; i++) {
    int prog_fd = bpf_program__fd(progs[i].prog);
    if (prog_fd < 0) {
        fprintf(stderr, "Error: Couldn't get file descriptor for program %s\n", progs[i].name);
        return 1;
    }

    // -1 represents the main program
    if (progs[i].map_prog_idx != -1) {
        unsigned int map_prog_idx = progs[i].map_prog_idx;
        if (map_prog_idx < 0) {
            fprintf(stderr, "Error: Cannot get prog fd for bpf program %s\n", progs[i].name);
            return 1;
        }
        // Insert prog_fd into the progs map with map_prog_idx
        err = bpf_map_update_elem(map_progs_fd, &map_prog_idx, &prog_fd, 0);
        if (err) {
            fprintf(stderr, "Error: bpf_map_update_elem failed for prog array map\n");
            return 1;
        }
    }
}

// Load only the main program, no tail call loaded, so cannot call trace_bpf__attach
struct bpf_link* link = bpf_program__attach(skel->progs.__seccomp_filter);
if (link == NULL) {
    fprintf(stderr, "Error: bpf_program__attach failed\n");
    return 1;
}

while(exiting){
    // Writing a bare loop consumes a lot of CPU
    sleep(1);
}

cleanup:
// Cleanup
trace_bpf__destroy(skel);

return err < 0 ? -err : 0;
}

Run the following commands:

  1. clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/src/kernels/5.4.119-19-0009.1(you should change it to your own)/include/ -idirafter /usr/local/include -idirafter /usr/include -c trace.bpf.c -o trace.bpf.o

  2. gen skeleton trace.bpf.o > trace.skel.h

  3. clang -g -O2 -Wall -I . -c trace.c -o trace.o

  4. clang -Wall -O2 -g trace.o -lelf -lz -o trace

  5. cat /sys/kernel/debug/tracing/trace_pipe

    trace_pipe

You can see the expected output.

5. Tail Call Costs in eBPF

[13] evaluates how much performance overhead some of the optimizations introduced to mitigate the Spectre vulnerability have on the tail call performance of eBPF (Spectre refers to a series of vulnerabilities exploiting hardware flaws present on most CPUs (Intel, AMD, ARM)).

I haven’t read this article yet. When I have the energy, I’ll take a close look at it, but for now, I’ll just glance at it.

6. Summary

I’m still struggling with the eBPF environment. I started running virtual machines on VMware Fusion, but M1 doesn’t support VMware Tools, making it extremely inconvenient to use, and the image also needs special modifications to run on M1 with a higher kernel version. Moreover, the cloud server version I purchased only goes up to 5.4.119, so some eBPF features cannot be used. As for running a dual system on a MAC M1, I don’t have the energy to tackle this crab. Firstly, there is little information available, and secondly, tail calls and BPF to BPF Calls simultaneous invocation is currently only supported on X86, with not much benefit.

Well, I admit, ultimately, I just don’t want to spend money on buying Parallels Desktop.

7. References

  1. Linux Security: Seccomp
  2. Using Seccomp to Restrict Kernel Attack Surface
  3. SECure COMPuting with Filters
  4. Seccomp(2) — Linux Manual Page
  5. BPF: Introducing bpf_tail_call() Helper
  6. Time-of-Check to Time-of-Use Wiki
  7. Cilium Document
  8. eBPF: Traffic Control Subsystem
  9. Introducing Walmart’s L3AF Project: Control Plane, Chaining eBPF Programs, and Open-Source Plans
  10. BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing
  11. Kernel: tracex5_kern.c
  12. Kernel: sockex3_kern.c
  13. Evaluation of Tail Call Costs in eBPF