Link to this article: https://www.ebpf.top/post/top_and_tricks_for_bpf_libbpf

Original article: https://www.pingcap.com/blog/tips-and-tricks-for-writing-linux-bpf-applications-with-libbpf/

In the early days of 2020, when I was using the BCC tool to analyze our database’s performance bottlenecks and pulled code from GitHub, I unexpectedly found an extra libbpf-tools directory in the BCC project. I studied the articles on BPF Portability and BCC to libbpf Conversion and transformed the previously submitted bcc-tools into libbpf-tools based on the knowledge I had gained. Finally, I completed the conversion work for nearly 20 tools (see Why We Switched from BCC-Tools to libbpf-Tools for BPF Performance Analysis).

During this process, I was fortunate to receive a lot of help from Andrii Nakryiko (the person in charge of the libbpf + BPF CO-RE project). It was an interesting experience, and I learned a lot. In this article, I will share the experience I gained from using libbpf to write BPF programs. I hope this article will be helpful to those who are interested in libbpf and will help them further develop and improve their BPF applications using libbpf.

However, before continuing reading, it is recommended to read these articles to obtain important background information:

This article assumes that you have already read the above articles, so there will be no systematic description here. Instead, I will provide corresponding tips for certain details of the program.

Program Framework (Skeleton)

Combine the Open and Loader Stages

If the BPF code you write does not require any runtime adjustments, such as adjusting the size of maps or setting additional configurations, you can call <name>__open_and_load() to combine the two stages. This will make our code look more concise. For example:

1
2
3
4
5
6
obj = readahead_bpf__open_and_load();
if (!obj){
        fprintf(stderr, "failed to open and/or load BPF objectn");
        return 1;
}
err = readahead_bpf__attach(obj);

You can view the complete code sample in readahead.c. This pull request has adjusted subsequent versions.

Selective Attachment (Attach)

By default, <name>__attach() will attach all BPF programs that can automatically be attached. However, sometimes you may want to selectively attach the corresponding BPF programs based on command line parameters. In this case, you can choose to actively call the bpf_program__attach() function. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
err = biolatency_bpf__load(obj);
[...]
if (env.queued){
        obj->links.block_rq_insert =
                bpf_program__attach(obj->progs.block_rq_insert);
        err = libbpf_get_error(obj->links.block_rq_insert);
        [...]
}
obj->links.block_rq_issue =
        bpf_program__attach(obj->progs.block_rq_issue);
err = libbpf_get_error(obj->links.block_rq_issue);
[...]

You can see the complete code example in biolatency.c [init biolatency.c].

Custom load and attach

The framework is suitable for almost all scenarios, but there is one special case: performance events (perf events). In this case, you don’t need to use link in struct <name>__bpf, but you need to define an array structure: struct bpf_link *links[]. This is because perf_event needs to be opened separately on each CPU.

Then, you need to open and attach perf_event manually:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static int open_and_attach_perf_event(int freq, struct bpf_program *prog,
                                struct bpf_link *links[])
{
        struct perf_event_attr attr = {
                .type = PERF_TYPE_SOFTWARE,
                .freq = 1,
                .sample_period = freq,
                .config = PERF_COUNT_SW_CPU_CLOCK,
        };
        int i, fd;
        for (i = 0; i < nr_cpus; i++){
                fd = syscall(__NR_perf_event_open, &attr, -1, i, -1, 0);
                if (fd < 0){
                        fprintf(stderr, "failed to init perf sampling: %s\n",
                                strerror(errno));
                        return -1;
                    }
                links[i] = bpf_program__attach_perf_event(prog, fd);
                if (libbpf_get_error(links[i])){
                        fprintf(stderr, "failed to attach perf event on cpu: "
                                "%d\n", i);
                        links[i] = NULL;
                        close(fd);
                        return -1;
                }
        }

        return 0;
}

Finally, in the cleaning phase, remember to destroy each link in links, and then destroy links itself.

You can see the complete code in runqlen.c.

Multiple BPF handlers for the same event

Starting from v0.2, libbpf supports having multiple entry BPF programs in the same executable and linkable format (ELF) section. Therefore, you can attach multiple BPF programs to the same event (e.g., tracepoints or kprobes) without worrying about ELF section name conflicts. For more information, see Add libbpf full support for BPF-to-BPF calls. Now, you can naturally define multiple handlers to process events like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry1, int irq, struct irqaction *action)
{
            [...]
}

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry2)
{
            [...]
}

You can see the complete code in hardirqs.bpf.c (the code is built on libbpf-bootstrap). [Note: this file no longer exists]

If you are using a libbpf version earlier than v2.0 and want to define multiple handlers for an event, you must use multiple program types, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SEC("tracepoint/irq/irq_handler_entry")
int handle__irq_handler(struct trace_event_raw_irq_handler_entry *ctx)
{
        [...]
}

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry)
{
        [...]
}

You can see the complete code in hardirqs.bpf.c.

Map

Reduce pre-allocation overhead

Note: https://github.com/iovisor/bcc/pull/4044 is this parameter triggering deadlock? Have it been removed?

Using hash maps with BPF_F_NO_PREALLOC flag triggers a warning ( 0), and according to kernel commit 94dacdbd5d2d, this may cause deadlocks. Remove the flag from libbpf tools.]

Starting from Linux 4.6, BPF hash maps will preallocate memory by default and introduce the BPF_F_NO_PREALLOC flag. The motivation behind this is to avoid kprobe + bpf deadlocks. The community has tried other solutions, but ultimately, preallocating all map elements is the simplest solution and does not affect user space behavior.

When it is too expensive to fully preallocate the map, you can define the map with the BPF_F_NO_PREALLOC flag to maintain the previous behavior. For more details, please refer to bpf: map pre-alloc. This flag is unnecessary when the map size is small (e.g., MAX_ENTRIES = 256) because BPF_F_NO_PREALLOC is slower.

Here is an example of usage:

1
2
3
4
5
6
7
struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, MAX_ENTRIES);
        __type(key, u32);
        __type(value, u64);
        __uint(map_flags, BPF_F_NO_PREALLOC);
} start SEC(".maps");

You can see more examples in libbpf-tools.

Determining Map Size at Runtime

One advantage of libbpf-tools is portability, so the maximum space required by the map may vary depending on the machine. In this case, you can define the map without specifying the size before loading and adjust it at runtime. For example:

In <name>.bpf.c, define the map:

1
2
3
4
5
struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __type(key, u32);
        __type(value, u64);
} start SEC(".maps");

After the open stage, call bpf_map__resize() to dynamically adjust it. For example:

1
2
3
4
5
struct cpudist_bpf *obj;

[...]
obj = cpudist_bpf__open();
bpf_map__resize(obj->maps.start, pid_max);

You can see the complete code in cpudist.c. [Has the latest code been adjusted through bpf_map__set_max_entries?]

Per-CPU

When choosing the map type, if multiple events occur with the same CPU, you can use per-CPU arrays to track timestamps, which is simpler and more efficient than using a hash map. However, you must ensure that the kernel does not migrate the process from one CPU to another between two BPF program calls. Therefore, you cannot always use this trick. The following example analyzes soft interrupts and meets these two conditions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct {
        __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
        __uint(max_entries, 1);
        __type(key, u32);
        __type(value, u64);
} start SEC(".maps");

SEC("tp_btf/softirq_entry")
int BPF_PROG(softirq_entry, unsigned int vec_nr)
{
        u64 ts = bpf_ktime_get_ns();
        u32 key = 0;

        bpf_map_update_elem(&start, &key, &ts, 0);
        return 0;
}

SEC("tp_btf/softirq_exit")
int BPF_PROG(softirq_exit, unsigned int vec_nr)
{
        u32 key = 0;
        u64 *tsp;

        [...]
        tsp = bpf_map_lookup_elem(&start, &key);
        [...]
}

You can see the complete code in softirqs.bpf.c.

Global Variables

Not only can you use global variables to customize the logic of BPF programs, but you can also use them instead of maps, making the program simpler and more efficient. Global variables can be of any size. You can set a fixed size for global variables.

For example, because the number of SOFTIRQ types is fixed, you can define a global array in softirq.bpf.c to store counts and histograms:

1
2
__u64 counts[NR_SOFTIRQS] = {};
struct hist hists[NR_SOFTIRQS] = {};

Then, you can directly iterate over this array in user space:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
static int print_count(struct softirqs_bpf__bss *bss)
{
        const char *units = env.nanoseconds ? "nsecs" : "usecs";
        __u64 count;
        __u32 vec;

        printf("%-16s %6s%5sn", "SOFTIRQ", "TOTAL_", units);
        ...
}

for (vec = 0; vec < NR_SOFTIRQS; vec++) {
    count = __atomic_exchange_n(&bss->counts[vec], 0, __ATOMIC_RELAXED);
    if (count > 0)
        printf("%-16s %11llun", vec_names[vec], count);
}

return 0;

You can find the complete code in softirqs.c.

Note on accessing fields directly through pointers

As you may have learned in the article BPF Portability and CO-RE, the combination of libbpf + BPF_PROG_TYPE_TRACING provides a basis for the BPF verifier. The verifier is able to understand and track BTF natively, allowing you to safely trace pointers and read kernel memory directly. For example:

1
u64 inode = task->mm->exe_file->f_inode->i_ino;

This is really cool to use. However, when you use such expressions in conditional statements, bugs can be introduced due to branch optimization in certain kernel versions. In this case, until bpf: fix an incorrect branch elimination by verifier was widely introduced, use BPF_CORE_READ to ensure kernel compatibility. You can find an example in biolatency.bpf.c:

1
2
3
4
5
6
7
SEC("tp_btf/block_rq_issue")
int BPF_PROG(block_rq_issue, struct request_queue *q, struct request *rq)
{
    if (targ_queued && BPF_CORE_READ(q, elevator))
        return 0;
    return trace_rq_start(rq);
}

As you can see, even though it’s a tp_btf program and q->elevator would be faster, I still used BPF_CORE_READ(q, elevator).

Conclusion

This article introduced some tricks for writing BPF programs using libbpf. You can find many practical examples in libbpf-tools and bpf. If you have any questions, feel free to join the TiDB community on Slack and send us your feedback.