This article link: https://www.ebpf.top/en/post/ebpf_and_kernel_feature

In 2022, the Linux kernel primarily released versions 5.16-5.19, 6.0, and 6.1, each introducing numerous new features for eBPF. This article briefly introduces these new features. For more in-depth information, please refer to the provided links. Overall, eBPF remains one of the most active modules in the kernel, with its functionality continuously evolving at a rapid pace. In a sense, eBPF is rapidly evolving towards a comprehensive kernel-state programmable interface.

BPF kfuncs

The BPF subsystem exposes many aspects of kernel internal algorithms and data structures, leading to concerns about maintaining interface stability when the kernel changes. Historically, BPF’s lack of interface stability guarantees for user space has been problematic; kernel developers have found themselves maintaining interfaces that were not intended to be stable. The BPF community is now considering what it might mean to provide explicit stability commitments for some of its interfaces.

BPF allows user-space-loaded programs to be attached to numerous hooks and run in the kernel—once the subsystem validator concludes that these programs will not harm the system. A program gains access to kernel data structures provided by the hooks to which it is attached. In some cases, a program can directly modify these data structures, directly influencing the kernel’s operation; in other cases, the kernel acts on the values returned by the BPF program, for example, allowing or disallowing a particular operation.

There are two mechanisms through which the kernel can provide additional functionality to BPF programs. Helpers are special functions written to be provided to BPF programs and have been around since the era of extending BPF. The mechanism known as kfuncs is relatively new; it allows any kernel function to be provided to BPF, with some limitations. Kfuncs are simpler and more flexible; if they are implemented first, it seems unlikely that helpers will be added later. However, kfuncs have a significant limitation—they can only be accessed by JIT-compiled BPF code, making them unavailable on architectures lacking JIT support (currently including 32-bit Arm and RISC-V, although patches to add support for these are in development). Each kfunc provides some useful functionality for BPF programs, but virtually every kfunc exposes some aspects of how the kernel internals work.

Bloom Filter Map: 5.16

A Bloom filter is a space-efficient probabilistic data structure used to quickly test if an element exists in a set. In a Bloom filter, false positives are possible, while false negatives are not.

This patchset includes benchmarking configurable numbers of hash values and entries in Bloom filters. These benchmarks roughly indicate that, on average, using 3 hash functions is one of the best choices. Comparing hashmap lookups with Bloom filters using 3 hash values and hashmap lookups without Bloom filters, using Bloom filter lookups is about 15% faster for 50,000 entries, 25% faster for 100,000 entries, 180% faster for 50,000 entries, and 200% faster for 1 million entries.

Compile Once – Run Everywhere: Linux 5.17 [Kernel Space]

The “Compile Once – Run Everywhere” (CO-RE) mechanism, previously implemented in user space, now runs in the kernel. This is a step towards eventually achieving signed BPF programs and makes it easier for languages like Go to use BPF features.

Linux 5.17 added Compile Once – Run Everywhere (CO-RE) to eBPF, greatly simplifying the complexity of handling eBPF programs for multiple kernel version compatibility and handling loop logic.

The CO-RE project for eBPF leverages debugging information provided by BPF Type Format (BTF) and completes the following four steps to make eBPF programs adaptable to different kernel versions:

  • First, a tool in bpftool provides tools to generate header files from BTF, eliminating the dependency on kernel headers.
  • Second, by rewriting access offsets in BPF code, the problem of differing data structure offsets across different kernel versions is solved.
  • Third, predefining modifications to data structures in different kernel versions in libbpf resolves the issue of incompatible data structures across various kernels.
  • Fourth, libbpf provides a set of kernel feature detection library functions to resolve the issue of eBPF programs needing to execute different behaviors in various kernel versions. For example, you can use bpf_core_type_exists() and bpf_core_field_exists() to check the existence of kernel data types and member variables and use methods similar to extern int LINUX_KERNEL_VERSION __kconfig to inquire about kernel configuration options.

By adopting these methods, CO-RE allows eBPF programs to be compiled in a development environment and distributed to machines running different kernel versions without requiring the installation of various development tools and kernel headers on the target machines. Therefore, the Linux kernel community strongly recommends that all developers use CO-RE and libbpf to build eBPF programs. In fact, if you examine the source code of BCC, you will find that BCC has already migrated many tools to CO-RE.

bpf_loop() Helper Function: 5.17

One of the main features of the eBPF virtual machine is the built-in verifier in the kernel, ensuring that all BPF programs can run safely. However, many BPF developers find the verifier somewhat of a mixed blessing; while it can catch many issues before they occur, it can also be quite challenging to satisfy. Comparing it to a well-meaning but rule-bound and picky bureaucratic organization is not entirely inaccurate. Joanne Koong’s proposed bpf_loop() is aimed at making a certain type of loop structure more user-friendly for BPF users.

In essence, this is the purpose of Koong’s patch. It introduces a new helper function that can be invoked from BPF code.

1
2
long bpf_loop(u32 iterations, long (*loop_fn)(u32 index, void *ctx),
          void *ctx, u64 flags);

Invoking bpf_loop() will result in iterative calls to loop_fn() with iterations and the passed ctx as parameters. The flags value is currently not used and must be zero. The loop_fn() typically returns 0; a return value of 1 will immediately end the iteration. No other return values are permitted.

Unlike bpf_for_each_map_elem() that is limited by the size of BPF maps, bpf_loop() can loop up to 1 « 23 = 8388608 (over 8 million) times, greatly expanding the applications of bpf_loop(). However, bpf_loop() is not constrained by the number of BPF instructions (1 million) because the loop occurs within the bpf_loop() helper function.

This patchset introduces the new link type BPF_TRACE_KPROBE_MULTI, which connects kprobe programs via the fprobe API made by Masami. The fprobe API allows attaching probes to multiple functions at once, working on top of ftrace and offering fast tracing while limiting probe points to function entry or return.

Dynamic Pointers and Type Pointers: 5.19

All memory accesses in BPF programs undergo safety static checks by the verifier, thoroughly analyzing them before allowing program execution. While this enables BPF programs to run safely in kernel space, it restricts how the program can use pointers. Until recently, one such limitation was that the size of memory regions referenced by pointers in BPF programs had to be statically known when loading the BPF program. Joanne Koong’s recently introduced patch set enhances BPF to support loading programs with pointers pointing to dynamically sized memory areas.

Koong’s patch set adds support for accessing dynamically sized memory areas in BPF programs, featuring a new feature called dynptrs. Behind dynptrs is the concept of associating pointers pointing to dynamically sized data areas with the verifier and some metadata used by BPF helper functions to ensure valid access to that area. Koong’s patch set creates this association in a newly defined type, called struct bpf_dynptr. This structure remains opaque to BPF programs.

USDT: 5.19

Static tracepoints, also known as USDT (User Static Defined Trace), allow tracing specific locations of interest in applications in user space. The tracer can mount inspection code execution and data here. They are explicitly defined by developers in the source code and typically enabled at compile time with flags like “–enable-trace”. The advantage of static tracepoints is their stability; developers usually maintain a stable static trace ABI, allowing tracing tools to work across different application versions, which is useful, for instance, when upgrading a PostgreSQL installation and encountering performance degradation.

BPF panic: 6.1

One of the key selling points of the BPF subsystem is that loading BPF programs is safe: the BPF verifier ensures that the program will not harm the kernel before allowing loading. As more features are provided to BPF programs, this assurance may lose some strength, but even then, seeing Artem Savkov’s proposal to add a BPF helper explicitly designed to crash the system may come as a bit surprising. If this patchset is merged in a form similar to its current state, it could herald a new era where, at least in certain cases, BPF programs are allowed to deliberately cause disruption.

As Savkov pointed out, one of the primary use cases of BPF is kernel debugging, a task often aided by having a timely crash dump available. By making the kernel’s panic() function available to BPF programs, Savkov aims to combine these two aspects, allowing BPF programs to crash upon detecting conditions indicating issues developers are looking for, and creating crash dumps. Savkov does not seem to be the only one interested in this capability; Jiri Olsa noted receiving requests for such a feature as well.

BPF Memory Allocator, Linked Lists: 6.1

This series introduced BTF types for user-defined BPF objects in programs, enabling BPF programs to allocate their own objects, establish their object hierarchies, and flexibly build their data structures using basic components provided by the BPF runtime.

Next, support for single ownership BPF linked lists was introduced. They can be placed in BPF maps or allocated objects, using these allocated objects as elements. It operates as an intrusive collection. The aim is to eventually make the allocated objects part of multiple data structures in the future.

The ultimate goal of this patch and future patches is to allow people to do limited kernel-like programming in BPF C and permit programmers to flexibly create their complex data structures from basic building blocks.

The main features of this validated and secure program that maintain system runtime integrity and have proven to be error-free include:

  • Allocating objects
  • bpf_obj_new, bpf_obj_drop for object allocation and release
  • Single ownership BPF linked lists
  • Support for them in BPF maps
  • Support for them in allocated objects
  • Global spinlocks
  • Spinlocks in allocated objects

References:

User Ring Buffer: 6.1

This patch set defines a new mapping type named BPF_MAP_TYPE_USER_RINGBUF, which provides the semantics of single-user space producer/single-kernel consumer on a ring buffer. In addition to the new mapping type, it also introduces a helper function named bpf_user_ringbuf_drain() that allows a BPF program to specify a callback with the following signature where samples are published to this callback:

1
void (struct bpf_dynptr *dynptr, void *context)

Subsequently, programs can securely read samples from dynptr using the helper functions bpf_dynptr_read() or bpf_dynptr_data(). Currently, there are no available helper functions to determine the size of samples, but adding one if necessary would be straightforward.

libbpf has also added corresponding APIs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct ring_buffer_user *
ring_buffer_user__new(int map_fd,
                      const struct ring_buffer_user_opts *opts);
void ring_buffer_user__free(struct ring_buffer_user *rb);
void *ring_buffer_user__reserve(struct ring_buffer_user *rb,
        uint32_t size);
void *ring_buffer_user__poll(struct ring_buffer_user *rb, uint32_t size,
           int timeout_ms);
void ring_buffer_user__discard(struct ring_buffer_user *rb, void *sample);
void ring_buffer_user__submit(struct ring_buffer_user *rb, void *sample);

User-defined linked list support: 6.2

The BPF program in the BTF defines user-defined BPF type objects, enabling BPF programs to allocate their objects, build their object hierarchy, and flexibly create their data structures using basic building blocks provided by BPF runtime. Here is a snippet of example code in use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct foo {
	struct bpf_spin_lock lock;
	int data;
};

struct array_map {
	__uint(type, BPF_MAP_TYPE_ARRAY);
	__type(key, int);
	__type(value, struct foo);
	__uint(max_entries, 1);
} array_map SEC(".maps");

 
static __always_inline
int test_list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head)
{
	int ret;
	ret = list_push_pop(lock, head, false);
	return list_push_pop(lock, head, true);
}

SEC("tc")
int map_list_push_pop(void *ctx)
{
	struct map_value *v;

	v = bpf_map_lookup_elem(&array_map, &(int){0});
	return test_list_push_pop(&v->lock, &v->head);
}

User-defined rbtree support: 6.3

Similar to linked lists, support for rbtree has been added, see

BPF Generic Iterator: 6.4

Typically, BPF (Berkeley Packet Filter) programs loaded into the kernel are written in C, but as the environment evolves, the runtime environment for BPF programs becomes increasingly different from C environment. The BPF virtual machine and related verifier have undergone many checks to ensure the safe execution of BPF code. To make handling iterators easier, Andrii Nakryiko, a BPF developer, submitted a patch set introducing an iterator mechanism that includes functions like “start iteration”, “next item”, and “end iteration”. These functions must be written in real C language within the kernel and must have special identifiers. Additionally, there are rules and requirements to follow when programming, such as limitations on the number of iterations the iterator driver loop may run.

References: