Read the article at: https://www.ebpf.top/post/en/bpf_rawtracepoint

1. Common Hook Types in eBPF Trace

eBPF allows tracing events in various categories within the trace domain as follows:

  • Kernel static trace points tracepoint/rawtracepoint/btf-tracepoint
    • Refer to /sys/kernel/tracing/available_events
  • Kernel dynamic trace points k[ret]probe, fentry/fexit (based on BTF)
    • Kprobe /sys/kernel/tracing/available_filter_functions
  • User-space static trace points USDT
    • Viewing method: readelf -n or bpftrace tool bpftrace -l 'usdt:/home/dave/ebpf/linux-tracing/usdt/main:*'
  • User-space dynamic trace: u[ret]probe, obtainable via nm hello | grep main
  • Performance monitoring counters PMC
  • perf_event

This article will focus on rawtracepoint within kernel static tracing, concluding with practical code examples using the libbpf development library and bpftrace.

2. BPF Rawtracepoint

In Linux kernel version 4.17, eBPF author Alexei Starovoitov introduced a raw tracepoint. In contrast to tracepoint, rawtracepoint directly exposes original parameters, somewhat avoiding the performance overhead of creating stable tracepoint parameters. However, as it directly exposes original parameters to users, this falls into the category of dynamic tracing and is considered an unstable tracing mode. Comparatively, rawtracepoint is more stable than kprobe since both the name and parameter changes of tracepoints are relatively infrequent. It can offer better performance than tracepoint. The implementation of rawtracepoint can be found at bpf: introduce BPF_RAW_TRACEPOINT. Performance benchmarks submitted by the author indicate improvements in both kprobe and tracepoint tracking, making it suitable for long-term monitoring of frequently called functions, such as system calls. The Tracee security product monitoring system calls is implemented using the rawtracepoint approach.

2.1 Trace Performance Enhanced by 20%

The table below shows original performance data at the time of the author’s submission:

1
2
3
tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename   1.1M   769K        947K            1.0M
urandom_read  789K   697K        750K            755K

The following data is based on running the official bench tool from the kernel code and plotting the results (requires kernel code compilation beforehand), with the y-axis representing instructions per second:

perf comparision of linux trace

To run the performance benchmark:

1
2
$ cd tools/testing/selftests/bpf
$ ./benchs/run_bench_trigger.sh

2.2 Rawtracepoint Tracking Event Inspection and Quantity Statistics

bpftrace version 0.19 supports rawtracepoint. You can use bpftrace -l to view, where the program type abbreviation is rt, and parameter types are arg0, arg1, and so on. You can view the complete list using:

1
$ sudo bpftrace -l "rawtracepoint:*"

On an Ubuntu 22.04 system (kernel version 6.2), there are approximately over 1480 rawtracepoints:

1
2
3
4
5
$ sudo bpftrace -l "rawtracepoint:*" | wc -l
1480

$ sudo bpftrace -l "tracepoint:*" | wc -l
2124

Keen observers may notice that there are 2124 tracepoint events in the system. What causes this discrepancy?

How does bpftrace obtain rawtracepoint? By analyzing the source code, we find that bpftrace reads all tracepoints from the /sys/kernel/debug/tracing/available_events file while filtering out those starting with syscalls:sys_enter_ or syscalls:sys_exit_. Filtering is necessary due to two special cases:- Use sys_enter to represent the sys_enter_xxx event under the syscalls category: SEC("raw_tracepoint/sys_enter")

  • Use sys_exit to represent the sys_exit_xxx event under the syscalls category: SEC("raw_tracepoint/sys_exit")

In this way, you can monitor all system call events using sys_enter and sys_exit events.

You can find the events that rawtracepoint can monitor by examining the contents of the /sys/kernel/debug/tracing/available_events file. The format of each line in the file is:

1
2
# <category>:<name>
skb:kfree_skb

However, in rawtracepoint, only the value of <name> is used, not the entire <category>:<name>. For example:

1
2
$ bpftrace -e 'rawtracepoint:kfree_skb  { printf("%s\n", comm)} '
Attaching 1 probe...

2.3 Passing Parameter Changes

From the perspective of a BPF program, the parameter definition and access for the rawtracepoint method are as follows. We will provide a complete sample program later.

1
2
3
4
5
6
7
8
9
struct bpf_raw_tracepoint_args {
       __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // The program can read args[N] where N depends on the tracepoint
  // and is statically verified at program load+attach time
}

All parameters are passed in through an array pointer. Here, based on the task_rename tracepoint defined in the __set_task_comm function, we will compare the tracking parameters for tracepoint and rawtracepoint. The task_rename tracepoint function declaration in the kernel is as follows:

1
2
void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec);
// In rawtracepoint mode, the original parameters tsk/buf/exec are pushed in directly

If the system does not have the task_rename event, we can compile the following program to manually trigger and verify the test:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// gcc -o rename test_rename.c
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define MAX_CNT 1

static void test_task_rename(int cpu)
{
	char buf[] = "test\n";
	int i, fd;

	fd = open("/proc/self/comm", O_WRONLY|O_TRUNC);
	if (fd < 0) {
		printf("couldn't open /proc\n");
		exit(1);
	}
	for (i = 0; i < MAX_CNT; i++) {
		if (write(fd, buf, sizeof(buf)) < 0) {
			printf("task rename failed: %s\n", strerror(errno));
			close(fd);
			return;
		}
	}
	close(fd);
}

int main()
{
	test_task_rename(0);
	return 0;
}

3. Example of Using rawtracepoint in BPF Programs

3.1 libbpf Library (Based on CO-RE)

The corresponding tracepoint for task_rename in the system is tracepoint:task:task_rename, and the format definition for the tracepoint is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ cat /sys/kernel/debug/tracing/events/task/task_rename/format
name: task_rename
ID: 131
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

  # Parameters start
	field:pid_t pid;	offset:8;	size:4;	signed:1;
	field:char oldcomm[16];	offset:12;	size:16;	signed:0;
	field:char newcomm[16];	offset:28;	size:16;	signed:0;
	field:short oom_score_adj;	offset:44;	size:2;	signed:1;

print fmt: "pid=%d oldcomm=%s newcomm=%s oom_score_adj=%hd", REC->pid, REC->oldcomm, REC->newcomm, REC->oom_score_adj

You can use structures in the libbpf library to write code in your program, as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* from: vmlinux.h
struct trace_entry {
        short unsigned int type;
        unsigned char flags;
        unsigned char preempt_count;
        int pid;
};

struct trace_event_raw_task_rename {
        struct trace_entry ent;
        pid_t pid;
        char oldcomm[16];
        char newcomm[16];
        short int oom_score_adj;
        char __data[0];
};
*/

SEC("tracepoint/task/task_rename")
int prog(struct trace_event_raw_task_rename *ctx)
{
		bpf_printk("task_rename -> pid %d, oldcomm %s, newcomm %s, oom %d",
								ctx->pid, 
      					ctx->oldcomm, 
      					ctx->newcomm, 
      					ctx->oom_score_adj );
    return 0;
}

If you use the rawtracepoint method, the parameters of __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) are pushed into the bpf_raw_tracepoint_args structure in sequence, and args[0] is the parameter struct task_struct *tsk , args[1] is const char *buf, which represents the renamed comm_name, and so on for other parameters.

The parameter structure of bpf_raw_tracepoint_args is as follows:

1
2
3
4
struct bpf_raw_tracepoint_args 
{    
	__u64 args[0]; }; 
}

The codes of raw_tracepoint as blow show:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SEC("raw_tracepoint/task_rename")
int rt_prog(struct bpf_raw_tracepoint_args *ctx)
{
    // void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec);
    struct task_struct *tsk = (struct task_struct *) ctx->args[0];
    u32 pid;
    u16 oom_score_adj;
    char old_name[TASK_COMM_LEN] = {};
    char new_name[TASK_COMM_LEN] = {};


    pid = BPF_CORE_READ(tsk, pid);
    // BPF_CORE_READ_INTO(&old_name, tsk, comm);
    bpf_core_read(&old_name, sizeof(old_name), &tsk->comm);
    bpf_core_read(&new_name, sizeof(new_name), (void *)ctx->args[1]);
    oom_score_adj = BPF_CORE_READ(tsk, signal, oom_score_adj);

    bpf_printk("task_rename:rt -> pid %d, oldcomm %s, newcomm %s, oom %d",
                pid,
                old_name,
                new_name,
                oom_score_adj);
    return 0;
}

3.2 Bpftrace Sample Code

Starting from version 0.19, bpftrace supports rawtracepoints. The program type abbreviation is rt, and the argument type is arg0, arg1, etc.

You can use bpftrace -l to check the list of available tracepoints, where the program type is represented by rt and the argument types are like arg0, arg1.

For bpftrace to trace using tracepoint:task:task_rename:

1
2
3
4
5
6
7
8
9
$ sudo bpftrace -e 'tracepoint:task:task_rename
{
    printf("enter t:task:task_rename %s, pid %d, oldcomm %s, newcomm %s, oom 0x%x\n",
        comm,
        args->pid,
        args->oldcomm,
        args->newcomm,
        args->oom_score_adj);
}'

For bpftrace to trace using rawtracepoint:task_rename:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# rename.bt
rawtracepoint:task_rename
{
    $task = (struct task_struct *)arg0;
    $pid = $task->pid;
    $oom_score_adj = $task->signal->oom_score_adj;

    printf("enter rt:task:task_rename %s, pid %d, oldcommn %s, newcomm %s, oom 0x%x\n",
        comm,
        $pid,
        $task->comm,
        str(arg1),
        $oom_score_adj);
}

See Also