This article can be found at: https://www.ebpf.top/post/ftrace_kernel_dynamic

In the Geek Time’s “Mastering Container Battles” by Teacher Li Chengyuan, a pondering question was left in the extra session 04 | Understanding ftrace(2): How to Understand the Technology Behind ftrace Tracepoint and kprobe?:

Consider this, how can we observe that the first instruction of the corresponding kernel function has been replaced after we register a probe with kprobe for it?

Kprobe is a mechanism for dynamic tracking of kernel functions. By using this mechanism, almost all kernel functions can be tracked (excluding those annotated with __kprobes/nokprobe_inline and those marked with NOKPROBE_SYMBOL). There are currently two main mechanisms for implementing the kprobe tracking mechanism:

  • In general, when a kprobe function is registered, the instruction code of the target address in kernel code is replaced with “cc,” which translates to the int3 instruction. Consequently, when the kernel code reaches this instruction, an exception is triggered, leading it to enter the Linux int3 exception handler, do_int3(). In this function, a check is performed to see if a corresponding kprobe has been registered, and if so, the pre_handler(), the instruction before replacement, and post_handler() are executed in sequence.

    kprobe_arch.png

  • When the kernel traces functions based on ftrace, 5 bytes at the beginning of the function are reserved with callq <__fentry__> (which are replaced with nop during startup). The kprobe tracking mechanism uses the 5 bytes reserved by ftrace for tracing and replaces them with ftrace_caller, bypassing the use of the int3 software interrupt instruction replacement.

Regardless of the method used above, the basic principle of kprobe implementation is the same: replace the target instruction with another one that makes the program jump to a specific handler and then execute the registered probe functions.

In this article, I will verify the entire dynamic replacement mechanism based on the ftrace mechanism. If you are unfamiliar with ftrace, I recommend reading Essential Knowledge of Linux Native Tracing Tool Ftrace in advance.

1. Basic Knowledge

1.1 Default Compilation

Let’s implement a very simple program in C for a basic verification:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <stdio.h>
#include <stdlib.h>

int a() {
    return 0;
}

int main(int argc, char ** argv){
    return 0;
}

After compiling with default parameters, the code looks like the following, showing no special definitions at the function header.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ gcc -o hello hello.c 
$ objdump -S hello
...
0000000000001129 <a>:
    1129:	f3 0f 1e fa          	endbr64
    112d:	55                   	push   %rbp
    112e:	48 89 e5             	mov    %rsp,%rbp
    1131:	b8 00 00 00 00       	mov    $0x0,%eax
    1136:	5d                   	pop    %rbp
    1137:	c3                   	ret
...

1.2 Using the -pg Option

When compiled with the -pg parameter, we can see a call to the mcount function added at the function header, a mechanism commonly employed for running program performance analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ gcc -pg -o hello.pg hello.c 
$ objdump -S hello.pg
...
00000000000011e9 <a>:
    11e9:	f3 0f 1e fa          	endbr64
    11ed:	55                   	push   %rbp
    11ee:	48 89 e5             	mov    %rsp,%rbp
    11f1:	ff 15 f1 2d 00 00    	call   *0x2df1(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
    11f7:	b8 00 00 00 00       	mov    $0x0,%eax
    11fc:	5d                   	pop    %rbp
    11fd:	c3                   	ret
...

When the -pg option is added with gcc, the compiler will add a call to the mcount/fentry function at the function header (excluding functions with the notrace attribute); #define notrace __attribute__((no_instrument_function))

1.3 Using the -pg and -mfentry Options

After version 4.6, gcc introduced the -mfentry compilation option to replace mcount with a more concise and efficient __fentry__ function call, which is now the default approach for x86 architecture in Linux Kernel 4.19.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# echo 'void foo(){}' | gcc -x c -S -o - - -pg -mfentry
$ gcc -pg -mfentry -o hello.pg.entry hello.c
$ objdump -S hello.pg.entry
00000000000011e9 <a>:
    11e9:	f3 0f 1e fa          	endbr64
    11ed:	ff 15 05 2e 00 00    	call   *0x2e05(%rip)        # 3ff8 <__fentry__@GLIBC_2.13>
    11f3:	55                   	push   %rbp
    11f4:	48 89 e5             	mov    %rsp,%rbp
    11f7:	b8 00 00 00 00       	mov    $0x0,%eax
    11fc:	5d                   	pop    %rbp
    11fd:	c3                   	ret

Here we take fentry as an example, the function call will occupy 5 bytes. In the Linux kernel, the fentry function is identified as retq for direct return.

1
2
3
SYM_FUNC_START(__fentry__)
        retq
SYM_FUNC_END(__fentry__)

Even though it directly returns through retq, calling each function still incurs approximately 13% performance overhead. During actual operation, the ftrace mechanism will directly replace the 5 bytes (ff 15 05 2e 00 00 call __fentry__) with nop instructions when the kernel starts. In the x86_64 architecture, nop instruction is 0F 1F 44 00 00H.

When enabling the ftrace dynamic tracing mechanism (CONFIG_DYNAMIC_FTRACE), after setting the tracing function, the kernel will dynamically replace the current nop instruction with a jump to the ftrace_caller function, thus achieving dynamic tracing. To avoid triggering multi-core exceptions during the replacement process, the first instruction is first replaced with the 0xcc interrupt instruction, and then the subsequent instructions are replaced. For specific implementation, refer to void ftrace_replace_code(int enable).

1.4 Kernel Verification

Taking the kernel function schedule as an example, when debugging vmlinux files with symbol information using gdb, we can directly view the compiled assembly code of the function:

schedule_fentry

The __fentry__ function is defined as the retq instruction:

fentry

Explanation of the call assembly instruction:

0xffffffff81c33580 <+0>: e8 1b 41 44 ff call 0xffffffff810776a0 <__fentry__>

e8 represents call, 1b 41 44 ff is the offset relative to the next instruction (0xffffffff81c33580 + 5), FF 44 41 1B is a negative number, its 2’s complement is BB BE E5, and 0xffffffff810776a0 - 0xffffffff81c33585 = -bbbee5

2. Validation of kprobe tracing mechanism in ftrace

Here, we plan to verify 3 things:

  1. After the kernel starts, the call instruction at the beginning of a function will be replaced with a nop instruction.
  2. When setting up kprobe function tracing under ftrace, the nop instruction will be replaced with the corresponding call.
  3. After kprobe tracing stops, the first 5 bytes of the function header will be replaced with nop instructions. (After verifying 1 and 2, it is easy to validate this)

To verify the dynamic replacement process of kernel functions, I first consider printing the first 5 bytes of the function address in a kernel module.

3. Validation using a Kernel Module

3.1 Using kallsyms_lookup_name

The most common or popular approach is to use the kernel function kallsyms_lookup_name() in a kernel module to obtain the address of the traced function and then print it.

Initially, I also wanted to try this approach. Other methods to obtain kernel symbol addresses can be found in Obtaining Kernel Symbol Addresses. The sample code for the kernel module can be found in hello_kernel_module, and the code is very simple:

1
2
3
4
5
6
static int __init hello_init(void)
{
    char *func_addr = (char *)kallsyms_lookup_name("schedule");

    // Check if the address is valid and then print it
}

However, I encountered an error during the compilation phase (local environment running 5.11.22-generic):

1
ERROR: modpost: "kallsyms_lookup_name" [hello_kernel_module/hello.ko] undefined!

For security reasons, in the latest kernels (>= 5.7), the kallsyms_lookup_name function is no longer exported and cannot be used directly in kernel modules. For more information, refer to the article Unexporting kallsyms_lookup_name and the related patch. Some alternative methods are discussed here. For a simplified approach, I used the kprobe registration mechanism (supported only in Linux 5.11), and the complete code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kprobes.h>

static struct kprobe kp = {
    .symbol_name = "kallsyms_lookup_name"
};

static int __init hello_init(void)
{
    typedef unsigned long (*kallsyms_lookup_name_t)(const char *name);
    int i = 0;
    kallsyms_lookup_name_t kallsyms_lookup_name;
    
    register_kprobe(&kp);
    kallsyms_lookup_name = (kallsyms_lookup_name_t) kp.addr;
    unregister_kprobe(&kp);
    
    char *func_addr = (char *)kallsyms_lookup_name("schedule");
    
    pr_info("Function address: 0x%lx\n", func_addr);
    for (i = 0; i < 5; i++)
    {
        pr_info("0x%02x ", (u8)func_addr[i]);
    }
    
    return 0;
}

For the complete code, please check get_inst.c. After compiling and installing, you can check the output by using the dmesg command:

1
2
3
4
$ sudo insmod ./hello.ko
$ dmesg -T
[Sat Apr  9 12:11:25 2022] Function address: 0xffffffff9eea3eb0
[Sat Apr  9 12:11:25 2022] 0x0f 0x1f 0x44 0x00 0x00

Here we can see that the first 5 bytes of the function have been replaced with NOP instructions (0f 1f 44 00 00), which is processed during kernel startup by the ftrace_init() function. Similarly, the function header of any newly installed kernel module will also be automatically replaced with NOP instructions.

Corresponding to the example of the schedule function in the ftrace pdf is as follows:

ftrace_ex1.png

Figure: Before enabling kprobe tracking, the first 5 bytes of the function are NOP instructions. <Image from ftrace pdf P36>

Next, by enabling tracing of the kernel function schedule and performing verification:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ cd /sys/kernel/debug/tracing
$ sudo echo 'p:schedule schedule' >> kprobe_events
$ sudo cat kprobe_events
p:kprobes/schedule schedule

$ sudo echo 1 > events/kprobes/schedule/enable
$ insmod ./hello.ko
$ dmesg -T
[Sun Apr 10 20:07:12 2022] 0xe8 0x7b 0x5a 0xd9 0x20
[Sun Apr 10 20:07:12 2022] Function address: 0xffffffff9fa33580

$ sudo echo 0 > events/kprobes/schedule/enable

After enabling tracing for the kernel function schedule, we can see that the first 5 bytes (NOP instructions) have been replaced with other function calls. The overall effect is as shown below:

ftrace_ex2.png Figure: Replacement effect of NOP instructions after registering kprobe function <Image from ftrace pdf P37>

3.2 Using Kernel Function Addresses Directly (Pitfalls, Optional)

Is it possible to use the addresses directly from /boot/System.map without going through the kallsyms_lookup_name function? The answer is yes, but caution is needed due to Kernel Address Space Layout Randomization (KASLR) mechanism.

KASLR may randomize the addresses of kernel code and data each time the system boots to protect the kernel space from attacks. As a result, the static addresses listed in /boot/System.map will be adjusted to random values. If the symbol address in /proc/kallsyms differs from the address in /boot/System.map, it indicates that KASLR is enabled in the running kernel. To check, root user privileges are required.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

$ sudo grep schedule$ /boot/System.map-$(uname -r)
ffffffff81c33580 T schedule

$ grep schedule$ /proc/kallsyms 
ffffffff9fa33580 T schedule

# If KASLR is not enabled in the system, the addresses will be the same; otherwise, they will differ.

If KASLR is enabled, it is essential to pay attention to the address in /proc/kallsyms each time the machine restarts (the address changes on each reboot). To use the address from /boot/System.map, ensure that KASLR is disabled. To disable KASLR, the nokaslr option can be added to the boot command line, and a reboot is required for this change to take effect.

1
2
3
4
5
6
$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
$ sudo perl -i -pe 'm/quiet/ and s//quiet nokaslr/' /etc/default/grub
$ grep quiet /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nokaslr splash"
$ sudo update-grub

We can add a sym variable in the kernel module to obtain the address of the incoming function. The sample code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kallsyms.h>

static unsigned long sym = 0;
module_param(sym, ulong, 0644);

static int __init hello_init(void)
{
    char *func_addr = 0;
    int i = 0;
    if (sym != 0)
    {
            func_addr = (char *)sym;
            for ( i = 0; i < 5; i++)
                pr_info("0x%02x ", (u8)func_addr[i]);
    }

    pr_info("fun addr 0x%p\n", func_addr);
    return 0;
}
module_init(hello_init);

After ensuring that KASLR is disabled, by compiling the above module and running it, we can obtain results consistent with the above method:

1
2
3
4
5
6
7
8
$ addr=`grep -w "schedule" /proc/kallsyms|cut -d " " -f 1`
$ insmod ./hello.ko sym=0x$addr

$ dmesg -T
[Sun Apr 10 20:50:51 2022] 0xe8 0x7b 0x5a 0xd9 0x20
[Sun Apr 10 20:50:51 2022] fun addr 0x000000005aad203e

$ rmmod hello

If KASLR is not disabled and a fixed address is used for compilation, loading the driver would result in an error:

1
2
3
4
$ sudo dmesg -T
[Fri Apr  8 17:39:47 2022] BUG: unable to handle page fault for address: ffffffff810a3eb2
[Fri Apr  8 17:39:47 2022] #PF: supervisor read access in kernel mode
[Fri Apr  8 17:39:47 2022] #PF: error_code(0x0000) - not-present page

4. Verification using gdb + qemu

I have compiled a kernel with DEBUG options and related files saved on Baidu Pan, extraction code av28. For detailed instructions on kernel compilation and debugging, refer to Debugging Linux Kernel using GDB + Qemu.

Here’s how to use qemu software for kernel debugging on a Mac environment:

1
2
$ brew install qemu
$ brew link qemu

Download the files from the online link to a local directory before running qemu for testing:

1
2
3
4
5
6
$ cat run.sh
#!/bin/bash

qemu-system-x86_64 -machine type=q35,accel=hvf -kernel ./bzImage -initrd  ./rootfs_root.img -append "nokaslr console=ttyS0" -s c

$ ./run.sh

Note the addition of -machine type=q35,accel=hvf flag for using hvf acceleration in a mac environment. If acceleration is not enabled, it defaults to using the xen virtualization instruction set.

gdb_schedule_before

If hvf acceleration is not enabled in the qemu-system-x86_64 command line, the first 5 bytes before the function will differ slightly, defaulting to 66 66 66 66 90 data16 data16 data16 xchg %ax,%ax, due to differences in nop instructions across architectures.

1
2
3
# cd /sys/kernel/debug/tracing
# echo 'p:schedule schedule' >> kprobe_events
# echo 1 > events/kprobes/schedule/enable

gdb_schedule_after

Further tracking can be done on the functions passed in as headers:

1
(gdb) x/100i 0xffffffffc0002000

In subsequent page flips, you can observe the invocation of the kprobe_ftrace_handler registration function. kprobe_ftrace_handler

It’s essential to note that the function at address 0xffffffffc0002000 is not either ftrace registration function ftrace_caller or ftrace_regs_caller, but a trampoline dynamically constructed in memory based on these two functions, copying the modified registration function from ftrace_caller or ftrace_regs_caller into this trampoline (for this debug session, ftrace function is ftrace_regs_caller and the event registration function is kprobe_ftrace_handler).

Reference