This article can be found at: https://www.ebpf.top/post/no_space_left_on_devices

Recently, there have been cases of failures in creating containers with the error “no space left on device” in the production environment. However, during the investigation, it was found that disk space and inodes were quite normal. In cases where conventional troubleshooting methods have failed, is there a quick and universal approach to pinpointing the root cause of the problem?

This article records the analysis and troubleshooting process using eBPF + Ftrace in a separate environment. Considering the general applicability of this approach, it has been organized in the hope of serving as a stepping stone for further exploration.

The author’s expertise is limited, and the ideas presented are for reference only. There may be some shortcomings in judgment or assumptions, so feedback and corrections from experts are welcome.

1. Understanding “no space left on device” Error

The method to replicate locally may not be entirely consistent with the root cause of the issue in the production environment and is intended for learning purposes.

When running docker run on a machine, the system displays “no space left on device,” resulting in container creation failure:

1
2
3
$ sudo docker run --rm -ti busybox ls -hl | wc -l
docker: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/40de1c588e43dea75cf80a971d1be474886d873dddee0f00369fc7c8f12b7149-init/merged: no space left on device.
See 'docker run --help'.

The error message indicates insufficient disk space during overlay mount. Checking the disk space situation using the df -Th command shows:

1
2
3
4
5
6
7
8
9
$ sudo df -Th
Filesystem     Type   Size    Used   Avail   Use%   Mounted on
tmpfs        tmpfs   392M    1.2M    391M    1%    /run
/dev/sda1    ext4    40G     19G     22G   46%    /
tmpfs        tmpfs   2.0G    0      2.0G    0%    /dev/shm
tmpfs        tmpfs   5.0M    0      5.0M    0%    /run/lock
/dev/sda15   vfat    98M     5.1M    93M    6%    /boot/efi
tmpfs        tmpfs   392M    4.0K   392M    1%    /run/user/1000
overlay      overlay 40G     19G     22G   46%    /root/overlay2/merged

However, the disk space usage reveals that the overlay device mounted in the system is only 46% utilized. Based on some troubleshooting experience, I recall that depletion of inodes in the file system could also lead to this situation. Checking the inode capacity using df -i:

1
2
3
4
5
6
7
8
9
$ sudo df -i
Filesystem      Inodes   IUsed   IFree   IUse%   Mounted on
tmpfs           500871     718   500153    1%     /run
/dev/sda1       5186560  338508  4848052   7%     /
tmpfs           500871       1   500870    1%     /dev/shm
tmpfs           500871       3   500868    1%     /run/lock
/dev/sda15          0       0        0    -      /boot/efi
tmpfs           100174      29   100145    1%     /run/user/1000
overlay         5186560  338508  4848052   7%     /root/overlay2/merged

From the inode situation, the inode usage in the overlay file system is only 7%. Could it be possible that files were deleted but their handles were not released, causing a delayed release and handle leak? Out of desperation, I used lsof | grep deleted to look for clues but found none:

1
2
$ sudo lsof | grep deleted
empty

After attempting common error scenarios without success, the problem seems to have reached an impasse. In situations where conventional troubleshooting methods fail, is there a way for troubleshooters to identify issues without overly relying on kernel experts?

Indeed, there is. Today, the spotlight belongs to ftrace and eBPF (tools developed based on eBPF technology, such as BCC).

2. Problem Analysis and Localization

2.1 Preliminary Identification of Problematic Function

The conventional way to analyze is to step by step through the client-side source code, but in container architecture, it involves a series of links like Docker -> Dockerd -> Containerd -> Runc, making the analysis process slightly cumbersome, requiring some expertise in container architecture.

Therefore, we can quickly identify issues by using the error codes from system calls. This method requires some experience and luck. If there is enough time, it is still recommended to gradually locate and analyze the source code, as it helps troubleshoot problems and allows for deeper learning.

The error “no space left on device” is defined in the kernel include/uapi/asm-generic/errno-base.h file as follows:

Generally, you can search directly in the kernel using the error message.

1
#define	ENOSPC		28	/* No space left on device */

BPF provides a system call tracing tool syscount-bpfcc for filtering based on error codes. The tool also offers the -x option to filter failed system calls, which is very useful in many scenarios.

Please note that the suffix bpfcc in syscount-bpfcc is specific to Ubuntu systems, whereas the source code in the BCC tool is known as syscount.

First, let’s briefly understand how to use the syscount-bpfcc tool:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ sudo syscount-bpfcc -h
usage: syscount-bpfcc [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T TOP] [-x] [-e ERRNO] [-L] [-m] [-P] [-l]

Summarize syscall counts and latencies.

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace only this pid
  -i INTERVAL, --interval INTERVAL
                        print summary at this interval (seconds)
  -d DURATION, --duration DURATION
                        total duration of trace, in seconds
  -T TOP, --top TOP     print only the top syscalls by count or latency
  -x, --failures        trace only failed syscalls (return < 0)
  -e ERRNO, --errno ERRNO
                        trace only syscalls that return this error (numeric or EPERM, etc.)
  -L, --latency         collect syscall latency
  -m, --milliseconds    display latency in milliseconds (default: microseconds)
  -P, --process         count by process and not by syscall
  -l, --list            print list of recognized syscalls and exit

In the syscount-bpfcc parameters, we can use the -e option to specify filtering based on the return of the ENOSPEC error in the system call:

1
2
3
4
5
$ sudo syscount-bpfcc -e ENOSPC
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[08:34:38]
SYSCALL                   COUNT
mount                         1

The trace result shows that the system call mount returned the ENOSPEC error.

To determine the calling program of the mount system call that caused the error, we can use the “-P” parameter to aggregate and display by process:

1
2
3
4
5
$ sudo syscount-bpfcc -e ENOSPC -P
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[08:35:32]
PID    COMM               COUNT
3010   dockerd                1

The trace result indicates that the mount system call invoked by the dockerd background process returned the ENOSPEC error.

To track specific processes, you can use the -p parameter to trace a process ID, useful when focusing on a specific process for investigation. If interested in viewing related code implementations, you can add the –ebpf parameter to print the relevant source code after the command syscount-bpfcc -e ENOSPC -p 3010 --ebpf.

With the help of the syscount-bpfcc tool, we have preliminarily identified that the dockerd system call mount returned the ENOSPC error.

The mount system call is sys_mount, but in newer kernel versions, the sys_mount function is not a direct entry point for tracing. This is because the 4.17 kernel made adjustments to system calls and added corresponding architectures on different platforms. For more details, refer to new BPF APIs to get kernel syscall entry func name/prefix.

In Ubuntu 21.10 with the 5.13.0 kernel and ARM64 architecture, the actual kernel entry function for sys_mount is __arm64_sys_mount:

For x86_64 architecture, the corresponding function for sys_mount is __x64_sys_mount, while other architectures can be confirmed by searching in /proc/kallsyms.

Up to this point, we have identified the kernel entry function __arm64_sys_mount, but how do we locate where exactly the error occurred in the sub-call process? After all, the function call path in the kernel is quite long and may involve various jumps or specific implementations.To identify the erroneous subprocess, the first step is to obtain the subprocess called by __arm64_sys_mount. The function_graph tracer in ftrace can be very helpful. In this article, I directly use the frontend tool funcgraph from the project perf-tools, which completely avoids manually setting various tracing options.

If you are not familiar with ftrace, it is recommended to learn more about Ftrace Essentials later on.

2.2 Locating the Root Cause of the Issue

The funcgraph function in the perf-tools toolkit can be used to directly trace the calling subprocess of kernel functions. The usage of funcgraph tool is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ sudo ./funcgraph -h
USAGE: funcgraph [-aCDhHPtT] [-m maxdepth] [-p PID] [-L TID] [-d secs] funcstring
                 -a              # all info (same as -HPt)
                 -C              # measure on-CPU time only
                 -d seconds      # trace duration, and use buffers
                 -D              # do not show function duration
                 -h              # this usage message
                 -H              # include column headers
                 -m maxdepth     # max stack depth to show
                 -p PID          # trace when this pid is on-CPU
                 -L TID          # trace when this thread is on-CPU
                 -P              # show process names & PIDs
                 -t              # show timestamps
                 -T              # comment function tails
  eg,
       funcgraph do_nanosleep    # trace do_nanosleep() and children
       funcgraph -m 3 do_sys_open # trace do_sys_open() to 3 levels only
       funcgraph -a do_sys_open    # include timestamps and process name
       funcgraph -p 198 do_sys_open # trace vfs_read() for PID 198 only
       funcgraph -d 1 do_sys_open >out # trace 1 sec, then write to file

For the first use, I set the depth of tracing subprocess to 2 using the parameter -m 2 to avoid viewing function calls too deeply at once.

1
$ sudo ./funcgraph -m 2 __arm64_sys_mount

sys_mon_trace.png

To fold the trace results in vim, you can refer to the corresponding section in Ftrace Essentials.

The function gic_handle_irq() seems to be related to interrupt handling, which can be ignored.

By analyzing the funcgraph tracing results, we can obtain the key subprocess functions called within the __arm64_sys_mount function.

During the kernel function calls, if an error occurs, it usually directly jumps to the error-related cleanup function logic (without calling subsequent subprocesses). Here, we can shift our focus from the __arm64_sys_mount function to analyze the critical path_mount kernel function towards the end.

For a deeper analysis of the path_mount function calls:

1
$ sudo ./funcgraph -m 5 path_mount > path_mount.log

path_mount_trace.png

Based on the gradually analyzed last possibly error-prone function calls within the kernel functions, we can deduce the calling logic:

1
2
3
4
5
6
7
__arm64_sys_mount()
	-> path_mount()
		-> do_new_mount()
			-> do_add_mount()
				-> graft_tree()
					-> attach_recursive_mnt()
						-> count_mounts()

Based on the aforementioned function call hierarchy, it is natural to speculate that the count_mounts function returned an error, ultimately leading to the return from the __arm64_sys_mount function to the user space.

Since this is a speculation, it requires verification. We need to obtain the return values of the entire function call chain. Use the BCC tool trace-bpfcc to trace related function return values. The help documentation of trace-bpfcc is extensive and can be found in the trace_example.txt file, which is omitted here.

Before using the trace-bpfcc tool for tracing, it is necessary to check the prototype declarations of the relevant functions in the kernel.

To validate the speculation, we need to track the return values of core functions along the entire call chain. The trace-bpfcc tool can simultaneously trace multiple function return values by using ‘xxx’ ‘yyy’ for separation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ sudo trace-bpfcc 'r::__arm64_sys_mount() "%llx", retval'  \
									 'r::path_mount "%llx", retval' \
                   'r::do_new_mount "%llx", retval' \
                   'r::do_add_mount "%llx", retval'\
                   'r::graft_tree "%llx", retval' \
                   'r::attach_recursive_mnt "%llx" retval'\
                   'r::count_mounts "%llx", retval'
PID     TID     COMM            FUNC             -
3010    3017    dockerd         graft_tree       ffffffe4
3010    3017    dockerd         attach_recursive_mnt ffffffe4
3010    3017    dockerd         count_mounts     ffffffe4
3010    3017    dockerd         __arm64_sys_mount ffffffffffffffe4
3010    3017    dockerd         path_mount       ffffffe4
3010    3017    dockerd         do_new_mount     ffffffe4
3010    3017    dockerd         do_add_mount     ffffffe4

The command r::__arm64_sys_mount() "%llx", retval can be interpreted as follows:

  • In r::__arm64_sys_mount, r indicates tracking the return value of the function;
  • "%llx", retval where retval is the variable for the return value, and "%llx" is the format for printing the return value;

The traced return value 0xffffffe4 converted to decimal is precisely -28 (0x1B), which equals -ENOSPC (28).

The underlying mechanism of Trace-bpfcc uses the perf_event for event triggering. Due to multicore concurrency, the order cannot be completely guaranteed. In higher kernel versions, switching event triggering to the Ring Buffer mechanism can ensure the order.

2.3 Identifying the Root Cause of the Issue

Through meticulous investigation, we have narrowed down the problem to the count_mounts function. At this point, we need to analyze the main logic flow of the function in the source code. Let’s directly examine the code, fortunately, the function’s code is concise and relatively easy to understand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
{
  // The maximum value that can be mounted is read from the variable sysctl_mount_max
	unsigned int max = READ_ONCE(sysctl_mount_max); 
	unsigned int mounts = 0, old, pending, sum;
	struct mount *p;

	for (p = mnt; p; p = next_mnt(p, mnt))
		mounts++;

	old = ns->mounts;  // Current number of mounts in the namespace
	pending = ns->pending_mounts;  // Pending number of mounts
	sum = old + pending;   // The total number of mounts is the current mounts + pending mounts
	if ((old > sum) ||
	    (pending > sum) ||
	    (max < sum) ||
	    (mounts > (max - sum)))  // These conditions are quite straightforward to understand
		return -ENOSPC;

	ns->pending_mounts = pending + mounts;
	return 0;
}

Based on a simple understanding of the code logic, we can infer that the number of files mounted in the current namespace exceeds the maximum value allowed by the system (sysctl_mount_max, which can be set via /proc/sys/fs/mount-max).

To replicate the issue, in the local environment, I have set the value of /proc/sys/fs/mount-max to 10 (default is 100000), resulting in the same error as in the production environment.

1
2
$ sudo cat /proc/sys/fs/mount-max
10

After identifying the root cause, we can adjust this value back to the default of 100000 and rerun the docker run command for successful execution.

In a real production environment, issues may be more complex, such as abnormal mounts or leaks. Nonetheless, the troubleshooting approach can be guided by the insights provided in this article.

With this, we have completed the issue identification stage. However, there are still some doubts from the tracing process that need clarification. These insights gained during the troubleshooting process, including the pitfalls encountered, are crucial knowledge for effectively using tools to diagnose issues.

Given that we have analyzed the source code based on the flow, can we expect a perfect match between what we observe during tracing and the actual code execution? The answer is not necessarily; sys_mount tracing is one such scenario that doesn’t align precisely with the code logic.

So, let’s compare and analyze the code flow of sys_mount with the actual tracing process to uncover the discrepancies.

3. Analysis of Discrepancies Between Code Flow and Tracing Process

The function sys_mount is defined in fs/namespace.c file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
		char __user *, type, unsigned long, flags, void __user *, data)
{
	int ret;
	char *kernel_type;
	char *kernel_dev;
	void *options;``````c
kernel_type = copy_mount_string(type); // Child function 1
ret = PTR_ERR(kernel_type);
if (IS_ERR(kernel_type))
	goto out_type;

kernel_dev = copy_mount_string(dev_name); // Child function 2
ret = PTR_ERR(kernel_dev);
if (IS_ERR(kernel_dev))
	goto out_dev;

options = copy_mount_options(data); // Child function 3
ret = PTR_ERR(options);
if (IS_ERR(options))
	goto out_data;

ret = do_mount(kernel_dev, dir_name, kernel_type, flags, options); // Child function 4

kfree(options);
out_data:
kfree(kernel_dev);
out_dev:
kfree(kernel_type);
out_type:
return ret;
}

During handling of the above exception, another exception occurred:

1
2
3
4
5
6
Traceback (most recent call last):
  File "/usr/sbin/syscount-bpfcc", line 20, in <module>
    from bcc.syscall import syscall_name, syscalls
  File "/usr/lib/python3/dist-packages/bcc/syscall.py", line 387, in <module>
    raise Exception("ausyscall: command not found")
Exception: ausyscall: command not found

The system lacks the ausyscall command, for installation methods on different systems, please refer to ausyscall:

1
$ sudo apt-get install auditd

References