Read more at: https://www.ebpf.top/post/cpu_io_wait

1. Definition of I/O Wait

I/O Wait is a performance metric for a single CPU, indicating the idle time consumed when threads in the CPU dispatch queue (in the sleep state) are blocked on disk I/O. The CPU’s idle time is divided into truly idle time and time spent blocked on disk I/O. A higher CPU I/O Wait time indicates a possible bottleneck in the disk, causing the CPU to wait idle. If you find this definition a bit confusing, then please continue reading. I believe that after you read the testing and verification process in this article, your understanding of the above definition will be different.

2. Test and Verification

The local test verification environment is: Ubuntu 22.04, CPU 12 cores, Kernel 6.2.0-34-generic

1
2
$ cat /proc/cpuinfo | grep processor | wc -l
12

We use the sysbench tool for corresponding stress testing and observe the performance of I/O Wait. To better demonstrate the effect, we set the number of threads to 50% of all CPU cores during the I/O testing process (in this article --threads=6)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ sudo apt-get update
$ sudo apt install sysbench

# Prepare test data in advance
$ sysbench --threads=6 --time=0 --max-requests=0 fileio --file-num=1 \
    --file-total-size=10G --file-io-mode=sync --file-extra-flags=direct \
    --file-test-mode=rndrd run prepare

# Run the stress test
$ sysbench --threads=6 --time=0 --max-requests=0 fileio --file-num=1 \
    --file-total-size=10G --file-io-mode=sync --file-extra-flags=direct \
    --file-test-mode=rndrd run
...
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of the test, Enabled.
Using synchronous I/O mode
Doing random read test
Initializing worker threads...

Threads started!

While running the I/O stress test, we use the top command to check the CPU usage, where we can see the wa (I/O wait) value displayed, which is 38.3% in the image below:

b53e1bc14c2e2acc911b355c1cff92dc

By looking at this, one might think that the CPU I/O Wait value is correctly showing the CPU usage related to I/O wait. While we maintain the sysbench I/O test, we simultaneously start a CPU-intensive test, and then we observe the value of wa in top. In this test, we set to use all CPU cores.

1
$ sysbench --threads=`grep processor /proc/cpuinfo | wc -l` --time=0 cpu run

208a97d04ca69e6f6426d6641386cd2f

Did something magical happen? Did the wa change to 0.0? Does it feel like the I/O bottleneck completely disappeared when running CPU load? Did the I/O pressure suddenly vanish from the system? We notice that there are two sysbench processes running during the test, corresponding to the I/O and CPU testing processes, so in reality, the I/O bottleneck still exists.

Therefore, although a high I/O Wait can indicate that many processes in the system are waiting for disk I/O, even when the I/O Wait is low or even 0, disk I/O may still become a bottleneck for certain processes on the system.

With the results of our tests in mind, it seems that I/O Wait may not be reliable. So, what should we use to provide better visibility into I/O? We can use the vmstat tool to investigate.

While running I/O and CPU loads simultaneously, we use the vmstat tool to observe them. After a certain time of testing, we stop the CPU load test sysbench process. The image below clearly shows the comparison trend before and after stopping.

1b6bdde712f44cbbad2e97cc49c1eb8a

Firstly, we notice a significant change in the wa column before and after running the CPU load test, and the data in the r column is also reflected accordingly (r indicates the number of tasks running, with a value of 12 when running the CPU load test).

15b628ce5fef0c35cc91208a3af1cef9

The “b” column in the vmstat display data represents the processes blocked on disk I/O. We observe that this value remains around 6 both before and after running the sysbench CPU load test, which corresponds to the --threads=6 specified in sysbench during I/O testing. This column value indicates that even when wa is 0.0, there are still 6 processes waiting for I/O in the system.

fa1f24a7768d04a8dcb0dac2a419e01c

3. Further Clarification on Disk Throughput and Processes with High I/O FrequencyAfter identifying process I/O wait conditions through the vmstat b column, we can further define them using iostat and iotop.

1
iostat 1 nvme0n1

nvme0n1 is a local disk.

a921d64010d80eb7977092be6a3a3e8c

Where:

  • tps: Transactions per second (IOPS)
  • kB_read/s, kB_wrtn/s: Number of kilobytes read and written per second

The iotop tool can quickly locate process information with frequent reads and writes in the current system.

3cacbc9f87bdce0a9fe755f60a3efbf2

4. Analysis of Kernel CPU Statistics Implementation

After the above analysis, we attempt a simple analysis from the perspective of kernel code implementation (kernel code version 6.2.0):

cputime.c#L483

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void account_process_tick(struct task_struct *p, int user_tick)
{
    u64 cputime, steal;

    if (vtime_accounting_enabled_this_cpu())
        return;

    if (sched_clock_irqtime) {
        irqtime_account_process_tick(p, user_tick, 1);
        return;
    }

    cputime = TICK_NSEC;
    steal = steal_account_process_time(ULONG_MAX);

    if (steal >= cputime)
        return;

    cputime -= steal;

    // 1. If the current process is in user mode, then increase user mode CPU time
    if (user_tick) 
        account_user_time(p, cputime);
    // 2. If the process is in kernel mode and not the idle process, increase system mode CPU time
    else if ((p != this_rq()->idle) || (irq_count() != HARDIRQ_OFFSET))
        account_system_time(p, HARDIRQ_OFFSET, cputime);
    // 3. If the current process is the idle process, call the account_idle_time() function for processing
    else
        account_idle_time(cputime);
}

The account_idle_time function is responsible for implementing the idle time of the current CPU. If the nr_iowait value on the CPU is not 0, the idle time will be counted in iowait; otherwise, it will be counted in idle.

cputime.c#L218

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void account_idle_time(u64 cputime)
{
    u64 *cpustat = kcpustat_this_cpu->cpustat;
    struct rq *rq = this_rq();

    // 1. If there are processes waiting for I/O requests, increase iowait time
    if (atomic_read(&rq->nr_iowait) > 0)
        cpustat[CPUTIME_IOWAIT] += cputime;
    // 2. Otherwise, increase idle time
    else
        cpustat[CPUTIME_IDLE] += cputime;
}

From the code, we can see that to be in an I/O wait state, two conditions must be met:

  1. There must be processes waiting for I/O requests to be completed on the current CPU.
  2. The CPU must be idle, meaning there are no runnable processes.

5. Conclusion

Through the above test, we can see that I/O Wait can be a very confusing metric. If CPU-intensive processes are running, the I/O Wait value may decrease. However, even if the I/O Wait metric decreases, disk I/O still blocks process execution as before. Therefore, when determining whether there is an I/O bottleneck, we cannot simply rely on the high or low value of I/O Wait to conclude if there is an I/O bottleneck in the system.

We believe that after reading this document, you will no longer fall into the trap of the I/O Wait metric, and that is where the value of this article lies.

References