We recently updated our Fedora servers from 36 and 37 to 38. Since the upgrade, we have observed servers with very high load averages – 8+ on a 4-cpu server – but the server didn’t seem unreasonably slow. On the Unix servers I first used, Irix and Solaris, load average counts threads in a Runnable state. Linux, however, includes both Runnable and Uninterruptible states in the load average. This means processes waiting – on network calls using mkdir to a mounted remote server, local disk I/O – are included in the load average. As such, a high load average on Linux may indicate CPU resource contention but it may also indicate I/O contention elsewhere.
But there’s a third possibility – code that opts for the simplicity of the uninterrupted sleep without needing to be uninterruptible for a process. In our upgrade, CIFS mounts have a laundromat that I assume cleans up cache – I see four cifsd-cfid-laundromat threads in an uninterruptible sleep state – which means my load average, when the system is doing absolutely nothing, would be 4.
2023-10-03 11:11:12 [lisa@server01 ~/]# ps aux | grep " [RD]" USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1150 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1151 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1152 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1153 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 556598 0.0 0.0 224668 3072 pts/11 R+ 11:11 0:00 ps aux
Looking around the Internet, I see quite a few bug reports regarding this situation … so it seems like a “ignore it and wait” problem – although the load average value is increased by these sleeping threads, it’s cosmetic. Which explains why the server didn’t seem to be running slowly even through the load average was so high.
https://lkml.org/lkml/2023/9/26/1144
Date: Tue, 26 Sep 2023 17:54:10 -0700 From: Paul Aurich Subject: Re: Possible bug report: kernel 6.5.0/6.5.1 high load when CIFS share is mounted (cifsd-cfid-laundromat in"D" state) On 2023-09-19 13:23:44 -0500, Steve French wrote: >On Tue, Sep 19, 2023 at 1:07 PM Tom Talpey <tom@talpey.com> wrote: >> These changes are good, but I'm skeptical they will reduce the load >> when the laundromat thread is actually running. All these do is avoid >> creating it when not necessary, right? > >It does create half as many laundromat threads (we don't need >laundromat on connection to IPC$) even for the Windows server target >example, but helps more for cases where server doesn't support >directory leases. Perhaps the laundromat thread should be using msleep_interruptible()? Using an interruptible sleep appears to prevent the thread from contributing to the load average, and has the happy side-effect of removing the up-to-1s delay when tearing down the tcon (since a7c01fa93ae, kthread_stop() will return early triggered by kthread_stop). ~Paul