To remove records older than a week (7 days) from the journalctl log records:
journalctl --vacuum-time=7d
To remove records older than a week (7 days) from the journalctl log records:
journalctl --vacuum-time=7d
If you film the boot sequence and look frame by frame, you’ll see that it very briefly flashes a TPM error
error: ../../grub-core/commands/efi/tpm.c:150:unknown TPM error.
From what I’ve been able to glean, this secure boot stuff works off of signatures. Microsoft has signatures in BIOS. Everyone else kind of inserts their keys on the fly … so you can run out of space to save these keys and be unable to boot. To work around this, every time an update gets us over the limit, we go into the secure boot DBX management menu and reset the “Forbidden Signatures” from factory default. This is 13 keys instead of 373, and the OS is able to do it’s “thing” and boot.
And I’m actually writing this down this time because I had spent a lot of time researching this last time Scott’s laptop failed to boot and dumped out to a grub menu. This time, I kinda know what we did and why but lost a lot of the details.
The rsync utility was meant to be used to sync files across the network — to or from an rsync server. For some time, I had a group of friends who shared documents off of my rsync server. Anyone with access could run an rsync command and sync their computer up with the group’s documents. With the advent of online file storage and collaborative editing, this was no longer needed. But I still use rsync to make sure my laptop has a local copy of a folder on the server. Mount /path/to/folder/contents/to/copy to the SMB or NFS share, and the following rsync command ensures the laptop’s /path/to/where/contents/should/be/placed has an exact mirror of the contents of the server folder
rsync –archive –verbose –update –delete “/path/to/folder/contents/to/copy/” “/path/to/where/contents/should/be/placed/”
–archive is a grouping of:
-r recursive
-l copy symlinks
-p preserve permissions
-t preserve modification timestamps
-g preserve group
-o preserve owner
–devices preserve device files (su only)
–specials preserve special files
We recently updated our Fedora servers from 36 and 37 to 38. Since the upgrade, we have observed servers with very high load averages – 8+ on a 4-cpu server – but the server didn’t seem unreasonably slow. On the Unix servers I first used, Irix and Solaris, load average counts threads in a Runnable state. Linux, however, includes both Runnable and Uninterruptible states in the load average. This means processes waiting – on network calls using mkdir to a mounted remote server, local disk I/O – are included in the load average. As such, a high load average on Linux may indicate CPU resource contention but it may also indicate I/O contention elsewhere.
But there’s a third possibility – code that opts for the simplicity of the uninterrupted sleep without needing to be uninterruptible for a process. In our upgrade, CIFS mounts have a laundromat that I assume cleans up cache – I see four cifsd-cfid-laundromat threads in an uninterruptible sleep state – which means my load average, when the system is doing absolutely nothing, would be 4.
2023-10-03 11:11:12 [lisa@server01 ~/]# ps aux | grep " [RD]" USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1150 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1151 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1152 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 1153 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat] root 556598 0.0 0.0 224668 3072 pts/11 R+ 11:11 0:00 ps aux
Looking around the Internet, I see quite a few bug reports regarding this situation … so it seems like a “ignore it and wait” problem – although the load average value is increased by these sleeping threads, it’s cosmetic. Which explains why the server didn’t seem to be running slowly even through the load average was so high.
https://lkml.org/lkml/2023/9/26/1144
Date: Tue, 26 Sep 2023 17:54:10 -0700 From: Paul Aurich Subject: Re: Possible bug report: kernel 6.5.0/6.5.1 high load when CIFS share is mounted (cifsd-cfid-laundromat in"D" state) On 2023-09-19 13:23:44 -0500, Steve French wrote: >On Tue, Sep 19, 2023 at 1:07 PM Tom Talpey <tom@talpey.com> wrote: >> These changes are good, but I'm skeptical they will reduce the load >> when the laundromat thread is actually running. All these do is avoid >> creating it when not necessary, right? > >It does create half as many laundromat threads (we don't need >laundromat on connection to IPC$) even for the Windows server target >example, but helps more for cases where server doesn't support >directory leases. Perhaps the laundromat thread should be using msleep_interruptible()? Using an interruptible sleep appears to prevent the thread from contributing to the load average, and has the happy side-effect of removing the up-to-1s delay when tearing down the tcon (since a7c01fa93ae, kthread_stop() will return early triggered by kthread_stop). ~Paul
I encountered an oddity at work — there’s a server on an internally located public IP space. Because it’s public space, it is not allowed to communicate with the internal interface of some of our security group’s servers. It has to use their public interface (not technically, just a policy on which they will not budge). I cannot just use a DNS server that resolves the public copy of our zone because then we’d lose access to everything else, so we are stuck making an /etc/hosts entry. Except this thing changes IPs fairly regularly (hey, we’re moving from AWS to Azure; hey, let’s try CloudFlare; nope, that is expensive so change it back) and the service it provides is application authentication so not something you want randomly falling over every couple of months.
So I’ve come up with a quick script to maintain the /etc/hosts record for the endpoint.
# requires: dnspython, subprocess
import dns.resolver
import subprocess
strHostToCheck = 'hostname.example.com' # PingID endpoint for authentication
strDNSServer = "8.8.8.8" # Google's public DNS server
listStrIPs = []
# Get current assignement from hosts file
listCurrentAssignment = [ line for line in open('/etc/hosts') if strHostToCheck in line]
if len(listCurrentAssignment) >= 1:
strCurrentAssignment = listCurrentAssignment[0].split("\t")[0]
# Get actual assignment from DNS
objResolver = dns.resolver.Resolver()
objResolver.nameservers = [strDNSServer]
objHostResolution = objResolver.query(strHostToCheck)
for objARecord in objHostResolution:
listStrIPs.append(objARecord.to_text())
if len(listStrIPs) >= 1:
# Fix /etc/hosts if the assignment there doesn't match DNS
if strCurrentAssignment in listStrIPs:
print(f"Nothing to do -- hosts file record {strCurrentAssignment} is in {listStrIPs}")
else:
print(f"I do not find {strCurrentAssignment} here, so now fix it!")
subprocess.call([f"sed -i -e 's/{strCurrentAssignment}\t{strHostToCheck}/{listStrIPs[0]}\t{strHostToCheck}/g' /etc/hosts"], shell=True)
else:
print("No resolution from DNS ... that's not great")
else:
print("No assignment found in /etc/hosts ... that's not great either")
And a final note from my disaster recovery adventure — I had to use ddrescue to copy as much data from a corrupted drive as possible (ddrescue /dev/sdb /mnt/vms/rescue/backup.raw –try-again –force –verbose) — once I had the image, what do you do with it? Fortunately, you can mount a dd file and copy data from it.
# Mounting DD image
2023-04-17 23:54:01 [root@fedora /]# kpartx -l backup.raw
loop0p1 : 0 716800 /dev/loop0 2048
loop0p2 : 0 438835200 /dev/loop0 718848
2023-04-17 23:55:08 [root@fedora /]# mount /dev/mapper/loop0p2 /mnt/recovery/ -o loop,ro
mount: /mnt/recovery: cannot mount /dev/loop1 read-only.
dmesg(1) may have more information after failed mount system call.
2023-04-17 23:55:10 [root@fedora /]# mount /dev/mapper/loop0p2 /mnt/recovery/ -o loop,ro,norecovery
2023-04-18 00:01:03 [root@fedora /]# ll /mnt/recovery/
total 205G
drwxr-xr-x 2 root root 213 Jul 14 2021 .
drwxr-xr-x. 8 root root 123 Apr 17 22:38 ..
-rw-r--r--. 1 root root 127G Apr 17 20:35 ExchangeServer.qcow2
-rw-r--r--. 1 qemu qemu 10G Apr 17 21:42 Fedora.qcow2
-rw-r--r--. 1 qemu qemu 15G Apr 17 14:05 FedoraVarMountPoint.qcow2
After the power came back on, I had to recover ext and xfs partitions —
# FSCK to clean up bad superblock
lsblk
fsck /dev/sdb
e2fsck -b 32768 /dev/sdb1
mount /dev/sdb1 /var
# xfs_repair for XFS FS on LVM
lvscan # get dev for LVM
mount /dev/fedora/root /mnt/oh
xfs_repair -L /dev/fedora/root
mount /dev/fedora/root /mnt/oh
We had a power outage on Monday that took out the drive that holds our VMs. There are backups, but the backup drive copies had superblock errors and all sorts of issues. To recover our data, I learned all sorts of new things — firstly that you can mount a QCOW file and copy data out. First, you have to connect a network block device to the file. Once it is connected, you can use fdisk to list the partitions on the drive and mount those partitions. In this example, I had a partition called nbd0p1 that I mounted to /mnt/data_recovery
modprobe nbd max_part=2
qemu-nbd --connect=/dev/nbd0 /path/to/server_file.qcow2
fdisk /dev/nbd0 -l
mount /dev/nbd0p1 /mnt/data_recovery
Once you are done, unmount it and disconnect from the network block device.
umount /mnt/data_recovery
qemu-nbd --disconnect /dev/nbd0
rmmod nbd
I noticed, today, that our Kafka Manager interface only shows details from one server — the one where we run Kafka Manager. We’ve done everything that we need to do in order to get this working — the port shows as open with nmap, the command to run Kafka includes all of the settings. I’ve even tried setting the JMX hostname, but still there is just one server reporting data
Then I happened across an article online that detailed how JMX actually uses three ports — the configured port 9999 and two other randomly selected and non-configurable ports. I used netstat to list all of the ports in use by the Java PID running my Kafka server and, voila, there were two odd-ball high ports (30000’s and 40000’s). I added those additional ports to the firewall rules and … I’ve got data for all of the Kafka servers!
This is obviously a short-term solution as the two randomly selected ports will be different when I restart the service next time. I’d prefer to leave the firewall in place (i.e. not just open all ports >1024 between the Kafka Manager host and all of the Kafka servers) so might put together a script to identify the “oddball” ports associated to the Java pid and add them to transient firewalld rules. But the last server restart was back in 2021 … so I might just manually add them after the upgrade next week and worry about something ‘better’ next year!