The Incident
The home AI server I built earlier (Ryzen 9 9950X + RTX 5090, NixOS) started going SSH-unresponsive every few days after it went into service. No ping, no display attached, nothing I could do. Each time I had to hold the power button to force a reboot. After the third time, I snapped and started a serious investigation — this is that record.
It turned into a long story, so I’m making it a series. This part is about how far you can chase a culprit with nothing but logs.
The Suspect List
When a headless Linux server freezes solid, the usual suspects are:
- GPU driver (the RTX 5090 is Blackwell-generation, and the nvidia driver’s GSP timeout issue is well known)
- Kernel / driver bugs
- Storage failure
- Power / thermals
- Memory
My initial favorite was the GSP timeout. New GPU, new driver, obviously suspicious. I even had a watchdog built specifically for GSP timeouts already in place, and it never fired — though that’s no contradiction: when the whole OS freezes, a userspace watchdog freezes with it. Either way, that guess turns out to be wrong.
Finding the Hangs from Boot Boundaries
I keep journald persistent (/var/log/journal), so every past boot is retained. First, list the boots with journalctl --list-boots and look at the gap between one boot’s last log line and the next boot’s first:
-15 0c8eb04e... Thu 2026-06-18 23:45 — Fri 2026-06-19 00:51 ← next boot 37s later (clean reboot)-14 96582cae... Fri 2026-06-19 00:52 — Fri 2026-06-19 16:28 ← next boot 8min later (hang → long-press)-13 a40ea466... Fri 2026-06-19 16:36 — Fri 2026-06-19 17:22A clean reboot leaves a 30–60 second gap. A hang followed by a long-press power cycle leaves minutes to tens of minutes of silence between the last log line and the next boot (that’s the time it takes to notice the freeze and walk over to the power button). This method pinpointed all three hangs.
Reading the Three Crash Signatures
At the tail of each identified boot, a crash with a different face was waiting.
Crash 1: CPU 0 Gets Stuck Waiting on a Lock and Never Comes Back
kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:kernel: rcu: (detected by 3, t=21002 jiffies, g=5283881, q=116 ncpus=32)kernel: Sending NMI from CPU 3 to CPUs 0:kernel: CPU: 0 UID: 0 PID: 1 Comm: systemd Tainted: G D O 6.18.34 #1-NixOSkernel: RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x2c0kernel: _raw_spin_lock_irqsave+0x3d/0x50(one minute later)kernel: rcu: (detected by 17, t=84007 jiffies, g=5283881, q=224 ncpus=32)On CPU 0, PID 1 — systemd itself — is wedged waiting to acquire a spinlock, and RCU is firing NMIs at it because it stopped responding. The wait time just keeps growing (t=21002 → 84007 jiffies) and never recovers. The log goes silent after this; the next entry is the boot banner after my long-press.
Crash 2: An Access to an Impossible Address
kernel: Oops: general protection fault, probably for non-canonical address 0xfdff8e2e2cffe130: 0000 [#1] SMP NOPTIkernel: CPU: 2 UID: 62392 PID: 4230 Comm: .wyoming-faster Tainted: G O 6.18.34 #1-NixOSkernel: RIP: 0010:__lruvec_stat_mod_folio+0x55/0xd0kernel: RAX: fdff8e2e2cffdb00 RBX: ffff8e3cfdd7a780 ...kernel: Call Trace:kernel: folio_remove_rmap_ptes+0x42/0x220kernel: unmap_page_range+0xdeb/0x14e0kernel: unmap_vmas+0xa1/0x180kernel: exit_mmap+0xe5/0x3c0kernel: __mmput+0x41/0x150kernel: do_exit+0x283/0xac0A speech-recognition service (wyoming-faster-whisper) was merely exiting, and died while giving its memory back (exit_mmap). Look at the addresses. Kernel pointers should start with 0xffff..., but both the fault address and the RAX register hold 0xfdff.... The difference between ffff and fdff is exactly one dropped bit — bit 57. Compare with the healthy ffff8e3c... sitting right next to it in RBX, and you can literally see “one high bit of a pointer got flipped.”
Crash 3: Page Bookkeeping Corruption Detected
kernel: BUG: Bad page state in process bash pfn:31e8c3kernel: page: refcount:0 mapcount:0 mapping:000000007ebbe801 index:0x748932ae8 pfn:0x31e8c3kernel: raw: 017fffc000000000 dead000000000100 dead000000000122 0400000000000000kernel: page dumped because: non-NULL mappingA freed page’s mapping field should be NULL, yet the fourth word of the raw dump holds 0400000000000000. 0x0400000000000000 is a value with only bit 58 set. Everything else is zero. Far too clean for anything to have written it — this is the shape of “a bit turned itself on.” The bystander that stepped on it and triggered the detection, by the way, was a plain bash process (Bad page state only dumps diagnostics; it doesn’t kill the process).
Deduction: Finding the Common Thread
Line the three up and the pattern emerges:
- All in unrelated processes (systemd / speech recognition / bash). An app bug would reproduce in the same process
- All inside generic kernel memory-management code (spinlock /
exit_mmap/ page free). The same kernel bug would die in the same function - Two of the three show a single flipped bit in a 64-bit value, directly visible (bit 57 dropped, bit 58 set). The remaining one (the deadlock) fits the same picture if you posit a corrupted lock variable
- And the prime suspect — the nvidia module never appears in a single call trace
Software bugs break the same place under the same conditions. This breakage is “a random bit, at a random place, at a random time.” That is not the face of software. It’s the face of DRAM bit flips.
It’s a consumer build without ECC despite being used as a server (a choice I made knowingly), so nothing detects or corrects a flipped bit. If the flipped bit lands on a kernel pointer you get a GPF; on the page bookkeeping, Bad page state; on a lock variable, a deadlock — one defect wearing a different face depending on where it lands would explain all three crashes.
But There’s No Hard Evidence
The circumstantial evidence is damning. But at this point, all I can say is “the memory is suspicious.” No amount of re-reading logs will tell me which DIMM, which bit.
Hard evidence means hammering the RAM directly with memtest86+. Except there’s a rather silly obstacle: a headless server has no physical console — and then, on top of it, the premise of my hypothesis gets overturned by an actual measurement.
Next time: correcting a misdiagnosis, and preparing the experiment.