diff --git a/runbooks/phase5-incident-log.md b/runbooks/phase5-incident-log.md index 6df38f9..fe9f79c 100644 --- a/runbooks/phase5-incident-log.md +++ b/runbooks/phase5-incident-log.md @@ -220,15 +220,12 @@ After P5-01 completes and PBS is confirmed stable for 48h. ### P5-10 — Pi4 node-exporter ARM64 Deploy See P5-08. -### P5-11 — Compute5 SK Hynix PC711 PCIe Power Management -nvme1n1 on compute5 (SK Hynix PC711 1TB, `0000:03:00.0`) has 2,362 power cycles and 84 unsafe shutdowns — indicative of PCIe runtime PM aggressively power-cycling the drive. Kernel applied `platform quirk: setting simple suspend` in the current boot. Verify this persists: -```bash -# Check if quirk is active post-reboot: -dmesg | grep -i 'nvme.*simple suspend\|03:00.*quirk' -# If not applied, add to kernel cmdline or create modprobe conf: -# nvme_core.default_ps_max_latency_us=0 -``` -WD PC SN740 (nvme0) on same node has 56 unsafe shutdowns in 1,407h — likely from pre-journal setup period and PCIe PS behavior. No action unless counts accumulate. +### P5-11 — Compute5 SK Hynix PC711 PCIe Power Management (monitor only) +nvme1n1 on compute5 (SK Hynix PC711 1TB, `0000:03:00.0`) has 2,362 power cycles and 84 unsafe shutdowns — indicative of PCIe runtime PM aggressively power-cycling the drive. + +**Quirk status verified 2026-05-06:** The kernel applies `platform quirk: setting simple suspend` automatically to both nvme0 and nvme1 — it is a built-in driver quirk for this CPU/chipset (i7-13700T platform), not a cmdline parameter. `/etc/kernel/cmdline` does not exist; `/proc/cmdline` has no nvme_core flags. The quirk persists across kernel updates by default. No user action required. + +Monitor: check `Unsafe Shutdowns` and `Power Cycles` in SMART at each health check. If counts continue accumulating after the quirk is active, escalate to drive replacement or PCIe slot investigation. WD PC SN740 (nvme0) on same node: 56 unsafe shutdowns in 1,407h — attributed to pre-journal setup period and PCIe PS interaction; no action unless accumulating. ---