Commit Graph

3 Commits

Author SHA1 Message Date
tommy
640cd908df docs: P5-11 — compute5 nvme1 PCIe quirk verified, no action needed
platform quirk 'simple suspend' is applied by PVE kernel automatically
for i7-13700T platform (both nvme0 and nvme1). Not a cmdline parameter;
/etc/kernel/cmdline absent. Persists across kernel updates by default.
Verified: dmesg confirms quirk active on both drives at current boot.
P5-11 status: monitor only, no user action required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 21:29:16 -05:00
tommy
909fe3dc12 docs: Phase 5 — incident log, maintenance runbook, NUT config archive
beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added
  cancel rule. Archived post-fix from Beast.
beast/nut/upssched-cmd: archived for reference.

runbooks/phase5-incident-log.md:
  INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings.
  INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life.
  P5-01 through P5-11: pending work queue including Pi4 node-exporter,
  compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics,
  cert renewal verification, ansible-control disk, Phase 4A resume.

runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure,
  cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd
  migration, Authelia metrics setup, recurring checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 21:25:49 -05:00
tommy
545d3563b8 docs: Phase 4C — root cause for W11 (xtables) and W12 (qdevice)
W11 (compute6 pve-firewall xtables conflicts): race condition between
pve-firewall's iptables_restore (no -w flag) and Docker's iptables
backend. Self-resolves within 10-120s; no firewall gap; no action taken.
Conflicts align to :58 minute mark = Docker daemon start time at last
compute6 boot (Apr 19 12:58).

W12 (compute2 corosync-qdevice boot failures): qnetd on PBS (.153);
PBS has crash-rebooted 34 times since Apr 21, each correlated with USB
ZFS import failure (same as C1, May 5 health check). Each PBS reboot
drops qnetd for ~60s; cluster quorum unaffected (6 node votes, threshold
3). No action this phase. Phase 5 items: fix PBS USB ZFS instability;
optionally migrate qnetd from PBS to Pi4 (.227).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 19:11:47 -05:00