diff --git a/runbooks/W11-W12-root-cause-2026-05.md b/runbooks/W11-W12-root-cause-2026-05.md new file mode 100644 index 0000000..d6a46d7 --- /dev/null +++ b/runbooks/W11-W12-root-cause-2026-05.md @@ -0,0 +1,125 @@ +# Root Cause: W11 (pve-firewall xtables conflicts, compute6) + W12 (corosync-qdevice boot failures, compute2) + +Investigated: 2026-05-06 +Status: **Both benign. No code change made.** Follow-up items flagged for Phase 5. + +--- + +## W11 — pve-firewall xtables lock conflicts on compute6 + +### Symptom +Recurring log entries on compute6 from pve-firewall daemon (PID 2924): +``` +pve-firewall[2924]: status update error: iptables_restore_cmdlist: + Another app is currently holding the xtables lock. + Perhaps you want to use the -w option? +``` +Observed: Apr 20 (two ~2-minute bursts at 00:58 and 12:58), Apr 21, Apr 29, May 6. +All occurrences at the same wall-clock minute (:58) that compute6 last booted (Apr 19 12:58). + +### Root Cause +Race condition between two independent iptables consumers sharing the xtables lock: + +1. **pve-firewall** — runs a 10-second status update loop using `iptables_restore` (legacy backend). + Does **not** pass the `-w` wait flag, so it fails immediately if the lock is held. + +2. **Docker daemon** — configured with `Firewall Backend: iptables` (default). Holds the xtables lock + during container network operations and periodic internal network consistency verification. + The Docker daemon started at Apr 19 12:58:14 (same boot as compute6), and its periodic + iptables housekeeping is timed from daemon start — explaining why conflicts align with the + :58 minute mark. + +When Docker briefly holds the xtables lock, pve-firewall's next 10-second status cycle fails. +pve-firewall retries on the following cycle (10 seconds later) and succeeds once Docker releases. +Maximum observed hold time: ~2 minutes (11 consecutive pve-firewall failures). + +### Impact +**None.** pve-firewall successfully applies its rules on the next retry within 10–120 seconds. +No firewall rules are left in an unapplied state. Cluster traffic and container networking are +unaffected during the lock contention window. + +### Why No Code Change +The two mitigation paths both have worse trade-offs than the current behavior: + +- **`--iptables=false` in Docker daemon.json** — disables Docker NAT and masquerade rules entirely; + containers lose internet access unless a parallel nftables setup is built. High risk for minimal + gain on a node running 9 services. + +- **PVE pve-firewall patch to add `-w`** — not a user-configurable option; would require a PVE + package patch or override. PVE may address this in a future release. + +Current behavior (retry within 10s, no sustained outage) is acceptable. + +### Monitoring +No alert added. The conflict is logged to syslog and visible in `journalctl -u pve-firewall`. +If conflicts become sustained (>5 minutes without resolution), investigate Docker daemon state. + +### Future Watch +If compute6 is rebooted, conflicts will initially appear more frequently (Docker and pve-firewall +both re-synchronizing), then stabilize. No action needed. + +--- + +## W12 — corosync-qdevice boot connection failures on compute2 + +### Symptom +At each PBS reboot, corosync-qdevice on cluster nodes (observed on compute2) logs: +``` +corosync-qdevice[1240]: Can't connect to qnetd host. (-5986): Network address not available +corosync-qdevice[1240]: Connect timeout (repeating every ~8s for 1–5 minutes) +``` +Observed on May 5 at 01:09, 05:33, and 19:17; on May 6 at 11:04 (compute2 own reboot). + +### Root Cause (Two Layers) + +**Proximate:** `corosync-qnetd` is hosted on PBS (192.168.99.153). Every time PBS reboots, +qnetd goes offline for 30–60 seconds during PBS's boot sequence. corosync-qdevice on all +cluster nodes loses its qdevice connection during this window, then auto-reconnects once +qnetd is listening again. + +**Underlying:** PBS has been crash-rebooting 34 times since April 21, 2026 — roughly every +1–3 days. Each crash is correlated with USB ZFS pool import failure (`usb1-zfs`, `usb2-zfs`) +on boot: +``` +systemd: Failed to start zfs-import@usb1-zfs.service — Import ZFS pool usb1-zfs. +systemd: Failed to start zfs-import@usb2-zfs.service — Import ZFS pool usb2-zfs. +systemd: Failed to start zfs-import-cache.service — Import ZFS pools by cache file. +``` +This pattern appeared on Apr 21, 22, 23, 28, 29×2, May 1, 5×2 and is ongoing. +The USB ZFS failure prevents PBS from completing initialization cleanly, causing repeated +reboots. This is the **same issue as C1 in the 2026-05-05 health check** (PBS ZFS pools +offline on boot). It is the primary driver of both the PBS instability and the downstream +qdevice outages. + +### Impact on Quorum +**None.** The 6-node Proxmox cluster (beast, compute2–6) uses `ffsplit` algorithm with +qdevice as a supplementary vote. During each qdevice outage, the cluster operates on 6 +node votes (each with `quorum_votes: 1`) — well above the 3-vote quorum threshold. +No VMs were migrated, fenced, or disrupted during any observed qdevice loss window. + +### Why No Code Change (This Phase) +Fixing the symptom (qdevice failures) without fixing the cause (PBS USB ZFS instability) +would mask the underlying problem. The correct fix is PBS ZFS reliability, not qdevice +configuration. Once PBS is stable, qdevice outages will stop. + +### Recommendations for Phase 5 + +1. **PBS USB ZFS issue (C1):** Resolve the recurrent USB ZFS import failures on PBS. Likely + causes: USB cable quality, USB hub power delivery, or ZFS import cache stale entries. + Check: `zpool import` manually after connecting drives, inspect `dmesg` for USB errors, + verify cable integrity. Once fixed, PBS will stop crash-rebooting. + +2. **Move qnetd from PBS to Pi4 (.227):** Even after fixing the USB ZFS issue, hosting + qnetd on a backup server that must be rebooted for updates/maintenance means every PBS + maintenance window causes a transient qdevice outage. The Pi4 is stable, low-load, + and never needs cluster-coordinated reboots. Migration steps: + ``` + # On PBS: stop and disable qnetd + systemctl stop corosync-qnetd && systemctl disable corosync-qnetd + # On Pi4: install and configure corosync-qnetd + apt install corosync-qnetd && systemctl enable --now corosync-qnetd + # On all cluster nodes: update /etc/corosync/corosync.conf + # Change quorum.device.net.host from 192.168.99.153 to 192.168.99.227 + # Then: pvecm qdevice setup 192.168.99.227 (or manual corosync config reload) + ``` + This is a 30-minute operation requiring a cluster config reload (no downtime, no VM migration).