docs: Phase 4C — root cause for W11 (xtables) and W12 (qdevice)

W11 (compute6 pve-firewall xtables conflicts): race condition between pve-firewall's iptables_restore (no -w flag) and Docker's iptables backend. Self-resolves within 10-120s; no firewall gap; no action taken. Conflicts align to :58 minute mark = Docker daemon start time at last compute6 boot (Apr 19 12:58). W12 (compute2 corosync-qdevice boot failures): qnetd on PBS (.153); PBS has crash-rebooted 34 times since Apr 21, each correlated with USB ZFS import failure (same as C1, May 5 health check). Each PBS reboot drops qnetd for ~60s; cluster quorum unaffected (6 node votes, threshold 3). No action this phase. Phase 5 items: fix PBS USB ZFS instability; optionally migrate qnetd from PBS to Pi4 (.227). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 19:11:37 -05:00
parent 38fa22d444
commit 545d3563b8
1 changed files with 125 additions and 0 deletions
--- a/runbooks/W11-W12-root-cause-2026-05.md
+++ b/runbooks/W11-W12-root-cause-2026-05.md
@@ -0,0 +1,125 @@
+# Root Cause: W11 (pve-firewall xtables conflicts, compute6) + W12 (corosync-qdevice boot failures, compute2)
+
+Investigated: 2026-05-06  
+Status: **Both benign. No code change made.** Follow-up items flagged for Phase 5.
+
+---
+
+## W11 — pve-firewall xtables lock conflicts on compute6
+
+### Symptom
+Recurring log entries on compute6 from pve-firewall daemon (PID 2924):
+```
+pve-firewall[2924]: status update error: iptables_restore_cmdlist:
+  Another app is currently holding the xtables lock.
+  Perhaps you want to use the -w option?
+```
+Observed: Apr 20 (two ~2-minute bursts at 00:58 and 12:58), Apr 21, Apr 29, May 6.
+All occurrences at the same wall-clock minute (:58) that compute6 last booted (Apr 19 12:58).
+
+### Root Cause
+Race condition between two independent iptables consumers sharing the xtables lock:
+
+1. **pve-firewall** — runs a 10-second status update loop using `iptables_restore` (legacy backend).
+   Does **not** pass the `-w` wait flag, so it fails immediately if the lock is held.
+   
+2. **Docker daemon** — configured with `Firewall Backend: iptables` (default). Holds the xtables lock
+   during container network operations and periodic internal network consistency verification.
+   The Docker daemon started at Apr 19 12:58:14 (same boot as compute6), and its periodic
+   iptables housekeeping is timed from daemon start — explaining why conflicts align with the
+   :58 minute mark.
+
+When Docker briefly holds the xtables lock, pve-firewall's next 10-second status cycle fails.
+pve-firewall retries on the following cycle (10 seconds later) and succeeds once Docker releases.
+Maximum observed hold time: ~2 minutes (11 consecutive pve-firewall failures).
+
+### Impact
+**None.** pve-firewall successfully applies its rules on the next retry within 10–120 seconds.
+No firewall rules are left in an unapplied state. Cluster traffic and container networking are
+unaffected during the lock contention window.
+
+### Why No Code Change
+The two mitigation paths both have worse trade-offs than the current behavior:
+
+- **`--iptables=false` in Docker daemon.json** — disables Docker NAT and masquerade rules entirely;
+  containers lose internet access unless a parallel nftables setup is built. High risk for minimal
+  gain on a node running 9 services.
+  
+- **PVE pve-firewall patch to add `-w`** — not a user-configurable option; would require a PVE
+  package patch or override. PVE may address this in a future release.
+
+Current behavior (retry within 10s, no sustained outage) is acceptable.
+
+### Monitoring
+No alert added. The conflict is logged to syslog and visible in `journalctl -u pve-firewall`.
+If conflicts become sustained (>5 minutes without resolution), investigate Docker daemon state.
+
+### Future Watch
+If compute6 is rebooted, conflicts will initially appear more frequently (Docker and pve-firewall
+both re-synchronizing), then stabilize. No action needed.
+
+---
+
+## W12 — corosync-qdevice boot connection failures on compute2
+
+### Symptom
+At each PBS reboot, corosync-qdevice on cluster nodes (observed on compute2) logs:
+```
+corosync-qdevice[1240]: Can't connect to qnetd host. (-5986): Network address not available
+corosync-qdevice[1240]: Connect timeout  (repeating every ~8s for 1–5 minutes)
+```
+Observed on May 5 at 01:09, 05:33, and 19:17; on May 6 at 11:04 (compute2 own reboot).
+
+### Root Cause (Two Layers)
+
+**Proximate:** `corosync-qnetd` is hosted on PBS (192.168.99.153). Every time PBS reboots,
+qnetd goes offline for 30–60 seconds during PBS's boot sequence. corosync-qdevice on all
+cluster nodes loses its qdevice connection during this window, then auto-reconnects once
+qnetd is listening again.
+
+**Underlying:** PBS has been crash-rebooting 34 times since April 21, 2026 — roughly every
+1–3 days. Each crash is correlated with USB ZFS pool import failure (`usb1-zfs`, `usb2-zfs`)
+on boot:
+```
+systemd: Failed to start zfs-import@usb1-zfs.service — Import ZFS pool usb1-zfs.
+systemd: Failed to start zfs-import@usb2-zfs.service — Import ZFS pool usb2-zfs.
+systemd: Failed to start zfs-import-cache.service — Import ZFS pools by cache file.
+```
+This pattern appeared on Apr 21, 22, 23, 28, 29×2, May 1, 5×2 and is ongoing.
+The USB ZFS failure prevents PBS from completing initialization cleanly, causing repeated
+reboots. This is the **same issue as C1 in the 2026-05-05 health check** (PBS ZFS pools
+offline on boot). It is the primary driver of both the PBS instability and the downstream
+qdevice outages.
+
+### Impact on Quorum
+**None.** The 6-node Proxmox cluster (beast, compute2–6) uses `ffsplit` algorithm with
+qdevice as a supplementary vote. During each qdevice outage, the cluster operates on 6
+node votes (each with `quorum_votes: 1`) — well above the 3-vote quorum threshold.
+No VMs were migrated, fenced, or disrupted during any observed qdevice loss window.
+
+### Why No Code Change (This Phase)
+Fixing the symptom (qdevice failures) without fixing the cause (PBS USB ZFS instability)
+would mask the underlying problem. The correct fix is PBS ZFS reliability, not qdevice
+configuration. Once PBS is stable, qdevice outages will stop.
+
+### Recommendations for Phase 5
+
+1. **PBS USB ZFS issue (C1):** Resolve the recurrent USB ZFS import failures on PBS. Likely
+   causes: USB cable quality, USB hub power delivery, or ZFS import cache stale entries.
+   Check: `zpool import` manually after connecting drives, inspect `dmesg` for USB errors,
+   verify cable integrity. Once fixed, PBS will stop crash-rebooting.
+
+2. **Move qnetd from PBS to Pi4 (.227):** Even after fixing the USB ZFS issue, hosting
+   qnetd on a backup server that must be rebooted for updates/maintenance means every PBS
+   maintenance window causes a transient qdevice outage. The Pi4 is stable, low-load,
+   and never needs cluster-coordinated reboots. Migration steps:
+   ```
+   # On PBS: stop and disable qnetd
+   systemctl stop corosync-qnetd && systemctl disable corosync-qnetd
+   # On Pi4: install and configure corosync-qnetd
+   apt install corosync-qnetd && systemctl enable --now corosync-qnetd
+   # On all cluster nodes: update /etc/corosync/corosync.conf
+   # Change quorum.device.net.host from 192.168.99.153 to 192.168.99.227
+   # Then: pvecm qdevice setup 192.168.99.227 (or manual corosync config reload)
+   ```
+   This is a 30-minute operation requiring a cluster config reload (no downtime, no VM migration).