docs: Phase 4C — root cause for W11 (xtables) and W12 (qdevice)

W11 (compute6 pve-firewall xtables conflicts): race condition between
pve-firewall's iptables_restore (no -w flag) and Docker's iptables
backend. Self-resolves within 10-120s; no firewall gap; no action taken.
Conflicts align to :58 minute mark = Docker daemon start time at last
compute6 boot (Apr 19 12:58).

W12 (compute2 corosync-qdevice boot failures): qnetd on PBS (.153);
PBS has crash-rebooted 34 times since Apr 21, each correlated with USB
ZFS import failure (same as C1, May 5 health check). Each PBS reboot
drops qnetd for ~60s; cluster quorum unaffected (6 node votes, threshold
3). No action this phase. Phase 5 items: fix PBS USB ZFS instability;
optionally migrate qnetd from PBS to Pi4 (.227).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
tommy
2026-05-06 19:11:37 -05:00
parent 38fa22d444
commit 545d3563b8

View File

@@ -0,0 +1,125 @@
# Root Cause: W11 (pve-firewall xtables conflicts, compute6) + W12 (corosync-qdevice boot failures, compute2)
Investigated: 2026-05-06
Status: **Both benign. No code change made.** Follow-up items flagged for Phase 5.
---
## W11 — pve-firewall xtables lock conflicts on compute6
### Symptom
Recurring log entries on compute6 from pve-firewall daemon (PID 2924):
```
pve-firewall[2924]: status update error: iptables_restore_cmdlist:
Another app is currently holding the xtables lock.
Perhaps you want to use the -w option?
```
Observed: Apr 20 (two ~2-minute bursts at 00:58 and 12:58), Apr 21, Apr 29, May 6.
All occurrences at the same wall-clock minute (:58) that compute6 last booted (Apr 19 12:58).
### Root Cause
Race condition between two independent iptables consumers sharing the xtables lock:
1. **pve-firewall** — runs a 10-second status update loop using `iptables_restore` (legacy backend).
Does **not** pass the `-w` wait flag, so it fails immediately if the lock is held.
2. **Docker daemon** — configured with `Firewall Backend: iptables` (default). Holds the xtables lock
during container network operations and periodic internal network consistency verification.
The Docker daemon started at Apr 19 12:58:14 (same boot as compute6), and its periodic
iptables housekeeping is timed from daemon start — explaining why conflicts align with the
:58 minute mark.
When Docker briefly holds the xtables lock, pve-firewall's next 10-second status cycle fails.
pve-firewall retries on the following cycle (10 seconds later) and succeeds once Docker releases.
Maximum observed hold time: ~2 minutes (11 consecutive pve-firewall failures).
### Impact
**None.** pve-firewall successfully applies its rules on the next retry within 10120 seconds.
No firewall rules are left in an unapplied state. Cluster traffic and container networking are
unaffected during the lock contention window.
### Why No Code Change
The two mitigation paths both have worse trade-offs than the current behavior:
- **`--iptables=false` in Docker daemon.json** — disables Docker NAT and masquerade rules entirely;
containers lose internet access unless a parallel nftables setup is built. High risk for minimal
gain on a node running 9 services.
- **PVE pve-firewall patch to add `-w`** — not a user-configurable option; would require a PVE
package patch or override. PVE may address this in a future release.
Current behavior (retry within 10s, no sustained outage) is acceptable.
### Monitoring
No alert added. The conflict is logged to syslog and visible in `journalctl -u pve-firewall`.
If conflicts become sustained (>5 minutes without resolution), investigate Docker daemon state.
### Future Watch
If compute6 is rebooted, conflicts will initially appear more frequently (Docker and pve-firewall
both re-synchronizing), then stabilize. No action needed.
---
## W12 — corosync-qdevice boot connection failures on compute2
### Symptom
At each PBS reboot, corosync-qdevice on cluster nodes (observed on compute2) logs:
```
corosync-qdevice[1240]: Can't connect to qnetd host. (-5986): Network address not available
corosync-qdevice[1240]: Connect timeout (repeating every ~8s for 15 minutes)
```
Observed on May 5 at 01:09, 05:33, and 19:17; on May 6 at 11:04 (compute2 own reboot).
### Root Cause (Two Layers)
**Proximate:** `corosync-qnetd` is hosted on PBS (192.168.99.153). Every time PBS reboots,
qnetd goes offline for 3060 seconds during PBS's boot sequence. corosync-qdevice on all
cluster nodes loses its qdevice connection during this window, then auto-reconnects once
qnetd is listening again.
**Underlying:** PBS has been crash-rebooting 34 times since April 21, 2026 — roughly every
13 days. Each crash is correlated with USB ZFS pool import failure (`usb1-zfs`, `usb2-zfs`)
on boot:
```
systemd: Failed to start zfs-import@usb1-zfs.service — Import ZFS pool usb1-zfs.
systemd: Failed to start zfs-import@usb2-zfs.service — Import ZFS pool usb2-zfs.
systemd: Failed to start zfs-import-cache.service — Import ZFS pools by cache file.
```
This pattern appeared on Apr 21, 22, 23, 28, 29×2, May 1, 5×2 and is ongoing.
The USB ZFS failure prevents PBS from completing initialization cleanly, causing repeated
reboots. This is the **same issue as C1 in the 2026-05-05 health check** (PBS ZFS pools
offline on boot). It is the primary driver of both the PBS instability and the downstream
qdevice outages.
### Impact on Quorum
**None.** The 6-node Proxmox cluster (beast, compute26) uses `ffsplit` algorithm with
qdevice as a supplementary vote. During each qdevice outage, the cluster operates on 6
node votes (each with `quorum_votes: 1`) — well above the 3-vote quorum threshold.
No VMs were migrated, fenced, or disrupted during any observed qdevice loss window.
### Why No Code Change (This Phase)
Fixing the symptom (qdevice failures) without fixing the cause (PBS USB ZFS instability)
would mask the underlying problem. The correct fix is PBS ZFS reliability, not qdevice
configuration. Once PBS is stable, qdevice outages will stop.
### Recommendations for Phase 5
1. **PBS USB ZFS issue (C1):** Resolve the recurrent USB ZFS import failures on PBS. Likely
causes: USB cable quality, USB hub power delivery, or ZFS import cache stale entries.
Check: `zpool import` manually after connecting drives, inspect `dmesg` for USB errors,
verify cable integrity. Once fixed, PBS will stop crash-rebooting.
2. **Move qnetd from PBS to Pi4 (.227):** Even after fixing the USB ZFS issue, hosting
qnetd on a backup server that must be rebooted for updates/maintenance means every PBS
maintenance window causes a transient qdevice outage. The Pi4 is stable, low-load,
and never needs cluster-coordinated reboots. Migration steps:
```
# On PBS: stop and disable qnetd
systemctl stop corosync-qnetd && systemctl disable corosync-qnetd
# On Pi4: install and configure corosync-qnetd
apt install corosync-qnetd && systemctl enable --now corosync-qnetd
# On all cluster nodes: update /etc/corosync/corosync.conf
# Change quorum.device.net.host from 192.168.99.153 to 192.168.99.227
# Then: pvecm qdevice setup 192.168.99.227 (or manual corosync config reload)
```
This is a 30-minute operation requiring a cluster config reload (no downtime, no VM migration).