docs: Phase 4C — root cause for W11 (xtables) and W12 (qdevice)
W11 (compute6 pve-firewall xtables conflicts): race condition between pve-firewall's iptables_restore (no -w flag) and Docker's iptables backend. Self-resolves within 10-120s; no firewall gap; no action taken. Conflicts align to :58 minute mark = Docker daemon start time at last compute6 boot (Apr 19 12:58). W12 (compute2 corosync-qdevice boot failures): qnetd on PBS (.153); PBS has crash-rebooted 34 times since Apr 21, each correlated with USB ZFS import failure (same as C1, May 5 health check). Each PBS reboot drops qnetd for ~60s; cluster quorum unaffected (6 node votes, threshold 3). No action this phase. Phase 5 items: fix PBS USB ZFS instability; optionally migrate qnetd from PBS to Pi4 (.227). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
125
runbooks/W11-W12-root-cause-2026-05.md
Normal file
125
runbooks/W11-W12-root-cause-2026-05.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Root Cause: W11 (pve-firewall xtables conflicts, compute6) + W12 (corosync-qdevice boot failures, compute2)
|
||||
|
||||
Investigated: 2026-05-06
|
||||
Status: **Both benign. No code change made.** Follow-up items flagged for Phase 5.
|
||||
|
||||
---
|
||||
|
||||
## W11 — pve-firewall xtables lock conflicts on compute6
|
||||
|
||||
### Symptom
|
||||
Recurring log entries on compute6 from pve-firewall daemon (PID 2924):
|
||||
```
|
||||
pve-firewall[2924]: status update error: iptables_restore_cmdlist:
|
||||
Another app is currently holding the xtables lock.
|
||||
Perhaps you want to use the -w option?
|
||||
```
|
||||
Observed: Apr 20 (two ~2-minute bursts at 00:58 and 12:58), Apr 21, Apr 29, May 6.
|
||||
All occurrences at the same wall-clock minute (:58) that compute6 last booted (Apr 19 12:58).
|
||||
|
||||
### Root Cause
|
||||
Race condition between two independent iptables consumers sharing the xtables lock:
|
||||
|
||||
1. **pve-firewall** — runs a 10-second status update loop using `iptables_restore` (legacy backend).
|
||||
Does **not** pass the `-w` wait flag, so it fails immediately if the lock is held.
|
||||
|
||||
2. **Docker daemon** — configured with `Firewall Backend: iptables` (default). Holds the xtables lock
|
||||
during container network operations and periodic internal network consistency verification.
|
||||
The Docker daemon started at Apr 19 12:58:14 (same boot as compute6), and its periodic
|
||||
iptables housekeeping is timed from daemon start — explaining why conflicts align with the
|
||||
:58 minute mark.
|
||||
|
||||
When Docker briefly holds the xtables lock, pve-firewall's next 10-second status cycle fails.
|
||||
pve-firewall retries on the following cycle (10 seconds later) and succeeds once Docker releases.
|
||||
Maximum observed hold time: ~2 minutes (11 consecutive pve-firewall failures).
|
||||
|
||||
### Impact
|
||||
**None.** pve-firewall successfully applies its rules on the next retry within 10–120 seconds.
|
||||
No firewall rules are left in an unapplied state. Cluster traffic and container networking are
|
||||
unaffected during the lock contention window.
|
||||
|
||||
### Why No Code Change
|
||||
The two mitigation paths both have worse trade-offs than the current behavior:
|
||||
|
||||
- **`--iptables=false` in Docker daemon.json** — disables Docker NAT and masquerade rules entirely;
|
||||
containers lose internet access unless a parallel nftables setup is built. High risk for minimal
|
||||
gain on a node running 9 services.
|
||||
|
||||
- **PVE pve-firewall patch to add `-w`** — not a user-configurable option; would require a PVE
|
||||
package patch or override. PVE may address this in a future release.
|
||||
|
||||
Current behavior (retry within 10s, no sustained outage) is acceptable.
|
||||
|
||||
### Monitoring
|
||||
No alert added. The conflict is logged to syslog and visible in `journalctl -u pve-firewall`.
|
||||
If conflicts become sustained (>5 minutes without resolution), investigate Docker daemon state.
|
||||
|
||||
### Future Watch
|
||||
If compute6 is rebooted, conflicts will initially appear more frequently (Docker and pve-firewall
|
||||
both re-synchronizing), then stabilize. No action needed.
|
||||
|
||||
---
|
||||
|
||||
## W12 — corosync-qdevice boot connection failures on compute2
|
||||
|
||||
### Symptom
|
||||
At each PBS reboot, corosync-qdevice on cluster nodes (observed on compute2) logs:
|
||||
```
|
||||
corosync-qdevice[1240]: Can't connect to qnetd host. (-5986): Network address not available
|
||||
corosync-qdevice[1240]: Connect timeout (repeating every ~8s for 1–5 minutes)
|
||||
```
|
||||
Observed on May 5 at 01:09, 05:33, and 19:17; on May 6 at 11:04 (compute2 own reboot).
|
||||
|
||||
### Root Cause (Two Layers)
|
||||
|
||||
**Proximate:** `corosync-qnetd` is hosted on PBS (192.168.99.153). Every time PBS reboots,
|
||||
qnetd goes offline for 30–60 seconds during PBS's boot sequence. corosync-qdevice on all
|
||||
cluster nodes loses its qdevice connection during this window, then auto-reconnects once
|
||||
qnetd is listening again.
|
||||
|
||||
**Underlying:** PBS has been crash-rebooting 34 times since April 21, 2026 — roughly every
|
||||
1–3 days. Each crash is correlated with USB ZFS pool import failure (`usb1-zfs`, `usb2-zfs`)
|
||||
on boot:
|
||||
```
|
||||
systemd: Failed to start zfs-import@usb1-zfs.service — Import ZFS pool usb1-zfs.
|
||||
systemd: Failed to start zfs-import@usb2-zfs.service — Import ZFS pool usb2-zfs.
|
||||
systemd: Failed to start zfs-import-cache.service — Import ZFS pools by cache file.
|
||||
```
|
||||
This pattern appeared on Apr 21, 22, 23, 28, 29×2, May 1, 5×2 and is ongoing.
|
||||
The USB ZFS failure prevents PBS from completing initialization cleanly, causing repeated
|
||||
reboots. This is the **same issue as C1 in the 2026-05-05 health check** (PBS ZFS pools
|
||||
offline on boot). It is the primary driver of both the PBS instability and the downstream
|
||||
qdevice outages.
|
||||
|
||||
### Impact on Quorum
|
||||
**None.** The 6-node Proxmox cluster (beast, compute2–6) uses `ffsplit` algorithm with
|
||||
qdevice as a supplementary vote. During each qdevice outage, the cluster operates on 6
|
||||
node votes (each with `quorum_votes: 1`) — well above the 3-vote quorum threshold.
|
||||
No VMs were migrated, fenced, or disrupted during any observed qdevice loss window.
|
||||
|
||||
### Why No Code Change (This Phase)
|
||||
Fixing the symptom (qdevice failures) without fixing the cause (PBS USB ZFS instability)
|
||||
would mask the underlying problem. The correct fix is PBS ZFS reliability, not qdevice
|
||||
configuration. Once PBS is stable, qdevice outages will stop.
|
||||
|
||||
### Recommendations for Phase 5
|
||||
|
||||
1. **PBS USB ZFS issue (C1):** Resolve the recurrent USB ZFS import failures on PBS. Likely
|
||||
causes: USB cable quality, USB hub power delivery, or ZFS import cache stale entries.
|
||||
Check: `zpool import` manually after connecting drives, inspect `dmesg` for USB errors,
|
||||
verify cable integrity. Once fixed, PBS will stop crash-rebooting.
|
||||
|
||||
2. **Move qnetd from PBS to Pi4 (.227):** Even after fixing the USB ZFS issue, hosting
|
||||
qnetd on a backup server that must be rebooted for updates/maintenance means every PBS
|
||||
maintenance window causes a transient qdevice outage. The Pi4 is stable, low-load,
|
||||
and never needs cluster-coordinated reboots. Migration steps:
|
||||
```
|
||||
# On PBS: stop and disable qnetd
|
||||
systemctl stop corosync-qnetd && systemctl disable corosync-qnetd
|
||||
# On Pi4: install and configure corosync-qnetd
|
||||
apt install corosync-qnetd && systemctl enable --now corosync-qnetd
|
||||
# On all cluster nodes: update /etc/corosync/corosync.conf
|
||||
# Change quorum.device.net.host from 192.168.99.153 to 192.168.99.227
|
||||
# Then: pvecm qdevice setup 192.168.99.227 (or manual corosync config reload)
|
||||
```
|
||||
This is a 30-minute operation requiring a cluster config reload (no downtime, no VM migration).
|
||||
Reference in New Issue
Block a user