Files
homelab-configs/runbooks/W11-W12-root-cause-2026-05.md
tommy 545d3563b8 docs: Phase 4C — root cause for W11 (xtables) and W12 (qdevice)
W11 (compute6 pve-firewall xtables conflicts): race condition between
pve-firewall's iptables_restore (no -w flag) and Docker's iptables
backend. Self-resolves within 10-120s; no firewall gap; no action taken.
Conflicts align to :58 minute mark = Docker daemon start time at last
compute6 boot (Apr 19 12:58).

W12 (compute2 corosync-qdevice boot failures): qnetd on PBS (.153);
PBS has crash-rebooted 34 times since Apr 21, each correlated with USB
ZFS import failure (same as C1, May 5 health check). Each PBS reboot
drops qnetd for ~60s; cluster quorum unaffected (6 node votes, threshold
3). No action this phase. Phase 5 items: fix PBS USB ZFS instability;
optionally migrate qnetd from PBS to Pi4 (.227).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 19:11:47 -05:00

6.2 KiB
Raw Permalink Blame History

Root Cause: W11 (pve-firewall xtables conflicts, compute6) + W12 (corosync-qdevice boot failures, compute2)

Investigated: 2026-05-06
Status: Both benign. No code change made. Follow-up items flagged for Phase 5.


W11 — pve-firewall xtables lock conflicts on compute6

Symptom

Recurring log entries on compute6 from pve-firewall daemon (PID 2924):

pve-firewall[2924]: status update error: iptables_restore_cmdlist:
  Another app is currently holding the xtables lock.
  Perhaps you want to use the -w option?

Observed: Apr 20 (two ~2-minute bursts at 00:58 and 12:58), Apr 21, Apr 29, May 6. All occurrences at the same wall-clock minute (:58) that compute6 last booted (Apr 19 12:58).

Root Cause

Race condition between two independent iptables consumers sharing the xtables lock:

  1. pve-firewall — runs a 10-second status update loop using iptables_restore (legacy backend). Does not pass the -w wait flag, so it fails immediately if the lock is held.

  2. Docker daemon — configured with Firewall Backend: iptables (default). Holds the xtables lock during container network operations and periodic internal network consistency verification. The Docker daemon started at Apr 19 12:58:14 (same boot as compute6), and its periodic iptables housekeeping is timed from daemon start — explaining why conflicts align with the :58 minute mark.

When Docker briefly holds the xtables lock, pve-firewall's next 10-second status cycle fails. pve-firewall retries on the following cycle (10 seconds later) and succeeds once Docker releases. Maximum observed hold time: ~2 minutes (11 consecutive pve-firewall failures).

Impact

None. pve-firewall successfully applies its rules on the next retry within 10120 seconds. No firewall rules are left in an unapplied state. Cluster traffic and container networking are unaffected during the lock contention window.

Why No Code Change

The two mitigation paths both have worse trade-offs than the current behavior:

  • --iptables=false in Docker daemon.json — disables Docker NAT and masquerade rules entirely; containers lose internet access unless a parallel nftables setup is built. High risk for minimal gain on a node running 9 services.

  • PVE pve-firewall patch to add -w — not a user-configurable option; would require a PVE package patch or override. PVE may address this in a future release.

Current behavior (retry within 10s, no sustained outage) is acceptable.

Monitoring

No alert added. The conflict is logged to syslog and visible in journalctl -u pve-firewall. If conflicts become sustained (>5 minutes without resolution), investigate Docker daemon state.

Future Watch

If compute6 is rebooted, conflicts will initially appear more frequently (Docker and pve-firewall both re-synchronizing), then stabilize. No action needed.


W12 — corosync-qdevice boot connection failures on compute2

Symptom

At each PBS reboot, corosync-qdevice on cluster nodes (observed on compute2) logs:

corosync-qdevice[1240]: Can't connect to qnetd host. (-5986): Network address not available
corosync-qdevice[1240]: Connect timeout  (repeating every ~8s for 15 minutes)

Observed on May 5 at 01:09, 05:33, and 19:17; on May 6 at 11:04 (compute2 own reboot).

Root Cause (Two Layers)

Proximate: corosync-qnetd is hosted on PBS (192.168.99.153). Every time PBS reboots, qnetd goes offline for 3060 seconds during PBS's boot sequence. corosync-qdevice on all cluster nodes loses its qdevice connection during this window, then auto-reconnects once qnetd is listening again.

Underlying: PBS has been crash-rebooting 34 times since April 21, 2026 — roughly every 13 days. Each crash is correlated with USB ZFS pool import failure (usb1-zfs, usb2-zfs) on boot:

systemd: Failed to start zfs-import@usb1-zfs.service — Import ZFS pool usb1-zfs.
systemd: Failed to start zfs-import@usb2-zfs.service — Import ZFS pool usb2-zfs.
systemd: Failed to start zfs-import-cache.service — Import ZFS pools by cache file.

This pattern appeared on Apr 21, 22, 23, 28, 29×2, May 1, 5×2 and is ongoing. The USB ZFS failure prevents PBS from completing initialization cleanly, causing repeated reboots. This is the same issue as C1 in the 2026-05-05 health check (PBS ZFS pools offline on boot). It is the primary driver of both the PBS instability and the downstream qdevice outages.

Impact on Quorum

None. The 6-node Proxmox cluster (beast, compute26) uses ffsplit algorithm with qdevice as a supplementary vote. During each qdevice outage, the cluster operates on 6 node votes (each with quorum_votes: 1) — well above the 3-vote quorum threshold. No VMs were migrated, fenced, or disrupted during any observed qdevice loss window.

Why No Code Change (This Phase)

Fixing the symptom (qdevice failures) without fixing the cause (PBS USB ZFS instability) would mask the underlying problem. The correct fix is PBS ZFS reliability, not qdevice configuration. Once PBS is stable, qdevice outages will stop.

Recommendations for Phase 5

  1. PBS USB ZFS issue (C1): Resolve the recurrent USB ZFS import failures on PBS. Likely causes: USB cable quality, USB hub power delivery, or ZFS import cache stale entries. Check: zpool import manually after connecting drives, inspect dmesg for USB errors, verify cable integrity. Once fixed, PBS will stop crash-rebooting.

  2. Move qnetd from PBS to Pi4 (.227): Even after fixing the USB ZFS issue, hosting qnetd on a backup server that must be rebooted for updates/maintenance means every PBS maintenance window causes a transient qdevice outage. The Pi4 is stable, low-load, and never needs cluster-coordinated reboots. Migration steps:

    # On PBS: stop and disable qnetd
    systemctl stop corosync-qnetd && systemctl disable corosync-qnetd
    # On Pi4: install and configure corosync-qnetd
    apt install corosync-qnetd && systemctl enable --now corosync-qnetd
    # On all cluster nodes: update /etc/corosync/corosync.conf
    # Change quorum.device.net.host from 192.168.99.153 to 192.168.99.227
    # Then: pvecm qdevice setup 192.168.99.227 (or manual corosync config reload)
    

    This is a 30-minute operation requiring a cluster config reload (no downtime, no VM migration).