Files

tommy 909fe3dc12 docs: Phase 5 — incident log, maintenance runbook, NUT config archive

beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added
  cancel rule. Archived post-fix from Beast.
beast/nut/upssched-cmd: archived for reference.

runbooks/phase5-incident-log.md:
  INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings.
  INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life.
  P5-01 through P5-11: pending work queue including Pi4 node-exporter,
  compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics,
  cert renewal verification, ansible-control disk, Phase 4A resume.

runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure,
  cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd
  migration, Authelia metrics setup, recurring checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-06 21:25:49 -05:00

15 KiB

Raw Blame History

Phase 5 — Incident Log / State Drift

Last updated: 2026-05-06
Purpose: Consolidates real findings from the Phase 4 worksession. Inputs to Phase 5 planning and repair.

Active Incidents

INC-005 — NUT upssched earlyshutdown Bug + May 6 Cluster Shutdown (FIXED)

Discovered: 2026-05-06 during Phase 5 health-check investigation
Fixed: 2026-05-06
Class: Same as HAOS auto-update (Phase 1.1) — autonomous mechanism causing user-impacting outage without warning

What happened:
At 10:53:42 CDT on May 6, a brief real power disturbance caused both UPS units to switch to battery (OB) for 14–16 seconds. Line power was restored within ~25 seconds. Despite power returning cleanly, Beast's upssched fired a forced shutdown (FSD) 30 seconds after the initial OB event, initiating a clean cluster-wide shutdown. All 5 nodes (beast, compute3, compute5, media-server, docker-node02) went offline and required manual power-on ~7.5 hours later.

Root cause — upssched.conf bug (pre-existing since NUT install):

# OLD — broken:
AT ONBATT * START-TIMER earlyshutdown 30   # starts 30s FSD timer on any OB event
AT ONLINE * CANCEL-TIMER onbatt online     # cancels "onbatt" timer — NOT earlyshutdown
# No AT ONLINE cancel for "earlyshutdown" existed → timer runs to completion regardless of power restoration

After 30 seconds, upssched-cmd called upsmon -c fsd, propagating FSD to all NUT slave nodes. The earlyshutdown timer was never cancelled because the cancel rule referenced the wrong timer name ("onbatt" vs "earlyshutdown"). Any OB event lasting more than 0 seconds would fire FSD 30 seconds later unconditionally.

Important: battery quick tests do NOT trigger this bug. CyberPower CP1500 quick battery tests produce OL DISCHRG (on-line, discharging) status, not OB (on battery). The upssched AT ONBATT rule does not fire on OL DISCHRG. The May 6 10:53 event was a real power disturbance, not a test artifact.

Why latent: The bug required an actual OB event. Power has been clean for at least the prior 17-day boot window on compute5 (no historical FSD events found in journal). The first observed trigger was the May 6 power disturbance.

Fix applied (Beast, 2026-05-06):

# NEW — fixed:
AT ONBATT * START-TIMER earlyshutdown 300  # extended: 5 min before FSD (batteries hold 22–66 min)
AT ONLINE * CANCEL-TIMER earlyshutdown online  # NEW: cancels timer when power returns

Config backed up as /etc/nut/upssched.conf.bak. Files committed to homelab-configs/beast/nut/.

Also fixed (same session): ups-quarterly-test.sh rewritten to use sequential tests (not simultaneous), poll for OL restoration with 120s timeout, 35s OL buffer before second test, ntfy alert on timeout, abort instead of testing second UPS if first doesn't return OL. This was correct practice even though battery tests don't produce OB status — the sequential design is safer against any firmware variation and future UPS model changes.

Break-test result:
Battery quick test on cyberpower2 produced OL DISCHRG — no ONBATT event, no earlyshutdown timer, no FSD. Cluster remained up. An OB-producing break-test cannot be safely replicated without a real power event. The cancel logic is verified by code review: AT ONLINE * CANCEL-TIMER earlyshutdown online is unambiguous NUT upssched syntax.

Compute5 unsafe shutdown cross-reference:
56 unsafe shutdowns on nvme0 (WD PC SN740) over 1,407h do NOT primarily reflect NUT FSD history. Only 1 FSD event found in journal history (today's). The elevated counts reflect: (a) pre-journal install period reboots, (b) PCIe power management on an i7-13700T platform. The SK Hynix PC711 (nvme1) has 2,362 power cycles — a separate PCIe runtime PM compatibility issue; platform quirk: simple suspend now applied by kernel.

P5-11 added: Compute5 SK Hynix PC711 PCIe power management — verify simple suspend quirk persists across reboots. See below.

INC-001 — PBS USB Hub Failure (ACTIVE — hardware fix in progress)

First event: March 12, 2026 (earliest confirmed crash in journal)
Crash count: 34+ unclean reboots since April 21; ~50+ total since March 12
Current state: usb1-zfs SUSPENDED with 4 data errors / 24 write errors. PBS up ~24h on current boot. agents hung task cleared. usb2-zfs ONLINE.

Root cause (confirmed):
A 4-port USB hub at xHCI Bus 2 Port 6 intermittently loses power or signal, dropping all 4 ASM1153E (174c:55aa) USB-to-SATA bridges simultaneously. ZFS detects uncorrectable I/O on usb1-zfs and suspends the pool. The PBS agents process enters D-state waiting for ZFS TXG sync on the suspended pool. After 737+ seconds, the kernel logs a hung task warning. Eventually the system hangs and watchdog-reboots.

The hub reconnects ~21 seconds after dropping, but ZFS remains suspended until manually cleared.

USB topology at failure:

xHCI Bus 2, Port 6 — [FAILING HUB]
  ├─ Port 1: ASM1153E ACAAEBBB2E5D → sdj → usb1-zfs mirror leg 2
  ├─ Port 2: ASM1153E ACAAEBBB2E5F → sdg → usb1-zfs mirror leg 1
  ├─ Port 3: ASM1153E ACAAEBBB2E60 → sdh → usb2-zfs mirror leg 1
  └─ Port 4: ASM1153E ACAAEBBB2E5E → sdi → usb2-zfs mirror leg 2

Both pools are behind the same hub. One hub drop affects both pools simultaneously.

UAS quirk already applied: usb-storage.quirks=174c:55aa:u in kernel cmdline disables UAS on the ASM1153E bridges. This was a pre-existing workaround for protocol-level issues but does not address hub power/connection failures.

Recovery procedure (manual steps, after hardware fix):

# 1. Confirm drives reconnected after hardware fix
lsblk | grep sd

# 2. Clear the suspended pool
zpool clear usb1-zfs

# 3. Run scrubs on both pools immediately
zpool scrub usb1-zfs
zpool scrub usb2-zfs

# 4. Monitor scrub progress (run ~every 10 min until complete)
zpool status -v

# 5. After scrub completes with 0 errors, run PBS backup verify
#    In PBS web UI → Datastore → usb1-zfs → Verify All
#    Or via CLI: proxmox-backup-client verify --all

# 6. If scrub finds unrepaired errors, see INC-002

Hardware fix options (ordered by preference):

Best: Connect each ASMT bridge directly to a motherboard USB 3.0 port (bypass hub entirely). xHCI has 10 downstream ports; only 2 are currently used (Port 2: Toshiba, Port 6: hub). Count available physical USB 3.0 jacks; plug each bridge directly.
Acceptable: Replace hub with an externally powered USB hub (dedicated power brick, not bus-powered). 4× spinning HDDs exceed typical bus-power budget (900mA), likely causing brownouts.
Do not: Return drives to any bus-powered hub.

Kernel command line note: The usb-storage.delay_use=10 param is a 10-second delay before using USB storage. This was also a pre-existing workaround and should be left in place post-fix.

INC-002 — PBS Backup Data Integrity Unknown (ACTIVE — verify post-recovery)

Context: usb1-zfs has been suspended on every one of the 34+ crash reboots since April 21. Each suspension may have left partially-written ZFS TXGs. The pool has 4 confirmed data errors in the current boot's suspension event.

Risk: Backup chunks written to usb1-zfs during or after a suspension event may be incomplete or corrupt. The PBS backup catalog may reference chunks that don't exist or are truncated.

Action required post-hardware-fix:

Run zpool scrub usb1-zfs and verify 0 unrepaired errors.
Run PBS verify on the usb1-zfs datastore (proxmox-backup-client verify --all, or via web UI).
Any backup with a verify failure should be noted. If critical VMs have no verified backup, a fresh backup job should be run against a known-good target (Synology-Remote or rust-usb) immediately.

Status: Blocked on hardware fix (INC-001).

INC-003 — PBS Phase 1 C1 Re-Diagnosis

Original diagnosis (Phase 1): usb1-zfs and usb2-zfs fail to import at boot due to stale cachefile. Fix applied: cleared cachefile, added zpool import -f to boot sequence.

Revised diagnosis: The import failures were a downstream symptom. The actual root cause is INC-001 (USB hub drop). The pool is suspended at the time of crash-reboot; importing a suspended pool fails regardless of cachefile state. The cachefile fix helped on some boots by switching to scan-based import, but the underlying hardware problem continued causing the failures.

Lesson: When troubleshooting recurring import failures, always check dmesg for USB disconnect events before the failure. A suspended pool at boot is usually caused by the prior boot's crash, not a cachefile issue.

INC-004 — sdj Drive End-of-Life (PENDING)

Drive: Hitachi HUA723030ALA640, Serial MK0371YVGRZ2AA, /dev/sdj
Pool: usb1-zfs mirror leg 2
Power-on hours: 93,267 (≈10.6 years)
SMART: Passes all checks. No reallocated or pending sectors. 1 UDMA CRC error (USB interface, consistent with hub issue, not drive failure).

Action: Order a replacement 3TB+ drive (HGST Ultrastar or WD Gold recommended for NAS duty). Plan resilver after INC-001 hardware fix and INC-002 scrub pass are both confirmed clean. Do not start a resilver on a hub that may still be marginal.

Phase 4 — Completed Work

Phase 4B — UPS Self-Tests + Alert (completed 2026-05-06)

Manual tests run: cyberpower1 and cyberpower2 both passed.
Script /usr/local/bin/ups-quarterly-test.sh updated to write Prometheus textfile metrics (ups_battery_test_result, ups_battery_test_timestamp_seconds).
Cron installed: /etc/cron.d/ups-quarterly-test — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast.
Alert group ups_battery_test added: UPSBatteryTestFailed (result==0, for:5m), UPSBatteryTestStale (>100 days, for:6h).
Break-tested and confirmed.
Committed: docker-configs and ansible repos.

Phase 4C — Root-Cause Documentation (completed 2026-05-06)

W11 (compute6 xtables): Race between pve-firewall iptables_restore (no -w flag) and Docker's iptables backend. Self-resolves in 10-120s. No firewall gap. No action taken.
W12 (compute2 qdevice): qnetd on PBS; PBS crash-rebooting 34+ times since April 21 (same as INC-001). Each reboot drops qnetd for ~60s. Cluster quorum unaffected.
Documented in runbooks/W11-W12-root-cause-2026-05.md.

Phase 4A — PVE Version Drift (PAUSED — pending PBS stability)

Scope: Roll beast (9.1.6) and compute5 (9.1.6) to match cluster majority (9.1.7+).
Procedure: One node per session, 24h soak. Migrate guests off → upgrade → reboot → verify Keepalived/quorum/storage → migrate back.
Hold until: PBS hardware fix confirmed + usb1-zfs scrub clean + backup verify complete.

Phase 5 — Pending Work Queue

P5-01 — PBS Recovery (blocked on hardware)

Sequence: hardware fix → zpool clear → scrub → PBS verify → confirm stable 48h → resume Phase 4A

P5-02 — sdj Replacement (after P5-01 completes)

Order drive → resilver → verify → remove old sdj

P5-03 — qnetd Migration: PBS → Pi4 (after P5-01 + PBS stability confirmed)

Move corosync-qnetd from PBS (.153) to Pi4 (.227). Steps documented in runbooks/W11-W12-root-cause-2026-05.md. 30-minute operation, no VM downtime, no migration.

P5-04 — Authelia Prometheus Metrics + Request-Rate Alert

The C3 issue (May 5 health check: i/o timeouts every 2 min under scanner load) is not caught by the existing ContainerRestartLoop alert. Requires:

Enable Authelia's Prometheus metrics endpoint (telemetry.metrics in configuration.yml)
Add Prometheus scrape target for Authelia metrics
Add alert: AutheliaHighErrorRate — authelia_request_duration_seconds_count{status="408"} rate elevated for >5m

P5-05 — TLS Cert Renewal Verification (time-sensitive)

4 certs were in the 30–35d expiry window during the cert break-test (~May 5).
Traefik should auto-renew via ACME when ≤30d remain.

Action schedule:

May 11: SSH to active Traefik node (docker-node01), inspect acme.json, confirm the 4 certs have new notAfter dates.

ssh tommy@192.168.99.186 "cat ~/traefik/acme.json | python3 -c \"
import json,sys,datetime
d=json.load(sys.stdin)
for r in d.get('myresolver',{}).get('Certificates',[]):
    domain=r['domain']['main']
    # notAfter is base64 DER; use openssl or traefik logs to verify
    print(domain)
\" 2>/dev/null || docker logs traefik --since 168h 2>&1 | grep -i 'renewed\|certificate\|acme'"

May 18: Repeat check. If any cert still has the original expiry date, force-renew manually.

P5-06 — Traefik node02 CrowdSec LAPI (intentional asymmetry — no action)

node02 has no local CrowdSec; it correctly points to node01's LAPI at 192.168.99.186:8081 per Phase 2A design. Not drift. Known limitation: if node01 is down during failover, node02 Traefik has no CrowdSec coverage.

P5-07 — ansible-control Disk Cleanup / Expansion

At 80%, DiskFillPredicted24h firing. Clean /tmp artifacts and old report files, or expand the LVM.

P5-08 — Pi4 node-exporter Wrong Architecture Binary

SSH works as tommy. DNS (Technitium) and fail2ban both active. node-exporter fails with Exec format error — binary at /usr/local/bin/node_exporter is wrong architecture (x86_64 installed on ARM64). Fix: deploy correct arm64 binary.

VERSION=1.8.2
curl -sL https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.linux-arm64.tar.gz \
  | tar -xz --strip-components=1 -C /tmp node_exporter-${VERSION}.linux-arm64/node_exporter
scp /tmp/node_exporter tommy@192.168.99.227:/tmp/node_exporter
ssh tommy@192.168.99.227 "sudo mv /tmp/node_exporter /usr/local/bin/ && sudo chmod 755 /usr/local/bin/node_exporter && sudo systemctl restart node_exporter"

Pi4 hostname is also raspberrypi (clone drift — set to pi4 or similar).

P5-09 — Phase 4A Resume (PVE Version Drift)

After P5-01 completes and PBS is confirmed stable for 48h.

P5-10 — Pi4 node-exporter ARM64 Deploy

See P5-08.

P5-11 — Compute5 SK Hynix PC711 PCIe Power Management

nvme1n1 on compute5 (SK Hynix PC711 1TB, 0000:03:00.0) has 2,362 power cycles and 84 unsafe shutdowns — indicative of PCIe runtime PM aggressively power-cycling the drive. Kernel applied platform quirk: setting simple suspend in the current boot. Verify this persists:

# Check if quirk is active post-reboot:
dmesg | grep -i 'nvme.*simple suspend\|03:00.*quirk'
# If not applied, add to kernel cmdline or create modprobe conf:
# nvme_core.default_ps_max_latency_us=0

WD PC SN740 (nvme0) on same node has 56 unsafe shutdowns in 1,407h — likely from pre-journal setup period and PCIe PS behavior. No action unless counts accumulate.

Known Config Drift (not incidents — track for cleanup)

Item	Location	Drift
Hostname mismatch	docker-node01 (.186)	`/etc/hostname` = ubuntu-template; should = docker-node01
crowdsecLapiHost	node02 dynamic_conf.yml	Points to 192.168.99.186:8081 (node01 IP); should point to local LAPI
Technitium DNS	Pi4	No internal overrides for goattw.net (relies on Pi-hole as primary)
UPS self-test never run	cyberpower1	Resolved by Phase 4B; now automated
PBS ZFS pool feature upgrade	usb1-zfs, usb2-zfs	`zpool upgrade` deferred until pools are stable post-hardware fix
Alertmanager permission errors	media-server	Resolved (C2 May-05) — no action needed

15 KiB Raw Blame History Unescape Escape