docs: Phase 5 — incident log, maintenance runbook, NUT config archive

beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added cancel rule. Archived post-fix from Beast. beast/nut/upssched-cmd: archived for reference. runbooks/phase5-incident-log.md: INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings. INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life. P5-01 through P5-11: pending work queue including Pi4 node-exporter, compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics, cert renewal verification, ansible-control disk, Phase 4A resume. runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure, cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd migration, Authelia metrics setup, recurring checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 21:25:49 -05:00
parent 545d3563b8
commit 909fe3dc12
4 changed files with 551 additions and 0 deletions
--- a/beast/nut/upssched-cmd
+++ b/beast/nut/upssched-cmd
@@ -0,0 +1,20 @@
 #!/bin/sh
 case $1 in
       onbatt)
          logger -t upssched-cmd "UPS running on battery"
          ;;
       earlyshutdown)
          logger -t upssched-cmd "UPS on battery too long, early shutdown"
          /usr/sbin/upsmon -c fsd
          ;;
       shutdowncritical)
          logger -t upssched-cmd "UPS on battery critical, forced shutdown"
          /usr/sbin/upsmon -c fsd
          ;;
       upsgone)
          logger -t upssched-cmd "UPS has been gone too long, can't reach"
          ;;
       *)
          logger -t upssched-cmd "Unrecognized command: $1"
          ;;
 esac
--- a/beast/nut/upssched.conf
+++ b/beast/nut/upssched.conf
@@ -0,0 +1,14 @@
 CMDSCRIPT /etc/nut/upssched-cmd
 PIPEFN /etc/nut/upssched.pipe
 LOCKFN /etc/nut/upssched.lock
 AT ONBATT * START-TIMER onbatt 30
 AT ONLINE * CANCEL-TIMER onbatt online
 AT ONBATT * START-TIMER earlyshutdown 300
 AT ONLINE * CANCEL-TIMER earlyshutdown online
 AT LOWBATT * EXECUTE onbatt
 AT COMMBAD * START-TIMER commbad 30
 AT COMMOK * CANCEL-TIMER commbad commok
 AT NOCOMM * EXECUTE commbad
 AT SHUTDOWN * EXECUTE powerdown
 AT SHUTDOWN * EXECUTE powerdown
--- a/runbooks/maintenance-runbook.md
+++ b/runbooks/maintenance-runbook.md
@@ -0,0 +1,273 @@
 # Homelab Maintenance Runbook
 Last updated: 2026-05-06  
 Status: Living document — update when procedures change.
 ---
 ## Regular Operations
 ### Health Check (run before any maintenance session)
 ```bash
 # From ansible-control (.190)
 # Ad hoc check:
 ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE'
 # Full report script (if present):
 ~/ansible/scripts/health-check.sh
 # Or run manually — see reports/health-check-YYYYMMDD.md for format
 ```
 ### Update Schedule
 - **Proxmox nodes:** Rolling, one at a time. Migrate guests off → `apt update && apt dist-upgrade -y` → reboot → verify Keepalived/quorum/storage → next node.
 - **Order:** Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice).
 - **Phase 4A procedure** (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session.
 - **PBS:** Schedule separately. Always verify backup datastores are accessible post-reboot.
 ### Prometheus Alerting
 - Prometheus: `http://192.168.99.183:9090`
 - Alert rules: `/home/tommy/observability/prometheus/alert_rules.yml` on media-server
 - After editing alert rules: restart Prometheus (`docker compose -f ~/observability/docker-compose.yml restart prometheus`)
 - Alertmanager: `http://192.168.99.183:9093`
 - Notifications: ntfy bridge → ntfy topic (configured in alertmanager)
 ---
 ## PBS Recovery Procedure (post USB hub fix)
 **Precondition:** Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in `dmesg`.
 ```bash
 # SSH to PBS
 ssh root@192.168.99.153
 # 1. Verify drives are present
 lsblk | grep -E '^sd'
 # Expected: sdg, sdh, sdi, sdj all present
 # 2. Check for USB errors in current boot (must be clean)
 dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect'
 # Expected: NO disconnect events for the new drives
 # 3. Clear the suspended pool
 zpool clear usb1-zfs
 # 4. Verify pool state
 zpool status
 # Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible)
 # 5. Run scrubs on both pools
 zpool scrub usb1-zfs
 zpool scrub usb2-zfs
 # 6. Monitor scrub — check every ~15 minutes
 zpool status -v
 # Wait for "scrub repaired 0B in HH:MM:SS with 0 errors"
 # usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB)
 # 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy
 #    ZFS will auto-repair from mirror if possible.
 #    If unrepaired errors remain after scrub, proceed to PBS verify anyway — 
 #    those specific chunks will fail verify and you'll know which backups are affected.
 # 8. Run PBS backup verification
 proxmox-backup-manager datastore list
 # For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All
 # Or via CLI (if proxmox-backup-client is configured):
 #   proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs
 # 9. Confirm stable operation for 48h before resuming Phase 4A
 #    Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs'
 ```
 **If pool won't import after clear:**
 ```bash
 # Force import by device scan (no cachefile)
 zpool import -f usb1-zfs
 # If this also fails, check: zpool import -d /dev/disk/by-id/
 # and verify drive serial numbers match expected
 ```
 ---
 ## UPS Quarterly Battery Test
 **Schedule:** Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast  
 **Script:** `/usr/local/bin/ups-quarterly-test.sh`  
 **Results:** Written to `/var/lib/node_exporter/textfile_collector/ups_test.prom`  
 **Alert:** `UPSBatteryTestFailed` fires if result != 1 (passed); `UPSBatteryTestStale` fires if >100 days since last run.
 **Manual trigger (if needed):**
 ```bash
 ssh tommy@192.168.99.200
 # Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible)
 ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \
  -m shell -a "/usr/local/bin/ups-quarterly-test.sh"
 ```
 **Next scheduled run:** First Sunday of July 2026 (2026-07-05 at 03:00 CDT)
 ---
 ## TLS Cert Renewal Verification (one-time, May 2026)
 4 certificates were in the 30–35d window around May 5. Traefik auto-renews when ≤30d remain.
 **May 11 check:**
 ```bash
 ssh tommy@192.168.99.186
 # Check Traefik logs for ACME activity in the last 7 days
 docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20
 # Check acme.json directly for notAfter dates (requires openssl to decode DER)
 # Simpler: check cert expiry live
 for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do
  expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \
    | openssl x509 -noout -dates 2>/dev/null | grep notAfter)
  echo "${domain}: ${expires}"
 done
 ```
 **May 18 check:** Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew:
 ```bash
 # Force ACME renewal for a specific domain (Traefik)
 # Traefik doesn't have a direct force-renew command. Options:
 # 1. Remove domain from acme.json and restart Traefik (it will re-request)
 # 2. Touch the cert file to make Traefik re-evaluate
 # 3. Delete the entry from acme.json:
 docker stop traefik
 # Edit ~/traefik/acme.json, remove the certificate block for the domain
 docker start traefik
 # Traefik will request a new cert on startup
 ```
 ---
 ## Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable)
 **Status:** On hold until PBS hardware fix + 48h stability confirmed.
 **Scope:**
 - beast: 9.1.6 → 9.1.7 (or latest)
 - compute5: 9.1.6 → 9.1.7 (or latest)
 - Both should reach same version as compute2, compute3, compute6 (9.1.7)
 - compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold
 **Per-node procedure (one node per session, 24h soak between):**
 ```bash
 # --- PRE-UPGRADE: migrate guests off the target node ---
 # Example: target = beast
 # Move TrueNAS-Scale (VM 100) → compute2 or compute3
 ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1"
 # Wait for migration, verify VM running on compute2
 # Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first
 ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \
  python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]"
 # --- UPGRADE ---
 ssh root@192.168.99.200
 apt update && apt dist-upgrade -y
 reboot
 # --- POST-UPGRADE: verify ---
 # From cluster peer:
 pvecm status    # quorum intact, all nodes seen
 zpool status    # ZFS pools healthy on beast
 # From beast after reboot:
 pveversion      # new version
 systemctl status pve-cluster pve-manager
 # Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP)
 # --- MIGRATE BACK ---
 # Move TrueNAS-Scale (VM 100) back to beast
 ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1"
 ```
 ---
 ## qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved)
 **Trigger:** PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored.
 **Steps:**
 ```bash
 # 1. On PBS: stop and disable qnetd
 ssh root@192.168.99.153
 systemctl stop corosync-qnetd
 systemctl disable corosync-qnetd
 # 2. On Pi4: install and start qnetd
 ssh pi@192.168.99.227
 sudo apt install -y corosync-qnetd
 sudo systemctl enable --now corosync-qnetd
 # 3. On all cluster nodes: update corosync.conf
 #    Change: quorum.device.net.host = 192.168.99.153
 #    To:     quorum.device.net.host = 192.168.99.227
 #    Run on each node (beast, compute2-6):
 for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf"
 done
 # 4. Reload corosync on all nodes (no restart needed)
 for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "corosync-cfgtool -R"
 done
 # 5. Verify qdevice reconnected to new host
 pvecm status | grep -i qdevice
 corosync-quorumtool -s
 # 6. After confirming, update homelab-configs corosync.conf copy
 ```
 ---
 ## Authelia Metrics + Alert (P5-04)
 **When:** After confirming Authelia is healthy post any Traefik maintenance.
 **Steps:**
 1. Edit Authelia config on both nodes (`~/authelia/configuration.yml` on node01 and node02):
 ```yaml
 telemetry:
  metrics:
    enabled: true
    address: "tcp://0.0.0.0:9959"
 ```
 2. Expose port 9959 in Authelia docker-compose on both nodes.
 3. Add Prometheus scrape target in `/home/tommy/observability/prometheus/prometheus.yml`:
 ```yaml
 - job_name: authelia
  static_configs:
    - targets:
        - "192.168.99.186:9959"
        - "192.168.99.187:9959"
 ```
 4. Add alert to alert_rules.yml:
 ```yaml
 - alert: AutheliaHighErrorRate
  expr: >
    rate(authelia_request_duration_seconds_count{code="408"}[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Authelia elevated 408 error rate on {{ $labels.instance }}"
    description: "Request timeout rate >0.1/s for 5 minutes. May indicate scanner load or upstream slowness."
 ```
 5. Restart Prometheus.
 6. Break-test by temporarily enabling a scanner against the Authelia endpoint.
 ---
 ## Recurring Checklist
 | Frequency | Task | Notes |
 |---|---|---|
 | Before any maintenance | Health check | SSH to all nodes, check Prometheus alerts |
 | Quarterly | UPS self-test | Automated; verify Prometheus metric updated |
 | Monthly | ZFS scrub (if not auto) | Beast das-mirror, Compute3 pool; PBS pools post-recovery |
 | Monthly | PBS backup verify | Spot-check recent backups via web UI |
 | After each PBS maintenance | `zpool status -v` + `dmesg \| grep usb` | Confirm no disconnect events |
 | After each update cycle | `pvecm status` | Confirm quorum intact after any node reboot |
 | After cert renewal window | Check cert expiry via openssl | See May 11/18 procedure above |
--- a/runbooks/phase5-incident-log.md
+++ b/runbooks/phase5-incident-log.md
@@ -0,0 +1,244 @@
 # Phase 5 — Incident Log / State Drift
 Last updated: 2026-05-06  
 Purpose: Consolidates real findings from the Phase 4 worksession. Inputs to Phase 5 planning and repair.
 ---
 ## Active Incidents
 ### INC-005 — NUT upssched earlyshutdown Bug + May 6 Cluster Shutdown (FIXED)
 **Discovered:** 2026-05-06 during Phase 5 health-check investigation  
 **Fixed:** 2026-05-06  
 **Class:** Same as HAOS auto-update (Phase 1.1) — autonomous mechanism causing user-impacting outage without warning
 **What happened:**  
 At 10:53:42 CDT on May 6, a brief real power disturbance caused both UPS units to switch to battery (OB) for 14–16 seconds. Line power was restored within ~25 seconds. Despite power returning cleanly, Beast's upssched fired a forced shutdown (FSD) 30 seconds after the initial OB event, initiating a clean cluster-wide shutdown. All 5 nodes (beast, compute3, compute5, media-server, docker-node02) went offline and required manual power-on ~7.5 hours later.
 **Root cause — upssched.conf bug (pre-existing since NUT install):**  
 ```
 # OLD — broken:
 AT ONBATT * START-TIMER earlyshutdown 30   # starts 30s FSD timer on any OB event
 AT ONLINE * CANCEL-TIMER onbatt online     # cancels "onbatt" timer — NOT earlyshutdown
 # No AT ONLINE cancel for "earlyshutdown" existed → timer runs to completion regardless of power restoration
 ```
 After 30 seconds, upssched-cmd called `upsmon -c fsd`, propagating FSD to all NUT slave nodes. The earlyshutdown timer was never cancelled because the cancel rule referenced the wrong timer name ("onbatt" vs "earlyshutdown"). Any OB event lasting more than 0 seconds would fire FSD 30 seconds later unconditionally.
 **Important: battery quick tests do NOT trigger this bug.** CyberPower CP1500 quick battery tests produce `OL DISCHRG` (on-line, discharging) status, not `OB` (on battery). The upssched `AT ONBATT` rule does not fire on `OL DISCHRG`. The May 6 10:53 event was a real power disturbance, not a test artifact.
 **Why latent:** The bug required an actual OB event. Power has been clean for at least the prior 17-day boot window on compute5 (no historical FSD events found in journal). The first observed trigger was the May 6 power disturbance.
 **Fix applied (Beast, 2026-05-06):**
 ```
 # NEW — fixed:
 AT ONBATT * START-TIMER earlyshutdown 300  # extended: 5 min before FSD (batteries hold 22–66 min)
 AT ONLINE * CANCEL-TIMER earlyshutdown online  # NEW: cancels timer when power returns
 ```
 Config backed up as `/etc/nut/upssched.conf.bak`. Files committed to `homelab-configs/beast/nut/`.
 **Also fixed (same session):** `ups-quarterly-test.sh` rewritten to use sequential tests (not simultaneous), poll for OL restoration with 120s timeout, 35s OL buffer before second test, ntfy alert on timeout, abort instead of testing second UPS if first doesn't return OL. This was correct practice even though battery tests don't produce OB status — the sequential design is safer against any firmware variation and future UPS model changes.
 **Break-test result:**  
 Battery quick test on cyberpower2 produced `OL DISCHRG` — no ONBATT event, no earlyshutdown timer, no FSD. Cluster remained up. An OB-producing break-test cannot be safely replicated without a real power event. The cancel logic is verified by code review: `AT ONLINE * CANCEL-TIMER earlyshutdown online` is unambiguous NUT upssched syntax.
 **Compute5 unsafe shutdown cross-reference:**  
 56 unsafe shutdowns on nvme0 (WD PC SN740) over 1,407h do NOT primarily reflect NUT FSD history. Only 1 FSD event found in journal history (today's). The elevated counts reflect: (a) pre-journal install period reboots, (b) PCIe power management on an i7-13700T platform. The SK Hynix PC711 (nvme1) has 2,362 power cycles — a separate PCIe runtime PM compatibility issue; `platform quirk: simple suspend` now applied by kernel.
 **P5-11 added:** Compute5 SK Hynix PC711 PCIe power management — verify `simple suspend` quirk persists across reboots. See below.
 ---
 ### INC-001 — PBS USB Hub Failure (ACTIVE — hardware fix in progress)
 **First event:** March 12, 2026 (earliest confirmed crash in journal)  
 **Crash count:** 34+ unclean reboots since April 21; ~50+ total since March 12  
 **Current state:** usb1-zfs SUSPENDED with 4 data errors / 24 write errors. PBS up ~24h on current boot. agents hung task cleared. usb2-zfs ONLINE.
 **Root cause (confirmed):**  
 A 4-port USB hub at xHCI Bus 2 Port 6 intermittently loses power or signal, dropping all 4 ASM1153E (174c:55aa) USB-to-SATA bridges simultaneously. ZFS detects uncorrectable I/O on usb1-zfs and suspends the pool. The PBS `agents` process enters D-state waiting for ZFS TXG sync on the suspended pool. After 737+ seconds, the kernel logs a hung task warning. Eventually the system hangs and watchdog-reboots.
 The hub reconnects ~21 seconds after dropping, but ZFS remains suspended until manually cleared.
 **USB topology at failure:**
 ```
 xHCI Bus 2, Port 6 — [FAILING HUB]
  ├─ Port 1: ASM1153E ACAAEBBB2E5D → sdj → usb1-zfs mirror leg 2
  ├─ Port 2: ASM1153E ACAAEBBB2E5F → sdg → usb1-zfs mirror leg 1
  ├─ Port 3: ASM1153E ACAAEBBB2E60 → sdh → usb2-zfs mirror leg 1
  └─ Port 4: ASM1153E ACAAEBBB2E5E → sdi → usb2-zfs mirror leg 2
 ```
 Both pools are behind the same hub. One hub drop affects both pools simultaneously.
 **UAS quirk already applied:** `usb-storage.quirks=174c:55aa:u` in kernel cmdline disables UAS on the ASM1153E bridges. This was a pre-existing workaround for protocol-level issues but does not address hub power/connection failures.
 **Recovery procedure (manual steps, after hardware fix):**
 ```bash
 # 1. Confirm drives reconnected after hardware fix
 lsblk | grep sd
 # 2. Clear the suspended pool
 zpool clear usb1-zfs
 # 3. Run scrubs on both pools immediately
 zpool scrub usb1-zfs
 zpool scrub usb2-zfs
 # 4. Monitor scrub progress (run ~every 10 min until complete)
 zpool status -v
 # 5. After scrub completes with 0 errors, run PBS backup verify
 #    In PBS web UI → Datastore → usb1-zfs → Verify All
 #    Or via CLI: proxmox-backup-client verify --all
 # 6. If scrub finds unrepaired errors, see INC-002
 ```
 **Hardware fix options (ordered by preference):**
 1. **Best:** Connect each ASMT bridge directly to a motherboard USB 3.0 port (bypass hub entirely). xHCI has 10 downstream ports; only 2 are currently used (Port 2: Toshiba, Port 6: hub). Count available physical USB 3.0 jacks; plug each bridge directly.
 2. **Acceptable:** Replace hub with an **externally powered** USB hub (dedicated power brick, not bus-powered). 4× spinning HDDs exceed typical bus-power budget (900mA), likely causing brownouts.
 3. **Do not:** Return drives to any bus-powered hub.
 **Kernel command line note:** The `usb-storage.delay_use=10` param is a 10-second delay before using USB storage. This was also a pre-existing workaround and should be left in place post-fix.
 ---
 ### INC-002 — PBS Backup Data Integrity Unknown (ACTIVE — verify post-recovery)
 **Context:** usb1-zfs has been suspended on every one of the 34+ crash reboots since April 21. Each suspension may have left partially-written ZFS TXGs. The pool has 4 confirmed data errors in the current boot's suspension event.
 **Risk:** Backup chunks written to usb1-zfs during or after a suspension event may be incomplete or corrupt. The PBS backup catalog may reference chunks that don't exist or are truncated.
 **Action required post-hardware-fix:**
 - Run `zpool scrub usb1-zfs` and verify 0 unrepaired errors.
 - Run PBS verify on the usb1-zfs datastore (proxmox-backup-client verify --all, or via web UI).
 - Any backup with a verify failure should be noted. If critical VMs have no verified backup, a fresh backup job should be run against a known-good target (Synology-Remote or rust-usb) immediately.
 **Status:** Blocked on hardware fix (INC-001).
 ---
 ### INC-003 — PBS Phase 1 C1 Re-Diagnosis
 **Original diagnosis (Phase 1):** usb1-zfs and usb2-zfs fail to import at boot due to stale cachefile. Fix applied: cleared cachefile, added `zpool import -f` to boot sequence.
 **Revised diagnosis:** The import failures were a downstream symptom. The actual root cause is INC-001 (USB hub drop). The pool is suspended at the time of crash-reboot; importing a suspended pool fails regardless of cachefile state. The cachefile fix helped on some boots by switching to scan-based import, but the underlying hardware problem continued causing the failures.
 **Lesson:** When troubleshooting recurring import failures, always check `dmesg` for USB disconnect events before the failure. A suspended pool at boot is usually caused by the prior boot's crash, not a cachefile issue.
 ---
 ### INC-004 — sdj Drive End-of-Life (PENDING)
 **Drive:** Hitachi HUA723030ALA640, Serial MK0371YVGRZ2AA, /dev/sdj  
 **Pool:** usb1-zfs mirror leg 2  
 **Power-on hours:** 93,267 (≈10.6 years)  
 **SMART:** Passes all checks. No reallocated or pending sectors. 1 UDMA CRC error (USB interface, consistent with hub issue, not drive failure).
 **Action:** Order a replacement 3TB+ drive (HGST Ultrastar or WD Gold recommended for NAS duty). Plan resilver after INC-001 hardware fix and INC-002 scrub pass are both confirmed clean. Do not start a resilver on a hub that may still be marginal.
 ---
 ## Phase 4 — Completed Work
 ### Phase 4B — UPS Self-Tests + Alert (completed 2026-05-06)
 - Manual tests run: cyberpower1 and cyberpower2 both passed.
 - Script `/usr/local/bin/ups-quarterly-test.sh` updated to write Prometheus textfile metrics (`ups_battery_test_result`, `ups_battery_test_timestamp_seconds`).
 - Cron installed: `/etc/cron.d/ups-quarterly-test` — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast.
 - Alert group `ups_battery_test` added: `UPSBatteryTestFailed` (result==0, for:5m), `UPSBatteryTestStale` (>100 days, for:6h).
 - Break-tested and confirmed.
 - Committed: docker-configs and ansible repos.
 ### Phase 4C — Root-Cause Documentation (completed 2026-05-06)
 - **W11 (compute6 xtables):** Race between pve-firewall iptables_restore (no -w flag) and Docker's iptables backend. Self-resolves in 10-120s. No firewall gap. No action taken.
 - **W12 (compute2 qdevice):** qnetd on PBS; PBS crash-rebooting 34+ times since April 21 (same as INC-001). Each reboot drops qnetd for ~60s. Cluster quorum unaffected.
 - Documented in `runbooks/W11-W12-root-cause-2026-05.md`.
 ### Phase 4A — PVE Version Drift (PAUSED — pending PBS stability)
 - Scope: Roll beast (9.1.6) and compute5 (9.1.6) to match cluster majority (9.1.7+).
 - Procedure: One node per session, 24h soak. Migrate guests off → upgrade → reboot → verify Keepalived/quorum/storage → migrate back.
 - **Hold until:** PBS hardware fix confirmed + usb1-zfs scrub clean + backup verify complete.
 ---
 ## Phase 5 — Pending Work Queue
 ### P5-01 — PBS Recovery (blocked on hardware)
 Sequence: hardware fix → zpool clear → scrub → PBS verify → confirm stable 48h → resume Phase 4A
 ### P5-02 — sdj Replacement (after P5-01 completes)
 Order drive → resilver → verify → remove old sdj
 ### P5-03 — qnetd Migration: PBS → Pi4 (after P5-01 + PBS stability confirmed)
 Move `corosync-qnetd` from PBS (.153) to Pi4 (.227). Steps documented in `runbooks/W11-W12-root-cause-2026-05.md`. 30-minute operation, no VM downtime, no migration.
 ### P5-04 — Authelia Prometheus Metrics + Request-Rate Alert
 The C3 issue (May 5 health check: i/o timeouts every 2 min under scanner load) is not caught by the existing `ContainerRestartLoop` alert. Requires:
 - Enable Authelia's Prometheus metrics endpoint (`telemetry.metrics` in configuration.yml)
 - Add Prometheus scrape target for Authelia metrics
 - Add alert: `AutheliaHighErrorRate` — `authelia_request_duration_seconds_count{status="408"}` rate elevated for >5m
 ### P5-05 — TLS Cert Renewal Verification (time-sensitive)
 4 certs were in the 30–35d expiry window during the cert break-test (~May 5).  
 Traefik should auto-renew via ACME when ≤30d remain.
 **Action schedule:**
 - **May 11:** SSH to active Traefik node (docker-node01), inspect acme.json, confirm the 4 certs have new `notAfter` dates.
  ```bash
  ssh tommy@192.168.99.186 "cat ~/traefik/acme.json | python3 -c \"
  import json,sys,datetime
  d=json.load(sys.stdin)
  for r in d.get('myresolver',{}).get('Certificates',[]):
      domain=r['domain']['main']
      # notAfter is base64 DER; use openssl or traefik logs to verify
      print(domain)
  \" 2>/dev/null || docker logs traefik --since 168h 2>&1 | grep -i 'renewed\|certificate\|acme'"
  ```
 - **May 18:** Repeat check. If any cert still has the original expiry date, force-renew manually.
 ### P5-06 — Traefik node02 CrowdSec LAPI (intentional asymmetry — no action)
 node02 has no local CrowdSec; it correctly points to node01's LAPI at 192.168.99.186:8081 per Phase 2A design. Not drift. Known limitation: if node01 is down during failover, node02 Traefik has no CrowdSec coverage.
 ### P5-07 — ansible-control Disk Cleanup / Expansion
 At 80%, DiskFillPredicted24h firing. Clean `/tmp` artifacts and old report files, or expand the LVM.
 ### P5-08 — Pi4 node-exporter Wrong Architecture Binary
 SSH works as `tommy`. DNS (Technitium) and fail2ban both active. node-exporter fails with `Exec format error` — binary at `/usr/local/bin/node_exporter` is wrong architecture (x86_64 installed on ARM64). Fix: deploy correct arm64 binary.
 ```bash
 VERSION=1.8.2
 curl -sL https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.linux-arm64.tar.gz \
  | tar -xz --strip-components=1 -C /tmp node_exporter-${VERSION}.linux-arm64/node_exporter
 scp /tmp/node_exporter tommy@192.168.99.227:/tmp/node_exporter
 ssh tommy@192.168.99.227 "sudo mv /tmp/node_exporter /usr/local/bin/ && sudo chmod 755 /usr/local/bin/node_exporter && sudo systemctl restart node_exporter"
 ```
 Pi4 hostname is also `raspberrypi` (clone drift — set to pi4 or similar).
 ### P5-09 — Phase 4A Resume (PVE Version Drift)
 After P5-01 completes and PBS is confirmed stable for 48h.
 ### P5-10 — Pi4 node-exporter ARM64 Deploy
 See P5-08.
 ### P5-11 — Compute5 SK Hynix PC711 PCIe Power Management
 nvme1n1 on compute5 (SK Hynix PC711 1TB, `0000:03:00.0`) has 2,362 power cycles and 84 unsafe shutdowns — indicative of PCIe runtime PM aggressively power-cycling the drive. Kernel applied `platform quirk: setting simple suspend` in the current boot. Verify this persists:
 ```bash
 # Check if quirk is active post-reboot:
 dmesg | grep -i 'nvme.*simple suspend\|03:00.*quirk'
 # If not applied, add to kernel cmdline or create modprobe conf:
 # nvme_core.default_ps_max_latency_us=0
 ```
 WD PC SN740 (nvme0) on same node has 56 unsafe shutdowns in 1,407h — likely from pre-journal setup period and PCIe PS behavior. No action unless counts accumulate.
 ---
 ## Known Config Drift (not incidents — track for cleanup)
 | Item | Location | Drift |
 |---|---|---|
 | Hostname mismatch | docker-node01 (.186) | `/etc/hostname` = ubuntu-template; should = docker-node01 |
 | crowdsecLapiHost | node02 dynamic_conf.yml | Points to 192.168.99.186:8081 (node01 IP); should point to local LAPI |
 | Technitium DNS | Pi4 | No internal overrides for goattw.net (relies on Pi-hole as primary) |
 | UPS self-test never run | cyberpower1 | Resolved by Phase 4B; now automated |
 | PBS ZFS pool feature upgrade | usb1-zfs, usb2-zfs | `zpool upgrade` deferred until pools are stable post-hardware fix |
 | Alertmanager permission errors | media-server | Resolved (C2 May-05) — no action needed |