beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added cancel rule. Archived post-fix from Beast. beast/nut/upssched-cmd: archived for reference. runbooks/phase5-incident-log.md: INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings. INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life. P5-01 through P5-11: pending work queue including Pi4 node-exporter, compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics, cert renewal verification, ansible-control disk, Phase 4A resume. runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure, cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd migration, Authelia metrics setup, recurring checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.9 KiB
Homelab Maintenance Runbook
Last updated: 2026-05-06
Status: Living document — update when procedures change.
Regular Operations
Health Check (run before any maintenance session)
# From ansible-control (.190)
# Ad hoc check:
ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE'
# Full report script (if present):
~/ansible/scripts/health-check.sh
# Or run manually — see reports/health-check-YYYYMMDD.md for format
Update Schedule
- Proxmox nodes: Rolling, one at a time. Migrate guests off →
apt update && apt dist-upgrade -y→ reboot → verify Keepalived/quorum/storage → next node. - Order: Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice).
- Phase 4A procedure (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session.
- PBS: Schedule separately. Always verify backup datastores are accessible post-reboot.
Prometheus Alerting
- Prometheus:
http://192.168.99.183:9090 - Alert rules:
/home/tommy/observability/prometheus/alert_rules.ymlon media-server - After editing alert rules: restart Prometheus (
docker compose -f ~/observability/docker-compose.yml restart prometheus) - Alertmanager:
http://192.168.99.183:9093 - Notifications: ntfy bridge → ntfy topic (configured in alertmanager)
PBS Recovery Procedure (post USB hub fix)
Precondition: Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in dmesg.
# SSH to PBS
ssh root@192.168.99.153
# 1. Verify drives are present
lsblk | grep -E '^sd'
# Expected: sdg, sdh, sdi, sdj all present
# 2. Check for USB errors in current boot (must be clean)
dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect'
# Expected: NO disconnect events for the new drives
# 3. Clear the suspended pool
zpool clear usb1-zfs
# 4. Verify pool state
zpool status
# Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible)
# 5. Run scrubs on both pools
zpool scrub usb1-zfs
zpool scrub usb2-zfs
# 6. Monitor scrub — check every ~15 minutes
zpool status -v
# Wait for "scrub repaired 0B in HH:MM:SS with 0 errors"
# usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB)
# 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy
# ZFS will auto-repair from mirror if possible.
# If unrepaired errors remain after scrub, proceed to PBS verify anyway —
# those specific chunks will fail verify and you'll know which backups are affected.
# 8. Run PBS backup verification
proxmox-backup-manager datastore list
# For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All
# Or via CLI (if proxmox-backup-client is configured):
# proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs
# 9. Confirm stable operation for 48h before resuming Phase 4A
# Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs'
If pool won't import after clear:
# Force import by device scan (no cachefile)
zpool import -f usb1-zfs
# If this also fails, check: zpool import -d /dev/disk/by-id/
# and verify drive serial numbers match expected
UPS Quarterly Battery Test
Schedule: Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast
Script: /usr/local/bin/ups-quarterly-test.sh
Results: Written to /var/lib/node_exporter/textfile_collector/ups_test.prom
Alert: UPSBatteryTestFailed fires if result != 1 (passed); UPSBatteryTestStale fires if >100 days since last run.
Manual trigger (if needed):
ssh tommy@192.168.99.200
# Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible)
ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \
-m shell -a "/usr/local/bin/ups-quarterly-test.sh"
Next scheduled run: First Sunday of July 2026 (2026-07-05 at 03:00 CDT)
TLS Cert Renewal Verification (one-time, May 2026)
4 certificates were in the 30–35d window around May 5. Traefik auto-renews when ≤30d remain.
May 11 check:
ssh tommy@192.168.99.186
# Check Traefik logs for ACME activity in the last 7 days
docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20
# Check acme.json directly for notAfter dates (requires openssl to decode DER)
# Simpler: check cert expiry live
for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do
expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \
| openssl x509 -noout -dates 2>/dev/null | grep notAfter)
echo "${domain}: ${expires}"
done
May 18 check: Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew:
# Force ACME renewal for a specific domain (Traefik)
# Traefik doesn't have a direct force-renew command. Options:
# 1. Remove domain from acme.json and restart Traefik (it will re-request)
# 2. Touch the cert file to make Traefik re-evaluate
# 3. Delete the entry from acme.json:
docker stop traefik
# Edit ~/traefik/acme.json, remove the certificate block for the domain
docker start traefik
# Traefik will request a new cert on startup
Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable)
Status: On hold until PBS hardware fix + 48h stability confirmed.
Scope:
- beast: 9.1.6 → 9.1.7 (or latest)
- compute5: 9.1.6 → 9.1.7 (or latest)
- Both should reach same version as compute2, compute3, compute6 (9.1.7)
- compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold
Per-node procedure (one node per session, 24h soak between):
# --- PRE-UPGRADE: migrate guests off the target node ---
# Example: target = beast
# Move TrueNAS-Scale (VM 100) → compute2 or compute3
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1"
# Wait for migration, verify VM running on compute2
# Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first
ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \
python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]"
# --- UPGRADE ---
ssh root@192.168.99.200
apt update && apt dist-upgrade -y
reboot
# --- POST-UPGRADE: verify ---
# From cluster peer:
pvecm status # quorum intact, all nodes seen
zpool status # ZFS pools healthy on beast
# From beast after reboot:
pveversion # new version
systemctl status pve-cluster pve-manager
# Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP)
# --- MIGRATE BACK ---
# Move TrueNAS-Scale (VM 100) back to beast
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1"
qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved)
Trigger: PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored.
Steps:
# 1. On PBS: stop and disable qnetd
ssh root@192.168.99.153
systemctl stop corosync-qnetd
systemctl disable corosync-qnetd
# 2. On Pi4: install and start qnetd
ssh pi@192.168.99.227
sudo apt install -y corosync-qnetd
sudo systemctl enable --now corosync-qnetd
# 3. On all cluster nodes: update corosync.conf
# Change: quorum.device.net.host = 192.168.99.153
# To: quorum.device.net.host = 192.168.99.227
# Run on each node (beast, compute2-6):
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf"
done
# 4. Reload corosync on all nodes (no restart needed)
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
ssh root@$ip "corosync-cfgtool -R"
done
# 5. Verify qdevice reconnected to new host
pvecm status | grep -i qdevice
corosync-quorumtool -s
# 6. After confirming, update homelab-configs corosync.conf copy
Authelia Metrics + Alert (P5-04)
When: After confirming Authelia is healthy post any Traefik maintenance.
Steps:
- Edit Authelia config on both nodes (
~/authelia/configuration.ymlon node01 and node02):
telemetry:
metrics:
enabled: true
address: "tcp://0.0.0.0:9959"
- Expose port 9959 in Authelia docker-compose on both nodes.
- Add Prometheus scrape target in
/home/tommy/observability/prometheus/prometheus.yml:
- job_name: authelia
static_configs:
- targets:
- "192.168.99.186:9959"
- "192.168.99.187:9959"
- Add alert to alert_rules.yml:
- alert: AutheliaHighErrorRate
expr: >
rate(authelia_request_duration_seconds_count{code="408"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Authelia elevated 408 error rate on {{ $labels.instance }}"
description: "Request timeout rate >0.1/s for 5 minutes. May indicate scanner load or upstream slowness."
- Restart Prometheus.
- Break-test by temporarily enabling a scanner against the Authelia endpoint.
Recurring Checklist
| Frequency | Task | Notes |
|---|---|---|
| Before any maintenance | Health check | SSH to all nodes, check Prometheus alerts |
| Quarterly | UPS self-test | Automated; verify Prometheus metric updated |
| Monthly | ZFS scrub (if not auto) | Beast das-mirror, Compute3 pool; PBS pools post-recovery |
| Monthly | PBS backup verify | Spot-check recent backups via web UI |
| After each PBS maintenance | zpool status -v + dmesg | grep usb |
Confirm no disconnect events |
| After each update cycle | pvecm status |
Confirm quorum intact after any node reboot |
| After cert renewal window | Check cert expiry via openssl | See May 11/18 procedure above |