Files

tommy 909fe3dc12 docs: Phase 5 — incident log, maintenance runbook, NUT config archive

beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added
  cancel rule. Archived post-fix from Beast.
beast/nut/upssched-cmd: archived for reference.

runbooks/phase5-incident-log.md:
  INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings.
  INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life.
  P5-01 through P5-11: pending work queue including Pi4 node-exporter,
  compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics,
  cert renewal verification, ansible-control disk, Phase 4A resume.

runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure,
  cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd
  migration, Authelia metrics setup, recurring checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-06 21:25:49 -05:00

9.9 KiB

Raw Permalink Blame History

Homelab Maintenance Runbook

Last updated: 2026-05-06
Status: Living document — update when procedures change.

Regular Operations

Health Check (run before any maintenance session)

# From ansible-control (.190)
# Ad hoc check:
ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE'
# Full report script (if present):
~/ansible/scripts/health-check.sh
# Or run manually — see reports/health-check-YYYYMMDD.md for format

Update Schedule

Proxmox nodes: Rolling, one at a time. Migrate guests off → apt update && apt dist-upgrade -y → reboot → verify Keepalived/quorum/storage → next node.
Order: Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice).
Phase 4A procedure (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session.
PBS: Schedule separately. Always verify backup datastores are accessible post-reboot.

Prometheus Alerting

Prometheus: http://192.168.99.183:9090
Alert rules: /home/tommy/observability/prometheus/alert_rules.yml on media-server
After editing alert rules: restart Prometheus (docker compose -f ~/observability/docker-compose.yml restart prometheus)
Alertmanager: http://192.168.99.183:9093
Notifications: ntfy bridge → ntfy topic (configured in alertmanager)

PBS Recovery Procedure (post USB hub fix)

Precondition: Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in dmesg.

# SSH to PBS
ssh root@192.168.99.153

# 1. Verify drives are present
lsblk | grep -E '^sd'
# Expected: sdg, sdh, sdi, sdj all present

# 2. Check for USB errors in current boot (must be clean)
dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect'
# Expected: NO disconnect events for the new drives

# 3. Clear the suspended pool
zpool clear usb1-zfs

# 4. Verify pool state
zpool status
# Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible)

# 5. Run scrubs on both pools
zpool scrub usb1-zfs
zpool scrub usb2-zfs

# 6. Monitor scrub — check every ~15 minutes
zpool status -v
# Wait for "scrub repaired 0B in HH:MM:SS with 0 errors"
# usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB)

# 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy
#    ZFS will auto-repair from mirror if possible.
#    If unrepaired errors remain after scrub, proceed to PBS verify anyway — 
#    those specific chunks will fail verify and you'll know which backups are affected.

# 8. Run PBS backup verification
proxmox-backup-manager datastore list
# For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All
# Or via CLI (if proxmox-backup-client is configured):
#   proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs

# 9. Confirm stable operation for 48h before resuming Phase 4A
#    Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs'

If pool won't import after clear:

# Force import by device scan (no cachefile)
zpool import -f usb1-zfs
# If this also fails, check: zpool import -d /dev/disk/by-id/
# and verify drive serial numbers match expected

UPS Quarterly Battery Test

Schedule: Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast
Script: /usr/local/bin/ups-quarterly-test.sh
Results: Written to /var/lib/node_exporter/textfile_collector/ups_test.prom
Alert: UPSBatteryTestFailed fires if result != 1 (passed); UPSBatteryTestStale fires if >100 days since last run.

Manual trigger (if needed):

ssh tommy@192.168.99.200
# Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible)
ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \
  -m shell -a "/usr/local/bin/ups-quarterly-test.sh"

Next scheduled run: First Sunday of July 2026 (2026-07-05 at 03:00 CDT)

TLS Cert Renewal Verification (one-time, May 2026)

4 certificates were in the 30–35d window around May 5. Traefik auto-renews when ≤30d remain.

May 11 check:

ssh tommy@192.168.99.186
# Check Traefik logs for ACME activity in the last 7 days
docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20

# Check acme.json directly for notAfter dates (requires openssl to decode DER)
# Simpler: check cert expiry live
for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do
  expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \
    | openssl x509 -noout -dates 2>/dev/null | grep notAfter)
  echo "${domain}: ${expires}"
done

May 18 check: Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew:

# Force ACME renewal for a specific domain (Traefik)
# Traefik doesn't have a direct force-renew command. Options:
# 1. Remove domain from acme.json and restart Traefik (it will re-request)
# 2. Touch the cert file to make Traefik re-evaluate
# 3. Delete the entry from acme.json:
docker stop traefik
# Edit ~/traefik/acme.json, remove the certificate block for the domain
docker start traefik
# Traefik will request a new cert on startup

Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable)

Status: On hold until PBS hardware fix + 48h stability confirmed.

Scope:

beast: 9.1.6 → 9.1.7 (or latest)
compute5: 9.1.6 → 9.1.7 (or latest)
Both should reach same version as compute2, compute3, compute6 (9.1.7)
compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold

Per-node procedure (one node per session, 24h soak between):

# --- PRE-UPGRADE: migrate guests off the target node ---
# Example: target = beast
# Move TrueNAS-Scale (VM 100) → compute2 or compute3
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1"
# Wait for migration, verify VM running on compute2
# Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first
ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \
  python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]"

# --- UPGRADE ---
ssh root@192.168.99.200
apt update && apt dist-upgrade -y
reboot

# --- POST-UPGRADE: verify ---
# From cluster peer:
pvecm status    # quorum intact, all nodes seen
zpool status    # ZFS pools healthy on beast
# From beast after reboot:
pveversion      # new version
systemctl status pve-cluster pve-manager
# Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP)

# --- MIGRATE BACK ---
# Move TrueNAS-Scale (VM 100) back to beast
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1"

qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved)

Trigger: PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored.

Steps:

# 1. On PBS: stop and disable qnetd
ssh root@192.168.99.153
systemctl stop corosync-qnetd
systemctl disable corosync-qnetd

# 2. On Pi4: install and start qnetd
ssh pi@192.168.99.227
sudo apt install -y corosync-qnetd
sudo systemctl enable --now corosync-qnetd

# 3. On all cluster nodes: update corosync.conf
#    Change: quorum.device.net.host = 192.168.99.153
#    To:     quorum.device.net.host = 192.168.99.227
#    Run on each node (beast, compute2-6):
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf"
done

# 4. Reload corosync on all nodes (no restart needed)
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "corosync-cfgtool -R"
done

# 5. Verify qdevice reconnected to new host
pvecm status | grep -i qdevice
corosync-quorumtool -s

# 6. After confirming, update homelab-configs corosync.conf copy

Authelia Metrics + Alert (P5-04)

When: After confirming Authelia is healthy post any Traefik maintenance.