Files
homelab-configs/runbooks/maintenance-runbook.md
tommy 909fe3dc12 docs: Phase 5 — incident log, maintenance runbook, NUT config archive
beast/nut/upssched.conf: fixed earlyshutdown timer (30s→300s) and added
  cancel rule. Archived post-fix from Beast.
beast/nut/upssched-cmd: archived for reference.

runbooks/phase5-incident-log.md:
  INC-005: NUT upssched earlyshutdown bug — cause, fix, break-test findings.
  INC-001 through INC-004: PBS USB hub/data/re-diagnosis/sdj end-of-life.
  P5-01 through P5-11: pending work queue including Pi4 node-exporter,
  compute5 PCIe PM, PBS recovery gate, qnetd migration, Authelia metrics,
  cert renewal verification, ansible-control disk, Phase 4A resume.

runbooks/maintenance-runbook.md: PBS recovery steps, UPS test procedure,
  cert renewal schedule (May 11/18), Phase 4A per-node procedure, qnetd
  migration, Authelia metrics setup, recurring checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 21:25:49 -05:00

9.9 KiB
Raw Permalink Blame History

Homelab Maintenance Runbook

Last updated: 2026-05-06
Status: Living document — update when procedures change.


Regular Operations

Health Check (run before any maintenance session)

# From ansible-control (.190)
# Ad hoc check:
ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE'
# Full report script (if present):
~/ansible/scripts/health-check.sh
# Or run manually — see reports/health-check-YYYYMMDD.md for format

Update Schedule

  • Proxmox nodes: Rolling, one at a time. Migrate guests off → apt update && apt dist-upgrade -y → reboot → verify Keepalived/quorum/storage → next node.
  • Order: Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice).
  • Phase 4A procedure (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session.
  • PBS: Schedule separately. Always verify backup datastores are accessible post-reboot.

Prometheus Alerting

  • Prometheus: http://192.168.99.183:9090
  • Alert rules: /home/tommy/observability/prometheus/alert_rules.yml on media-server
  • After editing alert rules: restart Prometheus (docker compose -f ~/observability/docker-compose.yml restart prometheus)
  • Alertmanager: http://192.168.99.183:9093
  • Notifications: ntfy bridge → ntfy topic (configured in alertmanager)

PBS Recovery Procedure (post USB hub fix)

Precondition: Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in dmesg.

# SSH to PBS
ssh root@192.168.99.153

# 1. Verify drives are present
lsblk | grep -E '^sd'
# Expected: sdg, sdh, sdi, sdj all present

# 2. Check for USB errors in current boot (must be clean)
dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect'
# Expected: NO disconnect events for the new drives

# 3. Clear the suspended pool
zpool clear usb1-zfs

# 4. Verify pool state
zpool status
# Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible)

# 5. Run scrubs on both pools
zpool scrub usb1-zfs
zpool scrub usb2-zfs

# 6. Monitor scrub — check every ~15 minutes
zpool status -v
# Wait for "scrub repaired 0B in HH:MM:SS with 0 errors"
# usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB)

# 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy
#    ZFS will auto-repair from mirror if possible.
#    If unrepaired errors remain after scrub, proceed to PBS verify anyway — 
#    those specific chunks will fail verify and you'll know which backups are affected.

# 8. Run PBS backup verification
proxmox-backup-manager datastore list
# For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All
# Or via CLI (if proxmox-backup-client is configured):
#   proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs

# 9. Confirm stable operation for 48h before resuming Phase 4A
#    Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs'

If pool won't import after clear:

# Force import by device scan (no cachefile)
zpool import -f usb1-zfs
# If this also fails, check: zpool import -d /dev/disk/by-id/
# and verify drive serial numbers match expected

UPS Quarterly Battery Test

Schedule: Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast
Script: /usr/local/bin/ups-quarterly-test.sh
Results: Written to /var/lib/node_exporter/textfile_collector/ups_test.prom
Alert: UPSBatteryTestFailed fires if result != 1 (passed); UPSBatteryTestStale fires if >100 days since last run.

Manual trigger (if needed):

ssh tommy@192.168.99.200
# Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible)
ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \
  -m shell -a "/usr/local/bin/ups-quarterly-test.sh"

Next scheduled run: First Sunday of July 2026 (2026-07-05 at 03:00 CDT)


TLS Cert Renewal Verification (one-time, May 2026)

4 certificates were in the 3035d window around May 5. Traefik auto-renews when ≤30d remain.

May 11 check:

ssh tommy@192.168.99.186
# Check Traefik logs for ACME activity in the last 7 days
docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20

# Check acme.json directly for notAfter dates (requires openssl to decode DER)
# Simpler: check cert expiry live
for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do
  expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \
    | openssl x509 -noout -dates 2>/dev/null | grep notAfter)
  echo "${domain}: ${expires}"
done

May 18 check: Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew:

# Force ACME renewal for a specific domain (Traefik)
# Traefik doesn't have a direct force-renew command. Options:
# 1. Remove domain from acme.json and restart Traefik (it will re-request)
# 2. Touch the cert file to make Traefik re-evaluate
# 3. Delete the entry from acme.json:
docker stop traefik
# Edit ~/traefik/acme.json, remove the certificate block for the domain
docker start traefik
# Traefik will request a new cert on startup

Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable)

Status: On hold until PBS hardware fix + 48h stability confirmed.

Scope:

  • beast: 9.1.6 → 9.1.7 (or latest)
  • compute5: 9.1.6 → 9.1.7 (or latest)
  • Both should reach same version as compute2, compute3, compute6 (9.1.7)
  • compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold

Per-node procedure (one node per session, 24h soak between):

# --- PRE-UPGRADE: migrate guests off the target node ---
# Example: target = beast
# Move TrueNAS-Scale (VM 100) → compute2 or compute3
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1"
# Wait for migration, verify VM running on compute2
# Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first
ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \
  python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]"

# --- UPGRADE ---
ssh root@192.168.99.200
apt update && apt dist-upgrade -y
reboot

# --- POST-UPGRADE: verify ---
# From cluster peer:
pvecm status    # quorum intact, all nodes seen
zpool status    # ZFS pools healthy on beast
# From beast after reboot:
pveversion      # new version
systemctl status pve-cluster pve-manager
# Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP)

# --- MIGRATE BACK ---
# Move TrueNAS-Scale (VM 100) back to beast
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1"

qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved)

Trigger: PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored.

Steps:

# 1. On PBS: stop and disable qnetd
ssh root@192.168.99.153
systemctl stop corosync-qnetd
systemctl disable corosync-qnetd

# 2. On Pi4: install and start qnetd
ssh pi@192.168.99.227
sudo apt install -y corosync-qnetd
sudo systemctl enable --now corosync-qnetd

# 3. On all cluster nodes: update corosync.conf
#    Change: quorum.device.net.host = 192.168.99.153
#    To:     quorum.device.net.host = 192.168.99.227
#    Run on each node (beast, compute2-6):
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf"
done

# 4. Reload corosync on all nodes (no restart needed)
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "corosync-cfgtool -R"
done

# 5. Verify qdevice reconnected to new host
pvecm status | grep -i qdevice
corosync-quorumtool -s

# 6. After confirming, update homelab-configs corosync.conf copy

Authelia Metrics + Alert (P5-04)

When: After confirming Authelia is healthy post any Traefik maintenance.

Steps:

  1. Edit Authelia config on both nodes (~/authelia/configuration.yml on node01 and node02):
telemetry:
  metrics:
    enabled: true
    address: "tcp://0.0.0.0:9959"
  1. Expose port 9959 in Authelia docker-compose on both nodes.
  2. Add Prometheus scrape target in /home/tommy/observability/prometheus/prometheus.yml:
- job_name: authelia
  static_configs:
    - targets:
        - "192.168.99.186:9959"
        - "192.168.99.187:9959"
  1. Add alert to alert_rules.yml:
- alert: AutheliaHighErrorRate
  expr: >
    rate(authelia_request_duration_seconds_count{code="408"}[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Authelia elevated 408 error rate on {{ $labels.instance }}"
    description: "Request timeout rate >0.1/s for 5 minutes. May indicate scanner load or upstream slowness."
  1. Restart Prometheus.
  2. Break-test by temporarily enabling a scanner against the Authelia endpoint.

Recurring Checklist

Frequency Task Notes
Before any maintenance Health check SSH to all nodes, check Prometheus alerts
Quarterly UPS self-test Automated; verify Prometheus metric updated
Monthly ZFS scrub (if not auto) Beast das-mirror, Compute3 pool; PBS pools post-recovery
Monthly PBS backup verify Spot-check recent backups via web UI
After each PBS maintenance zpool status -v + dmesg | grep usb Confirm no disconnect events
After each update cycle pvecm status Confirm quorum intact after any node reboot
After cert renewal window Check cert expiry via openssl See May 11/18 procedure above