# Homelab Maintenance Runbook Last updated: 2026-05-06 Status: Living document — update when procedures change. --- ## Regular Operations ### Health Check (run before any maintenance session) ```bash # From ansible-control (.190) # Ad hoc check: ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE' # Full report script (if present): ~/ansible/scripts/health-check.sh # Or run manually — see reports/health-check-YYYYMMDD.md for format ``` ### Update Schedule - **Proxmox nodes:** Rolling, one at a time. Migrate guests off → `apt update && apt dist-upgrade -y` → reboot → verify Keepalived/quorum/storage → next node. - **Order:** Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice). - **Phase 4A procedure** (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session. - **PBS:** Schedule separately. Always verify backup datastores are accessible post-reboot. ### Prometheus Alerting - Prometheus: `http://192.168.99.183:9090` - Alert rules: `/home/tommy/observability/prometheus/alert_rules.yml` on media-server - After editing alert rules: restart Prometheus (`docker compose -f ~/observability/docker-compose.yml restart prometheus`) - Alertmanager: `http://192.168.99.183:9093` - Notifications: ntfy bridge → ntfy topic (configured in alertmanager) --- ## PBS Recovery Procedure (post USB hub fix) **Precondition:** Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in `dmesg`. ```bash # SSH to PBS ssh root@192.168.99.153 # 1. Verify drives are present lsblk | grep -E '^sd' # Expected: sdg, sdh, sdi, sdj all present # 2. Check for USB errors in current boot (must be clean) dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect' # Expected: NO disconnect events for the new drives # 3. Clear the suspended pool zpool clear usb1-zfs # 4. Verify pool state zpool status # Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible) # 5. Run scrubs on both pools zpool scrub usb1-zfs zpool scrub usb2-zfs # 6. Monitor scrub — check every ~15 minutes zpool status -v # Wait for "scrub repaired 0B in HH:MM:SS with 0 errors" # usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB) # 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy # ZFS will auto-repair from mirror if possible. # If unrepaired errors remain after scrub, proceed to PBS verify anyway — # those specific chunks will fail verify and you'll know which backups are affected. # 8. Run PBS backup verification proxmox-backup-manager datastore list # For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All # Or via CLI (if proxmox-backup-client is configured): # proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs # 9. Confirm stable operation for 48h before resuming Phase 4A # Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs' ``` **If pool won't import after clear:** ```bash # Force import by device scan (no cachefile) zpool import -f usb1-zfs # If this also fails, check: zpool import -d /dev/disk/by-id/ # and verify drive serial numbers match expected ``` --- ## UPS Quarterly Battery Test **Schedule:** Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast **Script:** `/usr/local/bin/ups-quarterly-test.sh` **Results:** Written to `/var/lib/node_exporter/textfile_collector/ups_test.prom` **Alert:** `UPSBatteryTestFailed` fires if result != 1 (passed); `UPSBatteryTestStale` fires if >100 days since last run. **Manual trigger (if needed):** ```bash ssh tommy@192.168.99.200 # Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible) ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \ -m shell -a "/usr/local/bin/ups-quarterly-test.sh" ``` **Next scheduled run:** First Sunday of July 2026 (2026-07-05 at 03:00 CDT) --- ## TLS Cert Renewal Verification (one-time, May 2026) 4 certificates were in the 30–35d window around May 5. Traefik auto-renews when ≤30d remain. **May 11 check:** ```bash ssh tommy@192.168.99.186 # Check Traefik logs for ACME activity in the last 7 days docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20 # Check acme.json directly for notAfter dates (requires openssl to decode DER) # Simpler: check cert expiry live for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \ | openssl x509 -noout -dates 2>/dev/null | grep notAfter) echo "${domain}: ${expires}" done ``` **May 18 check:** Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew: ```bash # Force ACME renewal for a specific domain (Traefik) # Traefik doesn't have a direct force-renew command. Options: # 1. Remove domain from acme.json and restart Traefik (it will re-request) # 2. Touch the cert file to make Traefik re-evaluate # 3. Delete the entry from acme.json: docker stop traefik # Edit ~/traefik/acme.json, remove the certificate block for the domain docker start traefik # Traefik will request a new cert on startup ``` --- ## Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable) **Status:** On hold until PBS hardware fix + 48h stability confirmed. **Scope:** - beast: 9.1.6 → 9.1.7 (or latest) - compute5: 9.1.6 → 9.1.7 (or latest) - Both should reach same version as compute2, compute3, compute6 (9.1.7) - compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold **Per-node procedure (one node per session, 24h soak between):** ```bash # --- PRE-UPGRADE: migrate guests off the target node --- # Example: target = beast # Move TrueNAS-Scale (VM 100) → compute2 or compute3 ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1" # Wait for migration, verify VM running on compute2 # Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \ python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]" # --- UPGRADE --- ssh root@192.168.99.200 apt update && apt dist-upgrade -y reboot # --- POST-UPGRADE: verify --- # From cluster peer: pvecm status # quorum intact, all nodes seen zpool status # ZFS pools healthy on beast # From beast after reboot: pveversion # new version systemctl status pve-cluster pve-manager # Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP) # --- MIGRATE BACK --- # Move TrueNAS-Scale (VM 100) back to beast ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1" ``` --- ## qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved) **Trigger:** PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored. **Steps:** ```bash # 1. On PBS: stop and disable qnetd ssh root@192.168.99.153 systemctl stop corosync-qnetd systemctl disable corosync-qnetd # 2. On Pi4: install and start qnetd ssh pi@192.168.99.227 sudo apt install -y corosync-qnetd sudo systemctl enable --now corosync-qnetd # 3. On all cluster nodes: update corosync.conf # Change: quorum.device.net.host = 192.168.99.153 # To: quorum.device.net.host = 192.168.99.227 # Run on each node (beast, compute2-6): for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf" done # 4. Reload corosync on all nodes (no restart needed) for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do ssh root@$ip "corosync-cfgtool -R" done # 5. Verify qdevice reconnected to new host pvecm status | grep -i qdevice corosync-quorumtool -s # 6. After confirming, update homelab-configs corosync.conf copy ``` --- ## Authelia Metrics + Alert (P5-04) **When:** After confirming Authelia is healthy post any Traefik maintenance. **Steps:** 1. Edit Authelia config on both nodes (`~/authelia/configuration.yml` on node01 and node02): ```yaml telemetry: metrics: enabled: true address: "tcp://0.0.0.0:9959" ``` 2. Expose port 9959 in Authelia docker-compose on both nodes. 3. Add Prometheus scrape target in `/home/tommy/observability/prometheus/prometheus.yml`: ```yaml - job_name: authelia static_configs: - targets: - "192.168.99.186:9959" - "192.168.99.187:9959" ``` 4. Add alert to alert_rules.yml: ```yaml - alert: AutheliaHighErrorRate expr: > rate(authelia_request_duration_seconds_count{code="408"}[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Authelia elevated 408 error rate on {{ $labels.instance }}" description: "Request timeout rate >0.1/s for 5 minutes. May indicate scanner load or upstream slowness." ``` 5. Restart Prometheus. 6. Break-test by temporarily enabling a scanner against the Authelia endpoint. --- ## Recurring Checklist | Frequency | Task | Notes | |---|---|---| | Before any maintenance | Health check | SSH to all nodes, check Prometheus alerts | | Quarterly | UPS self-test | Automated; verify Prometheus metric updated | | Monthly | ZFS scrub (if not auto) | Beast das-mirror, Compute3 pool; PBS pools post-recovery | | Monthly | PBS backup verify | Spot-check recent backups via web UI | | After each PBS maintenance | `zpool status -v` + `dmesg \| grep usb` | Confirm no disconnect events | | After each update cycle | `pvecm status` | Confirm quorum intact after any node reboot | | After cert renewal window | Check cert expiry via openssl | See May 11/18 procedure above |