# Homelab Maintenance Runbook

Last updated: 2026-05-06  
Status: Living document — update when procedures change.

---

## Regular Operations

### Health Check (run before any maintenance session)
```bash
# From ansible-control (.190)
# Ad hoc check:
ansible -i ~/ansible/inventory/hosts.yml all -m ping 2>/dev/null | grep -E 'SUCCESS|FAILED|UNREACHABLE'
# Full report script (if present):
~/ansible/scripts/health-check.sh
# Or run manually — see reports/health-check-YYYYMMDD.md for format
```

### Update Schedule
- **Proxmox nodes:** Rolling, one at a time. Migrate guests off → `apt update && apt dist-upgrade -y` → reboot → verify Keepalived/quorum/storage → next node.
- **Order:** Compute nodes first (least critical), then beast last (runs TrueNAS, qdevice).
- **Phase 4A procedure** (current session): beast (9.1.6→9.1.7+), then compute5 (9.1.6→9.1.7+). 24h soak between nodes. One node per session.
- **PBS:** Schedule separately. Always verify backup datastores are accessible post-reboot.

### Prometheus Alerting
- Prometheus: `http://192.168.99.183:9090`
- Alert rules: `/home/tommy/observability/prometheus/alert_rules.yml` on media-server
- After editing alert rules: restart Prometheus (`docker compose -f ~/observability/docker-compose.yml restart prometheus`)
- Alertmanager: `http://192.168.99.183:9093`
- Notifications: ntfy bridge → ntfy topic (configured in alertmanager)

---

## PBS Recovery Procedure (post USB hub fix)

**Precondition:** Hardware fix complete (drives on direct USB or powered hub). PBS running for at least 15 minutes with no new USB disconnect events in `dmesg`.

```bash
# SSH to PBS
ssh root@192.168.99.153

# 1. Verify drives are present
lsblk | grep -E '^sd'
# Expected: sdg, sdh, sdi, sdj all present

# 2. Check for USB errors in current boot (must be clean)
dmesg | grep -iE 'disconnect|usb.*error|EIO|i/o error' | grep -v 'Connect\|connect'
# Expected: NO disconnect events for the new drives

# 3. Clear the suspended pool
zpool clear usb1-zfs

# 4. Verify pool state
zpool status
# Expected state: ONLINE (or DEGRADED if errors exist but pool is accessible)

# 5. Run scrubs on both pools
zpool scrub usb1-zfs
zpool scrub usb2-zfs

# 6. Monitor scrub — check every ~15 minutes
zpool status -v
# Wait for "scrub repaired 0B in HH:MM:SS with 0 errors"
# usb1-zfs scrub time: ~2 hours (estimated from April 12 scrub = 1h59m53s for 614 GiB)

# 7. If scrub shows errors: note which blocks, check if mirror leg has clean copy
#    ZFS will auto-repair from mirror if possible.
#    If unrepaired errors remain after scrub, proceed to PBS verify anyway — 
#    those specific chunks will fail verify and you'll know which backups are affected.

# 8. Run PBS backup verification
proxmox-backup-manager datastore list
# For each affected datastore, via web UI: Datastore → usb1-zfs → Verify All
# Or via CLI (if proxmox-backup-client is configured):
#   proxmox-backup-client verify all --repository root@pam@localhost:usb1-zfs

# 9. Confirm stable operation for 48h before resuming Phase 4A
#    Watch dmesg every few hours: dmesg | tail -20 | grep -i 'usb\|zfs'
```

**If pool won't import after clear:**
```bash
# Force import by device scan (no cachefile)
zpool import -f usb1-zfs
# If this also fails, check: zpool import -d /dev/disk/by-id/
# and verify drive serial numbers match expected
```

---

## UPS Quarterly Battery Test

**Schedule:** Automated — first Sunday of Jan/Apr/Jul/Oct at 03:00 on Beast  
**Script:** `/usr/local/bin/ups-quarterly-test.sh`  
**Results:** Written to `/var/lib/node_exporter/textfile_collector/ups_test.prom`  
**Alert:** `UPSBatteryTestFailed` fires if result != 1 (passed); `UPSBatteryTestStale` fires if >100 days since last run.

**Manual trigger (if needed):**
```bash
ssh tommy@192.168.99.200
# Then as root (via sudo on qm/pct/pvesh doesn't apply here — need ansible)
ansible -i ~/ansible/inventory/hosts.yml beast --become --vault-password-file ~/.vault_pass \
  -m shell -a "/usr/local/bin/ups-quarterly-test.sh"
```

**Next scheduled run:** First Sunday of July 2026 (2026-07-05 at 03:00 CDT)

---

## TLS Cert Renewal Verification (one-time, May 2026)

4 certificates were in the 30–35d window around May 5. Traefik auto-renews when ≤30d remain.

**May 11 check:**
```bash
ssh tommy@192.168.99.186
# Check Traefik logs for ACME activity in the last 7 days
docker logs traefik --since 168h 2>&1 | grep -iE 'renewed|obtained|acme|certificate' | head -20

# Check acme.json directly for notAfter dates (requires openssl to decode DER)
# Simpler: check cert expiry live
for domain in gitea.goattw.net portainer.goattw.net home.goattw.net traefik.goattw.net; do
  expires=$(echo | openssl s_client -connect ${domain}:443 -servername ${domain} 2>/dev/null \
    | openssl x509 -noout -dates 2>/dev/null | grep notAfter)
  echo "${domain}: ${expires}"
done
```

**May 18 check:** Repeat above. If any domain still shows the original expiry (≤30d from May 5), force renew:
```bash
# Force ACME renewal for a specific domain (Traefik)
# Traefik doesn't have a direct force-renew command. Options:
# 1. Remove domain from acme.json and restart Traefik (it will re-request)
# 2. Touch the cert file to make Traefik re-evaluate
# 3. Delete the entry from acme.json:
docker stop traefik
# Edit ~/traefik/acme.json, remove the certificate block for the domain
docker start traefik
# Traefik will request a new cert on startup
```

---

## Phase 4A — PVE Version Drift (PAUSED, resuming after PBS stable)

**Status:** On hold until PBS hardware fix + 48h stability confirmed.

**Scope:**
- beast: 9.1.6 → 9.1.7 (or latest)
- compute5: 9.1.6 → 9.1.7 (or latest)
- Both should reach same version as compute2, compute3, compute6 (9.1.7)
- compute4 is at 9.1.9 — if majority reaches 9.1.7, decide whether to move all to 9.1.9 or hold

**Per-node procedure (one node per session, 24h soak between):**
```bash
# --- PRE-UPGRADE: migrate guests off the target node ---
# Example: target = beast
# Move TrueNAS-Scale (VM 100) → compute2 or compute3
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/beast/qemu/100/migrate --target compute2 --online 1"
# Wait for migration, verify VM running on compute2
# Move Media-Server VM (101), docker-node02 VM (108) if on beast — check first
ssh tommy@192.168.99.200 "sudo pvesh get /nodes/beast/qemu --output-format json" | \
  python3 -c "import json,sys; [print(v['vmid'],v['name'],v['status']) for v in json.load(sys.stdin)]"

# --- UPGRADE ---
ssh root@192.168.99.200
apt update && apt dist-upgrade -y
reboot

# --- POST-UPGRADE: verify ---
# From cluster peer:
pvecm status    # quorum intact, all nodes seen
zpool status    # ZFS pools healthy on beast
# From beast after reboot:
pveversion      # new version
systemctl status pve-cluster pve-manager
# Check Keepalived VIP still active if beast hosts docker-node01 (it doesn't, but verify VIP)

# --- MIGRATE BACK ---
# Move TrueNAS-Scale (VM 100) back to beast
ssh tommy@192.168.99.200 "sudo pvesh create /nodes/compute2/qemu/100/migrate --target beast --online 1"
```

---

## qnetd Migration: PBS → Pi4 (future, after PBS stable + Pi4 SSH resolved)

**Trigger:** PBS confirmed stable for ≥7 days post-hardware fix AND Pi4 SSH restored.

**Steps:**
```bash
# 1. On PBS: stop and disable qnetd
ssh root@192.168.99.153
systemctl stop corosync-qnetd
systemctl disable corosync-qnetd

# 2. On Pi4: install and start qnetd
ssh pi@192.168.99.227
sudo apt install -y corosync-qnetd
sudo systemctl enable --now corosync-qnetd

# 3. On all cluster nodes: update corosync.conf
#    Change: quorum.device.net.host = 192.168.99.153
#    To:     quorum.device.net.host = 192.168.99.227
#    Run on each node (beast, compute2-6):
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "sed -i 's/host: 192.168.99.153/host: 192.168.99.227/' /etc/corosync/corosync.conf"
done

# 4. Reload corosync on all nodes (no restart needed)
for ip in 192.168.99.200 192.168.99.192 192.168.99.193 192.168.99.194 192.168.99.196 192.168.99.198; do
  ssh root@$ip "corosync-cfgtool -R"
done

# 5. Verify qdevice reconnected to new host
pvecm status | grep -i qdevice
corosync-quorumtool -s

# 6. After confirming, update homelab-configs corosync.conf copy
```

---

## Authelia Metrics + Alert (P5-04)

**When:** After confirming Authelia is healthy post any Traefik maintenance.

**Steps:**
1. Edit Authelia config on both nodes (`~/authelia/configuration.yml` on node01 and node02):
```yaml
telemetry:
  metrics:
    enabled: true
    address: "tcp://0.0.0.0:9959"
```
2. Expose port 9959 in Authelia docker-compose on both nodes.
3. Add Prometheus scrape target in `/home/tommy/observability/prometheus/prometheus.yml`:
```yaml
- job_name: authelia
  static_configs:
    - targets:
        - "192.168.99.186:9959"
        - "192.168.99.187:9959"
```
4. Add alert to alert_rules.yml:
```yaml
- alert: AutheliaHighErrorRate
  expr: >
    rate(authelia_request_duration_seconds_count{code="408"}[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Authelia elevated 408 error rate on {{ $labels.instance }}"
    description: "Request timeout rate >0.1/s for 5 minutes. May indicate scanner load or upstream slowness."
```
5. Restart Prometheus.
6. Break-test by temporarily enabling a scanner against the Authelia endpoint.

---

## Recurring Checklist

| Frequency | Task | Notes |
|---|---|---|
| Before any maintenance | Health check | SSH to all nodes, check Prometheus alerts |
| Quarterly | UPS self-test | Automated; verify Prometheus metric updated |
| Monthly | ZFS scrub (if not auto) | Beast das-mirror, Compute3 pool; PBS pools post-recovery |
| Monthly | PBS backup verify | Spot-check recent backups via web UI |
| After each PBS maintenance | `zpool status -v` + `dmesg \| grep usb` | Confirm no disconnect events |
| After each update cycle | `pvecm status` | Confirm quorum intact after any node reboot |
| After cert renewal window | Check cert expiry via openssl | See May 11/18 procedure above |