node02 was missing two blocks from node01 (canonical):
- strip-trailing-dot-speedtest middleware (regex redirect for speedtest.goattw.net. URLs)
- speedtest-trailing-dot router (catches trailing-dot Host header variant)
crowdsecLapiHost intentionally differs: node01 uses Docker service name
(crowdsec:8080, container on same host); node02 points to node01 IP
(192.168.99.186:8081, node02 has no local CrowdSec instance).
Added traefik-drift-check.sh — runs daily at 06:00 on ansible-control,
diffs both configs (excluding known crowdsecLapiHost difference),
posts to ntfy homelab-alerts on unexpected divergence.
Traefik hot-reloaded on node02 via SIGHUP — no restart required.
- Fix silent failure: script now posts to dedicated zfs-health ntfy topic
instead of grafana-alerts catch-all (pools were offline 12+ hours
undetected because alerts were buried in Grafana noise)
- Three explicit states: ONLINE (silent), DEGRADED (high priority),
MISSING (urgent priority) — empty zpool list output is now a MISSING
alert, not silently ignored
- Textfile metrics written atomically after loop completes only:
zfs_pool_present{pool=X} 0|1 and zfs_health_last_run_seconds
- Added trap cleanup so mid-script crash leaves previous .prom intact
- Logs each pool state to syslog via logger -t zfs-health-check
- Remove duplicate cron entry running as tommy (was firing twice per tick)
- Enable node_exporter textfile collector for Prometheus scraping
Incident: usb1-zfs and usb2-zfs offline since PBS boot (missing cachefile).
Imported and cachefile regenerated in this session. No data errors.
Refs: 2026-05-05 health check CRITICAL C1/C4