homelab-configs/pbs/zfs-health-check.sh at main

Files

tommy 23194ed22a pbs: rewrite zfs-health-check.sh, enable textfile collector

- Fix silent failure: script now posts to dedicated zfs-health ntfy topic
  instead of grafana-alerts catch-all (pools were offline 12+ hours
  undetected because alerts were buried in Grafana noise)
- Three explicit states: ONLINE (silent), DEGRADED (high priority),
  MISSING (urgent priority) — empty zpool list output is now a MISSING
  alert, not silently ignored
- Textfile metrics written atomically after loop completes only:
  zfs_pool_present{pool=X} 0|1 and zfs_health_last_run_seconds
- Added trap cleanup so mid-script crash leaves previous .prom intact
- Logs each pool state to syslog via logger -t zfs-health-check
- Remove duplicate cron entry running as tommy (was firing twice per tick)
- Enable node_exporter textfile collector for Prometheus scraping

Incident: usb1-zfs and usb2-zfs offline since PBS boot (missing cachefile).
Imported and cachefile regenerated in this session. No data errors.

Refs: 2026-05-05 health check CRITICAL C1/C4

2026-05-05 19:44:36 -05:00

2.6 KiB

Executable File

Raw Permalink Blame History

View Raw

2.6 KiB Executable File Raw Permalink Blame History

2.6 KiB

Executable File

Raw Permalink Blame History