Commit Graph

2 Commits

Author SHA1 Message Date
tommy
38fa22d444 monitoring: add quarterly UPS self-test script (Phase 4B)
ups-quarterly-test.sh:
  - Runs test.battery.start.quick on cyberpower1 then cyberpower2
  - 120s wait between tests (allow recharge)
  - Logs pass/fail to syslog via logger -t ups-quarterly-test
  - Password stored in single-quoted variable to prevent shell expansion
  - Deployed on beast (/usr/local/bin/), cron: first Sunday of Jan/Apr/Jul/Oct 02:00

Manual run 2026-05-06:
  cyberpower1: Done and passed (charge 97% post-test, recharged normally)
  cyberpower2: Done and passed (charge 100%)

Grafana alerts (in grafana.db):
  cfl8lrs1mxnnka  UPS Battery Charge Low (<80%) — break-tested pending ✓
  afl8lrs4mbaioa  UPS On Battery (power outage) — break-tested fired ✓

Note: nut_battery_test_result not exposed by nut-exporter v1.2.1.
Pass/fail tracked via syslog only for now. Adding to Phase 5 if exporter
gains test-result metric support.
2026-05-06 06:05:50 -05:00
tommy
e3ee020d53 monitoring: add postfix queue + cert expiry scripts, Phase 4D alerts
postfix-queue-check.sh:
  - Reads mailq queue depth, writes postfix_queue_size{host=X} textfile metric
  - Deployed on compute3 (systemd node_exporter) and compute5 (Docker)
  - Cron: */5 * * * * as root on each host
  - Prometheus alert: postfix_queue_size > 10 (uid: efl8kjns461a8f)

node-exporter-compute5-compose.yml:
  - Added textfile volume mount /var/lib/node_exporter/textfile:/textfile:ro
  - Added --collector.textfile.directory=/textfile flag

cert-expiry-check.sh:
  - Also stored here for monitoring/ grouping

Phase 4D Grafana alert rules (all in Infrastructure Alerts folder):
  cfl8jqdlhu680d  TLS Cert Expiry Warning (30d)        — break-tested ✓
  afl8jqdoepwqod  TLS Cert ACME Renewal Failure (14d)  — no real certs in window
  ffl8k2ry0nu2od  Alertmanager Down                     — break-tested, fired ✓
  efl8kjns461a8f  Postfix Queue Backing Up              — metric confirmed, 5m window
  dfl8k2s0xjklcf  Authelia Restart Loop                 — cadvisor-based proxy metric

Rules stored in grafana.db only — not yaml-provisioned (Phase 5 candidate)
2026-05-06 05:46:07 -05:00