tommy
|
38fa22d444
|
monitoring: add quarterly UPS self-test script (Phase 4B)
ups-quarterly-test.sh:
- Runs test.battery.start.quick on cyberpower1 then cyberpower2
- 120s wait between tests (allow recharge)
- Logs pass/fail to syslog via logger -t ups-quarterly-test
- Password stored in single-quoted variable to prevent shell expansion
- Deployed on beast (/usr/local/bin/), cron: first Sunday of Jan/Apr/Jul/Oct 02:00
Manual run 2026-05-06:
cyberpower1: Done and passed (charge 97% post-test, recharged normally)
cyberpower2: Done and passed (charge 100%)
Grafana alerts (in grafana.db):
cfl8lrs1mxnnka UPS Battery Charge Low (<80%) — break-tested pending ✓
afl8lrs4mbaioa UPS On Battery (power outage) — break-tested fired ✓
Note: nut_battery_test_result not exposed by nut-exporter v1.2.1.
Pass/fail tracked via syslog only for now. Adding to Phase 5 if exporter
gains test-result metric support.
|
2026-05-06 06:05:50 -05:00 |
|
tommy
|
e3ee020d53
|
monitoring: add postfix queue + cert expiry scripts, Phase 4D alerts
postfix-queue-check.sh:
- Reads mailq queue depth, writes postfix_queue_size{host=X} textfile metric
- Deployed on compute3 (systemd node_exporter) and compute5 (Docker)
- Cron: */5 * * * * as root on each host
- Prometheus alert: postfix_queue_size > 10 (uid: efl8kjns461a8f)
node-exporter-compute5-compose.yml:
- Added textfile volume mount /var/lib/node_exporter/textfile:/textfile:ro
- Added --collector.textfile.directory=/textfile flag
cert-expiry-check.sh:
- Also stored here for monitoring/ grouping
Phase 4D Grafana alert rules (all in Infrastructure Alerts folder):
cfl8jqdlhu680d TLS Cert Expiry Warning (30d) — break-tested ✓
afl8jqdoepwqod TLS Cert ACME Renewal Failure (14d) — no real certs in window
ffl8k2ry0nu2od Alertmanager Down — break-tested, fired ✓
efl8kjns461a8f Postfix Queue Backing Up — metric confirmed, 5m window
dfl8k2s0xjklcf Authelia Restart Loop — cadvisor-based proxy metric
Rules stored in grafana.db only — not yaml-provisioned (Phase 5 candidate)
|
2026-05-06 05:46:07 -05:00 |
|