Commit Graph

11 Commits

Author SHA1 Message Date
tommy
e3ee020d53 monitoring: add postfix queue + cert expiry scripts, Phase 4D alerts
postfix-queue-check.sh:
  - Reads mailq queue depth, writes postfix_queue_size{host=X} textfile metric
  - Deployed on compute3 (systemd node_exporter) and compute5 (Docker)
  - Cron: */5 * * * * as root on each host
  - Prometheus alert: postfix_queue_size > 10 (uid: efl8kjns461a8f)

node-exporter-compute5-compose.yml:
  - Added textfile volume mount /var/lib/node_exporter/textfile:/textfile:ro
  - Added --collector.textfile.directory=/textfile flag

cert-expiry-check.sh:
  - Also stored here for monitoring/ grouping

Phase 4D Grafana alert rules (all in Infrastructure Alerts folder):
  cfl8jqdlhu680d  TLS Cert Expiry Warning (30d)        — break-tested ✓
  afl8jqdoepwqod  TLS Cert ACME Renewal Failure (14d)  — no real certs in window
  ffl8k2ry0nu2od  Alertmanager Down                     — break-tested, fired ✓
  efl8kjns461a8f  Postfix Queue Backing Up              — metric confirmed, 5m window
  dfl8k2s0xjklcf  Authelia Restart Loop                 — cadvisor-based proxy metric

Rules stored in grafana.db only — not yaml-provisioned (Phase 5 candidate)
2026-05-06 05:46:07 -05:00
tommy
10b60761ff traefik: add cert-expiry-check.sh with Prometheus textfile output
Reads acme.json hourly on docker-node01, writes:
  traefik_cert_expiry_days{domain=X} N
  traefik_cert_check_last_run_seconds EPOCH

Two Grafana alert thresholds:
  Warning  < 30d: auto-renewal window opened, ntfy high priority
  Critical < 14d: ACME renewal failed, ntfy urgent

Textfile at /var/lib/node_exporter/textfile/cert_expiry.prom
Scraped by existing node-exporter job on 192.168.99.186:9100
Grafana rules: cfl8jqdlhu680d (warning), afl8jqdoepwqod (critical)
Break-tested: 35d threshold fired for vault/pdf/scrutiny/gitea correctly.

Cron: 0 * * * * sudo /usr/local/bin/cert-expiry-check.sh
2026-05-06 05:34:31 -05:00
tommy
7fac4fc9c7 traefik: sync dynamic_conf.yml and add drift-check cron
node02 was missing two blocks from node01 (canonical):
- strip-trailing-dot-speedtest middleware (regex redirect for speedtest.goattw.net. URLs)
- speedtest-trailing-dot router (catches trailing-dot Host header variant)

crowdsecLapiHost intentionally differs: node01 uses Docker service name
(crowdsec:8080, container on same host); node02 points to node01 IP
(192.168.99.186:8081, node02 has no local CrowdSec instance).

Added traefik-drift-check.sh — runs daily at 06:00 on ansible-control,
diffs both configs (excluding known crowdsecLapiHost difference),
posts to ntfy homelab-alerts on unexpected divergence.

Traefik hot-reloaded on node02 via SIGHUP — no restart required.
2026-05-05 20:17:17 -05:00
tommy
23194ed22a pbs: rewrite zfs-health-check.sh, enable textfile collector
- Fix silent failure: script now posts to dedicated zfs-health ntfy topic
  instead of grafana-alerts catch-all (pools were offline 12+ hours
  undetected because alerts were buried in Grafana noise)
- Three explicit states: ONLINE (silent), DEGRADED (high priority),
  MISSING (urgent priority) — empty zpool list output is now a MISSING
  alert, not silently ignored
- Textfile metrics written atomically after loop completes only:
  zfs_pool_present{pool=X} 0|1 and zfs_health_last_run_seconds
- Added trap cleanup so mid-script crash leaves previous .prom intact
- Logs each pool state to syslog via logger -t zfs-health-check
- Remove duplicate cron entry running as tommy (was firing twice per tick)
- Enable node_exporter textfile collector for Prometheus scraping

Incident: usb1-zfs and usb2-zfs offline since PBS boot (missing cachefile).
Imported and cachefile regenerated in this session. No data errors.

Refs: 2026-05-05 health check CRITICAL C1/C4
2026-05-05 19:44:36 -05:00
tommy
331820b8de Fix PBS icon and datastore to Synology-Remote 2026-03-12 21:23:15 -05:00
tommy
5d5d123e7a Add Frigate to Monitoring group 2026-03-12 21:14:29 -05:00
tommy
2c9df030b9 Add Frigate config files 2026-03-12 21:07:53 -05:00
tommy
4cc0a4c8f2 Add Authelia config 2026-03-12 21:01:04 -05:00
tommy
ea4db49b46 Add Homepage config files 2026-03-12 20:57:47 -05:00
root
22be212642 Add Compute3 PVE storage config 2026-03-11 22:25:08 -05:00
root
bc7f1933ac Initial commit - PVE storage and backup job configs 2026-03-11 22:10:04 -05:00