homelab-configs

Author	SHA1	Message	Date
tommy	e3ee020d53	monitoring: add postfix queue + cert expiry scripts, Phase 4D alerts postfix-queue-check.sh: - Reads mailq queue depth, writes postfix_queue_size{host=X} textfile metric - Deployed on compute3 (systemd node_exporter) and compute5 (Docker) - Cron: /5 * * * as root on each host - Prometheus alert: postfix_queue_size > 10 (uid: efl8kjns461a8f) node-exporter-compute5-compose.yml: - Added textfile volume mount /var/lib/node_exporter/textfile:/textfile:ro - Added --collector.textfile.directory=/textfile flag cert-expiry-check.sh: - Also stored here for monitoring/ grouping Phase 4D Grafana alert rules (all in Infrastructure Alerts folder): cfl8jqdlhu680d TLS Cert Expiry Warning (30d) — break-tested ✓ afl8jqdoepwqod TLS Cert ACME Renewal Failure (14d) — no real certs in window ffl8k2ry0nu2od Alertmanager Down — break-tested, fired ✓ efl8kjns461a8f Postfix Queue Backing Up — metric confirmed, 5m window dfl8k2s0xjklcf Authelia Restart Loop — cadvisor-based proxy metric Rules stored in grafana.db only — not yaml-provisioned (Phase 5 candidate)	2026-05-06 05:46:07 -05:00
tommy	10b60761ff	traefik: add cert-expiry-check.sh with Prometheus textfile output Reads acme.json hourly on docker-node01, writes: traefik_cert_expiry_days{domain=X} N traefik_cert_check_last_run_seconds EPOCH Two Grafana alert thresholds: Warning < 30d: auto-renewal window opened, ntfy high priority Critical < 14d: ACME renewal failed, ntfy urgent Textfile at /var/lib/node_exporter/textfile/cert_expiry.prom Scraped by existing node-exporter job on 192.168.99.186:9100 Grafana rules: cfl8jqdlhu680d (warning), afl8jqdoepwqod (critical) Break-tested: 35d threshold fired for vault/pdf/scrutiny/gitea correctly. Cron: 0 * * * * sudo /usr/local/bin/cert-expiry-check.sh	2026-05-06 05:34:31 -05:00
tommy	7fac4fc9c7	traefik: sync dynamic_conf.yml and add drift-check cron node02 was missing two blocks from node01 (canonical): - strip-trailing-dot-speedtest middleware (regex redirect for speedtest.goattw.net. URLs) - speedtest-trailing-dot router (catches trailing-dot Host header variant) crowdsecLapiHost intentionally differs: node01 uses Docker service name (crowdsec:8080, container on same host); node02 points to node01 IP (192.168.99.186:8081, node02 has no local CrowdSec instance). Added traefik-drift-check.sh — runs daily at 06:00 on ansible-control, diffs both configs (excluding known crowdsecLapiHost difference), posts to ntfy homelab-alerts on unexpected divergence. Traefik hot-reloaded on node02 via SIGHUP — no restart required.	2026-05-05 20:17:17 -05:00
tommy	23194ed22a	pbs: rewrite zfs-health-check.sh, enable textfile collector - Fix silent failure: script now posts to dedicated zfs-health ntfy topic instead of grafana-alerts catch-all (pools were offline 12+ hours undetected because alerts were buried in Grafana noise) - Three explicit states: ONLINE (silent), DEGRADED (high priority), MISSING (urgent priority) — empty zpool list output is now a MISSING alert, not silently ignored - Textfile metrics written atomically after loop completes only: zfs_pool_present{pool=X} 0\|1 and zfs_health_last_run_seconds - Added trap cleanup so mid-script crash leaves previous .prom intact - Logs each pool state to syslog via logger -t zfs-health-check - Remove duplicate cron entry running as tommy (was firing twice per tick) - Enable node_exporter textfile collector for Prometheus scraping Incident: usb1-zfs and usb2-zfs offline since PBS boot (missing cachefile). Imported and cachefile regenerated in this session. No data errors. Refs: 2026-05-05 health check CRITICAL C1/C4	2026-05-05 19:44:36 -05:00
tommy	331820b8de	Fix PBS icon and datastore to Synology-Remote	2026-03-12 21:23:15 -05:00
tommy	5d5d123e7a	Add Frigate to Monitoring group	2026-03-12 21:14:29 -05:00
tommy	2c9df030b9	Add Frigate config files	2026-03-12 21:07:53 -05:00
tommy	4cc0a4c8f2	Add Authelia config	2026-03-12 21:01:04 -05:00
tommy	ea4db49b46	Add Homepage config files	2026-03-12 20:57:47 -05:00
root	22be212642	Add Compute3 PVE storage config	2026-03-11 22:25:08 -05:00
root	bc7f1933ac	Initial commit - PVE storage and backup job configs	2026-03-11 22:10:04 -05:00

11 Commits