fix(certbot): harden renewal hook and fix permission errors

The renewal deploy-hook ran as the certbot user but lacked permissions to
write the combined PEM to /etc/haproxy/certs and to reload HAProxy,
causing silent failures that left a stale certificate in production until
expiry.

- Add certbot user to the haproxy group so it can write the combined PEM
- Grant certbot NOPASSWD sudo for `systemctl reload haproxy` only
- Make the Prometheus textfile directory group-owned by certbot (0775)
  so cert-metrics.sh can atomically update ssl_cert.prom
- Refactor renewal-hook.sh to always refresh cert metrics on exit via a
  trap, ensuring expiry alerts fire when the hook itself is broken
- Replace `set -e` with explicit error handling and structured logging
This commit is contained in:
2026-06-17 09:58:46 -04:00
parent 2f5a15eef5
commit 343b0e13d6
10 changed files with 665 additions and 46 deletions

View File

@@ -484,17 +484,35 @@ vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
```
#### Certificate fetch fails
#### TLS cert expired / not renewing on `*.ouranos.helu.ca`
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
TLS for all PPLG subdomains is terminated by **Titania's native HAProxy** using
the Let's Encrypt wildcard cert managed by certbot on Titania (see
[certbot DNS-01 with Namecheap](cerbot.md)). PPLG itself holds no cert.
**Fix**: Ensure Titania is up and certbot has run:
**Most likely cause**: certbot renewed the lineage but the deploy hook failed to
install the new cert into HAProxy's served PEM (`/etc/haproxy/certs/ouranos.pem`),
so HAProxy keeps serving the old file until it expires. Certbot reports such hook
failures only as a WARNING, so the renewal looks successful.
**Diagnose** (on Titania):
```bash
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
# Does the served file match the certbot lineage?
sudo openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.pem
sudo openssl x509 -enddate -noout \
-in /srv/certbot/config/live/wildcard.ouranos.helu.ca/fullchain.pem
# Look for a failing hook
sudo grep -iE 'hook|Permission denied|reload failed|STALE' /srv/certbot/logs/letsencrypt.log*
```
The playbook falls back to a self-signed certificate if Titania is unavailable.
**Fix**: re-run the playbooks (in this order) and force a renewal to reinstall:
```bash
ansible-playbook haproxy/deploy.yml --limit titania.incus
ansible-playbook certbot/deploy.yml --limit titania.incus
```
See the certbot doc's [permission model](cerbot.md#permission-model-why-renewals-can-silently-fail)
for the `certbot`-user permissions the hook depends on.
#### OAuth2 redirect loops