fix(certbot): harden renewal hook and fix permission errors
The renewal deploy-hook ran as the certbot user but lacked permissions to write the combined PEM to /etc/haproxy/certs and to reload HAProxy, causing silent failures that left a stale certificate in production until expiry. - Add certbot user to the haproxy group so it can write the combined PEM - Grant certbot NOPASSWD sudo for `systemctl reload haproxy` only - Make the Prometheus textfile directory group-owned by certbot (0775) so cert-metrics.sh can atomically update ssl_cert.prom - Refactor renewal-hook.sh to always refresh cert metrics on exit via a trap, ensuring expiry alerts fire when the hook itself is broken - Replace `set -e` with explicit error handling and structured logging
This commit is contained in:
@@ -9,8 +9,9 @@ This playbook deploys certbot with the Namecheap DNS plugin for DNS-01 validatio
|
||||
| Installation | Python virtualenv in `/srv/certbot/.venv` |
|
||||
| DNS Plugin | `certbot-dns-namecheap` |
|
||||
| Validation | DNS-01 (supports wildcards) |
|
||||
| Renewal | Systemd timer (twice daily) |
|
||||
| Certificate Output | `/etc/haproxy/certs/{domain}.pem` |
|
||||
| Renewal | Systemd timer (twice daily), runs as the `certbot` user |
|
||||
| Certificate Output | Combined PEM at `haproxy_cert_path` (Titania: `/etc/haproxy/certs/ouranos.pem`) |
|
||||
| HAProxy Reload | `systemctl reload haproxy` (native systemd, not Docker) |
|
||||
| Metrics | Prometheus textfile collector |
|
||||
## Deployments
|
||||
|
||||
@@ -69,12 +70,23 @@ services:
|
||||
# ...
|
||||
|
||||
certbot_email: webmaster@helu.ca
|
||||
certbot_cert_name: ouranos.helu.ca
|
||||
certbot_domains:
|
||||
- "*.ouranos.helu.ca"
|
||||
- "ouranos.helu.ca"
|
||||
certbot_certificates:
|
||||
- cert_name: wildcard.ouranos.helu.ca
|
||||
domains: ["*.ouranos.helu.ca", "ouranos.helu.ca"]
|
||||
|
||||
# Where the renewal hook writes the combined fullchain+privkey PEM for HAProxy
|
||||
haproxy_cert_path: /etc/haproxy/certs/ouranos.pem
|
||||
```
|
||||
|
||||
> The certbot lineage name is **`wildcard.ouranos.helu.ca`**, so the certbot
|
||||
> config lives under `/srv/certbot/config/live/wildcard.ouranos.helu.ca/`. The
|
||||
> combined PEM that HAProxy actually serves is a separate file at
|
||||
> `haproxy_cert_path` (`ouranos.pem`) written by the renewal hook — do not
|
||||
> confuse the two.
|
||||
>
|
||||
> The playbook also supports the single-cert form (`certbot_cert_name` +
|
||||
> `certbot_domains`) for hosts with one certificate.
|
||||
|
||||
### 3. Deploy
|
||||
|
||||
```bash
|
||||
@@ -91,9 +103,9 @@ ansible-playbook certbot/deploy.yml --limit titania.incus
|
||||
| `/srv/certbot/credentials/namecheap.ini` | Namecheap API credentials (600 perms) |
|
||||
| `/srv/certbot/hooks/renewal-hook.sh` | Post-renewal script |
|
||||
| `/srv/certbot/hooks/cert-metrics.sh` | Prometheus metrics script |
|
||||
| `/etc/haproxy/certs/ouranos.helu.ca.pem` | Combined cert for HAProxy (Titania) |
|
||||
| `/etc/systemd/system/certbot-renew.service` | Renewal service unit |
|
||||
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
|
||||
| `/etc/haproxy/certs/ouranos.pem` | Combined cert for HAProxy (Titania), written by the renewal hook |
|
||||
| `/etc/sudoers.d/certbot-haproxy-reload` | Scoped sudo rule letting certbot run `systemctl reload haproxy` |
|
||||
| `/etc/systemd/system/certbot-renew.service` | Renewal service unit (runs as the `certbot` user) |
|
||||
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
|
||||
|
||||
## Renewal Process
|
||||
@@ -105,10 +117,36 @@ ansible-playbook certbot/deploy.yml --limit titania.incus
|
||||
- Waits 120 seconds for propagation
|
||||
- Validates and downloads new certificate
|
||||
- Runs `renewal-hook.sh`
|
||||
4. Renewal hook:
|
||||
- Combines fullchain + privkey into HAProxy format
|
||||
- Reloads HAProxy via `docker compose kill -s HUP haproxy`
|
||||
- Updates Prometheus metrics
|
||||
4. Renewal hook (`renewal-hook.sh`, run via certbot's `--deploy-hook`):
|
||||
- Combines fullchain + privkey into the HAProxy PEM at `haproxy_cert_path`
|
||||
- Reloads native HAProxy via `sudo -n systemctl reload haproxy`
|
||||
- Always refreshes Prometheus metrics (even on failure — see below)
|
||||
|
||||
> **HAProxy on Titania runs natively under systemd, not in Docker.** The hook
|
||||
> reloads it with `systemctl reload haproxy`. (Only Casdoor runs in Docker on
|
||||
> Titania.)
|
||||
|
||||
### Permission model (why renewals can silently fail)
|
||||
|
||||
The renewal timer runs the hook as the unprivileged **`certbot`** user, so three
|
||||
permissions must line up or the renewed cert never reaches HAProxy:
|
||||
|
||||
| Resource | Required state | Provided by |
|
||||
|----------|----------------|-------------|
|
||||
| `/etc/haproxy/certs` | `0770`, group `haproxy`; `certbot` is a member of `haproxy` | `haproxy/deploy.yml` (mode) + `certbot/deploy.yml` (group membership) |
|
||||
| `systemctl reload haproxy` | allowed for `certbot` via sudo | `/etc/sudoers.d/certbot-haproxy-reload` |
|
||||
| Prometheus textfile dir | group-writable by `certbot` | `certbot/deploy.yml` |
|
||||
|
||||
If any of these is wrong, the hook fails. **Certbot treats a deploy-hook failure
|
||||
as a non-fatal WARNING and still reports "renewals succeeded"** — so a broken hook
|
||||
will let the live cert renew while HAProxy keeps serving the *old* file until it
|
||||
expires. To make this visible, the hook now:
|
||||
|
||||
- checks each step and exits non-zero with an explicit
|
||||
`serving a STALE certificate` error (surfaced in the certbot/journal output), and
|
||||
- refreshes the Prometheus cert metrics on *every* exit, so the
|
||||
`SSLCertificateExpiringSoon` / `SSLCertificateExpired` alerts keep reflecting
|
||||
reality even when installation fails.
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
@@ -137,14 +175,29 @@ Example alert rule:
|
||||
### View Certificate Status
|
||||
|
||||
```bash
|
||||
# Check certificate expiry (Titania example)
|
||||
openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.helu.ca.pem
|
||||
# Check expiry of the cert HAProxy actually serves (Titania)
|
||||
sudo openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.pem
|
||||
|
||||
# Confirm HAProxy is serving it on the wire
|
||||
echo | openssl s_client -connect titania.incus:8443 \
|
||||
-servername grafana.ouranos.helu.ca 2>/dev/null \
|
||||
| openssl x509 -noout -enddate -issuer
|
||||
|
||||
# Check the underlying certbot lineage (may be newer than the served file
|
||||
# if the deploy hook failed to install it)
|
||||
sudo openssl x509 -enddate -noout \
|
||||
-in /srv/certbot/config/live/wildcard.ouranos.helu.ca/fullchain.pem
|
||||
|
||||
# Check certbot certificates
|
||||
sudo -u certbot /srv/certbot/.venv/bin/certbot certificates \
|
||||
--config-dir /srv/certbot/config
|
||||
```
|
||||
|
||||
> If the served file is older than the certbot lineage, the deploy hook is
|
||||
> failing to install renewals. Check the hook output:
|
||||
> `sudo grep -i hook /srv/certbot/logs/letsencrypt.log*` — look for
|
||||
> `Permission denied`, `reload failed`, or `serving a STALE certificate`.
|
||||
|
||||
### Manual Renewal Test
|
||||
|
||||
```bash
|
||||
|
||||
@@ -374,10 +374,10 @@ MinIO specifically expects certs at `~/.minio/certs/public.crt` and `~/.minio/ce
|
||||
| Certbot location | On the host itself | OCI free host |
|
||||
| Namecheap credentials | On the host | Only on OCI host |
|
||||
| Cert delivery | Direct to HAProxy | Via OCI Vault → Ansible |
|
||||
| Renewal hook | Docker HAProxy reload | OCI Vault upload |
|
||||
| Renewal hook | Combine PEM + reload HAProxy | OCI Vault upload |
|
||||
| Distribution | N/A (local only) | Ansible cron on controller |
|
||||
| Environments served | Ouranos sandbox only | All environments |
|
||||
| Service reload | `docker compose kill -s HUP` | `systemctl reload` per host_vars |
|
||||
| Service reload | `systemctl reload haproxy` (native, via scoped sudo) | `systemctl reload` per host_vars |
|
||||
|
||||
Titania can remain self-contained (it's working) or migrate to this centralized model later.
|
||||
|
||||
|
||||
30
docs/pplg.md
30
docs/pplg.md
@@ -484,17 +484,35 @@ vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
|
||||
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
|
||||
```
|
||||
|
||||
#### Certificate fetch fails
|
||||
#### TLS cert expired / not renewing on `*.ouranos.helu.ca`
|
||||
|
||||
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
|
||||
TLS for all PPLG subdomains is terminated by **Titania's native HAProxy** using
|
||||
the Let's Encrypt wildcard cert managed by certbot on Titania (see
|
||||
[certbot DNS-01 with Namecheap](cerbot.md)). PPLG itself holds no cert.
|
||||
|
||||
**Fix**: Ensure Titania is up and certbot has run:
|
||||
**Most likely cause**: certbot renewed the lineage but the deploy hook failed to
|
||||
install the new cert into HAProxy's served PEM (`/etc/haproxy/certs/ouranos.pem`),
|
||||
so HAProxy keeps serving the old file until it expires. Certbot reports such hook
|
||||
failures only as a WARNING, so the renewal looks successful.
|
||||
|
||||
**Diagnose** (on Titania):
|
||||
```bash
|
||||
ansible-playbook sandbox_up.yml
|
||||
ansible-playbook certbot/deploy.yml
|
||||
# Does the served file match the certbot lineage?
|
||||
sudo openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.pem
|
||||
sudo openssl x509 -enddate -noout \
|
||||
-in /srv/certbot/config/live/wildcard.ouranos.helu.ca/fullchain.pem
|
||||
|
||||
# Look for a failing hook
|
||||
sudo grep -iE 'hook|Permission denied|reload failed|STALE' /srv/certbot/logs/letsencrypt.log*
|
||||
```
|
||||
|
||||
The playbook falls back to a self-signed certificate if Titania is unavailable.
|
||||
**Fix**: re-run the playbooks (in this order) and force a renewal to reinstall:
|
||||
```bash
|
||||
ansible-playbook haproxy/deploy.yml --limit titania.incus
|
||||
ansible-playbook certbot/deploy.yml --limit titania.incus
|
||||
```
|
||||
See the certbot doc's [permission model](cerbot.md#permission-model-why-renewals-can-silently-fail)
|
||||
for the `certbot`-user permissions the hook depends on.
|
||||
|
||||
#### OAuth2 redirect loops
|
||||
|
||||
|
||||
Reference in New Issue
Block a user