Files
ouranos/docs/cerbot.md
Robert Helewka 343b0e13d6 fix(certbot): harden renewal hook and fix permission errors
The renewal deploy-hook ran as the certbot user but lacked permissions to
write the combined PEM to /etc/haproxy/certs and to reload HAProxy,
causing silent failures that left a stale certificate in production until
expiry.

- Add certbot user to the haproxy group so it can write the combined PEM
- Grant certbot NOPASSWD sudo for `systemctl reload haproxy` only
- Make the Prometheus textfile directory group-owned by certbot (0775)
  so cert-metrics.sh can atomically update ssl_cert.prom
- Refactor renewal-hook.sh to always refresh cert metrics on exit via a
  trap, ensuring expiry alerts fire when the hook itself is broken
- Replace `set -e` with explicit error handling and structured logging
2026-06-17 09:58:46 -04:00

8.4 KiB

Certbot DNS-01 with Namecheap

This playbook deploys certbot with the Namecheap DNS plugin for DNS-01 validation, enabling wildcard SSL certificates.

Overview

Component Value
Installation Python virtualenv in /srv/certbot/.venv
DNS Plugin certbot-dns-namecheap
Validation DNS-01 (supports wildcards)
Renewal Systemd timer (twice daily), runs as the certbot user
Certificate Output Combined PEM at haproxy_cert_path (Titania: /etc/haproxy/certs/ouranos.pem)
HAProxy Reload systemctl reload haproxy (native systemd, not Docker)
Metrics Prometheus textfile collector

Deployments

Titania (ouranos.helu.ca)

Production deployment providing Let's Encrypt certificates for the Ouranos sandbox HAProxy reverse proxy.

Setting Value
Host titania.incus
Domain ouranos.helu.ca
Wildcard *.ouranos.helu.ca
Email webmaster@helu.ca
HAProxy Port 443 (HTTPS), Port 80 (HTTP redirect)
Renewal Twice daily, automatic HAProxy reload

Other Deployments

The playbook can be deployed to any host with HAProxy. See the example configuration for hippocamp.helu.ca (d.helu.ca domain) below.

Prerequisites

  1. Namecheap API Access enabled on your account
  2. Namecheap API key generated
  3. IP whitelisted in Namecheap API settings
  4. Ansible Vault configured with Namecheap credentials

Setup

1. Add Secrets to Ansible Vault

Add Namecheap credentials to ansible/inventory/group_vars/all/vault.yml:

ansible-vault edit inventory/group_vars/all/vault.yml

Add the following variables:

vault_namecheap_username: "your_namecheap_username"
vault_namecheap_api_key: "your_namecheap_api_key"

Map these in inventory/group_vars/all/vars.yml:

namecheap_username: "{{ vault_namecheap_username }}"
namecheap_api_key: "{{ vault_namecheap_api_key }}"

2. Configure Host Variables

For Titania, the configuration is in inventory/host_vars/titania.incus.yml:

services:
  - certbot
  - haproxy
  # ...

certbot_email: webmaster@helu.ca
certbot_certificates:
  - cert_name: wildcard.ouranos.helu.ca
    domains: ["*.ouranos.helu.ca", "ouranos.helu.ca"]

# Where the renewal hook writes the combined fullchain+privkey PEM for HAProxy
haproxy_cert_path: /etc/haproxy/certs/ouranos.pem

The certbot lineage name is wildcard.ouranos.helu.ca, so the certbot config lives under /srv/certbot/config/live/wildcard.ouranos.helu.ca/. The combined PEM that HAProxy actually serves is a separate file at haproxy_cert_path (ouranos.pem) written by the renewal hook — do not confuse the two.

The playbook also supports the single-cert form (certbot_cert_name + certbot_domains) for hosts with one certificate.

3. Deploy

cd ansible
ansible-playbook certbot/deploy.yml --limit titania.incus

Files Created

Path Purpose
/srv/certbot/.venv/ Python virtualenv with certbot
/srv/certbot/config/ Certbot configuration and certificates
/srv/certbot/credentials/namecheap.ini Namecheap API credentials (600 perms)
/srv/certbot/hooks/renewal-hook.sh Post-renewal script
/srv/certbot/hooks/cert-metrics.sh Prometheus metrics script
/etc/haproxy/certs/ouranos.pem Combined cert for HAProxy (Titania), written by the renewal hook
/etc/sudoers.d/certbot-haproxy-reload Scoped sudo rule letting certbot run systemctl reload haproxy
/etc/systemd/system/certbot-renew.service Renewal service unit (runs as the certbot user)
/etc/systemd/system/certbot-renew.timer Twice-daily renewal timer

Renewal Process

  1. Systemd timer triggers at 00:00 and 12:00 (with random delay up to 1 hour)
  2. Certbot checks if certificate needs renewal (within 30 days of expiry)
  3. If renewal needed:
    • Creates DNS TXT record via Namecheap API
    • Waits 120 seconds for propagation
    • Validates and downloads new certificate
    • Runs renewal-hook.sh
  4. Renewal hook (renewal-hook.sh, run via certbot's --deploy-hook):
    • Combines fullchain + privkey into the HAProxy PEM at haproxy_cert_path
    • Reloads native HAProxy via sudo -n systemctl reload haproxy
    • Always refreshes Prometheus metrics (even on failure — see below)

HAProxy on Titania runs natively under systemd, not in Docker. The hook reloads it with systemctl reload haproxy. (Only Casdoor runs in Docker on Titania.)

Permission model (why renewals can silently fail)

The renewal timer runs the hook as the unprivileged certbot user, so three permissions must line up or the renewed cert never reaches HAProxy:

Resource Required state Provided by
/etc/haproxy/certs 0770, group haproxy; certbot is a member of haproxy haproxy/deploy.yml (mode) + certbot/deploy.yml (group membership)
systemctl reload haproxy allowed for certbot via sudo /etc/sudoers.d/certbot-haproxy-reload
Prometheus textfile dir group-writable by certbot certbot/deploy.yml

If any of these is wrong, the hook fails. Certbot treats a deploy-hook failure as a non-fatal WARNING and still reports "renewals succeeded" — so a broken hook will let the live cert renew while HAProxy keeps serving the old file until it expires. To make this visible, the hook now:

  • checks each step and exits non-zero with an explicit serving a STALE certificate error (surfaced in the certbot/journal output), and
  • refreshes the Prometheus cert metrics on every exit, so the SSLCertificateExpiringSoon / SSLCertificateExpired alerts keep reflecting reality even when installation fails.

Prometheus Metrics

Metrics written to /var/lib/prometheus/node-exporter/ssl_cert.prom:

Metric Description
ssl_certificate_expiry_timestamp Unix timestamp when cert expires
ssl_certificate_expiry_seconds Seconds until cert expires
ssl_certificate_valid 1 if valid, 0 if expired/missing

Example alert rule:

- alert: SSLCertificateExpiringSoon
  expr: ssl_certificate_expiry_seconds < 604800  # 7 days
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "SSL certificate expiring soon"
    description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"

Troubleshooting

View Certificate Status

# Check expiry of the cert HAProxy actually serves (Titania)
sudo openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.pem

# Confirm HAProxy is serving it on the wire
echo | openssl s_client -connect titania.incus:8443 \
  -servername grafana.ouranos.helu.ca 2>/dev/null \
  | openssl x509 -noout -enddate -issuer

# Check the underlying certbot lineage (may be newer than the served file
# if the deploy hook failed to install it)
sudo openssl x509 -enddate -noout \
  -in /srv/certbot/config/live/wildcard.ouranos.helu.ca/fullchain.pem

# Check certbot certificates
sudo -u certbot /srv/certbot/.venv/bin/certbot certificates \
  --config-dir /srv/certbot/config

If the served file is older than the certbot lineage, the deploy hook is failing to install renewals. Check the hook output: sudo grep -i hook /srv/certbot/logs/letsencrypt.log* — look for Permission denied, reload failed, or serving a STALE certificate.

Manual Renewal Test

# Dry run renewal
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
  --config-dir /srv/certbot/config \
  --work-dir /srv/certbot/work \
  --logs-dir /srv/certbot/logs \
  --dry-run

# Force renewal (if needed)
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
  --config-dir /srv/certbot/config \
  --work-dir /srv/certbot/work \
  --logs-dir /srv/certbot/logs \
  --force-renewal

Check Systemd Timer

# Timer status
systemctl status certbot-renew.timer

# Last run
journalctl -u certbot-renew.service --since "1 day ago"

# List timers
systemctl list-timers certbot-renew.timer

DNS Propagation Issues

If certificate requests fail due to DNS propagation:

  1. Check Namecheap API is accessible
  2. Verify IP is whitelisted
  3. Increase propagation wait time (default 120s)
  4. Check certbot logs: /srv/certbot/logs/letsencrypt.log
  • haproxy/deploy.yml - Depends on certificate from certbot
  • prometheus/node_deploy.yml - Deploys node_exporter for metrics collection