The renewal deploy-hook ran as the certbot user but lacked permissions to write the combined PEM to /etc/haproxy/certs and to reload HAProxy, causing silent failures that left a stale certificate in production until expiry. - Add certbot user to the haproxy group so it can write the combined PEM - Grant certbot NOPASSWD sudo for `systemctl reload haproxy` only - Make the Prometheus textfile directory group-owned by certbot (0775) so cert-metrics.sh can atomically update ssl_cert.prom - Refactor renewal-hook.sh to always refresh cert metrics on exit via a trap, ensuring expiry alerts fire when the hook itself is broken - Replace `set -e` with explicit error handling and structured logging
244 lines
8.4 KiB
Markdown
244 lines
8.4 KiB
Markdown
# Certbot DNS-01 with Namecheap
|
|
|
|
This playbook deploys certbot with the Namecheap DNS plugin for DNS-01 validation, enabling wildcard SSL certificates.
|
|
|
|
## Overview
|
|
|
|
| Component | Value |
|
|
|-----------|-------|
|
|
| Installation | Python virtualenv in `/srv/certbot/.venv` |
|
|
| DNS Plugin | `certbot-dns-namecheap` |
|
|
| Validation | DNS-01 (supports wildcards) |
|
|
| Renewal | Systemd timer (twice daily), runs as the `certbot` user |
|
|
| Certificate Output | Combined PEM at `haproxy_cert_path` (Titania: `/etc/haproxy/certs/ouranos.pem`) |
|
|
| HAProxy Reload | `systemctl reload haproxy` (native systemd, not Docker) |
|
|
| Metrics | Prometheus textfile collector |
|
|
## Deployments
|
|
|
|
### Titania (ouranos.helu.ca)
|
|
|
|
Production deployment providing Let's Encrypt certificates for the Ouranos sandbox HAProxy reverse proxy.
|
|
|
|
| Setting | Value |
|
|
|---------|-------|
|
|
| **Host** | titania.incus |
|
|
| **Domain** | ouranos.helu.ca |
|
|
| **Wildcard** | *.ouranos.helu.ca |
|
|
| **Email** | webmaster@helu.ca |
|
|
| **HAProxy** | Port 443 (HTTPS), Port 80 (HTTP redirect) |
|
|
| **Renewal** | Twice daily, automatic HAProxy reload |
|
|
|
|
### Other Deployments
|
|
|
|
The playbook can be deployed to any host with HAProxy. See the example configuration for hippocamp.helu.ca (d.helu.ca domain) below.
|
|
## Prerequisites
|
|
|
|
1. **Namecheap API Access** enabled on your account
|
|
2. **Namecheap API key** generated
|
|
3. **IP whitelisted** in Namecheap API settings
|
|
4. **Ansible Vault** configured with Namecheap credentials
|
|
|
|
## Setup
|
|
|
|
### 1. Add Secrets to Ansible Vault
|
|
|
|
Add Namecheap credentials to `ansible/inventory/group_vars/all/vault.yml`:
|
|
|
|
```bash
|
|
ansible-vault edit inventory/group_vars/all/vault.yml
|
|
```
|
|
|
|
Add the following variables:
|
|
```yaml
|
|
vault_namecheap_username: "your_namecheap_username"
|
|
vault_namecheap_api_key: "your_namecheap_api_key"
|
|
```
|
|
|
|
Map these in `inventory/group_vars/all/vars.yml`:
|
|
```yaml
|
|
namecheap_username: "{{ vault_namecheap_username }}"
|
|
namecheap_api_key: "{{ vault_namecheap_api_key }}"
|
|
```
|
|
|
|
### 2. Configure Host Variables
|
|
|
|
For Titania, the configuration is in `inventory/host_vars/titania.incus.yml`:
|
|
```yaml
|
|
services:
|
|
- certbot
|
|
- haproxy
|
|
# ...
|
|
|
|
certbot_email: webmaster@helu.ca
|
|
certbot_certificates:
|
|
- cert_name: wildcard.ouranos.helu.ca
|
|
domains: ["*.ouranos.helu.ca", "ouranos.helu.ca"]
|
|
|
|
# Where the renewal hook writes the combined fullchain+privkey PEM for HAProxy
|
|
haproxy_cert_path: /etc/haproxy/certs/ouranos.pem
|
|
```
|
|
|
|
> The certbot lineage name is **`wildcard.ouranos.helu.ca`**, so the certbot
|
|
> config lives under `/srv/certbot/config/live/wildcard.ouranos.helu.ca/`. The
|
|
> combined PEM that HAProxy actually serves is a separate file at
|
|
> `haproxy_cert_path` (`ouranos.pem`) written by the renewal hook — do not
|
|
> confuse the two.
|
|
>
|
|
> The playbook also supports the single-cert form (`certbot_cert_name` +
|
|
> `certbot_domains`) for hosts with one certificate.
|
|
|
|
### 3. Deploy
|
|
|
|
```bash
|
|
cd ansible
|
|
ansible-playbook certbot/deploy.yml --limit titania.incus
|
|
```
|
|
|
|
## Files Created
|
|
|
|
| Path | Purpose |
|
|
|------|---------|
|
|
| `/srv/certbot/.venv/` | Python virtualenv with certbot |
|
|
| `/srv/certbot/config/` | Certbot configuration and certificates |
|
|
| `/srv/certbot/credentials/namecheap.ini` | Namecheap API credentials (600 perms) |
|
|
| `/srv/certbot/hooks/renewal-hook.sh` | Post-renewal script |
|
|
| `/srv/certbot/hooks/cert-metrics.sh` | Prometheus metrics script |
|
|
| `/etc/haproxy/certs/ouranos.pem` | Combined cert for HAProxy (Titania), written by the renewal hook |
|
|
| `/etc/sudoers.d/certbot-haproxy-reload` | Scoped sudo rule letting certbot run `systemctl reload haproxy` |
|
|
| `/etc/systemd/system/certbot-renew.service` | Renewal service unit (runs as the `certbot` user) |
|
|
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
|
|
|
|
## Renewal Process
|
|
|
|
1. Systemd timer triggers at 00:00 and 12:00 (with random delay up to 1 hour)
|
|
2. Certbot checks if certificate needs renewal (within 30 days of expiry)
|
|
3. If renewal needed:
|
|
- Creates DNS TXT record via Namecheap API
|
|
- Waits 120 seconds for propagation
|
|
- Validates and downloads new certificate
|
|
- Runs `renewal-hook.sh`
|
|
4. Renewal hook (`renewal-hook.sh`, run via certbot's `--deploy-hook`):
|
|
- Combines fullchain + privkey into the HAProxy PEM at `haproxy_cert_path`
|
|
- Reloads native HAProxy via `sudo -n systemctl reload haproxy`
|
|
- Always refreshes Prometheus metrics (even on failure — see below)
|
|
|
|
> **HAProxy on Titania runs natively under systemd, not in Docker.** The hook
|
|
> reloads it with `systemctl reload haproxy`. (Only Casdoor runs in Docker on
|
|
> Titania.)
|
|
|
|
### Permission model (why renewals can silently fail)
|
|
|
|
The renewal timer runs the hook as the unprivileged **`certbot`** user, so three
|
|
permissions must line up or the renewed cert never reaches HAProxy:
|
|
|
|
| Resource | Required state | Provided by |
|
|
|----------|----------------|-------------|
|
|
| `/etc/haproxy/certs` | `0770`, group `haproxy`; `certbot` is a member of `haproxy` | `haproxy/deploy.yml` (mode) + `certbot/deploy.yml` (group membership) |
|
|
| `systemctl reload haproxy` | allowed for `certbot` via sudo | `/etc/sudoers.d/certbot-haproxy-reload` |
|
|
| Prometheus textfile dir | group-writable by `certbot` | `certbot/deploy.yml` |
|
|
|
|
If any of these is wrong, the hook fails. **Certbot treats a deploy-hook failure
|
|
as a non-fatal WARNING and still reports "renewals succeeded"** — so a broken hook
|
|
will let the live cert renew while HAProxy keeps serving the *old* file until it
|
|
expires. To make this visible, the hook now:
|
|
|
|
- checks each step and exits non-zero with an explicit
|
|
`serving a STALE certificate` error (surfaced in the certbot/journal output), and
|
|
- refreshes the Prometheus cert metrics on *every* exit, so the
|
|
`SSLCertificateExpiringSoon` / `SSLCertificateExpired` alerts keep reflecting
|
|
reality even when installation fails.
|
|
|
|
## Prometheus Metrics
|
|
|
|
Metrics written to `/var/lib/prometheus/node-exporter/ssl_cert.prom`:
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires |
|
|
| `ssl_certificate_expiry_seconds` | Seconds until cert expires |
|
|
| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing |
|
|
|
|
Example alert rule:
|
|
```yaml
|
|
- alert: SSLCertificateExpiringSoon
|
|
expr: ssl_certificate_expiry_seconds < 604800 # 7 days
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "SSL certificate expiring soon"
|
|
description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### View Certificate Status
|
|
|
|
```bash
|
|
# Check expiry of the cert HAProxy actually serves (Titania)
|
|
sudo openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.pem
|
|
|
|
# Confirm HAProxy is serving it on the wire
|
|
echo | openssl s_client -connect titania.incus:8443 \
|
|
-servername grafana.ouranos.helu.ca 2>/dev/null \
|
|
| openssl x509 -noout -enddate -issuer
|
|
|
|
# Check the underlying certbot lineage (may be newer than the served file
|
|
# if the deploy hook failed to install it)
|
|
sudo openssl x509 -enddate -noout \
|
|
-in /srv/certbot/config/live/wildcard.ouranos.helu.ca/fullchain.pem
|
|
|
|
# Check certbot certificates
|
|
sudo -u certbot /srv/certbot/.venv/bin/certbot certificates \
|
|
--config-dir /srv/certbot/config
|
|
```
|
|
|
|
> If the served file is older than the certbot lineage, the deploy hook is
|
|
> failing to install renewals. Check the hook output:
|
|
> `sudo grep -i hook /srv/certbot/logs/letsencrypt.log*` — look for
|
|
> `Permission denied`, `reload failed`, or `serving a STALE certificate`.
|
|
|
|
### Manual Renewal Test
|
|
|
|
```bash
|
|
# Dry run renewal
|
|
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
|
|
--config-dir /srv/certbot/config \
|
|
--work-dir /srv/certbot/work \
|
|
--logs-dir /srv/certbot/logs \
|
|
--dry-run
|
|
|
|
# Force renewal (if needed)
|
|
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
|
|
--config-dir /srv/certbot/config \
|
|
--work-dir /srv/certbot/work \
|
|
--logs-dir /srv/certbot/logs \
|
|
--force-renewal
|
|
```
|
|
|
|
### Check Systemd Timer
|
|
|
|
```bash
|
|
# Timer status
|
|
systemctl status certbot-renew.timer
|
|
|
|
# Last run
|
|
journalctl -u certbot-renew.service --since "1 day ago"
|
|
|
|
# List timers
|
|
systemctl list-timers certbot-renew.timer
|
|
```
|
|
|
|
### DNS Propagation Issues
|
|
|
|
If certificate requests fail due to DNS propagation:
|
|
|
|
1. Check Namecheap API is accessible
|
|
2. Verify IP is whitelisted
|
|
3. Increase propagation wait time (default 120s)
|
|
4. Check certbot logs: `/srv/certbot/logs/letsencrypt.log`
|
|
|
|
## Related Playbooks
|
|
|
|
- `haproxy/deploy.yml` - Depends on certificate from certbot
|
|
- `prometheus/node_deploy.yml` - Deploys node_exporter for metrics collection |