feat: add documentation for centralized certificate management using Let's Encrypt
This commit is contained in:
448
docs/certbot_internal_hosts.md
Normal file
448
docs/certbot_internal_hosts.md
Normal file
@@ -0,0 +1,448 @@
|
||||
# Let's Encrypt for Internal Production Hosts
|
||||
|
||||
Centralized certificate management using an OCI free-tier host running certbot with DNS-01 validation, OCI Vault for secure storage, and Ansible-driven distribution to internal production hosts.
|
||||
|
||||
## Overview
|
||||
|
||||
| Component | Value |
|
||||
|-----------|-------|
|
||||
| Certificate Authority | Let's Encrypt |
|
||||
| Validation | DNS-01 via `certbot-dns-namecheap` |
|
||||
| Certificate Generator | OCI free-tier host (certbot) |
|
||||
| Certificate Store | OCI Vault (software-protected keys, free) |
|
||||
| Distribution | Ansible playbook on controller (cron) |
|
||||
| Target Hosts | pan.helu.ca, nyx.helu.ca (extensible) |
|
||||
| Monitoring | Prometheus metrics + Grafana dashboard |
|
||||
|
||||
## Problem
|
||||
|
||||
Self-signed and private certificates on internal production hosts (pan.helu.ca, nyx.helu.ca) cause persistent issues:
|
||||
|
||||
- Clients must disable TLS verification or trust custom CAs
|
||||
- Service-to-service HTTPS (e.g., LobeChat → MinIO S3 at `https://pan.helu.ca:8555`) requires trust workarounds
|
||||
- Certificate management is manual and inconsistent across environments
|
||||
- No automated renewal or expiry monitoring
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ OCI Free Host │
|
||||
│ - certbot + dns-namecheap │
|
||||
│ - systemd timer (twice daily) │
|
||||
│ - post-hook: upload certs to │
|
||||
│ OCI Vault via oci-cli │
|
||||
└──────────────┬──────────────────┘
|
||||
│ oci vault secret update-secret-content
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ OCI Vault │
|
||||
│ ouranos-certificates │
|
||||
│ ├── pan-helu-ca-fullchain │
|
||||
│ ├── pan-helu-ca-privkey │
|
||||
│ ├── nyx-helu-ca-fullchain │
|
||||
│ ├── nyx-helu-ca-privkey │
|
||||
│ └── (future domains) │
|
||||
└──────────────┬──────────────────┘
|
||||
│ community.oci.oci_secret lookup
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ Ansible Controller │
|
||||
│ (restricted access) │
|
||||
│ - cron: cert-distribute.yml │
|
||||
│ - pulls certs from OCI Vault │
|
||||
│ - deploys to target hosts │
|
||||
│ - updates Prometheus metrics │
|
||||
└──────────┬──────────┬───────────┘
|
||||
│ │
|
||||
┌────▼───┐ ┌───▼────┐
|
||||
│ pan │ │ nyx │
|
||||
│ .helu │ │ .helu │
|
||||
│ .ca │ │ .ca │
|
||||
└────┬───┘ └───┬────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
Prometheus cert metrics
|
||||
→ Grafana dashboard
|
||||
```
|
||||
|
||||
### Design Decisions
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| Individual certs (not wildcard `*.helu.ca`) | Limits blast radius; each host has only its own cert |
|
||||
| Namecheap API keys only on OCI host | Production hosts never hold DNS API credentials |
|
||||
| Pull model (controller pulls from vault) | OCI host doesn't need SSH access to production |
|
||||
| OCI Vault as distribution bus | Aligns with existing OCI Vault pattern in `docs/ansible.md` |
|
||||
| Software-protected keys | Free tier, no per-secret or per-version charges |
|
||||
|
||||
## Why DNS-01
|
||||
|
||||
DNS-01 validation is the correct choice for internal hosts. The challenge is validated via DNS TXT records managed through the Namecheap API — the target hosts (pan, nyx) do not need to be reachable from the internet on port 80 or 443.
|
||||
|
||||
This is the same proven approach used on Titania for `*.ouranos.helu.ca` (see `docs/cerbot.md`).
|
||||
|
||||
## Implementation
|
||||
|
||||
### Phase 1: OCI Free Host — Certbot
|
||||
|
||||
Deploy certbot on the OCI free host using the same pattern as `ansible/certbot/deploy.yml`, stripped down for minimal footprint (no HAProxy, no Docker).
|
||||
|
||||
**Resource requirements**: Trivial. Certbot runs briefly twice per day. The OCI free host's constrained resources are more than sufficient.
|
||||
|
||||
#### Certbot Setup
|
||||
|
||||
1. Python virtualenv with `certbot` and `certbot-dns-namecheap`
|
||||
2. Namecheap credentials in `/srv/certbot/credentials/namecheap.ini` (template: `ansible/certbot/namecheap.ini.j2`)
|
||||
3. Individual certificate requests per domain:
|
||||
|
||||
```bash
|
||||
certbot certonly \
|
||||
--non-interactive \
|
||||
--agree-tos \
|
||||
--email webmaster@helu.ca \
|
||||
--authenticator dns-namecheap \
|
||||
--dns-namecheap-credentials /srv/certbot/credentials/namecheap.ini \
|
||||
--dns-namecheap-propagation-seconds 120 \
|
||||
--config-dir /srv/certbot/config \
|
||||
--work-dir /srv/certbot/work \
|
||||
--logs-dir /srv/certbot/logs \
|
||||
--cert-name pan.helu.ca \
|
||||
-d pan.helu.ca
|
||||
```
|
||||
|
||||
4. Systemd timer for renewal (same pattern as Titania — twice daily with `RandomizedDelaySec=3600`)
|
||||
|
||||
#### Vault Upload Post-Hook
|
||||
|
||||
Replaces the HAProxy reload hook used on Titania. After each renewal, base64-encodes and uploads certificate files to OCI Vault:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Post-renewal hook: upload certificates to OCI Vault
|
||||
set -euo pipefail
|
||||
|
||||
CERT_NAME="${RENEWED_LINEAGE##*/}"
|
||||
CERT_DIR="${RENEWED_LINEAGE}"
|
||||
VAULT_ID="ocid1.vault.oc1..." # ouranos-certificates vault
|
||||
COMPARTMENT_ID="ocid1.compartment..."
|
||||
|
||||
# Derive OCI secret name from cert name (pan.helu.ca → pan-helu-ca)
|
||||
SECRET_PREFIX=$(echo "${CERT_NAME}" | tr '.' '-')
|
||||
|
||||
# Upload fullchain
|
||||
oci vault secret update-secret-content \
|
||||
--secret-id "$(oci vault secret list \
|
||||
--compartment-id "${COMPARTMENT_ID}" \
|
||||
--vault-id "${VAULT_ID}" \
|
||||
--name "${SECRET_PREFIX}-fullchain" \
|
||||
--query 'data[0].id' --raw-output)" \
|
||||
--content-type BASE64 \
|
||||
--content "$(base64 -w0 "${CERT_DIR}/fullchain.pem")"
|
||||
|
||||
# Upload private key
|
||||
oci vault secret update-secret-content \
|
||||
--secret-id "$(oci vault secret list \
|
||||
--compartment-id "${COMPARTMENT_ID}" \
|
||||
--vault-id "${VAULT_ID}" \
|
||||
--name "${SECRET_PREFIX}-privkey" \
|
||||
--query 'data[0].id' --raw-output)" \
|
||||
--content-type BASE64 \
|
||||
--content "$(base64 -w0 "${CERT_DIR}/privkey.pem")"
|
||||
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Uploaded ${CERT_NAME} to OCI Vault"
|
||||
```
|
||||
|
||||
### Phase 2: OCI Vault — Certificate Storage
|
||||
|
||||
#### Vault Organization
|
||||
|
||||
Extends the existing OCI Vault structure documented in `docs/ansible.md`:
|
||||
|
||||
```
|
||||
OCI Compartment: production
|
||||
├── Vault: ouranos-databases (existing)
|
||||
├── Vault: ouranos-services (existing)
|
||||
├── Vault: ouranos-integrations (existing)
|
||||
└── Vault: ouranos-certificates (new)
|
||||
├── Secret: pan-helu-ca-fullchain
|
||||
├── Secret: pan-helu-ca-privkey
|
||||
├── Secret: nyx-helu-ca-fullchain
|
||||
├── Secret: nyx-helu-ca-privkey
|
||||
└── (future domains follow same pattern)
|
||||
```
|
||||
|
||||
**Naming convention**: Domain dots replaced with hyphens, suffixed with `-fullchain` or `-privkey`.
|
||||
|
||||
**Secret format**: Base64-encoded PEM content. OCI Vault secrets support versioning natively — every renewal creates a new version, providing automatic rollback capability and audit trail.
|
||||
|
||||
**Cost**: OCI Vault with software-protected keys is free. No per-secret or per-version charges. Store certs for as many domains as needed at zero cost.
|
||||
|
||||
#### IAM Policies
|
||||
|
||||
```
|
||||
# OCI free host: can write certs to the certificates vault
|
||||
Allow dynamic-group certbot-host to manage secrets in compartment production
|
||||
where target.vault.id = '<ouranos-certificates vault OCID>'
|
||||
|
||||
# Ansible controller: can read certs from the certificates vault
|
||||
Allow dynamic-group ansible-controller to read secrets in compartment production
|
||||
where target.vault.id = '<ouranos-certificates vault OCID>'
|
||||
```
|
||||
|
||||
### Phase 3: Ansible Controller — Distribution
|
||||
|
||||
#### Distribution Playbook
|
||||
|
||||
A `cert-distribute.yml` playbook runs on the Ansible controller via cron. It uses `community.oci.oci_secret` lookups (same pattern as existing OCI Vault usage):
|
||||
|
||||
```yaml
|
||||
---
|
||||
- name: Distribute Let's Encrypt Certificates
|
||||
hosts: certbot_targets
|
||||
vars:
|
||||
oci_compartment_id: "{{ vault_oci_compartment_id }}"
|
||||
oci_certificates_vault_id: "{{ vault_oci_certificates_vault_id }}"
|
||||
|
||||
tasks:
|
||||
- name: Retrieve fullchain from OCI Vault
|
||||
ansible.builtin.set_fact:
|
||||
cert_fullchain: >-
|
||||
{{ lookup('community.oci.oci_secret',
|
||||
certbot_cert_name | replace('.', '-') ~ '-fullchain',
|
||||
compartment_id=oci_compartment_id,
|
||||
vault_id=oci_certificates_vault_id) | b64decode }}
|
||||
|
||||
- name: Retrieve private key from OCI Vault
|
||||
ansible.builtin.set_fact:
|
||||
cert_privkey: >-
|
||||
{{ lookup('community.oci.oci_secret',
|
||||
certbot_cert_name | replace('.', '-') ~ '-privkey',
|
||||
compartment_id=oci_compartment_id,
|
||||
vault_id=oci_certificates_vault_id) | b64decode }}
|
||||
|
||||
- name: Deploy certificate (HAProxy combined format)
|
||||
become: true
|
||||
ansible.builtin.copy:
|
||||
content: "{{ cert_fullchain }}{{ cert_privkey }}"
|
||||
dest: "{{ haproxy_cert_path }}"
|
||||
owner: "{{ certbot_user }}"
|
||||
group: "{{ haproxy_group }}"
|
||||
mode: '0640'
|
||||
when: certbot_cert_format | default('haproxy_combined') == 'haproxy_combined'
|
||||
notify: reload services
|
||||
|
||||
- name: Deploy certificate (separate files)
|
||||
become: true
|
||||
ansible.builtin.copy:
|
||||
content: "{{ item.content }}"
|
||||
dest: "{{ item.dest }}"
|
||||
owner: "{{ certbot_user }}"
|
||||
group: "{{ certbot_group }}"
|
||||
mode: '0640'
|
||||
loop:
|
||||
- { content: "{{ cert_fullchain }}", dest: "{{ certbot_cert_dir }}/fullchain.pem" }
|
||||
- { content: "{{ cert_privkey }}", dest: "{{ certbot_cert_dir }}/privkey.pem" }
|
||||
when: certbot_cert_format | default('haproxy_combined') == 'separate'
|
||||
notify: reload services
|
||||
|
||||
- name: Update certificate metrics
|
||||
become: true
|
||||
ansible.builtin.command: "{{ certbot_metrics_script }}"
|
||||
changed_when: false
|
||||
|
||||
handlers:
|
||||
- name: reload services
|
||||
become: true
|
||||
ansible.builtin.systemd:
|
||||
name: "{{ item }}"
|
||||
state: reloaded
|
||||
loop: "{{ certbot_reload_services | default([]) }}"
|
||||
```
|
||||
|
||||
#### Host Variables (Example)
|
||||
|
||||
```yaml
|
||||
# host_vars/pan.helu.ca.yml
|
||||
certbot_cert_name: pan.helu.ca
|
||||
certbot_cert_format: separate # MinIO expects separate files
|
||||
certbot_cert_dir: /etc/minio/certs
|
||||
certbot_user: minio
|
||||
certbot_group: minio
|
||||
certbot_reload_services:
|
||||
- minio
|
||||
|
||||
certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
|
||||
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter
|
||||
```
|
||||
|
||||
```yaml
|
||||
# host_vars/nyx.helu.ca.yml (if running HAProxy)
|
||||
certbot_cert_name: nyx.helu.ca
|
||||
certbot_cert_format: haproxy_combined
|
||||
haproxy_cert_path: /etc/haproxy/certs/nyx.helu.ca.pem
|
||||
certbot_user: certbot
|
||||
haproxy_group: haproxy
|
||||
certbot_reload_services:
|
||||
- haproxy
|
||||
|
||||
certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
|
||||
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter
|
||||
```
|
||||
|
||||
#### Cron Schedule
|
||||
|
||||
```cron
|
||||
# Run cert distribution every 6 hours on the Ansible controller
|
||||
0 */6 * * * cd /path/to/ansible && ansible-playbook cert-distribute.yml --limit certbot_targets 2>&1 | logger -t cert-distribute
|
||||
```
|
||||
|
||||
Let's Encrypt certs are valid for 90 days and renew at 30 days remaining. A 6-hour distribution cadence ensures certs propagate within hours of renewal.
|
||||
|
||||
### Phase 4: Monitoring
|
||||
|
||||
#### Prometheus Metrics
|
||||
|
||||
Deploy the `cert-metrics.sh` script (existing template: `ansible/certbot/cert-metrics.sh.j2`) on each target host. After each distribution, it writes metrics to the node-exporter textfile directory:
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires |
|
||||
| `ssl_certificate_expiry_seconds` | Seconds until cert expires |
|
||||
| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing |
|
||||
|
||||
Labels: `domain`, `issuer`
|
||||
|
||||
#### Alert Rules
|
||||
|
||||
Add to `ansible/prometheus/alert_rules.yml.j2`:
|
||||
|
||||
```yaml
|
||||
- name: ssl_alerts
|
||||
rules:
|
||||
- alert: SSLCertificateExpiringSoon
|
||||
expr: ssl_certificate_expiry_seconds < 604800
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "SSL certificate expiring soon"
|
||||
description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"
|
||||
|
||||
- alert: SSLCertificateExpired
|
||||
expr: ssl_certificate_valid == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "SSL certificate expired or missing"
|
||||
description: "Certificate for {{ $labels.domain }} is expired or not found"
|
||||
```
|
||||
|
||||
#### Grafana Dashboard
|
||||
|
||||
Add a certificate status dashboard to `ansible/grafana/dashboards/` showing:
|
||||
|
||||
- Certificate validity status (green/red) per domain
|
||||
- Days until expiry (gauge per domain)
|
||||
- Renewal history (annotation markers from cert version timestamps)
|
||||
- Issuer label (confirms Let's Encrypt vs self-signed)
|
||||
|
||||
## Certificate Format Reference
|
||||
|
||||
Different services expect certificates in different formats. The distribution playbook handles this via the `certbot_cert_format` host variable:
|
||||
|
||||
| Format | Services | Files Produced |
|
||||
|--------|----------|----------------|
|
||||
| `haproxy_combined` | HAProxy | Single PEM: fullchain + privkey concatenated |
|
||||
| `separate` | MinIO, nginx, most services | `fullchain.pem` + `privkey.pem` as separate files |
|
||||
|
||||
MinIO specifically expects certs at `~/.minio/certs/public.crt` and `~/.minio/certs/private.key`, or a custom path via `--certs-dir`.
|
||||
|
||||
## Let's Encrypt Rate Limits
|
||||
|
||||
| Limit | Value | Impact |
|
||||
|-------|-------|--------|
|
||||
| Certificates per registered domain | 50 per week | Individual certs for pan + nyx are well within limits |
|
||||
| Duplicate certificates | 5 per week | Avoid unnecessary re-issuance |
|
||||
| Failed validations | 5 per hour | DNS propagation failures count against this |
|
||||
|
||||
## Comparison with Titania Model
|
||||
|
||||
| Aspect | Titania (Current) | Centralized (This Document) |
|
||||
|--------|-------------------|----------------------------|
|
||||
| Certbot location | On the host itself | OCI free host |
|
||||
| Namecheap credentials | On the host | Only on OCI host |
|
||||
| Cert delivery | Direct to HAProxy | Via OCI Vault → Ansible |
|
||||
| Renewal hook | Docker HAProxy reload | OCI Vault upload |
|
||||
| Distribution | N/A (local only) | Ansible cron on controller |
|
||||
| Environments served | Ouranos sandbox only | All environments |
|
||||
| Service reload | `docker compose kill -s HUP` | `systemctl reload` per host_vars |
|
||||
|
||||
Titania can remain self-contained (it's working) or migrate to this centralized model later.
|
||||
|
||||
## Verification
|
||||
|
||||
### OCI Free Host
|
||||
|
||||
```bash
|
||||
# Check certbot managed certificates
|
||||
certbot certificates --config-dir /srv/certbot/config
|
||||
|
||||
# Dry-run renewal
|
||||
certbot renew --config-dir /srv/certbot/config \
|
||||
--work-dir /srv/certbot/work \
|
||||
--logs-dir /srv/certbot/logs \
|
||||
--dry-run
|
||||
|
||||
# Check systemd timer
|
||||
systemctl status certbot-renew.timer
|
||||
```
|
||||
|
||||
### OCI Vault
|
||||
|
||||
```bash
|
||||
# List certificate secrets
|
||||
oci vault secret list \
|
||||
--compartment-id $COMPARTMENT_ID \
|
||||
--vault-id $VAULT_ID \
|
||||
--query 'data[*].{"name":"secret-name","updated":"time-of-current-version-expiry"}' \
|
||||
--output table
|
||||
```
|
||||
|
||||
### Target Hosts
|
||||
|
||||
```bash
|
||||
# Verify issuer is Let's Encrypt
|
||||
openssl x509 -noout -issuer -in /path/to/cert.pem
|
||||
|
||||
# Check expiry
|
||||
openssl x509 -noout -enddate -in /path/to/cert.pem
|
||||
|
||||
# Test TLS connection
|
||||
openssl s_client -connect pan.helu.ca:8555 </dev/null 2>/dev/null \
|
||||
| openssl x509 -noout -issuer -dates
|
||||
```
|
||||
|
||||
### Prometheus
|
||||
|
||||
```promql
|
||||
# All certs valid
|
||||
ssl_certificate_valid == 1
|
||||
|
||||
# Days until expiry
|
||||
ssl_certificate_expiry_seconds / 86400
|
||||
|
||||
# Certs expiring within 30 days
|
||||
ssl_certificate_expiry_seconds < 2592000
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `docs/cerbot.md` — Titania certbot deployment (DNS-01 with Namecheap)
|
||||
- `docs/ansible.md` — OCI Vault secret lookup patterns and vault organization
|
||||
- `ansible/certbot/deploy.yml` — Certbot deployment playbook (base pattern)
|
||||
- `ansible/certbot/renewal-hook.sh.j2` — Renewal hook template (Titania/HAProxy variant)
|
||||
- `ansible/certbot/cert-metrics.sh.j2` — Prometheus metrics script template
|
||||
- `ansible/certbot/namecheap.ini.j2` — Namecheap credentials template
|
||||
- `ansible/prometheus/alert_rules.yml.j2` — Prometheus alert rules
|
||||
Reference in New Issue
Block a user