Files
ouranos/docs/certbot_internal_hosts.md

449 lines
16 KiB
Markdown

# Let's Encrypt for Internal Production Hosts
Centralized certificate management using an OCI free-tier host running certbot with DNS-01 validation, OCI Vault for secure storage, and Ansible-driven distribution to internal production hosts.
## Overview
| Component | Value |
|-----------|-------|
| Certificate Authority | Let's Encrypt |
| Validation | DNS-01 via `certbot-dns-namecheap` |
| Certificate Generator | OCI free-tier host (certbot) |
| Certificate Store | OCI Vault (software-protected keys, free) |
| Distribution | Ansible playbook on controller (cron) |
| Target Hosts | pan.helu.ca, nyx.helu.ca (extensible) |
| Monitoring | Prometheus metrics + Grafana dashboard |
## Problem
Self-signed and private certificates on internal production hosts (pan.helu.ca, nyx.helu.ca) cause persistent issues:
- Clients must disable TLS verification or trust custom CAs
- Service-to-service HTTPS (e.g., LobeChat → MinIO S3 at `https://pan.helu.ca:8555`) requires trust workarounds
- Certificate management is manual and inconsistent across environments
- No automated renewal or expiry monitoring
## Architecture
```
┌─────────────────────────────────┐
│ OCI Free Host │
│ - certbot + dns-namecheap │
│ - systemd timer (twice daily) │
│ - post-hook: upload certs to │
│ OCI Vault via oci-cli │
└──────────────┬──────────────────┘
│ oci vault secret update-secret-content
┌─────────────────────────────────┐
│ OCI Vault │
│ ouranos-certificates │
│ ├── pan-helu-ca-fullchain │
│ ├── pan-helu-ca-privkey │
│ ├── nyx-helu-ca-fullchain │
│ ├── nyx-helu-ca-privkey │
│ └── (future domains) │
└──────────────┬──────────────────┘
│ community.oci.oci_secret lookup
┌─────────────────────────────────┐
│ Ansible Controller │
│ (restricted access) │
│ - cron: cert-distribute.yml │
│ - pulls certs from OCI Vault │
│ - deploys to target hosts │
│ - updates Prometheus metrics │
└──────────┬──────────┬───────────┘
│ │
┌────▼───┐ ┌───▼────┐
│ pan │ │ nyx │
│ .helu │ │ .helu │
│ .ca │ │ .ca │
└────┬───┘ └───┬────┘
│ │
▼ ▼
Prometheus cert metrics
→ Grafana dashboard
```
### Design Decisions
| Decision | Rationale |
|----------|-----------|
| Individual certs (not wildcard `*.helu.ca`) | Limits blast radius; each host has only its own cert |
| Namecheap API keys only on OCI host | Production hosts never hold DNS API credentials |
| Pull model (controller pulls from vault) | OCI host doesn't need SSH access to production |
| OCI Vault as distribution bus | Aligns with existing OCI Vault pattern in `docs/ansible.md` |
| Software-protected keys | Free tier, no per-secret or per-version charges |
## Why DNS-01
DNS-01 validation is the correct choice for internal hosts. The challenge is validated via DNS TXT records managed through the Namecheap API — the target hosts (pan, nyx) do not need to be reachable from the internet on port 80 or 443.
This is the same proven approach used on Titania for `*.ouranos.helu.ca` (see `docs/cerbot.md`).
## Implementation
### Phase 1: OCI Free Host — Certbot
Deploy certbot on the OCI free host using the same pattern as `ansible/certbot/deploy.yml`, stripped down for minimal footprint (no HAProxy, no Docker).
**Resource requirements**: Trivial. Certbot runs briefly twice per day. The OCI free host's constrained resources are more than sufficient.
#### Certbot Setup
1. Python virtualenv with `certbot` and `certbot-dns-namecheap`
2. Namecheap credentials in `/srv/certbot/credentials/namecheap.ini` (template: `ansible/certbot/namecheap.ini.j2`)
3. Individual certificate requests per domain:
```bash
certbot certonly \
--non-interactive \
--agree-tos \
--email webmaster@helu.ca \
--authenticator dns-namecheap \
--dns-namecheap-credentials /srv/certbot/credentials/namecheap.ini \
--dns-namecheap-propagation-seconds 120 \
--config-dir /srv/certbot/config \
--work-dir /srv/certbot/work \
--logs-dir /srv/certbot/logs \
--cert-name pan.helu.ca \
-d pan.helu.ca
```
4. Systemd timer for renewal (same pattern as Titania — twice daily with `RandomizedDelaySec=3600`)
#### Vault Upload Post-Hook
Replaces the HAProxy reload hook used on Titania. After each renewal, base64-encodes and uploads certificate files to OCI Vault:
```bash
#!/bin/bash
# Post-renewal hook: upload certificates to OCI Vault
set -euo pipefail
CERT_NAME="${RENEWED_LINEAGE##*/}"
CERT_DIR="${RENEWED_LINEAGE}"
VAULT_ID="ocid1.vault.oc1..." # ouranos-certificates vault
COMPARTMENT_ID="ocid1.compartment..."
# Derive OCI secret name from cert name (pan.helu.ca → pan-helu-ca)
SECRET_PREFIX=$(echo "${CERT_NAME}" | tr '.' '-')
# Upload fullchain
oci vault secret update-secret-content \
--secret-id "$(oci vault secret list \
--compartment-id "${COMPARTMENT_ID}" \
--vault-id "${VAULT_ID}" \
--name "${SECRET_PREFIX}-fullchain" \
--query 'data[0].id' --raw-output)" \
--content-type BASE64 \
--content "$(base64 -w0 "${CERT_DIR}/fullchain.pem")"
# Upload private key
oci vault secret update-secret-content \
--secret-id "$(oci vault secret list \
--compartment-id "${COMPARTMENT_ID}" \
--vault-id "${VAULT_ID}" \
--name "${SECRET_PREFIX}-privkey" \
--query 'data[0].id' --raw-output)" \
--content-type BASE64 \
--content "$(base64 -w0 "${CERT_DIR}/privkey.pem")"
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Uploaded ${CERT_NAME} to OCI Vault"
```
### Phase 2: OCI Vault — Certificate Storage
#### Vault Organization
Extends the existing OCI Vault structure documented in `docs/ansible.md`:
```
OCI Compartment: production
├── Vault: ouranos-databases (existing)
├── Vault: ouranos-services (existing)
├── Vault: ouranos-integrations (existing)
└── Vault: ouranos-certificates (new)
├── Secret: pan-helu-ca-fullchain
├── Secret: pan-helu-ca-privkey
├── Secret: nyx-helu-ca-fullchain
├── Secret: nyx-helu-ca-privkey
└── (future domains follow same pattern)
```
**Naming convention**: Domain dots replaced with hyphens, suffixed with `-fullchain` or `-privkey`.
**Secret format**: Base64-encoded PEM content. OCI Vault secrets support versioning natively — every renewal creates a new version, providing automatic rollback capability and audit trail.
**Cost**: OCI Vault with software-protected keys is free. No per-secret or per-version charges. Store certs for as many domains as needed at zero cost.
#### IAM Policies
```
# OCI free host: can write certs to the certificates vault
Allow dynamic-group certbot-host to manage secrets in compartment production
where target.vault.id = '<ouranos-certificates vault OCID>'
# Ansible controller: can read certs from the certificates vault
Allow dynamic-group ansible-controller to read secrets in compartment production
where target.vault.id = '<ouranos-certificates vault OCID>'
```
### Phase 3: Ansible Controller — Distribution
#### Distribution Playbook
A `cert-distribute.yml` playbook runs on the Ansible controller via cron. It uses `community.oci.oci_secret` lookups (same pattern as existing OCI Vault usage):
```yaml
---
- name: Distribute Let's Encrypt Certificates
hosts: certbot_targets
vars:
oci_compartment_id: "{{ vault_oci_compartment_id }}"
oci_certificates_vault_id: "{{ vault_oci_certificates_vault_id }}"
tasks:
- name: Retrieve fullchain from OCI Vault
ansible.builtin.set_fact:
cert_fullchain: >-
{{ lookup('community.oci.oci_secret',
certbot_cert_name | replace('.', '-') ~ '-fullchain',
compartment_id=oci_compartment_id,
vault_id=oci_certificates_vault_id) | b64decode }}
- name: Retrieve private key from OCI Vault
ansible.builtin.set_fact:
cert_privkey: >-
{{ lookup('community.oci.oci_secret',
certbot_cert_name | replace('.', '-') ~ '-privkey',
compartment_id=oci_compartment_id,
vault_id=oci_certificates_vault_id) | b64decode }}
- name: Deploy certificate (HAProxy combined format)
become: true
ansible.builtin.copy:
content: "{{ cert_fullchain }}{{ cert_privkey }}"
dest: "{{ haproxy_cert_path }}"
owner: "{{ certbot_user }}"
group: "{{ haproxy_group }}"
mode: '0640'
when: certbot_cert_format | default('haproxy_combined') == 'haproxy_combined'
notify: reload services
- name: Deploy certificate (separate files)
become: true
ansible.builtin.copy:
content: "{{ item.content }}"
dest: "{{ item.dest }}"
owner: "{{ certbot_user }}"
group: "{{ certbot_group }}"
mode: '0640'
loop:
- { content: "{{ cert_fullchain }}", dest: "{{ certbot_cert_dir }}/fullchain.pem" }
- { content: "{{ cert_privkey }}", dest: "{{ certbot_cert_dir }}/privkey.pem" }
when: certbot_cert_format | default('haproxy_combined') == 'separate'
notify: reload services
- name: Update certificate metrics
become: true
ansible.builtin.command: "{{ certbot_metrics_script }}"
changed_when: false
handlers:
- name: reload services
become: true
ansible.builtin.systemd:
name: "{{ item }}"
state: reloaded
loop: "{{ certbot_reload_services | default([]) }}"
```
#### Host Variables (Example)
```yaml
# host_vars/pan.helu.ca.yml
certbot_cert_name: pan.helu.ca
certbot_cert_format: separate # MinIO expects separate files
certbot_cert_dir: /etc/minio/certs
certbot_user: minio
certbot_group: minio
certbot_reload_services:
- minio
certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter
```
```yaml
# host_vars/nyx.helu.ca.yml (if running HAProxy)
certbot_cert_name: nyx.helu.ca
certbot_cert_format: haproxy_combined
haproxy_cert_path: /etc/haproxy/certs/nyx.helu.ca.pem
certbot_user: certbot
haproxy_group: haproxy
certbot_reload_services:
- haproxy
certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter
```
#### Cron Schedule
```cron
# Run cert distribution every 6 hours on the Ansible controller
0 */6 * * * cd /path/to/ansible && ansible-playbook cert-distribute.yml --limit certbot_targets 2>&1 | logger -t cert-distribute
```
Let's Encrypt certs are valid for 90 days and renew at 30 days remaining. A 6-hour distribution cadence ensures certs propagate within hours of renewal.
### Phase 4: Monitoring
#### Prometheus Metrics
Deploy the `cert-metrics.sh` script (existing template: `ansible/certbot/cert-metrics.sh.j2`) on each target host. After each distribution, it writes metrics to the node-exporter textfile directory:
| Metric | Description |
|--------|-------------|
| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires |
| `ssl_certificate_expiry_seconds` | Seconds until cert expires |
| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing |
Labels: `domain`, `issuer`
#### Alert Rules
Add to `ansible/prometheus/alert_rules.yml.j2`:
```yaml
- name: ssl_alerts
rules:
- alert: SSLCertificateExpiringSoon
expr: ssl_certificate_expiry_seconds < 604800
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"
- alert: SSLCertificateExpired
expr: ssl_certificate_valid == 0
for: 5m
labels:
severity: critical
annotations:
summary: "SSL certificate expired or missing"
description: "Certificate for {{ $labels.domain }} is expired or not found"
```
#### Grafana Dashboard
Add a certificate status dashboard to `ansible/grafana/dashboards/` showing:
- Certificate validity status (green/red) per domain
- Days until expiry (gauge per domain)
- Renewal history (annotation markers from cert version timestamps)
- Issuer label (confirms Let's Encrypt vs self-signed)
## Certificate Format Reference
Different services expect certificates in different formats. The distribution playbook handles this via the `certbot_cert_format` host variable:
| Format | Services | Files Produced |
|--------|----------|----------------|
| `haproxy_combined` | HAProxy | Single PEM: fullchain + privkey concatenated |
| `separate` | MinIO, nginx, most services | `fullchain.pem` + `privkey.pem` as separate files |
MinIO specifically expects certs at `~/.minio/certs/public.crt` and `~/.minio/certs/private.key`, or a custom path via `--certs-dir`.
## Let's Encrypt Rate Limits
| Limit | Value | Impact |
|-------|-------|--------|
| Certificates per registered domain | 50 per week | Individual certs for pan + nyx are well within limits |
| Duplicate certificates | 5 per week | Avoid unnecessary re-issuance |
| Failed validations | 5 per hour | DNS propagation failures count against this |
## Comparison with Titania Model
| Aspect | Titania (Current) | Centralized (This Document) |
|--------|-------------------|----------------------------|
| Certbot location | On the host itself | OCI free host |
| Namecheap credentials | On the host | Only on OCI host |
| Cert delivery | Direct to HAProxy | Via OCI Vault → Ansible |
| Renewal hook | Docker HAProxy reload | OCI Vault upload |
| Distribution | N/A (local only) | Ansible cron on controller |
| Environments served | Ouranos sandbox only | All environments |
| Service reload | `docker compose kill -s HUP` | `systemctl reload` per host_vars |
Titania can remain self-contained (it's working) or migrate to this centralized model later.
## Verification
### OCI Free Host
```bash
# Check certbot managed certificates
certbot certificates --config-dir /srv/certbot/config
# Dry-run renewal
certbot renew --config-dir /srv/certbot/config \
--work-dir /srv/certbot/work \
--logs-dir /srv/certbot/logs \
--dry-run
# Check systemd timer
systemctl status certbot-renew.timer
```
### OCI Vault
```bash
# List certificate secrets
oci vault secret list \
--compartment-id $COMPARTMENT_ID \
--vault-id $VAULT_ID \
--query 'data[*].{"name":"secret-name","updated":"time-of-current-version-expiry"}' \
--output table
```
### Target Hosts
```bash
# Verify issuer is Let's Encrypt
openssl x509 -noout -issuer -in /path/to/cert.pem
# Check expiry
openssl x509 -noout -enddate -in /path/to/cert.pem
# Test TLS connection
openssl s_client -connect pan.helu.ca:8555 </dev/null 2>/dev/null \
| openssl x509 -noout -issuer -dates
```
### Prometheus
```promql
# All certs valid
ssl_certificate_valid == 1
# Days until expiry
ssl_certificate_expiry_seconds / 86400
# Certs expiring within 30 days
ssl_certificate_expiry_seconds < 2592000
```
## Related Documentation
- `docs/cerbot.md` — Titania certbot deployment (DNS-01 with Namecheap)
- `docs/ansible.md` — OCI Vault secret lookup patterns and vault organization
- `ansible/certbot/deploy.yml` — Certbot deployment playbook (base pattern)
- `ansible/certbot/renewal-hook.sh.j2` — Renewal hook template (Titania/HAProxy variant)
- `ansible/certbot/cert-metrics.sh.j2` — Prometheus metrics script template
- `ansible/certbot/namecheap.ini.j2` — Namecheap credentials template
- `ansible/prometheus/alert_rules.yml.j2` — Prometheus alert rules