From cdd61bd9166a625eb0abc2e7868f408110723dab Mon Sep 17 00:00:00 2001 From: Robert Helewka Date: Sat, 14 Mar 2026 15:59:19 -0400 Subject: [PATCH] feat: add documentation for centralized certificate management using Let's Encrypt --- docs/certbot_internal_hosts.md | 448 +++++++++++++++++++++++++++++++++ 1 file changed, 448 insertions(+) create mode 100644 docs/certbot_internal_hosts.md diff --git a/docs/certbot_internal_hosts.md b/docs/certbot_internal_hosts.md new file mode 100644 index 0000000..b19417b --- /dev/null +++ b/docs/certbot_internal_hosts.md @@ -0,0 +1,448 @@ +# Let's Encrypt for Internal Production Hosts + +Centralized certificate management using an OCI free-tier host running certbot with DNS-01 validation, OCI Vault for secure storage, and Ansible-driven distribution to internal production hosts. + +## Overview + +| Component | Value | +|-----------|-------| +| Certificate Authority | Let's Encrypt | +| Validation | DNS-01 via `certbot-dns-namecheap` | +| Certificate Generator | OCI free-tier host (certbot) | +| Certificate Store | OCI Vault (software-protected keys, free) | +| Distribution | Ansible playbook on controller (cron) | +| Target Hosts | pan.helu.ca, nyx.helu.ca (extensible) | +| Monitoring | Prometheus metrics + Grafana dashboard | + +## Problem + +Self-signed and private certificates on internal production hosts (pan.helu.ca, nyx.helu.ca) cause persistent issues: + +- Clients must disable TLS verification or trust custom CAs +- Service-to-service HTTPS (e.g., LobeChat → MinIO S3 at `https://pan.helu.ca:8555`) requires trust workarounds +- Certificate management is manual and inconsistent across environments +- No automated renewal or expiry monitoring + +## Architecture + +``` +┌─────────────────────────────────┐ +│ OCI Free Host │ +│ - certbot + dns-namecheap │ +│ - systemd timer (twice daily) │ +│ - post-hook: upload certs to │ +│ OCI Vault via oci-cli │ +└──────────────┬──────────────────┘ + │ oci vault secret update-secret-content + ▼ +┌─────────────────────────────────┐ +│ OCI Vault │ +│ ouranos-certificates │ +│ ├── pan-helu-ca-fullchain │ +│ ├── pan-helu-ca-privkey │ +│ ├── nyx-helu-ca-fullchain │ +│ ├── nyx-helu-ca-privkey │ +│ └── (future domains) │ +└──────────────┬──────────────────┘ + │ community.oci.oci_secret lookup + ▼ +┌─────────────────────────────────┐ +│ Ansible Controller │ +│ (restricted access) │ +│ - cron: cert-distribute.yml │ +│ - pulls certs from OCI Vault │ +│ - deploys to target hosts │ +│ - updates Prometheus metrics │ +└──────────┬──────────┬───────────┘ + │ │ + ┌────▼───┐ ┌───▼────┐ + │ pan │ │ nyx │ + │ .helu │ │ .helu │ + │ .ca │ │ .ca │ + └────┬───┘ └───┬────┘ + │ │ + ▼ ▼ + Prometheus cert metrics + → Grafana dashboard +``` + +### Design Decisions + +| Decision | Rationale | +|----------|-----------| +| Individual certs (not wildcard `*.helu.ca`) | Limits blast radius; each host has only its own cert | +| Namecheap API keys only on OCI host | Production hosts never hold DNS API credentials | +| Pull model (controller pulls from vault) | OCI host doesn't need SSH access to production | +| OCI Vault as distribution bus | Aligns with existing OCI Vault pattern in `docs/ansible.md` | +| Software-protected keys | Free tier, no per-secret or per-version charges | + +## Why DNS-01 + +DNS-01 validation is the correct choice for internal hosts. The challenge is validated via DNS TXT records managed through the Namecheap API — the target hosts (pan, nyx) do not need to be reachable from the internet on port 80 or 443. + +This is the same proven approach used on Titania for `*.ouranos.helu.ca` (see `docs/cerbot.md`). + +## Implementation + +### Phase 1: OCI Free Host — Certbot + +Deploy certbot on the OCI free host using the same pattern as `ansible/certbot/deploy.yml`, stripped down for minimal footprint (no HAProxy, no Docker). + +**Resource requirements**: Trivial. Certbot runs briefly twice per day. The OCI free host's constrained resources are more than sufficient. + +#### Certbot Setup + +1. Python virtualenv with `certbot` and `certbot-dns-namecheap` +2. Namecheap credentials in `/srv/certbot/credentials/namecheap.ini` (template: `ansible/certbot/namecheap.ini.j2`) +3. Individual certificate requests per domain: + +```bash +certbot certonly \ + --non-interactive \ + --agree-tos \ + --email webmaster@helu.ca \ + --authenticator dns-namecheap \ + --dns-namecheap-credentials /srv/certbot/credentials/namecheap.ini \ + --dns-namecheap-propagation-seconds 120 \ + --config-dir /srv/certbot/config \ + --work-dir /srv/certbot/work \ + --logs-dir /srv/certbot/logs \ + --cert-name pan.helu.ca \ + -d pan.helu.ca +``` + +4. Systemd timer for renewal (same pattern as Titania — twice daily with `RandomizedDelaySec=3600`) + +#### Vault Upload Post-Hook + +Replaces the HAProxy reload hook used on Titania. After each renewal, base64-encodes and uploads certificate files to OCI Vault: + +```bash +#!/bin/bash +# Post-renewal hook: upload certificates to OCI Vault +set -euo pipefail + +CERT_NAME="${RENEWED_LINEAGE##*/}" +CERT_DIR="${RENEWED_LINEAGE}" +VAULT_ID="ocid1.vault.oc1..." # ouranos-certificates vault +COMPARTMENT_ID="ocid1.compartment..." + +# Derive OCI secret name from cert name (pan.helu.ca → pan-helu-ca) +SECRET_PREFIX=$(echo "${CERT_NAME}" | tr '.' '-') + +# Upload fullchain +oci vault secret update-secret-content \ + --secret-id "$(oci vault secret list \ + --compartment-id "${COMPARTMENT_ID}" \ + --vault-id "${VAULT_ID}" \ + --name "${SECRET_PREFIX}-fullchain" \ + --query 'data[0].id' --raw-output)" \ + --content-type BASE64 \ + --content "$(base64 -w0 "${CERT_DIR}/fullchain.pem")" + +# Upload private key +oci vault secret update-secret-content \ + --secret-id "$(oci vault secret list \ + --compartment-id "${COMPARTMENT_ID}" \ + --vault-id "${VAULT_ID}" \ + --name "${SECRET_PREFIX}-privkey" \ + --query 'data[0].id' --raw-output)" \ + --content-type BASE64 \ + --content "$(base64 -w0 "${CERT_DIR}/privkey.pem")" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] Uploaded ${CERT_NAME} to OCI Vault" +``` + +### Phase 2: OCI Vault — Certificate Storage + +#### Vault Organization + +Extends the existing OCI Vault structure documented in `docs/ansible.md`: + +``` +OCI Compartment: production +├── Vault: ouranos-databases (existing) +├── Vault: ouranos-services (existing) +├── Vault: ouranos-integrations (existing) +└── Vault: ouranos-certificates (new) + ├── Secret: pan-helu-ca-fullchain + ├── Secret: pan-helu-ca-privkey + ├── Secret: nyx-helu-ca-fullchain + ├── Secret: nyx-helu-ca-privkey + └── (future domains follow same pattern) +``` + +**Naming convention**: Domain dots replaced with hyphens, suffixed with `-fullchain` or `-privkey`. + +**Secret format**: Base64-encoded PEM content. OCI Vault secrets support versioning natively — every renewal creates a new version, providing automatic rollback capability and audit trail. + +**Cost**: OCI Vault with software-protected keys is free. No per-secret or per-version charges. Store certs for as many domains as needed at zero cost. + +#### IAM Policies + +``` +# OCI free host: can write certs to the certificates vault +Allow dynamic-group certbot-host to manage secrets in compartment production + where target.vault.id = '' + +# Ansible controller: can read certs from the certificates vault +Allow dynamic-group ansible-controller to read secrets in compartment production + where target.vault.id = '' +``` + +### Phase 3: Ansible Controller — Distribution + +#### Distribution Playbook + +A `cert-distribute.yml` playbook runs on the Ansible controller via cron. It uses `community.oci.oci_secret` lookups (same pattern as existing OCI Vault usage): + +```yaml +--- +- name: Distribute Let's Encrypt Certificates + hosts: certbot_targets + vars: + oci_compartment_id: "{{ vault_oci_compartment_id }}" + oci_certificates_vault_id: "{{ vault_oci_certificates_vault_id }}" + + tasks: + - name: Retrieve fullchain from OCI Vault + ansible.builtin.set_fact: + cert_fullchain: >- + {{ lookup('community.oci.oci_secret', + certbot_cert_name | replace('.', '-') ~ '-fullchain', + compartment_id=oci_compartment_id, + vault_id=oci_certificates_vault_id) | b64decode }} + + - name: Retrieve private key from OCI Vault + ansible.builtin.set_fact: + cert_privkey: >- + {{ lookup('community.oci.oci_secret', + certbot_cert_name | replace('.', '-') ~ '-privkey', + compartment_id=oci_compartment_id, + vault_id=oci_certificates_vault_id) | b64decode }} + + - name: Deploy certificate (HAProxy combined format) + become: true + ansible.builtin.copy: + content: "{{ cert_fullchain }}{{ cert_privkey }}" + dest: "{{ haproxy_cert_path }}" + owner: "{{ certbot_user }}" + group: "{{ haproxy_group }}" + mode: '0640' + when: certbot_cert_format | default('haproxy_combined') == 'haproxy_combined' + notify: reload services + + - name: Deploy certificate (separate files) + become: true + ansible.builtin.copy: + content: "{{ item.content }}" + dest: "{{ item.dest }}" + owner: "{{ certbot_user }}" + group: "{{ certbot_group }}" + mode: '0640' + loop: + - { content: "{{ cert_fullchain }}", dest: "{{ certbot_cert_dir }}/fullchain.pem" } + - { content: "{{ cert_privkey }}", dest: "{{ certbot_cert_dir }}/privkey.pem" } + when: certbot_cert_format | default('haproxy_combined') == 'separate' + notify: reload services + + - name: Update certificate metrics + become: true + ansible.builtin.command: "{{ certbot_metrics_script }}" + changed_when: false + + handlers: + - name: reload services + become: true + ansible.builtin.systemd: + name: "{{ item }}" + state: reloaded + loop: "{{ certbot_reload_services | default([]) }}" +``` + +#### Host Variables (Example) + +```yaml +# host_vars/pan.helu.ca.yml +certbot_cert_name: pan.helu.ca +certbot_cert_format: separate # MinIO expects separate files +certbot_cert_dir: /etc/minio/certs +certbot_user: minio +certbot_group: minio +certbot_reload_services: + - minio + +certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh +prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter +``` + +```yaml +# host_vars/nyx.helu.ca.yml (if running HAProxy) +certbot_cert_name: nyx.helu.ca +certbot_cert_format: haproxy_combined +haproxy_cert_path: /etc/haproxy/certs/nyx.helu.ca.pem +certbot_user: certbot +haproxy_group: haproxy +certbot_reload_services: + - haproxy + +certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh +prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter +``` + +#### Cron Schedule + +```cron +# Run cert distribution every 6 hours on the Ansible controller +0 */6 * * * cd /path/to/ansible && ansible-playbook cert-distribute.yml --limit certbot_targets 2>&1 | logger -t cert-distribute +``` + +Let's Encrypt certs are valid for 90 days and renew at 30 days remaining. A 6-hour distribution cadence ensures certs propagate within hours of renewal. + +### Phase 4: Monitoring + +#### Prometheus Metrics + +Deploy the `cert-metrics.sh` script (existing template: `ansible/certbot/cert-metrics.sh.j2`) on each target host. After each distribution, it writes metrics to the node-exporter textfile directory: + +| Metric | Description | +|--------|-------------| +| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires | +| `ssl_certificate_expiry_seconds` | Seconds until cert expires | +| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing | + +Labels: `domain`, `issuer` + +#### Alert Rules + +Add to `ansible/prometheus/alert_rules.yml.j2`: + +```yaml +- name: ssl_alerts + rules: + - alert: SSLCertificateExpiringSoon + expr: ssl_certificate_expiry_seconds < 604800 + for: 1h + labels: + severity: warning + annotations: + summary: "SSL certificate expiring soon" + description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}" + + - alert: SSLCertificateExpired + expr: ssl_certificate_valid == 0 + for: 5m + labels: + severity: critical + annotations: + summary: "SSL certificate expired or missing" + description: "Certificate for {{ $labels.domain }} is expired or not found" +``` + +#### Grafana Dashboard + +Add a certificate status dashboard to `ansible/grafana/dashboards/` showing: + +- Certificate validity status (green/red) per domain +- Days until expiry (gauge per domain) +- Renewal history (annotation markers from cert version timestamps) +- Issuer label (confirms Let's Encrypt vs self-signed) + +## Certificate Format Reference + +Different services expect certificates in different formats. The distribution playbook handles this via the `certbot_cert_format` host variable: + +| Format | Services | Files Produced | +|--------|----------|----------------| +| `haproxy_combined` | HAProxy | Single PEM: fullchain + privkey concatenated | +| `separate` | MinIO, nginx, most services | `fullchain.pem` + `privkey.pem` as separate files | + +MinIO specifically expects certs at `~/.minio/certs/public.crt` and `~/.minio/certs/private.key`, or a custom path via `--certs-dir`. + +## Let's Encrypt Rate Limits + +| Limit | Value | Impact | +|-------|-------|--------| +| Certificates per registered domain | 50 per week | Individual certs for pan + nyx are well within limits | +| Duplicate certificates | 5 per week | Avoid unnecessary re-issuance | +| Failed validations | 5 per hour | DNS propagation failures count against this | + +## Comparison with Titania Model + +| Aspect | Titania (Current) | Centralized (This Document) | +|--------|-------------------|----------------------------| +| Certbot location | On the host itself | OCI free host | +| Namecheap credentials | On the host | Only on OCI host | +| Cert delivery | Direct to HAProxy | Via OCI Vault → Ansible | +| Renewal hook | Docker HAProxy reload | OCI Vault upload | +| Distribution | N/A (local only) | Ansible cron on controller | +| Environments served | Ouranos sandbox only | All environments | +| Service reload | `docker compose kill -s HUP` | `systemctl reload` per host_vars | + +Titania can remain self-contained (it's working) or migrate to this centralized model later. + +## Verification + +### OCI Free Host + +```bash +# Check certbot managed certificates +certbot certificates --config-dir /srv/certbot/config + +# Dry-run renewal +certbot renew --config-dir /srv/certbot/config \ + --work-dir /srv/certbot/work \ + --logs-dir /srv/certbot/logs \ + --dry-run + +# Check systemd timer +systemctl status certbot-renew.timer +``` + +### OCI Vault + +```bash +# List certificate secrets +oci vault secret list \ + --compartment-id $COMPARTMENT_ID \ + --vault-id $VAULT_ID \ + --query 'data[*].{"name":"secret-name","updated":"time-of-current-version-expiry"}' \ + --output table +``` + +### Target Hosts + +```bash +# Verify issuer is Let's Encrypt +openssl x509 -noout -issuer -in /path/to/cert.pem + +# Check expiry +openssl x509 -noout -enddate -in /path/to/cert.pem + +# Test TLS connection +openssl s_client -connect pan.helu.ca:8555 /dev/null \ + | openssl x509 -noout -issuer -dates +``` + +### Prometheus + +```promql +# All certs valid +ssl_certificate_valid == 1 + +# Days until expiry +ssl_certificate_expiry_seconds / 86400 + +# Certs expiring within 30 days +ssl_certificate_expiry_seconds < 2592000 +``` + +## Related Documentation + +- `docs/cerbot.md` — Titania certbot deployment (DNS-01 with Namecheap) +- `docs/ansible.md` — OCI Vault secret lookup patterns and vault organization +- `ansible/certbot/deploy.yml` — Certbot deployment playbook (base pattern) +- `ansible/certbot/renewal-hook.sh.j2` — Renewal hook template (Titania/HAProxy variant) +- `ansible/certbot/cert-metrics.sh.j2` — Prometheus metrics script template +- `ansible/certbot/namecheap.ini.j2` — Namecheap credentials template +- `ansible/prometheus/alert_rules.yml.j2` — Prometheus alert rules