# Let's Encrypt for Internal Production Hosts Centralized certificate management using an OCI free-tier host running certbot with DNS-01 validation, OCI Vault for secure storage, and Ansible-driven distribution to internal production hosts. ## Overview | Component | Value | |-----------|-------| | Certificate Authority | Let's Encrypt | | Validation | DNS-01 via `certbot-dns-namecheap` | | Certificate Generator | OCI free-tier host (certbot) | | Certificate Store | OCI Vault (software-protected keys, free) | | Distribution | Ansible playbook on controller (cron) | | Target Hosts | pan.helu.ca, nyx.helu.ca (extensible) | | Monitoring | Prometheus metrics + Grafana dashboard | ## Problem Self-signed and private certificates on internal production hosts (pan.helu.ca, nyx.helu.ca) cause persistent issues: - Clients must disable TLS verification or trust custom CAs - Service-to-service HTTPS (e.g., LobeChat → MinIO S3 at `https://pan.helu.ca:8555`) requires trust workarounds - Certificate management is manual and inconsistent across environments - No automated renewal or expiry monitoring ## Architecture ``` ┌─────────────────────────────────┐ │ OCI Free Host │ │ - certbot + dns-namecheap │ │ - systemd timer (twice daily) │ │ - post-hook: upload certs to │ │ OCI Vault via oci-cli │ └──────────────┬──────────────────┘ │ oci vault secret update-secret-content ▼ ┌─────────────────────────────────┐ │ OCI Vault │ │ ouranos-certificates │ │ ├── pan-helu-ca-fullchain │ │ ├── pan-helu-ca-privkey │ │ ├── nyx-helu-ca-fullchain │ │ ├── nyx-helu-ca-privkey │ │ └── (future domains) │ └──────────────┬──────────────────┘ │ community.oci.oci_secret lookup ▼ ┌─────────────────────────────────┐ │ Ansible Controller │ │ (restricted access) │ │ - cron: cert-distribute.yml │ │ - pulls certs from OCI Vault │ │ - deploys to target hosts │ │ - updates Prometheus metrics │ └──────────┬──────────┬───────────┘ │ │ ┌────▼───┐ ┌───▼────┐ │ pan │ │ nyx │ │ .helu │ │ .helu │ │ .ca │ │ .ca │ └────┬───┘ └───┬────┘ │ │ ▼ ▼ Prometheus cert metrics → Grafana dashboard ``` ### Design Decisions | Decision | Rationale | |----------|-----------| | Individual certs (not wildcard `*.helu.ca`) | Limits blast radius; each host has only its own cert | | Namecheap API keys only on OCI host | Production hosts never hold DNS API credentials | | Pull model (controller pulls from vault) | OCI host doesn't need SSH access to production | | OCI Vault as distribution bus | Aligns with existing OCI Vault pattern in `docs/ansible.md` | | Software-protected keys | Free tier, no per-secret or per-version charges | ## Why DNS-01 DNS-01 validation is the correct choice for internal hosts. The challenge is validated via DNS TXT records managed through the Namecheap API — the target hosts (pan, nyx) do not need to be reachable from the internet on port 80 or 443. This is the same proven approach used on Titania for `*.ouranos.helu.ca` (see `docs/cerbot.md`). ## Implementation ### Phase 1: OCI Free Host — Certbot Deploy certbot on the OCI free host using the same pattern as `ansible/certbot/deploy.yml`, stripped down for minimal footprint (no HAProxy, no Docker). **Resource requirements**: Trivial. Certbot runs briefly twice per day. The OCI free host's constrained resources are more than sufficient. #### Certbot Setup 1. Python virtualenv with `certbot` and `certbot-dns-namecheap` 2. Namecheap credentials in `/srv/certbot/credentials/namecheap.ini` (template: `ansible/certbot/namecheap.ini.j2`) 3. Individual certificate requests per domain: ```bash certbot certonly \ --non-interactive \ --agree-tos \ --email webmaster@helu.ca \ --authenticator dns-namecheap \ --dns-namecheap-credentials /srv/certbot/credentials/namecheap.ini \ --dns-namecheap-propagation-seconds 120 \ --config-dir /srv/certbot/config \ --work-dir /srv/certbot/work \ --logs-dir /srv/certbot/logs \ --cert-name pan.helu.ca \ -d pan.helu.ca ``` 4. Systemd timer for renewal (same pattern as Titania — twice daily with `RandomizedDelaySec=3600`) #### Vault Upload Post-Hook Replaces the HAProxy reload hook used on Titania. After each renewal, base64-encodes and uploads certificate files to OCI Vault: ```bash #!/bin/bash # Post-renewal hook: upload certificates to OCI Vault set -euo pipefail CERT_NAME="${RENEWED_LINEAGE##*/}" CERT_DIR="${RENEWED_LINEAGE}" VAULT_ID="ocid1.vault.oc1..." # ouranos-certificates vault COMPARTMENT_ID="ocid1.compartment..." # Derive OCI secret name from cert name (pan.helu.ca → pan-helu-ca) SECRET_PREFIX=$(echo "${CERT_NAME}" | tr '.' '-') # Upload fullchain oci vault secret update-secret-content \ --secret-id "$(oci vault secret list \ --compartment-id "${COMPARTMENT_ID}" \ --vault-id "${VAULT_ID}" \ --name "${SECRET_PREFIX}-fullchain" \ --query 'data[0].id' --raw-output)" \ --content-type BASE64 \ --content "$(base64 -w0 "${CERT_DIR}/fullchain.pem")" # Upload private key oci vault secret update-secret-content \ --secret-id "$(oci vault secret list \ --compartment-id "${COMPARTMENT_ID}" \ --vault-id "${VAULT_ID}" \ --name "${SECRET_PREFIX}-privkey" \ --query 'data[0].id' --raw-output)" \ --content-type BASE64 \ --content "$(base64 -w0 "${CERT_DIR}/privkey.pem")" echo "[$(date '+%Y-%m-%d %H:%M:%S')] Uploaded ${CERT_NAME} to OCI Vault" ``` ### Phase 2: OCI Vault — Certificate Storage #### Vault Organization Extends the existing OCI Vault structure documented in `docs/ansible.md`: ``` OCI Compartment: production ├── Vault: ouranos-databases (existing) ├── Vault: ouranos-services (existing) ├── Vault: ouranos-integrations (existing) └── Vault: ouranos-certificates (new) ├── Secret: pan-helu-ca-fullchain ├── Secret: pan-helu-ca-privkey ├── Secret: nyx-helu-ca-fullchain ├── Secret: nyx-helu-ca-privkey └── (future domains follow same pattern) ``` **Naming convention**: Domain dots replaced with hyphens, suffixed with `-fullchain` or `-privkey`. **Secret format**: Base64-encoded PEM content. OCI Vault secrets support versioning natively — every renewal creates a new version, providing automatic rollback capability and audit trail. **Cost**: OCI Vault with software-protected keys is free. No per-secret or per-version charges. Store certs for as many domains as needed at zero cost. #### IAM Policies ``` # OCI free host: can write certs to the certificates vault Allow dynamic-group certbot-host to manage secrets in compartment production where target.vault.id = '' # Ansible controller: can read certs from the certificates vault Allow dynamic-group ansible-controller to read secrets in compartment production where target.vault.id = '' ``` ### Phase 3: Ansible Controller — Distribution #### Distribution Playbook A `cert-distribute.yml` playbook runs on the Ansible controller via cron. It uses `community.oci.oci_secret` lookups (same pattern as existing OCI Vault usage): ```yaml --- - name: Distribute Let's Encrypt Certificates hosts: certbot_targets vars: oci_compartment_id: "{{ vault_oci_compartment_id }}" oci_certificates_vault_id: "{{ vault_oci_certificates_vault_id }}" tasks: - name: Retrieve fullchain from OCI Vault ansible.builtin.set_fact: cert_fullchain: >- {{ lookup('community.oci.oci_secret', certbot_cert_name | replace('.', '-') ~ '-fullchain', compartment_id=oci_compartment_id, vault_id=oci_certificates_vault_id) | b64decode }} - name: Retrieve private key from OCI Vault ansible.builtin.set_fact: cert_privkey: >- {{ lookup('community.oci.oci_secret', certbot_cert_name | replace('.', '-') ~ '-privkey', compartment_id=oci_compartment_id, vault_id=oci_certificates_vault_id) | b64decode }} - name: Deploy certificate (HAProxy combined format) become: true ansible.builtin.copy: content: "{{ cert_fullchain }}{{ cert_privkey }}" dest: "{{ haproxy_cert_path }}" owner: "{{ certbot_user }}" group: "{{ haproxy_group }}" mode: '0640' when: certbot_cert_format | default('haproxy_combined') == 'haproxy_combined' notify: reload services - name: Deploy certificate (separate files) become: true ansible.builtin.copy: content: "{{ item.content }}" dest: "{{ item.dest }}" owner: "{{ certbot_user }}" group: "{{ certbot_group }}" mode: '0640' loop: - { content: "{{ cert_fullchain }}", dest: "{{ certbot_cert_dir }}/fullchain.pem" } - { content: "{{ cert_privkey }}", dest: "{{ certbot_cert_dir }}/privkey.pem" } when: certbot_cert_format | default('haproxy_combined') == 'separate' notify: reload services - name: Update certificate metrics become: true ansible.builtin.command: "{{ certbot_metrics_script }}" changed_when: false handlers: - name: reload services become: true ansible.builtin.systemd: name: "{{ item }}" state: reloaded loop: "{{ certbot_reload_services | default([]) }}" ``` #### Host Variables (Example) ```yaml # host_vars/pan.helu.ca.yml certbot_cert_name: pan.helu.ca certbot_cert_format: separate # MinIO expects separate files certbot_cert_dir: /etc/minio/certs certbot_user: minio certbot_group: minio certbot_reload_services: - minio certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter ``` ```yaml # host_vars/nyx.helu.ca.yml (if running HAProxy) certbot_cert_name: nyx.helu.ca certbot_cert_format: haproxy_combined haproxy_cert_path: /etc/haproxy/certs/nyx.helu.ca.pem certbot_user: certbot haproxy_group: haproxy certbot_reload_services: - haproxy certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter ``` #### Cron Schedule ```cron # Run cert distribution every 6 hours on the Ansible controller 0 */6 * * * cd /path/to/ansible && ansible-playbook cert-distribute.yml --limit certbot_targets 2>&1 | logger -t cert-distribute ``` Let's Encrypt certs are valid for 90 days and renew at 30 days remaining. A 6-hour distribution cadence ensures certs propagate within hours of renewal. ### Phase 4: Monitoring #### Prometheus Metrics Deploy the `cert-metrics.sh` script (existing template: `ansible/certbot/cert-metrics.sh.j2`) on each target host. After each distribution, it writes metrics to the node-exporter textfile directory: | Metric | Description | |--------|-------------| | `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires | | `ssl_certificate_expiry_seconds` | Seconds until cert expires | | `ssl_certificate_valid` | 1 if valid, 0 if expired/missing | Labels: `domain`, `issuer` #### Alert Rules Add to `ansible/prometheus/alert_rules.yml.j2`: ```yaml - name: ssl_alerts rules: - alert: SSLCertificateExpiringSoon expr: ssl_certificate_expiry_seconds < 604800 for: 1h labels: severity: warning annotations: summary: "SSL certificate expiring soon" description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}" - alert: SSLCertificateExpired expr: ssl_certificate_valid == 0 for: 5m labels: severity: critical annotations: summary: "SSL certificate expired or missing" description: "Certificate for {{ $labels.domain }} is expired or not found" ``` #### Grafana Dashboard Add a certificate status dashboard to `ansible/grafana/dashboards/` showing: - Certificate validity status (green/red) per domain - Days until expiry (gauge per domain) - Renewal history (annotation markers from cert version timestamps) - Issuer label (confirms Let's Encrypt vs self-signed) ## Certificate Format Reference Different services expect certificates in different formats. The distribution playbook handles this via the `certbot_cert_format` host variable: | Format | Services | Files Produced | |--------|----------|----------------| | `haproxy_combined` | HAProxy | Single PEM: fullchain + privkey concatenated | | `separate` | MinIO, nginx, most services | `fullchain.pem` + `privkey.pem` as separate files | MinIO specifically expects certs at `~/.minio/certs/public.crt` and `~/.minio/certs/private.key`, or a custom path via `--certs-dir`. ## Let's Encrypt Rate Limits | Limit | Value | Impact | |-------|-------|--------| | Certificates per registered domain | 50 per week | Individual certs for pan + nyx are well within limits | | Duplicate certificates | 5 per week | Avoid unnecessary re-issuance | | Failed validations | 5 per hour | DNS propagation failures count against this | ## Comparison with Titania Model | Aspect | Titania (Current) | Centralized (This Document) | |--------|-------------------|----------------------------| | Certbot location | On the host itself | OCI free host | | Namecheap credentials | On the host | Only on OCI host | | Cert delivery | Direct to HAProxy | Via OCI Vault → Ansible | | Renewal hook | Docker HAProxy reload | OCI Vault upload | | Distribution | N/A (local only) | Ansible cron on controller | | Environments served | Ouranos sandbox only | All environments | | Service reload | `docker compose kill -s HUP` | `systemctl reload` per host_vars | Titania can remain self-contained (it's working) or migrate to this centralized model later. ## Verification ### OCI Free Host ```bash # Check certbot managed certificates certbot certificates --config-dir /srv/certbot/config # Dry-run renewal certbot renew --config-dir /srv/certbot/config \ --work-dir /srv/certbot/work \ --logs-dir /srv/certbot/logs \ --dry-run # Check systemd timer systemctl status certbot-renew.timer ``` ### OCI Vault ```bash # List certificate secrets oci vault secret list \ --compartment-id $COMPARTMENT_ID \ --vault-id $VAULT_ID \ --query 'data[*].{"name":"secret-name","updated":"time-of-current-version-expiry"}' \ --output table ``` ### Target Hosts ```bash # Verify issuer is Let's Encrypt openssl x509 -noout -issuer -in /path/to/cert.pem # Check expiry openssl x509 -noout -enddate -in /path/to/cert.pem # Test TLS connection openssl s_client -connect pan.helu.ca:8555 /dev/null \ | openssl x509 -noout -issuer -dates ``` ### Prometheus ```promql # All certs valid ssl_certificate_valid == 1 # Days until expiry ssl_certificate_expiry_seconds / 86400 # Certs expiring within 30 days ssl_certificate_expiry_seconds < 2592000 ``` ## Related Documentation - `docs/cerbot.md` — Titania certbot deployment (DNS-01 with Namecheap) - `docs/ansible.md` — OCI Vault secret lookup patterns and vault organization - `ansible/certbot/deploy.yml` — Certbot deployment playbook (base pattern) - `ansible/certbot/renewal-hook.sh.j2` — Renewal hook template (Titania/HAProxy variant) - `ansible/certbot/cert-metrics.sh.j2` — Prometheus metrics script template - `ansible/certbot/namecheap.ini.j2` — Namecheap credentials template - `ansible/prometheus/alert_rules.yml.j2` — Prometheus alert rules