Files
ouranos/docs/certbot_internal_hosts.md

16 KiB

Let's Encrypt for Internal Production Hosts

Centralized certificate management using an OCI free-tier host running certbot with DNS-01 validation, OCI Vault for secure storage, and Ansible-driven distribution to internal production hosts.

Overview

Component Value
Certificate Authority Let's Encrypt
Validation DNS-01 via certbot-dns-namecheap
Certificate Generator OCI free-tier host (certbot)
Certificate Store OCI Vault (software-protected keys, free)
Distribution Ansible playbook on controller (cron)
Target Hosts pan.helu.ca, nyx.helu.ca (extensible)
Monitoring Prometheus metrics + Grafana dashboard

Problem

Self-signed and private certificates on internal production hosts (pan.helu.ca, nyx.helu.ca) cause persistent issues:

  • Clients must disable TLS verification or trust custom CAs
  • Service-to-service HTTPS (e.g., LobeChat → MinIO S3 at https://pan.helu.ca:8555) requires trust workarounds
  • Certificate management is manual and inconsistent across environments
  • No automated renewal or expiry monitoring

Architecture

┌─────────────────────────────────┐
│  OCI Free Host                  │
│  - certbot + dns-namecheap      │
│  - systemd timer (twice daily)  │
│  - post-hook: upload certs to   │
│    OCI Vault via oci-cli        │
└──────────────┬──────────────────┘
               │ oci vault secret update-secret-content
               ▼
┌─────────────────────────────────┐
│  OCI Vault                      │
│  ouranos-certificates           │
│  ├── pan-helu-ca-fullchain      │
│  ├── pan-helu-ca-privkey        │
│  ├── nyx-helu-ca-fullchain      │
│  ├── nyx-helu-ca-privkey        │
│  └── (future domains)           │
└──────────────┬──────────────────┘
               │ community.oci.oci_secret lookup
               ▼
┌─────────────────────────────────┐
│  Ansible Controller             │
│  (restricted access)            │
│  - cron: cert-distribute.yml    │
│  - pulls certs from OCI Vault   │
│  - deploys to target hosts      │
│  - updates Prometheus metrics   │
└──────────┬──────────┬───────────┘
           │          │
      ┌────▼───┐  ┌───▼────┐
      │  pan   │  │  nyx   │
      │ .helu  │  │ .helu  │
      │ .ca    │  │ .ca    │
      └────┬───┘  └───┬────┘
           │          │
           ▼          ▼
      Prometheus cert metrics
           → Grafana dashboard

Design Decisions

Decision Rationale
Individual certs (not wildcard *.helu.ca) Limits blast radius; each host has only its own cert
Namecheap API keys only on OCI host Production hosts never hold DNS API credentials
Pull model (controller pulls from vault) OCI host doesn't need SSH access to production
OCI Vault as distribution bus Aligns with existing OCI Vault pattern in docs/ansible.md
Software-protected keys Free tier, no per-secret or per-version charges

Why DNS-01

DNS-01 validation is the correct choice for internal hosts. The challenge is validated via DNS TXT records managed through the Namecheap API — the target hosts (pan, nyx) do not need to be reachable from the internet on port 80 or 443.

This is the same proven approach used on Titania for *.ouranos.helu.ca (see docs/cerbot.md).

Implementation

Phase 1: OCI Free Host — Certbot

Deploy certbot on the OCI free host using the same pattern as ansible/certbot/deploy.yml, stripped down for minimal footprint (no HAProxy, no Docker).

Resource requirements: Trivial. Certbot runs briefly twice per day. The OCI free host's constrained resources are more than sufficient.

Certbot Setup

  1. Python virtualenv with certbot and certbot-dns-namecheap
  2. Namecheap credentials in /srv/certbot/credentials/namecheap.ini (template: ansible/certbot/namecheap.ini.j2)
  3. Individual certificate requests per domain:
certbot certonly \
  --non-interactive \
  --agree-tos \
  --email webmaster@helu.ca \
  --authenticator dns-namecheap \
  --dns-namecheap-credentials /srv/certbot/credentials/namecheap.ini \
  --dns-namecheap-propagation-seconds 120 \
  --config-dir /srv/certbot/config \
  --work-dir /srv/certbot/work \
  --logs-dir /srv/certbot/logs \
  --cert-name pan.helu.ca \
  -d pan.helu.ca
  1. Systemd timer for renewal (same pattern as Titania — twice daily with RandomizedDelaySec=3600)

Vault Upload Post-Hook

Replaces the HAProxy reload hook used on Titania. After each renewal, base64-encodes and uploads certificate files to OCI Vault:

#!/bin/bash
# Post-renewal hook: upload certificates to OCI Vault
set -euo pipefail

CERT_NAME="${RENEWED_LINEAGE##*/}"
CERT_DIR="${RENEWED_LINEAGE}"
VAULT_ID="ocid1.vault.oc1..."       # ouranos-certificates vault
COMPARTMENT_ID="ocid1.compartment..."

# Derive OCI secret name from cert name (pan.helu.ca → pan-helu-ca)
SECRET_PREFIX=$(echo "${CERT_NAME}" | tr '.' '-')

# Upload fullchain
oci vault secret update-secret-content \
  --secret-id "$(oci vault secret list \
    --compartment-id "${COMPARTMENT_ID}" \
    --vault-id "${VAULT_ID}" \
    --name "${SECRET_PREFIX}-fullchain" \
    --query 'data[0].id' --raw-output)" \
  --content-type BASE64 \
  --content "$(base64 -w0 "${CERT_DIR}/fullchain.pem")"

# Upload private key
oci vault secret update-secret-content \
  --secret-id "$(oci vault secret list \
    --compartment-id "${COMPARTMENT_ID}" \
    --vault-id "${VAULT_ID}" \
    --name "${SECRET_PREFIX}-privkey" \
    --query 'data[0].id' --raw-output)" \
  --content-type BASE64 \
  --content "$(base64 -w0 "${CERT_DIR}/privkey.pem")"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] Uploaded ${CERT_NAME} to OCI Vault"

Phase 2: OCI Vault — Certificate Storage

Vault Organization

Extends the existing OCI Vault structure documented in docs/ansible.md:

OCI Compartment: production
├── Vault: ouranos-databases          (existing)
├── Vault: ouranos-services           (existing)
├── Vault: ouranos-integrations       (existing)
└── Vault: ouranos-certificates       (new)
    ├── Secret: pan-helu-ca-fullchain
    ├── Secret: pan-helu-ca-privkey
    ├── Secret: nyx-helu-ca-fullchain
    ├── Secret: nyx-helu-ca-privkey
    └── (future domains follow same pattern)

Naming convention: Domain dots replaced with hyphens, suffixed with -fullchain or -privkey.

Secret format: Base64-encoded PEM content. OCI Vault secrets support versioning natively — every renewal creates a new version, providing automatic rollback capability and audit trail.

Cost: OCI Vault with software-protected keys is free. No per-secret or per-version charges. Store certs for as many domains as needed at zero cost.

IAM Policies

# OCI free host: can write certs to the certificates vault
Allow dynamic-group certbot-host to manage secrets in compartment production
  where target.vault.id = '<ouranos-certificates vault OCID>'

# Ansible controller: can read certs from the certificates vault
Allow dynamic-group ansible-controller to read secrets in compartment production
  where target.vault.id = '<ouranos-certificates vault OCID>'

Phase 3: Ansible Controller — Distribution

Distribution Playbook

A cert-distribute.yml playbook runs on the Ansible controller via cron. It uses community.oci.oci_secret lookups (same pattern as existing OCI Vault usage):

---
- name: Distribute Let's Encrypt Certificates
  hosts: certbot_targets
  vars:
    oci_compartment_id: "{{ vault_oci_compartment_id }}"
    oci_certificates_vault_id: "{{ vault_oci_certificates_vault_id }}"

  tasks:
    - name: Retrieve fullchain from OCI Vault
      ansible.builtin.set_fact:
        cert_fullchain: >-
          {{ lookup('community.oci.oci_secret',
             certbot_cert_name | replace('.', '-') ~ '-fullchain',
             compartment_id=oci_compartment_id,
             vault_id=oci_certificates_vault_id) | b64decode }}

    - name: Retrieve private key from OCI Vault
      ansible.builtin.set_fact:
        cert_privkey: >-
          {{ lookup('community.oci.oci_secret',
             certbot_cert_name | replace('.', '-') ~ '-privkey',
             compartment_id=oci_compartment_id,
             vault_id=oci_certificates_vault_id) | b64decode }}

    - name: Deploy certificate (HAProxy combined format)
      become: true
      ansible.builtin.copy:
        content: "{{ cert_fullchain }}{{ cert_privkey }}"
        dest: "{{ haproxy_cert_path }}"
        owner: "{{ certbot_user }}"
        group: "{{ haproxy_group }}"
        mode: '0640'
      when: certbot_cert_format | default('haproxy_combined') == 'haproxy_combined'
      notify: reload services

    - name: Deploy certificate (separate files)
      become: true
      ansible.builtin.copy:
        content: "{{ item.content }}"
        dest: "{{ item.dest }}"
        owner: "{{ certbot_user }}"
        group: "{{ certbot_group }}"
        mode: '0640'
      loop:
        - { content: "{{ cert_fullchain }}", dest: "{{ certbot_cert_dir }}/fullchain.pem" }
        - { content: "{{ cert_privkey }}", dest: "{{ certbot_cert_dir }}/privkey.pem" }
      when: certbot_cert_format | default('haproxy_combined') == 'separate'
      notify: reload services

    - name: Update certificate metrics
      become: true
      ansible.builtin.command: "{{ certbot_metrics_script }}"
      changed_when: false

  handlers:
    - name: reload services
      become: true
      ansible.builtin.systemd:
        name: "{{ item }}"
        state: reloaded
      loop: "{{ certbot_reload_services | default([]) }}"

Host Variables (Example)

# host_vars/pan.helu.ca.yml
certbot_cert_name: pan.helu.ca
certbot_cert_format: separate           # MinIO expects separate files
certbot_cert_dir: /etc/minio/certs
certbot_user: minio
certbot_group: minio
certbot_reload_services:
  - minio

certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter
# host_vars/nyx.helu.ca.yml (if running HAProxy)
certbot_cert_name: nyx.helu.ca
certbot_cert_format: haproxy_combined
haproxy_cert_path: /etc/haproxy/certs/nyx.helu.ca.pem
certbot_user: certbot
haproxy_group: haproxy
certbot_reload_services:
  - haproxy

certbot_metrics_script: /srv/certbot/hooks/cert-metrics.sh
prometheus_node_exporter_text_directory: /var/lib/prometheus/node-exporter

Cron Schedule

# Run cert distribution every 6 hours on the Ansible controller
0 */6 * * * cd /path/to/ansible && ansible-playbook cert-distribute.yml --limit certbot_targets 2>&1 | logger -t cert-distribute

Let's Encrypt certs are valid for 90 days and renew at 30 days remaining. A 6-hour distribution cadence ensures certs propagate within hours of renewal.

Phase 4: Monitoring

Prometheus Metrics

Deploy the cert-metrics.sh script (existing template: ansible/certbot/cert-metrics.sh.j2) on each target host. After each distribution, it writes metrics to the node-exporter textfile directory:

Metric Description
ssl_certificate_expiry_timestamp Unix timestamp when cert expires
ssl_certificate_expiry_seconds Seconds until cert expires
ssl_certificate_valid 1 if valid, 0 if expired/missing

Labels: domain, issuer

Alert Rules

Add to ansible/prometheus/alert_rules.yml.j2:

- name: ssl_alerts
  rules:
    - alert: SSLCertificateExpiringSoon
      expr: ssl_certificate_expiry_seconds < 604800
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "SSL certificate expiring soon"
        description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"

    - alert: SSLCertificateExpired
      expr: ssl_certificate_valid == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "SSL certificate expired or missing"
        description: "Certificate for {{ $labels.domain }} is expired or not found"

Grafana Dashboard

Add a certificate status dashboard to ansible/grafana/dashboards/ showing:

  • Certificate validity status (green/red) per domain
  • Days until expiry (gauge per domain)
  • Renewal history (annotation markers from cert version timestamps)
  • Issuer label (confirms Let's Encrypt vs self-signed)

Certificate Format Reference

Different services expect certificates in different formats. The distribution playbook handles this via the certbot_cert_format host variable:

Format Services Files Produced
haproxy_combined HAProxy Single PEM: fullchain + privkey concatenated
separate MinIO, nginx, most services fullchain.pem + privkey.pem as separate files

MinIO specifically expects certs at ~/.minio/certs/public.crt and ~/.minio/certs/private.key, or a custom path via --certs-dir.

Let's Encrypt Rate Limits

Limit Value Impact
Certificates per registered domain 50 per week Individual certs for pan + nyx are well within limits
Duplicate certificates 5 per week Avoid unnecessary re-issuance
Failed validations 5 per hour DNS propagation failures count against this

Comparison with Titania Model

Aspect Titania (Current) Centralized (This Document)
Certbot location On the host itself OCI free host
Namecheap credentials On the host Only on OCI host
Cert delivery Direct to HAProxy Via OCI Vault → Ansible
Renewal hook Docker HAProxy reload OCI Vault upload
Distribution N/A (local only) Ansible cron on controller
Environments served Ouranos sandbox only All environments
Service reload docker compose kill -s HUP systemctl reload per host_vars

Titania can remain self-contained (it's working) or migrate to this centralized model later.

Verification

OCI Free Host

# Check certbot managed certificates
certbot certificates --config-dir /srv/certbot/config

# Dry-run renewal
certbot renew --config-dir /srv/certbot/config \
  --work-dir /srv/certbot/work \
  --logs-dir /srv/certbot/logs \
  --dry-run

# Check systemd timer
systemctl status certbot-renew.timer

OCI Vault

# List certificate secrets
oci vault secret list \
  --compartment-id $COMPARTMENT_ID \
  --vault-id $VAULT_ID \
  --query 'data[*].{"name":"secret-name","updated":"time-of-current-version-expiry"}' \
  --output table

Target Hosts

# Verify issuer is Let's Encrypt
openssl x509 -noout -issuer -in /path/to/cert.pem

# Check expiry
openssl x509 -noout -enddate -in /path/to/cert.pem

# Test TLS connection
openssl s_client -connect pan.helu.ca:8555 </dev/null 2>/dev/null \
  | openssl x509 -noout -issuer -dates

Prometheus

# All certs valid
ssl_certificate_valid == 1

# Days until expiry
ssl_certificate_expiry_seconds / 86400

# Certs expiring within 30 days
ssl_certificate_expiry_seconds < 2592000
  • docs/cerbot.md — Titania certbot deployment (DNS-01 with Namecheap)
  • docs/ansible.md — OCI Vault secret lookup patterns and vault organization
  • ansible/certbot/deploy.yml — Certbot deployment playbook (base pattern)
  • ansible/certbot/renewal-hook.sh.j2 — Renewal hook template (Titania/HAProxy variant)
  • ansible/certbot/cert-metrics.sh.j2 — Prometheus metrics script template
  • ansible/certbot/namecheap.ini.j2 — Namecheap credentials template
  • ansible/prometheus/alert_rules.yml.j2 — Prometheus alert rules