Files
ouranos/docs/terraform.md
Robert Helewka b4d60f2f38 docs: rewrite README with structured overview and quick start guide
Replaces the minimal project description with a comprehensive README
including a component overview table, quick start instructions, common
Ansible operations, and links to detailed documentation. Aligns with
Red Panda Approval™ standards.
2026-03-03 12:49:06 +00:00

9.0 KiB

Terraform Practices & Patterns

This document describes the Terraform design philosophy, patterns, and practices used across our infrastructure. The audience includes LLMs assisting with development, new team members, and existing team members seeking a reference.

Design Philosophy

Incus-First Infrastructure

Incus containers form the foundational layer of all environments. Management and monitoring infrastructure (Prospero, Titania) must exist before application hosts. This is a critical dependency that must be explicitly codified.

Why? Terraform isn't magic. Implicit ordering can lead to race conditions or failed deployments. Always use explicit depends_on for critical infrastructure chains.

# Example: Application host depends on monitoring infrastructure
resource "incus_instance" "app_host" {
  # ...
  depends_on = [incus_instance.uranian_hosts["prospero"]]
}

Explicit Dependencies

Never rely solely on implicit resource ordering for critical infrastructure. Codify dependencies explicitly to:

  • ✔ Prevent race conditions during parallel applies
  • ✔ Document architectural relationships in code
  • ✔ Ensure consistent deployment ordering across environments

Repository Strategy

Agathos (Sandbox)

Agathos is the Sandbox repository — isolated, safe for external demos, and uses local state.

Aspect Decision
Purpose Evaluation, demos, pattern experimentation, new software testing
State Local (no remote backend)
Secrets No production credentials or references
Security Safe to use on external infrastructure for demos

Production Repository (Separate)

A separate repository manages Dev, UAT, and Prod environments:

terraform/
├── modules/incus_host/     # Reusable container module
├── environments/
│   ├── dev/                # Local Incus only
│   └── prod/               # OCI + Incus (parameterized via tfvars)
Aspect Decision
State PostgreSQL backend on eris.helu.ca:6432 with SSL
Schemas Separate per environment: dev, uat, prod
UAT/Prod Parameterized twins via -var-file

Module Design

When to Extract a Module

A pattern is a good module candidate when it meets these criteria:

Criterion Description
Reuse Pattern used across multiple environments (Sandbox, Dev, UAT, Prod)
Stable Interface Inputs/outputs won't change frequently
Testable Can validate module independently before promotion
Encapsulates Complexity Hides dynamic blocks, for_each, cloud-init generation

When NOT to Extract

  • Single-use patterns
  • Tightly coupled to specific environment
  • Adds indirection without measurable benefit

The incus_host Module

The standard container provisioning pattern extracted from Agathos:

Inputs:

  • hosts — Map of host definitions (name, role, image, devices, config)
  • project — Incus project name
  • profile — Incus profile name
  • cloud_init_template — Cloud-init configuration template
  • ssh_key_path — Path to SSH authorized keys
  • depends_on_resources — Explicit dependencies for infrastructure ordering

Outputs:

  • host_details — Name, IPv4, role, description for each host
  • inventory — Documentation reference for DHCP/DNS provisioning

Environment Strategy

Environment Purposes

Environment Purpose Infrastructure
Sandbox Evaluation, demos, pattern experimentation Local Incus only
Dev Integration testing, container builds, security testing Local Incus only
UAT User acceptance testing, bug resolution OCI + Incus (hybrid)
Prod Production workloads OCI + Incus (hybrid)

Parameterized Twins (UAT/Prod)

UAT and Prod are architecturally identical. Use a single environment directory with variable files:

# UAT deployment
terraform apply -var-file=uat.tfvars

# Prod deployment  
terraform apply -var-file=prod.tfvars

Key differences in tfvars:

  • Hostnames and DNS domains
  • Resource sizing (CPU, memory limits)
  • OCI compartment IDs
  • Credential references

State Management

Sandbox (Agathos)

Local state is acceptable because:

  • Environment is ephemeral
  • Single-user workflow
  • No production secrets to protect
  • Safe for external demos

Production Environments

PostgreSQL backend on eris.helu.ca:

terraform {
  backend "pg" {
    conn_str = "postgres://eris.helu.ca:6432/terraform_state?sslmode=verify-full"
    schema_name = "dev"  # or "uat", "prod"
  }
}

Connection requirements:

  • Port 6432 (pgBouncer)
  • SSL with sslmode=verify-full
  • Credentials via environment variables (PGUSER, PGPASSWORD)
  • Separate schema per environment for isolation

Integration Points

Terraform → DHCP/DNS

The agathos_inventory output provides host information for DHCP/DNS provisioning:

  1. Terraform creates containers with cloud-init
  2. agathos_inventory output includes hostnames and IPs
  3. MAC addresses registered in DHCP server
  4. DHCP server creates DNS entries (hostname.incus domain)
  5. Ansible uses DNS names for host connectivity

Terraform → Ansible

Ansible does not consume Terraform outputs directly. Instead:

  1. Terraform provisions containers
  2. Incus DNS resolution provides hostname.incus domain
  3. Ansible inventory uses static DNS names
  4. sandbox_up.yml configures DNS resolution on the hypervisor
# Ansible inventory uses DNS names, not Terraform outputs
ubuntu:
  hosts:
    oberon.incus:
    ariel.incus:
    prospero.incus:

Terraform → Bash Scripts

The ssh_key_update.sh script demonstrates proper integration:

terraform output -json agathos_inventory | jq -r \
  '.uranian_hosts.hosts | to_entries[] | "\(.key) \(.value.ipv4)"' | \
  while read hostname ip; do
    ssh-keyscan -H "$ip" >> ~/.ssh/known_hosts
    ssh-keyscan -H "$hostname.incus" >> ~/.ssh/known_hosts
  done

Promotion Workflow

All infrastructure changes flow through this pipeline:

Agathos (Sandbox)
    ↓ Validate pattern works
    ↓ Extract to module if reusable
Dev
    ↓ Integration testing
    ↓ Container builds
    ↓ Security testing
UAT
    ↓ User acceptance testing
    ↓ Bug fixes return to Dev
    ↓ Delete environment, test restore
Prod
    ↓ Deploy from tested artifacts

Critical: Nothing starts in Prod. Every change originates in Agathos, is validated through the pipeline, and only then deployed to production.

Promotion Includes

When promoting Terraform changes, always update corresponding:

  • Ansible playbooks and templates
  • Service documentation in /docs/services/
  • Host variables if new services added

Output Conventions

agathos_inventory

The primary output for documentation and DNS integration:

output "agathos_inventory" {
  description = "Host inventory for documentation and DHCP/DNS provisioning"
  value = {
    uranian_hosts = {
      hosts = {
        for name, instance in incus_instance.uranian_hosts : name => {
          name             = instance.name
          ipv4             = instance.ipv4_address
          role             = local.uranian_hosts[name].role
          description      = local.uranian_hosts[name].description
          security_nesting = lookup(local.uranian_hosts[name].config, "security.nesting", false)
        }
      }
    }
  }
}

Purpose:

  • Update sandbox.html documentation
  • Reference for DHCP server MAC/IP registration
  • DNS entry creation via DHCP

Layered Configuration

Single Config with Conditional Resources

Avoid multiple separate Terraform configurations. Use one config with conditional resources:

environments/prod/
├── main.tf              # Incus project, profile, images (always)
├── incus_hosts.tf       # Module call for Incus containers (always)
├── oci_resources.tf     # OCI compute (conditional)
├── variables.tf
├── dev.tfvars           # Dev: enable_oci = false
├── uat.tfvars           # UAT: enable_oci = true
└── prod.tfvars          # Prod: enable_oci = true
variable "enable_oci" {
  description = "Enable OCI resources (false for Dev, true for UAT/Prod)"
  type        = bool
  default     = false
}

resource "oci_core_instance" "hosts" {
  for_each = var.enable_oci ? var.oci_hosts : {}
  # ...
}

Best Practices Summary

Practice Rationale
✔ Explicit depends_on for critical chains Terraform isn't magic
✔ Local map for host definitions Single source of truth, easy iteration
for_each over count Stable resource addresses
dynamic blocks for optional devices Clean, declarative device configuration
✔ Merge base config with overrides DRY principle for common settings
✔ Separate tfvars for environment twins Minimal duplication, clear parameterization
✔ Document module interfaces Enable promotion across environments
✔ Never start in Prod Always validate through pipeline