Files
ouranos/docs/terraform.md

297 lines
9.0 KiB
Markdown

# Terraform Practices & Patterns
This document describes the Terraform design philosophy, patterns, and practices used across our infrastructure. The audience includes LLMs assisting with development, new team members, and existing team members seeking a reference.
## Design Philosophy
### Incus-First Infrastructure
Incus containers form the foundational layer of all environments. Management and monitoring infrastructure (Prospero, Titania) must exist before application hosts. This is a **critical dependency** that must be explicitly codified.
**Why?** Terraform isn't magic. Implicit ordering can lead to race conditions or failed deployments. Always use explicit `depends_on` for critical infrastructure chains.
```hcl
# Example: Application host depends on monitoring infrastructure
resource "incus_instance" "app_host" {
# ...
depends_on = [incus_instance.uranian_hosts["prospero"]]
}
```
### Explicit Dependencies
Never rely solely on implicit resource ordering for critical infrastructure. Codify dependencies explicitly to:
- ✔ Prevent race conditions during parallel applies
- ✔ Document architectural relationships in code
- ✔ Ensure consistent deployment ordering across environments
## Repository Strategy
### Ouranos (Sandbox)
Ouranos is the **Sandbox repository** — isolated, safe for external demos, and uses local state.
| Aspect | Decision |
|--------|----------|
| Purpose | Evaluation, demos, pattern experimentation, new software testing |
| State | Local (no remote backend) |
| Secrets | No production credentials or references |
| Security | Safe to use on external infrastructure for demos |
### Production Repository (Separate)
A separate repository manages Dev, UAT, and Prod environments:
```
terraform/
├── modules/incus_host/ # Reusable container module
├── environments/
│ ├── dev/ # Local Incus only
│ └── prod/ # OCI + Incus (parameterized via tfvars)
```
| Aspect | Decision |
|--------|----------|
| State | PostgreSQL backend on `eris.helu.ca:6432` with SSL |
| Schemas | Separate per environment: `dev`, `uat`, `prod` |
| UAT/Prod | Parameterized twins via `-var-file` |
## Module Design
### When to Extract a Module
A pattern is a good module candidate when it meets these criteria:
| Criterion | Description |
|-----------|-------------|
| **Reuse** | Pattern used across multiple environments (Sandbox, Dev, UAT, Prod) |
| **Stable Interface** | Inputs/outputs won't change frequently |
| **Testable** | Can validate module independently before promotion |
| **Encapsulates Complexity** | Hides `dynamic` blocks, `for_each`, cloud-init generation |
### When NOT to Extract
- Single-use patterns
- Tightly coupled to specific environment
- Adds indirection without measurable benefit
### The `incus_host` Module
The standard container provisioning pattern extracted from Ouranos:
**Inputs:**
- `hosts` — Map of host definitions (name, role, image, devices, config)
- `project` — Incus project name
- `profile` — Incus profile name
- `cloud_init_template` — Cloud-init configuration template
- `ssh_key_path` — Path to SSH authorized keys
- `depends_on_resources` — Explicit dependencies for infrastructure ordering
**Outputs:**
- `host_details` — Name, IPv4, role, description for each host
- `inventory` — Documentation reference for DHCP/DNS provisioning
## Environment Strategy
### Environment Purposes
| Environment | Purpose | Infrastructure |
|-------------|---------|----------------|
| **Sandbox** | Evaluation, demos, pattern experimentation | Local Incus only |
| **Dev** | Integration testing, container builds, security testing | Local Incus only |
| **UAT** | User acceptance testing, bug resolution | OCI + Incus (hybrid) |
| **Prod** | Production workloads | OCI + Incus (hybrid) |
### Parameterized Twins (UAT/Prod)
UAT and Prod are architecturally identical. Use a single environment directory with variable files:
```bash
# UAT deployment
terraform apply -var-file=uat.tfvars
# Prod deployment
terraform apply -var-file=prod.tfvars
```
Key differences in tfvars:
- Hostnames and DNS domains
- Resource sizing (CPU, memory limits)
- OCI compartment IDs
- Credential references
## State Management
### Sandbox (Ouranos)
Local state is acceptable because:
- Environment is ephemeral
- Single-user workflow
- No production secrets to protect
- Safe for external demos
### Production Environments
PostgreSQL backend on `eris.helu.ca`:
```hcl
terraform {
backend "pg" {
conn_str = "postgres://eris.helu.ca:6432/terraform_state?sslmode=verify-full"
schema_name = "dev" # or "uat", "prod"
}
}
```
**Connection requirements:**
- Port 6432 (pgBouncer)
- SSL with `sslmode=verify-full`
- Credentials via environment variables (`PGUSER`, `PGPASSWORD`)
- Separate schema per environment for isolation
## Integration Points
### Terraform → DHCP/DNS
The `ouranos_inventory` output provides host information for DHCP/DNS provisioning:
1. Terraform creates containers with cloud-init
2. `ouranos_inventory` output includes hostnames and IPs
3. MAC addresses registered in DHCP server
4. DHCP server creates DNS entries (`hostname.incus` domain)
5. Ansible uses DNS names for host connectivity
### Terraform → Ansible
Ansible does **not** consume Terraform outputs directly. Instead:
1. Terraform provisions containers
2. Incus DNS resolution provides `hostname.incus` domain
3. Ansible inventory uses static DNS names
4. `sandbox_up.yml` configures DNS resolution on the hypervisor
```yaml
# Ansible inventory uses DNS names, not Terraform outputs
ubuntu:
hosts:
oberon.incus:
ariel.incus:
prospero.incus:
```
### Terraform → Bash Scripts
The `ssh_key_update.sh` script demonstrates proper integration:
```bash
terraform output -json ouranos_inventory | jq -r \
'.uranian_hosts.hosts | to_entries[] | "\(.key) \(.value.ipv4)"' | \
while read hostname ip; do
ssh-keyscan -H "$ip" >> ~/.ssh/known_hosts
ssh-keyscan -H "$hostname.incus" >> ~/.ssh/known_hosts
done
```
## Promotion Workflow
All infrastructure changes flow through this pipeline:
```
Ouranos (Sandbox)
↓ Validate pattern works
↓ Extract to module if reusable
Dev
↓ Integration testing
↓ Container builds
↓ Security testing
UAT
↓ User acceptance testing
↓ Bug fixes return to Dev
↓ Delete environment, test restore
Prod
↓ Deploy from tested artifacts
```
**Critical:** Nothing starts in Prod. Every change originates in Ouranos, is validated through the pipeline, and only then deployed to production.
### Promotion Includes
When promoting Terraform changes, always update corresponding:
- Ansible playbooks and templates
- Service documentation in `/docs/services/`
- Host variables if new services added
## Output Conventions
### `ouranos_inventory`
The primary output for documentation and DNS integration:
```hcl
output "ouranos_inventory" {
description = "Host inventory for documentation and DHCP/DNS provisioning"
value = {
uranian_hosts = {
hosts = {
for name, instance in incus_instance.uranian_hosts : name => {
name = instance.name
ipv4 = instance.ipv4_address
role = local.uranian_hosts[name].role
description = local.uranian_hosts[name].description
security_nesting = lookup(local.uranian_hosts[name].config, "security.nesting", false)
}
}
}
}
}
```
**Purpose:**
- Update [sandbox.html](sandbox.html) documentation
- Reference for DHCP server MAC/IP registration
- DNS entry creation via DHCP
## Layered Configuration
### Single Config with Conditional Resources
Avoid multiple separate Terraform configurations. Use one config with conditional resources:
```
environments/prod/
├── main.tf # Incus project, profile, images (always)
├── incus_hosts.tf # Module call for Incus containers (always)
├── oci_resources.tf # OCI compute (conditional)
├── variables.tf
├── dev.tfvars # Dev: enable_oci = false
├── uat.tfvars # UAT: enable_oci = true
└── prod.tfvars # Prod: enable_oci = true
```
```hcl
variable "enable_oci" {
description = "Enable OCI resources (false for Dev, true for UAT/Prod)"
type = bool
default = false
}
resource "oci_core_instance" "hosts" {
for_each = var.enable_oci ? var.oci_hosts : {}
# ...
}
```
## Best Practices Summary
| Practice | Rationale |
|----------|-----------|
| ✔ Explicit `depends_on` for critical chains | Terraform isn't magic |
| ✔ Local map for host definitions | Single source of truth, easy iteration |
| ✔ `for_each` over `count` | Stable resource addresses |
| ✔ `dynamic` blocks for optional devices | Clean, declarative device configuration |
| ✔ Merge base config with overrides | DRY principle for common settings |
| ✔ Separate tfvars for environment twins | Minimal duplication, clear parameterization |
| ✔ Document module interfaces | Enable promotion across environments |
| ✔ Never start in Prod | Always validate through pipeline |