docs: rewrite README with structured overview and quick start guide
Replaces the minimal project description with a comprehensive README including a component overview table, quick start instructions, common Ansible operations, and links to detailed documentation. Aligns with Red Panda Approval™ standards.
This commit is contained in:
296
docs/terraform.md
Normal file
296
docs/terraform.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Terraform Practices & Patterns
|
||||
|
||||
This document describes the Terraform design philosophy, patterns, and practices used across our infrastructure. The audience includes LLMs assisting with development, new team members, and existing team members seeking a reference.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
### Incus-First Infrastructure
|
||||
|
||||
Incus containers form the foundational layer of all environments. Management and monitoring infrastructure (Prospero, Titania) must exist before application hosts. This is a **critical dependency** that must be explicitly codified.
|
||||
|
||||
**Why?** Terraform isn't magic. Implicit ordering can lead to race conditions or failed deployments. Always use explicit `depends_on` for critical infrastructure chains.
|
||||
|
||||
```hcl
|
||||
# Example: Application host depends on monitoring infrastructure
|
||||
resource "incus_instance" "app_host" {
|
||||
# ...
|
||||
depends_on = [incus_instance.uranian_hosts["prospero"]]
|
||||
}
|
||||
```
|
||||
|
||||
### Explicit Dependencies
|
||||
|
||||
Never rely solely on implicit resource ordering for critical infrastructure. Codify dependencies explicitly to:
|
||||
|
||||
- ✔ Prevent race conditions during parallel applies
|
||||
- ✔ Document architectural relationships in code
|
||||
- ✔ Ensure consistent deployment ordering across environments
|
||||
|
||||
## Repository Strategy
|
||||
|
||||
### Agathos (Sandbox)
|
||||
|
||||
Agathos is the **Sandbox repository** — isolated, safe for external demos, and uses local state.
|
||||
|
||||
| Aspect | Decision |
|
||||
|--------|----------|
|
||||
| Purpose | Evaluation, demos, pattern experimentation, new software testing |
|
||||
| State | Local (no remote backend) |
|
||||
| Secrets | No production credentials or references |
|
||||
| Security | Safe to use on external infrastructure for demos |
|
||||
|
||||
### Production Repository (Separate)
|
||||
|
||||
A separate repository manages Dev, UAT, and Prod environments:
|
||||
|
||||
```
|
||||
terraform/
|
||||
├── modules/incus_host/ # Reusable container module
|
||||
├── environments/
|
||||
│ ├── dev/ # Local Incus only
|
||||
│ └── prod/ # OCI + Incus (parameterized via tfvars)
|
||||
```
|
||||
|
||||
| Aspect | Decision |
|
||||
|--------|----------|
|
||||
| State | PostgreSQL backend on `eris.helu.ca:6432` with SSL |
|
||||
| Schemas | Separate per environment: `dev`, `uat`, `prod` |
|
||||
| UAT/Prod | Parameterized twins via `-var-file` |
|
||||
|
||||
## Module Design
|
||||
|
||||
### When to Extract a Module
|
||||
|
||||
A pattern is a good module candidate when it meets these criteria:
|
||||
|
||||
| Criterion | Description |
|
||||
|-----------|-------------|
|
||||
| **Reuse** | Pattern used across multiple environments (Sandbox, Dev, UAT, Prod) |
|
||||
| **Stable Interface** | Inputs/outputs won't change frequently |
|
||||
| **Testable** | Can validate module independently before promotion |
|
||||
| **Encapsulates Complexity** | Hides `dynamic` blocks, `for_each`, cloud-init generation |
|
||||
|
||||
### When NOT to Extract
|
||||
|
||||
- Single-use patterns
|
||||
- Tightly coupled to specific environment
|
||||
- Adds indirection without measurable benefit
|
||||
|
||||
### The `incus_host` Module
|
||||
|
||||
The standard container provisioning pattern extracted from Agathos:
|
||||
|
||||
**Inputs:**
|
||||
- `hosts` — Map of host definitions (name, role, image, devices, config)
|
||||
- `project` — Incus project name
|
||||
- `profile` — Incus profile name
|
||||
- `cloud_init_template` — Cloud-init configuration template
|
||||
- `ssh_key_path` — Path to SSH authorized keys
|
||||
- `depends_on_resources` — Explicit dependencies for infrastructure ordering
|
||||
|
||||
**Outputs:**
|
||||
- `host_details` — Name, IPv4, role, description for each host
|
||||
- `inventory` — Documentation reference for DHCP/DNS provisioning
|
||||
|
||||
## Environment Strategy
|
||||
|
||||
### Environment Purposes
|
||||
|
||||
| Environment | Purpose | Infrastructure |
|
||||
|-------------|---------|----------------|
|
||||
| **Sandbox** | Evaluation, demos, pattern experimentation | Local Incus only |
|
||||
| **Dev** | Integration testing, container builds, security testing | Local Incus only |
|
||||
| **UAT** | User acceptance testing, bug resolution | OCI + Incus (hybrid) |
|
||||
| **Prod** | Production workloads | OCI + Incus (hybrid) |
|
||||
|
||||
### Parameterized Twins (UAT/Prod)
|
||||
|
||||
UAT and Prod are architecturally identical. Use a single environment directory with variable files:
|
||||
|
||||
```bash
|
||||
# UAT deployment
|
||||
terraform apply -var-file=uat.tfvars
|
||||
|
||||
# Prod deployment
|
||||
terraform apply -var-file=prod.tfvars
|
||||
```
|
||||
|
||||
Key differences in tfvars:
|
||||
- Hostnames and DNS domains
|
||||
- Resource sizing (CPU, memory limits)
|
||||
- OCI compartment IDs
|
||||
- Credential references
|
||||
|
||||
## State Management
|
||||
|
||||
### Sandbox (Agathos)
|
||||
|
||||
Local state is acceptable because:
|
||||
- Environment is ephemeral
|
||||
- Single-user workflow
|
||||
- No production secrets to protect
|
||||
- Safe for external demos
|
||||
|
||||
### Production Environments
|
||||
|
||||
PostgreSQL backend on `eris.helu.ca`:
|
||||
|
||||
```hcl
|
||||
terraform {
|
||||
backend "pg" {
|
||||
conn_str = "postgres://eris.helu.ca:6432/terraform_state?sslmode=verify-full"
|
||||
schema_name = "dev" # or "uat", "prod"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Connection requirements:**
|
||||
- Port 6432 (pgBouncer)
|
||||
- SSL with `sslmode=verify-full`
|
||||
- Credentials via environment variables (`PGUSER`, `PGPASSWORD`)
|
||||
- Separate schema per environment for isolation
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Terraform → DHCP/DNS
|
||||
|
||||
The `agathos_inventory` output provides host information for DHCP/DNS provisioning:
|
||||
|
||||
1. Terraform creates containers with cloud-init
|
||||
2. `agathos_inventory` output includes hostnames and IPs
|
||||
3. MAC addresses registered in DHCP server
|
||||
4. DHCP server creates DNS entries (`hostname.incus` domain)
|
||||
5. Ansible uses DNS names for host connectivity
|
||||
|
||||
### Terraform → Ansible
|
||||
|
||||
Ansible does **not** consume Terraform outputs directly. Instead:
|
||||
|
||||
1. Terraform provisions containers
|
||||
2. Incus DNS resolution provides `hostname.incus` domain
|
||||
3. Ansible inventory uses static DNS names
|
||||
4. `sandbox_up.yml` configures DNS resolution on the hypervisor
|
||||
|
||||
```yaml
|
||||
# Ansible inventory uses DNS names, not Terraform outputs
|
||||
ubuntu:
|
||||
hosts:
|
||||
oberon.incus:
|
||||
ariel.incus:
|
||||
prospero.incus:
|
||||
```
|
||||
|
||||
### Terraform → Bash Scripts
|
||||
|
||||
The `ssh_key_update.sh` script demonstrates proper integration:
|
||||
|
||||
```bash
|
||||
terraform output -json agathos_inventory | jq -r \
|
||||
'.uranian_hosts.hosts | to_entries[] | "\(.key) \(.value.ipv4)"' | \
|
||||
while read hostname ip; do
|
||||
ssh-keyscan -H "$ip" >> ~/.ssh/known_hosts
|
||||
ssh-keyscan -H "$hostname.incus" >> ~/.ssh/known_hosts
|
||||
done
|
||||
```
|
||||
|
||||
## Promotion Workflow
|
||||
|
||||
All infrastructure changes flow through this pipeline:
|
||||
|
||||
```
|
||||
Agathos (Sandbox)
|
||||
↓ Validate pattern works
|
||||
↓ Extract to module if reusable
|
||||
Dev
|
||||
↓ Integration testing
|
||||
↓ Container builds
|
||||
↓ Security testing
|
||||
UAT
|
||||
↓ User acceptance testing
|
||||
↓ Bug fixes return to Dev
|
||||
↓ Delete environment, test restore
|
||||
Prod
|
||||
↓ Deploy from tested artifacts
|
||||
```
|
||||
|
||||
**Critical:** Nothing starts in Prod. Every change originates in Agathos, is validated through the pipeline, and only then deployed to production.
|
||||
|
||||
### Promotion Includes
|
||||
|
||||
When promoting Terraform changes, always update corresponding:
|
||||
- Ansible playbooks and templates
|
||||
- Service documentation in `/docs/services/`
|
||||
- Host variables if new services added
|
||||
|
||||
## Output Conventions
|
||||
|
||||
### `agathos_inventory`
|
||||
|
||||
The primary output for documentation and DNS integration:
|
||||
|
||||
```hcl
|
||||
output "agathos_inventory" {
|
||||
description = "Host inventory for documentation and DHCP/DNS provisioning"
|
||||
value = {
|
||||
uranian_hosts = {
|
||||
hosts = {
|
||||
for name, instance in incus_instance.uranian_hosts : name => {
|
||||
name = instance.name
|
||||
ipv4 = instance.ipv4_address
|
||||
role = local.uranian_hosts[name].role
|
||||
description = local.uranian_hosts[name].description
|
||||
security_nesting = lookup(local.uranian_hosts[name].config, "security.nesting", false)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Purpose:**
|
||||
- Update [sandbox.html](sandbox.html) documentation
|
||||
- Reference for DHCP server MAC/IP registration
|
||||
- DNS entry creation via DHCP
|
||||
|
||||
## Layered Configuration
|
||||
|
||||
### Single Config with Conditional Resources
|
||||
|
||||
Avoid multiple separate Terraform configurations. Use one config with conditional resources:
|
||||
|
||||
```
|
||||
environments/prod/
|
||||
├── main.tf # Incus project, profile, images (always)
|
||||
├── incus_hosts.tf # Module call for Incus containers (always)
|
||||
├── oci_resources.tf # OCI compute (conditional)
|
||||
├── variables.tf
|
||||
├── dev.tfvars # Dev: enable_oci = false
|
||||
├── uat.tfvars # UAT: enable_oci = true
|
||||
└── prod.tfvars # Prod: enable_oci = true
|
||||
```
|
||||
|
||||
```hcl
|
||||
variable "enable_oci" {
|
||||
description = "Enable OCI resources (false for Dev, true for UAT/Prod)"
|
||||
type = bool
|
||||
default = false
|
||||
}
|
||||
|
||||
resource "oci_core_instance" "hosts" {
|
||||
for_each = var.enable_oci ? var.oci_hosts : {}
|
||||
# ...
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
| Practice | Rationale |
|
||||
|----------|-----------|
|
||||
| ✔ Explicit `depends_on` for critical chains | Terraform isn't magic |
|
||||
| ✔ Local map for host definitions | Single source of truth, easy iteration |
|
||||
| ✔ `for_each` over `count` | Stable resource addresses |
|
||||
| ✔ `dynamic` blocks for optional devices | Clean, declarative device configuration |
|
||||
| ✔ Merge base config with overrides | DRY principle for common settings |
|
||||
| ✔ Separate tfvars for environment twins | Minimal duplication, clear parameterization |
|
||||
| ✔ Document module interfaces | Enable promotion across environments |
|
||||
| ✔ Never start in Prod | Always validate through pipeline |
|
||||
Reference in New Issue
Block a user