Files
ouranos/docs/ouranos.md
Robert Helewka 4ae6379613 chore(ansible): centralize third-party Docker image versions
Add centralized image version variables in group_vars/all/vars.yml for
vulnerability tracking and controlled upgrades of third-party Docker
images (casdoor, flower, grafana-mcp, gitea-mcp, neo4j, memcached,
nginx, oauth2-proxy, rabbitmq, searxng).

Update vault.yml accordingly.
2026-05-03 18:57:58 -04:00

23 KiB

Ouranos Lab

Infrastructure-as-Code project managing the Ouranos Lab — a development sandbox at ouranos.helu.ca. Uses Terraform for container provisioning and Ansible for configuration management, themed around the moons of Uranus.


Project Overview

Component Purpose
Terraform Provisions 10 specialised Incus containers (LXC) with DNS-resolved networking, security policies, and resource dependencies
Ansible Deploys Docker, databases (PostgreSQL, Neo4j), observability stack (Prometheus, Grafana, Loki), and application runtimes across all hosts

DNS Domain: Incus resolves containers via the .incus domain suffix (e.g., oberon.incus, portia.incus). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.


Uranian Host Architecture

All containers are named after moons of Uranus and resolved via the .incus DNS suffix.

Name Role Description Nesting
ariel graph_database Neo4j — Ethereal graph connections
caliban agent_automation Agent S MCP Server with MATE Desktop
miranda mcp_docker_host Dedicated Docker Host for MCP Servers
oberon container_orchestration Docker Host — MCP Switchboard, RabbitMQ, Open WebUI
portia database PostgreSQL — Relational database host
prospero observability PPLG stack — Prometheus, Grafana, Loki, PgAdmin
puck application_runtime Python App Host — JupyterLab, Django apps, Gitea Runner
rosalind collaboration Gitea, LobeChat, Nextcloud, AnythingLLM
sycorax language_models Arke LLM Proxy
titania proxy_sso HAProxy TLS termination + Casdoor SSO
umbriel graph_database Neo4j (Mnemosyne) — dedicated memory graph

puck — Project Application Runtime

Shape-shifting trickster embodying Python's versatility. This is the host that runs Python projects in the Ouranos sandbox. It has an RDP server and is generally where application development happens. Each project has a number that is used to determine port numbers.

  • Docker engine
  • JupyterLab (port 22071 via OAuth2-Proxy)
  • Gitea Runner (CI/CD agent)
  • Django Projects: Zelus (221), Angelia (222), Athena (224), Kairos (225), Icarlos (226), MCP Switchboard (227), Spelunker (228), Peitho (229), Mnemosyne (230)
  • FastAgent Projects: Pallas (240)
  • FastAPI Projects: Daedalus (200), Arke (201) Kernos (202), Rommie (203), Orpheus (204), Periplus (205), Nike (206), Stentor (207)

caliban — Agent Automation

Autonomous computer agent learning through environmental interaction.

  • Docker engine
  • Agent S MCP Server (MATE desktop, AT-SPI automation)
  • Kernos MCP Shell Server (port 22062)
  • Rommie MCP Server (port 22061) — agent-to-agent GUI automation via Agent S
  • FreeCAD Robust MCP Server (port 22063) — CAD automation via FreeCAD XML-RPC
  • GPU passthrough
  • RDP access (port 25521)

oberon — Container Orchestration & Dockerized Shared Services

King of the Fairies orchestrating containers and managing MCP infrastructure.

  • Docker engine
  • MCP Switchboard (port 22781) — Django app routing MCP tool calls
  • RabbitMQ message queue
  • smtp4dev SMTP test server (port 22025)

portia — Relational Database

Intelligent and resourceful — the reliability of relational databases.

  • PostgreSQL 17 (port 5432)
  • Databases: arke, anythingllm, gitea, hass, lobechat, mcp_switchboard, mnemosyne, nextcloud, openwebui, periplus, spelunker

ariel — Graph Database

Air spirit — ethereal, interconnected nature mirroring graph relationships.

  • Neo4j 5.26.0 (Docker)
  • HTTP API: port 25554
  • Bolt: port 7687 (reached as ariel.incus:7687 on the internal network)

umbriel — Graph Database (Mnemosyne)

Dusky melancholy sprite from Pope's Rape of the Lock — keeper of the Cave of Spleen, naturally paired with Mnemosyne the Titan of memory. Dedicated Neo4j instance so Mnemosyne's Library/Collection/Item/Chunk/Concept labels, vector indexes, and schema migrations can't collide with another tenant's graph on Ariel.

  • Neo4j 5.26.0 (Docker)
  • HTTP Browser: port 25555
  • Bolt: port 7687 (reached as umbriel.incus:7687 on the internal network)

miranda — MCP Docker Host

Curious bridge between worlds — hosting MCP server containers.

  • Docker engine (API exposed on port 2375 for MCP Switchboard)
  • MCPO OpenAI-compatible MCP proxy 22071
  • Argos MCP Server — web search via SearXNG (port 22062)
  • Grafana MCP Server (port 22063)
  • Neo4j MCP Server (port 22064)
  • Gitea MCP Server (port 22065)

prospero — Observability Stack

Master magician observing all events.

  • PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin
  • Internal HAProxy with OAuth2-Proxy for all dashboards
  • AlertManager with Pushover notifications
  • Prometheus metrics collection (node-exporter, HAProxy, Loki)
  • Loki log aggregation via Alloy (all hosts)
  • Grafana dashboard suite with Casdoor SSO integration

rosalind — Third Party Applications for testing and evaluation

Witty and resourceful moon for PHP, Go, and Node.js runtimes.

  • SearXNG privacy search (port 22083, behind OAuth2-Proxy)
  • Gitea self-hosted Git (port 22082, SSH on 22022)
  • LobeChat AI chat interface (port 22081)
  • Nextcloud file sharing and collaboration (port 22083)
  • AnythingLLM document AI workspace (port 22084)
  • Nextcloud data on dedicated Incus storage volume
  • Open WebUI LLM interface (port 22088, PostgreSQL backend on Portia
  • Home Assistant (port 8123)

sycorax — Language Models

Original magical power wielding language magic.

  • Arke LLM API Proxy (port 25540)
  • Multi-provider support (OpenAI, Anthropic, etc.)
  • Session management with Memcached
  • Database backend on Portia

titania — Proxy & SSO Services

Queen of the Fairies managing access control and authentication.

  • HAProxy 3.x with TLS termination (port 443)
  • Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)
  • HTTP to HTTPS redirect (port 80)
  • Gitea SSH proxy (port 22022)
  • Casdoor SSO (port 22081, local PostgreSQL)
  • Prometheus metrics at :8404/metrics

Port Numbering

Well-known ports running as a service may be used: Postgresql 5432, Prometheus Metrics 9100.

However inside a docker project, the number plan needs to be followed to avoid port conflicts and confusion: XXXYZ XXX Project Number or 220 for external project Y Service: 0 reserved, 1-4 flexible, 5 database, 6 MCP, 7 API, 8 Web App, 9 Prometheus metrics Z Instance: The running instance of this app on the same host, starting at 1. May also be used to handle exceptions.

255 Incus port forwarding: Ports in ths range are forwarded from the Incus host to Incus containers (defined in Terraform)

514ZZ is the syslog port. Docker containers send their syslog to an Alloy syslog collector port. ZZ is the application instance, they just need to be different on the same host and increment from 01.


Application Conventions

Standards that all services deployed in Ouranos MUST follow. For full logging standards and anti-patterns, see red_panda_standards.md.

Health Check Endpoints

All services MUST expose Kubernetes-style health endpoints:

Endpoint Purpose Auth
GET /live Liveness — process is running and accepting connections None
GET /ready Readiness — process is running AND all dependencies (DB, cache, upstream APIs) are healthy None
GET /metrics Prometheus metrics (see below) IP-restricted
  • HAProxy checks health_path (typically /ready/) for backend health — return HTTP 200 when healthy
  • Health endpoints MUST NOT require authentication (no JWT, no session)
  • Third-party services use their native health paths (e.g., /api/health, /api/healthz, /-/healthy)

Health Checks in Docker Compose

Use curl -f for Docker Compose healthchecks. Install curl in images if needed.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Logging Conventions

Log output flows through: App → syslog (RFC3164) → Alloy → Loki → Grafana

Level Usage
ERROR Broken state requiring human action — always include exc_info=True, error type, and context
WARNING Degraded but recovering — client disconnects, performance outliers, client-side exceptions, leaked markup
INFO Lifecycle events — service start/stop, connections, requests completed, jobs finished
DEBUG Diagnostic detail — SSE events, keepalive pings, health check 200 responses, negotiation steps

Health check responses MUST be logged at DEBUG only. HAProxy and Prometheus probe endpoints every 15-30 seconds. Logging these at INFO floods syslog with thousands of identical 200 OK lines per hour, burying real events.

Protected vs Unprotected Endpoints

Protected (require valid JWT) Unprotected
All /api/v1/* routes GET /live
GET /ready
GET /metrics (IP-restricted to internal networks)
GET /api/auth/login-url
POST /api/auth/token
POST /api/v1/telemetry (sendBeacon cannot set headers)

Prometheus Metrics

All services SHOULD expose GET /metrics in Prometheus exposition format, scraped by Prospero's Prometheus (default 15s interval).

  • IP-restricted to internal networks only (10.10.0.0/24, 172.16.0.0/12, 127.0.0.0/8)
  • Consider exposing: request counts/durations, error rates, active connections, queue depths, dependency health

Browser Telemetry

Frontend/browser code MUST send telemetry data and errors back to the application's telemetry API:

  • POST /api/v1/telemetry — unprotected (browser sendBeacon cannot set Authorization headers)
  • Capture and report: JavaScript exceptions, performance metrics, user-facing errors
  • Client-side exceptions should log as WARNING on the server (they indicate a problem but not a server-side failure)

Docker Networking

  • Use the default Docker bridge network for simple deployments
  • Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
  • Do not create custom network definitions for single-service Docker Compose stacks

External Access via HAProxy

Titania provides TLS termination and reverse proxy for all services.

  • Base domain: ouranos.helu.ca
  • HTTPS: port 443 (standard)
  • HTTP: port 80 (redirects to HTTPS)
  • Certificate: Let's Encrypt wildcard via certbot DNS-01

Route Table

Subdomain Backend Service
ouranos.helu.ca (root) puck.incus:22281 Angelia (Django)
alertmanager.ouranos.helu.ca prospero.incus:443 (SSL) AlertManager
angelia.ouranos.helu.ca puck.incus:22281 Angelia (Django)
anythingllm.ouranos.helu.ca rosalind.incus:22084 AnythingLLM
arke.ouranos.helu.ca sycorax.incus:25540 Arke LLM Proxy
athena.ouranos.helu.ca puck.incus:22481 Athena (Django)
gitea.ouranos.helu.ca rosalind.incus:22082 Gitea
grafana.ouranos.helu.ca prospero.incus:443 (SSL) Grafana
hass.ouranos.helu.ca oberon.incus:8123 Home Assistant
id.ouranos.helu.ca titania.incus:22081 Casdoor SSO
icarlos.ouranos.helu.ca puck.incus:22681 Icarlos (Django)
jupyterlab.ouranos.helu.ca puck.incus:22071 JupyterLab (OAuth2-Proxy)
kairos.ouranos.helu.ca puck.incus:22581 Kairos (Django)
lobechat.ouranos.helu.ca rosalind.incus:22081 LobeChat
loki.ouranos.helu.ca prospero.incus:443 (SSL) Loki
mcp-switchboard.ouranos.helu.ca oberon.incus:22781 MCP Switchboard
nextcloud.ouranos.helu.ca rosalind.incus:22083 Nextcloud
openwebui.ouranos.helu.ca oberon.incus:22088 Open WebUI
peitho.ouranos.helu.ca puck.incus:22981 Peitho (Django)
periplus.ouranos.helu.ca puck.incus:20681 Periplus (FastAPI + MCP via nginx)
pgadmin.ouranos.helu.ca prospero.incus:443 (SSL) PgAdmin 4
prometheus.ouranos.helu.ca prospero.incus:443 (SSL) Prometheus
searxng.ouranos.helu.ca oberon.incus:22073 SearXNG (OAuth2-Proxy)
smtp4dev.ouranos.helu.ca oberon.incus:22085 smtp4dev
spelunker.ouranos.helu.ca puck.incus:22881 Spelunker (Django)

Infrastructure Management

Quick Start

# Provision containers
cd terraform
terraform init
terraform plan
terraform apply

# Start all containers
cd ../ansible
source ~/env/ouranos/bin/activate
ansible-playbook sandbox_up.yml

# Deploy all services
ansible-playbook site.yml

# Stop all containers
ansible-playbook sandbox_down.yml

Python Virtual Environment Setup

The Ansible automation requires a Python virtual environment with the ansible package installed. Create and activate the environment from the ~ directory:

# Create virtual environment
cd ~
python3 -m venv env/ouranos

# Activate environment
source ~/env/ouranos/bin/activate

# Install Ansible
pip install ansible
pip install ansible-core
pip install ansible-community.postgresql

Ansible Playbook Syntax Check

Before running playbooks, use the apsc.sh utility (in PATH) to quickly validate YAML syntax:

# From the ansible directory
apsc.sh

# This will check all YAML files in the current directory for syntax errors

Terraform Workflow

  1. Define — Containers, networks, and resources in *.tf files
  2. Plan — Review changes with terraform plan
  3. Apply — Provision with terraform apply
  4. Verify — Check outputs and container status

Terraform Import

When containers or other resources are created manually (outside Terraform) or need to be re-imported after recreation, use terraform import to sync the Terraform state with existing infrastructure.

Import Syntax

The correct import format for Incus resources requires quoting resource addresses with for_each keys and using the full ID including image fingerprints:

# Import a container with correct syntax
terraform import 'incus_instance.uranian_hosts["<name>"]' ouranos/<name>,image=<fingerprint>

Getting Image Fingerprints

First, get the fingerprint of the image resource from Terraform state:

cd terraform
terraform state show incus_image.noble | grep fingerprint
# Output: fingerprint = "75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644"

terraform state show incus_image.questing | grep fingerprint
# Output: fingerprint = "e78dd4a406b7fa3592ed0a6048862260b3d2e50c76e32a6169930245c0a13fdf"

Importing All Uranian Hosts

Replace containers missing from state (or re-import after manual recreation):

# Containers using noble image
terraform import 'incus_instance.uranian_hosts["ariel"]' ouranos/ariel,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["miranda"]' ouranos/miranda,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["oberon"]' ouranos/oberon,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["portia"]' ouranos/portia,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["prospero"]' ouranos/prospero,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["rosalind"]' ouranos/rosalind,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["sycorax"]' ouranos/sycorax,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["titania"]' ouranos/titania,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644
terraform import 'incus_instance.uranian_hosts["umbriel"]' ouranos/umbriel,image=75cde3e755b0e657c05f67e03a42683217b233b0339448be747845747df58644

# Containers using questing image
terraform import 'incus_instance.uranian_hosts["caliban"]' ouranos/caliban,image=e78dd4a406b7fa3592ed0a6048862260b3d2e50c76e32a6169930245c0a13fdf
terraform import 'incus_instance.uranian_hosts["puck"]' ouranos/puck,image=e78dd4a406b7fa3592ed0a6048862260b3d2e50c76e32a6169930245c0a13fdf

Storage Bucket Import

For storage buckets, use the <project>/<pool>/<name> format:

terraform import incus_storage_bucket.<name> ouranos/default/<bucket-name>

Common Issues

  1. Import ID format errors: Use quotes around resource addresses with for_each keys: 'incus_instance.uranian_hosts["name"]'

  2. Image replacement on import: Importing without specifying the image fingerprint will cause Terraform to replace the container on next apply. Always include image=<fingerprint> in the import ID.

  3. Tainted state: If a resource shows "will be created" but already exists, it may be tainted. Remove from state and re-import:

    terraform state rm 'incus_instance.uranian_hosts["name"]'
    terraform import 'incus_instance.uranian_hosts["name"]' ouranos/name,image=<fingerprint>
    

Verify Import

After importing, verify with terraform plan:

terraform plan
# Should show: Plan: 0 to add, 0 to change, 0 to destroy
# (Minor "update in-place" changes are normal for state sync of computed attributes)

Ansible Workflow

  1. Bootstrap — Update packages, install essentials (apt_update.yml)
  2. Agents — Deploy Alloy (log/metrics) and Node Exporter on all hosts
  3. Services — Configure databases, Docker, applications, observability
  4. Verify — Check service health and connectivity

Vault Management

# Edit secrets
ansible-vault edit inventory/group_vars/all/vault.yml

# View secrets
ansible-vault view inventory/group_vars/all/vault.yml

# Encrypt a new file
ansible-vault encrypt new_secrets.yml

S3 Storage Provisioning

Terraform provisions Incus S3 buckets for services requiring object storage:

Service Host Purpose
Casdoor Titania User avatars and SSO resource storage
LobeChat Rosalind File uploads and attachments

S3 credentials (access key, secret key, endpoint) are stored as sensitive Terraform outputs and managed in Ansible Vault with the vault_*_s3_* prefix.


Ansible Automation

Full Deployment (site.yml)

Playbooks run in dependency order:

Playbook Hosts Purpose
apt_update.yml All Update packages and install essentials
alloy/deploy.yml All Grafana Alloy log/metrics collection
prometheus/node_deploy.yml All Node Exporter metrics
docker/deploy.yml Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania Docker engine
smtp4dev/deploy.yml Oberon SMTP test server
pplg/deploy.yml Prospero Full observability stack + HAProxy + OAuth2-Proxy
postgresql/deploy.yml Portia PostgreSQL with all databases
postgresql_ssl/deploy.yml Titania Dedicated PostgreSQL for Casdoor
neo4j/deploy.yml Ariel, Umbriel Neo4j graph database (Umbriel is the dedicated Mnemosyne instance)
searxng/deploy.yml Oberon SearXNG privacy search
haproxy/deploy.yml Titania HAProxy TLS termination and routing
casdoor/deploy.yml Titania Casdoor SSO
mcpo/deploy.yml Miranda MCPO MCP proxy
openwebui/deploy.yml Oberon Open WebUI LLM interface
hass/deploy.yml Oberon Home Assistant
gitea/deploy.yml Rosalind Gitea self-hosted Git
nextcloud/deploy.yml Rosalind Nextcloud collaboration

Individual Service Deployments

Services with standalone deploy playbooks (not in site.yml):

Playbook Host Service
anythingllm/deploy.yml Rosalind AnythingLLM document AI
arke/deploy.yml Sycorax Arke LLM proxy
argos/deploy.yml Miranda Argos MCP web search server
caliban/deploy.yml Caliban Agent S MCP Server
certbot/deploy.yml Titania Let's Encrypt certificate renewal
gitea_mcp/deploy.yml Miranda Gitea MCP Server
gitea_runner/deploy.yml Puck Gitea CI/CD runner
grafana_mcp/deploy.yml Miranda Grafana MCP Server
jupyterlab/deploy.yml Puck JupyterLab + OAuth2-Proxy
kernos/deploy.yml Caliban Kernos MCP shell server
lobechat/deploy.yml Rosalind LobeChat AI chat
rommie/deploy.yml Caliban Rommie MCP server (Agent S GUI automation)
neo4j_mcp/deploy.yml Miranda Neo4j MCP Server
freecad_mcp/deploy.yml Caliban FreeCAD Robust MCP Server
rabbitmq/deploy.yml Oberon RabbitMQ message queue

Lifecycle Playbooks

Playbook Purpose
sandbox_up.yml Start all Uranian host containers
sandbox_down.yml Gracefully stop all containers
apt_update.yml Update packages on all hosts
site.yml Full deployment orchestration

Data Flow Architecture

Observability Pipeline

All Hosts                      Prospero                         Alerts
Alloy + Node Exporter     →   Prometheus + Loki + Grafana   →  AlertManager + Pushover
collect metrics & logs         storage & visualisation           notifications

Integration Points

Consumer Provider Connection
All LLM apps Arke (Sycorax) http://sycorax.incus:25540
Open WebUI, Arke, Gitea, Nextcloud, LobeChat PostgreSQL (Portia) portia.incus:5432
Neo4j MCP Neo4j (Ariel) ariel.incus:7687 (Bolt)
Mnemosyne Neo4j (Umbriel) umbriel.incus:7687 (Bolt) — dedicated tenant
MCP Switchboard Docker API (Miranda) tcp://miranda.incus:2375
MCP Switchboard RabbitMQ (Oberon) oberon.incus:5672
Kairos, Spelunker RabbitMQ (Oberon) oberon.incus:5672
SMTP (all apps) smtp4dev (Oberon) oberon.incus:22025
All hosts Loki (Prospero) http://prospero.incus:3100
All hosts Prometheus (Prospero) http://prospero.incus:9090

Important Notes

⚠️ Alloy Host Variables Required — Every host with alloy in its services list must define alloy_log_level in inventory/host_vars/<host>.incus.yml. The playbook will fail with an undefined variable error if this is missing.

⚠️ Alloy Syslog Listeners Required for Docker Services — Any Docker Compose service using the syslog logging driver must have a corresponding loki.source.syslog listener in the host's Alloy config template (ansible/alloy/<hostname>/config.alloy.j2). Missing listeners cause Docker containers to fail on start.

⚠️ Local Terraform State — This project uses local Terraform state (no remote backend). Do not run terraform apply from multiple machines simultaneously.

⚠️ Nested Docker — Docker runs inside Incus containers (nested), requiring security.nesting = true and lxc.apparmor.profile=unconfined AppArmor override on all Docker-enabled hosts.

⚠️ Deployment Order — Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to prospero.incus. Run pplg/deploy.yml before site.yml on a fresh environment.