Files
ouranos/docs/ouranos.md

334 lines
13 KiB
Markdown

# Ouranos Lab
Infrastructure-as-Code project managing the **Ouranos Lab** — a development sandbox at [ouranos.helu.ca](https://ouranos.helu.ca). Uses **Terraform** for container provisioning and **Ansible** for configuration management, themed around the moons of Uranus.
---
## Project Overview
| Component | Purpose |
|-----------|---------|
| **Terraform** | Provisions 10 specialised Incus containers (LXC) with DNS-resolved networking, security policies, and resource dependencies |
| **Ansible** | Deploys Docker, databases (PostgreSQL, Neo4j), observability stack (Prometheus, Grafana, Loki), and application runtimes across all hosts |
> **DNS Domain**: Incus resolves containers via the `.incus` domain suffix (e.g., `oberon.incus`, `portia.incus`). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.
---
## Uranian Host Architecture
All containers are named after moons of Uranus and resolved via the `.incus` DNS suffix.
| Name | Role | Description | Nesting |
|------|------|-------------|---------|
| **ariel** | graph_database | Neo4j — Ethereal graph connections | ✔ |
| **caliban** | agent_automation | Agent S MCP Server with MATE Desktop | ✔ |
| **miranda** | mcp_docker_host | Dedicated Docker Host for MCP Servers | ✔ |
| **oberon** | container_orchestration | Docker Host — MCP Switchboard, RabbitMQ, Open WebUI | ✔ |
| **portia** | database | PostgreSQL — Relational database host | ❌ |
| **prospero** | observability | PPLG stack — Prometheus, Grafana, Loki, PgAdmin | ❌ |
| **puck** | application_runtime | Python App Host — JupyterLab, Django apps, Gitea Runner | ✔ |
| **rosalind** | collaboration | Gitea, LobeChat, Nextcloud, AnythingLLM | ✔ |
| **sycorax** | language_models | Arke LLM Proxy | ✔ |
| **titania** | proxy_sso | HAProxy TLS termination + Casdoor SSO | ✔ |
### oberon — Container Orchestration
King of the Fairies orchestrating containers and managing MCP infrastructure.
- Docker engine
- MCP Switchboard (port 22785) — Django app routing MCP tool calls
- RabbitMQ message queue
- Open WebUI LLM interface (port 22088, PostgreSQL backend on Portia)
- SearXNG privacy search (port 22083, behind OAuth2-Proxy)
- smtp4dev SMTP test server (port 22025)
### portia — Relational Database
Intelligent and resourceful — the reliability of relational databases.
- PostgreSQL 17 (port 5432)
- Databases: `arke`, `anythingllm`, `gitea`, `hass`, `lobechat`, `mcp_switchboard`, `nextcloud`, `openwebui`, `periplus`, `spelunker`
### ariel — Graph Database
Air spirit — ethereal, interconnected nature mirroring graph relationships.
- Neo4j 5.26.0 (Docker)
- HTTP API: port 25584
- Bolt: port 25554
### puck — Application Runtime
Shape-shifting trickster embodying Python's versatility.
- Docker engine
- JupyterLab (port 22071 via OAuth2-Proxy)
- Gitea Runner (CI/CD agent)
- Home Assistant (port 8123)
- Django applications: Angelia (22281), Athena (22481), Kairos (22581), Icarlos (22681), Spelunker (22881), Peitho (22981)
### prospero — Observability Stack
Master magician observing all events.
- PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin
- Internal HAProxy with OAuth2-Proxy for all dashboards
- AlertManager with Pushover notifications
- Prometheus metrics collection (`node-exporter`, HAProxy, Loki)
- Loki log aggregation via Alloy (all hosts)
- Grafana dashboard suite with Casdoor SSO integration
### miranda — MCP Docker Host
Curious bridge between worlds — hosting MCP server containers.
- Docker engine (API exposed on port 2375 for MCP Switchboard)
- MCPO OpenAI-compatible MCP proxy
- Grafana MCP Server (port 25533)
- Gitea MCP Server (port 25535)
- Neo4j MCP Server
- Argos MCP Server — web search via SearXNG (port 25534)
### sycorax — Language Models
Original magical power wielding language magic.
- Arke LLM API Proxy (port 25540)
- Multi-provider support (OpenAI, Anthropic, etc.)
- Session management with Memcached
- Database backend on Portia
### caliban — Agent Automation
Autonomous computer agent learning through environmental interaction.
- Docker engine
- Agent S MCP Server (MATE desktop, AT-SPI automation)
- Kernos MCP Shell Server (port 22021)
- GPU passthrough for vision tasks
- RDP access (port 25521)
### rosalind — Collaboration Services
Witty and resourceful moon for PHP, Go, and Node.js runtimes.
- Gitea self-hosted Git (port 22082, SSH on 22022)
- LobeChat AI chat interface (port 22081)
- Nextcloud file sharing and collaboration (port 22083)
- AnythingLLM document AI workspace (port 22084)
- Nextcloud data on dedicated Incus storage volume
### titania — Proxy & SSO Services
Queen of the Fairies managing access control and authentication.
- HAProxy 3.x with TLS termination (port 443)
- Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)
- HTTP to HTTPS redirect (port 80)
- Gitea SSH proxy (port 22022)
- Casdoor SSO (port 22081, local PostgreSQL)
- Prometheus metrics at `:8404/metrics`
---
## External Access via HAProxy
Titania provides TLS termination and reverse proxy for all services.
- **Base domain**: `ouranos.helu.ca`
- **HTTPS**: port 443 (standard)
- **HTTP**: port 80 (redirects to HTTPS)
- **Certificate**: Let's Encrypt wildcard via certbot DNS-01
### Route Table
| Subdomain | Backend | Service |
|-----------|---------|---------|
| `ouranos.helu.ca` (root) | puck.incus:22281 | Angelia (Django) |
| `alertmanager.ouranos.helu.ca` | prospero.incus:443 (SSL) | AlertManager |
| `angelia.ouranos.helu.ca` | puck.incus:22281 | Angelia (Django) |
| `anythingllm.ouranos.helu.ca` | rosalind.incus:22084 | AnythingLLM |
| `arke.ouranos.helu.ca` | sycorax.incus:25540 | Arke LLM Proxy |
| `athena.ouranos.helu.ca` | puck.incus:22481 | Athena (Django) |
| `gitea.ouranos.helu.ca` | rosalind.incus:22082 | Gitea |
| `grafana.ouranos.helu.ca` | prospero.incus:443 (SSL) | Grafana |
| `hass.ouranos.helu.ca` | oberon.incus:8123 | Home Assistant |
| `id.ouranos.helu.ca` | titania.incus:22081 | Casdoor SSO |
| `icarlos.ouranos.helu.ca` | puck.incus:22681 | Icarlos (Django) |
| `jupyterlab.ouranos.helu.ca` | puck.incus:22071 | JupyterLab (OAuth2-Proxy) |
| `kairos.ouranos.helu.ca` | puck.incus:22581 | Kairos (Django) |
| `lobechat.ouranos.helu.ca` | rosalind.incus:22081 | LobeChat |
| `loki.ouranos.helu.ca` | prospero.incus:443 (SSL) | Loki |
| `mcp-switchboard.ouranos.helu.ca` | oberon.incus:22785 | MCP Switchboard |
| `nextcloud.ouranos.helu.ca` | rosalind.incus:22083 | Nextcloud |
| `openwebui.ouranos.helu.ca` | oberon.incus:22088 | Open WebUI |
| `peitho.ouranos.helu.ca` | puck.incus:22981 | Peitho (Django) |
| `pgadmin.ouranos.helu.ca` | prospero.incus:443 (SSL) | PgAdmin 4 |
| `prometheus.ouranos.helu.ca` | prospero.incus:443 (SSL) | Prometheus |
| `searxng.ouranos.helu.ca` | oberon.incus:22073 | SearXNG (OAuth2-Proxy) |
| `smtp4dev.ouranos.helu.ca` | oberon.incus:22085 | smtp4dev |
| `spelunker.ouranos.helu.ca` | puck.incus:22881 | Spelunker (Django) |
---
## Infrastructure Management
### Quick Start
```bash
# Provision containers
cd terraform
terraform init
terraform plan
terraform apply
# Start all containers
cd ../ansible
source ~/env/agathos/bin/activate
ansible-playbook sandbox_up.yml
# Deploy all services
ansible-playbook site.yml
# Stop all containers
ansible-playbook sandbox_down.yml
```
### Terraform Workflow
1. **Define** — Containers, networks, and resources in `*.tf` files
2. **Plan** — Review changes with `terraform plan`
3. **Apply** — Provision with `terraform apply`
4. **Verify** — Check outputs and container status
### Ansible Workflow
1. **Bootstrap** — Update packages, install essentials (`apt_update.yml`)
2. **Agents** — Deploy Alloy (log/metrics) and Node Exporter on all hosts
3. **Services** — Configure databases, Docker, applications, observability
4. **Verify** — Check service health and connectivity
### Vault Management
```bash
# Edit secrets
ansible-vault edit inventory/group_vars/all/vault.yml
# View secrets
ansible-vault view inventory/group_vars/all/vault.yml
# Encrypt a new file
ansible-vault encrypt new_secrets.yml
```
---
## S3 Storage Provisioning
Terraform provisions Incus S3 buckets for services requiring object storage:
| Service | Host | Purpose |
|---------|------|---------|
| **Casdoor** | Titania | User avatars and SSO resource storage |
| **LobeChat** | Rosalind | File uploads and attachments |
> S3 credentials (access key, secret key, endpoint) are stored as sensitive Terraform outputs and managed in Ansible Vault with the `vault_*_s3_*` prefix.
---
## Ansible Automation
### Full Deployment (`site.yml`)
Playbooks run in dependency order:
| Playbook | Hosts | Purpose |
|----------|-------|---------|
| `apt_update.yml` | All | Update packages and install essentials |
| `alloy/deploy.yml` | All | Grafana Alloy log/metrics collection |
| `prometheus/node_deploy.yml` | All | Node Exporter metrics |
| `docker/deploy.yml` | Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania | Docker engine |
| `smtp4dev/deploy.yml` | Oberon | SMTP test server |
| `pplg/deploy.yml` | Prospero | Full observability stack + HAProxy + OAuth2-Proxy |
| `postgresql/deploy.yml` | Portia | PostgreSQL with all databases |
| `postgresql_ssl/deploy.yml` | Titania | Dedicated PostgreSQL for Casdoor |
| `neo4j/deploy.yml` | Ariel | Neo4j graph database |
| `searxng/deploy.yml` | Oberon | SearXNG privacy search |
| `haproxy/deploy.yml` | Titania | HAProxy TLS termination and routing |
| `casdoor/deploy.yml` | Titania | Casdoor SSO |
| `mcpo/deploy.yml` | Miranda | MCPO MCP proxy |
| `openwebui/deploy.yml` | Oberon | Open WebUI LLM interface |
| `hass/deploy.yml` | Oberon | Home Assistant |
| `gitea/deploy.yml` | Rosalind | Gitea self-hosted Git |
| `nextcloud/deploy.yml` | Rosalind | Nextcloud collaboration |
### Individual Service Deployments
Services with standalone deploy playbooks (not in `site.yml`):
| Playbook | Host | Service |
|----------|------|---------|
| `anythingllm/deploy.yml` | Rosalind | AnythingLLM document AI |
| `arke/deploy.yml` | Sycorax | Arke LLM proxy |
| `argos/deploy.yml` | Miranda | Argos MCP web search server |
| `caliban/deploy.yml` | Caliban | Agent S MCP Server |
| `certbot/deploy.yml` | Titania | Let's Encrypt certificate renewal |
| `gitea_mcp/deploy.yml` | Miranda | Gitea MCP Server |
| `gitea_runner/deploy.yml` | Puck | Gitea CI/CD runner |
| `grafana_mcp/deploy.yml` | Miranda | Grafana MCP Server |
| `jupyterlab/deploy.yml` | Puck | JupyterLab + OAuth2-Proxy |
| `kernos/deploy.yml` | Caliban | Kernos MCP shell server |
| `lobechat/deploy.yml` | Rosalind | LobeChat AI chat |
| `neo4j_mcp/deploy.yml` | Miranda | Neo4j MCP Server |
| `rabbitmq/deploy.yml` | Oberon | RabbitMQ message queue |
### Lifecycle Playbooks
| Playbook | Purpose |
|----------|---------|
| `sandbox_up.yml` | Start all Uranian host containers |
| `sandbox_down.yml` | Gracefully stop all containers |
| `apt_update.yml` | Update packages on all hosts |
| `site.yml` | Full deployment orchestration |
---
## Data Flow Architecture
### Observability Pipeline
```
All Hosts Prospero Alerts
Alloy + Node Exporter → Prometheus + Loki + Grafana → AlertManager + Pushover
collect metrics & logs storage & visualisation notifications
```
### Integration Points
| Consumer | Provider | Connection |
|----------|----------|-----------|
| All LLM apps | Arke (Sycorax) | `http://sycorax.incus:25540` |
| Open WebUI, Arke, Gitea, Nextcloud, LobeChat | PostgreSQL (Portia) | `portia.incus:5432` |
| Neo4j MCP | Neo4j (Ariel) | `ariel.incus:7687` (Bolt) |
| MCP Switchboard | Docker API (Miranda) | `tcp://miranda.incus:2375` |
| MCP Switchboard | RabbitMQ (Oberon) | `oberon.incus:5672` |
| Kairos, Spelunker | RabbitMQ (Oberon) | `oberon.incus:5672` |
| SMTP (all apps) | smtp4dev (Oberon) | `oberon.incus:22025` |
| All hosts | Loki (Prospero) | `http://prospero.incus:3100` |
| All hosts | Prometheus (Prospero) | `http://prospero.incus:9090` |
---
## Important Notes
⚠️ **Alloy Host Variables Required** — Every host with `alloy` in its `services` list must define `alloy_log_level` in `inventory/host_vars/<host>.incus.yml`. The playbook will fail with an undefined variable error if this is missing.
⚠️ **Alloy Syslog Listeners Required for Docker Services** — Any Docker Compose service using the syslog logging driver must have a corresponding `loki.source.syslog` listener in the host's Alloy config template (`ansible/alloy/<hostname>/config.alloy.j2`). Missing listeners cause Docker containers to fail on start.
⚠️ **Local Terraform State** — This project uses local Terraform state (no remote backend). Do not run `terraform apply` from multiple machines simultaneously.
⚠️ **Nested Docker** — Docker runs inside Incus containers (nested), requiring `security.nesting = true` and `lxc.apparmor.profile=unconfined` AppArmor override on all Docker-enabled hosts.
⚠️ **Deployment Order** — Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to `prospero.incus`. Run `pplg/deploy.yml` before `site.yml` on a fresh environment.