docs: rewrite README with structured overview and quick start guide

Replaces the minimal project description with a comprehensive README
including a component overview table, quick start instructions, common
Ansible operations, and links to detailed documentation. Aligns with
Red Panda Approval™ standards.
This commit is contained in:
2026-03-03 12:49:06 +00:00
parent c7be03a743
commit b4d60f2f38
219 changed files with 34586 additions and 2 deletions

View File

@@ -0,0 +1,38 @@
# Scalable Twelve-Factor App
https://12factor.net/
The twelve-factor app is a methodology for building software-as-a-service apps that:
Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
Minimize divergence between development and production, enabling continuous deployment for maximum agility;
And can scale up without significant changes to tooling, architecture, or development practices.
I. Codebase
One codebase tracked in revision control, many deploys
II. Dependencies
Explicitly declare and isolate dependencies
III. Config
Store config in the environment
IV. Backing services
Treat backing services as attached resources
V. Build, release, run
Strictly separate build and run stages
VI. Processes
Execute the app as one or more stateless processes
VII. Port binding
Export services via port binding
VIII. Concurrency
Scale out via the process model
IX. Disposability
Maximize robustness with fast startup and graceful shutdown
X. Dev/prod parity
Keep development, staging, and production as similar as possible
XI. Logs
Treat logs as event streams
XII. Admin processes
Run admin/management tasks as one-off processes
# Django Logging
https://lincolnloop.com/blog/django-logging-right-way/

View File

@@ -0,0 +1,13 @@
# Semantic Versioning 2.0.0
https://semver.org/
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you make incompatible API changes
MINOR version when you add functionality in a backwards compatible manner
PATCH version when you make backwards compatible bug fixes
Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.
GitHub Actions: Gitbump
https://betterprogramming.pub/how-to-version-your-code-in-2020-60bdd221278b

184
docs/_template.md Normal file
View File

@@ -0,0 +1,184 @@
# Service Documentation Template
This is a template for documenting services deployed in the Agathos sandbox. Copy this file and replace placeholders with service-specific information.
---
# {Service Name}
## Overview
Brief description of the service, its purpose, and role in the infrastructure.
**Host:** {hostname} (e.g., oberon, miranda, prospero)
**Role:** {role from Terraform} (e.g., container_orchestration, observability)
**Port Range:** {exposed ports} (e.g., 25580-25599)
## Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Service │────▶│ Database │
└─────────────┘ └─────────────┘ └─────────────┘
```
Describe the service architecture, data flow, and integration points.
## Terraform Resources
### Host Definition
The service runs on `{hostname}`, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | {noble/plucky/questing} |
| Role | {terraform role} |
| Security Nesting | {true/false} |
| Proxy Devices | {port mappings} |
### Dependencies
| Resource | Relationship |
|----------|--------------|
| {other host} | {description of dependency} |
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook {service}/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `{service}/deploy.yml` | Main deployment playbook |
| `{service}/*.j2` | Jinja2 templates |
### Variables
#### Group Variables (`group_vars/all/main.yml`)
| Variable | Description | Default |
|----------|-------------|---------|
| `{service}_version` | Version to deploy | `latest` |
#### Host Variables (`host_vars/{hostname}.yml`)
| Variable | Description |
|----------|-------------|
| `{service}_port` | Service port |
| `{service}_data_dir` | Data directory |
#### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Description |
|----------|-------------|
| `vault_{service}_password` | Service password |
| `vault_{service}_api_key` | API key (if applicable) |
## Configuration
### Environment Variables
| Variable | Description | Source |
|----------|-------------|--------|
| `{VAR_NAME}` | Description | `{{ vault_{service}_var }}` |
### Configuration Files
| File | Location | Template |
|------|----------|----------|
| `config.yml` | `/etc/{service}/` | `{service}/config.yml.j2` |
## Monitoring
### Prometheus Metrics
| Metric | Description |
|--------|-------------|
| `{service}_requests_total` | Total requests |
| `{service}_errors_total` | Total errors |
**Scrape Target:** Configured in `ansible/prometheus/` or via Alloy.
### Loki Logs
| Log Source | Labels |
|------------|--------|
| Application log | `{job="{service}", host="{hostname}"}` |
| Access log | `{job="{service}_access", host="{hostname}"}` |
**Collection:** Alloy agent on host ships logs to Loki on Prospero.
### Grafana Dashboard
Dashboard provisioned at: `ansible/grafana/dashboards/{service}.json`
## Operations
### Start/Stop
```bash
# Via systemd (if applicable)
sudo systemctl start {service}
sudo systemctl stop {service}
# Via Docker (if applicable)
docker compose -f /opt/{service}/docker-compose.yml up -d
docker compose -f /opt/{service}/docker-compose.yml down
```
### Health Check
```bash
curl http://{hostname}.incus:{port}/health
```
### Logs
```bash
# Systemd
journalctl -u {service} -f
# Docker
docker logs -f {container_name}
# Loki (via Grafana Explore)
{job="{service}"}
```
### Backup
Describe backup procedures, scripts, and schedules.
### Restore
Describe restore procedures and verification steps.
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| Service won't start | Missing config | Check `{config_file}` exists |
| Connection refused | Firewall/proxy | Verify Incus proxy device |
### Debug Mode
```bash
# Enable debug logging
{service} --debug
```
## References
- Official Documentation: {url}
- [Terraform Practices](../terraform.md)
- [Ansible Practices](../ansible.md)
- [Sandbox Overview](../sandbox.html)

705
docs/ansible.md Normal file
View File

@@ -0,0 +1,705 @@
# Ansible Project Structure - Best Practices
This document describes the clean, maintainable Ansible structure implemented in the Agathos project. Use this as a reference template for other Ansible projects.
## Overview
This structure emphasizes:
- **Simplicity**: Minimal files at root level
- **Organization**: Services contain all related files (playbooks + templates)
- **Separation**: Variables live in dedicated files, not inline in inventory
- **Discoverability**: Clear naming and logical grouping
## Directory Structure
```
ansible/
├── ansible.cfg # Ansible configuration
├── .vault_pass # Vault password file
├── site.yml # Master orchestration playbook
├── apt_update.yml # Utility: Update all hosts
├── sandbox_up.yml # Utility: Start infrastructure
├── sandbox_down.yml # Utility: Stop infrastructure
├── inventory/ # Inventory organization
│ ├── hosts # Simple host/group membership
│ │
│ ├── group_vars/ # Variables for groups
│ │ └── all/
│ │ ├── vars.yml # Common variables
│ │ └── vault.yml # Encrypted secrets
│ │
│ └── host_vars/ # Variables per host
│ ├── hostname1.yml # All vars for hostname1
│ ├── hostname2.yml # All vars for hostname2
│ └── ...
└── service_name/ # Per-service directories
├── deploy.yml # Main deployment playbook
├── stage.yml # Staging playbook (if needed)
├── template1.j2 # Jinja2 templates
├── template2.j2
└── files/ # Static files (if needed)
```
## Key Components
### 1. Simplified Inventory (`inventory/hosts`)
**Purpose**: Define ONLY host/group membership, no variables
**Example**:
```yaml
---
# Ansible Inventory - Simplified
# Main infrastructure group
ubuntu:
hosts:
server1.example.com:
server2.example.com:
server3.example.com:
# Service-specific groups
web_servers:
hosts:
server1.example.com:
database_servers:
hosts:
server2.example.com:
```
**Before**: 361 lines with variables inline
**After**: 34 lines of pure structure
### 2. Host Variables (`inventory/host_vars/`)
**Purpose**: All configuration specific to a single host
**File naming**: `{hostname}.yml` (matches inventory hostname exactly)
**Example** (`inventory/host_vars/server1.example.com.yml`):
```yaml
---
# Server1 Configuration - Web Server
# Services: nginx, php-fpm, redis
services:
- nginx
- php
- redis
# Nginx Configuration
nginx_user: www-data
nginx_worker_processes: auto
nginx_port: 80
nginx_ssl_port: 443
# PHP-FPM Configuration
php_version: 8.2
php_max_children: 50
# Redis Configuration
redis_port: 6379
redis_password: "{{vault_redis_password}}"
```
### 3. Group Variables (`inventory/group_vars/`)
**Purpose**: Variables shared across multiple hosts
**Structure**:
```
group_vars/
├── all/ # Variables for ALL hosts
│ ├── vars.yml # Common non-sensitive config
│ └── vault.yml # Encrypted secrets (ansible-vault)
└── web_servers/ # Variables for web_servers group
└── vars.yml
```
**Example** (`inventory/group_vars/all/vars.yml`):
```yaml
---
# Common Variables for All Hosts
remote_user: ansible
deployment_environment: production
ansible_python_interpreter: /usr/bin/python3
# Release versions
app_release: v1.2.3
api_release: v2.0.1
# Monitoring endpoints
prometheus_url: http://monitoring.example.com:9090
loki_url: http://monitoring.example.com:3100
```
### 4. Service Directories
**Purpose**: Group all files related to a service deployment
**Pattern**: `{service_name}/`
**Contents**:
- `deploy.yml` - Main deployment playbook
- `stage.yml` - Staging/update playbook (optional)
- `*.j2` - Jinja2 templates
- `files/` - Static files (if needed)
- `tasks/` - Task files (if splitting large playbooks)
**Example Structure**:
```
nginx/
├── deploy.yml # Deployment playbook
├── nginx.conf.j2 # Main config template
├── site.conf.j2 # Virtual host template
├── nginx.service.j2 # Systemd service file
└── files/
└── ssl_params.conf # Static SSL configuration
```
### 5. Master Playbook (`site.yml`)
**Purpose**: Orchestrate full-stack deployment
**Pattern**: Import service playbooks in dependency order
**Example**:
```yaml
---
- name: Update All Hosts
import_playbook: apt_update.yml
- name: Deploy Docker
import_playbook: docker/deploy.yml
- name: Deploy PostgreSQL
import_playbook: postgresql/deploy.yml
- name: Deploy Application
import_playbook: myapp/deploy.yml
- name: Deploy Monitoring
import_playbook: prometheus/deploy.yml
```
### 6. Service Playbook Pattern
**Location**: `{service}/deploy.yml`
**Standard Structure**:
```yaml
---
- name: Deploy Service Name
hosts: target_group
tasks:
# Service detection (if using services list)
- name: Check if host has service_name service
ansible.builtin.set_fact:
has_service: "{{ 'service_name' in services | default([]) }}"
- name: Skip hosts without service
ansible.builtin.meta: end_host
when: not has_service
# Actual deployment tasks
- name: Create service user
become: true
ansible.builtin.user:
name: "{{service_user}}"
group: "{{service_group}}"
system: true
- name: Template configuration
become: true
ansible.builtin.template:
src: config.j2
dest: "{{service_directory}}/config.yml"
notify: restart service
# Handlers
handlers:
- name: restart service
become: true
ansible.builtin.systemd:
name: service_name
state: restarted
daemon_reload: true
```
**IMPORTANT: Template Path Convention**
- When playbooks are inside service directories, template `src:` paths are relative to that directory
- Use `src: config.j2` NOT `src: service_name/config.j2`
- The service directory prefix was correct when playbooks were at the ansible root, but is wrong now
**Host-Specific Templates**
Some services need different configuration per host. Store these in subdirectories named by hostname:
```
service_name/
├── deploy.yml
├── config.j2 # Default template
├── hostname1/ # Host-specific overrides
│ └── config.j2
├── hostname2/
│ └── config.j2
└── hostname3/
└── config.j2
```
Use conditional logic to select the correct template:
```yaml
- name: Check for host-specific configuration
ansible.builtin.stat:
path: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
delegate_to: localhost
register: host_specific_config
become: false
- name: Template host-specific configuration
become: true
ansible.builtin.template:
src: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
dest: "{{service_directory}}/config"
when: host_specific_config.stat.exists
- name: Template default configuration
become: true
ansible.builtin.template:
src: config.j2
dest: "{{service_directory}}/config"
when: not host_specific_config.stat.exists
```
**Real Example: Alloy Service**
```
alloy/
├── deploy.yml
├── config.alloy.j2 # Default configuration
├── ariel/ # Neo4j monitoring
│ └── config.alloy.j2
├── miranda/ # Docker monitoring
│ └── config.alloy.j2
├── oberon/ # Web services monitoring
│ └── config.alloy.j2
└── puck/ # Application monitoring
└── config.alloy.j2
```
## Service Detection Pattern
**Purpose**: Allow hosts to selectively run service playbooks
**How it works**:
1. Each host defines a `services:` list in `host_vars/`
2. Each playbook checks if its service is in the list
3. Playbook skips host if service not needed
**Example**:
`inventory/host_vars/server1.yml`:
```yaml
services:
- docker
- nginx
- redis
```
`nginx/deploy.yml`:
```yaml
- name: Deploy Nginx
hosts: ubuntu
tasks:
- name: Check if host has nginx service
ansible.builtin.set_fact:
has_nginx: "{{ 'nginx' in services | default([]) }}"
- name: Skip hosts without nginx
ansible.builtin.meta: end_host
when: not has_nginx
# Rest of tasks only run if nginx in services list
```
## Ansible Vault Integration
**Setup**:
```bash
# Create vault password file (one-time)
echo "your_vault_password" > .vault_pass
chmod 600 .vault_pass
# Configure ansible.cfg
echo "vault_password_file = .vault_pass" >> ansible.cfg
```
**Usage**:
```bash
# Edit vault file
ansible-vault edit inventory/group_vars/all/vault.yml
# View vault file
ansible-vault view inventory/group_vars/all/vault.yml
# Encrypt new file
ansible-vault encrypt secrets.yml
```
**Variable naming convention**:
- Prefix vault variables with `vault_`
- Reference in regular vars: `db_password: "{{vault_db_password}}"`
## Running Playbooks
**Full deployment**:
```bash
ansible-playbook site.yml
```
**Single service**:
```bash
ansible-playbook nginx/deploy.yml
```
**Specific hosts**:
```bash
ansible-playbook nginx/deploy.yml --limit server1.example.com
```
**Check mode (dry-run)**:
```bash
ansible-playbook site.yml --check
```
**With extra verbosity**:
```bash
ansible-playbook nginx/deploy.yml -vv
```
## Benefits of This Structure
### 1. Cleaner Root Directory
- **Before**: 29+ playbook files cluttering root
- **After**: 3-4 utility playbooks + site.yml
### 2. Simplified Inventory
- **Before**: 361 lines with inline variables
- **After**: 34 lines of pure structure
- Variables organized logically by host/group
### 3. Service Cohesion
- Everything related to a service in one place
- Easy to find templates when editing playbooks
- Natural grouping for git operations
### 4. Scalability
- Easy to add new services (create directory, add playbook)
- Easy to add new hosts (create host_vars file)
- No risk of playbook name conflicts
### 5. Reusability
- Service directories can be copied to other projects
- host_vars pattern works for any inventory size
- Clear separation of concerns
### 6. Maintainability
- Changes isolated to service directories
- Inventory file rarely needs editing
- Clear audit trail in git (changes per service)
## Migration Checklist
Moving an existing Ansible project to this structure:
- [ ] Create service directories for each playbook
- [ ] Move `{service}_deploy.yml``{service}/deploy.yml`
- [ ] Move templates into service directories
- [ ] Extract host variables from inventory to `host_vars/`
- [ ] Extract group variables to `group_vars/all/vars.yml`
- [ ] Move secrets to `group_vars/all/vault.yml` (encrypted)
- [ ] Update `site.yml` import_playbook paths
- [ ] Backup original inventory: `cp hosts hosts.backup`
- [ ] Create simplified inventory with only group/host structure
- [ ] Test with `ansible-playbook site.yml --check`
- [ ] Verify with limited deployment: `--limit test_host`
## Example: Adding a New Service
**1. Create service directory**:
```bash
mkdir ansible/myapp
```
**2. Create deployment playbook** (`ansible/myapp/deploy.yml`):
```yaml
---
- name: Deploy MyApp
hosts: ubuntu
tasks:
- name: Check if host has myapp service
ansible.builtin.set_fact:
has_myapp: "{{ 'myapp' in services | default([]) }}"
- name: Skip hosts without myapp
ansible.builtin.meta: end_host
when: not has_myapp
- name: Deploy myapp
# ... deployment tasks
```
**3. Create template** (`ansible/myapp/config.yml.j2`):
```yaml
app_name: MyApp
port: {{myapp_port}}
database: {{myapp_db_host}}
```
**4. Add variables to host** (`inventory/host_vars/server1.yml`):
```yaml
services:
- myapp # Add to services list
# MyApp configuration
myapp_port: 8080
myapp_db_host: db.example.com
```
**5. Add to site.yml**:
```yaml
- name: Deploy MyApp
import_playbook: myapp/deploy.yml
```
**6. Deploy**:
```bash
ansible-playbook myapp/deploy.yml
```
## Best Practices
### Naming Conventions
- Service directories: lowercase, underscores (e.g., `mcp_switchboard/`)
- Playbooks: `deploy.yml`, `stage.yml`, `remove.yml`
- Templates: descriptive name + `.j2` extension
- Variables: service prefix (e.g., `nginx_port`, `redis_password`)
- Vault variables: `vault_` prefix
### File Organization
- Keep playbooks under 100 lines (split into task files if larger)
- Group related templates in service directory
- Use comments to document non-obvious variables
- Add README.md to complex service directories
### Variable Organization
- Host-specific: `host_vars/{hostname}.yml`
- Service-specific across hosts: `group_vars/{service_group}/vars.yml`
- Global configuration: `group_vars/all/vars.yml`
- Secrets: `group_vars/all/vault.yml` (encrypted)
### Idempotency
- Use `creates:` parameter for one-time operations
- Use `state:` explicitly (present/absent/restarted)
- Check conditions before destructive operations
- Test with `--check` mode before applying
### Documentation
- Comment complex task logic
- Document required variables in playbook header
- Add README.md for service directories with many files
- Keep docs/ separate from ansible/ directory
## Related Documentation
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html)
- [Ansible Vault Guide](https://docs.ansible.com/ansible/latest/vault_guide/index.html)
- [Inventory Organization](https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html)
## Secret Management Patterns
### Ansible Vault (Sandbox Environment)
**Purpose**: Store sensitive values encrypted at rest in version control
**File Location**: `inventory/group_vars/all/vault.yml`
**Variable Naming Convention**: Prefix all vault variables with `vault_`
**Example vault.yml**:
Note the entire vault file is encrypted
```yaml
---
# Database passwords
vault_postgres_admin_password: # Avoid special characters & non-ASCII
vault_casdoor_db_password:
# S3 credentials
vault_casdoor_s3_access_key:
vault_casdoor_s3_secret_key:
vault_casdoor_s3_bucket:
```
**Host Variables Reference Vault**:
```yaml
# In host_vars/oberon.incus.yml
casdoor_db_password: "{{ vault_casdoor_db_password }}"
casdoor_s3_access_key: "{{ vault_casdoor_s3_access_key }}"
casdoor_s3_secret_key: "{{ vault_casdoor_s3_secret_key }}"
casdoor_s3_bucket: "{{ vault_casdoor_s3_bucket }}"
# Non-sensitive values stay as plain variables
casdoor_s3_endpoint: "https://ariel.incus:9000"
casdoor_s3_region: "us-east-1"
```
**Prerequisites**:
- Set `ANSIBLE_VAULT_PASSWORD_FILE` environment variable
- Create `.vault_pass` file with vault password
- Add `.vault_pass` to `.gitignore`
**Encrypting New Values**:
```bash
# Encrypt a string and add to vault.yml
echo -n "secret_value" | ansible-vault encrypt_string --stdin-name 'vault_variable_name'
# Edit vault file directly
ansible-vault edit inventory/group_vars/all/vault.yml
```
### OCI Vault (Production Environment)
**Purpose**: Use Oracle Cloud Infrastructure Vault for centralized secret management
**Variable Pattern**: Use Ansible lookups to fetch secrets at runtime
**Example host_vars for OCI**:
```yaml
# In host_vars/production-server.yml
# Database passwords from OCI Vault
casdoor_db_password: "{{ lookup('community.oci.oci_secret', 'casdoor-db-password', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
# S3 credentials from OCI Vault
casdoor_s3_access_key: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-access-key', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
casdoor_s3_secret_key: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-secret-key', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
casdoor_s3_bucket: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-bucket', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
# Non-sensitive values remain as plain variables
casdoor_s3_endpoint: "https://objectstorage.us-phoenix-1.oraclecloud.com"
casdoor_s3_region: "us-phoenix-1"
```
**OCI Vault Organization**:
```
OCI Compartment: production
├── Vault: agathos-databases
│ ├── Secret: postgres-admin-password
│ └── Secret: casdoor-db-password
├── Vault: agathos-services
│ ├── Secret: casdoor-s3-access-key
│ ├── Secret: casdoor-s3-secret-key
│ ├── Secret: casdoor-s3-bucket
│ └── Secret: openwebui-db-password
└── Vault: agathos-integrations
├── Secret: apikey-openai
└── Secret: apikey-anthropic
```
**Secret Naming Convention**:
- Ansible Vault: `vault_service_secret` (underscores)
- OCI Vault: `service-secret` (hyphens)
**Benefits of Two-Tier Pattern**:
1. **Portability**: Service playbooks remain unchanged across environments
2. **Flexibility**: Switch secret backends by changing only host_vars
3. **Clarity**: Variable names clearly indicate their purpose
4. **Security**: Secrets never appear in playbooks or templates
### S3 Bucket Provisioning with Ansible
**Purpose**: Provision Incus S3 buckets and manage credentials in Ansible Vault
**Playbooks**:
- `provision_s3.yml` - Create bucket and store credentials
- `regenerate_s3_key.yml` - Rotate credentials
- `remove_s3.yml` - Delete bucket and clean vault
**Usage**:
```bash
# Provision new S3 bucket for a service
ansible-playbook provision_s3.yml -e bucket_name=casdoor -e service_name=casdoor
# Regenerate access credentials (invalidates old keys)
ansible-playbook regenerate_s3_key.yml -e bucket_name=casdoor -e service_name=casdoor
# Remove bucket and credentials
ansible-playbook remove_s3.yml -e bucket_name=casdoor -e service_name=casdoor
```
**Requirements**:
- User must be member of `incus` group
- `ANSIBLE_VAULT_PASSWORD_FILE` must be set
- Incus CLI must be configured and accessible
**What Gets Created**:
1. Incus storage bucket in project `agathos`, pool `default`
2. Admin access key for the bucket
3. Encrypted vault entries: `vault_<service>_s3_access_key`, `vault_<service>_s3_secret_key`, `vault_<service>_s3_bucket`
**Behind the Scenes**:
- Role: `incus_storage_bucket`
- Idempotent: Checks if bucket/key exists before creating
- Atomic: Credentials captured and encrypted in single operation
- Variables sourced from: `inventory/group_vars/all/vars.yml`
## Troubleshooting
### Template Not Found Errors
**Symptom**: `Could not find or access 'service_name/template.j2'`
**Cause**: When playbooks were moved from ansible root into service directories, template paths weren't updated.
**Solution**: Remove the service directory prefix from template paths:
```yaml
# WRONG (old path from when playbook was at root)
src: service_name/config.j2
# CORRECT (playbook is now in service_name/ directory)
src: config.j2
```
### Host-Specific Template Path Issues
**Symptom**: Playbook fails to find host-specific templates
**Cause**: Host-specific directories are at the wrong level
**Expected Structure**:
```
service_name/
├── deploy.yml
├── config.j2 # Default
└── hostname/ # Host-specific (inside service dir)
└── config.j2
```
**Use `{{playbook_dir}}` for relative paths**:
```yaml
# This finds templates relative to the playbook location
src: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
```
---
**Last Updated**: December 2025
**Project**: Agathos Infrastructure
**Approval**: Red Panda Approved™

334
docs/anythingllm.md Normal file
View File

@@ -0,0 +1,334 @@
# AnythingLLM
## Overview
AnythingLLM is a full-stack application that provides a unified interface for interacting with Large Language Models (LLMs). It supports multi-provider LLM access, document intelligence (RAG with pgvector), AI agents with tools, and Model Context Protocol (MCP) extensions.
**Host:** Rosalind
**Role:** go_nodejs_php_apps
**Port:** 22084 (internal), accessible via `anythingllm.ouranos.helu.ca` (HAProxy)
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client │────▶│ HAProxy │────▶│ AnythingLLM │
│ (Browser/API) │ │ (Titania) │ │ (Rosalind) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
┌────────────────────────────────┼────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ LLM Backend │ │ TTS Service │
│ + pgvector │ │ (pan.helu.ca) │ │ (FastKokoro) │
│ (Portia) │ │ llama-cpp │ │ pan.helu.ca │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
### Directory Structure
AnythingLLM uses a native Node.js deployment with the following directory layout:
```
/srv/anythingllm/
├── app/ # Cloned git repository
│ ├── server/ # Backend API server
│ │ ├── .env # Environment configuration
│ │ └── node_modules/
│ ├── collector/ # Document processing service
│ │ ├── hotdir -> ../hotdir # SYMLINK (critical!)
│ │ └── node_modules/
│ └── frontend/ # React frontend (built into server)
├── storage/ # Persistent data
│ ├── documents/ # Processed documents
│ ├── vector-cache/ # Embedding cache
│ └── plugins/ # MCP server configs
└── hotdir/ # Upload staging directory (actual location)
/srv/collector/
└── hotdir -> /srv/anythingllm/hotdir # SYMLINK (critical!)
```
### Hotdir Path Resolution (Critical)
The server and collector use **different path resolution** for the upload directory:
| Component | Code Location | Resolves To |
|-----------|--------------|-------------|
| **Server** (multer) | `STORAGE_DIR/../../collector/hotdir` | `/srv/collector/hotdir` |
| **Collector** | `__dirname/../hotdir` | `/srv/anythingllm/app/collector/hotdir` |
Both paths must point to the same physical directory. This is achieved with **two symlinks**:
1. `/srv/collector/hotdir``/srv/anythingllm/hotdir`
2. `/srv/anythingllm/app/collector/hotdir``/srv/anythingllm/hotdir`
⚠️ **Important**: The collector ships with an empty `hotdir/` directory. The Ansible deploy must **remove** this directory before creating the symlink, or file uploads will fail with "File does not exist in upload directory."
### Key Integrations
| Component | Host | Purpose |
|-----------|------|---------|
| PostgreSQL + pgvector | Portia | Vector database for RAG embeddings |
| LLM Provider | pan.helu.ca:22071 | Generic OpenAI-compatible llama-cpp |
| TTS Service | pan.helu.ca:22070 | FastKokoro text-to-speech |
| HAProxy | Titania | TLS termination and routing |
| Loki | Prospero | Log aggregation |
## Terraform Resources
### Host Definition
AnythingLLM runs on **Rosalind**, which is already defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | go_nodejs_php_apps |
| Security Nesting | true |
| AppArmor | unconfined |
| Port Range | 22080-22099 |
No Terraform changes required—AnythingLLM uses port 22084 within Rosalind's existing range.
## Ansible Deployment
### Playbook
```bash
cd ansible
source ~/env/agathos/bin/activate
# Deploy PostgreSQL database first (if not already done)
ansible-playbook postgresql/deploy.yml
# Deploy AnythingLLM
ansible-playbook anythingllm/deploy.yml
# Redeploy HAProxy to pick up new backend
ansible-playbook haproxy/deploy.yml
# Redeploy Alloy to pick up new log source
ansible-playbook alloy/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `anythingllm/deploy.yml` | Main deployment playbook |
| `anythingllm/anythingllm-server.service.j2` | Systemd service for server |
| `anythingllm/anythingllm-collector.service.j2` | Systemd service for collector |
| `anythingllm/env.j2` | Environment variables template |
### Variables
#### Host Variables (`host_vars/rosalind.incus.yml`)
| Variable | Description | Default |
|----------|-------------|---------|
| `anythingllm_user` | Service account user | `anythingllm` |
| `anythingllm_group` | Service account group | `anythingllm` |
| `anythingllm_directory` | Installation directory | `/srv/anythingllm` |
| `anythingllm_port` | Service port | `22084` |
| `anythingllm_db_host` | PostgreSQL host | `portia.incus` |
| `anythingllm_db_port` | PostgreSQL port | `5432` |
| `anythingllm_db_name` | Database name | `anythingllm` |
| `anythingllm_db_user` | Database user | `anythingllm` |
| `anythingllm_llm_base_url` | LLM API endpoint | `http://pan.helu.ca:22071/v1` |
| `anythingllm_llm_model` | Default LLM model | `llama-3-8b` |
| `anythingllm_embedding_engine` | Embedding engine | `native` |
| `anythingllm_tts_provider` | TTS provider | `openai` |
| `anythingllm_tts_endpoint` | TTS API endpoint | `http://pan.helu.ca:22070/v1` |
#### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Description |
|----------|-------------|
| `vault_anythingllm_db_password` | PostgreSQL password |
| `vault_anythingllm_jwt_secret` | JWT signing secret (32+ chars) |
| `vault_anythingllm_sig_key` | Signature key (32+ chars) |
| `vault_anythingllm_sig_salt` | Signature salt (32+ chars) |
Generate secrets with:
```bash
openssl rand -hex 32
```
## Configuration
### Environment Variables
| Variable | Description | Source |
|----------|-------------|--------|
| `JWT_SECRET` | JWT signing secret | `vault_anythingllm_jwt_secret` |
| `SIG_KEY` | Signature key | `vault_anythingllm_sig_key` |
| `SIG_SALT` | Signature salt | `vault_anythingllm_sig_salt` |
| `VECTOR_DB` | Vector database type | `pgvector` |
| `PGVECTOR_CONNECTION_STRING` | PostgreSQL connection | Composed from host_vars |
| `LLM_PROVIDER` | LLM provider type | `generic-openai` |
| `EMBEDDING_ENGINE` | Embedding engine | `native` |
| `TTS_PROVIDER` | TTS provider | `openai` |
### External Access
AnythingLLM is accessible via HAProxy on Titania:
| URL | Backend |
|-----|---------|
| `https://anythingllm.ouranos.helu.ca` | `rosalind.incus:22084` |
The HAProxy backend is configured in `host_vars/titania.incus.yml`.
## Monitoring
### Loki Logs
| Log Source | Labels |
|------------|--------|
| Server logs | `{unit="anythingllm-server.service"}` |
| Collector logs | `{unit="anythingllm-collector.service"}` |
Logs are collected via systemd journal → Alloy on Rosalind → Loki on Prospero.
**Grafana Query:**
```logql
{unit=~"anythingllm.*"} |= ``
```
### Health Check
```bash
# From any sandbox host
curl http://rosalind.incus:22084/api/ping
# Via HAProxy (external)
curl -k https://anythingllm.ouranos.helu.ca/api/ping
```
## Operations
### Start/Stop
```bash
# SSH to Rosalind
ssh rosalind.incus
# Manage via systemd
sudo systemctl start anythingllm-server # Start server
sudo systemctl start anythingllm-collector # Start collector
sudo systemctl stop anythingllm-server # Stop server
sudo systemctl stop anythingllm-collector # Stop collector
sudo systemctl restart anythingllm-server # Restart server
sudo systemctl restart anythingllm-collector # Restart collector
```
### Logs
```bash
# Real-time server logs
journalctl -u anythingllm-server -f
# Real-time collector logs
journalctl -u anythingllm-collector -f
# Grafana (historical)
# Query: {unit=~"anythingllm.*"}
```
### Upgrade
Pull latest code and redeploy:
```bash
ansible-playbook anythingllm/deploy.yml
```
## Vault Setup
Add the following secrets to `ansible/inventory/group_vars/all/vault.yml`:
```bash
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
```
```yaml
# AnythingLLM Secrets
vault_anythingllm_db_password: "your-secure-password"
vault_anythingllm_jwt_secret: "your-32-char-jwt-secret"
vault_anythingllm_sig_key: "your-32-char-signature-key"
vault_anythingllm_sig_salt: "your-32-char-signature-salt"
```
## Follow-On Tasks
### MCP Server Integration
AnythingLLM supports Model Context Protocol (MCP) for extending AI agent capabilities. Future integration with existing MCP servers:
| MCP Server | Host | Tools |
|------------|------|-------|
| MCPO | Miranda | Docker management |
| Neo4j MCP | Miranda | Graph database queries |
| GitHub MCP | (external) | Repository operations |
Configure MCP connections via AnythingLLM Admin UI after initial deployment.
### Casdoor SSO
For single sign-on integration, configure AnythingLLM to authenticate via Casdoor OAuth2. This requires:
1. Creating an application in Casdoor admin
2. Configuring OAuth2 environment variables in AnythingLLM
3. Optionally using OAuth2-Proxy for transparent authentication
## Troubleshooting
### File Upload Fails with "File does not exist in upload directory"
**Symptom:** Uploading files via the UI returns 500 Internal Server Error with message "File does not exist in upload directory."
**Cause:** The server uploads files to `/srv/collector/hotdir`, but the collector looks for them in `/srv/anythingllm/app/collector/hotdir`. If these aren't the same physical directory, uploads fail.
**Solution:** Verify symlinks are correctly configured:
```bash
# Check symlinks
ls -la /srv/collector/hotdir
# Should show: /srv/collector/hotdir -> /srv/anythingllm/hotdir
ls -la /srv/anythingllm/app/collector/hotdir
# Should show: /srv/anythingllm/app/collector/hotdir -> /srv/anythingllm/hotdir
# If collector/hotdir is a directory (not symlink), fix it:
sudo rm -rf /srv/anythingllm/app/collector/hotdir
sudo ln -s /srv/anythingllm/hotdir /srv/anythingllm/app/collector/hotdir
sudo chown -h anythingllm:anythingllm /srv/anythingllm/app/collector/hotdir
sudo systemctl restart anythingllm-collector
```
### Container Won't Start
Check Docker logs:
```bash
sudo docker logs anythingllm
```
Verify PostgreSQL connectivity:
```bash
psql -h portia.incus -U anythingllm -d anythingllm
```
### Database Connection Issues
Ensure pgvector extension is enabled:
```bash
psql -h portia.incus -U postgres -d anythingllm -c "SELECT * FROM pg_extension WHERE extname = 'vector';"
```
### LLM Provider Issues
Test LLM endpoint directly:
```bash
curl http://pan.helu.ca:22071/v1/models
```

207
docs/anythingllm_mcp.md Normal file
View File

@@ -0,0 +1,207 @@
# AnythingLLM MCP Server Configuration
## Overview
AnythingLLM supports [Model Context Protocol (MCP)](https://modelcontextprotocol.io) servers, allowing AI agents to call tools provided by local processes or remote services. MCP servers are managed by the internal `MCPHypervisor` singleton and configured via a single JSON file.
## Configuration File Location
| Environment | Path |
|-------------|------|
| Development | `server/storage/plugins/anythingllm_mcp_servers.json` |
| Production / Docker | `$STORAGE_DIR/plugins/anythingllm_mcp_servers.json` |
The file and its parent directory are created automatically with an empty `{ "mcpServers": {} }` object if they do not already exist.
## File Format
```json
{
"mcpServers": {
"<server-name>": { ... },
"<server-name-2>": { ... }
}
}
```
Each key inside `mcpServers` is the unique name used to identify the server within AnythingLLM. The value is the server definition, whose required fields depend on the transport type (see below).
---
## Transport Types
### `stdio` — Local Process
Spawns a local process and communicates over stdin/stdout. The transport type is inferred automatically when a `command` field is present.
```json
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/docs"],
"env": {
"SOME_VAR": "value"
}
}
}
}
```
| Field | Required | Description |
|-------|----------|-------------|
| `command` | ✅ | Executable to run (e.g. `npx`, `node`, `python3`) |
| `args` | ❌ | Array of arguments passed to the command |
| `env` | ❌ | Extra environment variables merged into the process environment |
> **Note:** The process inherits PATH and NODE_PATH from the shell environment that started AnythingLLM. If a command such as `npx` is not found, ensure it is available in that shell's PATH.
---
### `sse` — Server-Sent Events (legacy)
Connects to a remote MCP server using the legacy SSE transport. The type is inferred automatically when only a `url` field is present (no `command`), or when `"type": "sse"` is set explicitly.
```json
{
"mcpServers": {
"my-sse-server": {
"url": "https://example.com/mcp",
"type": "sse",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}
```
---
### `streamable` / `http` — Streamable HTTP
Connects to a remote MCP server using the newer Streamable HTTP transport.
```json
{
"mcpServers": {
"my-http-server": {
"url": "https://example.com/mcp",
"type": "streamable",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}
```
Both `"type": "streamable"` and `"type": "http"` select this transport.
| Field | Required | Description |
|-------|----------|-------------|
| `url` | ✅ | Full URL of the MCP endpoint |
| `type` | ✅ | `"sse"`, `"streamable"`, or `"http"` |
| `headers` | ❌ | HTTP headers sent with every request (useful for auth) |
---
## AnythingLLM-Specific Options
An optional `anythingllm` block inside any server definition can control AnythingLLM-specific behaviour:
```json
{
"mcpServers": {
"my-server": {
"command": "npx",
"args": ["-y", "some-mcp-package"],
"anythingllm": {
"autoStart": false
}
}
}
}
```
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `autoStart` | boolean | `true` | When `false`, the server is skipped at startup and must be started manually from the Admin UI |
---
## Full Example
```json
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/documents"]
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_xxxxxxxxxxxx"
}
},
"remote-tools": {
"url": "https://mcp.example.com/mcp",
"type": "streamable",
"headers": {
"Authorization": "Bearer my-secret-token"
}
},
"optional-server": {
"command": "node",
"args": ["/opt/mcp/server.js"],
"anythingllm": {
"autoStart": false
}
}
}
}
```
---
## Managing Servers via the Admin UI
MCP servers can be managed without editing the JSON file directly:
1. Log in as an Admin.
2. Go to **Admin → Agents → MCP Servers**.
3. From this page you can:
- View all configured servers and the tools each one exposes.
- Start or stop individual servers.
- Delete a server (removes it from the JSON file).
- Force-reload all servers (stops all, re-reads the file, restarts them).
Any changes made through the UI are persisted back to `anythingllm_mcp_servers.json`.
---
## How Servers Are Started
- At startup, `MCPHypervisor` reads the config file and starts all servers whose `anythingllm.autoStart` is not `false`.
- Each server has a **30-second connection timeout**. If a server fails to connect within that window it is marked as failed and its process is cleaned up.
- Servers are exposed to agents via the `@agent` directive using the naming convention `@@mcp_<server-name>`.
---
## Troubleshooting
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| `ENOENT` / command not found | The executable is not in PATH | Use the full absolute path for `command`, or ensure the binary is accessible in the shell that starts AnythingLLM |
| Connection timeout after 30 s | Server process started but did not respond | Check the server's own logs; verify arguments are correct |
| Tools not visible to agent | Server failed to start | Check the status badge in **Admin → Agents → MCP Servers** for the error message |
| Auth / 401 errors on remote servers | Missing or incorrect credentials | Verify `headers` or `env` values in the config |
---
## Further Reading
- [AnythingLLM MCP Compatibility Docs](https://docs.anythingllm.com/mcp-compatibility/overview)
- [Model Context Protocol Specification](https://modelcontextprotocol.io)

View File

@@ -0,0 +1,726 @@
# AnythingLLM: Your AI-Powered Knowledge Hub
## 🎯 What is AnythingLLM?
AnythingLLM is a **full-stack application** that transforms how you interact with Large Language Models (LLMs). Think of it as your personal AI assistant platform that can:
- 💬 Chat with multiple LLM providers
- 📚 Query your own documents and data (RAG - Retrieval Augmented Generation)
- 🤖 Run autonomous AI agents with tools
- 🔌 Extend capabilities via Model Context Protocol (MCP)
- 👥 Support multiple users and workspaces
- 🎨 Provide a beautiful, intuitive web interface
**In simple terms:** It's like ChatGPT, but you control everything - the data, the models, the privacy, and the capabilities.
---
## 🌟 Key Capabilities
### 1. **Multi-Provider LLM Support**
AnythingLLM isn't locked to a single AI provider. It supports **30+ LLM providers**:
#### Your Environment:
```
┌─────────────────────────────────────────┐
│ Your LLM Infrastructure │
├─────────────────────────────────────────┤
│ ✅ Llama CPP Router (pan.helu.ca) │
│ - Load-balanced inference │
│ - High availability │
│ │
│ ✅ Direct Llama CPP (nyx.helu.ca) │
│ - Direct connection option │
│ - Lower latency │
│ │
│ ✅ LLM Proxy - Arke (circe.helu.ca) │
│ - Unified API gateway │
│ - Request routing │
│ │
│ ✅ AWS Bedrock (optional) │
│ - Claude, Titan models │
│ - Enterprise-grade │
└─────────────────────────────────────────┘
```
**What this means:**
- Switch between providers without changing your application
- Use different models for different workspaces
- Fallback to alternative providers if one fails
- Compare model performance side-by-side
### 2. **Document Intelligence (RAG)**
AnythingLLM can ingest and understand your documents:
**Supported Formats:**
- 📄 PDF, DOCX, TXT, MD
- 🌐 Websites (scraping)
- 📊 CSV, JSON
- 🎥 YouTube transcripts
- 🔗 GitHub repositories
- 📝 Confluence, Notion exports
**How it works:**
```
Your Document → Text Extraction → Chunking → Embeddings → Vector DB (PostgreSQL)
User Question → Embedding → Similarity Search → Relevant Chunks → LLM → Answer
```
**Example Use Case:**
```
You: "What's our refund policy?"
AnythingLLM: [Searches your policy documents]
"According to your Terms of Service (page 12),
refunds are available within 30 days..."
```
### 3. **AI Agents with Tools** 🤖
This is where AnythingLLM becomes **truly powerful**. Agents can:
#### Built-in Agent Tools:
- 🌐 **Web Browsing** - Navigate websites, fill forms, take screenshots
- 🔍 **Web Scraping** - Extract data from web pages
- 📊 **SQL Agent** - Query databases (PostgreSQL, MySQL, MSSQL)
- 📈 **Chart Generation** - Create visualizations
- 💾 **File Operations** - Save and manage files
- 📝 **Document Summarization** - Condense long documents
- 🧠 **Memory** - Remember context across conversations
#### Agent Workflow Example:
```
User: "Check our database for users who signed up last week
and send them a welcome email"
Agent:
1. Uses SQL Agent to query PostgreSQL
2. Retrieves user list
3. Generates personalized email content
4. (With email MCP) Sends emails
5. Reports back with results
```
### 4. **Model Context Protocol (MCP)** 🔌
MCP is AnythingLLM's **superpower** - it allows you to extend the AI with custom tools and data sources.
#### What is MCP?
MCP is a **standardized protocol** for connecting AI systems to external tools and data. Think of it as "plugins for AI."
#### Your MCP Possibilities:
**Example 1: Docker Management**
```javascript
// MCP Server: docker-mcp
Tools Available:
- list_containers()
- start_container(name)
- stop_container(name)
- view_logs(container)
- exec_command(container, command)
User: "Show me all running containers and restart the one using most memory"
Agent: [Uses docker-mcp tools to check, analyze, and restart]
```
**Example 2: GitHub Integration**
```javascript
// MCP Server: github-mcp
Tools Available:
- create_issue(repo, title, body)
- search_code(query)
- create_pr(repo, branch, title)
- list_repos()
User: "Create a GitHub issue for the bug I just described"
Agent: [Uses github-mcp to create issue with details]
```
**Example 3: Custom Business Tools**
```javascript
// Your Custom MCP Server
Tools Available:
- query_crm(customer_id)
- check_inventory(product_sku)
- create_order(customer, items)
- send_notification(user, message)
User: "Check if we have product XYZ in stock and notify me if it's low"
Agent: [Uses your custom MCP tools]
```
#### MCP Architecture in AnythingLLM:
```
┌─────────────────────────────────────────────────────────┐
│ AnythingLLM │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Agent System │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Built-in │ │ MCP │ │ Custom │ │ │
│ │ │ Tools │ │ Tools │ │ Flows │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ MCP Hypervisor │ │
│ │ - Manages MCP server lifecycle │ │
│ │ - Handles stdio/http/sse transports │ │
│ │ - Auto-discovers tools │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ MCP Servers (Running Locally or Remote) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Docker │ │ GitHub │ │ Custom │ │
│ │ MCP │ │ MCP │ │ MCP │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
```
**Key Features:**
-**Hot-reload** - Add/remove MCP servers without restarting
-**Multiple transports** - stdio, HTTP, Server-Sent Events
-**Auto-discovery** - Tools automatically appear in agent
-**Process management** - Automatic start/stop/restart
-**Error handling** - Graceful failures with logging
### 5. **Agent Flows** 🔄
Create **no-code agent workflows** for complex tasks:
```
┌─────────────────────────────────────────┐
│ Example Flow: "Daily Report Generator" │
├─────────────────────────────────────────┤
│ 1. Query database for yesterday's data │
│ 2. Generate summary statistics │
│ 3. Create visualization charts │
│ 4. Write report to document │
│ 5. Send via email (MCP) │
└─────────────────────────────────────────┘
```
Flows can be:
- Triggered manually
- Scheduled (via external cron)
- Called from other agents
- Shared across workspaces
---
## 🏗️ How AnythingLLM Fits Your Environment
### Your Complete Stack:
```
┌─────────────────────────────────────────────────────────────────┐
│ Internet │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ HAProxy (SSL Termination & Load Balancing) │
│ - HTTPS/WSS support │
│ - Security headers │
│ - Health checks │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ AnythingLLM Application │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ Web UI │ │ API Server │ │ Agent Engine │ │
│ │ - React │ │ - Express.js │ │ - AIbitat │ │
│ │ - WebSocket │ │ - REST API │ │ - MCP Support │ │
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PostgreSQL 17 + pgvector │ │
│ │ - User data & workspaces │ │
│ │ - Chat history │ │
│ │ - Vector embeddings (for RAG) │ │
│ │ - Agent invocations │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ External LLM Services │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Llama Router │ │ Direct Llama │ │ LLM Proxy │ │
│ │ pan.helu.ca │ │ nyx.helu.ca │ │ circe.helu.ca│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ TTS Service │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ FastKokoro (OpenAI-compatible TTS) │ │
│ │ pan.helu.ca:22070 │ │
│ │ - Text-to-speech generation │ │
│ │ - Multiple voices │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### Observability Stack:
```
┌─────────────────────────────────────────────────────────────────┐
│ Monitoring & Logging │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Grafana (Unified Dashboard) │ │
│ │ - Metrics visualization │ │
│ │ - Log exploration │ │
│ │ - Alerting │ │
│ └────────────┬─────────────────────────────┬────────────────┘ │
│ ↓ ↓ │
│ ┌────────────────────────┐ ┌────────────────────────┐ │
│ │ Prometheus │ │ Loki │ │
│ │ - Metrics storage │ │ - Log aggregation │ │
│ │ - Alert rules │ │ - 31-day retention │ │
│ │ - 30-day retention │ │ - Query language │ │
│ └────────────────────────┘ └────────────────────────┘ │
│ ↑ ↑ │
│ ┌────────────┴─────────────────────────────┴────────────────┐ │
│ │ Data Collection │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ cAdvisor │ │ Postgres │ │ Alloy │ │ │
│ │ │ (Container) │ │ Exporter │ │ (Logs) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## 🎨 Real-World Use Cases
### Use Case 1: **Internal Knowledge Base**
**Scenario:** Your team needs quick access to company documentation
**Setup:**
1. Upload all company docs to AnythingLLM workspace
2. Documents are embedded and stored in PostgreSQL
3. Team members ask questions naturally
**Example:**
```
Employee: "What's the process for requesting time off?"
AnythingLLM: [Searches HR documents]
"According to the Employee Handbook, you need to:
1. Submit request via HR portal
2. Get manager approval
3. Minimum 2 weeks notice for vacations..."
```
**Benefits:**
- ✅ No more searching through SharePoint
- ✅ Instant answers with source citations
- ✅ Always up-to-date (re-sync documents)
- ✅ Multi-user access with permissions
### Use Case 2: **DevOps Assistant**
**Scenario:** Manage infrastructure with natural language
**Setup:**
1. Install Docker MCP server
2. Install GitHub MCP server
3. Connect to your monitoring stack
**Example Conversation:**
```
You: "Show me all containers and their resource usage"
Agent: [Uses docker-mcp + Prometheus data]
"Here are your containers:
- anythingllm: 2.1GB RAM, 45% CPU
- postgres: 1.8GB RAM, 12% CPU
- prometheus: 1.2GB RAM, 8% CPU
anythingllm is using high CPU. Would you like me to investigate?"
You: "Yes, check the logs for errors"
Agent: [Uses docker-mcp to fetch logs]
"Found 15 errors in the last hour related to LLM timeouts.
Should I create a GitHub issue?"
You: "Yes, and restart the container"
Agent: [Creates GitHub issue, restarts container]
"Done! Issue #123 created and container restarted.
CPU usage now at 15%."
```
### Use Case 3: **Customer Support Automation**
**Scenario:** AI-powered support that can take action
**Setup:**
1. Upload product documentation
2. Connect CRM via custom MCP
3. Enable SQL agent for database queries
**Example:**
```
Support Agent: "Customer John Doe says his order #12345 hasn't arrived"
AnythingLLM: [Queries database via SQL agent]
"Order #12345 shipped on Jan 5th via FedEx.
Tracking shows it's delayed due to weather.
Would you like me to:
1. Send customer an update email
2. Offer expedited shipping on next order
3. Issue a partial refund"
Support Agent: "Send update email"
AnythingLLM: [Uses email MCP]
"Email sent to john@example.com with tracking info
and apology for delay."
```
### Use Case 4: **Data Analysis Assistant**
**Scenario:** Query your database with natural language
**Setup:**
1. Enable SQL Agent
2. Connect to PostgreSQL
3. Grant read-only access
**Example:**
```
You: "Show me user signups by month for the last 6 months"
Agent: [Generates and executes SQL]
SELECT
DATE_TRUNC('month', created_at) as month,
COUNT(*) as signups
FROM users
WHERE created_at >= NOW() - INTERVAL '6 months'
GROUP BY month
ORDER BY month;
Results:
- July 2025: 145 signups
- August 2025: 203 signups
- September 2025: 187 signups
...
You: "Create a chart of this"
Agent: [Uses chart generation tool]
[Displays bar chart visualization]
```
---
## 🔐 Security & Privacy
### Why Self-Hosted Matters:
**Your Data Stays Yours:**
- ✅ Documents never leave your infrastructure
- ✅ Chat history stored in your PostgreSQL
- ✅ No data sent to third parties (except chosen LLM provider)
- ✅ Full audit trail in logs (via Loki)
**Access Control:**
- ✅ Multi-user authentication
- ✅ Role-based permissions (Admin, User)
- ✅ Workspace-level isolation
- ✅ API key management
**Network Security:**
- ✅ HAProxy SSL termination
- ✅ Security headers (HSTS, CSP, etc.)
- ✅ Internal network isolation
- ✅ Firewall-friendly (only ports 80/443 exposed)
**Monitoring:**
- ✅ All access logged to Loki
- ✅ Failed login attempts tracked
- ✅ Resource usage monitored
- ✅ Alerts for suspicious activity
---
## 📊 Monitoring Integration
Your observability stack provides **complete visibility**:
### What You Can Monitor:
**Application Health:**
```
Grafana Dashboard: "AnythingLLM Overview"
├─ Request Rate: 1,234 req/min
├─ Response Time: 245ms avg
├─ Error Rate: 0.3%
├─ Active Users: 23
└─ Agent Invocations: 45/hour
```
**Resource Usage:**
```
Container Metrics (via cAdvisor):
├─ CPU: 45% (2 cores)
├─ Memory: 2.1GB / 4GB
├─ Network: 15MB/s in, 8MB/s out
└─ Disk I/O: 120 IOPS
```
**Database Performance:**
```
PostgreSQL Metrics (via postgres-exporter):
├─ Connections: 45 / 100
├─ Query Time: 12ms avg
├─ Cache Hit Ratio: 98.5%
├─ Database Size: 2.3GB
└─ Vector Index Size: 450MB
```
**LLM Provider Performance:**
```
Custom Metrics (via HAProxy):
├─ Llama Router: 234ms avg latency
├─ Direct Llama: 189ms avg latency
├─ Arke Proxy: 267ms avg latency
└─ Success Rate: 99.2%
```
**Log Analysis (Loki):**
```logql
# Find slow LLM responses
{service="anythingllm"}
| json
| duration > 5000
# Track agent tool usage
{service="anythingllm"}
|= "agent"
|= "tool_call"
# Monitor errors by type
{service="anythingllm"}
|= "ERROR"
| json
| count by error_type
```
### Alerting Examples:
**Critical Alerts:**
- 🚨 AnythingLLM container down
- 🚨 PostgreSQL connection failures
- 🚨 Disk space > 95%
- 🚨 Memory usage > 90%
**Warning Alerts:**
- ⚠️ High LLM response times (> 5s)
- ⚠️ Database connections > 80%
- ⚠️ Error rate > 1%
- ⚠️ Agent failures
---
## 🚀 Getting Started
### Quick Start:
```bash
cd deployment
# 1. Configure environment
cp .env.example .env
nano .env # Set your LLM endpoints, passwords, etc.
# 2. Setup SSL certificates
# (See README.md for Let's Encrypt instructions)
# 3. Deploy
docker-compose up -d
# 4. Access services
# - AnythingLLM: https://your-domain.com
# - Grafana: http://localhost:3000
# - Prometheus: http://localhost:9090
```
### First Steps in AnythingLLM:
1. **Create Account** - First user becomes admin
2. **Create Workspace** - Organize by project/team
3. **Upload Documents** - Add your knowledge base
4. **Configure LLM** - Choose your provider (already set via .env)
5. **Enable Agents** - Turn on agent mode for tools
6. **Add MCP Servers** - Extend with custom tools
7. **Start Chatting!** - Ask questions, run agents
---
## 🎯 Why AnythingLLM is Powerful
### Compared to ChatGPT:
| Feature | ChatGPT | AnythingLLM |
|---------|---------|-------------|
| **Data Privacy** | ❌ Data sent to OpenAI | ✅ Self-hosted, private |
| **Custom Documents** | ⚠️ Limited (ChatGPT Plus) | ✅ Unlimited RAG |
| **LLM Choice** | ❌ OpenAI only | ✅ 30+ providers |
| **Agents** | ⚠️ Limited tools | ✅ Unlimited via MCP |
| **Multi-User** | ❌ Individual accounts | ✅ Team workspaces |
| **API Access** | ⚠️ Paid tier | ✅ Full REST API |
| **Monitoring** | ❌ No visibility | ✅ Complete observability |
| **Cost** | 💰 $20/user/month | ✅ Self-hosted (compute only) |
### Compared to LangChain/LlamaIndex:
| Feature | LangChain | AnythingLLM |
|---------|-----------|-------------|
| **Setup** | 🔧 Code required | ✅ Web UI, no code |
| **User Interface** | ❌ Build your own | ✅ Beautiful UI included |
| **Multi-User** | ❌ Build your own | ✅ Built-in |
| **Agents** | ✅ Powerful | ✅ Equally powerful + UI |
| **MCP Support** | ❌ No | ✅ Native support |
| **Monitoring** | ❌ DIY | ✅ Integrated |
| **Learning Curve** | 📚 Steep | ✅ Gentle |
---
## 🎓 Advanced Capabilities
### 1. **Workspace Isolation**
Create separate workspaces for different use cases:
```
├─ Engineering Workspace
│ ├─ Documents: Code docs, API specs
│ ├─ LLM: Direct Llama (fast)
│ └─ Agents: GitHub MCP, Docker MCP
├─ Customer Support Workspace
│ ├─ Documents: Product docs, FAQs
│ ├─ LLM: Llama Router (reliable)
│ └─ Agents: CRM MCP, Email MCP
└─ Executive Workspace
├─ Documents: Reports, analytics
├─ LLM: AWS Bedrock Claude (best quality)
└─ Agents: SQL Agent, Chart generation
```
### 2. **Embedding Strategies**
AnythingLLM supports multiple embedding models:
- **Native** (Xenova) - Fast, runs locally
- **OpenAI** - High quality, requires API
- **Azure OpenAI** - Enterprise option
- **Local AI** - Self-hosted alternative
**Your Setup:** Using native embeddings for privacy and speed
### 3. **Agent Chaining**
Agents can call other agents:
```
Main Agent
├─> Research Agent (web scraping)
├─> Analysis Agent (SQL queries)
└─> Report Agent (document generation)
```
### 4. **API Integration**
Full REST API for programmatic access:
```bash
# Send chat message
curl -X POST https://your-domain.com/api/v1/workspace/chat \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"message": "What is our refund policy?"}'
# Upload document
curl -X POST https://your-domain.com/api/v1/document/upload \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@policy.pdf"
# Invoke agent
curl -X POST https://your-domain.com/api/v1/agent/invoke \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"prompt": "Check server status"}'
```
---
## 🔮 Future Possibilities
With your infrastructure, you could:
### 1. **Voice Interface**
- Use FastKokoro TTS for responses
- Add speech-to-text (Whisper)
- Create voice-controlled assistant
### 2. **Slack/Discord Bot**
- Create MCP server for messaging
- Deploy bot that uses AnythingLLM
- Team can chat with AI in Slack
### 3. **Automated Workflows**
- Scheduled agent runs (cron)
- Webhook triggers
- Event-driven automation
### 4. **Custom Dashboards**
- Embed AnythingLLM in your apps
- White-label the interface
- Custom branding
### 5. **Multi-Modal AI**
- Image analysis (with vision models)
- Document OCR
- Video transcription
---
## 📚 Summary
**AnythingLLM is your AI platform that:**
**Respects Privacy** - Self-hosted, your data stays yours
**Flexible** - 30+ LLM providers, switch anytime
**Intelligent** - RAG for document understanding
**Powerful** - AI agents with unlimited tools via MCP
**Observable** - Full monitoring with Prometheus/Loki
**Scalable** - PostgreSQL + HAProxy for production
**Extensible** - MCP protocol for custom integrations
**User-Friendly** - Beautiful web UI, no coding required
**In your environment, it provides:**
🎯 **Unified AI Interface** - One place for all AI interactions
🔧 **DevOps Automation** - Manage infrastructure with natural language
📊 **Data Intelligence** - Query databases, analyze trends
🤖 **Autonomous Agents** - Tasks that run themselves
📈 **Complete Visibility** - Every metric, every log, every alert
🔒 **Enterprise Security** - SSL, auth, audit trails, monitoring
**Think of it as:** Your personal AI assistant platform that can see your data, use your tools, and help your team - all while you maintain complete control.
---
## 🆘 Learn More
- **Deployment Guide**: [README.md](README.md)
- **Monitoring Explained**: [PROMETHEUS_EXPLAINED.md](PROMETHEUS_EXPLAINED.md)
- **Official Docs**: https://docs.anythingllm.com
- **GitHub**: https://github.com/Mintplex-Labs/anything-llm
- **Discord Community**: https://discord.gg/6UyHPeGZAC

94
docs/arke.md Normal file
View File

@@ -0,0 +1,94 @@
# Arke Vault Variables Documentation
This document lists the vault variables that need to be added to `ansible/inventory/group_vars/all/vault.yml` for the Arke deployment.
## Required Vault Variables
### Existing Variables
These should already be present in your vault:
```yaml
vault_arke_db_password: "your_secure_password"
vault_arke_ntth_tokens: '[{"app_id":"your_app_id","app_secret":"your_secret","name":"Production"}]'
```
### New Variables to Add
```yaml
# OpenAI-Compatible Embedding API Key (optional - can be empty string if not using OpenAI provider)
vault_arke_openai_embedding_api_key: ""
```
## Usage Notes
### vault_arke_openai_embedding_api_key
- **Required when**: `arke_embedding_provider` is set to `openai` in the inventory
- **Can be empty**: If using llama-cpp, LocalAI, or other services that don't require authentication
- **Must be set**: If using actual OpenAI API or services requiring authentication
- **Default in inventory**: Empty string (`""`)
### vault_arke_ntth_tokens
- **Format**: JSON array of objects
- **Required fields per object**:
- `app_id`: The application ID
- `app_secret`: The application secret
- `name`: (optional) A descriptive name for the token
**Example with multiple tokens**:
```yaml
vault_arke_ntth_tokens: '[{"app_id":"id1","app_secret":"secret1","name":"Production-Primary"},{"app_id":"id2","app_secret":"secret2","name":"Production-Backup"}]'
```
## Editing the Vault
To edit the vault file:
```bash
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
```
Make sure you have the vault password available (stored in `ansible/.vault_pass` by default).
## Configuration Examples
### Using Ollama (Current Default)
No additional vault variables needed beyond the existing ones. The following inventory settings are used:
```yaml
arke_embedding_provider: ollama
arke_ollama_host: "pan.helu.ca"
```
### Using OpenAI API
Add to vault:
```yaml
vault_arke_openai_embedding_api_key: "sk-your-openai-api-key"
```
Update inventory to:
```yaml
arke_embedding_provider: openai
arke_openai_embedding_base_url: "https://api.openai.com"
arke_openai_embedding_model: "text-embedding-3-small"
```
### Using llama-cpp or LocalAI (No Auth Required)
Vault variable can remain empty:
```yaml
vault_arke_openai_embedding_api_key: ""
```
Update inventory to:
```yaml
arke_embedding_provider: openai
arke_openai_embedding_base_url: "http://your-server:8080"
arke_openai_embedding_model: "text-embedding-ada-002"
```
## Security Best Practices
1. Always use `ansible-vault` to encrypt sensitive data
2. Never commit unencrypted secrets to version control
3. Keep the vault password secure and separate from the repository
4. Rotate API keys and secrets regularly
5. Use unique tokens for different environments (dev/staging/production)

204
docs/auditd.md Normal file
View File

@@ -0,0 +1,204 @@
## Auditd + Laurel: Host-Based Detection Done Right
### What They Are
**Auditd** is the Linux Audit Framework—a kernel-level system that logs security-relevant events: file access, system calls, process execution, user authentication, privilege changes. It's been in the kernel since 2.6 and is rock solid.
**Laurel** is a plugin that transforms auditd's notoriously awkward multi-line log format into clean, structured JSON—perfect for shipping to Loki.
### Why This Combination Works
Auditd alone has two problems:
1. The log format is painful (events split across multiple lines, encoded arguments)
2. High-volume logging can impact performance if not tuned
Laurel solves the first problem elegantly. Proper rule tuning solves the second.
### Installation
```bash
# Auditd (likely already installed)
sudo apt install auditd audispd-plugins
# Laurel - grab the latest release
wget https://github.com/threathunters-io/laurel/releases/latest/download/laurel-x86_64-musl
sudo mv laurel-x86_64-musl /usr/local/sbin/laurel
sudo chmod 755 /usr/local/sbin/laurel
# Create laurel user and directories
sudo useradd -r -s /usr/sbin/nologin laurel
sudo mkdir -p /var/log/laurel /etc/laurel
sudo chown laurel:laurel /var/log/laurel
```
### Configuration
**/etc/laurel/config.toml:**
```toml
[auditlog]
# Output JSON logs here - point Promtail/Loki agent at this
file = "/var/log/laurel/audit.json"
size = 100000000 # 100MB rotation
generations = 5
[transform]
# Enrich with useful context
execve-argv = "array"
execve-env = "delete" # Don't log environment (secrets risk)
[filter]
# Drop noisy low-value events
filter-keys = ["exclude-noise"]
```
**/etc/audit/plugins.d/laurel.conf:**
```ini
active = yes
direction = out
path = /usr/local/sbin/laurel
type = always
args = --config /etc/laurel/config.toml
format = string
```
### High-Value Audit Rules
Here's a starter set focused on actual intrusion indicators—not compliance checkbox noise:
**/etc/audit/rules.d/intrusion-detection.rules:**
```bash
# Clear existing rules
-D
# Buffer size (tune based on your load)
-b 8192
# Failed file access (credential hunting)
-a always,exit -F arch=b64 -S open,openat -F exit=-EACCES -F key=access-denied
-a always,exit -F arch=b64 -S open,openat -F exit=-EPERM -F key=access-denied
# Credential file access
-w /etc/passwd -p wa -k credential-files
-w /etc/shadow -p wa -k credential-files
-w /etc/gshadow -p wa -k credential-files
-w /etc/sudoers -p wa -k credential-files
-w /etc/sudoers.d -p wa -k credential-files
# SSH key access
-w /root/.ssh -p wa -k ssh-keys
-w /home -p wa -k ssh-keys
# Privilege escalation
-a always,exit -F arch=b64 -S setuid,setgid,setreuid,setregid -F key=priv-escalation
-w /usr/bin/sudo -p x -k priv-escalation
-w /usr/bin/su -p x -k priv-escalation
# Process injection / debugging
-a always,exit -F arch=b64 -S ptrace -F key=process-injection
# Suspicious process execution
-a always,exit -F arch=b64 -S execve -F euid=0 -F key=root-exec
-w /tmp -p x -k exec-from-tmp
-w /var/tmp -p x -k exec-from-tmp
-w /dev/shm -p x -k exec-from-shm
# Network connections from unexpected processes
-a always,exit -F arch=b64 -S connect -F key=network-connect
# Kernel module loading
-a always,exit -F arch=b64 -S init_module,finit_module -F key=kernel-modules
# Audit log tampering (high priority)
-w /var/log/audit -p wa -k audit-tampering
-w /etc/audit -p wa -k audit-tampering
# Cron/scheduled task modification
-w /etc/crontab -p wa -k persistence
-w /etc/cron.d -p wa -k persistence
-w /var/spool/cron -p wa -k persistence
# Systemd service creation (persistence mechanism)
-w /etc/systemd/system -p wa -k persistence
-w /usr/lib/systemd/system -p wa -k persistence
# Make config immutable (remove -e 2 while tuning)
# -e 2
```
Load the rules:
```bash
sudo augenrules --load
sudo systemctl restart auditd
```
### Shipping to Loki
**Promtail config snippet:**
```yaml
scrape_configs:
- job_name: laurel
static_configs:
- targets:
- localhost
labels:
job: auditd
host: your-hostname
__path__: /var/log/laurel/audit.json
pipeline_stages:
- json:
expressions:
event_type: SYSCALL.SYSCALL
key: SYSCALL.key
exe: SYSCALL.exe
uid: SYSCALL.UID
success: SYSCALL.success
- labels:
event_type:
key:
```
### Grafana Alerting Examples
Once in Loki, create alerts for the high-value events:
```logql
# Credential file tampering
{job="auditd"} |= `credential-files` | json | success = "yes"
# Execution from /tmp (classic attack pattern)
{job="auditd"} |= `exec-from-tmp` | json
# Root execution by non-root user (priv esc)
{job="auditd"} |= `priv-escalation` | json
# Kernel module loading (rootkit indicator)
{job="auditd"} |= `kernel-modules` | json
# Audit log tampering (covering tracks)
{job="auditd"} |= `audit-tampering` | json
```
### Performance Tuning
If ye see performance impact:
1. **Add exclusions** for known-noisy processes:
```bash
-a never,exit -F exe=/usr/bin/prometheus -F key=exclude-noise
```
2. **Reduce network logging** — the `connect` syscall is high-volume; consider removing or filtering
3. **Increase buffer** if you see `audit: backlog limit exceeded`
### What You'll Catch
With this setup, you'll detect:
- Credential harvesting attempts
- Privilege escalation (successful and attempted)
- Persistence mechanisms (cron, systemd services)
- Execution from world-writable directories
- Process injection/debugging
- Rootkit installation attempts
- Evidence tampering
All with structured JSON flowing into your existing Loki/Grafana stack. No Suricata noise, just host-level events that actually matter.
Want me to help tune rules for specific services you're running, or set up the Grafana alert rules?

542
docs/casdoor.md Normal file
View File

@@ -0,0 +1,542 @@
# Casdoor SSO Identity Provider
Casdoor provides Single Sign-On (SSO) authentication for Agathos services. This document covers the design decisions, architecture, and deployment procedures.
## Design Philosophy
### Security Isolation
Casdoor handles identity and authentication - the most security-sensitive data in any system. For this reason, Casdoor uses a **dedicated PostgreSQL instance** on Titania rather than sharing the PostgreSQL server on Portia with other applications.
This isolation provides:
- **Data separation**: Authentication data is physically separated from application data
- **Access control**: The `casdoor` database user only has access to the `casdoor` database
- **Blast radius reduction**: A compromise of the shared database on Portia doesn't expose identity data
- **Production alignment**: Dev/UAT/Prod environments use the same architecture
### Native PostgreSQL with Docker Casdoor
The architecture splits cleanly:
```
┌──────────────────────────────────────────────────────────────┐
│ titania.incus │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Native PostgreSQL 17 (systemd) │ │
│ │ - SSL enabled for external connections │ │
│ │ - Local connections without SSL │ │
│ │ - Managed like any standard PostgreSQL install │ │
│ │ - Port 5432 │ │
│ └────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ localhost:5432 │
│ │ sslmode=disable │
│ │ │
│ ┌────────┴───────────────────────────────────────────────┐ │
│ │ Casdoor Docker Container (network_mode: host) │ │
│ │ - Runs as casdoor:casdoor user │ │
│ │ - Only has access to its database │ │
│ │ - Cannot touch PostgreSQL server config │ │
│ │ - Port 22081 (via HAProxy) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
│ External: SSL required
│ sslmode=verify-ca
┌─────────────┐
│ PGadmin │
│ on Portia │
└─────────────┘
```
### Why Not Docker for PostgreSQL?
Docker makes PostgreSQL permission management unnecessarily complex:
- UID/GID mapping between host and container
- Volume permission issues
- SSL certificate ownership problems
- More difficult backups and maintenance
Native PostgreSQL is:
- Easier to manage (standard Linux administration)
- Better integrated with systemd
- Simpler backup procedures
- Well-documented and understood
### SSL Strategy
PostgreSQL connections follow a **split SSL policy**:
| Connection Source | SSL Requirement | Rationale |
|-------------------|-----------------|-----------|
| Casdoor (localhost) | `sslmode=disable` | Same host, trusted |
| PGadmin (Portia) | `sslmode=verify-ca` | External network, requires encryption |
| Other external | `hostssl` required | Enforced by pg_hba.conf |
This is controlled by `pg_hba.conf`:
```
# Local connections (Unix socket)
local all all peer
# Localhost connections (no SSL required)
host all all 127.0.0.1/32 md5
# External connections (SSL required)
hostssl all all 0.0.0.0/0 md5
```
### System User Pattern
The Casdoor service user is created without hardcoded UID/GID:
```yaml
- name: Create casdoor user
ansible.builtin.user:
name: "{{ casdoor_user }}"
system: true # System account, UID assigned by OS
```
The playbook queries the assigned UID/GID at runtime for Docker container user mapping.
## Architecture
### Components
| Component | Location | Purpose |
|-----------|----------|---------|
| PostgreSQL 17 | Native on Titania | Dedicated identity database |
| Casdoor | Docker on Titania | SSO identity provider |
| HAProxy | Titania | TLS termination, routing |
| Alloy | Titania | Syslog collection |
### Deployment Order
```
1. postgresql_ssl/deploy.yml → Install PostgreSQL, SSL, create casdoor DB
2. casdoor/deploy.yml → Deploy Casdoor container
3. pgadmin/deploy.yml → Distribute SSL cert to PGadmin (optional)
```
### Network Ports
| Port | Service | Access |
|------|---------|--------|
| 22081 | Casdoor HTTP | Via HAProxy (network_mode: host) |
| 5432 | PostgreSQL | SSL for external, plain for localhost |
| 51401 | Syslog | Local only (Alloy) |
### Data Persistence
PostgreSQL data (native install):
```
/var/lib/postgresql/17/main/ # Database files
/etc/postgresql/17/main/ # Configuration
/etc/postgresql/17/main/ssl/ # SSL certificates
```
Casdoor configuration:
```
/srv/casdoor/
├── conf/
│ └── app.conf # Casdoor configuration
└── docker-compose.yml # Service definition
```
## Prerequisites
### 1. Terraform (S3 Buckets)
Casdoor can use S3-compatible storage for avatars and attachments:
```bash
cd terraform
terraform apply
```
### 2. Ansible Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
```yaml
# PostgreSQL SSL postgres user password (for Titania's dedicated PostgreSQL)
vault_postgresql_ssl_postgres_password: "secure-postgres-password"
# Casdoor database password
vault_casdoor_db_password: "secure-db-password"
# Casdoor application secrets
vault_casdoor_auth_state: "random-32-char-string"
vault_casdoor_app_client_secret: "generated-client-secret"
# Casdoor initial user passwords (changed after first login)
vault_casdoor_admin_password: "initial-admin-password"
vault_casdoor_hostmaster_password: "initial-hostmaster-password"
# Optional (for RADIUS protocol)
vault_casdoor_radius_secret: "radius-secret"
```
Generate secrets:
```bash
# Database password
openssl rand -base64 24
# Auth state
openssl rand -hex 16
```
### 3. Alloy Log Collection
Ensure Alloy is deployed to receive syslog:
```bash
ansible-playbook alloy/deploy.yml --limit titania.incus
```
## Deployment
### Fresh Installation
```bash
cd ansible
# 1. Deploy PostgreSQL with SSL
ansible-playbook postgresql_ssl/deploy.yml
# 2. Deploy Casdoor
ansible-playbook casdoor/deploy.yml
# 3. Update PGadmin with SSL certificate (optional)
ansible-playbook pgadmin/deploy.yml
```
### Verify Deployment
```bash
# Check PostgreSQL status
ssh titania.incus "sudo systemctl status postgresql"
# Check Casdoor container
ssh titania.incus "cd /srv/casdoor && docker compose ps"
# Check logs
ssh titania.incus "cd /srv/casdoor && docker compose logs --tail=50"
# Test health endpoint
curl -s http://titania.incus:22081/api/health
```
### Redeployment
To redeploy Casdoor only (database preserved):
```bash
ansible-playbook casdoor/remove.yml
ansible-playbook casdoor/deploy.yml
```
To completely reset (including database):
```bash
ansible-playbook casdoor/remove.yml
ssh titania.incus "sudo -u postgres dropdb casdoor"
ssh titania.incus "sudo -u postgres dropuser casdoor"
ansible-playbook postgresql_ssl/deploy.yml
ansible-playbook casdoor/deploy.yml
```
## Configuration Reference
### Host Variables
Located in `ansible/inventory/host_vars/titania.incus.yml`:
```yaml
# PostgreSQL SSL (dedicated identity database)
postgresql_ssl_postgres_password: "{{ vault_postgresql_ssl_postgres_password }}"
postgresql_ssl_port: 5432
postgresql_ssl_cert_path: /etc/postgresql/17/main/ssl/server.crt
# Casdoor service account (system-assigned UID/GID)
casdoor_user: casdoor
casdoor_group: casdoor
casdoor_directory: /srv/casdoor
# Web
casdoor_port: 22081
casdoor_runmode: dev # or 'prod'
# Database (connects to localhost PostgreSQL)
casdoor_db_port: 5432
casdoor_db_name: casdoor
casdoor_db_user: casdoor
casdoor_db_password: "{{ vault_casdoor_db_password }}"
casdoor_db_sslmode: disable # Localhost, no SSL needed
# Logging
casdoor_syslog_port: 51401
```
### SSL Certificate
The self-signed certificate is generated automatically with:
- **Common Name**: `titania.incus`
- **Subject Alt Names**: `titania.incus`, `localhost`, `127.0.0.1`
- **Validity**: 10 years (`+3650d`)
- **Key Size**: 4096 bits
- **Location**: `/etc/postgresql/17/main/ssl/`
To regenerate certificates:
```bash
ssh titania.incus "sudo rm -rf /etc/postgresql/17/main/ssl/*"
ansible-playbook postgresql_ssl/deploy.yml
ansible-playbook pgadmin/deploy.yml # Update cert on Portia
```
## PGadmin Connection
To connect from PGadmin on Portia:
1. Navigate to https://pgadmin.ouranos.helu.ca
2. Add Server:
- **General tab**
- Name: `Titania PostgreSQL (Casdoor)`
- **Connection tab**
- Host: `titania.incus`
- Port: `5432`
- Database: `casdoor`
- Username: `casdoor`
- Password: *(from vault)*
- **SSL tab**
- SSL Mode: `Verify-CA`
- Root certificate: `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
The certificate is automatically distributed by `ansible-playbook pgadmin/deploy.yml`.
## Application Branding & CSS Customization
Casdoor allows extensive customization of login/signup pages through CSS and HTML fields in the **Application** settings.
### Available CSS/HTML Fields
| Field | Purpose | Where Applied |
|-------|---------|---------------|
| `formCss` | Custom CSS for desktop login forms | Login, signup, consent pages |
| `formCssMobile` | Mobile-specific CSS overrides | Mobile views |
| `headerHtml` | Custom HTML in page header | All auth pages (can inject `<style>` tags) |
| `footerHtml` | Custom footer HTML | Replaces "Powered by Casdoor" |
| `formSideHtml` | HTML beside the form | Side panel content |
| `formBackgroundUrl` | Background image URL | Full-page background |
| `formBackgroundUrlMobile` | Mobile background image | Mobile background |
| `signupHtml` | Custom HTML for signup page | Signup page only |
| `signinHtml` | Custom HTML for signin page | Signin page only |
### Configuration via init_data.json
Application branding is configured in `ansible/casdoor/init_data.json.j2`:
```json
{
"applications": [
{
"name": "app-heluca",
"formCss": "<style>/* Your CSS here */</style>",
"footerHtml": "<div style=\"text-align:center;\">Powered by Helu.ca</div>",
"headerHtml": "<style>/* Additional CSS via style tag */</style>",
"formBackgroundUrl": "https://example.com/bg.jpg"
}
]
}
```
### Example: Custom Theme CSS
The `formCss` field contains CSS to customize the Ant Design components:
```css
<style>
/* Login panel styling */
.login-panel {
background-color: #ffffff;
border-radius: 10px;
box-shadow: 0 0 30px 20px rgba(255,164,21,0.12);
}
/* Primary button colors */
.ant-btn-primary {
background-color: #4b96ff !important;
border-color: #4b96ff !important;
}
.ant-btn-primary:hover {
background-color: #58c0ff !important;
border-color: #58c0ff !important;
}
/* Link colors */
a { color: #ffa415; }
a:hover { color: #ffc219; }
/* Input focus states */
.ant-input:focus, .ant-input-focused {
border-color: #4b96ff !important;
box-shadow: 0 0 0 2px rgba(75,150,255,0.2) !important;
}
/* Checkbox styling */
.ant-checkbox-checked .ant-checkbox-inner {
background-color: #4b96ff !important;
border-color: #4b96ff !important;
}
</style>
```
### Example: Custom Footer
Replace the default "Powered by Casdoor" footer:
```html
<div style="text-align:center;padding:10px;color:#666;">
<a href="https://helu.ca" style="color:#4b96ff;text-decoration:none;">
Powered by Helu.ca
</a>
</div>
```
### Organization-Level Theme
Organization settings also affect theming. Configure in the **Organization** settings:
| Setting | Purpose |
|---------|---------|
| `themeData.colorPrimary` | Primary color (Ant Design) |
| `themeData.borderRadius` | Border radius for components |
| `themeData.isCompact` | Compact mode toggle |
| `logo` | Organization logo |
| `favicon` | Browser favicon |
| `websiteUrl` | Organization website |
### Updating Existing Applications
Changes to `init_data.json` only apply during **initial Casdoor setup**. For existing deployments:
1. **Via Admin UI**: Applications → Edit → Update CSS/HTML fields
2. **Via API**: Use Casdoor's REST API to update application settings
3. **Database reset**: Redeploy with `initDataNewOnly = false` (overwrites existing data)
### CSS Class Reference
Common CSS classes for targeting Casdoor UI elements:
| Class | Element |
|-------|---------|
| `.login-panel` | Main login form container |
| `.login-logo-box` | Logo container |
| `.login-username` | Username input wrapper |
| `.login-password` | Password input wrapper |
| `.login-button-box` | Submit button container |
| `.login-forget-password` | Forgot password link |
| `.login-signup-link` | Signup link |
| `.login-languages` | Language selector |
| `.back-button` | Back button |
| `.provider-img` | OAuth provider icons |
| `.signin-methods` | Sign-in method tabs |
| `.verification-code` | Verification code input |
| `.login-agreement` | Terms agreement checkbox |
## Initial Setup
After deployment, access Casdoor at https://id.ouranos.helu.ca:
1. **Login** with default credentials: `admin` / `123`
2. **Change admin password immediately**
3. **Create organization** for your domain
4. **Create applications** for services that need SSO:
- SearXNG (via OAuth2-Proxy)
- Grafana
- Other internal services
### OAuth2 Application Setup
For each service:
1. Applications → Add
2. Configure OAuth2 settings:
- Redirect URI: `https://service.ouranos.helu.ca/oauth2/callback`
- Grant types: Authorization Code
3. Note the Client ID and Client Secret for service configuration
## Troubleshooting
### PostgreSQL Issues
```bash
# Check PostgreSQL status
ssh titania.incus "sudo systemctl status postgresql"
# View PostgreSQL logs
ssh titania.incus "sudo journalctl -u postgresql -f"
# Check SSL configuration
ssh titania.incus "sudo -u postgres psql -c 'SHOW ssl;'"
ssh titania.incus "sudo -u postgres psql -c 'SHOW ssl_cert_file;'"
# Test SSL connection externally
openssl s_client -connect titania.incus:5432 -starttls postgres
```
### Casdoor Container Issues
```bash
# View container status
ssh titania.incus "cd /srv/casdoor && docker compose ps"
# View logs
ssh titania.incus "cd /srv/casdoor && docker compose logs casdoor"
# Restart
ssh titania.incus "cd /srv/casdoor && docker compose restart"
```
### Database Connection
```bash
# Connect as postgres admin
ssh titania.incus "sudo -u postgres psql"
# Connect as casdoor user
ssh titania.incus "sudo -u postgres psql -U casdoor -d casdoor -h localhost"
# List databases
ssh titania.incus "sudo -u postgres psql -c '\l'"
# List users
ssh titania.incus "sudo -u postgres psql -c '\du'"
```
### Health Check
```bash
# Casdoor health
curl -s http://titania.incus:22081/api/health | jq
# PostgreSQL accepting connections
ssh titania.incus "pg_isready -h localhost"
```
## Security Considerations
1. **Change default admin password** immediately after deployment
2. **Rotate database passwords** periodically (update vault, redeploy)
3. **Monitor authentication logs** in Grafana (via Alloy/Loki)
4. **SSL certificates** have 10-year validity, regenerate if compromised
5. **Backup PostgreSQL data** regularly - contains all identity data:
```bash
ssh titania.incus "sudo -u postgres pg_dump casdoor > casdoor_backup.sql"
```
## Related Documentation
- [Ansible Practices](ansible.md) - Playbook and variable patterns
- [Terraform Practices](terraform.md) - S3 bucket provisioning
- [OAuth2-Proxy](services/oauth2_proxy.md) - Protecting services with Casdoor SSO

191
docs/cerbot.md Normal file
View File

@@ -0,0 +1,191 @@
# Certbot DNS-01 with Namecheap
This playbook deploys certbot with the Namecheap DNS plugin for DNS-01 validation, enabling wildcard SSL certificates.
## Overview
| Component | Value |
|-----------|-------|
| Installation | Python virtualenv in `/srv/certbot/.venv` |
| DNS Plugin | `certbot-dns-namecheap` |
| Validation | DNS-01 (supports wildcards) |
| Renewal | Systemd timer (twice daily) |
| Certificate Output | `/etc/haproxy/certs/{domain}.pem` |
| Metrics | Prometheus textfile collector |
## Deployments
### Titania (ouranos.helu.ca)
Production deployment providing Let's Encrypt certificates for the Agathos sandbox HAProxy reverse proxy.
| Setting | Value |
|---------|-------|
| **Host** | titania.incus |
| **Domain** | ouranos.helu.ca |
| **Wildcard** | *.ouranos.helu.ca |
| **Email** | webmaster@helu.ca |
| **HAProxy** | Port 443 (HTTPS), Port 80 (HTTP redirect) |
| **Renewal** | Twice daily, automatic HAProxy reload |
### Other Deployments
The playbook can be deployed to any host with HAProxy. See the example configuration for hippocamp.helu.ca (d.helu.ca domain) below.
## Prerequisites
1. **Namecheap API Access** enabled on your account
2. **Namecheap API key** generated
3. **IP whitelisted** in Namecheap API settings
4. **Ansible Vault** configured with Namecheap credentials
## Setup
### 1. Add Secrets to Ansible Vault
Add Namecheap credentials to `ansible/inventory/group_vars/all/vault.yml`:
```bash
ansible-vault edit inventory/group_vars/all/vault.yml
```
Add the following variables:
```yaml
vault_namecheap_username: "your_namecheap_username"
vault_namecheap_api_key: "your_namecheap_api_key"
```
Map these in `inventory/group_vars/all/vars.yml`:
```yaml
namecheap_username: "{{ vault_namecheap_username }}"
namecheap_api_key: "{{ vault_namecheap_api_key }}"
```
### 2. Configure Host Variables
For Titania, the configuration is in `inventory/host_vars/titania.incus.yml`:
```yaml
services:
- certbot
- haproxy
# ...
certbot_email: webmaster@helu.ca
certbot_cert_name: ouranos.helu.ca
certbot_domains:
- "*.ouranos.helu.ca"
- "ouranos.helu.ca"
```
### 3. Deploy
```bash
cd ansible
ansible-playbook certbot/deploy.yml --limit titania.incus
```
## Files Created
| Path | Purpose |
|------|---------|
| `/srv/certbot/.venv/` | Python virtualenv with certbot |
| `/srv/certbot/config/` | Certbot configuration and certificates |
| `/srv/certbot/credentials/namecheap.ini` | Namecheap API credentials (600 perms) |
| `/srv/certbot/hooks/renewal-hook.sh` | Post-renewal script |
| `/srv/certbot/hooks/cert-metrics.sh` | Prometheus metrics script |
| `/etc/haproxy/certs/ouranos.helu.ca.pem` | Combined cert for HAProxy (Titania) |
| `/etc/systemd/system/certbot-renew.service` | Renewal service unit |
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
## Renewal Process
1. Systemd timer triggers at 00:00 and 12:00 (with random delay up to 1 hour)
2. Certbot checks if certificate needs renewal (within 30 days of expiry)
3. If renewal needed:
- Creates DNS TXT record via Namecheap API
- Waits 120 seconds for propagation
- Validates and downloads new certificate
- Runs `renewal-hook.sh`
4. Renewal hook:
- Combines fullchain + privkey into HAProxy format
- Reloads HAProxy via `docker compose kill -s HUP haproxy`
- Updates Prometheus metrics
## Prometheus Metrics
Metrics written to `/var/lib/prometheus/node-exporter/ssl_cert.prom`:
| Metric | Description |
|--------|-------------|
| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires |
| `ssl_certificate_expiry_seconds` | Seconds until cert expires |
| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing |
Example alert rule:
```yaml
- alert: SSLCertificateExpiringSoon
expr: ssl_certificate_expiry_seconds < 604800 # 7 days
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"
```
## Troubleshooting
### View Certificate Status
```bash
# Check certificate expiry (Titania example)
openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.helu.ca.pem
# Check certbot certificates
sudo -u certbot /srv/certbot/.venv/bin/certbot certificates \
--config-dir /srv/certbot/config
```
### Manual Renewal Test
```bash
# Dry run renewal
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
--config-dir /srv/certbot/config \
--work-dir /srv/certbot/work \
--logs-dir /srv/certbot/logs \
--dry-run
# Force renewal (if needed)
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
--config-dir /srv/certbot/config \
--work-dir /srv/certbot/work \
--logs-dir /srv/certbot/logs \
--force-renewal
```
### Check Systemd Timer
```bash
# Timer status
systemctl status certbot-renew.timer
# Last run
journalctl -u certbot-renew.service --since "1 day ago"
# List timers
systemctl list-timers certbot-renew.timer
```
### DNS Propagation Issues
If certificate requests fail due to DNS propagation:
1. Check Namecheap API is accessible
2. Verify IP is whitelisted
3. Increase propagation wait time (default 120s)
4. Check certbot logs: `/srv/certbot/logs/letsencrypt.log`
## Related Playbooks
- `haproxy/deploy.yml` - Depends on certificate from certbot
- `prometheus/node_deploy.yml` - Deploys node_exporter for metrics collection

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,505 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Documentation Style Guide</title>
<!-- Bootstrap CSS -->
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Bootstrap Icons -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css">
<style>
/* Smooth scrolling */
html {
scroll-behavior: smooth;
}
/* Icon styling */
.section-icon {
margin-right: 0.5rem;
color: var(--bs-primary);
}
.alert-icon {
margin-right: 0.5rem;
font-size: 1.2rem;
vertical-align: middle;
}
/* Scroll to top button */
#scrollTopBtn {
position: fixed;
bottom: 20px;
right: 20px;
z-index: 1000;
display: none;
border-radius: 50%;
width: 50px;
height: 50px;
box-shadow: 0 2px 10px rgba(0,0,0,0.3);
}
/* Icon legend */
.icon-legend {
display: inline-flex;
align-items: center;
margin-right: 1.5rem;
margin-bottom: 0.5rem;
}
.icon-legend i {
margin-right: 0.5rem;
font-size: 1.2rem;
}
</style>
</head>
<body>
<div class="container-fluid">
<nav class="navbar navbar-dark bg-dark rounded mb-4">
<div class="container-fluid">
<a class="navbar-brand" href="agathos.html">
<i class="bi bi-arrow-left"></i> Back to Main Documentation
</a>
<div class="navbar-nav d-flex flex-row">
<a class="nav-link me-3" href="#philosophy"><i class="bi bi-book"></i> Philosophy</a>
<a class="nav-link me-3" href="#structure"><i class="bi bi-diagram-3"></i> Structure</a>
<a class="nav-link me-3" href="#visual-design"><i class="bi bi-palette"></i> Design</a>
<a class="nav-link me-3" href="#bootstrap-icons"><i class="bi bi-bootstrap"></i> Icons</a>
<a class="nav-link" href="#implementation"><i class="bi bi-gear"></i> Implementation</a>
</div>
</div>
</nav>
<nav aria-label="breadcrumb">
<ol class="breadcrumb">
<li class="breadcrumb-item"><a href="agathos.html"><i class="bi bi-house-door"></i> Main Documentation</a></li>
<li class="breadcrumb-item active" aria-current="page">Style Guide</li>
</ol>
</nav>
<div class="row">
<div class="col-12">
<h1 class="display-4 mb-4">
<i class="bi bi-journal-code section-icon"></i>Documentation Style Guide
<span class="badge bg-success"><i class="bi bi-check-circle-fill"></i> Complete</span>
</h1>
<p class="lead">This guide explains the approach and principles used to create comprehensive HTML documentation for infrastructure and software projects.</p>
</div>
</div>
<!-- Icon Legend -->
<div class="alert alert-light border">
<h5><i class="bi bi-info-circle"></i> Icon Legend</h5>
<div class="d-flex flex-wrap">
<span class="icon-legend"><i class="bi bi-exclamation-triangle-fill text-danger"></i>Critical/Danger</span>
<span class="icon-legend"><i class="bi bi-exclamation-circle-fill text-warning"></i>Warning/Important</span>
<span class="icon-legend"><i class="bi bi-check-circle-fill text-success"></i>Success/Complete</span>
<span class="icon-legend"><i class="bi bi-info-circle-fill text-info"></i>Information</span>
<span class="icon-legend"><i class="bi bi-lightning-fill text-primary"></i>Active/Key</span>
<span class="icon-legend"><i class="bi bi-link-45deg text-secondary"></i>Integration</span>
</div>
</div>
<section id="philosophy" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-book section-icon"></i>Philosophy</h2>
<div class="row g-4">
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-diagram-3"></i> Documentation as Architecture
</h3>
<p>Documentation should mirror and reinforce the software architecture. Each component gets its own focused document that clearly explains its purpose, boundaries, and relationships.</p>
</div>
</div>
</div>
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-people"></i> User-Centric Design
</h3>
<p>Documentation serves multiple audiences:</p>
<ul>
<li><i class="bi bi-code-slash"></i> <strong>Developers</strong> need technical details and implementation guidance</li>
<li><i class="bi bi-briefcase"></i> <strong>Stakeholders</strong> need high-level overviews and business context</li>
<li><i class="bi bi-award"></i> <strong>Red Panda</strong> needs approval checkpoints and critical decisions highlighted</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-arrow-repeat"></i> Living Documentation
</h3>
<p>Documentation evolves with the codebase and captures both current state and architectural decisions.</p>
</div>
</div>
</div>
</div>
</section>
<section id="structure" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-diagram-3 section-icon"></i>Structure Principles</h2>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>1. Hierarchical Information Architecture</h3>
<pre class="mb-0">Main Documentation (project.html)
├── Component Docs (component1.html, component2.html, etc.)
├── Standards References (docs/standards/)
└── Supporting Materials (README.md, style guides)</pre>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h3><i class="bi bi-exclamation-circle-fill alert-icon"></i>2. Consistent Navigation</h3>
<p>Every document includes:</p>
<ul class="mb-0">
<li><i class="bi bi-compass"></i> <strong>Navigation bar</strong> with key sections</li>
<li><i class="bi bi-link-45deg"></i> <strong>Cross-references</strong> to related components</li>
<li><i class="bi bi-arrow-return-left"></i> <strong>Return links</strong> to main documentation</li>
</ul>
</div>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>3. Progressive Disclosure</h3>
<p>Information flows from general to specific:</p>
<p class="mb-0"><strong><i class="bi bi-arrow-right-short"></i> Overview → Architecture → Implementation → Details</strong></p>
</div>
</section>
<section id="visual-design" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-palette section-icon"></i>Visual Design Principles</h2>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>1. Clean Typography</h3>
<ul class="mb-0">
<li><i class="bi bi-fonts"></i> System fonts for readability</li>
<li><i class="bi bi-text-paragraph"></i> Generous line spacing (1.6)</li>
<li><i class="bi bi-list-nested"></i> Clear hierarchy with consistent heading sizes</li>
</ul>
</div>
<div class="alert alert-danger border-start border-4 border-danger">
<h3><i class="bi bi-exclamation-triangle-fill alert-icon"></i>2. Color-Coded Information Types</h3>
<p><strong><i class="bi bi-bootstrap"></i> Bootstrap Alert Classes (Preferred):</strong></p>
<ul class="mb-3">
<li><i class="bi bi-exclamation-triangle-fill text-danger"></i> <code>alert alert-danger</code> - Critical decisions requiring immediate attention</li>
<li><i class="bi bi-exclamation-circle-fill text-warning"></i> <code>alert alert-warning</code> - Important context and warnings</li>
<li><i class="bi bi-check-circle-fill text-success"></i> <code>alert alert-success</code> - Completed features and positive outcomes</li>
<li><i class="bi bi-info-circle-fill text-info"></i> <code>alert alert-info</code> - Technical architecture information</li>
<li><i class="bi bi-lightning-fill text-primary"></i> <code>alert alert-primary</code> - Key workflows and processes</li>
<li><i class="bi bi-link-45deg text-secondary"></i> <code>alert alert-secondary</code> - Cross-component integration details</li>
</ul>
<p><strong>Legacy Custom Classes (Backward Compatible):</strong></p>
<ul class="mb-0">
<li><i class="bi bi-circle-fill text-info"></i> <strong>.tech-stack</strong> - Technical architecture information</li>
<li><i class="bi bi-circle-fill text-warning"></i> <strong>.critical</strong> - Important decisions requiring attention</li>
<li><i class="bi bi-circle-fill text-primary"></i> <strong>.workflow</strong> - Process and workflow information</li>
<li><i class="bi bi-circle-fill text-secondary"></i> <strong>.integration</strong> - Cross-component integration details</li>
</ul>
</div>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>3. Responsive Layout</h3>
<ul class="mb-0">
<li><i class="bi bi-phone"></i> Bootstrap grid system for all screen sizes</li>
<li><i class="bi bi-grid-3x3"></i> Consistent spacing with utility classes</li>
<li><i class="bi bi-card-list"></i> Card-based information grouping</li>
</ul>
</div>
</section>
<section id="bootstrap-icons" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-bootstrap section-icon"></i>Bootstrap Icons Integration</h2>
<div class="alert alert-success border-start border-4 border-success">
<h3><i class="bi bi-check-circle-fill alert-icon"></i>Setup</h3>
<p>Add Bootstrap Icons CDN to your HTML documents:</p>
<div class="bg-light p-3 rounded my-2">
<code>&lt;link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css"&gt;</code>
</div>
<p class="mt-3"><strong>Benefits:</strong></p>
<ul class="mb-0">
<li><i class="bi bi-lightning-charge"></i> Minimal overhead (~75KB)</li>
<li><i class="bi bi-palette"></i> 2000+ icons matching Bootstrap design</li>
<li><i class="bi bi-cloud-download"></i> CDN caching for fast loading</li>
</ul>
</div>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>Common Icon Patterns</h3>
<div class="row g-3">
<div class="col-md-6">
<h5><i class="bi bi-check-square"></i> Status & Progress</h5>
<ul>
<li><i class="bi bi-check-square"></i> <code>bi-check-square</code> - Completed</li>
<li><i class="bi bi-square"></i> <code>bi-square</code> - Pending</li>
<li><i class="bi bi-hourglass-split"></i> <code>bi-hourglass-split</code> - In Progress</li>
<li><i class="bi bi-x-circle"></i> <code>bi-x-circle</code> - Failed/Error</li>
</ul>
</div>
<div class="col-md-6">
<h5><i class="bi bi-compass"></i> Navigation</h5>
<ul>
<li><i class="bi bi-house-door"></i> <code>bi-house-door</code> - Home</li>
<li><i class="bi bi-arrow-left"></i> <code>bi-arrow-left</code> - Back</li>
<li><i class="bi bi-box-arrow-up-right"></i> <code>bi-box-arrow-up-right</code> - External</li>
<li><i class="bi bi-link-45deg"></i> <code>bi-link-45deg</code> - Link</li>
</ul>
</div>
<div class="col-md-6">
<h5><i class="bi bi-exclamation-triangle"></i> Alerts</h5>
<ul>
<li><i class="bi bi-exclamation-triangle-fill text-danger"></i> <code>bi-exclamation-triangle-fill</code> - Danger</li>
<li><i class="bi bi-exclamation-circle-fill text-warning"></i> <code>bi-exclamation-circle-fill</code> - Warning</li>
<li><i class="bi bi-info-circle-fill text-info"></i> <code>bi-info-circle-fill</code> - Info</li>
<li><i class="bi bi-check-circle-fill text-success"></i> <code>bi-check-circle-fill</code> - Success</li>
</ul>
</div>
<div class="col-md-6">
<h5><i class="bi bi-code-slash"></i> Technical</h5>
<ul>
<li><i class="bi bi-code-slash"></i> <code>bi-code-slash</code> - Code</li>
<li><i class="bi bi-database"></i> <code>bi-database</code> - Database</li>
<li><i class="bi bi-cpu"></i> <code>bi-cpu</code> - System</li>
<li><i class="bi bi-plug"></i> <code>bi-plug</code> - API/Integration</li>
</ul>
</div>
</div>
</div>
<div class="alert alert-primary border-start border-4 border-primary">
<h3><i class="bi bi-lightning-fill alert-icon"></i>Usage Examples</h3>
<h5 class="mt-3">Section Headers with Icons</h5>
<div class="bg-light p-3 rounded my-2">
<code>&lt;h2&gt;&lt;i class="bi bi-book section-icon"&gt;&lt;/i&gt;Section Title&lt;/h2&gt;</code>
</div>
<h5 class="mt-3">Alert Boxes with Icons</h5>
<div class="bg-light p-3 rounded my-2">
<code>&lt;div class="alert alert-info border-start border-4 border-info"&gt;<br>
&nbsp;&nbsp;&lt;h3&gt;&lt;i class="bi bi-info-circle-fill alert-icon"&gt;&lt;/i&gt;Information&lt;/h3&gt;<br>
&lt;/div&gt;</code>
</div>
<h5 class="mt-3">Badges with Icons</h5>
<div class="bg-light p-3 rounded my-2">
<code>&lt;span class="badge bg-success"&gt;&lt;i class="bi bi-check-circle-fill"&gt;&lt;/i&gt; Complete&lt;/span&gt;</code>
</div>
<h5 class="mt-3">List Items with Icons</h5>
<div class="bg-light p-3 rounded my-2">
<code>&lt;li&gt;&lt;i class="bi bi-check-circle"&gt;&lt;/i&gt; Completed task&lt;/li&gt;<br>
&lt;li&gt;&lt;i class="bi bi-arrow-right-short"&gt;&lt;/i&gt; Action item&lt;/li&gt;</code>
</div>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h3><i class="bi bi-exclamation-circle-fill alert-icon"></i>Best Practices</h3>
<ul class="mb-0">
<li><i class="bi bi-check-circle"></i> Use semantic icons that match content meaning</li>
<li><i class="bi bi-palette"></i> Maintain consistent icon usage across documents</li>
<li><i class="bi bi-eye"></i> Don't overuse icons - they should enhance, not clutter</li>
<li><i class="bi bi-phone"></i> Ensure icons are visible and meaningful at all screen sizes</li>
<li><i class="bi bi-universal-access"></i> Icons should supplement text, not replace it (accessibility)</li>
</ul>
</div>
</section>
<section id="implementation" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-gear section-icon"></i>Implementation Guidelines</h2>
<div class="alert alert-success border-start border-4 border-success">
<h3><i class="bi bi-check-circle-fill alert-icon"></i>HTML Document Template</h3>
<pre class="mb-0">&lt;!DOCTYPE html&gt;
&lt;html lang="en"&gt;
&lt;head&gt;
&lt;meta charset="UTF-8"&gt;
&lt;meta name="viewport" content="width=device-width, initial-scale=1.0"&gt;
&lt;title&gt;Document Title&lt;/title&gt;
&lt;!-- Bootstrap CSS --&gt;
&lt;link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet"&gt;
&lt;!-- Bootstrap Icons --&gt;
&lt;link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css"&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;div class="container-fluid"&gt;
&lt;!-- Navigation --&gt;
&lt;nav class="navbar navbar-dark bg-dark rounded mb-4"&gt;
&lt;a class="navbar-brand" href="main.html"&gt;
&lt;i class="bi bi-arrow-left"&gt;&lt;/i&gt; Back
&lt;/a&gt;
&lt;/nav&gt;
&lt;!-- Breadcrumb --&gt;
&lt;nav aria-label="breadcrumb"&gt;
&lt;ol class="breadcrumb"&gt;
&lt;li class="breadcrumb-item"&gt;&lt;a href="main.html"&gt;&lt;i class="bi bi-house-door"&gt;&lt;/i&gt; Main&lt;/a&gt;&lt;/li&gt;
&lt;li class="breadcrumb-item active"&gt;Current Page&lt;/li&gt;
&lt;/ol&gt;
&lt;/nav&gt;
&lt;!-- Content --&gt;
&lt;h1&gt;&lt;i class="bi bi-journal-code"&gt;&lt;/i&gt; Page Title&lt;/h1&gt;
&lt;!-- Sections --&gt;
&lt;/div&gt;
&lt;!-- Bootstrap JS --&gt;
&lt;script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"&gt;&lt;/script&gt;
&lt;!-- Dark mode support --&gt;
&lt;script&gt;
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
document.documentElement.setAttribute('data-bs-theme', 'dark');
}
&lt;/script&gt;
&lt;/body&gt;
&lt;/html&gt;</pre>
</div>
<div class="alert alert-info border-start border-4 border-info">
<h3><i class="bi bi-info-circle-fill alert-icon"></i>Dark Mode Support</h3>
<p>Bootstrap 5.3+ includes built-in dark mode support. Add this script to automatically detect system preferences:</p>
<div class="bg-light p-3 rounded my-2">
<code>&lt;script&gt;<br>
&nbsp;&nbsp;if (window.matchMedia('(prefers-color-scheme: dark)').matches) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;document.documentElement.setAttribute('data-bs-theme', 'dark');<br>
&nbsp;&nbsp;}<br>
&lt;/script&gt;</code>
</div>
</div>
<div class="alert alert-primary border-start border-4 border-primary">
<h3><i class="bi bi-lightning-fill alert-icon"></i>Scroll to Top Button</h3>
<p>Add a floating button for easy navigation in long documents:</p>
<div class="bg-light p-3 rounded my-2">
<code>&lt;button id="scrollTopBtn" class="btn btn-primary"&gt;<br>
&nbsp;&nbsp;&lt;i class="bi bi-arrow-up-circle"&gt;&lt;/i&gt;<br>
&lt;/button&gt;<br><br>
&lt;script&gt;<br>
&nbsp;&nbsp;window.onscroll = function() {<br>
&nbsp;&nbsp;&nbsp;&nbsp;if (document.documentElement.scrollTop &gt; 300) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;document.getElementById('scrollTopBtn').style.display = 'block';<br>
&nbsp;&nbsp;&nbsp;&nbsp;} else {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;document.getElementById('scrollTopBtn').style.display = 'none';<br>
&nbsp;&nbsp;&nbsp;&nbsp;}<br>
&nbsp;&nbsp;};<br>
&nbsp;&nbsp;document.getElementById('scrollTopBtn').onclick = function() {<br>
&nbsp;&nbsp;&nbsp;&nbsp;window.scrollTo({top: 0, behavior: 'smooth'});<br>
&nbsp;&nbsp;};<br>
&lt;/script&gt;</code>
</div>
</div>
</section>
<section id="quality-standards" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-award section-icon"></i>Quality Standards</h2>
<div class="progress mb-4">
<div class="progress-bar bg-success" style="width: 100%">
<i class="bi bi-check-circle-fill"></i> Style Guide Implementation: 100% Complete
</div>
</div>
<div class="row g-4">
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-check-circle"></i> Technical Accuracy
</h3>
<ul class="mb-0">
<li>All code examples must work</li>
<li>All URLs must be valid</li>
<li>All relationships must be correct</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-eye"></i> Clarity and Completeness
</h3>
<ul class="mb-0">
<li>Each section serves a specific purpose</li>
<li>Information is neither duplicated nor missing</li>
<li>Cross-references are accurate</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-4">
<div class="card h-100">
<div class="card-body">
<h3 class="card-title text-primary">
<i class="bi bi-stars"></i> Professional Presentation
</h3>
<ul class="mb-0">
<li>Consistent formatting throughout</li>
<li>Clean visual hierarchy</li>
<li>Responsive design for all devices</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<div class="alert alert-secondary border-start border-4 border-secondary">
<p class="mb-0">
<i class="bi bi-award"></i> <strong>This style guide ensures consistent, professional, and maintainable documentation that serves both technical and business needs while supporting the long-term success of your projects.</strong>
</p>
</div>
<!-- Scroll to top button -->
<button id="scrollTopBtn" class="btn btn-primary" title="Scroll to top">
<i class="bi bi-arrow-up-circle"></i>
</button>
</div>
<!-- Bootstrap JS -->
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
<!-- Dark mode support -->
<script>
// Detect system preference and apply dark mode
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
document.documentElement.setAttribute('data-bs-theme', 'dark');
}
// Listen for changes in system preference
window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', function(e) {
if (e.matches) {
document.documentElement.setAttribute('data-bs-theme', 'dark');
} else {
document.documentElement.setAttribute('data-bs-theme', 'light');
}
});
// Scroll to top button functionality
window.onscroll = function() {
const scrollBtn = document.getElementById('scrollTopBtn');
if (document.body.scrollTop > 300 || document.documentElement.scrollTop > 300) {
scrollBtn.style.display = 'block';
} else {
scrollBtn.style.display = 'none';
}
};
document.getElementById('scrollTopBtn').onclick = function() {
window.scrollTo({top: 0, behavior: 'smooth'});
};
</script>
</body>
</html>

386
docs/gitea.md Normal file
View File

@@ -0,0 +1,386 @@
# Gitea - Git with a Cup of Tea
## Overview
Gitea is a lightweight, self-hosted Git service providing a GitHub-like web interface with repository management, issue tracking, pull requests, and code review capabilities. Deployed on **Rosalind** with PostgreSQL backend on Portia and Memcached caching.
**Host:** rosalind.incus
**Role:** Collaboration (PHP, Go, Node.js runtimes)
**Container Port:** 22083 (HTTP), 22022 (SSH), 22093 (Metrics)
**External Access:** https://gitea.ouranos.helu.ca/ (via HAProxy on Titania)
**SSH Access:** `ssh -p 22022 git@gitea.ouranos.helu.ca` (TCP passthrough via HAProxy)
## Architecture
```
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐
│ Client │─────▶│ HAProxy │─────▶│ Gitea │─────▶│PostgreSQL │
│ │ │ (Titania) │ │(Rosalind)│ │ (Portia) │
└──────────┘ └────────────┘ └──────────┘ └───────────┘
┌───────────┐
│ Memcached │
│ (Local) │
└───────────┘
```
## Deployment
### Playbook
```bash
cd ansible
ansible-playbook gitea/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `gitea/deploy.yml` | Main deployment playbook |
| `gitea/app.ini.j2` | Gitea configuration template |
### Deployment Steps
1. **Install Dependencies**: git, git-lfs, curl, memcached
2. **Create System User**: `git:git` with home directory
3. **Create Directories**: Work dir, data, LFS storage, repository root, logs
4. **Download Gitea Binary**: Latest release from GitHub (architecture-specific)
5. **Template Configuration**: Apply `app.ini.j2` with variables
6. **Create Systemd Service**: Custom service unit for Gitea
7. **Start Service**: Enable and start gitea.service
8. **Configure OAuth2**: Register Casdoor as OpenID Connect provider
## Configuration
### Key Features
- **Git LFS Support**: Large file storage enabled
- **SSH Server**: Built-in SSH server on port 22022
- **Prometheus Metrics**: Metrics endpoint on port 22094
- **Memcached Caching**: Session and cache storage with `gt_` prefix
- **Repository Settings**: Push-to-create, all units enabled
- **Security**: Argon2 password hashing, reverse proxy trusted
### Storage Locations
| Path | Purpose | Owner |
|------|---------|-------|
| `/var/lib/gitea` | Working directory | git:git |
| `/var/lib/gitea/data` | Application data | git:git |
| `/var/lib/gitea/data/lfs` | Git LFS objects | git:git |
| `/mnt/dv` | Git repositories | git:git |
| `/var/log/gitea` | Application logs | git:git |
| `/etc/gitea` | Configuration files | root:git |
### Logging
- **Console Output**: Info level to systemd journal
- **File Logs**: `/var/log/gitea/gitea.log`
- **Rotation**: Daily rotation, 7-day retention
- **SSH Logs**: Enabled for debugging
## Access After Deployment
1. **Web Interface**: https://gitea.ouranos.helu.ca/
2. **First-Time Setup**: Create admin account on first visit
3. **Git Clone**:
```bash
git clone https://gitea.ouranos.helu.ca/username/repo.git
```
4. **SSH Clone**:
```bash
git clone git@gitea.ouranos.helu.ca:username/repo.git
```
Note: SSH requires port 22022 configured in `~/.ssh/config`
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/rosalind/config.alloy.j2`
- **Log Collection**: `/var/log/gitea/gitea.log` → Loki
- **Metrics**: Port 22094 → Prometheus (token-protected)
- **System Metrics**: Process exporter tracks Gitea process
### Metrics Endpoint
- **URL**: `http://rosalind.incus:22083/metrics`
- **Authentication**: Bearer token required (`vault_gitea_metrics_token`)
- **Note**: Metrics are exposed on the main web port, not a separate metrics port
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
### 1. Database Password
```yaml
vault_gitea_db_password: "YourSecurePassword123!"
```
**Requirements:**
- Minimum 12 characters recommended
- Used by PostgreSQL authentication
### 2. Secret Key (Session Encryption)
```yaml
vault_gitea_secret_key: "RandomString64CharactersLongForSessionCookieEncryptionSecurity123"
```
**Requirements:**
- **Length**: Recommended 64+ characters
- **Format**: Base64 or hex string
- **Generation**:
```bash
openssl rand -base64 48
```
### 3. LFS JWT Secret
```yaml
vault_gitea_lfs_jwt_secret: "AnotherRandomString64CharsForLFSJWTTokenSigning1234567890ABC"
```
**Requirements:**
- **Length**: Recommended 64+ characters
- **Purpose**: Signs JWT tokens for Git LFS authentication
- **Generation**:
```bash
openssl rand -base64 48
```
### 4. Metrics Token
```yaml
vault_gitea_metrics_token: "RandomTokenForPrometheusMetricsAccess123"
```
**Requirements:**
- **Length**: 32+ characters recommended
- **Purpose**: Bearer token for Prometheus scraping
- **Generation**:
```bash
openssl rand -hex 32
```
### 5. OAuth Client ID
```yaml
vault_gitea_oauth_client_id: "gitea-oauth-client"
```
**Requirements:**
- **Purpose**: Client ID for Casdoor OAuth2 application
- **Source**: Must match `clientId` in Casdoor application configuration
### 6. OAuth Client Secret
```yaml
vault_gitea_oauth_client_secret: "YourRandomOAuthSecret123!"
```
**Requirements:**
- **Length**: 32+ characters recommended
- **Purpose**: Client secret for Casdoor OAuth2 authentication
- **Generation**:
```bash
openssl rand -base64 32
```
- **Source**: Must match `clientSecret` in Casdoor application configuration
## Host Variables
**File:** `ansible/inventory/host_vars/rosalind.incus.yml`
```yaml
# Gitea User and Directories
gitea_user: git
gitea_group: git
gitea_work_dir: /var/lib/gitea
gitea_data_dir: /var/lib/gitea/data
gitea_lfs_dir: /var/lib/gitea/data/lfs
gitea_repo_root: /mnt/dv
gitea_config_file: /etc/gitea/app.ini
# Ports
gitea_web_port: 22083
gitea_ssh_port: 22022
gitea_metrics_port: 22094
# Network
gitea_domain: ouranos.helu.ca
gitea_root_url: https://gitea.ouranos.helu.ca/
# Database Configuration
gitea_db_type: postgres
gitea_db_host: portia.incus
gitea_db_port: 5432
gitea_db_name: gitea
gitea_db_user: gitea
gitea_db_password: "{{vault_gitea_db_password}}"
gitea_db_ssl_mode: disable
# Features
gitea_lfs_enabled: true
gitea_metrics_enabled: true
# Service Settings
gitea_disable_registration: true # Use Casdoor SSO
gitea_require_signin_view: false
# Security (vault secrets)
gitea_secret_key: "{{vault_gitea_secret_key}}"
gitea_lfs_jwt_secret: "{{vault_gitea_lfs_jwt_secret}}"
gitea_metrics_token: "{{vault_gitea_metrics_token}}"
# OAuth2 (Casdoor SSO)
gitea_oauth_enabled: true
gitea_oauth_name: "casdoor"
gitea_oauth_display_name: "Sign in with Casdoor"
gitea_oauth_client_id: "{{vault_gitea_oauth_client_id}}"
gitea_oauth_client_secret: "{{vault_gitea_oauth_client_secret}}"
gitea_oauth_auth_url: "https://id.ouranos.helu.ca/login/oauth/authorize"
gitea_oauth_token_url: "http://titania.incus:22081/api/login/oauth/access_token"
gitea_oauth_userinfo_url: "http://titania.incus:22081/api/userinfo"
gitea_oauth_scopes: "openid profile email"
```
## OAuth2 / Casdoor SSO
Gitea integrates with Casdoor for Single Sign-On using OpenID Connect.
### Architecture
```
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌──────────┐
│ Browser │─────▶│ HAProxy │─────▶│ Gitea │─────▶│ Casdoor │
│ │ │ (Titania) │ │(Rosalind)│ │(Titania) │
└──────────┘ └────────────┘ └──────────┘ └──────────┘
│ │ │
│ 1. Click "Sign in with Casdoor" │ │
│◀─────────────────────────────────────│ │
│ 2. Redirect to Casdoor login │ │
│─────────────────────────────────────────────────────▶│
│ 3. User authenticates │ │
│◀─────────────────────────────────────────────────────│
│ 4. Redirect back with auth code │ │
│─────────────────────────────────────▶│ │
│ │ 5. Exchange code for token
│ │────────────────▶│
│ │◀────────────────│
│ 6. User logged into Gitea │ │
│◀─────────────────────────────────────│ │
```
### Casdoor Application Configuration
A Gitea application is defined in `ansible/casdoor/init_data.json.j2`:
| Setting | Value |
|---------|-------|
| **Name** | `app-gitea` |
| **Client ID** | `vault_gitea_oauth_client_id` |
| **Redirect URI** | `https://gitea.ouranos.helu.ca/user/oauth2/casdoor/callback` |
| **Grant Types** | `authorization_code`, `refresh_token` |
### URL Strategy
| URL Type | Address | Used By |
|----------|---------|---------|
| **Auth URL** | `https://id.ouranos.helu.ca/...` | User's browser (external) |
| **Token URL** | `http://titania.incus:22081/...` | Gitea server (internal) |
| **Userinfo URL** | `http://titania.incus:22081/...` | Gitea server (internal) |
| **Discovery URL** | `http://titania.incus:22081/.well-known/openid-configuration` | Gitea server (internal) |
The auth URL uses the external HAProxy address because it runs in the user's browser. Token/userinfo URLs use internal addresses for server-to-server communication.
### User Auto-Registration
With `ENABLE_AUTO_REGISTRATION = true` in `[oauth2_client]`, users who authenticate via Casdoor are automatically created in Gitea. Account linking uses `auto` mode to match by email address.
### Deployment Order
1. **Deploy Casdoor first** (if not already running):
```bash
ansible-playbook casdoor/deploy.yml
```
2. **Deploy Gitea** (registers OAuth provider):
```bash
ansible-playbook gitea/deploy.yml
```
### Verify OAuth Configuration
```bash
# List authentication sources
ssh rosalind.incus "sudo -u git /usr/local/bin/gitea admin auth list --config /etc/gitea/app.ini"
# Should show: casdoor (OpenID Connect)
```
## Database Setup
Gitea requires a PostgreSQL database on Portia. This is automatically created by the `postgresql/deploy.yml` playbook.
**Database Details:**
- **Name**: gitea
- **User**: gitea
- **Owner**: gitea
- **Extensions**: None required
## Integration with Other Services
### HAProxy Routing
**Backend Configuration** (`titania.incus.yml`):
```yaml
- subdomain: "gitea"
backend_host: "rosalind.incus"
backend_port: 22083
health_path: "/api/healthz"
timeout_server: 120s
```
### Memcached Integration
- **Host**: localhost:11211
- **Session Prefix**: N/A (Memcache adapter doesn't require prefix)
- **Cache Prefix**: N/A
### Prometheus Monitoring
- **Scrape Target**: `rosalind.incus:22094`
- **Job Name**: gitea
- **Authentication**: Bearer token
## Troubleshooting
### Service Status
```bash
ssh rosalind.incus
sudo systemctl status gitea
```
### View Logs
```bash
# Application logs
sudo tail -f /var/log/gitea/gitea.log
# Systemd journal
sudo journalctl -u gitea -f
```
### Test Database Connection
```bash
psql -h portia.incus -U gitea -d gitea
```
### Check Memcached
```bash
echo "stats" | nc localhost 11211
```
### Verify Metrics Endpoint
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:22094/metrics
```
## Version Information
- **Installation Method**: Binary download from GitHub releases
- **Version Selection**: Latest stable release (dynamic)
- **Update Process**: Re-run deployment playbook to fetch latest binary
- **Architecture**: linux-amd64
## References
- **Official Documentation**: https://docs.gitea.com/
- **GitHub Repository**: https://github.com/go-gitea/gitea
- **Configuration Reference**: https://docs.gitea.com/administration/config-cheat-sheet

759
docs/gitea_mcp.md Normal file
View File

@@ -0,0 +1,759 @@
# Gitea MCP Server - Red Panda Approved™
Model Context Protocol (MCP) server providing programmatic access to Gitea repositories, issues, and pull requests. Deployed as a Docker container on Miranda (MCP Docker Host) in the Agathos sandbox.
---
## Overview
The Gitea MCP Server exposes Gitea's functionality through the MCP protocol, enabling AI assistants and automation tools to interact with Git repositories, issues, pull requests, and other Gitea features.
| Property | Value |
|----------|-------|
| **Host** | Miranda (10.10.0.156) |
| **Service Port** | 25535 |
| **Container Port** | 8000 |
| **Transport** | HTTP |
| **Image** | `docker.gitea.com/gitea-mcp-server:latest` |
| **Gitea Instance** | https://gitea.ouranos.helu.ca |
| **Logging** | Syslog to port 51435 → Alloy → Loki |
### Purpose
- **Repository Operations**: Clone, read, and analyze repository contents
- **Issue Management**: Create, read, update, and search issues
- **Pull Request Workflow**: Manage PRs, reviews, and merges
- **Code Search**: Search across repositories and file contents
- **User/Organization Info**: Query user profiles and organization details
### Integration Points
```
AI Assistant (Cline/Claude Desktop)
↓ (MCP Protocol)
MCP Switchboard (Oberon)
↓ (HTTP)
Gitea MCP Server (Miranda:25535)
↓ (Gitea API)
Gitea Instance (Rosalind:22083)
```
---
## Architecture
### Deployment Model
**Container-Based**: Single Docker container managed via Docker Compose
**Directory Structure**:
```
/srv/gitea_mcp/
└── docker-compose.yml # Container orchestration
```
**System Integration**:
- **User/Group**: `gitea_mcp:gitea_mcp` (system user)
- **Ansible User Access**: Remote user added to gitea_mcp group
- **Permissions**: Directory mode 750, compose file mode 550
### Network Configuration
| Component | Port | Protocol | Purpose |
|-----------|------|----------|---------|
| External Access | 25535 | HTTP | MCP protocol endpoint |
| Container Internal | 8000 | HTTP | Service listening port |
| Syslog | 51435 | TCP | Log forwarding to Alloy |
### Logging Pipeline
```
Gitea MCP Container
↓ (Docker syslog driver)
Local Syslog (127.0.0.1:51435)
↓ (Alloy collection)
Loki (Prospero)
↓ (Grafana queries)
Grafana Dashboards
```
**Log Format**: RFC5424 (syslog_format variable)
**Log Tag**: `gitea-mcp`
---
## Prerequisites
### Infrastructure Requirements
1. **Miranda Host**: Docker engine installed and running
2. **Gitea Instance**: Accessible Gitea server (gitea.ouranos.helu.ca)
3. **Access Token**: Gitea personal access token with required permissions
4. **Monitoring Stack**: Alloy configured for syslog collection (port 51435)
### Required Permissions
**Gitea Access Token Scopes**:
- `repo`: Full repository access (read/write)
- `user`: Read user information
- `org`: Read organization information
- `issue`: Manage issues
- `pull_request`: Manage pull requests
**Token Creation**:
1. Log into Gitea → User Settings → Applications
2. Generate New Token → Select scopes
3. Copy token (shown only once)
4. Store in Ansible Vault as `vault_gitea_mcp_access_token`
### Ansible Dependencies
- `community.docker.docker_compose_v2` collection
- Docker Python SDK on Miranda
- Ansible Vault configured with password file
---
## Configuration
### Host Variables
All configuration is defined in `ansible/inventory/host_vars/miranda.incus.yml`:
```yaml
services:
- gitea_mcp # Enable service on this host
# Gitea MCP Configuration
gitea_mcp_user: gitea_mcp
gitea_mcp_group: gitea_mcp
gitea_mcp_directory: /srv/gitea_mcp
gitea_mcp_port: 25535
gitea_mcp_host: https://gitea.ouranos.helu.ca
gitea_mcp_access_token: "{{ vault_gitea_mcp_access_token }}"
gitea_mcp_syslog_port: 51435
```
### Variable Reference
| Variable | Purpose | Example |
|----------|---------|---------|
| `gitea_mcp_user` | Service system user | `gitea_mcp` |
| `gitea_mcp_group` | Service system group | `gitea_mcp` |
| `gitea_mcp_directory` | Service root directory | `/srv/gitea_mcp` |
| `gitea_mcp_port` | External port binding | `25535` |
| `gitea_mcp_host` | Gitea instance URL | `https://gitea.ouranos.helu.ca` |
| `gitea_mcp_access_token` | Gitea API token (vault) | `{{ vault_gitea_mcp_access_token }}` |
| `gitea_mcp_syslog_port` | Local syslog port | `51435` |
### Vault Configuration
Store the Gitea access token securely in `ansible/inventory/group_vars/all/vault.yml`:
```yaml
---
# Gitea MCP Server Access Token
vault_gitea_mcp_access_token: "your_gitea_access_token_here"
```
**Encrypt vault file**:
```bash
ansible-vault encrypt ansible/inventory/group_vars/all/vault.yml
```
**Edit vault file**:
```bash
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
```
---
## Deployment
### Initial Deployment
**Prerequisites Check**:
```bash
# Verify Miranda has Docker
ansible miranda.incus -m command -a "docker --version"
# Verify Miranda is in inventory
ansible miranda.incus -m ping
# Check Gitea accessibility
curl -I https://gitea.ouranos.helu.ca
```
**Deploy Service**:
```bash
cd ansible/
# Deploy only Gitea MCP service
ansible-playbook gitea_mcp/deploy.yml
# Or deploy as part of full stack
ansible-playbook site.yml
```
**Deployment Process**:
1. ✓ Check service is enabled in host's `services` list
2. ✓ Create gitea_mcp system user and group
3. ✓ Add Ansible remote user to gitea_mcp group
4. ✓ Create /srv/gitea_mcp directory (mode 750)
5. ✓ Template docker-compose.yml (mode 550)
6. ✓ Reset SSH connection (apply group changes)
7. ✓ Start Docker container via docker-compose
### Deployment Output
**Expected Success**:
```
PLAY [Deploy Gitea MCP Server with Docker Compose] ****************************
TASK [Check if host has gitea_mcp service] ************************************
ok: [miranda.incus]
TASK [Create gitea_mcp group] *************************************************
changed: [miranda.incus]
TASK [Create gitea_mcp user] **************************************************
changed: [miranda.incus]
TASK [Add group gitea_mcp to Ansible remote_user] *****************************
changed: [miranda.incus]
TASK [Create gitea_mcp directory] *********************************************
changed: [miranda.incus]
TASK [Template docker-compose file] *******************************************
changed: [miranda.incus]
TASK [Reset SSH connection to apply group changes] ****************************
changed: [miranda.incus]
TASK [Start Gitea MCP service] ************************************************
changed: [miranda.incus]
PLAY RECAP ********************************************************************
miranda.incus : ok=8 changed=7 unreachable=0 failed=0
```
---
## Verification
### Container Status
**Check container is running**:
```bash
# Via Ansible
ansible miranda.incus -m command -a "docker ps | grep gitea-mcp"
# Direct SSH
ssh miranda.incus
docker ps | grep gitea-mcp
```
**Expected Output**:
```
CONTAINER ID IMAGE STATUS PORTS
abc123def456 docker.gitea.com/gitea-mcp-server:latest Up 2 minutes 0.0.0.0:25535->8000/tcp
```
### Service Connectivity
**Test MCP endpoint**:
```bash
# From Miranda
curl -v http://localhost:25535
# From other hosts
curl -v http://miranda.incus:25535
```
**Expected Response**: HTTP response indicating MCP server is listening
### Log Inspection
**Docker logs**:
```bash
ssh miranda.incus
docker logs gitea-mcp
```
**Centralized logs via Loki**:
```bash
# Via logcli (if installed)
logcli query '{job="syslog", container_name="gitea-mcp"}' --limit=50
# Via Grafana Explore
# Navigate to: https://grafana.ouranos.helu.ca
# Select Loki datasource
# Query: {job="syslog", container_name="gitea-mcp"}
```
### Functional Testing
**Test Gitea API access**:
```bash
# Enter container
ssh miranda.incus
docker exec -it gitea-mcp sh
# Test Gitea API connectivity (if curl available in container)
# Note: Container may not have shell utilities
```
**MCP Protocol Test** (from client):
```bash
# Using MCP inspector or client tool
mcp connect http://miranda.incus:25535
# Or test via MCP Switchboard
curl -X POST http://oberon.incus:22781/mcp/invoke \
-H "Content-Type: application/json" \
-d '{"server":"gitea","method":"list_repositories"}'
```
---
## Management
### Updating the Service
**Update container image**:
```bash
cd ansible/
# Re-run deployment (pulls latest image)
ansible-playbook gitea_mcp/deploy.yml
```
**Docker Compose will**:
1. Pull latest `docker.gitea.com/gitea-mcp-server:latest` image
2. Recreate container if image changed
3. Preserve configuration from docker-compose.yml
### Restarting the Service
**Via Docker Compose**:
```bash
ssh miranda.incus
cd /srv/gitea_mcp
docker compose restart
```
**Via Docker**:
```bash
ssh miranda.incus
docker restart gitea-mcp
```
**Via Ansible** (re-run deployment):
```bash
ansible-playbook gitea_mcp/deploy.yml
```
### Removing the Service
**Complete removal**:
```bash
cd ansible/
ansible-playbook gitea_mcp/remove.yml
```
**Remove playbook actions**:
1. Stop and remove Docker containers
2. Remove Docker volumes
3. Remove Docker images
4. Prune unused Docker images
5. Remove /srv/gitea_mcp directory
**Manual cleanup** (if needed):
```bash
ssh miranda.incus
# Stop and remove container
cd /srv/gitea_mcp
docker compose down -v --rmi all
# Remove directory
sudo rm -rf /srv/gitea_mcp
# Remove user/group (optional)
sudo userdel gitea_mcp
sudo groupdel gitea_mcp
```
### Configuration Changes
**Update Gitea host or port**:
1. Edit `ansible/inventory/host_vars/miranda.incus.yml`
2. Modify `gitea_mcp_host` or `gitea_mcp_port`
3. Re-run deployment: `ansible-playbook gitea_mcp/deploy.yml`
**Rotate access token**:
1. Generate new token in Gitea
2. Update vault: `ansible-vault edit ansible/inventory/group_vars/all/vault.yml`
3. Update `vault_gitea_mcp_access_token` value
4. Re-run deployment to update environment variable
---
## Troubleshooting
### Container Won't Start
**Symptom**: Container exits immediately or won't start
**Diagnosis**:
```bash
ssh miranda.incus
# Check container logs
docker logs gitea-mcp
# Check container status
docker ps -a | grep gitea-mcp
# Inspect container
docker inspect gitea-mcp
```
**Common Causes**:
- **Invalid Access Token**: Check `GITEA_ACCESS_TOKEN` in docker-compose.yml
- **Gitea Host Unreachable**: Verify `GITEA_HOST` is accessible from Miranda
- **Port Conflict**: Check if port 25535 is already in use
- **Image Pull Failure**: Check Docker registry connectivity
**Solutions**:
```bash
# Test Gitea connectivity
curl -I https://gitea.ouranos.helu.ca
# Check port availability
ss -tlnp | grep 25535
# Pull image manually
docker pull docker.gitea.com/gitea-mcp-server:latest
# Re-run deployment with verbose logging
ansible-playbook gitea_mcp/deploy.yml -vv
```
### Authentication Errors
**Symptom**: "401 Unauthorized" or "403 Forbidden" in logs
**Diagnosis**:
```bash
# Check token is correctly passed
ssh miranda.incus
docker exec gitea-mcp env | grep GITEA_ACCESS_TOKEN
# Test token manually
TOKEN="your_token_here"
curl -H "Authorization: token $TOKEN" https://gitea.ouranos.helu.ca/api/v1/user
```
**Solutions**:
1. Verify token scopes in Gitea (repo, user, org, issue, pull_request)
2. Regenerate token if expired or revoked
3. Update vault with new token
4. Re-run deployment
### Network Connectivity Issues
**Symptom**: Cannot connect to Gitea or MCP endpoint unreachable
**Diagnosis**:
```bash
# Test Gitea from Miranda
ssh miranda.incus
curl -v https://gitea.ouranos.helu.ca
# Test MCP endpoint from other hosts
curl -v http://miranda.incus:25535
# Check Docker network
docker network inspect bridge
```
**Solutions**:
- Verify Miranda can resolve and reach `gitea.ouranos.helu.ca`
- Check firewall rules on Miranda
- Verify port 25535 is not blocked
- Check Docker network configuration
### Logs Not Appearing in Loki
**Symptom**: No logs in Grafana from gitea-mcp container
**Diagnosis**:
```bash
# Check Alloy is listening on syslog port
ssh miranda.incus
ss -tlnp | grep 51435
# Check Alloy configuration
sudo systemctl status alloy
# Verify syslog driver is configured
docker inspect gitea-mcp | grep -A 10 LogConfig
```
**Solutions**:
1. Verify Alloy is running: `sudo systemctl status alloy`
2. Check Alloy syslog source configuration
3. Verify `gitea_mcp_syslog_port` matches Alloy config
4. Restart Alloy: `sudo systemctl restart alloy`
5. Restart container to reconnect syslog
### Permission Denied Errors
**Symptom**: Cannot access /srv/gitea_mcp or docker-compose.yml
**Diagnosis**:
```bash
ssh miranda.incus
# Check directory permissions
ls -la /srv/gitea_mcp
# Check user group membership
groups # Should show gitea_mcp group
# Check file ownership
ls -la /srv/gitea_mcp/docker-compose.yml
```
**Solutions**:
```bash
# Re-run deployment to fix permissions
ansible-playbook gitea_mcp/deploy.yml
# Manually fix if needed
sudo chown -R gitea_mcp:gitea_mcp /srv/gitea_mcp
sudo chmod 750 /srv/gitea_mcp
sudo chmod 550 /srv/gitea_mcp/docker-compose.yml
# Re-login to apply group changes
exit
ssh miranda.incus
```
### MCP Switchboard Integration Issues
**Symptom**: Switchboard cannot connect to Gitea MCP server
**Diagnosis**:
```bash
# Check switchboard configuration
ssh oberon.incus
cat /srv/mcp-switchboard/config.json | jq '.servers.gitea'
# Test connectivity from Oberon
curl -v http://miranda.incus:25535
```
**Solutions**:
1. Verify Gitea MCP server URL in switchboard config
2. Check network connectivity: Oberon → Miranda
3. Verify port 25535 is accessible
4. Restart MCP Switchboard after config changes
---
## MCP Protocol Integration
### Server Capabilities
The Gitea MCP Server exposes these resources and tools via the MCP protocol:
**Resources**:
- Repository information
- File contents
- Issue details
- Pull request data
- User profiles
- Organization information
**Tools**:
- `list_repositories`: List accessible repositories
- `get_repository`: Get repository details
- `list_issues`: Search and list issues
- `create_issue`: Create new issue
- `update_issue`: Modify existing issue
- `list_pull_requests`: List PRs in repository
- `create_pull_request`: Open new PR
- `search_code`: Search code across repositories
### Switchboard Configuration
**MCP Switchboard** on Oberon routes MCP requests to Gitea MCP Server.
**Configuration** (`/srv/mcp-switchboard/config.json`):
```json
{
"servers": {
"gitea": {
"command": null,
"args": [],
"url": "http://miranda.incus:25535",
"transport": "http"
}
}
}
```
### Client Usage
**From AI Assistant** (Claude Desktop, Cline, etc.):
The assistant can interact with Gitea repositories through natural language:
- "List all repositories in the organization"
- "Show me open issues in the agathos repository"
- "Create an issue about improving documentation"
- "Search for 'ansible' in repository code"
**Direct MCP Client**:
```json
POST http://oberon.incus:22781/mcp/invoke
Content-Type: application/json
{
"server": "gitea",
"method": "list_repositories",
"params": {}
}
```
---
## Security Considerations
### Access Token Management
**Best Practices**:
- Store token in Ansible Vault (never in plain text)
- Use minimum required scopes for token
- Rotate tokens periodically
- Revoke tokens when no longer needed
- Use separate tokens for different services
**Token Rotation**:
```bash
# 1. Generate new token in Gitea
# 2. Update vault
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
# 3. Re-deploy to update environment variable
ansible-playbook gitea_mcp/deploy.yml
# 4. Revoke old token in Gitea
```
### Network Security
**Isolation**:
- Service only accessible within Incus network (10.10.0.0/24)
- No direct external exposure (proxied through Switchboard)
- TLS handled by HAProxy (upstream) for external access
**Access Control**:
- Gitea enforces user/repository permissions
- MCP protocol authenticated by Switchboard
- Container runs as non-root user
### Audit and Monitoring
**Logging**:
- All requests logged to Loki via syslog
- Grafana dashboards for monitoring access patterns
- Alert on authentication failures
**Monitoring Queries**:
```logql
# All Gitea MCP logs
{job="syslog", container_name="gitea-mcp"}
# Authentication errors
{job="syslog", container_name="gitea-mcp"} |= "401" or |= "403"
# Error rate
rate({job="syslog", container_name="gitea-mcp"} |= "error" [5m])
```
---
## Performance Considerations
### Resource Usage
**Container Resources**:
- **Memory**: ~50-100 MB baseline
- **CPU**: Minimal (< 1% idle, spikes during API calls)
- **Disk**: ~100 MB for image, minimal runtime storage
**Scaling Considerations**:
- Single container sufficient for development/sandbox
- For production: Consider multiple replicas behind load balancer
- Gitea API rate limits apply to token (typically 5000 requests/hour)
### Optimization
**Caching**:
- Gitea MCP Server may cache repository metadata
- Restart container to clear cache if needed
**Connection Pooling**:
- Server maintains connection pool to Gitea API
- Reuses connections for better performance
---
## Related Documentation
### Agathos Infrastructure
- [Agathos Overview](agathos.md) - Complete infrastructure documentation
- [Ansible Best Practices](ansible.md) - Deployment patterns and structure
- [Miranda Host](agathos.md#miranda---mcp-docker-host) - MCP Docker host details
### Related Services
- [Gitea Service](gitea.md) - Gitea server deployment and configuration
- [MCP Switchboard](../ansible/mcp_switchboard/README.md) - MCP request routing
- [Grafana MCP](grafana_mcp.md) - Similar MCP server deployment
### External References
- [Gitea API Documentation](https://docs.gitea.com/api/1.21/) - Gitea REST API reference
- [Model Context Protocol Specification](https://spec.modelcontextprotocol.io/) - MCP protocol details
- [Gitea MCP Server Repository](https://gitea.com/gitea/mcp-server) - Upstream project
- [Docker Compose Documentation](https://docs.docker.com/compose/) - Container orchestration
---
## Maintenance Schedule
**Regular Tasks**:
- **Weekly**: Review logs for errors or anomalies
- **Monthly**: Update container image to latest version
- **Quarterly**: Rotate Gitea access token
- **As Needed**: Review and adjust token permissions
**Update Procedure**:
```bash
# Pull latest image and restart
ansible-playbook gitea_mcp/deploy.yml
# Verify new version
ssh miranda.incus
docker inspect gitea-mcp | jq '.[0].Config.Image'
```
---
**Last Updated**: February 2026
**Project**: Agathos Infrastructure
**Host**: Miranda (MCP Docker Host)
**Status**: Red Panda Approved™ ✓

200
docs/gitea_runner.md Normal file
View File

@@ -0,0 +1,200 @@
# Gitea Act Runner
## Overview
Gitea Actions is Gitea's built-in CI/CD system, compatible with GitHub Actions workflows. The **Act Runner** is the agent that executes these workflows. It picks up jobs from a Gitea instance, spins up Docker containers for each workflow step, runs the commands, and reports results back.
The name "act" comes from [nektos/act](https://github.com/nektos/act), an open-source tool originally built to run GitHub Actions locally. Gitea forked and adapted it into their runner, so `act_runner` is a lineage artifact — the binary keeps the upstream name, but everything else in our infrastructure uses `gitea-runner`.
### How it works
1. The runner daemon polls the Gitea instance for queued workflow jobs
2. When a job is picked up, the runner pulls the Docker image specified by the workflow label (e.g., `ubuntu-24.04` maps to `docker.gitea.com/runner-images:ubuntu-24.04`)
3. Each workflow step executes inside an ephemeral container
4. Logs and status are streamed back to Gitea in real time
5. The container is destroyed after the job completes
### Architecture in Agathos
```
Gitea (Rosalind) Act Runner (Puck)
┌──────────────┐ poll/report ┌──────────────────┐
│ gitea.ouranos │◄──────────────────│ act_runner daemon │
│ .helu.ca │ │ (gitea-runner) │
└──────────────┘ └────────┬─────────┘
│ spawns
┌────────▼─────────┐
│ Docker containers │
│ (workflow steps) │
└──────────────────┘
```
### Naming conventions
The **binary** is `act_runner` — that's the upstream package name and renaming it would break updates. Everything else uses `gitea-runner`:
| Component | Name |
|-----------|------|
| Binary | `/usr/local/bin/act_runner` (upstream, don't rename) |
| Service account | `gitea-runner` |
| Home directory | `/srv/gitea-runner/` |
| Config file | `/srv/gitea-runner/config.yaml` |
| Registration state | `/srv/gitea-runner/.runner` (created by registration) |
| Systemd service | `gitea-runner.service` |
| Runner name | `puck-runner` (shown in Gitea UI) |
---
## Ansible Deployment
The runner is deployed via the `gitea_runner` Ansible service to **Puck** (application runtime host with Docker already available).
### Prerequisites
- Docker must be installed on the target host (`docker` in services list)
- Gitea must be running and accessible at `https://gitea.ouranos.helu.ca`
### Deploy
```bash
# Deploy to all hosts with gitea_runner in their services list
ansible-playbook gitea_runner/deploy.yml
# Dry run (skip registration prompt)
ansible-playbook gitea_runner/deploy.yml --check
# Limit to a specific host
ansible-playbook gitea_runner/deploy.yml --limit puck.incus
# Non-interactive mode (for CI/CD)
ansible-playbook gitea_runner/deploy.yml -e registration_token=YOUR_TOKEN
```
The playbook is also included in the full-stack deployment via `site.yml`, running after the Gitea playbook.
**Registration Prompt**: On first deployment, the playbook will pause and prompt for a registration token. Get the token from `https://gitea.ouranos.helu.ca/-/admin/runners` before running the playbook.
### What the playbook does
1. Filters hosts — only runs on hosts with `gitea_runner` in their `services` list
2. Creates `gitea-runner` system group and user (added to `docker` group)
3. Downloads `act_runner` binary from Gitea releases (version pinned as `act_runner_version` in `group_vars/all/vars.yml`)
4. Skips download if the installed version already matches (idempotent)
5. Copies the managed `config.yaml` from the Ansible controller (edit `ansible/gitea_runner/config.yaml` to change runner settings)
6. Templates `gitea-runner.service` systemd unit
7. **Registers the runner** — prompts for registration token on first deployment
8. Enables and starts the service
### Systemd unit
```ini
# /etc/systemd/system/gitea-runner.service
[Unit]
Description=Gitea Runner
After=network.target docker.service
Requires=docker.service
[Service]
Type=simple
User=gitea-runner
Group=gitea-runner
WorkingDirectory=/srv/gitea-runner
ExecStart=/usr/local/bin/act_runner daemon --config /srv/gitea-runner/config.yaml
Restart=on-failure
RestartSec=10
Environment=HOME=/srv/gitea-runner
[Install]
WantedBy=multi-user.target
```
### Registration Flow
On first deployment, the playbook will automatically prompt for a registration token:
```
TASK [Prompt for registration token]
Gitea runner registration required.
Get token from: https://gitea.ouranos.helu.ca/-/admin/runners
Enter registration token:
[Enter token here]
```
**Steps**:
1. Before running the playbook, obtain a registration token:
- Navigate to `https://gitea.ouranos.helu.ca/-/admin/runners`
- Click "Create new Runner"
- Copy the displayed token
2. Run the deployment playbook
3. Paste the token when prompted
The registration is **idempotent** — if the runner is already registered (`.runner` file exists), the prompt is skipped.
**Non-interactive mode**: Pass the token as an extra variable:
```bash
ansible-playbook gitea_runner/deploy.yml -e registration_token=YOUR_TOKEN
```
**Manual registration** (if needed): The traditional method still works if you prefer manual control. Labels are picked up from `config.yaml` at daemon start, so `--labels` is not needed at registration:
```bash
ssh puck.incus
sudo -iu gitea-runner
act_runner register \
--instance https://gitea.ouranos.helu.ca \
--token <token> \
--name puck-runner \
--no-interactive
```
### Verify
```bash
# Check service status
sudo systemctl status gitea-runner
# Check runner version
act_runner --version
# View runner logs
sudo journalctl -u gitea-runner -f
```
`puck-runner` should show as **online** at `https://gitea.ouranos.helu.ca/-/admin/runners`.
### Runner labels
Labels map workflow `runs-on` values to Docker images. They are configured in `ansible/gitea_runner/config.yaml` under `runner.labels`:
| Label | Docker Image | Use case |
|-------|-------------|----------|
| `ubuntu-latest` | `docker.gitea.com/runner-images:ubuntu-latest` | General CI (Gitea official image) |
| `ubuntu-24.04` | `docker.gitea.com/runner-images:ubuntu-24.04` | Ubuntu 24.04 builds |
| `ubuntu-22.04` | `docker.gitea.com/runner-images:ubuntu-22.04` | Ubuntu 22.04 builds |
| `ubuntu-20.04` | `docker.gitea.com/runner-images:ubuntu-20.04` | Ubuntu 20.04 builds |
| `node-24` | `node:24-bookworm` | Node.js CI |
To add or change labels, edit `ansible/gitea_runner/config.yaml` and re-run the playbook.
### Configuration reference
| Variable | Location | Value |
|----------|----------|-------|
| `act_runner_version` | `group_vars/all/vars.yml` | `0.2.13` |
| `gitea_runner_instance_url` | `group_vars/all/vars.yml` | `https://gitea.ouranos.helu.ca` |
| `gitea_runner_name` | `host_vars/puck.incus.yml` | `puck-runner` |
| Runner labels | `ansible/gitea_runner/config.yaml` | See `runner.labels` section |
### Upgrading
To upgrade the runner binary, update `act_runner_version` in `group_vars/all/vars.yml` and re-run the playbook:
```bash
# Edit the version
vim inventory/group_vars/all/vars.yml
# act_runner_version: "0.2.14"
# Re-deploy — only the binary download and service restart will trigger
ansible-playbook gitea_runner/deploy.yml
```

344
docs/github_mcp.md Normal file
View File

@@ -0,0 +1,344 @@
# GitHub MCP Server
## Overview
The GitHub MCP server provides read-only access to GitHub repositories through the Model Context Protocol (MCP). It enables AI assistants and other MCP clients to explore repository contents, search code, read issues, and analyze pull requests without requiring local clones.
**Deployment Host:** miranda.incus (10.10.0.156)
**Port:** 25533 (HTTP MCP endpoint)
**MCPO Proxy:** http://miranda.incus:25530/github
---
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ MCP CLIENTS │
│ VS Code/Cline │ OpenWebUI │ Custom Applications │
└─────────────────────────┬───────────────────────────────────┘
┌───────────┴──────────────┐
│ │
▼ ▼
Direct MCP (port 25533) MCPO Proxy (port 25530)
streamable-http OpenAI-compatible API
│ │
└──────────┬───────────────┘
┌──────────────────────┐
│ GitHub MCP Server │
│ Docker Container │
│ miranda.incus │
└──────────┬───────────┘
┌──────────────────────┐
│ GitHub API │
│ (Read-Only PAT) │
└──────────────────────┘
```
---
## GitHub Personal Access Token
### Required Scopes
The GitHub MCP server requires a **read-only Personal Access Token (PAT)** with the following scopes:
| Scope | Purpose |
|-------|---------|
| `public_repo` | Read access to public repositories |
| `repo` | Read access to private repositories (if needed) |
| `read:org` | Read organization membership and teams |
| `read:user` | Read user profile information |
### Creating a PAT
1. Navigate to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
2. Click "Generate new token (classic)"
3. Set name: `Agathos GitHub MCP - Read Only`
4. Set expiration: Custom or 90 days (recommended)
5. Select scopes: `public_repo`, `read:org`, `read:user`
6. Click "Generate token"
7. Copy the token immediately (it won't be shown again)
8. Store in Ansible vault: `ansible-vault edit ansible/inventory/group_vars/all/vault.yml`
- Add: `vault_github_personal_access_token: "ghp_xxxxxxxxxxxxx"`
---
## Available Tools
The GitHub MCP server provides the following tools:
### Repository Operations
- `get_file_contents` - Read file contents from repository
- `search_repositories` - Search for repositories on GitHub
- `list_commits` - List commits in a repository
- `create_branch` - Create a new branch (requires write access)
- `push_files` - Push files to repository (requires write access)
### Issue Management
- `create_issue` - Create a new issue (requires write access)
- `list_issues` - List issues in a repository
- `get_issue` - Get details of a specific issue
- `update_issue` - Update an issue (requires write access)
### Pull Request Management
- `create_pull_request` - Create a new PR (requires write access)
- `list_pull_requests` - List pull requests in a repository
- `get_pull_request` - Get details of a specific PR
### Search Operations
- `search_code` - Search code across repositories
- `search_users` - Search for GitHub users
**Note:** With a read-only PAT, write operations (`create_*`, `update_*`, `push_*`) will fail. The primary use case is repository exploration and code reading.
---
## Client Configuration
### MCP Native Clients (Cline, Claude Desktop)
Add the following to your MCP settings (e.g., `~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json`):
```json
{
"mcpServers": {
"github": {
"type": "streamable-http",
"url": "http://miranda.incus:25533/mcp"
}
}
}
```
### OpenWebUI Configuration
1. Navigate to **Settings → Tools → OpenAPI Servers**
2. Click **Add OpenAPI Server**
3. Configure:
- **Name:** GitHub MCP
- **URL:** `http://miranda.incus:25530/github`
- **Authentication:** None (MCPO handles upstream auth)
4. Save and enable desired GitHub tools
### Custom Applications
**Direct MCP Connection:**
```python
import mcp
client = mcp.Client("http://miranda.incus:25533/mcp")
tools = await client.list_tools()
```
**Via MCPO (OpenAI-compatible):**
```python
import openai
client = openai.OpenAI(
base_url="http://miranda.incus:25530/github",
api_key="not-required" # MCPO doesn't require auth for GitHub MCP
)
```
---
## Deployment
### Prerequisites
- Miranda container running with Docker installed
- Ansible vault containing `vault_github_personal_access_token`
- Network connectivity from clients to miranda.incus
### Deploy GitHub MCP Server
```bash
cd /home/robert/dv/agathos/ansible
ansible-playbook github_mcp/deploy.yml
```
This playbook:
1. Creates `github_mcp` user and group
2. Creates `/srv/github_mcp` directory
3. Templates docker-compose.yml with PAT from vault
4. Starts github-mcp-server container on port 25533
### Update MCPO Configuration
```bash
ansible-playbook mcpo/deploy.yml
```
This restarts MCPO with the updated config including GitHub MCP server.
### Update Alloy Logging
```bash
ansible-playbook alloy/deploy.yml --limit miranda.incus
```
This reconfigures Alloy to collect GitHub MCP server logs.
---
## Verification
### Test Direct MCP Endpoint
```bash
# Check container is running
ssh miranda.incus docker ps | grep github-mcp-server
# Test MCP endpoint responds
curl http://miranda.incus:25533/mcp
# List available tools (expect JSON response)
curl -X POST http://miranda.incus:25533/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'
```
### Test MCPO Proxy
```bash
# List GitHub tools via MCPO
curl http://miranda.incus:25530/github/tools
# Test repository file reading
curl -X POST http://miranda.incus:25530/github/tools/get_file_contents \
-H "Content-Type: application/json" \
-d '{
"owner": "github",
"repo": "docs",
"path": "README.md"
}'
```
### View Logs
```bash
# Container logs
ssh miranda.incus docker logs github-mcp-server
# Loki logs (via Grafana on prospero.incus)
# Navigate to Explore → Loki
# Query: {job="github-mcp-server"}
```
---
## Troubleshooting
### Container Won't Start
**Check Docker Compose:**
```bash
ssh miranda.incus
sudo -u github_mcp docker compose -f /srv/github_mcp/docker-compose.yml logs
```
**Common Issues:**
- Missing or invalid GitHub PAT in vault
- Port 25533 already in use
- Docker image pull failure
### MCP Endpoint Returns Errors
**Check GitHub PAT validity:**
```bash
curl -H "Authorization: token YOUR_PAT" https://api.github.com/user
```
**Verify PAT scopes:**
```bash
curl -i -H "Authorization: token YOUR_PAT" https://api.github.com/user \
| grep X-OAuth-Scopes
```
### MCPO Not Exposing GitHub Tools
**Verify MCPO config:**
```bash
ssh miranda.incus cat /srv/mcpo/config.json | jq '.mcpServers.github'
```
**Restart MCPO:**
```bash
ssh miranda.incus sudo systemctl restart mcpo
ssh miranda.incus sudo systemctl status mcpo
```
---
## Monitoring
### Prometheus Metrics
GitHub MCP server exposes Prometheus metrics (if supported by the container). Add to Prometheus scrape config:
```yaml
scrape_configs:
- job_name: 'github-mcp'
static_configs:
- targets: ['miranda.incus:25533']
```
### Grafana Dashboard
Import or create a dashboard on prospero.incus to visualize:
- Request rate and latency
- GitHub API rate limits
- Tool invocation counts
- Error rates
### Log Queries
Useful Loki queries in Grafana:
```logql
# All GitHub MCP logs
{job="github-mcp-server"}
# Errors only
{job="github-mcp-server"} |= "error" or |= "ERROR"
# GitHub API rate limit warnings
{job="github-mcp-server"} |= "rate limit"
# Tool invocations
{job="github-mcp-server"} |= "tool"
```
---
## Security Considerations
**Read-Only PAT** - Server uses minimal scopes, cannot modify repositories
**Network Isolation** - Only accessible within Agathos network (miranda.incus)
**Vault Storage** - PAT stored encrypted in Ansible Vault
**No Public Exposure** - MCP endpoint not exposed to internet
⚠️ **PAT Rotation** - Consider rotating PAT every 90 days
⚠️ **Access Control** - MCPO currently doesn't require authentication
### Recommended Enhancements
1. Add authentication to MCPO endpoints
2. Implement request rate limiting
3. Monitor GitHub API quota usage
4. Set up PAT expiration alerts
5. Restrict network access to miranda via firewall rules
---
## References
- [GitHub MCP Server Repository](https://github.com/github/github-mcp-server)
- [Model Context Protocol Specification](https://modelcontextprotocol.io/)
- [MCPO Documentation](https://github.com/open-webui/mcpo)
- [Agathos README](../../README.md)
- [Agathos Sandbox Documentation](../sandbox.html)

422
docs/grafana_mcp.md Normal file
View File

@@ -0,0 +1,422 @@
# Grafana MCP Server
## Overview
The Grafana MCP server provides AI/LLM access to Grafana dashboards, datasources, and APIs through the Model Context Protocol (MCP). It runs as a Docker container on **Miranda** and connects to the Grafana instance inside the [PPLG stack](pplg.md) on **Prospero** via the internal Incus network.
**Deployment Host:** miranda.incus
**Port:** 25533 (HTTP MCP endpoint)
**MCPO Proxy:** http://miranda.incus:25530/grafana
**Grafana Backend:** http://prospero.incus:3000 (PPLG stack)
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ MCP CLIENTS │
│ VS Code/Cline │ OpenWebUI │ LobeChat │ Custom Applications │
└───────────────────────────┬─────────────────────────────────────────┘
┌───────────┴──────────────┐
│ │
▼ ▼
Direct MCP (port 25533) MCPO Proxy (port 25530)
streamable-http OpenAI-compatible API
│ │
└──────────┬───────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Miranda (miranda.incus) │
│ ┌────────────────────────────────────────────────┐ │
│ │ Grafana MCP Server (Docker) │ │
│ │ mcp/grafana:latest │ │
│ │ Container: grafana-mcp │ │
│ │ :25533 → :8000 │ │
│ └─────────────────────┬──────────────────────────┘ │
│ │ HTTP (internal network) │
└────────────────────────┼─────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Prospero (prospero.incus) — PPLG Stack │
│ ┌────────────────────────────────────────────────┐ │
│ │ Grafana :3000 │ │
│ │ Authenticated via Service Account Token │ │
│ └────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
### Cross-Host Dependency
The Grafana MCP server on Miranda communicates with Grafana on Prospero over the Incus internal network (`prospero.incus:3000`). This means:
- **PPLG must be deployed first** — Grafana must be running before deploying the MCP server
- The connection uses Grafana's **internal HTTP port** (3000), not the external HTTPS endpoint
- Authentication is handled by a **Grafana service account token**, not Casdoor OAuth
## Terraform Resources
### Host Definition
Grafana MCP runs on Miranda, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | mcp_docker_host |
| Security Nesting | true |
| AppArmor | unconfined |
| Proxy: mcp_containers | `0.0.0.0:25530-25539``127.0.0.1:25530-25539` |
### Dependencies
| Resource | Relationship |
|----------|--------------|
| prospero (PPLG) | Grafana backend — service account token auth on `:3000` |
| miranda (MCPO) | MCPO proxies Grafana MCP at `localhost:25533/mcp` |
## Ansible Deployment
### Prerequisites
1. **PPLG stack**: Grafana must be running on Prospero (`ansible-playbook pplg/deploy.yml`)
2. **Docker**: Docker must be installed on the target host (`ansible-playbook docker/deploy.yml`)
3. **Vault Secret**: `vault_grafana_service_account_token` must be set (see [Required Vault Secrets](#required-vault-secrets))
### Playbook
```bash
cd ansible
ansible-playbook grafana_mcp/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `grafana_mcp/deploy.yml` | Main deployment playbook |
| `grafana_mcp/docker-compose.yml.j2` | Docker Compose template for the MCP server |
### Deployment Steps
1. **Pre-flight Check**: Verify Grafana is reachable on Prospero (`/api/health`)
2. **Create System User**: `grafana_mcp:grafana_mcp` system account
3. **Create Directory**: `/srv/grafana_mcp` with restricted permissions (750)
4. **Template Docker Compose**: Renders `docker-compose.yml.j2` with Grafana URL and service account token
5. **Start Container**: `docker compose up` via `community.docker.docker_compose_v2`
6. **Health Check**: Verifies the MCP endpoint is responding on `localhost:25533/mcp`
### Deployment Order
Grafana MCP must be deployed **after** PPLG and **before** MCPO:
```
pplg → docker → grafana_mcp → mcpo
```
This ensures Grafana is available before the MCP server starts, and MCPO can proxy to it.
## Docker Compose Configuration
The container is defined in `grafana_mcp/docker-compose.yml.j2`:
```yaml
services:
grafana-mcp:
image: mcp/grafana:latest
container_name: grafana-mcp
restart: unless-stopped
ports:
- "25533:8000"
environment:
- GRAFANA_URL=http://prospero.incus:3000
- GRAFANA_SERVICE_ACCOUNT_TOKEN=<from vault>
command: ["--transport", "streamable-http", "--address", "0.0.0.0:8000", "--tls-skip-verify"]
logging:
driver: syslog
options:
syslog-address: "tcp://127.0.0.1:51433"
syslog-format: rfc5424
tag: "grafana-mcp"
```
Key configuration:
- **Transport**: `streamable-http` — standard MCP HTTP transport
- **TLS Skip Verify**: Enabled because Grafana is accessed over internal HTTP (not HTTPS)
- **Syslog**: Logs shipped to Alloy on localhost for forwarding to Loki
## Available Tools
The Grafana MCP server exposes tools for interacting with Grafana's API:
### Dashboard Operations
- Search and list dashboards
- Get dashboard details and panels
- Query panel data
### Datasource Operations
- List configured datasources
- Query datasources directly
### Alerting
- List alert rules
- Get alert rule details and status
### General
- Get Grafana health status
- Search across Grafana resources
> **Note:** The specific tools available depend on the `mcp/grafana` Docker image version. Use the MCPO Swagger docs at `http://miranda.incus:25530/docs` to see the current tool inventory.
## Client Configuration
### MCP Native Clients (Cline, Claude Desktop)
```json
{
"mcpServers": {
"grafana": {
"type": "streamable-http",
"url": "http://miranda.incus:25533/mcp"
}
}
}
```
### Via MCPO (OpenAI-Compatible)
Grafana MCP is automatically available through MCPO at:
```
http://miranda.incus:25530/grafana
```
This endpoint is OpenAI-compatible and can be used by OpenWebUI, LobeChat, or any OpenAI SDK client:
```python
import openai
client = openai.OpenAI(
base_url="http://miranda.incus:25530/grafana",
api_key="not-required"
)
```
### OpenWebUI / LobeChat
1. Navigate to **Settings → Tools → OpenAPI Servers**
2. Click **Add OpenAPI Server**
3. Configure:
- **Name:** Grafana MCP
- **URL:** `http://miranda.incus:25530/grafana`
- **Authentication:** None (MCPO handles upstream auth)
4. Save and enable the Grafana tools
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
| Variable | Purpose |
|----------|---------|
| `vault_grafana_service_account_token` | Grafana service account token for MCP API access |
### Creating a Grafana Service Account Token
1. Log in to Grafana at `https://grafana.ouranos.helu.ca` (Casdoor SSO or local admin)
2. Navigate to **Administration → Service Accounts**
3. Click **Add service account**
- **Name:** `mcp-server`
- **Role:** `Viewer` (or `Editor` if write tools are needed)
4. Click **Add service account token**
- **Name:** `mcp-token`
- **Expiration:** No expiration (or set a rotation schedule)
5. Copy the generated token
6. Store in vault:
```bash
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
```
```yaml
vault_grafana_service_account_token: "glsa_xxxxxxxxxxxxxxxxxxxx"
```
## Host Variables
**File:** `ansible/inventory/host_vars/miranda.incus.yml`
```yaml
# Grafana MCP Config
grafana_mcp_user: grafana_mcp
grafana_mcp_group: grafana_mcp
grafana_mcp_directory: /srv/grafana_mcp
grafana_mcp_port: 25533
grafana_mcp_grafana_host: prospero.incus
grafana_mcp_grafana_port: 3000
grafana_service_account_token: "{{ vault_grafana_service_account_token }}"
```
Miranda's services list includes `grafana_mcp`:
```yaml
services:
- alloy
- argos
- docker
- gitea_mcp
- grafana_mcp
- mcpo
- neo4j_mcp
```
## Monitoring
### Syslog to Loki
The Grafana MCP container ships logs via Docker's syslog driver to Alloy on Miranda:
| Server | Syslog Port | Loki Tag |
|--------|-------------|----------|
| grafana-mcp | 51433 | `grafana-mcp` |
### Grafana Log Queries
Useful Loki queries in Grafana Explore:
```logql
# All Grafana MCP logs
{hostname="miranda.incus", job="grafana_mcp"}
# Errors only
{hostname="miranda.incus", job="grafana_mcp"} |= "error" or |= "ERROR"
# Tool invocations
{hostname="miranda.incus", job="grafana_mcp"} |= "tool"
```
### MCPO Aggregation
Grafana MCP is registered in MCPO's `config.json` as:
```json
{
"grafana": {
"type": "streamable-http",
"url": "http://localhost:25533/mcp"
}
}
```
MCPO exposes it at `http://miranda.incus:25530/grafana` with OpenAI-compatible API and Swagger documentation.
## Operations
### Start / Stop
```bash
ssh miranda.incus
# Docker container
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml up -d
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml down
# Or redeploy via Ansible
cd ansible
ansible-playbook grafana_mcp/deploy.yml
```
### Health Check
```bash
# Container status
ssh miranda.incus docker ps --filter name=grafana-mcp
# MCP endpoint
curl http://miranda.incus:25533/mcp
# Via MCPO
curl http://miranda.incus:25530/grafana/tools
# Grafana backend (from Miranda)
curl http://prospero.incus:3000/api/health
```
### Logs
```bash
# Docker container logs
ssh miranda.incus docker logs -f grafana-mcp
# Loki logs (via Grafana on Prospero)
# Query: {hostname="miranda.incus", job="grafana_mcp"}
```
## Troubleshooting
### Container Won't Start
```bash
ssh miranda.incus
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml logs
```
**Common causes:**
- Grafana on Prospero not running → check `ssh prospero.incus sudo systemctl status grafana-server`
- Invalid or expired service account token → regenerate in Grafana UI
- Port 25533 already in use → `ss -tlnp | grep 25533`
- Docker image pull failure → check Docker Hub access
### MCP Endpoint Returns Errors
**Verify service account token:**
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" http://prospero.incus:3000/api/org
```
**Check container environment:**
```bash
ssh miranda.incus docker inspect grafana-mcp | jq '.[0].Config.Env'
```
### MCPO Not Exposing Grafana Tools
**Verify MCPO config:**
```bash
ssh miranda.incus cat /srv/mcpo/config.json | jq '.mcpServers.grafana'
```
**Restart MCPO:**
```bash
ssh miranda.incus sudo systemctl restart mcpo
```
### Grafana Unreachable from Miranda
**Test network connectivity:**
```bash
ssh miranda.incus curl -s http://prospero.incus:3000/api/health
```
If this fails, check:
- Prospero container is running: `incus list prospero`
- Grafana service is up: `ssh prospero.incus sudo systemctl status grafana-server`
- No firewall rules blocking inter-container traffic
## Security Considerations
**Service Account Token** — Scoped to Viewer role, cannot modify Grafana configuration
**Internal Network** — MCP server only accessible within the Incus network
**Vault Storage** — Token stored encrypted in Ansible Vault
**No Public Exposure** — Neither the MCP endpoint nor the MCPO proxy are internet-facing
⚠️ **Token Rotation** — Consider rotating the service account token periodically
⚠️ **Access Control** — MCPO currently doesn't require authentication for tool access
## References
- [PPLG Stack Documentation](pplg.md) — Grafana deployment on Prospero
- [MCPO Documentation](mcpo.md) — MCP gateway that proxies Grafana MCP
- [Grafana MCP Server](https://github.com/grafana/mcp-grafana) — Upstream project
- [Model Context Protocol Specification](https://modelcontextprotocol.io/)
- [Ansible Practices](ansible.md)
- [Agathos Overview](agathos.md)

222
docs/hass.md Normal file
View File

@@ -0,0 +1,222 @@
# Home Assistant
## Overview
[Home Assistant](https://github.com/home-assistant/core) is an open-source home automation platform. In the Agathos sandbox it runs as a native Python application inside a virtual environment, backed by PostgreSQL for state recording and fronted by HAProxy for TLS termination.
**Host:** Oberon
**Role:** container_orchestration
**Port:** 8123
**URL:** https://hass.ouranos.helu.ca
## Architecture
```
┌──────────┐ HTTPS ┌──────────────┐ HTTP ┌──────────────┐
│ Client │────────▶│ HAProxy │────────▶│ Home │
│ │ │ (Titania) │ │ Assistant │
└──────────┘ │ :443 TLS │ │ (Oberon) │
└──────────────┘ │ :8123 │
└──────┬───────┘
┌─────────────────┼─────────────────┐
│ │ │
┌────▼─────┐ ┌──────▼──────┐ ┌─────▼─────┐
│PostgreSQL│ │ Alloy │ │ Prometheus│
│(Portia) │ │ (Oberon) │ │(Prospero) │
│ :5432 │ │ scrape │ │ remote │
│ recorder │ │ /api/prom │ │ write │
└──────────┘ └─────────────┘ └───────────┘
```
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook hass/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `hass/deploy.yml` | Main deployment playbook |
| `hass/configuration.yaml.j2` | Home Assistant configuration |
| `hass/requirements.txt.j2` | Python package pinning |
| `hass/hass.service.j2` | Systemd service unit |
### Variables
#### Host Variables (`host_vars/oberon.incus.yml`)
| Variable | Description | Value |
|----------|-------------|-------|
| `hass_user` | System user | `hass` |
| `hass_group` | System group | `hass` |
| `hass_directory` | Install directory | `/srv/hass` |
| `hass_media_directory` | Media storage | `/srv/hass/media` |
| `hass_port` | HTTP listen port | `8123` |
| `hass_version` | Pinned HA release | `2026.2.0` |
| `hass_db_host` | PostgreSQL host | `portia.incus` |
| `hass_db_port` | PostgreSQL port | `5432` |
| `hass_db_name` | Database name | `hass` |
| `hass_db_user` | Database user | `hass` |
| `hass_db_password` | Database password | `{{ vault_hass_db_password }}` |
| `hass_metrics_token` | Prometheus bearer token | `{{ vault_hass_metrics_token }}` |
#### Host Variables (`host_vars/portia.incus.yml`)
| Variable | Description |
|----------|-------------|
| `hass_db_name` | Database name on Portia |
| `hass_db_user` | Database user on Portia |
| `hass_db_password` | `{{ vault_hass_db_password }}` |
#### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Description |
|----------|-------------|
| `vault_hass_db_password` | PostgreSQL password for hass database |
| `vault_hass_metrics_token` | Long-Lived Access Token for Prometheus scraping |
## Configuration
### PostgreSQL Recorder
Home Assistant uses the `recorder` integration to persist entity states and events to PostgreSQL on Portia instead of the default SQLite. Configured in `configuration.yaml.j2`:
```yaml
recorder:
db_url: "postgresql://hass:<password>@portia.incus:5432/hass"
purge_keep_days: 30
commit_interval: 1
```
The database and user are provisioned by `postgresql/deploy.yml` alongside other service databases.
### HTTP / Reverse Proxy
HAProxy on Titania terminates TLS and forwards to Oberon:8123. The `http` block in `configuration.yaml.j2` configures trusted proxies so HA correctly reads `X-Forwarded-For` headers:
```yaml
http:
server_port: 8123
use_x_forwarded_for: true
trusted_proxies:
- 10.0.0.0/8
```
### HAProxy Backend
Defined in `host_vars/titania.incus.yml` under `haproxy_backends`:
| Setting | Value |
|---------|-------|
| Subdomain | `hass` |
| Backend | `oberon.incus:8123` |
| Health path | `/api/` |
| Timeout | 300s (WebSocket support) |
The wildcard TLS certificate (`*.ouranos.helu.ca`) covers `hass.ouranos.helu.ca` automatically — no certificate changes required.
## Authentication
Home Assistant uses its **native `homeassistant` auth provider** (built-in username/password). HA does not support OIDC/OAuth2 natively, so Casdoor SSO integration is not available.
On first deployment, HA will present an onboarding wizard to create the initial admin user.
## Monitoring
### Prometheus Metrics
Home Assistant exposes Prometheus metrics at `/api/prometheus`. The Alloy agent on Oberon scrapes this endpoint with bearer token authentication and remote-writes to Prometheus on Prospero.
| Setting | Value |
|---------|-------|
| Metrics path | `/api/prometheus` |
| Scrape interval | 60s |
| Auth | Bearer token (Long-Lived Access Token) |
**⚠️ Two-Phase Metrics Bootstrapping:**
The `vault_hass_metrics_token` must be a Home Assistant **Long-Lived Access Token**, which can only be generated from the HA web UI after the initial deployment:
1. Deploy Home Assistant: `ansible-playbook hass/deploy.yml`
2. Complete the onboarding wizard at `https://hass.ouranos.helu.ca`
3. Navigate to **Profile → Security → Long-Lived Access Tokens → Create Token**
4. Store the token in vault: `vault_hass_metrics_token: "<token>"`
5. Redeploy Alloy to pick up the token: `ansible-playbook alloy/deploy.yml`
Until the token is created, the Alloy hass scrape will fail silently.
### Loki Logs
Systemd journal logs are collected by Alloy's `loki.source.journal` and shipped to Loki on Prospero.
```bash
# Query in Grafana Explore
{job="systemd", hostname="oberon"} |= "hass"
```
## Operations
### Start / Stop
```bash
sudo systemctl start hass
sudo systemctl stop hass
sudo systemctl restart hass
```
### Health Check
```bash
curl http://localhost:8123/api/
```
### Logs
```bash
journalctl -u hass -f
```
### Version Upgrade
1. Update `hass_version` in `host_vars/oberon.incus.yml`
2. Run: `ansible-playbook hass/deploy.yml`
The playbook will reinstall the pinned version via pip and restart the service.
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| Service won't start | Missing Python deps | Check `pip install` output in deploy log |
| Database connection error | Portia unreachable | Verify PostgreSQL is running: `ansible-playbook postgresql/deploy.yml` |
| 502 via HAProxy | HA not listening | Check `systemctl status hass` on Oberon |
| Metrics scrape failing | Missing/invalid token | Generate Long-Lived Access Token from HA UI (see Monitoring section) |
### Debug Mode
```bash
# Check service status
sudo systemctl status hass
# View recent logs
journalctl -u hass --since "5 minutes ago"
# Test database connectivity from Oberon
psql -h portia.incus -U hass -d hass -c "SELECT 1"
```
## References
- [Home Assistant Documentation](https://www.home-assistant.io/docs/)
- [Home Assistant GitHub](https://github.com/home-assistant/core)
- [Recorder Integration](https://www.home-assistant.io/integrations/recorder/)
- [Prometheus Integration](https://www.home-assistant.io/integrations/prometheus/)
- [HTTP Integration](https://www.home-assistant.io/integrations/http/)

342
docs/jupyterlab.md Normal file
View File

@@ -0,0 +1,342 @@
# JupyterLab - Interactive Computing Environment
## Overview
JupyterLab is a web-based interactive development environment for notebooks, code, and data. Deployed on **Puck** as a systemd service running in a Python virtual environment, with OAuth2-Proxy sidecar providing Casdoor SSO authentication.
**Host:** puck.incus
**Role:** Application Runtime (Python App Host)
**Container Port:** 22181 (JupyterLab), 22182 (OAuth2-Proxy)
**External Access:** https://jupyter.ouranos.helu.ca/ (via HAProxy on Titania)
## Architecture
```
┌──────────┐ ┌────────────┐ ┌─────────────┐ ┌────────────┐
│ Client │─────▶│ HAProxy │─────▶│ OAuth2-Proxy│─────▶│ JupyterLab │
│ │ │ (Titania) │ │ (Puck) │ │ (Puck) │
└──────────┘ └────────────┘ └─────────────┘ └────────────┘
┌───────────┐
│ Casdoor │
│ (Titania) │
└───────────┘
```
### Authentication Flow
```
┌──────────┐ ┌────────────┐ ┌─────────────┐ ┌──────────┐
│ Browser │─────▶│ HAProxy │─────▶│ OAuth2-Proxy│─────▶│ Casdoor │
│ │ │ (Titania) │ │ (Puck) │ │(Titania) │
└──────────┘ └────────────┘ └─────────────┘ └──────────┘
│ │ │
│ 1. Access jupyter.ouranos.helu.ca │ │
│─────────────────────────────────────▶│ │
│ 2. No session - redirect to Casdoor │ │
│◀─────────────────────────────────────│ │
│ 3. User authenticates │ │
│─────────────────────────────────────────────────────────▶│
│ 4. Redirect with auth code │ │
│◀─────────────────────────────────────────────────────────│
│ 5. Exchange code, set session cookie│ │
│◀─────────────────────────────────────│ │
│ 6. Proxy to JupyterLab │ │
│◀─────────────────────────────────────│ │
```
## Deployment
### Playbook
```bash
cd ansible
ansible-playbook jupyterlab/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `jupyterlab/deploy.yml` | Main deployment playbook |
| `jupyterlab/jupyterlab.service.j2` | Systemd unit for JupyterLab |
| `jupyterlab/oauth2-proxy-jupyter.service.j2` | Systemd unit for OAuth2-Proxy sidecar |
| `jupyterlab/oauth2-proxy-jupyter.cfg.j2` | OAuth2-Proxy configuration |
| `jupyterlab/jupyter_lab_config.py.j2` | JupyterLab server configuration |
### Deployment Steps
1. **Install Dependencies**: python3-venv, nodejs, npm, graphviz
2. **Ensure User Exists**: `robert:robert` with home directory
3. **Create Directories**: Notebooks dir, config dir, log dir
4. **Create Virtual Environment**: `/home/robert/env/jupyter`
5. **Install Python Packages**: jupyterlab, jupyter-ai, langchain-ollama, matplotlib, plotly
6. **Install Jupyter Extensions**: contrib nbextensions
7. **Template Configuration**: Apply JupyterLab config
8. **Download OAuth2-Proxy**: Binary from GitHub releases
9. **Template OAuth2-Proxy Config**: With Casdoor OIDC settings
10. **Start Services**: Enable and start both systemd units
## Configuration
### Key Features
- **Jupyter AI**: AI assistance via jupyter-ai[all] with LangChain Ollama integration
- **Visualization**: matplotlib, plotly for data visualization
- **Diagrams**: Mermaid support via jupyterlab-mermaid
- **Extensions**: Jupyter contrib nbextensions
- **SSO**: Casdoor authentication via OAuth2-Proxy sidecar
- **WebSocket**: Full WebSocket support through reverse proxy
### Storage Locations
| Path | Purpose | Owner |
|------|---------|-------|
| `/home/robert/Notebooks` | Notebook files | robert:robert |
| `/home/robert/env/jupyter` | Python virtual environment | robert:robert |
| `/etc/jupyterlab` | Configuration files | root:robert |
| `/var/log/jupyterlab` | Application logs | robert:robert |
| `/etc/oauth2-proxy-jupyter` | OAuth2-Proxy config | root:root |
### Installed Python Packages
| Package | Purpose |
|---------|---------|
| `jupyterlab` | Core JupyterLab server |
| `jupyter-ai[all]` | AI assistant integration |
| `langchain-ollama` | Ollama LLM integration |
| `matplotlib` | Data visualization |
| `plotly` | Interactive charts |
| `jupyter_contrib_nbextensions` | Community extensions |
| `jupyterlab-mermaid` | Mermaid diagram support |
| `ipywidgets` | Interactive widgets |
### Logging
- **JupyterLab**: systemd journal via `SyslogIdentifier=jupyterlab`
- **OAuth2-Proxy**: systemd journal via `SyslogIdentifier=oauth2-proxy-jupyter`
- **Alloy Forwarding**: Syslog port 51491 → Loki
## Access After Deployment
1. **Web Interface**: https://jupyter.ouranos.helu.ca/
2. **Authentication**: Redirects to Casdoor SSO login
3. **After Login**: Full JupyterLab interface with notebook access
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/puck/config.alloy.j2`
- **Log Collection**: Syslog port 51491 → Loki
- **Job Label**: `jupyterlab`
- **System Metrics**: Process exporter tracks JupyterLab process
### Health Check
- **URL**: `http://puck.incus:22182/ping` (OAuth2-Proxy)
- **JupyterLab API**: `http://127.0.0.1:22181/api/status` (localhost only)
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
### 1. OAuth Client ID
```yaml
vault_jupyter_oauth_client_id: "jupyter-oauth-client"
```
**Requirements:**
- **Purpose**: Client ID for Casdoor OAuth2 application
- **Source**: Must match `clientId` in Casdoor application configuration
### 2. OAuth Client Secret
```yaml
vault_jupyter_oauth_client_secret: "YourRandomOAuthSecret123!"
```
**Requirements:**
- **Length**: 32+ characters recommended
- **Purpose**: Client secret for Casdoor OAuth2 authentication
- **Generation**:
```bash
openssl rand -base64 32
```
### 3. Cookie Secret
```yaml
vault_jupyter_oauth2_cookie_secret: "32CharacterRandomStringHere1234"
```
**Requirements:**
- **Length**: Exactly 32 characters (or 16/24 for AES)
- **Purpose**: Encrypts OAuth2-Proxy session cookies
- **Generation**:
```bash
openssl rand -base64 32 | head -c 32
```
## Host Variables
**File:** `ansible/inventory/host_vars/puck.incus.yml`
```yaml
# JupyterLab Configuration
jupyterlab_user: robert
jupyterlab_group: robert
jupyterlab_notebook_dir: /home/robert/Notebooks
jupyterlab_venv_dir: /home/robert/env/jupyter
# Ports
jupyterlab_port: 22181 # JupyterLab (localhost only)
jupyterlab_proxy_port: 22182 # OAuth2-Proxy (exposed to HAProxy)
# OAuth2-Proxy Configuration
jupyterlab_oauth2_proxy_dir: /etc/oauth2-proxy-jupyter
jupyterlab_oauth2_proxy_version: "7.6.0"
jupyterlab_domain: "ouranos.helu.ca"
jupyterlab_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
jupyterlab_oauth2_redirect_url: "https://jupyter.ouranos.helu.ca/oauth2/callback"
# OAuth2 Credentials (from vault)
jupyterlab_oauth_client_id: "{{ vault_jupyter_oauth_client_id }}"
jupyterlab_oauth_client_secret: "{{ vault_jupyter_oauth_client_secret }}"
jupyterlab_oauth2_cookie_secret: "{{ vault_jupyter_oauth2_cookie_secret }}"
# Alloy Logging
jupyterlab_syslog_port: 51491
```
## OAuth2 / Casdoor SSO
JupyterLab uses OAuth2-Proxy as a sidecar to handle Casdoor authentication. This pattern is simpler than native OAuth for single-user setups.
### Why OAuth2-Proxy Sidecar?
| Approach | Pros | Cons |
|----------|------|------|
| **OAuth2-Proxy (chosen)** | Simple setup, no JupyterLab modification | Extra service to manage |
| **Native JupyterHub OAuth** | Integrated solution | More complex, overkill for single user |
| **Token-only auth** | Simplest | Less secure, no SSO integration |
### Casdoor Application Configuration
A JupyterLab application is defined in `ansible/casdoor/init_data.json.j2`:
| Setting | Value |
|---------|-------|
| **Name** | `app-jupyter` |
| **Client ID** | `vault_jupyter_oauth_client_id` |
| **Redirect URI** | `https://jupyter.ouranos.helu.ca/oauth2/callback` |
| **Grant Types** | `authorization_code`, `refresh_token` |
### URL Strategy
| URL Type | Address | Used By |
|----------|---------|---------|
| **OIDC Issuer** | `https://id.ouranos.helu.ca` | OAuth2-Proxy (external) |
| **Redirect URL** | `https://jupyter.ouranos.helu.ca/oauth2/callback` | Browser callback |
| **Upstream** | `http://127.0.0.1:22181` | OAuth2-Proxy → JupyterLab |
### Deployment Order
1. **Deploy Casdoor first** (if not already running):
```bash
ansible-playbook casdoor/deploy.yml
```
2. **Update HAProxy** (add jupyter backend):
```bash
ansible-playbook haproxy/deploy.yml
```
3. **Deploy JupyterLab**:
```bash
ansible-playbook jupyterlab/deploy.yml
```
4. **Update Alloy** (for log forwarding):
```bash
ansible-playbook alloy/deploy.yml
```
## Integration with Other Services
### HAProxy Routing
**Backend Configuration** (`titania.incus.yml`):
```yaml
- subdomain: "jupyter"
backend_host: "puck.incus"
backend_port: 22182 # OAuth2-Proxy port
health_path: "/ping"
timeout_server: 300s # WebSocket support
```
### Alloy Log Forwarding
**Syslog Configuration** (`puck/config.alloy.j2`):
```hcl
loki.source.syslog "jupyterlab_logs" {
listener {
address = "127.0.0.1:51491"
protocol = "tcp"
labels = {
job = "jupyterlab",
}
}
forward_to = [loki.write.default.receiver]
}
```
## Troubleshooting
### Service Status
```bash
ssh puck.incus
sudo systemctl status jupyterlab
sudo systemctl status oauth2-proxy-jupyter
```
### View Logs
```bash
# JupyterLab logs
sudo journalctl -u jupyterlab -f
# OAuth2-Proxy logs
sudo journalctl -u oauth2-proxy-jupyter -f
```
### Test JupyterLab Directly (bypass OAuth)
```bash
# From puck container
curl http://127.0.0.1:22181/api/status
```
### Test OAuth2-Proxy Health
```bash
curl http://puck.incus:22182/ping
```
### Verify Virtual Environment
```bash
ssh puck.incus
sudo -u robert /home/robert/env/jupyter/bin/jupyter --version
```
### Common Issues
| Issue | Solution |
|-------|----------|
| WebSocket disconnects | Verify `timeout_server: 300s` in HAProxy backend |
| OAuth redirect loop | Check `redirect_url` matches Casdoor app config |
| 502 Bad Gateway | Ensure JupyterLab service is running on port 22181 |
| Cookie errors | Verify `cookie_secret` is exactly 32 characters |
## Version Information
- **Installation Method**: Python pip in virtual environment
- **JupyterLab Version**: Latest stable (pip managed)
- **OAuth2-Proxy Version**: 7.6.0 (binary from GitHub)
- **Update Process**: Re-run deployment playbook
## References
- **JupyterLab Documentation**: https://jupyterlab.readthedocs.io/
- **OAuth2-Proxy Documentation**: https://oauth2-proxy.github.io/oauth2-proxy/
- **Jupyter AI**: https://jupyter-ai.readthedocs.io/
- **Casdoor OIDC**: https://casdoor.org/docs/integration/oidc

View File

@@ -0,0 +1,127 @@
Docker Compose doesn't pull newer images for existing tags
-----------------------------------------------------------
# Issue
Running `docker compose up` on a service tagged `:latest` does not check the registry for a newer image. The container keeps running the old image even though a newer one has been pushed upstream.
## Symptoms
- `docker compose up` starts the container immediately using the locally cached image
- `docker compose pull` or `docker pull <image>:latest` successfully downloads a newer image
- After pulling manually, `docker compose up` recreates the container with the new image
- The `community.docker.docker_compose_v2` Ansible module with `state: present` behaves identically — no pull check
# Explanation
Docker's default behaviour is: **if an image with the requested tag exists locally, use it without checking the registry.** The `:latest` tag is not special — it's just a regular mutable tag. Docker does not treat it as "always fetch the newest." It is simply the default tag applied when no tag is specified.
When you run `docker compose up`:
1. Docker checks if `image:latest` exists in the local image store
2. If yes → use it, no registry check
3. If no → pull from registry
This means a stale `:latest` can sit on your host indefinitely while the upstream registry has a completely different image behind the same tag. The only way Docker knows to pull is if:
- The image doesn't exist locally at all
- You explicitly tell it to pull
The same applies to the Ansible `community.docker.docker_compose_v2` module — `state: present` maps to `docker compose up` behaviour, so no pull check occurs unless you tell it to.
# Solution
Two complementary fixes ensure images are always checked against the registry.
## 1. Docker Compose — `pull_policy: always`
Add `pull_policy: always` to the service definition in `docker-compose.yml`:
```yaml
services:
my-service:
image: registry.example.com/my-image:latest
pull_policy: always # Check registry on every `up`
container_name: my-service
...
```
With this set, `docker compose up` will always contact the registry and compare the local image digest with the remote one. If they match, no download occurs — it's a lightweight check. If they differ, the new image layers are pulled.
Valid values for `pull_policy`:
| Value | Behaviour |
|-------|-----------|
| `always` | Always check the registry before starting |
| `missing` | Only pull if the image doesn't exist locally (default) |
| `never` | Never pull, fail if image doesn't exist locally |
| `build` | Always build the image (for services with `build:`) |
## 2. Ansible — `pull: always` on `docker_compose_v2`
Add `pull: always` to the `community.docker.docker_compose_v2` task:
```yaml
- name: Start service
community.docker.docker_compose_v2:
project_src: "{{ service_directory }}"
state: present
pull: always # Check registry during deploy
```
Valid values for `pull`:
| Value | Behaviour |
|-------|-----------|
| `always` | Always pull before starting (like `docker compose pull && up`) |
| `missing` | Only pull if image doesn't exist locally |
| `never` | Never pull |
| `policy` | Defer to `pull_policy` defined in the compose file |
## Why use both?
- **`pull_policy` in compose file** — Protects against manual `docker compose up` on the host
- **`pull: always` in Ansible** — Ensures automated deployments always get the freshest image
They are independent mechanisms. The Ansible `pull` parameter runs a pull step before compose up, regardless of what the compose file says. Belt and suspenders.
# Agathos Fix
Applied to `ansible/gitea_mcp/` as the first instance. The same pattern should be applied to any service using mutable tags (`:latest`, `:stable`, etc.).
**docker-compose.yml.j2:**
```yaml
services:
gitea-mcp:
image: docker.gitea.com/gitea-mcp-server:latest
pull_policy: always
...
```
**deploy.yml:**
```yaml
- name: Start Gitea MCP service
community.docker.docker_compose_v2:
project_src: "{{ gitea_mcp_directory }}"
state: present
pull: always
```
# When you DON'T need this
- **Pinned image tags** (e.g., `postgres:16.2`, `grafana/grafana:11.1.0`) — The tag is immutable, so there's nothing newer to pull. Using `pull: always` here just adds a redundant registry check on every deploy.
- **Locally built images** — If the image is built by `docker compose build`, use `pull_policy: build` instead.
- **Air-gapped / offline hosts** — `pull: always` will fail if the registry is unreachable. Use `missing` or `never`.
# Verification
```bash
# Check what image a running container is using
docker inspect --format='{{.Image}}' gitea-mcp
# Compare local digest with remote
docker images --digests docker.gitea.com/gitea-mcp-server
# Force pull and check if image ID changes
docker compose pull
docker compose up -d
```

View File

@@ -0,0 +1,134 @@
Docker won't start inside Incus container
------------------------------------------
# Issue
Running Docker inside Incus has worked for years, but a recent Ubuntu package update caused it to fail.
## Symptoms
Docker containers won't start with the following error:
```
docker compose up
Attaching to neo4j
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open sysctl net.ipv4.ip_unprivileged_port_start file: reopen fd 8: permission denied
```
The issue is AppArmor on Incus containers. The host has AppArmor, and Incus applies an AppArmor profile to containers with `security.nesting=true` that blocks Docker from writing to `/proc/sys/net/ipv4/ip_unprivileged_port_start`.
# Solution (Automated)
The fix requires **both** host-side and container-side changes. These are now automated in our infrastructure:
## 1. Terraform - Host-side fix
In `terraform/containers.tf`, all containers with `security.nesting=true` now include:
```terraform
config = {
"security.nesting" = true
"raw.lxc" = "lxc.apparmor.profile=unconfined"
}
```
This tells Incus not to load any AppArmor profile for the container.
## 2. Ansible - Container-side fix
In `ansible/docker/deploy.yml`, Docker deployment now creates a systemd override:
```yaml
- name: Create AppArmor workaround for Incus nested Docker
ansible.builtin.copy:
content: |
[Service]
Environment=container="setmeandforgetme"
dest: /etc/systemd/system/docker.service.d/apparmor-workaround.conf
```
This tells Docker to skip loading its own AppArmor profile.
# Manual Workaround
If you need to fix this manually (e.g., before running Terraform/Ansible):
## Step 1: Force unconfined mode from the Incus host
```bash
# On the HOST (pan.helu.ca), not in the container
incus config set <container-name> raw.lxc "lxc.apparmor.profile=unconfined" --project agathos
incus restart <container-name> --project agathos
```
## Step 2: Disable AppArmor for Docker inside the container
```bash
# Inside the container
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/apparmor-workaround.conf <<EOF
[Service]
Environment=container="setmeandforgetme"
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
```
Reference: [ktz.blog](https://blog.ktz.me/proxmox-9-broke-my-docker-containers/)
# Verification
Tested on Miranda (2025-12-28):
```bash
# Before fix - fails with permission denied
$ ssh miranda.incus "docker run hello-world"
docker: Error response from daemon: failed to create task for container: ... permission denied
# After applying both fixes
$ ssh miranda.incus "docker run hello-world"
Hello from Docker!
# Port binding also works
$ ssh miranda.incus "docker run -d -p 8080:80 nginx"
# Container starts successfully
```
# Security Considerations
Setting `lxc.apparmor.profile=unconfined` only disables the AppArmor profile that Incus applies **to** the container. The host's AppArmor daemon continues running and protecting the host itself.
Security layers with this fix:
- Host AppArmor ✅ (still active)
- Incus container isolation ✅ (namespaces, cgroups)
- Container AppArmor ❌ (disabled with unconfined)
- Docker container isolation ✅ (namespaces, cgroups)
For sandbox/dev environments, this tradeoff is acceptable since:
- The Incus container is already isolated from the host
- We're not running untrusted workloads
- Production uses VMs + Docker without Incus nesting
# Explanation
What happened is that a recent update on the host (probably the incus and/or apparmor packages that landed in Ubuntu 24.04) started feeding the container a new AppArmor profile that contains this rule (or one very much like it):
```
deny @{PROC}/sys/net/ipv4/ip_unprivileged_port_start rw,
```
That rule is not present in the profile that ships with plain Docker, but it is present in the profile that Incus now attaches to every container that has `security.nesting=true` (the flag you need to run Docker inside Incus).
Because the rule is a `deny`, it overrides any later `allow`, so Docker's own profile (which allows the write) is ignored and the kernel returns `permission denied` the first time Docker/runc tries to write the value that tells the kernel which ports an unprivileged user may bind to.
So the container itself starts fine, but as soon as Docker tries to start any of its own containers, the AppArmor policy that Incus attached to the nested container blocks the write and the whole Docker container creation aborts.
The two workarounds remove the enforcing profile:
1. **`raw.lxc = lxc.apparmor.profile=unconfined`** — Tells Incus "don't load any AppArmor profile for this container at all", so the offending rule is never applied.
2. **`Environment=container="setmeandforgetme"`** — Is the magic string Docker's systemd unit looks for. When it sees that variable it skips loading the Docker-default AppArmor profile. The value literally does not matter; the variable only has to exist.
Either way you end up with no AppArmor policy on the nested Docker container, so the write to `ip_unprivileged_port_start` succeeds and your containers start again.
**In short:** Recent Incus added a deny rule that clashes with Docker's need to tweak that sysctl; disabling the profile (host-side or container-side) is the quickest fix until the profiles are updated to allow the operation.
Because the rule is a deny, it overrides any later allow, so Dockers own profile (which allows the write) is ignored and the kernel returns:

202
docs/kernos.md Normal file
View File

@@ -0,0 +1,202 @@
# Kernos Service Documentation
HTTP-enabled MCP shell server using FastMCP. Wraps the existing `mcp-shell-server` execution logic with FastMCP's HTTP transport for remote AI agent access.
## Overview
| Property | Value |
|----------|-------|
| **Host** | caliban.incus |
| **Port** | 22021 |
| **Service Type** | Systemd service (non-Docker) |
| **Repository** | `ssh://robert@clio.helu.ca:18677/mnt/dev/kernos` |
## Features
- **HTTP Transport**: Accessible via URL instead of stdio
- **Health Endpoints**: `/live`, `/ready`, `/health` for Kubernetes-style probes
- **Prometheus Metrics**: `/metrics` endpoint for monitoring
- **JSON Structured Logging**: Production-ready log format with correlation IDs
- **Full Security**: Command whitelisting inherited from `mcp-shell-server`
## Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/mcp/` | POST | MCP protocol endpoint (FastMCP handles this) |
| `/live` | GET | Liveness probe - always returns 200 |
| `/ready` | GET | Readiness probe - checks executor and config |
| `/health` | GET | Combined health check |
| `/metrics` | GET | Prometheus metrics (text/plain) or JSON |
## Ansible Playbooks
### Stage Playbook
```bash
ansible-playbook kernos/stage.yml
```
Fetches the Kernos repository from clio and creates a release tarball at `~/rel/kernos_{{kernos_rel}}.tar`.
### Deploy Playbook
```bash
ansible-playbook kernos/deploy.yml
```
Deploys Kernos to caliban.incus:
1. Creates kernos user/group
2. Creates `/srv/kernos` directory
3. Transfers and extracts the staged tarball
4. Creates Python virtual environment
5. Installs package dependencies
6. Templates `.env` configuration
7. Templates systemd service file
8. Enables and starts the service
9. Validates health endpoints
## Configuration Variables
### Host Variables (`ansible/inventory/host_vars/caliban.incus.yml`)
| Variable | Default | Description |
|----------|---------|-------------|
| `kernos_user` | `kernos` | System user for the service |
| `kernos_group` | `kernos` | System group for the service |
| `kernos_directory` | `/srv/kernos` | Installation directory |
| `kernos_port` | `22021` | HTTP server port |
| `kernos_host` | `0.0.0.0` | Server bind address |
| `kernos_log_level` | `INFO` | Python log level |
| `kernos_log_format` | `json` | Log format (`json` or `text`) |
| `kernos_environment` | `production` | Environment name for logging |
| `kernos_allow_commands` | (see below) | Comma-separated command whitelist |
### Global Variables (`ansible/inventory/group_vars/all/vars.yml`)
| Variable | Default | Description |
|----------|---------|-------------|
| `kernos_rel` | `master` | Git branch/tag for staging |
## Allowed Commands
The following commands are whitelisted for execution:
```
ls, cat, head, tail, grep, find, wc, file, stat, mkdir, touch, cp, mv, rm,
chmod, pwd, tree, du, df, sed, awk, sort, uniq, cut, tr, tee, curl, wget,
ping, nc, dig, host, ps, pgrep, kill, pkill, nohup, timeout, python3, pip,
node, npm, npx, pnpm, git, make, tar, gzip, gunzip, zip, unzip, whoami, id,
uname, hostname, date, uptime, free, which, env, printenv, run-captured, jq
```
## Security
All security features are inherited from `mcp-shell-server`:
- **Command Whitelisting**: Only commands in `ALLOW_COMMANDS` can be executed
- **Shell Operator Validation**: Commands after `;`, `&&`, `||`, `|` are validated
- **Directory Validation**: Working directory must be absolute and accessible
- **No Shell Injection**: Commands executed directly without shell interpretation
The systemd service includes additional hardening:
- `NoNewPrivileges=true`
- `PrivateTmp=true`
- `ProtectSystem=strict`
- `ProtectHome=true`
- `ReadWritePaths=/tmp`
## Usage
### Testing Health Endpoints
```bash
curl http://caliban.incus:22021/health
curl http://caliban.incus:22021/ready
curl http://caliban.incus:22021/live
curl -H "Accept: text/plain" http://caliban.incus:22021/metrics
```
### MCP Client Connection
Connect using any MCP client that supports HTTP transport:
```python
from fastmcp import Client
client = Client("http://caliban.incus:22021/mcp")
async with client:
result = await client.call_tool("shell_execute", {
"command": ["ls", "-la"],
"directory": "/tmp"
})
print(result)
```
## Tool: shell_execute
Execute a shell command in a specified directory.
### Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `command` | `list[str]` | Yes | - | Command and arguments as array |
| `directory` | `str` | No | `/tmp` | Absolute path to working directory |
| `stdin` | `str` | No | `None` | Input to pass to command |
| `timeout` | `int` | No | `None` | Timeout in seconds |
### Response
```json
{
"stdout": "command output",
"stderr": "",
"status": 0,
"execution_time": 0.123
}
```
## Monitoring
### Prometheus Metrics
The `/metrics` endpoint exposes Prometheus-compatible metrics. Add to your Prometheus configuration:
```yaml
- job_name: 'kernos'
static_configs:
- targets: ['caliban.incus:22021']
```
### Service Status
```bash
# Check service status
ssh caliban.incus sudo systemctl status kernos
# View logs
ssh caliban.incus sudo journalctl -u kernos -f
```
## Troubleshooting
### Service Won't Start
1. Check logs: `journalctl -u kernos -n 50`
2. Verify `.env` file exists and has correct permissions
3. Ensure Python venv was created successfully
4. Check that `ALLOW_COMMANDS` is set
### Health Check Failures
1. Verify the service is running: `systemctl status kernos`
2. Check if port 22021 is accessible
3. Review logs for startup errors
### Command Execution Denied
1. Verify the command is in `ALLOW_COMMANDS` whitelist
2. Check that the working directory is absolute and accessible
3. Review logs for security validation errors

184
docs/lobechat.md Normal file
View File

@@ -0,0 +1,184 @@
# LobeChat
Modern AI chat interface with multi-LLM support, deployed on **Rosalind** with PostgreSQL backend and S3 storage.
**Host:** rosalind.incus
**Port:** 22081
**External URL:** https://lobechat.ouranos.helu.ca/
## Quick Deployment
```bash
cd ansible
ansible-playbook lobechat/deploy.yml
```
## Architecture
```
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐
│ Client │─────▶│ HAProxy │─────▶│ LobeChat │─────▶│PostgreSQL │
│ │ │ (Titania) │ │(Rosalind)│ │ (Portia) │
└──────────┘ └────────────┘ └──────────┘ └───────────┘
├─────────▶ Casdoor (SSO)
├─────────▶ S3 (File Storage)
├─────────▶ SearXNG (Search)
└─────────▶ AI APIs
```
## Required Vault Secrets
Add secrets to `ansible/inventory/group_vars/all/vault.yml`:
### 1. Key Vaults Secret (Encryption Key)
```yaml
vault_lobechat_key_vaults_secret: "your-generated-secret"
```
**Purpose:** Encrypts sensitive data (API keys, credentials) stored in the database.
**Generate with:**
```bash
openssl rand -base64 32
```
This secret must be at least 32 bytes (base64 encoded). If changed after deployment, previously stored encrypted data will become unreadable.
### 2. NextAuth Secret
```yaml
vault_lobechat_next_auth_secret: "your-generated-secret"
```
**Purpose:** Signs NextAuth.js JWT tokens for session management.
**Generate with:**
```bash
openssl rand -base64 32
```
### 3. Database Password
```yaml
vault_lobechat_db_password: "your-secure-password"
```
**Purpose:** PostgreSQL authentication for the `lobechat` database user.
### 4. S3 Secret Key
```yaml
vault_lobechat_s3_secret_key: "your-s3-secret-key"
```
**Purpose:** Authentication for S3 file storage bucket.
**Get from Terraform:**
```bash
cd terraform
terraform output -json lobechat_s3_credentials
```
### 5. AI Provider API Keys (Optional)
```yaml
vault_lobechat_openai_api_key: "sk-proj-..."
vault_lobechat_anthropic_api_key: "sk-ant-api03-..."
vault_lobechat_google_api_key: "AIza..."
```
**Purpose:** Server-side AI provider access. Users can also provide their own keys via the UI.
| Provider | Get Key From |
|----------|-------------|
| OpenAI | https://platform.openai.com/api-keys |
| Anthropic | https://console.anthropic.com/ |
| Google | https://aistudio.google.com/apikey |
### 6. AWS Bedrock Credentials (Optional)
```yaml
vault_lobechat_aws_access_key_id: "AKIA..."
vault_lobechat_aws_secret_access_key: "wJalr..."
vault_lobechat_aws_region: "us-east-1"
```
**Purpose:** Access AWS Bedrock models (Claude, Titan, Llama, etc.)
**Requirements:**
- IAM user/role with `bedrock:InvokeModel` permission
- Model access enabled in AWS Bedrock console for the region
## Host Variables
Defined in `ansible/inventory/host_vars/rosalind.incus.yml`:
| Variable | Description |
|----------|-------------|
| `lobechat_user` | Service user (lobechat) |
| `lobechat_directory` | Service directory (/srv/lobechat) |
| `lobechat_port` | Container port (22081) |
| `lobechat_db_*` | PostgreSQL connection settings |
| `lobechat_auth_casdoor_*` | Casdoor SSO configuration |
| `lobechat_s3_*` | S3 storage settings |
| `lobechat_syslog_port` | Alloy log collection port (51461) |
## Dependencies
| Service | Host | Purpose |
|---------|------|---------|
| PostgreSQL | Portia | Database backend |
| Casdoor | Titania | SSO authentication |
| HAProxy | Titania | HTTPS termination |
| SearXNG | Oberon | Web search |
| S3 Bucket | Incus | File storage |
## Ansible Files
| File | Purpose |
|------|---------|
| `lobechat/deploy.yml` | Main deployment playbook |
| `lobechat/docker-compose.yml.j2` | Docker Compose template |
## Operations
### Check Status
```bash
ssh rosalind.incus
cd /srv/lobechat
docker compose ps
docker compose logs -f
```
### Update Container
```bash
ssh rosalind.incus
cd /srv/lobechat
docker compose pull
docker compose up -d
```
### Database Access
```bash
psql -h portia.incus -U lobechat -d lobechat
```
## Troubleshooting
| Issue | Resolution |
|-------|------------|
| Container won't start | Check vault secrets are defined |
| Database connection failed | Verify PostgreSQL on Portia is running |
| SSO redirect fails | Check Casdoor application config |
| File uploads fail | Verify S3 credentials from Terraform |
## References
- [Detailed Service Documentation](services/lobechat.md)
- [LobeChat Official Docs](https://lobehub.com/docs)
- [GitHub Repository](https://github.com/lobehub/lobe-chat)

303
docs/mcpo.md Normal file
View File

@@ -0,0 +1,303 @@
# MCPO - Model Context Protocol OpenAI-Compatible Proxy
## Overview
MCPO is an OpenAI-compatible proxy that aggregates multiple Model Context Protocol (MCP) servers behind a single HTTP endpoint. It acts as the central MCP gateway for the Agathos sandbox, exposing tools from 13 MCP servers through a unified REST API with interactive Swagger documentation.
**Host:** miranda.incus
**Role:** MCP Docker Host
**Service Port:** 25530
**API Docs:** http://miranda.incus:25530/docs
## Architecture
```
┌───────────────┐ ┌──────────────────────────────────────────────────────────┐
│ LLM Client │ │ Miranda (miranda.incus) │
│ (LobeChat, │────▶│ ┌────────────────────────────────────────────────────┐ │
│ Open WebUI, │ │ │ MCPO :25530 │ │
│ VS Code) │ │ │ OpenAI-compatible proxy │ │
└───────────────┘ │ └─────┬────────────┬────────────┬───────────────────┘ │
│ │ │ │ │
│ ┌─────▼─────┐ ┌────▼────┐ ┌────▼─────┐ │
│ │ stdio │ │ Local │ │ Remote │ │
│ │ servers │ │ Docker │ │ servers │ │
│ │ │ │ MCP │ │ │ │
│ │ • time │ │ │ │ • athena │ │
│ │ • ctx7 │ │ • neo4j │ │ • github │ │
│ │ │ │ • graf │ │ • hface │ │
│ │ │ │ • gitea │ │ • argos │ │
│ │ │ │ │ │ • rommie │ │
│ │ │ │ │ │ • caliban│ │
│ │ │ │ │ │ • korax │ │
│ └───────────┘ └─────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────┘
```
MCPO manages two categories of MCP servers:
- **stdio servers**: MCPO spawns and manages the process (time, context7)
- **streamable-http servers**: MCPO proxies to Docker containers on localhost or remote services across the Incus network
## Terraform Resources
### Host Definition
MCPO runs on Miranda, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | mcp_docker_host |
| Security Nesting | true |
| AppArmor | unconfined |
| Proxy: mcp_containers | `0.0.0.0:25530-25539``127.0.0.1:25530-25539` |
| Proxy: mcpo_ports | `0.0.0.0:25560-25569``127.0.0.1:25560-25569` |
### Dependencies
| Resource | Relationship |
|----------|--------------|
| prospero | Monitoring (Alloy → Loki, Prometheus) |
| ariel | Neo4j database for neo4j-cypher and neo4j-memory MCP servers |
| puck | Athena MCP server |
| caliban | Caliban and Rommie MCP servers |
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook mcpo/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `mcpo/deploy.yml` | Main deployment playbook |
| `mcpo/config.json.j2` | MCP server configuration template |
| `mcpo/mcpo.service.j2` | Systemd service unit template |
| `mcpo/restart.yml` | Restart playbook with health check |
| `mcpo/requirements.txt` | Python package requirements |
### Deployment Steps
1. **Create System User**: `mcpo:mcpo` system account
2. **Create Directory**: `/srv/mcpo` with restricted permissions
3. **Backup Config**: Saves existing `config.json` before overwriting
4. **Template Config**: Renders `config.json.j2` with MCP server definitions
5. **Install Node.js 22.x**: NodeSource repository for npx-based MCP servers
6. **Install Python 3.12**: System packages for virtual environment
7. **Create Virtual Environment**: Python 3.12 venv at `/srv/mcpo/.venv`
8. **Install pip Packages**: `wheel`, `mcpo`, `mcp-server-time`
9. **Pre-install Context7**: Downloads `@upstash/context7-mcp` via npx
10. **Deploy Systemd Service**: Enables and starts `mcpo.service`
11. **Health Check**: Verifies `http://localhost:25530/docs` returns HTTP 200
## MCP Servers
MCPO aggregates the following MCP servers in `config.json`:
### stdio Servers (managed by MCPO)
| Server | Command | Purpose |
|--------|---------|---------|
| `time` | `mcp-server-time` (Python venv) | Current time with timezone support |
| `upstash-context7` | `npx @upstash/context7-mcp` | Library documentation lookup |
### streamable-http Servers (local Docker containers)
| Server | URL | Purpose |
|--------|-----|---------|
| `neo4j-cypher` | `localhost:25531/mcp` | Neo4j Cypher query execution |
| `neo4j-memory` | `localhost:25532/mcp` | Neo4j knowledge graph memory |
| `grafana` | `localhost:25533/mcp` | Grafana dashboard and API integration |
| `gitea` | `localhost:25535/mcp` | Gitea repository management |
### streamable-http Servers (remote services)
| Server | URL | Purpose |
|--------|-----|---------|
| `argos-searxng` | `miranda.incus:25534/mcp` | SearXNG search integration |
| `athena` | `puck.incus:22461/mcp` | Athena knowledge service (auth required) |
| `github` | `api.githubcopilot.com/mcp/` | GitHub API integration |
| `rommie` | `caliban.incus:8080/mcp` | Rommie agent interface |
| `caliban` | `caliban.incus:22021/mcp` | Caliban computer use agent |
| `korax` | `korax.helu.ca:22021/mcp` | Korax external agent |
| `huggingface` | `huggingface.co/mcp` | Hugging Face model hub |
## Configuration
### Systemd Service
MCPO runs as a systemd service:
```
ExecStart=/srv/mcpo/.venv/bin/mcpo --port 25530 --config /srv/mcpo/config.json
```
- **User:** mcpo
- **Restart:** always (3s delay)
- **WorkingDirectory:** /srv/mcpo
### Storage Locations
| Path | Purpose | Owner |
|------|---------|-------|
| `/srv/mcpo` | Service directory | mcpo:mcpo |
| `/srv/mcpo/.venv` | Python virtual environment | mcpo:mcpo |
| `/srv/mcpo/config.json` | MCP server configuration | mcpo:mcpo |
| `/srv/mcpo/config.json.bak` | Config backup (pre-deploy) | mcpo:mcpo |
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
| Variable | Purpose |
|----------|---------|
| `vault_athena_mcp_auth` | Bearer token for Athena MCP server |
| `vault_github_personal_access_token` | GitHub personal access token |
| `vault_huggingface_mcp_token` | Hugging Face API token |
| `vault_gitea_mcp_access_token` | Gitea personal access token for MCP |
```bash
ansible-vault edit inventory/group_vars/all/vault.yml
```
## Host Variables
**File:** `ansible/inventory/host_vars/miranda.incus.yml`
```yaml
# MCPO Config
mcpo_user: mcpo
mcpo_group: mcpo
mcpo_directory: /srv/mcpo
mcpo_port: 25530
argos_mcp_url: http://miranda.incus:25534/mcp
athena_mcp_auth: "{{ vault_athena_mcp_auth }}"
athena_mcp_url: http://puck.incus:22461/mcp
github_personal_access_token: "{{ vault_github_personal_access_token }}"
neo4j_cypher_mcp_port: 25531
neo4j_memory_mcp_port: 25532
caliban_mcp_url: http://caliban.incus:22021/mcp
korax_mcp_url: http://korax.helu.ca:22021/mcp
huggingface_mcp_token: "{{ vault_huggingface_mcp_token }}"
gitea_mcp_port: 25535
```
## Monitoring
### Loki Logs
MCPO logs are collected via systemd journal by Alloy on Miranda. A relabel rule in Alloy's config tags `mcpo.service` journal entries with `job="mcpo"` so they appear as a dedicated app in Grafana dashboards.
| Log Source | Labels |
|------------|--------|
| Systemd journal | `{job="mcpo", hostname="miranda.incus"}` |
The Docker-based MCP servers (neo4j, grafana, gitea) each have dedicated syslog ports forwarded to Loki:
| Server | Syslog Port | Loki Job |
|--------|-------------|----------|
| neo4j-cypher | 51431 | `neo4j-cypher` |
| neo4j-memory | 51432 | `neo4j-memory` |
| grafana-mcp | 51433 | `grafana_mcp` |
| argos | 51434 | `argos` |
| gitea-mcp | 51435 | `gitea-mcp` |
### Grafana
Query MCPO-related logs in Grafana Explore:
```
{hostname="miranda.incus", job="mcpo"}
{hostname="miranda.incus", job="gitea-mcp"}
{hostname="miranda.incus", job="grafana_mcp"}
```
## Operations
### Start/Stop
```bash
ssh miranda.incus
# MCPO service
sudo systemctl start mcpo
sudo systemctl stop mcpo
sudo systemctl restart mcpo
# Or use the restart playbook with health check
cd ansible
ansible-playbook mcpo/restart.yml
```
### Health Check
```bash
# API docs endpoint
curl http://miranda.incus:25530/docs
# From Miranda itself
curl http://localhost:25530/docs
```
### Logs
```bash
# MCPO systemd journal
ssh miranda.incus "sudo journalctl -u mcpo -f"
# Docker MCP server logs
ssh miranda.incus "docker logs -f gitea-mcp"
ssh miranda.incus "docker logs -f grafana-mcp"
```
### Adding a New MCP Server
1. Add the server definition to `ansible/mcpo/config.json.j2`
2. Add any required variables to `ansible/inventory/host_vars/miranda.incus.yml`
3. Add vault secrets (if needed) to `inventory/group_vars/all/vault.yml`
4. If Docker-based: create a new `ansible/{service}/deploy.yml` and `docker-compose.yml.j2`
5. If Docker-based: add a syslog port to Miranda's host vars and Alloy config
6. Redeploy: `ansible-playbook mcpo/deploy.yml`
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| MCPO won't start | Config JSON syntax error | Check `config.json` with `python -m json.tool` |
| Server shows "unavailable" | Backend MCP server not running | Check Docker containers or remote service status |
| Context7 timeout on first use | npx downloading package | Wait for download to complete, or re-run pre-install |
| Health check fails | Port not ready | Increase retry delay, check `journalctl -u mcpo` |
| stdio server crash loops | Missing runtime dependency | Verify Python venv and Node.js installation |
### Debug Commands
```bash
# Check MCPO service status
ssh miranda.incus "sudo systemctl status mcpo"
# Validate config.json syntax
ssh miranda.incus "python3 -m json.tool /srv/mcpo/config.json"
# List Docker MCP containers
ssh miranda.incus "docker ps --filter name=mcp"
# Test a specific MCP server endpoint
ssh miranda.incus "curl -s http://localhost:25531/mcp | head"
# Check MCPO port is listening
ssh miranda.incus "ss -tlnp | grep 25530"
```
## References
- **MCPO Repository**: https://github.com/nicobailey/mcpo
- **MCP Specification**: https://modelcontextprotocol.io/
- [Ansible Practices](ansible.md)
- [Agathos Overview](agathos.md)

283
docs/neo4j.md Normal file
View File

@@ -0,0 +1,283 @@
# Neo4j - Graph Database Platform
## Overview
Neo4j is a high-performance graph database providing native graph storage and processing. It enables efficient traversal of complex relationships and is used for knowledge graphs, recommendation engines, and connected data analysis. Deployed with the **APOC plugin** enabled for extended stored procedures and functions.
**Host:** ariel.incus
**Role:** graph_database
**Container Port:** 25554 (HTTP Browser), 7687 (Bolt)
**External Access:** Direct Bolt connection via `ariel.incus:7687`
## Architecture
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client │─────▶│ Neo4j │◀─────│ Neo4j MCP │
│ (Browser) │ │ (Ariel) │ │ (Miranda) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │
│ ▼
│ ┌──────────────┐
└────────────▶│ Neo4j Browser│
│ HTTP :25554 │
└──────────────┘
```
- **Neo4j Browser**: Web-based query interface on port 25554
- **Bolt Protocol**: Binary protocol on port 7687 for high-performance connections
- **APOC Plugin**: Extended procedures for import/export, graph algorithms, and utilities
- **Neo4j MCP Servers**: Connect via Bolt from Miranda for AI agent access
## Terraform Resources
### Host Definition
The service runs on `ariel`, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | graph_database |
| Security Nesting | true |
| AppArmor | unconfined |
| Description | Neo4j Host - Ethereal graph connections |
### Proxy Devices
| Device Name | Listen | Connect |
|-------------|--------|---------|
| neo4j_ports | tcp:0.0.0.0:25554 | tcp:127.0.0.1:25554 |
### Dependencies
| Resource | Relationship |
|----------|--------------|
| Prospero | Monitoring stack must exist for Alloy log shipping |
| Miranda | Neo4j MCP servers connect to Neo4j via Bolt |
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook neo4j/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `neo4j/deploy.yml` | Main deployment playbook |
| `neo4j/docker-compose.yml.j2` | Docker Compose template |
| `alloy/ariel/config.alloy.j2` | Alloy log collection config |
### Deployment Steps
1. **Create System User**: `neo4j:neo4j` system group and user
2. **Configure ponos Access**: Add ponos user to neo4j group
3. **Create Directory**: `/srv/neo4j` with proper ownership
4. **Template Compose File**: Apply `docker-compose.yml.j2`
5. **Start Service**: Launch via `docker_compose_v2` module
## Configuration
### Host Variables (`host_vars/ariel.incus.yml`)
| Variable | Description | Default |
|----------|-------------|---------|
| `neo4j_version` | Neo4j Docker image version | `5.26.0` |
| `neo4j_user` | System user | `neo4j` |
| `neo4j_group` | System group | `neo4j` |
| `neo4j_directory` | Installation directory | `/srv/neo4j` |
| `neo4j_auth_user` | Database admin username | `neo4j` |
| `neo4j_auth_password` | Database admin password | `{{ vault_neo4j_auth_password }}` |
| `neo4j_http_port` | HTTP browser port | `25554` |
| `neo4j_bolt_port` | Bolt protocol port | `7687` |
| `neo4j_syslog_port` | Local syslog port for Alloy | `22011` |
| `neo4j_apoc_unrestricted` | APOC procedures allowed | `apoc.*` |
### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Description |
|----------|-------------|
| `vault_neo4j_auth_password` | Neo4j admin password |
### APOC Plugin Configuration
The APOC (Awesome Procedures on Cypher) plugin is enabled with the following settings:
| Environment Variable | Value | Purpose |
|---------------------|-------|---------|
| `NEO4J_PLUGINS` | `["apoc"]` | Install APOC plugin |
| `NEO4J_apoc_export_file_enabled` | `true` | Allow file exports |
| `NEO4J_apoc_import_file_enabled` | `true` | Allow file imports |
| `NEO4J_apoc_import_file_use__neo4j__config` | `true` | Use Neo4j config for imports |
| `NEO4J_dbms_security_procedures_unrestricted` | `apoc.*` | Allow all APOC procedures |
### Docker Volumes
| Volume | Mount Point | Purpose |
|--------|-------------|---------|
| `neo4j_data` | `/data` | Database files |
| `neo4j_logs` | `/logs` | Application logs |
| `neo4j_plugins` | `/plugins` | APOC and other plugins |
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/ariel/config.alloy.j2`
Alloy on Ariel collects:
- System logs (`/var/log/syslog`, `/var/log/auth.log`)
- Systemd journal
- Neo4j Docker container logs via syslog
### Loki Logs
| Log Source | Labels |
|------------|--------|
| Neo4j container | `{job="neo4j", hostname="ariel.incus"}` |
| System logs | `{job="syslog", hostname="ariel.incus"}` |
### Prometheus Metrics
Host-level metrics collected via Alloy's Unix exporter:
| Metric | Description |
|--------|-------------|
| `node_*` | Standard node exporter metrics |
### Log Collection Flow
```
Neo4j Container → Syslog (tcp:127.0.0.1:22011) → Alloy → Loki (Prospero)
```
## Operations
### Start/Stop
```bash
# Via Docker Compose
cd /srv/neo4j
docker compose up -d
docker compose down
# Via Ansible
ansible-playbook neo4j/deploy.yml
```
### Health Check
```bash
# HTTP Browser
curl http://ariel.incus:25554
# Bolt connection test
cypher-shell -a bolt://ariel.incus:7687 -u neo4j -p <password> "RETURN 1"
```
### Logs
```bash
# Docker container logs
docker logs -f neo4j
# Via Loki (Grafana Explore)
{job="neo4j", hostname="ariel.incus"}
```
### Cypher Shell Access
```bash
# SSH to Ariel and exec into container
ssh ariel.incus
docker exec -it neo4j cypher-shell -u neo4j -p <password>
```
### Backup
Neo4j data persists in Docker volumes. Backup procedures:
```bash
# Stop container for consistent backup
docker compose -f /srv/neo4j/docker-compose.yml stop
# Backup volumes
docker run --rm -v neo4j_data:/data -v /backup:/backup alpine \
tar czf /backup/neo4j_data_$(date +%Y%m%d).tar.gz -C /data .
# Start container
docker compose -f /srv/neo4j/docker-compose.yml up -d
```
### Restore
```bash
# Stop container
docker compose -f /srv/neo4j/docker-compose.yml down
# Remove existing volume
docker volume rm neo4j_data
# Create new volume and restore
docker volume create neo4j_data
docker run --rm -v neo4j_data:/data -v /backup:/backup alpine \
tar xzf /backup/neo4j_data_YYYYMMDD.tar.gz -C /data
# Start container
docker compose -f /srv/neo4j/docker-compose.yml up -d
```
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| Container won't start | Auth format issue | Check `NEO4J_AUTH` format is `user/password` |
| APOC procedures fail | Security restrictions | Verify `neo4j_apoc_unrestricted` includes procedure |
| Connection refused | Port not exposed | Check Incus proxy device configuration |
| Bolt connection fails | Wrong port | Use port 7687, not 25554 |
### Debug Mode
```bash
# View container startup logs
docker logs neo4j
# Check Neo4j internal logs
docker exec neo4j cat /logs/debug.log
```
### Verify APOC Installation
```cypher
CALL apoc.help("apoc")
YIELD name, text
RETURN name, text LIMIT 10;
```
## Related Services
### Neo4j MCP Servers (Miranda)
Two MCP servers run on Miranda to provide AI agent access to Neo4j:
| Server | Port | Purpose |
|--------|------|---------|
| neo4j-cypher | 25531 | Direct Cypher query execution |
| neo4j-memory | 25532 | Knowledge graph memory operations |
See [Neo4j MCP documentation](#neo4j-mcp-servers) for deployment details.
## References
- [Neo4j Documentation](https://neo4j.com/docs/)
- [APOC Library Documentation](https://neo4j.com/labs/apoc/)
- [Terraform Practices](../terraform.md)
- [Ansible Practices](../ansible.md)
- [Sandbox Overview](../agathos.html)

380
docs/nextcloud.md Normal file
View File

@@ -0,0 +1,380 @@
# Nextcloud - Self-Hosted Cloud Collaboration
## Overview
Nextcloud is a self-hosted cloud collaboration platform providing file storage, sharing, calendar, contacts, and productivity tools. Deployed as a **native LAPP stack** (Linux, Apache, PostgreSQL, PHP) on **Rosalind** with Memcached caching and Incus storage volume for data.
**Host:** rosalind.incus
**Role:** Collaboration (PHP, Go, Node.js runtimes)
**Container Port:** 22083
**External Access:** https://nextcloud.ouranos.helu.ca/ (via HAProxy on Titania)
**Installation Method:** Native (tar.bz2 extraction to /var/www/nextcloud)
## Architecture
```
┌──────────┐ ┌────────────┐ ┌───────────┐ ┌───────────┐
│ Client │─────▶│ HAProxy │─────▶│ Apache2 │─────▶│PostgreSQL │
│ │ │ (Titania) │ │ Nextcloud │ │ (Portia) │
└──────────┘ └────────────┘ │(Rosalind) │ └───────────┘
└───────────┘
├─────────▶ Memcached (Local)
└─────────▶ /mnt/nextcloud (Volume)
```
## Deployment
### Playbook
```bash
cd ansible
ansible-playbook nextcloud/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `nextcloud/deploy.yml` | Main deployment playbook |
| `nextcloud/nextcloud.conf.j2` | Apache VirtualHost template |
2. **Create Data Directory**: `/mnt/nextcloud` on Incus storage volume
3. **Download Nextcloud**: Latest tarball from official site (if not already present)
4. **Extract to Web Root**: `/var/www/nextcloud` (if new installation)
5. **Set Permissions**: `www-data:www-data` ownership
6. **Configure Apache**: Template vhost with port 22083, enable mods, disable default site
7. **Run Installation**: OCC command-line installer (generates config.php with secrets)
8. **Configure via OCC**: Set trusted domains, Memcached, background job mode
9. **Setup Cron**: Background jobs every 5 minutes as www-data
**⚠️ Important**: The playbook does NOT template over config.php after installation. All configuration changes are made via OCC commands to preserve auto-generated secrets (instanceid, passwordsalt, secret).
## Configuration
### Key Features
- **PostgreSQL Backend**: Database on Portia
- **Memcached Caching**: Local distributed cache with `nc_` prefix
- **Incus Storage Volume**: Dedicated 100GB volume at /mnt/nextcloud
- **Apache Web Server**: mod_php with rewrite/headers modules
- **Cron Background Jobs**: System cron (not Docker/AJAX)
- **Native Installation**: No Docker overhead, matches production pattern
### Storage Configuration
| Path | Purpose | Owner | Mount |
|------|---------|-------|-------|
| `/var/www/nextcloud` | Application files | www-data | Local |
| `/mnt/nextcloud` | User data directory | www-data | Incus volume |
| `/var/log/apache2` | Web server logs | root | Local |
### Apache Modules
Required modules enabled by playbook:
- `rewrite` - URL rewriting
- `headers` - HTTP header manipulation
- `env` - Environment variable passing
- `dir` - Directory index handling
- `mime` - MIME type configuration
### PHP Configuration
Installed PHP extensions:
- `php-gd` - Image manipulation
- `php-pgsql` - PostgreSQL database
- `php-curl` - HTTP client
- `php-mbstring` - Multibyte string handling
- `php-intl` - Internationalization
- `php-gmp` - GNU Multiple Precision
- `php-bcmath` - Binary calculator
- `php-xml` - XML processing
- `php-imagick` - ImageMagick integration
- `php-zip` - ZIP archive handling
- `php-memcached` - Memcached caching
### Memcached Configuration
- **Host**: localhost:11211
- **Prefix**: `nc_` (Nextcloud-specific keys)
- **Local**: `\OC\Memcache\Memcached`
- **Distributed**: `\OC\Memcache\Memcached`
### Cron Jobs
Background jobs configured via system cron:
```cron
*/5 * * * * php /var/www/nextcloud/cron.php
```
Runs as `www-data` user every 5 minutes.
## Access After Deployment
1. **Web Interface**: https://nextcloud.ouranos.helu.ca/
2. **First Login**: Use admin credentials from vault
3. **Initial Setup**: Configure apps and settings via web UI
4. **Client Apps**: Download desktop/mobile clients from Nextcloud website
### Desktop/Mobile Sync
- **Server URL**: https://nextcloud.ouranos.helu.ca
- **Username**: admin (or created user)
- **Password**: From vault
- **Desktop Client**: https://nextcloud.com/install/#install-clients
- **Mobile Apps**: iOS App Store / Google Play Store
### WebDAV Access
- **WebDAV URL**: `https://nextcloud.ouranos.helu.ca/remote.php/dav/files/USERNAME/`
- **Use Cases**: File sync, calendar (CalDAV), contacts (CardDAV)
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/rosalind/config.alloy.j2`
- **Apache Access Logs**: `/var/log/apache2/access.log` → Loki
- **Apache Error Logs**: `/var/log/apache2/error.log` → Loki
- **System Metrics**: Process exporter tracks Apache/PHP processes
- **Labels**: job=apache_access, job=apache_error
### Health Checks
**HAProxy Health Endpoint**: `/status.php`
**Manual Health Check**:
```bash
curl http://rosalind.incus:22082/status.php
```
Expected response: JSON with status information
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
### 1. Database Password
```yaml
vault_nextcloud_db_password: "PostgresSecurePassword123!"
```
**Requirements:**
- Minimum 12 characters
- Used by PostgreSQL authentication
### 2. Admin Password
```yaml
vault_nextcloud_admin_password: "AdminSecurePassword123!"
```
**Requirements:**
- Minimum 8 characters (Nextcloud requirement)
- Used for admin user login
- **Important**: Store securely, used for web interface access
### 3. Instance Secrets (Auto-Generated)
These are automatically generated during installation by the OCC installer and stored in `/var/www/nextcloud/config/config.php`. The host_vars should leave these empty:
```yaml
nextcloud_instance_id: "" # Auto-generated, leave empty
nextcloud_password_salt: "" # Auto-generated, leave empty
nextcloud_secret: "" # Auto-generated, leave empty
```
** These secrets persist in config.php and do not need to be stored in vault or host_vars.** They are only referenced in these variables for consistency with the original template design.
## Host Variables
**File:** `ansible/inventory/host_vars/rosalind.incus.yml`
```yaml
# Nextcloud Configuration
nextcloud_web_port: 22083
nextcloud_data_dir: /mnt/nextcloud
# Database Configuration
nextcloud_db_type: pgsql
nextcloud_db_host: portia.incus
nextcloud_db_port: 5432
nextcloud_db_name: nextcloud
nextcloud_db_user: nextcloud
nextcloud_db_password: "{{vault_nextcloud_db_password}}"
# Admin Configuration
nextcloud_admin_user: admin
nextcloud_admin_password: "{{vault_nextcloud_admin_password}}"
# Domain Configuration
nextcloud_domain: nextcloud.ouranos.helu.ca
# Instance secrets (generated during install)
nextcloud_instance_id: ""
nextcloud_password_salt: ""
nextcloud_secret: ""
```
## Database Setup
Nextcloud requires a PostgreSQL database on Portia. This is automatically created by the `postgresql/deploy.yml` playbook.
**Database Details:**
- **Name**: nextcloud
- **User**: nextcloud
- **Owner**: nextcloud
- **Extensions**: None required
## Storage Setup
### Incus Storage Volume
**Terraform Resource:** `terraform/storage.tf`
```hcl
resource "incus_storage_volume" "nextcloud_data" {
name = "nextcloud-data"
pool = "default"
project = "agathos"
config = { size = "100GB" }
}
```
Mounted at `/mnt/nextcloud` on Rosalind container. This volume stores all Nextcloud user data, including uploaded files, app data, and user-specific configurations.
## Integration with Other Services
### HAProxy Routing
**Backend Configuration** (`titania.incus.yml`):
```yaml
- subdomain: "nextcloud"
backend_host: "rosalind.incus"
backend_port: 22083
health_path: "/status.php"
```
### Memcached Integration
- **Host**: localhost:11211
- **Prefix**: `nc_`
- **Shared Instance**: Rosalind hosts Memcached for all services
## Troubleshooting
### Service Status
```bash
ssh rosalind.incus
sudo systemctl status apache2
```
### View Logs
```bash
# Apache access logs
sudo tail -f /var/log/apache2/access.log
# Apache error logs
sudo tail -f /var/log/apache2/error.log
# Nextcloud logs (via web UI)
# Settings → Logging
```
### OCC Command-Line Tool
```bash
# As www-data user
sudo -u www-data php /var/www/nextcloud/occ
# Examples:
sudo -u www-data php /var/www/nextcloud/occ status
sudo -u www-data php /var/www/nextcloud/occ config:list
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
```
### Database Connection
```bash
psql -h portia.incus -U nextcloud -d nextcloud
```
### Check Memcached
```bash
echo "stats" | nc localhost 11211
```
### Verify Storage Volume
```bash
# Reset ownership
sudo chown -R www-data:www-data /var/www/nextcloud
sudo chown -R www-data:www-data /mnt/nextcloud
# Reset permissions
sudo chmod -R 0750 /var/www/nextcloud
```
### Maintenance Mode
```bash
# Enable maintenance mode
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
# Disable maintenance mode
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
```
## Updates and Maintenance
### Updating Nextcloud
**⚠️ Important**: Always backup before updating!
```bash
# 1. Enable maintenance mode
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
# 2. Backup config and database
sudo cp -r /var/www/nextcloud/config /backup/nextcloud-config-$(date +%Y%m%d)
pg_dump -h portia.incus -U nextcloud nextcloud > /backup/nextcloud-db-$(date +%Y%m%d).sql
# 3. Download new version
wget https://download.nextcloud.com/server/releases/latest.tar.bz2
# 4. Extract and replace (preserve config/)
tar -xjf latest.tar.bz2
sudo rsync -av --delete --exclude config/ nextcloud/ /var/www/nextcloud/
# 5. Run upgrade
sudo -u www-data php /var/www/nextcloud/occ upgrade
# 6. Disable maintenance mode
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
```
### Database Maintenance
```bash
# Add missing indices
sudo -u www-data php /var/www/nextcloud/occ db:add-missing-indices
# Convert to bigint
sudo -u www-data php /var/www/nextcloud/occ db:convert-filecache-bigint
```
## Version Information
- **Installation Method**: Tarball extraction (official releases)
- **Current Version**: Check web UI → Settings → Overview
- **Update Channel**: Stable (latest.tar.bz2)
- **PHP Version**: Installed by apt (Ubuntu repository version)
## Docker vs Native Comparison
**Why Native Installation?**
| Aspect | Native (Chosen) | Docker |
|--------|-----------------|--------|
| **Performance** | Better (no container overhead) | Good |
| **Updates** | Manual tarball extraction | Container image pull |
| **Cron Jobs** | System cron (reliable) | Requires sidecar/exec |
| **App Updates** | Direct via web UI | Limited/complex |
| **Customization** | Full PHP/Apache control | Constrained by image |
| **Production Match** | Yes (same pattern) | No |
| **Complexity** | Lower for LAMP stack | Higher for orchestration |
**Recommendation**: Native installation matches production deployment pattern and avoids Docker-specific limitations with Nextcloud's app ecosystem and cron requirements.
## References
- **Official Documentation**: https://docs.nextcloud.com/
- **Admin Manual**: https://docs.nextcloud.com/server/latest/admin_manual/
- **Installation Guide**: https://docs.nextcloud.com/server/latest/admin_manual/installation/
- **OCC Commands**: https://docs.nextcloud.com/server/latest/admin_manual/configuration_server/occ_command.html

314
docs/oauth2_proxy.md Normal file
View File

@@ -0,0 +1,314 @@
# OAuth2-Proxy Authentication Gateway
# Red Panda Approved
## Overview
OAuth2-Proxy provides authentication for services that don't natively support SSO/OIDC.
It acts as a reverse proxy that requires users to authenticate via Casdoor before
accessing the upstream service.
This document describes the generic approach for adding OAuth2-Proxy authentication
to any service in the Agathos infrastructure.
## Architecture
```
┌──────────────┐ ┌───────────────┐ ┌────────────────┐ ┌───────────────┐
│ Browser │────▶│ HAProxy │────▶│ OAuth2-Proxy │────▶│ Your Service │
│ │ │ (titania) │ │ (titania) │ │ (any host) │
└──────────────┘ └───────┬───────┘ └───────┬────────┘ └───────────────┘
│ │
│ ┌───────────────▼───────────────┐
└────▶│ Casdoor │
│ (OIDC Provider - titania) │
└───────────────────────────────┘
```
## How It Works
1. User requests `https://service.ouranos.helu.ca/`
2. HAProxy routes to OAuth2-Proxy (titania:22082)
3. OAuth2-Proxy checks for valid session cookie
4. **No session?** → Redirect to Casdoor login → After login, redirect back with cookie
5. **Valid session?** → Forward request to upstream service
## File Structure
```
ansible/oauth2_proxy/
├── deploy.yml # Main deployment playbook
├── docker-compose.yml.j2 # Docker Compose template
├── oauth2-proxy.cfg.j2 # OAuth2-Proxy configuration
└── stage.yml # Validation/staging playbook
```
Monitoring configuration is integrated into the host-specific Alloy config:
- `ansible/alloy/titania/config.alloy.j2` - Contains OAuth2-Proxy log collection and metrics scraping
## Variable Architecture
The OAuth2-Proxy template uses **generic variables** (`oauth2_proxy_*`) that are
mapped from **service-specific variables** in host_vars:
```
Vault (service-specific) Host Vars (mapping) Template (generic)
──────────────────────── ─────────────────── ──────────────────
vault_<service>_oauth2_* ──► <service>_oauth2_* ──► oauth2_proxy_*
```
This allows:
- Multiple services to use the same OAuth2-Proxy template
- Service-specific credentials in vault
- Clear naming conventions
## Configuration Steps
### Step 1: Create Casdoor Application
1. Login to Casdoor at `https://id.ouranos.helu.ca/` (Casdoor SSO)
2. Navigate to **Applications****Add**
3. Configure:
- **Name**: `<your-service>` (e.g., `searxng`, `jupyter`)
- **Organization**: `heluca` (or your organization)
- **Redirect URLs**: `https://<service>.ouranos.helu.ca/oauth2/callback`
- **Grant Types**: `authorization_code`, `refresh_token`
4. Save and note the **Client ID** and **Client Secret**
### Step 2: Add Vault Secrets
```bash
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
```
Add service-specific credentials:
```yaml
# SearXNG OAuth2 credentials
vault_searxng_oauth2_client_id: "abc123..."
vault_searxng_oauth2_client_secret: "secret..."
vault_searxng_oauth2_cookie_secret: "<generate-with-command-below>"
```
Generate cookie secret:
```bash
openssl rand -base64 32
```
### Step 3: Configure Host Variables
Add to the host that will run OAuth2-Proxy (typically `titania.incus.yml`):
```yaml
# =============================================================================
# <Service> OAuth2 Configuration (Service-Specific)
# =============================================================================
<service>_oauth2_client_id: "{{ vault_<service>_oauth2_client_id }}"
<service>_oauth2_client_secret: "{{ vault_<service>_oauth2_client_secret }}"
<service>_oauth2_cookie_secret: "{{ vault_<service>_oauth2_cookie_secret }}"
# =============================================================================
# OAuth2-Proxy Configuration (Generic Template Variables)
# =============================================================================
oauth2_proxy_user: oauth2proxy
oauth2_proxy_group: oauth2proxy
oauth2_proxy_uid: 802
oauth2_proxy_gid: 802
oauth2_proxy_directory: /srv/oauth2-proxy
oauth2_proxy_port: 22082
# OIDC Configuration
oauth2_proxy_oidc_issuer_url: "http://titania.incus:{{ casdoor_port }}"
# Map service-specific credentials to generic template variables
oauth2_proxy_client_id: "{{ <service>_oauth2_client_id }}"
oauth2_proxy_client_secret: "{{ <service>_oauth2_client_secret }}"
oauth2_proxy_cookie_secret: "{{ <service>_oauth2_cookie_secret }}"
# Service-specific URLs
oauth2_proxy_redirect_url: "https://<service>.{{ haproxy_domain }}/oauth2/callback"
oauth2_proxy_upstream_url: "http://<service-host>:<service-port>"
oauth2_proxy_cookie_domain: "{{ haproxy_domain }}"
# Access Control
oauth2_proxy_email_domains:
- "*" # Or restrict to specific domains
# Session Configuration
oauth2_proxy_cookie_expire: "168h"
oauth2_proxy_cookie_refresh: "1h"
# SSL Verification
oauth2_proxy_skip_ssl_verify: true # Set false for production
```
### Step 4: Update HAProxy Backend
Change the service backend to route through OAuth2-Proxy:
```yaml
haproxy_backends:
- subdomain: "<service>"
backend_host: "titania.incus" # OAuth2-Proxy host
backend_port: 22082 # OAuth2-Proxy port
health_path: "/ping" # OAuth2-Proxy health endpoint
```
### Step 5: Deploy
```bash
cd ansible
# Validate configuration
ansible-playbook oauth2_proxy/stage.yml
# Deploy OAuth2-Proxy
ansible-playbook oauth2_proxy/deploy.yml
# Update HAProxy routing
ansible-playbook haproxy/deploy.yml
```
## Complete Example: SearXNG
### Vault Variables
```yaml
vault_searxng_oauth2_client_id: "searxng-client-id-from-casdoor"
vault_searxng_oauth2_client_secret: "searxng-client-secret-from-casdoor"
vault_searxng_oauth2_cookie_secret: "ABCdef123..."
```
### Host Variables (titania.incus.yml)
```yaml
# SearXNG OAuth2 (service-specific)
searxng_oauth2_client_id: "{{ vault_searxng_oauth2_client_id }}"
searxng_oauth2_client_secret: "{{ vault_searxng_oauth2_client_secret }}"
searxng_oauth2_cookie_secret: "{{ vault_searxng_oauth2_cookie_secret }}"
# OAuth2-Proxy (generic mapping)
oauth2_proxy_client_id: "{{ searxng_oauth2_client_id }}"
oauth2_proxy_client_secret: "{{ searxng_oauth2_client_secret }}"
oauth2_proxy_cookie_secret: "{{ searxng_oauth2_cookie_secret }}"
oauth2_proxy_redirect_url: "https://searxng.{{ haproxy_domain }}/oauth2/callback"
oauth2_proxy_upstream_url: "http://oberon.incus:25599"
```
### HAProxy Backend
```yaml
- subdomain: "searxng"
backend_host: "titania.incus"
backend_port: 22082
health_path: "/ping"
```
## Adding a Second Service (e.g., Jupyter)
When adding authentication to another service, you would:
1. Create a new Casdoor application for Jupyter
2. Add vault variables:
```yaml
vault_jupyter_oauth2_client_id: "..."
vault_jupyter_oauth2_client_secret: "..."
vault_jupyter_oauth2_cookie_secret: "..."
```
3. Either:
- **Option A**: Deploy a second OAuth2-Proxy instance on a different port
- **Option B**: Configure the same OAuth2-Proxy with multiple upstreams (more complex)
For multiple services, **Option A** is recommended for isolation and simplicity.
## Monitoring
OAuth2-Proxy monitoring is handled by Grafana Alloy, which runs on each host.
### Architecture
```
OAuth2-Proxy ─────► Grafana Alloy ─────► Prometheus (prospero)
(titania) (local agent) (remote_write)
└─────────────► Loki (prospero)
(log forwarding)
```
### Metrics (via Prometheus)
Alloy scrapes OAuth2-Proxy metrics at `/metrics` and forwards them to Prometheus:
- `oauth2_proxy_requests_total` - Total requests processed
- `oauth2_proxy_errors_total` - Total errors
- `oauth2_proxy_upstream_latency_seconds` - Latency to upstream service
Configuration in `ansible/alloy/titania/config.alloy.j2`:
```alloy
prometheus.scrape "oauth2_proxy" {
targets = [{"__address__" = "127.0.0.1:{{oauth2_proxy_port}}"}]
scrape_interval = "30s"
forward_to = [prometheus.remote_write.default.receiver]
job_name = "oauth2-proxy"
}
```
### Logs (via Loki)
OAuth2-Proxy logs are collected via syslog and forwarded to Loki:
```alloy
loki.source.syslog "oauth2_proxy_logs" {
listener {
address = "127.0.0.1:{{oauth2_proxy_syslog_port}}"
protocol = "tcp"
labels = { job = "oauth2-proxy", hostname = "{{inventory_hostname}}" }
}
forward_to = [loki.write.default.receiver]
}
```
### Deploy Alloy After Changes
If you update the Alloy configuration:
```bash
ansible-playbook alloy/deploy.yml --limit titania.incus
```
## Security Considerations
1. **Cookie Security**:
- `cookie_secure = true` - HTTPS only
- `cookie_httponly = true` - No JavaScript access
- `cookie_samesite = "lax"` - CSRF protection
2. **Access Control**:
- Use `oauth2_proxy_email_domains` to restrict by email domain
- Use `oauth2_proxy_allowed_groups` to restrict by Casdoor groups
3. **SSL Verification**:
- Set `oauth2_proxy_skip_ssl_verify: false` in production
- Ensure Casdoor has valid SSL certificates
## Troubleshooting
### Check OAuth2-Proxy Logs
```bash
ssh titania.incus
docker logs oauth2-proxy
```
### Test OIDC Discovery
```bash
curl http://titania.incus:22081/.well-known/openid-configuration
```
### Verify Cookie Domain
Ensure `oauth2_proxy_cookie_domain` matches your HAProxy domain.
### Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Redirect loop | Cookie domain mismatch | Check `oauth2_proxy_cookie_domain` |
| 403 Forbidden | Email domain not allowed | Update `oauth2_proxy_email_domains` |
| OIDC discovery failed | Casdoor not accessible | Check network/firewall |
| Invalid redirect URI | Mismatch in Casdoor app | Verify redirect URL in Casdoor |
## Related Documentation
- [SearXNG Authentication](services/searxng-auth.md) - Specific implementation details
- [Casdoor Documentation](casdoor.md) - Identity provider configuration

331
docs/openwebui.md Normal file
View File

@@ -0,0 +1,331 @@
# Open WebUI
Open WebUI is an extensible, self-hosted AI interface that provides a web-based chat experience for interacting with LLMs. This document covers deployment, Casdoor SSO integration, and configuration.
## Architecture
### Components
| Component | Location | Purpose |
|-----------|----------|---------|
| Open WebUI | Native on Oberon | AI chat interface |
| PostgreSQL | Portia | Database with pgvector extension |
| Casdoor | Titania | SSO identity provider |
| HAProxy | Ariel | TLS termination, routing |
### Network Diagram
```
┌────────────────────────────────────────────────────────────────────┐
│ External Access │
│ https://openwebui.ouranos.helu.ca │
└───────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ ariel.incus (HAProxy) │
│ TLS termination → proxy to oberon.incus:25588 │
└───────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ oberon.incus │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Open WebUI (systemd) │ │
│ │ - Python 3.12 virtual environment │ │
│ │ - Port 25588 │ │
│ │ - OAuth/OIDC via Casdoor │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │ │
│ │ PostgreSQL │ OIDC │
│ ▼ ▼ │
│ portia.incus:5432 titania.incus:22081 │
│ (openwebui database) (Casdoor SSO) │
└────────────────────────────────────────────────────────────────────┘
```
### Network Ports
| Port | Service | Access |
|------|---------|--------|
| 25588 | Open WebUI HTTP | Via HAProxy |
| 5432 | PostgreSQL | Internal (Portia) |
| 22081 | Casdoor | Internal (Titania) |
## Casdoor SSO Integration
Open WebUI uses native OAuth/OIDC to authenticate against Casdoor. Local signup is disabled—all users must authenticate through Casdoor.
### How It Works
1. User visits `https://openwebui.ouranos.helu.ca`
2. Open WebUI redirects to Casdoor login page
3. User authenticates with Casdoor credentials
4. Casdoor redirects back with authorization code
5. Open WebUI exchanges code for tokens and creates/updates user session
6. User email from Casdoor becomes their Open WebUI identity
### Configuration
OAuth settings are defined in host variables and rendered into the environment file:
**Host Variables** (`inventory/host_vars/oberon.incus.yml`):
```yaml
# OAuth/OIDC Configuration (Casdoor SSO)
openwebui_oauth_client_id: "{{ vault_openwebui_oauth_client_id }}"
openwebui_oauth_client_secret: "{{ vault_openwebui_oauth_client_secret }}"
openwebui_oauth_provider_name: "Casdoor"
openwebui_oauth_provider_url: "https://id.ouranos.helu.ca/.well-known/openid-configuration"
# Disable local authentication
openwebui_enable_signup: false
openwebui_enable_email_login: false
```
**Environment Variables** (rendered from `openwebui.env.j2`):
```bash
ENABLE_SIGNUP=false
ENABLE_EMAIL_LOGIN=false
ENABLE_OAUTH_SIGNUP=true
OAUTH_CLIENT_ID=<client-id>
OAUTH_CLIENT_SECRET=<client-secret>
OAUTH_PROVIDER_NAME=Casdoor
OPENID_PROVIDER_URL=https://id.ouranos.helu.ca/.well-known/openid-configuration
```
### Casdoor Application
The `app-openwebui` application is defined in `ansible/casdoor/init_data.json.j2`:
| Setting | Value |
|---------|-------|
| Name | `app-openwebui` |
| Display Name | Open WebUI |
| Redirect URI | `https://openwebui.ouranos.helu.ca/oauth/oidc/callback` |
| Grant Types | `authorization_code`, `refresh_token` |
| Token Format | JWT |
| Token Expiry | 168 hours (7 days) |
## Prerequisites
### 1. PostgreSQL Database
The `openwebui` database must exist on Portia with the `pgvector` extension:
```bash
ansible-playbook postgresql/deploy.yml
```
### 2. Casdoor SSO
Casdoor must be deployed and the `app-openwebui` application configured:
```bash
ansible-playbook casdoor/deploy.yml
```
### 3. Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
```yaml
# OpenWebUI
vault_openwebui_secret_key: "<random-secret>"
vault_openwebui_db_password: "<database-password>"
vault_openwebui_oauth_client_id: "<from-casdoor>"
vault_openwebui_oauth_client_secret: "<from-casdoor>"
# API Keys (optional)
vault_openwebui_openai_api_key: "<openai-key>"
vault_openwebui_anthropic_api_key: "<anthropic-key>"
vault_openwebui_groq_api_key: "<groq-key>"
vault_openwebui_mistral_api_key: "<mistral-key>"
```
Generate secrets:
```bash
# Secret key
openssl rand -hex 32
# Database password
openssl rand -base64 24
```
## Deployment
### Fresh Installation
```bash
cd ansible
# 1. Ensure PostgreSQL is deployed
ansible-playbook postgresql/deploy.yml
# 2. Deploy Casdoor (if not already deployed)
ansible-playbook casdoor/deploy.yml
# 3. Get OAuth credentials from Casdoor admin UI
# - Navigate to https://id.ouranos.helu.ca
# - Go to Applications → app-openwebui
# - Copy Client ID and Client Secret
# - Update vault.yml with these values
# 4. Deploy Open WebUI
ansible-playbook openwebui/deploy.yml
```
### Verify Deployment
```bash
# Check service status
ssh oberon.incus "sudo systemctl status openwebui"
# View logs
ssh oberon.incus "sudo journalctl -u openwebui -f"
# Test health endpoint
curl -s http://oberon.incus:25588/health
# Test via HAProxy
curl -s https://openwebui.ouranos.helu.ca/health
```
### Redeployment
To redeploy Open WebUI (preserves database):
```bash
ansible-playbook openwebui/deploy.yml
```
## Configuration Reference
### Host Variables
Located in `ansible/inventory/host_vars/oberon.incus.yml`:
```yaml
# Service account
openwebui_user: openwebui
openwebui_group: openwebui
openwebui_directory: /srv/openwebui
openwebui_port: 25588
openwebui_host: puck.incus
# Database
openwebui_db_host: portia.incus
openwebui_db_port: 5432
openwebui_db_name: openwebui
openwebui_db_user: openwebui
openwebui_db_password: "{{ vault_openwebui_db_password }}"
# Authentication (SSO only)
openwebui_enable_signup: false
openwebui_enable_email_login: false
# OAuth/OIDC (Casdoor)
openwebui_oauth_client_id: "{{ vault_openwebui_oauth_client_id }}"
openwebui_oauth_client_secret: "{{ vault_openwebui_oauth_client_secret }}"
openwebui_oauth_provider_name: "Casdoor"
openwebui_oauth_provider_url: "https://id.ouranos.helu.ca/.well-known/openid-configuration"
# API Keys
openwebui_openai_api_key: "{{ vault_openwebui_openai_api_key }}"
openwebui_anthropic_api_key: "{{ vault_openwebui_anthropic_api_key }}"
openwebui_groq_api_key: "{{ vault_openwebui_groq_api_key }}"
openwebui_mistral_api_key: "{{ vault_openwebui_mistral_api_key }}"
```
### Data Persistence
Open WebUI data locations:
```
/srv/openwebui/
├── .venv/ # Python virtual environment
├── .env # Environment configuration
└── data/ # User uploads, cache
```
Database (on Portia):
```
PostgreSQL: openwebui database with pgvector extension
```
## User Management
### First-Time Setup
After deployment, the first user to authenticate via Casdoor becomes an admin. Subsequent users get standard user roles.
### Promoting Users to Admin
1. Log in as an existing admin
2. Navigate to Admin Panel → Users
3. Select the user and change their role to Admin
### Existing Users Migration
If users were created before SSO was enabled:
- Users with matching email addresses will be linked automatically
- Users without matching emails must be recreated through Casdoor
## Troubleshooting
### Service Issues
```bash
# Check service status
ssh oberon.incus "sudo systemctl status openwebui"
# View logs
ssh oberon.incus "sudo journalctl -u openwebui -n 100"
# Restart service
ssh oberon.incus "sudo systemctl restart openwebui"
```
### OAuth/OIDC Issues
```bash
# Verify Casdoor is accessible
curl -s https://id.ouranos.helu.ca/.well-known/openid-configuration | jq
# Check redirect URI matches
# Must be: https://openwebui.ouranos.helu.ca/oauth/oidc/callback
# Verify client credentials in environment
ssh oberon.incus "sudo grep OAUTH /srv/openwebui/.env"
```
### Database Issues
```bash
# Test database connection
ssh oberon.incus "PGPASSWORD=<password> psql -h portia.incus -U openwebui -d openwebui -c '\dt'"
# Check pgvector extension
ssh portia.incus "sudo -u postgres psql -d openwebui -c '\dx'"
```
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| "Invalid redirect_uri" | Mismatch between Casdoor config and Open WebUI | Verify redirect URI in Casdoor matches exactly |
| "Invalid client credentials" | Wrong client ID/secret | Update vault with correct values from Casdoor |
| "OIDC discovery failed" | Casdoor unreachable | Check Casdoor is running on Titania |
| "Database connection failed" | PostgreSQL unreachable | Verify PostgreSQL on Portia, check network |
## Security Considerations
1. **SSO-only authentication** - Local signup disabled, all users authenticate through Casdoor
2. **API keys in vault** - All API keys stored encrypted in Ansible vault
3. **Database credentials** - Stored in vault, rendered to environment file with restrictive permissions (0600)
4. **Session security** - JWT tokens with 7-day expiry, managed by Casdoor
## Related Documentation
- [Casdoor SSO](services/casdoor.md) - Identity provider configuration
- [PostgreSQL](../ansible.md) - Database deployment
- [HAProxy](../terraform.md) - TLS termination and routing

808
docs/ouranos.html Normal file
View File

@@ -0,0 +1,808 @@
<!DOCTYPE html>
<html lang="en" data-bs-theme="light">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Ouranos Lab - Red Panda Approved Infrastructure</title>
<!-- Bootswatch Flatly -->
<link href="https://cdn.jsdelivr.net/npm/bootswatch@5.3.2/dist/flatly/bootstrap.min.css" rel="stylesheet">
<!-- Bootstrap Icons -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css">
<style>
html { scroll-behavior: smooth; }
#scrollTopBtn {
position: fixed;
bottom: 20px;
right: 20px;
z-index: 1000;
display: none;
border-radius: 50%;
width: 50px;
height: 50px;
box-shadow: 0 2px 10px rgba(0,0,0,0.3);
}
</style>
</head>
<body>
<div class="container-fluid px-4">
<!-- Navbar -->
<nav class="navbar navbar-expand-lg navbar-dark bg-primary rounded mb-4 mt-3">
<div class="container-fluid">
<a class="navbar-brand fw-bold" href="#">
<i class="bi bi-diagram-3-fill"></i> Ouranos Lab
</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarNav">
<ul class="navbar-nav me-auto">
<li class="nav-item"><a class="nav-link" href="#overview"><i class="bi bi-info-circle"></i> Overview</a></li>
<li class="nav-item"><a class="nav-link" href="#hosts"><i class="bi bi-hdd-network"></i> Hosts</a></li>
<li class="nav-item"><a class="nav-link" href="#routing"><i class="bi bi-signpost-split"></i> Routing</a></li>
<li class="nav-item"><a class="nav-link" href="#infrastructure"><i class="bi bi-gear"></i> Infrastructure</a></li>
<li class="nav-item"><a class="nav-link" href="#automation"><i class="bi bi-play-circle"></i> Automation</a></li>
<li class="nav-item"><a class="nav-link" href="#dataflow"><i class="bi bi-diagram-2"></i> Data Flow</a></li>
</ul>
<button id="darkModeToggle" class="btn btn-outline-light btn-sm" title="Toggle dark mode">
<i class="bi bi-moon-fill"></i>
</button>
</div>
</div>
</nav>
<!-- Hero -->
<header class="bg-primary text-white py-5 rounded mb-4">
<div class="container">
<div class="row align-items-center">
<div class="col-lg-8">
<h1 class="display-4 fw-bold"><i class="bi bi-diagram-3-fill"></i> Ouranos Lab</h1>
<p class="lead">Red Panda Approved™ Infrastructure as Code</p>
<p class="mb-0">10 Incus containers named after moons of Uranus, provisioned with Terraform and configured with Ansible. Accessible at <a href="https://ouranos.helu.ca" class="text-white fw-bold">ouranos.helu.ca</a></p>
</div>
<div class="col-lg-4 text-center mt-3 mt-lg-0">
<div class="badge bg-success fs-6 p-3">
<i class="bi bi-check-circle-fill"></i> Red Panda Approved™
</div>
</div>
</div>
</div>
</header>
<!-- Overview -->
<section id="overview" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-info-circle text-primary me-2"></i>Project Overview</h2>
<div class="alert alert-info border-start border-4 border-info">
<p class="mb-1">Ouranos is a comprehensive infrastructure-as-code project that provisions and manages a complete development sandbox environment. All infrastructure and configuration is tracked in Git for reproducible deployments.</p>
<p class="mb-0"><i class="bi bi-exclamation-triangle-fill text-warning me-1"></i><strong>DNS Domain:</strong> Incus resolves containers via the <code>.incus</code> suffix (e.g., <code>oberon.incus</code>). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.</p>
</div>
<div class="row g-4">
<div class="col-md-6">
<div class="card h-100 border-primary">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-diagram-3 me-2"></i>Terraform</h5>
</div>
<div class="card-body">
<p class="card-text">Provisions the Uranian host containers with:</p>
<ul class="mb-0">
<li>10 specialised Incus containers (LXC)</li>
<li>DNS-resolved networking (<code>.incus</code> domain)</li>
<li>Security policies and nested Docker support</li>
<li>Port proxy devices and resource dependencies</li>
<li>Incus S3 buckets for object storage (Casdoor, LobeChat)</li>
</ul>
</div>
<div class="card-footer text-muted small"><i class="bi bi-check-circle me-1"></i>Idempotent, elegant, observable</div>
</div>
</div>
<div class="col-md-6">
<div class="card h-100 border-success">
<div class="card-header bg-success text-white">
<h5 class="mb-0"><i class="bi bi-gear-fill me-2"></i>Ansible</h5>
</div>
<div class="card-body">
<p class="card-text">Deploys and configures all services:</p>
<ul class="mb-0">
<li>Docker engine on nested-capable hosts</li>
<li>Databases: PostgreSQL (Portia), Neo4j (Ariel)</li>
<li>Observability: Prometheus, Loki, Grafana (Prospero)</li>
<li>Application runtimes and LLM proxies</li>
<li>HAProxy TLS termination and Casdoor SSO (Titania)</li>
</ul>
</div>
<div class="card-footer text-muted small"><i class="bi bi-check-circle me-1"></i>Idempotent, auditable, integrated</div>
</div>
</div>
</div>
</section>
<!-- Hosts -->
<section id="hosts" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-hdd-network text-primary me-2"></i>Uranian Host Architecture</h2>
<div class="card mb-4">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-table me-2"></i>Hosts Summary</h5>
</div>
<div class="card-body p-0">
<div class="table-responsive">
<table class="table table-hover table-bordered mb-0 align-middle">
<thead class="table-light">
<tr>
<th><i class="bi bi-tag me-1"></i>Name</th>
<th><i class="bi bi-briefcase me-1"></i>Role</th>
<th><i class="bi bi-list-ul me-1"></i>Key Services</th>
<th class="text-center"><i class="bi bi-shield me-1"></i>Nesting</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ariel</strong></td>
<td><span class="badge bg-warning text-dark">graph_database</span></td>
<td>Neo4j 5.26.0</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>caliban</strong></td>
<td><span class="badge bg-secondary">agent_automation</span></td>
<td>Agent S MCP Server, Kernos, MATE Desktop, GPU</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>miranda</strong></td>
<td><span class="badge bg-info">mcp_docker_host</span></td>
<td>MCPO, Grafana MCP, Gitea MCP, Neo4j MCP, Argos MCP</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>oberon</strong></td>
<td><span class="badge bg-primary">container_orchestration</span></td>
<td>MCP Switchboard, RabbitMQ, Open WebUI, SearXNG, Home Assistant, smtp4dev</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>portia</strong></td>
<td><span class="badge bg-success">database</span></td>
<td>PostgreSQL 16</td>
<td class="text-center"><i class="bi bi-x-circle-fill text-danger"></i></td>
</tr>
<tr>
<td><strong>prospero</strong></td>
<td><span class="badge bg-dark">observability</span></td>
<td>Prometheus, Loki, Grafana, PgAdmin, AlertManager</td>
<td class="text-center"><i class="bi bi-x-circle-fill text-danger"></i></td>
</tr>
<tr>
<td><strong>puck</strong></td>
<td><span class="badge bg-danger">application_runtime</span></td>
<td>JupyterLab, Gitea Runner, Django apps (6×)</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>rosalind</strong></td>
<td><span class="badge bg-success">collaboration</span></td>
<td>Gitea, LobeChat, Nextcloud, AnythingLLM</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>sycorax</strong></td>
<td><span class="badge bg-secondary">language_models</span></td>
<td>Arke LLM Proxy</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
<tr>
<td><strong>titania</strong></td>
<td><span class="badge bg-primary">proxy_sso</span></td>
<td>HAProxy, Casdoor SSO, certbot</td>
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<!-- Host Detail Cards -->
<div class="row g-4">
<div class="col-lg-6">
<div class="card h-100 border-primary">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-box me-2"></i>oberon — Container Orchestration</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">King of the Fairies orchestrating containers and managing MCP infrastructure.</p>
<ul class="mb-0">
<li>Docker engine</li>
<li><strong>MCP Switchboard</strong> (port 22785) — Django app routing MCP tool calls</li>
<li><strong>RabbitMQ</strong> message queue</li>
<li><strong>Open WebUI</strong> LLM interface (port 22088, PostgreSQL backend on Portia)</li>
<li><strong>SearXNG</strong> privacy search (port 22073, behind OAuth2-Proxy)</li>
<li><strong>Home Assistant</strong> (port 8123)</li>
<li><strong>smtp4dev</strong> SMTP test server (port 22025)</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-success">
<div class="card-header bg-success text-white">
<h5 class="mb-0"><i class="bi bi-database me-2"></i>portia — Relational Database</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Intelligent and resourceful — the reliability of relational databases.</p>
<ul class="mb-0">
<li>PostgreSQL 16 (port 5432)</li>
<li>Databases: <code>arke</code>, <code>anythingllm</code>, <code>gitea</code>, <code>hass</code>, <code>lobechat</code>, <code>mcp_switchboard</code>, <code>nextcloud</code>, <code>openwebui</code>, <code>spelunker</code></li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-warning">
<div class="card-header bg-warning text-dark">
<h5 class="mb-0"><i class="bi bi-diagram-2 me-2"></i>ariel — Graph Database</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Air spirit — ethereal, interconnected nature mirroring graph relationships.</p>
<ul class="mb-0">
<li>Neo4j 5.26.0 (Docker)</li>
<li>HTTP API: port 25554</li>
<li>Bolt: port 7687</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-danger">
<div class="card-header bg-danger text-white">
<h5 class="mb-0"><i class="bi bi-code-slash me-2"></i>puck — Application Runtime</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Shape-shifting trickster embodying Python's versatility.</p>
<ul class="mb-0">
<li>Docker engine</li>
<li><strong>JupyterLab</strong> (port 22071 via OAuth2-Proxy)</li>
<li><strong>Gitea Runner</strong> CI/CD agent</li>
<li>Django apps: <strong>Angelia</strong> (22281), <strong>Athena</strong> (22481), <strong>Kairos</strong> (22581), <strong>Icarlos</strong> (22681), <strong>Spelunker</strong> (22881), <strong>Peitho</strong> (22981)</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-dark">
<div class="card-header bg-dark text-white">
<h5 class="mb-0"><i class="bi bi-graph-up me-2"></i>prospero — Observability Stack</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Master magician observing all events.</p>
<ul class="mb-0">
<li>PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin</li>
<li>Internal HAProxy with OAuth2-Proxy for all dashboards</li>
<li>AlertManager with Pushover notifications</li>
<li>Prometheus node-exporter metrics from all hosts</li>
<li>Loki log aggregation via Alloy (all hosts)</li>
<li>Grafana with Casdoor SSO integration</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-info">
<div class="card-header bg-info text-white">
<h5 class="mb-0"><i class="bi bi-chat-dots me-2"></i>miranda — MCP Docker Host</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Curious bridge between worlds — hosting MCP server containers.</p>
<ul class="mb-0">
<li>Docker engine (API on port 2375 for MCP Switchboard)</li>
<li><strong>MCPO</strong> OpenAI-compatible MCP proxy</li>
<li><strong>Grafana MCP Server</strong> — Grafana API integration (port 25533)</li>
<li><strong>Gitea MCP Server</strong> (port 25535)</li>
<li><strong>Neo4j MCP Server</strong></li>
<li><strong>Argos MCP Server</strong> — web search via SearXNG (port 25534)</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100">
<div class="card-header bg-secondary text-white">
<h5 class="mb-0"><i class="bi bi-magic me-2"></i>sycorax — Language Models</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Original magical power wielding language magic.</p>
<ul class="mb-0">
<li><strong>Arke</strong> LLM API Proxy (port 25540)</li>
<li>Multi-provider support (OpenAI, Anthropic, etc.)</li>
<li>Session management with Memcached</li>
<li>Database backend on Portia</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100">
<div class="card-header bg-secondary text-white">
<h5 class="mb-0"><i class="bi bi-robot me-2"></i>caliban — Agent Automation</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Autonomous computer agent learning through environmental interaction.</p>
<ul class="mb-0">
<li>Docker engine</li>
<li><strong>Agent S MCP Server</strong> (MATE desktop, AT-SPI automation)</li>
<li><strong>Kernos</strong> MCP Shell Server (port 22021)</li>
<li>GPU passthrough for vision tasks</li>
<li>RDP access (port 25521)</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-success">
<div class="card-header bg-success text-white">
<h5 class="mb-0"><i class="bi bi-people me-2"></i>rosalind — Collaboration Services</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Witty and resourceful moon for PHP, Go, and Node.js runtimes.</p>
<ul class="mb-0">
<li><strong>Gitea</strong> self-hosted Git (port 22082, SSH on 22022)</li>
<li><strong>LobeChat</strong> AI chat interface (port 22081)</li>
<li><strong>Nextcloud</strong> file sharing and collaboration (port 22083)</li>
<li><strong>AnythingLLM</strong> document AI workspace (port 22084)</li>
<li>Nextcloud data on dedicated Incus storage volume</li>
</ul>
</div>
</div>
</div>
<div class="col-lg-6">
<div class="card h-100 border-primary">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-shield-check me-2"></i>titania — Proxy &amp; SSO Services</h5>
</div>
<div class="card-body">
<p class="text-muted fst-italic small">Queen of the Fairies managing access control and authentication.</p>
<ul class="mb-0">
<li><strong>HAProxy 3.x</strong> with TLS termination (port 443)</li>
<li>Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)</li>
<li>HTTP to HTTPS redirect (port 80)</li>
<li>Gitea SSH proxy (port 22022)</li>
<li><strong>Casdoor SSO</strong> (port 22081, local PostgreSQL)</li>
<li>Prometheus metrics at <code>:8404/metrics</code></li>
</ul>
</div>
</div>
</div>
</div>
</section>
<!-- Routing -->
<section id="routing" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-signpost-split text-primary me-2"></i>External Access via HAProxy</h2>
<div class="alert alert-primary border-start border-4 border-primary">
<p class="mb-0">Titania provides TLS termination and reverse proxy for all services. <strong>Base domain:</strong> <a href="https://ouranos.helu.ca" class="alert-link">ouranos.helu.ca</a> — HTTPS port 443, HTTP port 80 (redirects to HTTPS). Certificate: Let's Encrypt wildcard via certbot DNS-01 (Namecheap).</p>
</div>
<div class="card">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-table me-2"></i>Route Table</h5>
</div>
<div class="card-body p-0">
<div class="table-responsive">
<table class="table table-hover table-bordered mb-0 align-middle">
<thead class="table-light">
<tr>
<th><i class="bi bi-link-45deg me-1"></i>Subdomain</th>
<th><i class="bi bi-hdd-network me-1"></i>Backend</th>
<th><i class="bi bi-app me-1"></i>Service</th>
</tr>
</thead>
<tbody>
<tr><td><code>ouranos.helu.ca</code> <span class="badge bg-secondary">root</span></td><td><code>puck.incus:22281</code></td><td>Angelia (Django)</td></tr>
<tr><td><code>alertmanager.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>AlertManager</td></tr>
<tr><td><code>angelia.ouranos.helu.ca</code></td><td><code>puck.incus:22281</code></td><td>Angelia (Django)</td></tr>
<tr><td><code>anythingllm.ouranos.helu.ca</code></td><td><code>rosalind.incus:22084</code></td><td>AnythingLLM</td></tr>
<tr><td><code>arke.ouranos.helu.ca</code></td><td><code>sycorax.incus:25540</code></td><td>Arke LLM Proxy</td></tr>
<tr><td><code>athena.ouranos.helu.ca</code></td><td><code>puck.incus:22481</code></td><td>Athena (Django)</td></tr>
<tr><td><code>gitea.ouranos.helu.ca</code></td><td><code>rosalind.incus:22082</code></td><td>Gitea</td></tr>
<tr><td><code>grafana.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Grafana</td></tr>
<tr><td><code>hass.ouranos.helu.ca</code></td><td><code>oberon.incus:8123</code></td><td>Home Assistant</td></tr>
<tr><td><code>id.ouranos.helu.ca</code></td><td><code>titania.incus:22081</code></td><td>Casdoor SSO</td></tr>
<tr><td><code>icarlos.ouranos.helu.ca</code></td><td><code>puck.incus:22681</code></td><td>Icarlos (Django)</td></tr>
<tr><td><code>jupyterlab.ouranos.helu.ca</code></td><td><code>puck.incus:22071</code></td><td>JupyterLab <span class="badge bg-secondary">OAuth2-Proxy</span></td></tr>
<tr><td><code>kairos.ouranos.helu.ca</code></td><td><code>puck.incus:22581</code></td><td>Kairos (Django)</td></tr>
<tr><td><code>lobechat.ouranos.helu.ca</code></td><td><code>rosalind.incus:22081</code></td><td>LobeChat</td></tr>
<tr><td><code>loki.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Loki</td></tr>
<tr><td><code>mcp-switchboard.ouranos.helu.ca</code></td><td><code>oberon.incus:22785</code></td><td>MCP Switchboard</td></tr>
<tr><td><code>nextcloud.ouranos.helu.ca</code></td><td><code>rosalind.incus:22083</code></td><td>Nextcloud</td></tr>
<tr><td><code>openwebui.ouranos.helu.ca</code></td><td><code>oberon.incus:22088</code></td><td>Open WebUI</td></tr>
<tr><td><code>peitho.ouranos.helu.ca</code></td><td><code>puck.incus:22981</code></td><td>Peitho (Django)</td></tr>
<tr><td><code>pgadmin.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>PgAdmin 4</td></tr>
<tr><td><code>prometheus.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Prometheus</td></tr>
<tr><td><code>searxng.ouranos.helu.ca</code></td><td><code>oberon.incus:22073</code></td><td>SearXNG <span class="badge bg-secondary">OAuth2-Proxy</span></td></tr>
<tr><td><code>smtp4dev.ouranos.helu.ca</code></td><td><code>oberon.incus:22085</code></td><td>smtp4dev</td></tr>
<tr><td><code>spelunker.ouranos.helu.ca</code></td><td><code>puck.incus:22881</code></td><td>Spelunker (Django)</td></tr>
</tbody>
</table>
</div>
</div>
</div>
</section>
<!-- Infrastructure Management -->
<section id="infrastructure" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-gear text-primary me-2"></i>Infrastructure Management</h2>
<div class="row g-4 mb-4">
<div class="col-md-6">
<div class="card h-100 border-primary">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-play-circle me-2"></i>Quick Start</h5>
</div>
<div class="card-body">
<pre class="mb-0"><code># Provision containers
cd terraform
terraform init
terraform plan
terraform apply
# Start all containers
cd ../ansible
source ~/env/agathos/bin/activate
ansible-playbook sandbox_up.yml
# Deploy all services
ansible-playbook site.yml
# Stop all containers
ansible-playbook sandbox_down.yml</code></pre>
</div>
</div>
</div>
<div class="col-md-6">
<div class="card h-100 border-warning">
<div class="card-header bg-warning text-dark">
<h5 class="mb-0"><i class="bi bi-shield-lock me-2"></i>Vault Management</h5>
</div>
<div class="card-body">
<pre class="mb-0"><code># Edit secrets
ansible-vault edit \
inventory/group_vars/all/vault.yml
# View secrets
ansible-vault view \
inventory/group_vars/all/vault.yml
# Encrypt a new file
ansible-vault encrypt new_secrets.yml</code></pre>
</div>
</div>
</div>
</div>
<div class="row g-4">
<div class="col-md-6">
<div class="alert alert-primary border-start border-4 border-primary h-100 mb-0">
<h5><i class="bi bi-lightning-fill me-2"></i>Terraform Workflow</h5>
<ol class="mb-0">
<li><strong>Define</strong> — Containers, networks, and resources in <code>*.tf</code> files</li>
<li><strong>Plan</strong> — Review changes with <code>terraform plan</code></li>
<li><strong>Apply</strong> — Provision with <code>terraform apply</code></li>
<li><strong>Verify</strong> — Check outputs and container status</li>
</ol>
</div>
</div>
<div class="col-md-6">
<div class="alert alert-success border-start border-4 border-success h-100 mb-0">
<h5><i class="bi bi-check-circle-fill me-2"></i>Ansible Workflow</h5>
<ol class="mb-0">
<li><strong>Bootstrap</strong> — Update packages, install essentials (<code>apt_update.yml</code>)</li>
<li><strong>Agents</strong> — Deploy Alloy and Node Exporter on all hosts</li>
<li><strong>Services</strong> — Configure databases, Docker, applications, observability</li>
<li><strong>Verify</strong> — Check service health and connectivity</li>
</ol>
</div>
</div>
</div>
<div class="alert alert-info border-start border-4 border-info mt-4">
<h5><i class="bi bi-bucket me-2"></i>S3 Storage Provisioning</h5>
<p>Terraform provisions Incus S3 buckets for services requiring object storage:</p>
<div class="table-responsive">
<table class="table table-sm table-bordered mb-1">
<thead class="table-light">
<tr><th>Service</th><th>Host</th><th>Purpose</th></tr>
</thead>
<tbody>
<tr><td><strong>Casdoor</strong></td><td>Titania</td><td>User avatars and SSO resource storage</td></tr>
<tr><td><strong>LobeChat</strong></td><td>Rosalind</td><td>File uploads and attachments</td></tr>
</tbody>
</table>
</div>
<p class="mb-0 small"><i class="bi bi-shield-lock me-1"></i>S3 credentials are stored as sensitive Terraform outputs and in Ansible Vault with the <code>vault_*_s3_*</code> prefix.</p>
</div>
</section>
<!-- Automation -->
<section id="automation" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-play-circle text-primary me-2"></i>Ansible Automation</h2>
<div class="accordion" id="playbookAccordion">
<!-- site.yml -->
<div class="accordion-item">
<h2 class="accordion-header">
<button class="accordion-button" type="button" data-bs-toggle="collapse" data-bs-target="#colSiteYml">
<i class="bi bi-list-check me-2"></i>Full Deployment — <code>site.yml</code> (in order)
</button>
</h2>
<div id="colSiteYml" class="accordion-collapse collapse show" data-bs-parent="#playbookAccordion">
<div class="accordion-body">
<div class="table-responsive">
<table class="table table-hover table-bordered mb-0 align-middle">
<thead class="table-light">
<tr><th>Playbook</th><th>Host(s)</th><th>Purpose</th></tr>
</thead>
<tbody>
<tr><td><code>apt_update.yml</code></td><td>All</td><td>Update packages and install essentials</td></tr>
<tr><td><code>alloy/deploy.yml</code></td><td>All</td><td>Grafana Alloy log/metrics collection</td></tr>
<tr><td><code>prometheus/node_deploy.yml</code></td><td>All</td><td>Node Exporter metrics</td></tr>
<tr><td><code>docker/deploy.yml</code></td><td>Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania</td><td>Docker engine</td></tr>
<tr><td><code>smtp4dev/deploy.yml</code></td><td>Oberon</td><td>SMTP test server</td></tr>
<tr><td><code>pplg/deploy.yml</code></td><td>Prospero</td><td>Full observability stack + internal HAProxy + OAuth2-Proxy</td></tr>
<tr><td><code>postgresql/deploy.yml</code></td><td>Portia</td><td>PostgreSQL with all databases</td></tr>
<tr><td><code>postgresql_ssl/deploy.yml</code></td><td>Titania</td><td>Dedicated PostgreSQL for Casdoor</td></tr>
<tr><td><code>neo4j/deploy.yml</code></td><td>Ariel</td><td>Neo4j graph database</td></tr>
<tr><td><code>searxng/deploy.yml</code></td><td>Oberon</td><td>SearXNG privacy search</td></tr>
<tr><td><code>haproxy/deploy.yml</code></td><td>Titania</td><td>HAProxy TLS termination and routing</td></tr>
<tr><td><code>casdoor/deploy.yml</code></td><td>Titania</td><td>Casdoor SSO</td></tr>
<tr><td><code>mcpo/deploy.yml</code></td><td>Miranda</td><td>MCPO MCP proxy</td></tr>
<tr><td><code>openwebui/deploy.yml</code></td><td>Oberon</td><td>Open WebUI LLM interface</td></tr>
<tr><td><code>hass/deploy.yml</code></td><td>Oberon</td><td>Home Assistant</td></tr>
<tr><td><code>gitea/deploy.yml</code></td><td>Rosalind</td><td>Gitea self-hosted Git</td></tr>
<tr><td><code>nextcloud/deploy.yml</code></td><td>Rosalind</td><td>Nextcloud collaboration</td></tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<!-- Individual services -->
<div class="accordion-item">
<h2 class="accordion-header">
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#colIndividual">
<i class="bi bi-puzzle me-2"></i>Individual Service Deployments
</button>
</h2>
<div id="colIndividual" class="accordion-collapse collapse" data-bs-parent="#playbookAccordion">
<div class="accordion-body">
<div class="table-responsive">
<table class="table table-hover table-bordered mb-0 align-middle">
<thead class="table-light">
<tr><th>Playbook</th><th>Host</th><th>Service</th></tr>
</thead>
<tbody>
<tr><td><code>anythingllm/deploy.yml</code></td><td>Rosalind</td><td>AnythingLLM document AI</td></tr>
<tr><td><code>arke/deploy.yml</code></td><td>Sycorax</td><td>Arke LLM proxy</td></tr>
<tr><td><code>argos/deploy.yml</code></td><td>Miranda</td><td>Argos MCP web search server</td></tr>
<tr><td><code>caliban/deploy.yml</code></td><td>Caliban</td><td>Agent S MCP Server</td></tr>
<tr><td><code>certbot/deploy.yml</code></td><td>Titania</td><td>Let's Encrypt certificate renewal</td></tr>
<tr><td><code>gitea_mcp/deploy.yml</code></td><td>Miranda</td><td>Gitea MCP Server</td></tr>
<tr><td><code>gitea_runner/deploy.yml</code></td><td>Puck</td><td>Gitea CI/CD runner</td></tr>
<tr><td><code>grafana_mcp/deploy.yml</code></td><td>Miranda</td><td>Grafana MCP Server</td></tr>
<tr><td><code>jupyterlab/deploy.yml</code></td><td>Puck</td><td>JupyterLab + OAuth2-Proxy</td></tr>
<tr><td><code>kernos/deploy.yml</code></td><td>Caliban</td><td>Kernos MCP shell server</td></tr>
<tr><td><code>lobechat/deploy.yml</code></td><td>Rosalind</td><td>LobeChat AI chat</td></tr>
<tr><td><code>neo4j_mcp/deploy.yml</code></td><td>Miranda</td><td>Neo4j MCP Server</td></tr>
<tr><td><code>rabbitmq/deploy.yml</code></td><td>Oberon</td><td>RabbitMQ message queue</td></tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<!-- Lifecycle -->
<div class="accordion-item">
<h2 class="accordion-header">
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#colLifecycle">
<i class="bi bi-arrow-repeat me-2"></i>Lifecycle Playbooks
</button>
</h2>
<div id="colLifecycle" class="accordion-collapse collapse" data-bs-parent="#playbookAccordion">
<div class="accordion-body">
<div class="row g-3">
<div class="col-md-3">
<div class="card border-success text-center h-100">
<div class="card-body">
<i class="bi bi-play-fill text-success" style="font-size:2rem;"></i>
<h6 class="mt-2"><code>sandbox_up.yml</code></h6>
<p class="small mb-0">Start all Uranian host containers</p>
</div>
</div>
</div>
<div class="col-md-3">
<div class="card border-primary text-center h-100">
<div class="card-body">
<i class="bi bi-list-check text-primary" style="font-size:2rem;"></i>
<h6 class="mt-2"><code>site.yml</code></h6>
<p class="small mb-0">Full deployment orchestration</p>
</div>
</div>
</div>
<div class="col-md-3">
<div class="card border-warning text-center h-100">
<div class="card-body">
<i class="bi bi-arrow-up-circle text-warning" style="font-size:2rem;"></i>
<h6 class="mt-2"><code>apt_update.yml</code></h6>
<p class="small mb-0">Update packages on all hosts</p>
</div>
</div>
</div>
<div class="col-md-3">
<div class="card border-danger text-center h-100">
<div class="card-body">
<i class="bi bi-stop-fill text-danger" style="font-size:2rem;"></i>
<h6 class="mt-2"><code>sandbox_down.yml</code></h6>
<p class="small mb-0">Gracefully stop all containers</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- Data Flow -->
<section id="dataflow" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-diagram-2 text-primary me-2"></i>Data Flow Architecture</h2>
<div class="card mb-4">
<div class="card-header bg-dark text-white">
<h5 class="mb-0"><i class="bi bi-diagram-3 me-2"></i>Observability Pipeline</h5>
</div>
<div class="card-body">
<div class="mermaid">
flowchart LR
subgraph hosts["All Hosts"]
alloy["Alloy\n(syslog + journal)"]
node_exp["Node Exporter\n(metrics)"]
end
subgraph prospero["Prospero"]
loki["Loki\n(logs)"]
prom["Prometheus\n(metrics)"]
grafana["Grafana\n(dashboards)"]
alert["AlertManager"]
end
pushover["Pushover\n(notifications)"]
alloy -->|"HTTP push"| loki
node_exp -->|"scrape 15s"| prom
loki --> grafana
prom --> grafana
grafana --> alert
alert -->|"webhook"| pushover
</div>
</div>
</div>
<div class="card">
<div class="card-header bg-primary text-white">
<h5 class="mb-0"><i class="bi bi-link-45deg me-2"></i>Service Integration Points</h5>
</div>
<div class="card-body p-0">
<div class="table-responsive">
<table class="table table-hover table-bordered mb-0 align-middle">
<thead class="table-light">
<tr><th>Consumer</th><th>Provider</th><th>Connection</th></tr>
</thead>
<tbody>
<tr><td>All LLM apps</td><td>Arke (Sycorax)</td><td><code>http://sycorax.incus:25540</code></td></tr>
<tr><td>Open WebUI, Arke, Gitea, Nextcloud, LobeChat</td><td>PostgreSQL (Portia)</td><td><code>portia.incus:5432</code></td></tr>
<tr><td>Neo4j MCP</td><td>Neo4j (Ariel)</td><td><code>ariel.incus:7687</code> (Bolt)</td></tr>
<tr><td>MCP Switchboard</td><td>Docker API (Miranda)</td><td><code>tcp://miranda.incus:2375</code></td></tr>
<tr><td>MCP Switchboard, Kairos, Spelunker</td><td>RabbitMQ (Oberon)</td><td><code>oberon.incus:5672</code></td></tr>
<tr><td>All apps (SMTP)</td><td>smtp4dev (Oberon)</td><td><code>oberon.incus:22025</code></td></tr>
<tr><td>All hosts (logs)</td><td>Loki (Prospero)</td><td><code>http://prospero.incus:3100</code></td></tr>
<tr><td>All hosts (metrics)</td><td>Prometheus (Prospero)</td><td><code>http://prospero.incus:9090</code></td></tr>
</tbody>
</table>
</div>
</div>
</div>
</section>
<!-- Important Notes -->
<section id="notes" class="mb-5">
<h2 class="h2 mb-4"><i class="bi bi-exclamation-triangle text-warning me-2"></i>Important Notes</h2>
<div class="alert alert-warning border-start border-4 border-warning">
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Alloy Host Variables Required</h5>
<p class="mb-0">Every host with <code>alloy</code> in its <code>services</code> list must define <code>alloy_log_level</code> in <code>inventory/host_vars/&lt;host&gt;.incus.yml</code>. The playbook will fail with an undefined variable error if this is missing.</p>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Alloy Syslog Listeners Required for Docker Services</h5>
<p class="mb-0">Any Docker Compose service using the <code>syslog</code> logging driver must have a corresponding <code>loki.source.syslog</code> listener in the host's Alloy config template (<code>ansible/alloy/&lt;hostname&gt;/config.alloy.j2</code>). Missing listeners cause Docker containers to fail on start because the syslog driver cannot connect to its configured port.</p>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Local Terraform State</h5>
<p class="mb-0">This project uses local Terraform state (no remote backend). Do not run <code>terraform apply</code> from multiple machines simultaneously.</p>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Nested Docker</h5>
<p class="mb-0">Docker runs inside Incus containers (nested), requiring <code>security.nesting = true</code> and <code>lxc.apparmor.profile=unconfined</code> AppArmor override on all Docker-enabled hosts.</p>
</div>
<div class="alert alert-warning border-start border-4 border-warning">
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Deployment Order</h5>
<p class="mb-0">Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to <code>prospero.incus</code>. Run <code>pplg/deploy.yml</code> before <code>site.yml</code> on a fresh environment.</p>
</div>
</section>
<!-- Footer -->
<footer class="bg-dark text-white py-4 rounded mt-2 mb-4">
<div class="container text-center">
<p class="mb-1"><i class="bi bi-heart-fill text-danger"></i> Built with love and approved by red pandas</p>
<small class="text-muted">Ouranos Lab — <a href="https://ouranos.helu.ca" class="text-muted">ouranos.helu.ca</a> — Infrastructure as Code for Development Excellence</small>
</div>
</footer>
<!-- Scroll to top button -->
<button id="scrollTopBtn" class="btn btn-primary" title="Scroll to top">
<i class="bi bi-arrow-up-circle"></i>
</button>
</div><!-- /container-fluid -->
<!-- Bootstrap JS -->
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/js/bootstrap.bundle.min.js"></script>
<!-- Mermaid JS -->
<script type="module">
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
const isDark = () => document.documentElement.getAttribute('data-bs-theme') === 'dark';
mermaid.initialize({ startOnLoad: true, theme: isDark() ? 'dark' : 'default' });
document.getElementById('darkModeToggle').addEventListener('click', () => {
setTimeout(() => mermaid.initialize({ startOnLoad: false, theme: isDark() ? 'dark' : 'default' }), 50);
});
</script>
<script>
// Dark mode toggle
const toggleBtn = document.getElementById('darkModeToggle');
function applyTheme(dark) {
document.documentElement.setAttribute('data-bs-theme', dark ? 'dark' : 'light');
toggleBtn.innerHTML = dark ? '<i class="bi bi-sun-fill"></i>' : '<i class="bi bi-moon-fill"></i>';
toggleBtn.title = dark ? 'Switch to light mode' : 'Switch to dark mode';
}
toggleBtn.addEventListener('click', () => {
applyTheme(document.documentElement.getAttribute('data-bs-theme') !== 'dark');
});
// Scroll to top
window.addEventListener('scroll', () => {
document.getElementById('scrollTopBtn').style.display =
(document.body.scrollTop > 300 || document.documentElement.scrollTop > 300) ? 'block' : 'none';
});
document.getElementById('scrollTopBtn').addEventListener('click', () => {
window.scrollTo({ top: 0, behavior: 'smooth' });
});
</script>
</body>
</html>

333
docs/ouranos.md Normal file
View File

@@ -0,0 +1,333 @@
# Ouranos Lab
Infrastructure-as-Code project managing the **Ouranos Lab** — a development sandbox at [ouranos.helu.ca](https://ouranos.helu.ca). Uses **Terraform** for container provisioning and **Ansible** for configuration management, themed around the moons of Uranus.
---
## Project Overview
| Component | Purpose |
|-----------|---------|
| **Terraform** | Provisions 10 specialised Incus containers (LXC) with DNS-resolved networking, security policies, and resource dependencies |
| **Ansible** | Deploys Docker, databases (PostgreSQL, Neo4j), observability stack (Prometheus, Grafana, Loki), and application runtimes across all hosts |
> **DNS Domain**: Incus resolves containers via the `.incus` domain suffix (e.g., `oberon.incus`, `portia.incus`). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.
---
## Uranian Host Architecture
All containers are named after moons of Uranus and resolved via the `.incus` DNS suffix.
| Name | Role | Description | Nesting |
|------|------|-------------|---------|
| **ariel** | graph_database | Neo4j — Ethereal graph connections | ✔ |
| **caliban** | agent_automation | Agent S MCP Server with MATE Desktop | ✔ |
| **miranda** | mcp_docker_host | Dedicated Docker Host for MCP Servers | ✔ |
| **oberon** | container_orchestration | Docker Host — MCP Switchboard, RabbitMQ, Open WebUI | ✔ |
| **portia** | database | PostgreSQL — Relational database host | ❌ |
| **prospero** | observability | PPLG stack — Prometheus, Grafana, Loki, PgAdmin | ❌ |
| **puck** | application_runtime | Python App Host — JupyterLab, Django apps, Gitea Runner | ✔ |
| **rosalind** | collaboration | Gitea, LobeChat, Nextcloud, AnythingLLM | ✔ |
| **sycorax** | language_models | Arke LLM Proxy | ✔ |
| **titania** | proxy_sso | HAProxy TLS termination + Casdoor SSO | ✔ |
### oberon — Container Orchestration
King of the Fairies orchestrating containers and managing MCP infrastructure.
- Docker engine
- MCP Switchboard (port 22785) — Django app routing MCP tool calls
- RabbitMQ message queue
- Open WebUI LLM interface (port 22088, PostgreSQL backend on Portia)
- SearXNG privacy search (port 22083, behind OAuth2-Proxy)
- smtp4dev SMTP test server (port 22025)
### portia — Relational Database
Intelligent and resourceful — the reliability of relational databases.
- PostgreSQL 17 (port 5432)
- Databases: `arke`, `anythingllm`, `gitea`, `hass`, `lobechat`, `mcp_switchboard`, `nextcloud`, `openwebui`, `spelunker`
### ariel — Graph Database
Air spirit — ethereal, interconnected nature mirroring graph relationships.
- Neo4j 5.26.0 (Docker)
- HTTP API: port 25584
- Bolt: port 25554
### puck — Application Runtime
Shape-shifting trickster embodying Python's versatility.
- Docker engine
- JupyterLab (port 22071 via OAuth2-Proxy)
- Gitea Runner (CI/CD agent)
- Home Assistant (port 8123)
- Django applications: Angelia (22281), Athena (22481), Kairos (22581), Icarlos (22681), Spelunker (22881), Peitho (22981)
### prospero — Observability Stack
Master magician observing all events.
- PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin
- Internal HAProxy with OAuth2-Proxy for all dashboards
- AlertManager with Pushover notifications
- Prometheus metrics collection (`node-exporter`, HAProxy, Loki)
- Loki log aggregation via Alloy (all hosts)
- Grafana dashboard suite with Casdoor SSO integration
### miranda — MCP Docker Host
Curious bridge between worlds — hosting MCP server containers.
- Docker engine (API exposed on port 2375 for MCP Switchboard)
- MCPO OpenAI-compatible MCP proxy
- Grafana MCP Server (port 25533)
- Gitea MCP Server (port 25535)
- Neo4j MCP Server
- Argos MCP Server — web search via SearXNG (port 25534)
### sycorax — Language Models
Original magical power wielding language magic.
- Arke LLM API Proxy (port 25540)
- Multi-provider support (OpenAI, Anthropic, etc.)
- Session management with Memcached
- Database backend on Portia
### caliban — Agent Automation
Autonomous computer agent learning through environmental interaction.
- Docker engine
- Agent S MCP Server (MATE desktop, AT-SPI automation)
- Kernos MCP Shell Server (port 22021)
- GPU passthrough for vision tasks
- RDP access (port 25521)
### rosalind — Collaboration Services
Witty and resourceful moon for PHP, Go, and Node.js runtimes.
- Gitea self-hosted Git (port 22082, SSH on 22022)
- LobeChat AI chat interface (port 22081)
- Nextcloud file sharing and collaboration (port 22083)
- AnythingLLM document AI workspace (port 22084)
- Nextcloud data on dedicated Incus storage volume
### titania — Proxy & SSO Services
Queen of the Fairies managing access control and authentication.
- HAProxy 3.x with TLS termination (port 443)
- Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)
- HTTP to HTTPS redirect (port 80)
- Gitea SSH proxy (port 22022)
- Casdoor SSO (port 22081, local PostgreSQL)
- Prometheus metrics at `:8404/metrics`
---
## External Access via HAProxy
Titania provides TLS termination and reverse proxy for all services.
- **Base domain**: `ouranos.helu.ca`
- **HTTPS**: port 443 (standard)
- **HTTP**: port 80 (redirects to HTTPS)
- **Certificate**: Let's Encrypt wildcard via certbot DNS-01
### Route Table
| Subdomain | Backend | Service |
|-----------|---------|---------|
| `ouranos.helu.ca` (root) | puck.incus:22281 | Angelia (Django) |
| `alertmanager.ouranos.helu.ca` | prospero.incus:443 (SSL) | AlertManager |
| `angelia.ouranos.helu.ca` | puck.incus:22281 | Angelia (Django) |
| `anythingllm.ouranos.helu.ca` | rosalind.incus:22084 | AnythingLLM |
| `arke.ouranos.helu.ca` | sycorax.incus:25540 | Arke LLM Proxy |
| `athena.ouranos.helu.ca` | puck.incus:22481 | Athena (Django) |
| `gitea.ouranos.helu.ca` | rosalind.incus:22082 | Gitea |
| `grafana.ouranos.helu.ca` | prospero.incus:443 (SSL) | Grafana |
| `hass.ouranos.helu.ca` | oberon.incus:8123 | Home Assistant |
| `id.ouranos.helu.ca` | titania.incus:22081 | Casdoor SSO |
| `icarlos.ouranos.helu.ca` | puck.incus:22681 | Icarlos (Django) |
| `jupyterlab.ouranos.helu.ca` | puck.incus:22071 | JupyterLab (OAuth2-Proxy) |
| `kairos.ouranos.helu.ca` | puck.incus:22581 | Kairos (Django) |
| `lobechat.ouranos.helu.ca` | rosalind.incus:22081 | LobeChat |
| `loki.ouranos.helu.ca` | prospero.incus:443 (SSL) | Loki |
| `mcp-switchboard.ouranos.helu.ca` | oberon.incus:22785 | MCP Switchboard |
| `nextcloud.ouranos.helu.ca` | rosalind.incus:22083 | Nextcloud |
| `openwebui.ouranos.helu.ca` | oberon.incus:22088 | Open WebUI |
| `peitho.ouranos.helu.ca` | puck.incus:22981 | Peitho (Django) |
| `pgadmin.ouranos.helu.ca` | prospero.incus:443 (SSL) | PgAdmin 4 |
| `prometheus.ouranos.helu.ca` | prospero.incus:443 (SSL) | Prometheus |
| `searxng.ouranos.helu.ca` | oberon.incus:22073 | SearXNG (OAuth2-Proxy) |
| `smtp4dev.ouranos.helu.ca` | oberon.incus:22085 | smtp4dev |
| `spelunker.ouranos.helu.ca` | puck.incus:22881 | Spelunker (Django) |
---
## Infrastructure Management
### Quick Start
```bash
# Provision containers
cd terraform
terraform init
terraform plan
terraform apply
# Start all containers
cd ../ansible
source ~/env/agathos/bin/activate
ansible-playbook sandbox_up.yml
# Deploy all services
ansible-playbook site.yml
# Stop all containers
ansible-playbook sandbox_down.yml
```
### Terraform Workflow
1. **Define** — Containers, networks, and resources in `*.tf` files
2. **Plan** — Review changes with `terraform plan`
3. **Apply** — Provision with `terraform apply`
4. **Verify** — Check outputs and container status
### Ansible Workflow
1. **Bootstrap** — Update packages, install essentials (`apt_update.yml`)
2. **Agents** — Deploy Alloy (log/metrics) and Node Exporter on all hosts
3. **Services** — Configure databases, Docker, applications, observability
4. **Verify** — Check service health and connectivity
### Vault Management
```bash
# Edit secrets
ansible-vault edit inventory/group_vars/all/vault.yml
# View secrets
ansible-vault view inventory/group_vars/all/vault.yml
# Encrypt a new file
ansible-vault encrypt new_secrets.yml
```
---
## S3 Storage Provisioning
Terraform provisions Incus S3 buckets for services requiring object storage:
| Service | Host | Purpose |
|---------|------|---------|
| **Casdoor** | Titania | User avatars and SSO resource storage |
| **LobeChat** | Rosalind | File uploads and attachments |
> S3 credentials (access key, secret key, endpoint) are stored as sensitive Terraform outputs and managed in Ansible Vault with the `vault_*_s3_*` prefix.
---
## Ansible Automation
### Full Deployment (`site.yml`)
Playbooks run in dependency order:
| Playbook | Hosts | Purpose |
|----------|-------|---------|
| `apt_update.yml` | All | Update packages and install essentials |
| `alloy/deploy.yml` | All | Grafana Alloy log/metrics collection |
| `prometheus/node_deploy.yml` | All | Node Exporter metrics |
| `docker/deploy.yml` | Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania | Docker engine |
| `smtp4dev/deploy.yml` | Oberon | SMTP test server |
| `pplg/deploy.yml` | Prospero | Full observability stack + HAProxy + OAuth2-Proxy |
| `postgresql/deploy.yml` | Portia | PostgreSQL with all databases |
| `postgresql_ssl/deploy.yml` | Titania | Dedicated PostgreSQL for Casdoor |
| `neo4j/deploy.yml` | Ariel | Neo4j graph database |
| `searxng/deploy.yml` | Oberon | SearXNG privacy search |
| `haproxy/deploy.yml` | Titania | HAProxy TLS termination and routing |
| `casdoor/deploy.yml` | Titania | Casdoor SSO |
| `mcpo/deploy.yml` | Miranda | MCPO MCP proxy |
| `openwebui/deploy.yml` | Oberon | Open WebUI LLM interface |
| `hass/deploy.yml` | Oberon | Home Assistant |
| `gitea/deploy.yml` | Rosalind | Gitea self-hosted Git |
| `nextcloud/deploy.yml` | Rosalind | Nextcloud collaboration |
### Individual Service Deployments
Services with standalone deploy playbooks (not in `site.yml`):
| Playbook | Host | Service |
|----------|------|---------|
| `anythingllm/deploy.yml` | Rosalind | AnythingLLM document AI |
| `arke/deploy.yml` | Sycorax | Arke LLM proxy |
| `argos/deploy.yml` | Miranda | Argos MCP web search server |
| `caliban/deploy.yml` | Caliban | Agent S MCP Server |
| `certbot/deploy.yml` | Titania | Let's Encrypt certificate renewal |
| `gitea_mcp/deploy.yml` | Miranda | Gitea MCP Server |
| `gitea_runner/deploy.yml` | Puck | Gitea CI/CD runner |
| `grafana_mcp/deploy.yml` | Miranda | Grafana MCP Server |
| `jupyterlab/deploy.yml` | Puck | JupyterLab + OAuth2-Proxy |
| `kernos/deploy.yml` | Caliban | Kernos MCP shell server |
| `lobechat/deploy.yml` | Rosalind | LobeChat AI chat |
| `neo4j_mcp/deploy.yml` | Miranda | Neo4j MCP Server |
| `rabbitmq/deploy.yml` | Oberon | RabbitMQ message queue |
### Lifecycle Playbooks
| Playbook | Purpose |
|----------|---------|
| `sandbox_up.yml` | Start all Uranian host containers |
| `sandbox_down.yml` | Gracefully stop all containers |
| `apt_update.yml` | Update packages on all hosts |
| `site.yml` | Full deployment orchestration |
---
## Data Flow Architecture
### Observability Pipeline
```
All Hosts Prospero Alerts
Alloy + Node Exporter → Prometheus + Loki + Grafana → AlertManager + Pushover
collect metrics & logs storage & visualisation notifications
```
### Integration Points
| Consumer | Provider | Connection |
|----------|----------|-----------|
| All LLM apps | Arke (Sycorax) | `http://sycorax.incus:25540` |
| Open WebUI, Arke, Gitea, Nextcloud, LobeChat | PostgreSQL (Portia) | `portia.incus:5432` |
| Neo4j MCP | Neo4j (Ariel) | `ariel.incus:7687` (Bolt) |
| MCP Switchboard | Docker API (Miranda) | `tcp://miranda.incus:2375` |
| MCP Switchboard | RabbitMQ (Oberon) | `oberon.incus:5672` |
| Kairos, Spelunker | RabbitMQ (Oberon) | `oberon.incus:5672` |
| SMTP (all apps) | smtp4dev (Oberon) | `oberon.incus:22025` |
| All hosts | Loki (Prospero) | `http://prospero.incus:3100` |
| All hosts | Prometheus (Prospero) | `http://prospero.incus:9090` |
---
## Important Notes
⚠️ **Alloy Host Variables Required** — Every host with `alloy` in its `services` list must define `alloy_log_level` in `inventory/host_vars/<host>.incus.yml`. The playbook will fail with an undefined variable error if this is missing.
⚠️ **Alloy Syslog Listeners Required for Docker Services** — Any Docker Compose service using the syslog logging driver must have a corresponding `loki.source.syslog` listener in the host's Alloy config template (`ansible/alloy/<hostname>/config.alloy.j2`). Missing listeners cause Docker containers to fail on start.
⚠️ **Local Terraform State** — This project uses local Terraform state (no remote backend). Do not run `terraform apply` from multiple machines simultaneously.
⚠️ **Nested Docker** — Docker runs inside Incus containers (nested), requiring `security.nesting = true` and `lxc.apparmor.profile=unconfined` AppArmor override on all Docker-enabled hosts.
⚠️ **Deployment Order** — Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to `prospero.incus`. Run `pplg/deploy.yml` before `site.yml` on a fresh environment.

192
docs/pgadmin.md Normal file
View File

@@ -0,0 +1,192 @@
# PgAdmin - PostgreSQL Web Administration
## Overview
PgAdmin 4 is a web-based administration and management tool for PostgreSQL. It is deployed on **Portia** alongside the shared PostgreSQL instance, providing a graphical interface for database management, query execution, and server monitoring across both PostgreSQL deployments (Portia and Titania).
**Host:** portia.incus
**Role:** database
**Container Port:** 80 (Apache / pgAdmin4 web app)
**External Access:** https://pgadmin.ouranos.helu.ca/ (via HAProxy on Titania, proxied through host port 25555)
## Architecture
```
┌──────────┐ ┌────────────┐ ┌──────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Portia │
│ │ │ (Titania) │ │ │
│ │ │ :443 │ │ :25555 ──▶ :80 (Apache) │
└──────────┘ └────────────┘ │ │ │
│ ┌────▼─────┐ │
│ │ PgAdmin4 │ │
│ │ (web) │ │
│ └────┬─────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ PostgreSQL 17 │ │
│ │ (localhost) │ │
│ └─────────────────┘ │
└──────────┬─────────────────────┘
│ SSL
┌─────────────────────┐
│ PostgreSQL 17 (SSL) │
│ (Titania) │
└─────────────────────┘
```
PgAdmin connects to:
- **Portia's PostgreSQL** — locally via `localhost:5432` (no SSL)
- **Titania's PostgreSQL** — over the Incus network via SSL, using the fetched certificate stored at `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
## Terraform Resources
### Host Definition
PgAdmin runs on Portia, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | database |
| Security Nesting | false |
| Proxy Devices | `25555 → 80` (Apache/PgAdmin web UI) |
The Incus proxy device maps host port 25555 to Apache on port 80 inside the container, where PgAdmin4 is served as a WSGI application.
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook pgadmin/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `pgadmin/deploy.yml` | PgAdmin installation and SSL cert distribution |
### Deployment Steps
1. **Add PgAdmin repository** — Official pgAdmin4 APT repository with GPG key
2. **Install PgAdmin**`pgadmin4-web` package (includes Apache configuration)
3. **Create certs directory**`/var/lib/pgadmin/certs/` owned by `www-data`
4. **Fetch Titania SSL certificate** — Retrieves the self-signed PostgreSQL SSL cert from Titania
5. **Distribute certificate** — Copies to `/var/lib/pgadmin/certs/titania-postgres-ca.crt` for SSL connections
### ⚠️ Manual Post-Deployment Step Required
After running the playbook, you **must** SSH into Portia and run the PgAdmin web setup script manually:
```bash
# SSH into Portia
ssh portia.incus
# Run the setup script
sudo /usr/pgadmin4/bin/setup-web.sh
```
This interactive script:
- Prompts for the **admin email address** and **password** (use the values from `pgadmin_email` and `pgadmin_password` vault variables)
- Configures Apache virtual host for PgAdmin4
- Sets file permissions and ownership
- Restarts Apache to activate the configuration
This step cannot be automated via Ansible because the script requires interactive input and performs Apache configuration that depends on the local environment.
### Variables
#### Host Variables (`host_vars/portia.incus.yml`)
| Variable | Description |
|----------|-------------|
| `pgadmin_user` | System user (`pgadmin`) |
| `pgadmin_group` | System group (`pgadmin`) |
| `pgadmin_directory` | Data directory (`/srv/pgadmin`) |
| `pgadmin_port` | External port (`25555`) |
| `pgadmin_email` | Admin login email (`{{ vault_pgadmin_email }}`) |
| `pgadmin_password` | Admin login password (`{{ vault_pgadmin_password }}`) |
#### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Description |
|----------|-------------|
| `vault_pgadmin_email` | PgAdmin admin email address |
| `vault_pgadmin_password` | PgAdmin admin password |
## Configuration
### SSL Certificate for Titania Connection
The playbook fetches the self-signed PostgreSQL SSL certificate from Titania and places it at `/var/lib/pgadmin/certs/titania-postgres-ca.crt`. When adding Titania's PostgreSQL as a server in PgAdmin:
1. Navigate to **Servers → Register → Server**
2. On the **Connection** tab:
- Host: `titania.incus`
- Port: `5432`
- Username: `postgres`
3. On the **SSL** tab:
- SSL mode: `verify-ca` or `require`
- Root certificate: `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
### Registered Servers
After setup, register both PostgreSQL instances:
| Server Name | Host | Port | SSL |
|-------------|------|------|-----|
| Portia (local) | `localhost` | `5432` | Off |
| Titania (Casdoor) | `titania.incus` | `5432` | verify-ca |
## Operations
### Start/Stop
```bash
# PgAdmin runs under Apache
sudo systemctl start apache2
sudo systemctl stop apache2
sudo systemctl restart apache2
```
### Health Check
```bash
# Check Apache is serving PgAdmin
curl -s -o /dev/null -w "%{http_code}" http://localhost/pgadmin4/login
# Check from external host
curl -s -o /dev/null -w "%{http_code}" http://portia.incus/pgadmin4/login
```
### Logs
```bash
# Apache error log
tail -f /var/log/apache2/error.log
# PgAdmin application log
tail -f /var/log/pgadmin/pgadmin4.log
```
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| 502/503 on pgadmin.ouranos.helu.ca | Apache not running on Portia | `sudo systemctl restart apache2` on Portia |
| Login page loads but can't authenticate | Setup script not run | SSH to Portia and run `sudo /usr/pgadmin4/bin/setup-web.sh` |
| Can't connect to Titania PostgreSQL | Missing SSL certificate | Re-run `ansible-playbook pgadmin/deploy.yml` to fetch cert |
| SSL certificate error for Titania | Certificate expired or regenerated | Re-fetch cert by re-running the playbook |
| Port 25555 unreachable | Incus proxy device missing | Verify proxy device in `terraform/containers.tf` for Portia |
## References
- [PgAdmin 4 Documentation](https://www.pgadmin.org/docs/pgadmin4/latest/)
- [PostgreSQL Deployment](postgresql.md)
- [Terraform Practices](terraform.md)
- [Ansible Practices](ansible.md)

287
docs/postgresql.md Normal file
View File

@@ -0,0 +1,287 @@
# PostgreSQL - Dual-Deployment Database Layer
## Overview
PostgreSQL 17 serves as the primary relational database engine for the Agathos sandbox. There are **two separate deployment playbooks**, each targeting a different host with a distinct purpose:
| Playbook | Host | Purpose |
|----------|------|---------|
| `postgresql/deploy.yml` | **Portia** | Shared multi-tenant database with **pgvector** for AI/vector workloads |
| `postgresql_ssl/deploy.yml` | **Titania** | Dedicated SSL-enabled database for the **Casdoor** identity provider |
**Portia** acts as the central database server for most applications, while **Titania** runs an isolated PostgreSQL instance exclusively for Casdoor, hardened with self-signed SSL certificates for secure external connections.
## Architecture
```
┌────────────────────────────────────────────────────┐
│ Portia (postgresql) │
┌──────────┐ │ ┌──────────────────────────────────────────────┐ │
│ Arke │───────────▶│ │ PostgreSQL 17 + pgvector v0.8.0 │ │
│(Caliban) │ │ │ │ │
├──────────┤ │ │ Databases: │ │
│ Gitea │───────────▶│ │ arke ─── openwebui ─── spelunker │ │
│(Rosalind)│ │ │ gitea ── lobechat ──── nextcloud │ │
├──────────┤ │ │ anythingllm ────────── hass │ │
│ Open │───────────▶│ │ │ │
│ WebUI │ │ │ pgvector enabled in: │ │
├──────────┤ │ │ arke, lobechat, openwebui, │ │
│ LobeChat │───────────▶│ │ spelunker, anythingllm │ │
├──────────┤ │ └──────────────────────────────────────────────┘ │
│ HASS │───────────▶│ │
│ + others │ │ PgAdmin available on :25555 │
└──────────┘ └────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ Titania (postgresql_ssl) │
┌──────────┐ │ ┌──────────────────────────────────────────────┐ │
│ Casdoor │──SSL──────▶│ │ PostgreSQL 17 + SSL (self-signed) │ │
│(Titania) │ (local) │ │ │ │
└──────────┘ │ │ Database: casdoor (single-purpose) │ │
│ └──────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────┘
```
## Terraform Resources
### Portia Shared Database Host
Defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | database |
| Security Nesting | false |
| Proxy Devices | `25555 → 80` (PgAdmin web UI) |
PostgreSQL port 5432 is **not** exposed externally—applications connect over the private Incus network (`10.10.0.0/16`).
### Titania Proxy & SSO Host
| Attribute | Value |
|-----------|-------|
| Image | noble |
| Role | proxy_sso |
| Security Nesting | true |
| Proxy Devices | `443 → 8443`, `80 → 8080` (HAProxy) |
Titania runs PostgreSQL alongside Casdoor on the same host. Casdoor connects via localhost, so SSL is not required for the local connection despite being available for external clients.
## Ansible Deployment
### Playbook 1: Shared PostgreSQL with pgvector (Portia)
```bash
cd ansible
ansible-playbook postgresql/deploy.yml
```
#### Files
| File | Purpose |
|------|---------|
| `postgresql/deploy.yml` | Multi-tenant PostgreSQL with pgvector |
#### Deployment Steps
1. **Install build dependencies**`curl`, `git`, `build-essential`, `vim`, `python3-psycopg2`
2. **Add PGDG repository** — Official PostgreSQL APT repository
3. **Install PostgreSQL 17** — Client, server, docs, `libpq-dev`, `server-dev`
4. **Clone & build pgvector v0.8.0** — Compiled from source against the installed PG version
5. **Start PostgreSQL** and restart after pgvector installation
6. **Set data directory permissions**`700` owned by `postgres:postgres`
7. **Configure networking**`listen_addresses = '*'`
8. **Configure authentication**`host all all 0.0.0.0/0 md5` in `pg_hba.conf`
9. **Set admin password**`postgres` superuser password from vault
10. **Create application users** — 9 database users (see table below)
11. **Create application databases** — 9 databases with matching owners
12. **Enable pgvector**`CREATE EXTENSION vector` in 5 databases
### Playbook 2: SSL-Enabled PostgreSQL (Titania)
```bash
cd ansible
ansible-playbook postgresql_ssl/deploy.yml
```
#### Files
| File | Purpose |
|------|---------|
| `postgresql_ssl/deploy.yml` | Single-purpose SSL PostgreSQL for Casdoor |
#### Deployment Steps
1. **Install dependencies**`curl`, `python3-psycopg2`, `python3-cryptography`
2. **Add PGDG repository** — Official PostgreSQL APT repository
3. **Install PostgreSQL 17** — Client and server only (no dev packages needed)
4. **Generate SSL certificates** — 4096-bit RSA key, self-signed, 10-year validity
5. **Configure networking**`listen_addresses = '*'`
6. **Enable SSL**`ssl = on` with cert/key file paths
7. **Configure tiered authentication** in `pg_hba.conf`:
- `local``peer` (Unix socket, no password)
- `host 127.0.0.1/32``md5` (localhost, no SSL)
- `host 10.10.0.0/16``md5` (Incus network, no SSL)
- `hostssl 0.0.0.0/0``md5` (external, SSL required)
8. **Set admin password**`postgres` superuser password from vault
9. **Create Casdoor user and database** — Single-purpose
## User & Database Creation via Host Variables
Both playbooks derive all database names, usernames, and passwords from **host variables** defined in the Ansible inventory. No database credentials appear in `group_vars`—everything is scoped to the host that runs PostgreSQL.
### Portia Host Variables (`inventory/host_vars/portia.incus.yml`)
The `postgresql/deploy.yml` playbook loops over variable pairs to create users and databases. Each application gets three variables defined in Portia's host_vars:
| Variable Pattern | Example | Description |
|-----------------|---------|-------------|
| `{app}_db_name` | `arke_db_name: arke` | Database name |
| `{app}_db_user` | `arke_db_user: arke` | Database owner/user |
| `{app}_db_password` | `arke_db_password: "{{ vault_arke_db_password }}"` | Password (from vault) |
#### Application Database Matrix (Portia)
| Application | DB Name Variable | DB User Variable | pgvector |
|-------------|-----------------|-----------------|----------|
| Arke | `arke_db_name` | `arke_db_user` | ✔ |
| Open WebUI | `openwebui_db_name` | `openwebui_db_user` | ✔ |
| Spelunker | `spelunker_db_name` | `spelunker_db_user` | ✔ |
| Gitea | `gitea_db_name` | `gitea_db_user` | |
| LobeChat | `lobechat_db_name` | `lobechat_db_user` | ✔ |
| Nextcloud | `nextcloud_db_name` | `nextcloud_db_user` | |
| AnythingLLM | `anythingllm_db_name` | `anythingllm_db_user` | ✔ |
| HASS | `hass_db_name` | `hass_db_user` | |
| Nike | `nike_db_name` | `nike_db_user` | |
#### Additional Portia Variables
| Variable | Description |
|----------|-------------|
| `postgres_user` | System user (`postgres`) |
| `postgres_group` | System group (`postgres`) |
| `postgresql_port` | Port (`5432`) |
| `postgresql_data_dir` | Data directory (`/var/lib/postgresql`) |
| `postgres_password` | Admin password (`{{ vault_postgres_password }}`) |
### Titania Host Variables (`inventory/host_vars/titania.incus.yml`)
The `postgresql_ssl/deploy.yml` playbook creates a single database for Casdoor:
| Variable | Value | Description |
|----------|-------|-------------|
| `postgresql_ssl_postgres_password` | `{{ vault_postgresql_ssl_postgres_password }}` | Admin password |
| `postgresql_ssl_port` | `5432` | PostgreSQL port |
| `postgresql_ssl_cert_path` | `/etc/postgresql/17/main/ssl/server.crt` | SSL certificate |
| `casdoor_db_name` | `casdoor` | Database name |
| `casdoor_db_user` | `casdoor` | Database user |
| `casdoor_db_password` | `{{ vault_casdoor_db_password }}` | Password (from vault) |
| `casdoor_db_sslmode` | `disable` | Local connection skips SSL |
### Adding a New Application Database
To add a new application database on Portia:
1. **Add variables** to `inventory/host_vars/portia.incus.yml`:
```yaml
myapp_db_name: myapp
myapp_db_user: myapp
myapp_db_password: "{{ vault_myapp_db_password }}"
```
2. **Add the vault secret** to `inventory/group_vars/all/vault.yml`:
```yaml
vault_myapp_db_password: "s3cure-passw0rd"
```
3. **Add the user** to the `Create application database users` loop in `postgresql/deploy.yml`:
```yaml
- { user: "{{ myapp_db_user }}", password: "{{ myapp_db_password }}" }
```
4. **Add the database** to the `Create application databases with owners` loop:
```yaml
- { name: "{{ myapp_db_name }}", owner: "{{ myapp_db_user }}" }
```
5. **(Optional)** If the application uses vector embeddings, add the database to the `Enable pgvector extension in databases` loop:
```yaml
- "{{ myapp_db_name }}"
```
## Operations
### Start/Stop
```bash
# On either host
sudo systemctl start postgresql
sudo systemctl stop postgresql
sudo systemctl restart postgresql
```
### Health Check
```bash
# From any Incus host → Portia
psql -h portia.incus -U postgres -c "SELECT 1;"
# From Titania localhost
sudo -u postgres psql -c "SELECT 1;"
# Check pgvector availability
sudo -u postgres psql -c "SELECT * FROM pg_available_extensions WHERE name = 'vector';"
```
### Logs
```bash
# Systemd journal
journalctl -u postgresql -f
# PostgreSQL log files
tail -f /var/log/postgresql/postgresql-17-main.log
# Loki (via Grafana Explore)
{job="postgresql"}
```
### Backup
```bash
# Dump a single database
sudo -u postgres pg_dump myapp > myapp_backup.sql
# Dump all databases
sudo -u postgres pg_dumpall > full_backup.sql
```
### Restore
```bash
# Restore a single database
sudo -u postgres psql myapp < myapp_backup.sql
# Restore all databases
sudo -u postgres psql < full_backup.sql
```
## Troubleshooting
### Common Issues
| Symptom | Cause | Resolution |
|---------|-------|------------|
| Connection refused from app host | `pg_hba.conf` missing entry | Verify client IP is covered by HBA rules |
| pgvector extension not found | Built against wrong PG version | Re-run the `Build pgvector with correct pg_config` task |
| SSL handshake failure (Titania) | Expired or missing certificate | Check `/etc/postgresql/17/main/ssl/server.crt` validity |
| `FATAL: password authentication failed` | Wrong password in host_vars | Verify vault variable matches and re-run playbook |
| PgAdmin unreachable on :25555 | Incus proxy device missing | Check `terraform/containers.tf` proxy for Portia |
## References
- [PostgreSQL 17 Documentation](https://www.postgresql.org/docs/17/)
- [pgvector GitHub](https://github.com/pgvector/pgvector)
- [Terraform Practices](terraform.md)
- [Ansible Practices](ansible.md)

583
docs/pplg.md Normal file
View File

@@ -0,0 +1,583 @@
# PPLG - Consolidated Observability & Admin Stack
## Overview
PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication.
**Host:** prospero.incus
**Role:** Observability
**Incus Ports:** 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect)
**External Access:** Via Titania HAProxy → `prospero.incus:443`
| Subdomain | Service | Auth Method |
|-----------|---------|-------------|
| `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth |
| `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth |
| `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar |
| `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) |
| `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) |
## Architecture
```
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
│ │ │ (Titania) │ │ │
└──────────┘ │ :443 → :443 │ ┌──────────────────────────────────────────┐ │
└────────────┘ │ │ HAProxy (systemd, :443/:80) │ │
│ │ TLS termination + subdomain routing │ │
┌──────────┐ │ └───┬──────┬──────┬──────┬──────┬──────────┘ │
│ Alloy │──push──────────────────────────▶│ │ │ │ │
│ (agents) │ loki.ouranos.helu.ca │ │ │ │ │ │
│ │ prometheus.ouranos.helu.ca │ │ │ │ │
└──────────┘ │ ▼ ▼ ▼ ▼ ▼ │
│ Grafana PgAdmin OAuth2 Loki Alertmanager │
│ :3000 :5050 Proxy :3100 :9093 │
│ :9091 │
│ │ │
│ ▼ │
│ Prometheus │
│ :9090 │
└─────────────────────────────────────────────────┘
```
### Traffic Flow
| Source | Destination | Path | Auth |
|--------|-------------|------|------|
| Browser → Grafana | Titania :443 → Prospero :443 → HAProxy → :3000 | Subdomain ACL | Casdoor OAuth |
| Browser → PgAdmin | Titania :443 → Prospero :443 → HAProxy → :5050 | Subdomain ACL | Casdoor OAuth |
| Browser → Prometheus | Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
| Alloy → Loki | `https://loki.ouranos.helu.ca` → HAProxy :443 → :3100 | Subdomain ACL | None |
| Alloy → Prometheus | `https://prometheus.ouranos.helu.ca/api/v1/write` → HAProxy :443 → :9090 | `skip_auth_route` | None |
## Deployment
### Prerequisites
1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`)
2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`)
3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets))
4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso))
### Playbook
```bash
cd ansible
ansible-playbook pplg/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `pplg/deploy.yml` | Main consolidated deployment playbook |
| `pplg/pplg-haproxy.cfg.j2` | HAProxy TLS termination config (5 backends) |
| `pplg/prometheus.yml.j2` | Prometheus scrape configuration |
| `pplg/alert_rules.yml.j2` | Prometheus alerting rules |
| `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications |
| `pplg/config.yml.j2` | Loki server configuration |
| `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth |
| `pplg/datasource.yml.j2` | Grafana provisioned datasources |
| `pplg/users.yml.j2` | Grafana provisioned users |
| `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth |
| `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit |
| `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI |
| `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit |
### Deployment Steps
1. **APT Repositories**: Add Grafana and PgAdmin repos
2. **Install Packages**: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn
3. **Prometheus**: Config, alert rules, systemd override for remote write receiver
4. **Alertmanager**: Install, config with Pushover integration
5. **Loki**: Create user/dirs, template config
6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config
7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar
9. **SSL Certificate**: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback)
10. **HAProxy**: Template config, enable and start systemd service
### Deployment Order
PPLG must be deployed **before** services that push metrics/logs:
```
apt_update → alloy → node_exporter → pplg → postgresql → ...
```
This order is enforced in `site.yml`.
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like:
```
TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
### Prometheus Scrape Credentials
These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea.
#### 1. Casdoor Prometheus Access Key
```yaml
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
```
#### 2. Casdoor Prometheus Access Secret
```yaml
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
```
**Requirements (both):**
- **Source**: API key pair from the `built-in/admin` Casdoor user
- **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params)
- **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
```bash
# 1. Login to get session cookie
curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
-H "Content-Type: application/json" \
-d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
# 2. Generate API keys for built-in/admin
curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
-H "Content-Type: application/json" \
-d '{"owner":"built-in","name":"admin"}'
# 3. Retrieve the generated keys
curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')"
# 4. Cleanup
rm /tmp/casdoor-cookie.txt
```
⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics.
#### 3. Gitea Metrics Token
```yaml
vault_gitea_metrics_token: "YourGiteaMetricsToken"
```
**Requirements:**
- **Length**: 32+ characters
- **Source**: Must match the token configured in Gitea's `app.ini`
- **Generation**: `openssl rand -hex 32`
- **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth)
### Grafana Credentials
#### 4. Grafana Admin User
```yaml
vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"
```
#### 5. Grafana Viewer User
```yaml
vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"
```
#### 6. Grafana OAuth (Casdoor SSO)
```yaml
vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-grafana`
- **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth`
### PgAdmin
#### 7. PgAdmin Setup
Just do it manually:
cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
**Requirements:**
- **Purpose**: Initial local admin account (fallback when OAuth is unavailable)
#### 8. PgAdmin OAuth (Casdoor SSO)
```yaml
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-pgadmin`
- **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
### Prometheus OAuth2-Proxy
#### 9. Prometheus OAuth2-Proxy (Casdoor SSO)
```yaml
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
```
**Requirements:**
- Client ID/Secret must match the Casdoor application `app-prometheus`
- **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback`
- **Cookie secret generation**:
```bash
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
```
### Alertmanager (Pushover)
#### 10. Pushover Notification Credentials
```yaml
vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"
```
**Requirements:**
- **Source**: [pushover.net](https://pushover.net/) account
- **User Key**: Found on Pushover dashboard
- **API Token**: Create an application in Pushover
### Quick Reference
| Vault Variable | Used By | Source |
|---------------|---------|--------|
| `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini |
| `vault_grafana_admin_name` | users.yml.j2 | Choose any |
| `vault_grafana_admin_login` | users.yml.j2 | Choose any |
| `vault_grafana_admin_password` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_name` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_login` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_password` | users.yml.j2 | Choose any |
| `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app |
| `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app |
| `vault_pgadmin_email` | config_local.py.j2 | Choose any |
| `vault_pgadmin_password` | config_local.py.j2 | Choose any |
| `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app |
| `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate |
| `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account |
| `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account |
## Casdoor SSO
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
### Applications to Register
Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`:
| Application | Client ID | Redirect URI | Grant Types |
|-------------|-----------|-------------|-------------|
| `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` |
| `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` |
| `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` |
### URL Strategy
| URL Type | Address | Used By |
|----------|---------|---------|
| **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) |
| **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server |
| **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server |
| **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy |
### Auth Methods per Service
| Service | Auth Method | Details |
|---------|-------------|---------|
| **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` |
| **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` |
| **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` |
| **Loki** | None | Machine-to-machine (Alloy agents push logs) |
| **Alertmanager** | None | Internal only |
## HAProxy Configuration
### Backends
| Backend | Upstream | Health Check | Auth |
|---------|----------|-------------|------|
| `backend_grafana` | `127.0.0.1:3000` | `GET /api/health` | Grafana OAuth |
| `backend_pgadmin` | `127.0.0.1:5050` | `GET /misc/ping` | PgAdmin OAuth |
| `backend_prometheus` | `127.0.0.1:9091` (OAuth2-Proxy) | `GET /ping` | OAuth2-Proxy |
| `backend_prometheus_direct` | `127.0.0.1:9090` | — | None (write API) |
| `backend_loki` | `127.0.0.1:3100` | `GET /ready` | None |
| `backend_alertmanager` | `127.0.0.1:9093` | `GET /-/healthy` | None |
### skip_auth_route Pattern
The Prometheus write API (`/api/v1/write`) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy:
```
acl is_prometheus_write path_beg /api/v1/write
use_backend backend_prometheus_direct if host_prometheus is_prometheus_write
```
This routes `https://prometheus.ouranos.helu.ca/api/v1/write` directly to Prometheus on `:9090`, while all other Prometheus traffic goes through OAuth2-Proxy on `:9091`.
### SSL Certificate
- **Primary**: Let's Encrypt wildcard cert (`*.ouranos.helu.ca`) fetched from Titania
- **Fallback**: Self-signed cert generated on Prospero (if Titania unavailable)
- **Path**: `/etc/haproxy/certs/ouranos.pem`
## Host Variables
**File:** `ansible/inventory/host_vars/prospero.incus.yml`
Services list:
```yaml
services:
- alloy
- pplg
```
Key variable groups defined in `prospero.incus.yml`:
- PPLG HAProxy (user, group, uid/gid 800, syslog port)
- Grafana (datasources, users, OAuth config)
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
- Alertmanager (Pushover integration)
- Loki (user, data/config directories)
- PgAdmin (user, data/log directories, OAuth config)
- Casdoor Metrics (access key/secret for Prometheus scraping)
## Terraform
### Prospero Port Mapping
```hcl
devices = [
{
name = "https_internal"
type = "proxy"
properties = {
listen = "tcp:0.0.0.0:25510"
connect = "tcp:127.0.0.1:443"
}
},
{
name = "http_redirect"
type = "proxy"
properties = {
listen = "tcp:0.0.0.0:25511"
connect = "tcp:127.0.0.1:80"
}
}
]
```
Run `terraform apply` before deploying if port mappings changed.
### Titania Backend Routing
Titania's HAProxy routes external subdomains to Prospero's HTTPS port:
```yaml
# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/api/health"
ssl_backend: true
- subdomain: "pgadmin"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/misc/ping"
ssl_backend: true
- subdomain: "prometheus"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/ping"
ssl_backend: true
```
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/prospero/config.alloy.j2`
- **HAProxy Syslog**: `loki.source.syslog` on `127.0.0.1:51405` (TCP) receives Docker syslog from HAProxy container
- **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus`
- **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki
- **Metrics**: Node exporter + process exporter → Prometheus remote write
### Prometheus Scrape Targets
| Job | Target | Auth |
|-----|--------|------|
| `prometheus` | `localhost:9090` | None |
| `node-exporter` | All Uranian hosts `:9100` | None |
| `alertmanager` | `prospero.incus:9093` | None |
| `haproxy` | `titania.incus:8404` | None |
| `gitea` | `oberon.incus:22084` | Bearer token |
| `casdoor` | `titania.incus:22081` | Access key/secret params |
### Alert Rules
Groups defined in `alert_rules.yml.j2`:
| Group | Alerts | Scope |
|-------|--------|-------|
| `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
| `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus |
| `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
| `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
| `loki_alerts` | HighLogVolume | Loki |
### Alertmanager Routing
Alerts are routed to Pushover with severity-based priority:
| Severity | Pushover Priority | Emoji |
|----------|-------------------|-------|
| Critical | 2 (Emergency) | 🚨 |
| Warning | 1 (High) | ⚠️ |
| Info | 0 (Normal) | — |
## Grafana MCP Server
Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token.
| Property | Value |
|----------|-------|
| MCP Host | miranda.incus |
| MCP Port | 25533 |
| MCPO Proxy | `http://miranda.incus:25530/grafana` |
| Auth | Grafana service account token (`vault_grafana_service_account_token`) |
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`.
For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**.
## Access After Deployment
| Service | URL | Login |
|---------|-----|-------|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
## Troubleshooting
### Service Status
```bash
ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
```
### HAProxy Service
```bash
ssh prospero.incus
sudo systemctl status haproxy
sudo journalctl -u haproxy -f
```
### View Logs
```bash
# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
# HAProxy logs (shipped via syslog to Alloy → Loki)
# Query in Grafana: {job="pplg-haproxy"}
```
### Test Endpoints (from Prospero)
```bash
# Grafana
curl -s http://127.0.0.1:3000/api/health
# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping
# Prometheus
curl -s http://127.0.0.1:9090/-/healthy
# Loki
curl -s http://127.0.0.1:3100/ready
# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy
# HAProxy stats
curl -s http://127.0.0.1:8404/metrics | head
```
### Test TLS (from any host)
```bash
# Direct to Prospero container
curl -sk https://prospero.incus/api/health
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
```
### Common Errors
#### `vault_casdoor_prometheus_access_key` is undefined
```
TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
**Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials.
**Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault:
```bash
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
```
```yaml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
```
#### Certificate fetch fails
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
**Fix**: Ensure Titania is up and certbot has run:
```bash
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
```
The playbook falls back to a self-signed certificate if Titania is unavailable.
#### OAuth2 redirect loops
**Cause**: Casdoor application redirect URI doesn't match the service URL.
**Fix**: Verify redirect URIs match exactly:
- Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth`
- PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
- Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback`
## Migration Notes
PPLG replaces the following standalone playbooks (kept as reference):
| Original Playbook | Replaced By |
|-------------------|-------------|
| `prometheus/deploy.yml` | `pplg/deploy.yml` |
| `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` |
| `loki/deploy.yml` | `pplg/deploy.yml` |
| `grafana/deploy.yml` | `pplg/deploy.yml` |
| `pgadmin/deploy.yml` | `pplg/deploy.yml` |
PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).

546
docs/rabbitmq.md Normal file
View File

@@ -0,0 +1,546 @@
# RabbitMQ - Message Broker Infrastructure
## Overview
RabbitMQ 3 (management-alpine) serves as the central message broker for the Agathos sandbox, providing AMQP-compliant message queuing for asynchronous communication between services. The deployment includes the management web interface for monitoring and administration.
**Host:** Oberon (container_orchestration)
**Role:** Message broker for event-driven architectures
**AMQP Port:** 5672
**Management Port:** 25582
**Syslog Port:** 51402 (Alloy)
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Oberon Host │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ RabbitMQ Container (Docker) │ │
│ │ │ │
│ │ ┌──────────────┬──────────────┐ │ │
│ │ │ VHost │ VHost │ │ │
│ │ │ "kairos" │ "spelunker" │ │ │
│ │ │ │ │ │ │
│ │ │ User: │ User: │ │ │
│ │ │ kairos │ spelunker │ │ │
│ │ │ (full perm) │ (full perm) │ │ │
│ │ └──────────────┴──────────────┘ │ │
│ │ │ │
│ │ Default Admin: rabbitmq │ │
│ │ (all vhosts, admin privileges) │ │
│ │ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Ports: 5672 (AMQP), 25582 (Management) │
│ Logs: syslog → Alloy:51402 → Loki │
└─────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐
│ Kairos │───AMQP────▶│ kairos/ │
│ (future) │ │ (vhost) │
└──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐
│ Spelunker │───AMQP────▶│ spelunker/ │
│ (future) │ │ (vhost) │
└──────────────┘ └──────────────┘
```
**Note**: Kairos and Spelunker are future services. The RabbitMQ infrastructure is pre-provisioned with dedicated virtual hosts and users ready for when these services are deployed.
## Terraform Resources
### Oberon Host Definition
RabbitMQ runs on Oberon, defined in `terraform/containers.tf`:
| Attribute | Value |
|-----------|-------|
| Description | Docker Host + MCP Switchboard - King of Fairies orchestrating containers |
| Image | noble |
| Role | container_orchestration |
| Security Nesting | `true` (required for Docker) |
| AppArmor Profile | unconfined |
| Proxy Devices | `25580-25599 → 25580-25599` (application port range) |
### Container Dependencies
| Resource | Relationship |
|----------|--------------|
| Docker | RabbitMQ runs as a Docker container on Oberon |
| Alloy | Collects syslog logs from RabbitMQ on port 51402 |
| Prospero | Receives logs via Loki for observability |
## Ansible Deployment
### Playbook
```bash
cd ansible
ansible-playbook rabbitmq/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `rabbitmq/deploy.yml` | Main deployment playbook |
| `rabbitmq/docker-compose.yml.j2` | Docker Compose template |
### Deployment Steps
The playbook performs the following operations:
1. **User and Group Management**
- Creates `rabbitmq` system user and group
- Adds `ponos` user to `rabbitmq` group for operational access
2. **Directory Setup**
- Creates service directory at `/srv/rabbitmq`
- Sets ownership to `rabbitmq:rabbitmq`
- Configures permissions (mode 750)
3. **Docker Compose Deployment**
- Templates `docker-compose.yml` from Jinja2 template
- Deploys RabbitMQ container with `docker compose up`
4. **rabbitmqadmin CLI Setup**
- Extracts `rabbitmqadmin` from container to `/usr/local/bin/`
- Makes it executable for host-level management
5. **Automatic Provisioning** (idempotent)
- Creates virtual hosts: `kairos`, `spelunker`
- Creates users with passwords from vault
- Sets user tags (currently none, expandable for admin/monitoring roles)
- Configures full permissions for each user on their respective vhost
### Variables
#### Host Variables (`host_vars/oberon.incus.yml`)
| Variable | Description | Default |
|----------|-------------|---------|
| `rabbitmq_user` | Service user | `rabbitmq` |
| `rabbitmq_group` | Service group | `rabbitmq` |
| `rabbitmq_directory` | Installation directory | `/srv/rabbitmq` |
| `rabbitmq_amqp_port` | AMQP protocol port | `5672` |
| `rabbitmq_management_port` | Management web interface | `25582` |
| `rabbitmq_password` | Default admin password | `{{ vault_rabbitmq_password }}` |
#### Group Variables (`group_vars/all/vars.yml`)
Defines the provisioning configuration for vhosts, users, and permissions:
```yaml
rabbitmq_vhosts:
- name: kairos
- name: spelunker
rabbitmq_users:
- name: kairos
password: "{{ kairos_rabbitmq_password }}"
tags: []
- name: spelunker
password: "{{ spelunker_rabbitmq_password }}"
tags: []
rabbitmq_permissions:
- vhost: kairos
user: kairos
configure_priv: .*
read_priv: .*
write_priv: .*
- vhost: spelunker
user: spelunker
configure_priv: .*
read_priv: .*
write_priv: .*
```
**Vault Variable Mappings**:
```yaml
kairos_rabbitmq_password: "{{ vault_kairos_rabbitmq_password }}"
spelunker_rabbitmq_password: "{{ vault_spelunker_rabbitmq_password }}"
```
#### Vault Variables (`group_vars/all/vault.yml`)
All sensitive credentials are encrypted in the vault:
| Variable | Description |
|----------|-------------|
| `vault_rabbitmq_password` | Default admin account password |
| `vault_kairos_rabbitmq_password` | Kairos service user password |
| `vault_spelunker_rabbitmq_password` | Spelunker service user password |
## Configuration
### Docker Compose Template
The deployment uses a minimal Docker Compose configuration:
```yaml
services:
rabbitmq:
image: rabbitmq:3-management-alpine
container_name: rabbitmq
restart: unless-stopped
ports:
- "{{rabbitmq_amqp_port}}:5672" # AMQP protocol
- "{{rabbitmq_management_port}}:15672" # Management UI
volumes:
- rabbitmq_data:/var/lib/rabbitmq # Persistent data
environment:
RABBITMQ_DEFAULT_USER: "{{rabbitmq_user}}"
RABBITMQ_DEFAULT_PASS: "{{rabbitmq_password}}"
logging:
driver: syslog
options:
syslog-address: "tcp://127.0.0.1:{{rabbitmq_syslog_port}}"
syslog-format: "{{syslog_format}}"
tag: "rabbitmq"
```
### Data Persistence
- **Volume**: `rabbitmq_data` (Docker-managed volume)
- **Location**: `/var/lib/rabbitmq` inside container
- **Contents**:
- Message queues and persistent messages
- Virtual host metadata
- User credentials and permissions
- Configuration overrides
## Virtual Hosts and Users
### Default Admin Account
**Username**: `rabbitmq`
**Password**: `{{ vault_rabbitmq_password }}` (from vault)
**Privileges**: Full administrative access to all virtual hosts
The default admin account is created automatically when the container starts and can access:
- All virtual hosts (including `/`, `kairos`, `spelunker`)
- Management web interface
- All RabbitMQ management commands
### Kairos Virtual Host
**VHost**: `kairos`
**User**: `kairos`
**Password**: `{{ vault_kairos_rabbitmq_password }}`
**Permissions**: Full (configure, read, write) on all resources matching `.*`
Intended for the **Kairos** service (event-driven time-series processing system, planned future deployment).
### Spelunker Virtual Host
**VHost**: `spelunker`
**User**: `spelunker`
**Password**: `{{ vault_spelunker_rabbitmq_password }}`
**Permissions**: Full (configure, read, write) on all resources matching `.*`
Intended for the **Spelunker** service (log exploration and analytics platform, planned future deployment).
### Permission Model
Both service users have full access within their respective virtual hosts:
| Permission | Pattern | Description |
|------------|---------|-------------|
| Configure | `.*` | Create/delete queues, exchanges, bindings |
| Write | `.*` | Publish messages to exchanges |
| Read | `.*` | Consume messages from queues |
This isolation ensures:
- ✔ Each service operates in its own namespace
- ✔ Messages cannot cross between services
- ✔ Resource limits can be applied per-vhost
- ✔ Service credentials can be rotated independently
## Access and Administration
### Management Web Interface
**URL**: `http://oberon.incus:25582`
**External**: `http://{oberon-ip}:25582`
**Login**: `rabbitmq` / `{{ vault_rabbitmq_password }}`
Features:
- Queue inspection and message browsing
- Exchange and binding management
- Connection and channel monitoring
- User and permission administration
- Virtual host management
- Performance metrics and charts
### CLI Administration
#### On Host Machine (using rabbitmqadmin)
```bash
# List vhosts
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD list vhosts
# List queues in a vhost
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD -V kairos list queues
# Publish a test message
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD -V kairos publish \
exchange=amq.default routing_key=test payload="test message"
```
#### Inside Container
```bash
# Enter the container
docker exec -it rabbitmq /bin/sh
# List vhosts
rabbitmqctl list_vhosts
# List users
rabbitmqctl list_users
# List permissions for a user
rabbitmqctl list_user_permissions kairos
# List queues in a vhost
rabbitmqctl list_queues -p kairos
# Check node status
rabbitmqctl status
```
### Connection Strings
#### AMQP Connection (from other containers on Oberon)
```
amqp://kairos:PASSWORD@localhost:5672/kairos
amqp://spelunker:PASSWORD@localhost:5672/spelunker
```
#### AMQP Connection (from other hosts)
```
amqp://kairos:PASSWORD@oberon.incus:5672/kairos
amqp://spelunker:PASSWORD@oberon.incus:5672/spelunker
```
#### Management API
```
http://rabbitmq:PASSWORD@oberon.incus:25582/api/
```
## Monitoring and Observability
### Logging
- **Driver**: syslog (Docker logging driver)
- **Destination**: `tcp://127.0.0.1:51402` (Alloy on Oberon)
- **Tag**: `rabbitmq`
- **Format**: `{{ syslog_format }}` (from Alloy configuration)
Logs are collected by Alloy and forwarded to Loki on Prospero for centralized log aggregation.
### Key Metrics (via Management UI)
| Metric | Description |
|--------|-------------|
| Connections | Active AMQP client connections |
| Channels | Active channels within connections |
| Queues | Total queues across all vhosts |
| Messages | Ready, unacknowledged, and total message counts |
| Message Rate | Publish/deliver rates (msg/s) |
| Memory Usage | Container memory consumption |
| Disk Usage | Persistent storage utilization |
### Health Check
```bash
# Check if RabbitMQ is running
docker ps | grep rabbitmq
# Check container logs
docker logs rabbitmq
# Check RabbitMQ node status
docker exec rabbitmq rabbitmqctl status
# Check cluster health (single-node, should show 1 node)
docker exec rabbitmq rabbitmqctl cluster_status
```
## Operational Tasks
### Restart RabbitMQ
```bash
# Via Docker Compose
cd /srv/rabbitmq
sudo -u rabbitmq docker compose restart
# Via Docker directly
docker restart rabbitmq
```
### Recreate Container (preserves data)
```bash
cd /srv/rabbitmq
sudo -u rabbitmq docker compose down
sudo -u rabbitmq docker compose up -d
```
### Add New Virtual Host and User
1. Update `group_vars/all/vars.yml`:
```yaml
rabbitmq_vhosts:
- name: newservice
rabbitmq_users:
- name: newservice
password: "{{ newservice_rabbitmq_password }}"
tags: []
rabbitmq_permissions:
- vhost: newservice
user: newservice
configure_priv: .*
read_priv: .*
write_priv: .*
# Add mapping
newservice_rabbitmq_password: "{{ vault_newservice_rabbitmq_password }}"
```
2. Add password to `group_vars/all/vault.yml`:
```bash
ansible-vault edit inventory/group_vars/all/vault.yml
# Add: vault_newservice_rabbitmq_password: "secure_password"
```
3. Run the playbook:
```bash
ansible-playbook rabbitmq/deploy.yml
```
The provisioning tasks are idempotent—existing vhosts and users are skipped, only new ones are created.
### Rotate User Password
```bash
# Inside container
docker exec rabbitmq rabbitmqctl change_password kairos "new_password"
# Update vault
ansible-vault edit inventory/group_vars/all/vault.yml
# Update vault_kairos_rabbitmq_password
```
### Clear All Messages in a Queue
```bash
docker exec rabbitmq rabbitmqctl purge_queue queue_name -p kairos
```
## Troubleshooting
### Container Won't Start
Check Docker logs for errors:
```bash
docker logs rabbitmq
```
Common issues:
- Port conflict on 5672 or 25582
- Permission issues on `/srv/rabbitmq` directory
- Corrupted data volume
### Cannot Connect to Management UI
1. Verify port mapping: `docker port rabbitmq`
2. Check firewall rules on Oberon
3. Verify container is running: `docker ps | grep rabbitmq`
4. Check if management plugin is enabled (should be in `-management-alpine` image)
### User Authentication Failing
```bash
# List users and verify they exist
docker exec rabbitmq rabbitmqctl list_users
# Check user permissions
docker exec rabbitmq rabbitmqctl list_user_permissions kairos
# Verify vhost exists
docker exec rabbitmq rabbitmqctl list_vhosts
```
### High Memory Usage
RabbitMQ may consume significant memory with many messages. Check:
```bash
# Memory usage
docker exec rabbitmq rabbitmqctl status | grep memory
# Queue depths
docker exec rabbitmq rabbitmqctl list_queues -p kairos messages
# Consider setting memory limits in docker-compose.yml
```
## Security Considerations
### Network Isolation
- RabbitMQ AMQP port (5672) is **only** exposed on the Incus network (`10.10.0.0/16`)
- Management UI (25582) is exposed externally for administration
- For production: Place HAProxy in front of management UI with authentication
- Consider enabling SSL/TLS for AMQP connections in production
### Credential Management
- ✔ All passwords stored in Ansible Vault
- ✔ Service accounts have isolated virtual hosts
- ✔ Default admin account uses strong password from vault
- ⚠️ Credentials passed as environment variables (visible in `docker inspect`)
- Consider using Docker secrets or Vault integration for enhanced security
### Virtual Host Isolation
Each service operates in its own virtual host:
- Messages cannot cross between vhosts
- Resource quotas can be applied per-vhost
- Credentials can be rotated without affecting other services
## Future Enhancements
- [ ] **SSL/TLS Support**: Enable encrypted AMQP connections
- [ ] **Cluster Mode**: Add additional RabbitMQ nodes for high availability
- [ ] **Federation**: Connect to external RabbitMQ clusters
- [ ] **Prometheus Exporter**: Add metrics export for Grafana monitoring
- [ ] **Shovel Plugin**: Configure message forwarding between brokers
- [ ] **HAProxy Integration**: Reverse proxy for management UI with authentication
- [ ] **Docker Secrets**: Replace environment variables with Docker secrets
## References
- [RabbitMQ Official Documentation](https://www.rabbitmq.com/documentation.html)
- [RabbitMQ Management Plugin](https://www.rabbitmq.com/management.html)
- [AMQP 0-9-1 Protocol Reference](https://www.rabbitmq.com/amqp-0-9-1-reference.html)
- [Virtual Hosts](https://www.rabbitmq.com/vhosts.html)
- [Access Control (Authentication, Authorisation)](https://www.rabbitmq.com/access-control.html)
- [Monitoring RabbitMQ](https://www.rabbitmq.com/monitoring.html)
---
**Last Updated**: February 12, 2026
**Project**: Agathos Infrastructure
**Approval**: Red Panda Approved™

148
docs/red_panda_standards.md Normal file
View File

@@ -0,0 +1,148 @@
# Red Panda Approval™ Standards
Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.
---
## 🐾 Red Panda Approval™
All implementations must meet the 5 Sacred Criteria:
1. **Fresh Environment Test** — Clean runs on new systems without drift. No leftover state, no manual steps.
2. **Elegant Simplicity** — Modular, reusable, no copy-paste sprawl. One playbook per concern.
3. **Observable & Auditable** — Clear task names, proper logging, check mode compatible. You can see what happened.
4. **Idempotent Patterns** — Run multiple times with consistent results. No side effects on re-runs.
5. **Actually Provisions & Configures** — Resources work, dependencies resolve, services integrate. It does the thing.
---
## Vault Security
All sensitive information is encrypted using Ansible Vault with AES256 encryption.
**Encrypted secrets:**
- Database passwords (PostgreSQL, Neo4j)
- API keys (OpenAI, Anthropic, Mistral, Groq)
- Application secrets (Grafana, SearXNG, Arke)
- Monitoring alerts (Pushover integration)
**Security rules:**
- AES256 encryption with `ansible-vault`
- Password file for automation — never pass `--vault-password-file` inline in scripts
- Vault variables use the `vault_` prefix; map to friendly names in `group_vars/all/vars.yml`
- No secrets in plain text files, ever
---
## Log Level Standards
All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.
### Level Definitions
| Level | When to Use | What MUST Be Included | Loki / Grafana Role |
|-------|-------------|----------------------|---------------------|
| **ERROR** | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`. | AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger Pushover notifications. |
| **WARNING** | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). |
| **INFO** | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the *story* of what the system did. | Default production visibility. The go-to level for post-incident timelines. |
| **DEBUG** | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | **Actionable context is mandatory.** A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. |
### Anti-Patterns
These are explicit violations of Ouranos logging standards:
| ❌ Anti-Pattern | Why It's Wrong | ✅ Correct Approach |
|----------------|---------------|-------------------|
| Health checks logged at INFO (`GET /health → 200 OK`) | Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress health endpoints from access logs entirely, or demote to DEBUG. |
| DEBUG with no context (`logger.debug("error occurred")`) | Provides zero diagnostic value. If DEBUG is noisy *and* useless, nobody will ever enable it. | `logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)` |
| ERROR without exception details (`logger.error("task failed")`) | Cannot be triaged without reproduction steps. Wastes on-call time. | `logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)` |
| Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: `api_key=sk-...a3f2`, `password=*****`. |
| Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | **Python / Django**: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). **Go / infrastructure** (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`). |
| Logging expected conditions as ERROR | A user entering a wrong password is not an error — it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. |
### Health Check Rule
> All services exposed through HAProxy MUST suppress or demote health check endpoints (`/health`, `/healthz`, `/api/health`, `/metrics`, `/ping`) to DEBUG or below. Health check success is the *absence* of errors, not the presence of 200s. If your syslog shows a successful health probe, your log level is wrong.
**Implementation guidance:**
- **Django / Gunicorn**: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
- **Docker services**: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
- **HAProxy**: HAProxy's own health check logs (`option httpchk`) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.
### Background Worker & Queue Monitoring
> **The most dangerous failure is the one that produces no logs.**
When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.
**Required practices:**
1. **Heartbeat logging** — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., `"worker alive, processed N jobs in last 5m, queue depth: M"`). The *absence* of this heartbeat is the alertable condition.
2. **Startup and shutdown at INFO** — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
3. **Queue depth as a metric** — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an **ERROR**-level alert, not a warning.
4. **Grafana "last seen" alerts** — For every background worker, configure a Grafana alert using `absent_over_time()` or equivalent staleness detection: *"Worker X has not logged a heartbeat in >10 minutes"* → ERROR severity → Pushover notification.
5. **Crash-on-start is ERROR** — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (`systemd OnFailure=`, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.
### Production Defaults
| Service Category | Default Level | Rationale |
|-----------------|---------------|-----------|
| Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | `WARNING` | Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. |
| Gunicorn access logs | Suppress 2xx/3xx health probes | Routine request logging deferred to HAProxy access logs in Loki. |
| Infrastructure agents (Alloy, Prometheus, Node Exporter) | `warn` | Stable — do not change without cause. |
| HAProxy (Titania) | `warning` | Connection-level logging handled by HAProxy's own log format → Alloy → Loki. |
| Databases (PostgreSQL, Neo4j) | `warning` | Query-level logging only enabled for active troubleshooting. |
| Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | `warn` / `warning` | Per-service default. Tune individually if needed. |
| LLM Proxy (Arke) | `info` | Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. |
| Observability stack (Grafana, Loki, AlertManager) | `warn` | Should be quiet unless something is wrong with observability itself. |
### Loki & Grafana Alignment
**Label normalization**: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a `level` label on every log line. Without a `level` label, the log entry is invisible to level-based dashboard filters and alert rules.
**LogQL conventions for dashboards:**
```logql
# Production error monitoring (default dashboard view)
{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"
# Warning-and-above for a specific service
{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"
# Debug-level troubleshooting (temporary, never permanent dashboards)
{container="angelia"} | json | level="debug"
```
**Alerting rules** — Grafana alert rules MUST key off the normalized `level` label:
- `level=~"error|fatal|critical"` → Immediate Pushover notification via AlertManager
- `absent_over_time({service_name="celery_worker"}[10m])` → Worker heartbeat staleness → ERROR severity
- Rate-based: `rate({service_name="arke"} | json | level="error" [5m]) > 0.1` → Sustained error rate
**Retention alignment**: Loki retention policies should preserve ERROR and WARNING logs longer than DEBUG. DEBUG-level logs generated during troubleshooting sessions should have a short TTL or be explicitly cleaned up.
---
## Documentation Standards
Place documentation in the `/docs/` directory of the repository.
### HTML Documents
HTML documents must follow [docs/documentation_style_guide.html](documentation_style_guide.html).
- Use Bootstrap CDN with Bootswatch theme **Flatly**
- Include a dark mode toggle button in the navbar
- Use Bootstrap Icons for icons
- Use Bootstrap CSS for styles — avoid custom CSS
- Use **Mermaid** for diagrams
### Markdown Documents
Only these status symbols are approved:
- ✔ Success/Complete
- ❌ Error/Failed
- ⚠️ Warning/Caution
- Information/Note

253
docs/searxng-auth.md Normal file
View File

@@ -0,0 +1,253 @@
# SearXNG Authentication Design Document
# Red Panda Approved
## Overview
This document describes the design for adding Casdoor-based authentication to SearXNG,
which doesn't natively support SSO/OIDC authentication.
## Architecture
```
┌──────────────┐ ┌───────────────┐ ┌─────────────────────────────────────┐
│ Browser │────▶│ HAProxy │────▶│ Oberon │
│ │ │ (titania) │ │ ┌────────────────┐ ┌───────────┐ │
└──────────────┘ └───────┬───────┘ │ │ OAuth2-Proxy │─▶│ SearXNG │ │
│ │ │ (port 22073) │ │ (22083) │ │
│ │ └───────┬────────┘ └───────────┘ │
│ └──────────┼─────────────────────────┘
│ │ OIDC
│ ┌──────────────────▼────────────────┐
└────▶│ Casdoor │
│ (OIDC Provider - titania) │
└───────────────────────────────────┘
```
The OAuth2-Proxy runs as a **native binary sidecar** on Oberon alongside SearXNG,
following the same pattern used for JupyterLab on Puck. The upstream connection is
`localhost` — eliminating the cross-host hop from the previous Docker-based deployment
on Titania.
> Each host supports at most one OAuth2-Proxy sidecar instance. The binary is
> shared at `/usr/local/bin/oauth2-proxy`; each service gets a unique config directory
> and systemd unit name.
## Components
### 1. OAuth2-Proxy (Sidecar on Oberon)
- **Purpose**: Acts as authentication gateway for SearXNG
- **Port**: 22073 (exposed to HAProxy)
- **Binary**: Native `oauth2-proxy` v7.6.0 (systemd service `oauth2-proxy-searxng`)
- **Config**: `/etc/oauth2-proxy-searxng/oauth2-proxy.cfg`
- **Upstream**: `http://127.0.0.1:22083` (localhost sidecar to SearXNG)
- **Logging**: systemd journal (`SyslogIdentifier=oauth2-proxy-searxng`)
### 2. Casdoor (Existing on Titania)
- **Purpose**: OIDC Identity Provider
- **Port**: 22081
- **URL**: https://id.ouranos.helu.ca/ (via HAProxy)
- **Required Setup**:
- Create Application for SearXNG
- Configure redirect URI
- Generate client credentials
### 3. HAProxy Updates (Titania)
- Route `searxng.ouranos.helu.ca` to OAuth2-Proxy on Oberon (`oberon.incus:22073`)
- OAuth2-Proxy handles authentication before proxying to SearXNG on localhost
### 4. SearXNG (Existing on Oberon)
- **No changes required** - remains unaware of authentication
- Receives pre-authenticated requests from OAuth2-Proxy
## Authentication Flow
1. User navigates to `https://searxng.ouranos.helu.ca/`
2. HAProxy routes to OAuth2-Proxy on oberon:22073
3. OAuth2-Proxy checks for valid session cookie (`_oauth2_proxy_searxng`)
4. **If no valid session**:
- Redirect to Casdoor login: `https://id.ouranos.helu.ca/login/oauth/authorize`
- User authenticates with Casdoor (username/password, social login, etc.)
- Casdoor redirects back with authorization code
- OAuth2-Proxy exchanges code for tokens
- OAuth2-Proxy sets session cookie
5. **If valid session**:
- OAuth2-Proxy adds `X-Forwarded-User` header
- Request proxied to SearXNG at `127.0.0.1:22083` (localhost sidecar)
## Casdoor Configuration
### Application Setup (Manual via Casdoor UI)
1. Login to Casdoor at https://id.ouranos.helu.ca/
2. Navigate to Applications → Add
3. Configure:
- **Name**: `searxng`
- **Display Name**: `SearXNG Search`
- **Organization**: `built-in` (or your organization)
- **Redirect URLs**:
- `https://searxng.ouranos.helu.ca/oauth2/callback`
- **Grant Types**: `authorization_code`, `refresh_token`
- **Response Types**: `code`
4. Save and note the `Client ID` and `Client Secret`
### Cookie Secret Generation
Generate a 32-byte random secret for OAuth2-Proxy cookies:
```bash
openssl rand -base64 32
```
## Environment Variables
### Development (Sandbox)
```yaml
# In inventory/host_vars/oberon.incus.yml
searxng_oauth2_proxy_dir: /etc/oauth2-proxy-searxng
searxng_oauth2_proxy_version: "7.6.0"
searxng_proxy_port: 22073
searxng_domain: "ouranos.helu.ca"
searxng_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
searxng_oauth2_redirect_url: "https://searxng.ouranos.helu.ca/oauth2/callback"
# OAuth2 Credentials (from vault)
searxng_oauth2_client_id: "{{ vault_searxng_oauth2_client_id }}"
searxng_oauth2_client_secret: "{{ vault_searxng_oauth2_client_secret }}"
searxng_oauth2_cookie_secret: "{{ vault_searxng_oauth2_cookie_secret }}"
```
> Variables use the `searxng_` prefix, following the same naming pattern as
> `jupyterlab_oauth2_*` variables on Puck. The upstream URL (`http://127.0.0.1:22083`)
> is derived from `searxng_port` in the config template — no cross-host URL needed.
## Deployment Steps
### 1. Add Vault Secrets
```bash
ansible-vault edit inventory/group_vars/all/vault.yml
```
Add:
```yaml
vault_searxng_oauth2_client_id: "<from-casdoor>"
vault_searxng_oauth2_client_secret: "<from-casdoor>"
vault_searxng_oauth2_cookie_secret: "<generated-32-byte-secret>"
```
Note: The `searxng_` prefix allows service-specific credentials. The Oberon host_vars
maps these directly to `searxng_oauth2_*` variables used by the sidecar config template.
### 2. Update Host Variables
OAuth2-Proxy variables are defined in `inventory/host_vars/oberon.incus.yml` alongside
the existing SearXNG configuration. No separate service entry is needed — the OAuth2-Proxy
sidecar is deployed as part of the `searxng` service.
```yaml
# SearXNG OAuth2-Proxy Sidecar (in oberon.incus.yml)
searxng_oauth2_proxy_dir: /etc/oauth2-proxy-searxng
searxng_oauth2_proxy_version: "7.6.0"
searxng_proxy_port: 22073
searxng_domain: "ouranos.helu.ca"
searxng_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
searxng_oauth2_redirect_url: "https://searxng.ouranos.helu.ca/oauth2/callback"
```
### 3. Update HAProxy Backend
Route SearXNG traffic through OAuth2-Proxy on Oberon:
```yaml
# In inventory/host_vars/titania.incus.yml
haproxy_backends:
- subdomain: "searxng"
backend_host: "oberon.incus" # Same host as SearXNG
backend_port: 22073 # OAuth2-Proxy port
health_path: "/ping" # OAuth2-Proxy health endpoint
```
### 4. Deploy
```bash
cd ansible
# Deploy SearXNG + OAuth2-Proxy sidecar
ansible-playbook searxng/deploy.yml
# Update HAProxy configuration
ansible-playbook haproxy/deploy.yml
```
## Monitoring
### Logs
OAuth2-Proxy logs to systemd journal on Oberon. Alloy's default `systemd_logs`
source captures these logs automatically, filterable by `SyslogIdentifier=oauth2-proxy-searxng`.
```bash
# View logs on Oberon
ssh oberon.incus
journalctl -u oauth2-proxy-searxng -f
```
### Metrics
OAuth2-Proxy exposes Prometheus metrics at `/metrics` on port 22073:
- `oauth2_proxy_requests_total` - Total requests
- `oauth2_proxy_errors_total` - Error count
- `oauth2_proxy_upstream_latency_seconds` - Upstream latency
## Security Considerations
1. **Cookie Security**:
- `cookie_secure = true` enforces HTTPS-only cookies
- `cookie_httponly = true` prevents JavaScript access
- `cookie_samesite = "lax"` provides CSRF protection
2. **Email Domain Restriction**:
- Configure `oauth2_proxy_email_domains` to limit who can access
- Example: `["yourdomain.com"]` or `["*"]` for any
3. **Group-Based Access**:
- Optional: Configure `oauth2_proxy_allowed_groups` in Casdoor
- Only users in specified groups can access SearXNG
## Troubleshooting
### Check OAuth2-Proxy Status
```bash
ssh oberon.incus
systemctl status oauth2-proxy-searxng
journalctl -u oauth2-proxy-searxng --no-pager -n 50
```
### Test OIDC Discovery
```bash
curl https://id.ouranos.helu.ca/.well-known/openid-configuration
```
### Test Health Endpoint
```bash
curl http://oberon.incus:22073/ping
```
### Verify Cookie Domain
Ensure the cookie domain (`.ouranos.helu.ca`) matches your HAProxy domain.
Cookies won't work across different domains.
## Files
| File | Purpose |
|------|---------|
| `ansible/searxng/deploy.yml` | SearXNG + OAuth2-Proxy sidecar deployment |
| `ansible/searxng/oauth2-proxy-searxng.cfg.j2` | OAuth2-Proxy OIDC configuration |
| `ansible/searxng/oauth2-proxy-searxng.service.j2` | Systemd unit for OAuth2-Proxy |
| `ansible/inventory/host_vars/oberon.incus.yml` | Host variables (`searxng_oauth2_*`) |
| `docs/searxng-auth.md` | This design document |
### Generic OAuth2-Proxy Module (Retained)
The standalone `ansible/oauth2_proxy/` directory is retained as a generic, reusable
Docker-based OAuth2-Proxy module for future services:
| File | Purpose |
|------|---------|
| `ansible/oauth2_proxy/deploy.yml` | Generic Docker Compose deployment |
| `ansible/oauth2_proxy/docker-compose.yml.j2` | Docker Compose template |
| `ansible/oauth2_proxy/oauth2-proxy.cfg.j2` | Generic OIDC configuration template |
| `ansible/oauth2_proxy/stage.yml` | Validation / dry-run playbook |

191
docs/smtp4dev.md Normal file
View File

@@ -0,0 +1,191 @@
# smtp4dev - Development SMTP Server
## Overview
smtp4dev is a fake SMTP server for development and testing. It accepts all incoming email without delivering it, capturing messages for inspection via a web UI and IMAP client. All services in the Agathos sandbox that send email (Casdoor, Gitea, etc.) are wired to smtp4dev so email flows can be tested without a real mail server.
**Host:** Oberon (container_orchestration)
**Web UI Port:** 22085 → `https://smtp4dev.ouranos.helu.ca`
**SMTP Port:** 22025 (used by all services as `smtp_host:smtp_port`)
**IMAP Port:** 22045
**Syslog Port:** 51405 (Alloy)
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Oberon Host │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ smtp4dev Container (Docker) │ │
│ │ │ │
│ │ Port 80 → host 22085 (Web UI) │ │
│ │ Port 25 → host 22025 (SMTP) │ │
│ │ Port 143 → host 22045 (IMAP) │ │
│ │ │ │
│ │ Volume: smtp4dev_data → /smtp4dev │ │
│ │ Logs: syslog → Alloy:51405 → Loki │ │
│ └──────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
▲ ▲
│ SMTP :22025 │ SMTP :22025
┌──────┴──────┐ ┌──────┴──────┐
│ Casdoor │ │ Gitea │
│ (Titania) │ │ (Rosalind) │
└─────────────┘ └─────────────┘
External access:
https://smtp4dev.ouranos.helu.ca → HAProxy (Titania) → oberon.incus:22085
```
## Shared SMTP Variables
smtp4dev connection details are defined once in `ansible/inventory/group_vars/all/vars.yml` and consumed by all service templates:
| Variable | Value | Purpose |
|----------|-------|---------|
| `smtp_host` | `oberon.incus` | SMTP server hostname |
| `smtp_port` | `22025` | SMTP server port |
| `smtp_from` | `noreply@ouranos.helu.ca` | Default sender address |
| `smtp_from_name` | `Agathos` | Default sender display name |
Any service that needs to send email references these shared variables rather than defining its own SMTP config. This means switching to a real SMTP server only requires changing `group_vars/all/vars.yml`.
## Ansible Deployment
### Playbook
```bash
# Deploy smtp4dev on Oberon
ansible-playbook smtp4dev/deploy.yml
# Redeploy HAProxy to activate the smtp4dev.ouranos.helu.ca backend
ansible-playbook haproxy/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `ansible/smtp4dev/deploy.yml` | Main deployment playbook |
| `ansible/smtp4dev/docker-compose.yml.j2` | Docker Compose template |
### Deployment Steps
The `deploy.yml` playbook:
1. Filters hosts — only runs on hosts with `smtp4dev` in their `services` list (Oberon)
2. Creates `smtp4dev` system group and user
3. Adds `ponos` user to the `smtp4dev` group (for `docker compose` access)
4. Creates `/srv/smtp4dev` directory owned by `smtp4dev:smtp4dev`
5. Templates `docker-compose.yml` into `/srv/smtp4dev/`
6. Resets SSH connection to apply group membership
7. Starts the service with `community.docker.docker_compose_v2: state: present`
### Host Variables
Defined in `ansible/inventory/host_vars/oberon.incus.yml`:
```yaml
# smtp4dev Configuration
smtp4dev_user: smtp4dev
smtp4dev_group: smtp4dev
smtp4dev_directory: /srv/smtp4dev
smtp4dev_port: 22085 # Web UI (container port 80)
smtp4dev_smtp_port: 22025 # SMTP (container port 25)
smtp4dev_imap_port: 22045 # IMAP (container port 143)
smtp4dev_syslog_port: 51405 # Alloy syslog collector
```
## Service Integrations
### Casdoor
The Casdoor email provider is declared in `ansible/casdoor/init_data.json.j2` and seeded automatically on a **fresh** Casdoor deployment:
```json
{
"owner": "admin",
"name": "provider-email-smtp4dev",
"displayName": "smtp4dev Email",
"category": "Email",
"type": "SMTP",
"host": "oberon.incus",
"port": 22025,
"disableSsl": true,
"fromAddress": "noreply@ouranos.helu.ca",
"fromName": "Agathos"
}
```
> ⚠️ For **existing** Casdoor installs, create the provider manually:
> 1. Log in to `https://id.ouranos.helu.ca` as admin
> 2. Navigate to **Identity → Providers → Add**
> 3. Set **Category**: `Email`, **Type**: `SMTP`
> 4. Fill host `oberon.incus`, port `22025`, disable SSL, from `noreply@ouranos.helu.ca`
> 5. Save and assign the provider to the `heluca` organization under **Organizations → heluca → Edit → Default email provider**
### Gitea
Configured directly in `ansible/gitea/app.ini.j2`:
```ini
[mailer]
ENABLED = true
SMTP_ADDR = {{ smtp_host }}
SMTP_PORT = {{ smtp_port }}
FROM = {{ smtp_from }}
```
Redeploy Gitea to apply:
```bash
ansible-playbook gitea/deploy.yml
```
## External Access
smtp4dev's web UI is exposed via HAProxy on Titania at `https://smtp4dev.ouranos.helu.ca`.
Backend entry in `ansible/inventory/host_vars/titania.incus.yml`:
```yaml
- subdomain: "smtp4dev"
backend_host: "oberon.incus"
backend_port: 22085
health_path: "/"
```
## Verification
```bash
# Check container is running
ssh oberon.incus "cd /srv/smtp4dev && docker compose ps"
# Check logs
ssh oberon.incus "cd /srv/smtp4dev && docker compose logs --tail=50"
# Test SMTP delivery (sends a test message)
ssh oberon.incus "echo 'Subject: test' | sendmail -S oberon.incus:22025 test@example.com"
# Check web UI is reachable internally
curl -s -o /dev/null -w "%{http_code}" http://oberon.incus:22085
# Check external HTTPS route
curl -sk -o /dev/null -w "%{http_code}" https://smtp4dev.ouranos.helu.ca
```
## site.yml Order
smtp4dev is deployed after Docker (it requires the Docker engine) and before Casdoor (so the SMTP endpoint exists when Casdoor initialises):
```yaml
- name: Deploy Docker
import_playbook: docker/deploy.yml
- name: Deploy smtp4dev
import_playbook: smtp4dev/deploy.yml
- name: Deploy PPLG Stack # ...continues
```

70
docs/sunwait.txt Normal file
View File

@@ -0,0 +1,70 @@
Calculate sunrise and sunset times for the current or targetted day.
The times can be adjusted either for twilight or fixed durations.
The program can either: wait for sunrise or sunset (function: wait),
or return the time (GMT or local) the event occurs (function: list),
or report the day length and twilight timings (function: report),
or simply report if it is DAY or NIGHT (function: poll).
You should specify the latitude and longitude of your target location.
Usage: sunwait [major options] [minor options] [twilight type] [rise|set] [offset] [latitude] [longitude]
Major options, either:
poll Returns immediately indicating DAY or NIGHT. See 'program exit codes'. Default.
wait Sleep until specified event occurs. Else exit immediate.
list [X] Report twilight times for next 'X' days (inclusive). Default: 1.
report [date] Generate a report about the days sunrise and sunset timings. Default: the current day
Minor options, any of:
[no]debug Print extra info and returns in one minute. Default: nodebug.
[no]version Print the version number. Default: noversion.
[no]help Print this help. Default: nohelp.
[no]gmt Print times in GMT or local-time. Default: nogmt.
Twilight types, either:
daylight Top of sun just below the horizon. Default.
civil Civil Twilight. -6 degrees below horizon.
nautical Nautical twilight. -12 degrees below horizon.
astronomical Astronomical twilight. -18 degrees below horizon.
angle [X.XX] User-specified twilight-angle (degrees). Default: 0.
Sunrise/sunset. Only useful with major-options: 'wait' and 'list'. Any of: (default: both)
rise Wait for the sun to rise past specified twilight & offset.
set Wait for the sun to set past specified twilight & offset.
Offset:
offset [MM|HH:MM] Time interval (+ve towards noon) to adjust twilight calculation.
Target date. Only useful with major-options: 'report' or 'list'. Default: today
d [DD] Set the target Day-of-Month to calculate for. 1 to 31.
m [MM] Set the target Month to calculate for. 1 to 12.
y [YYYY] Set the target Year to calculate for. 2000 to 2099.
latitude/longitude coordinates: floating-point degrees, with [NESW] appended. Default: Bingham, England.
Exit (return) codes:
0 OK: exit from 'wait' or 'list' only.
1 Error.
2 Exit from 'poll': it is DAY or twilight.
3 Exit from 'poll': it is NIGHT (after twilight).
Example 1: sunwait wait rise offset -1:15:10 51.477932N 0.000000E
Wait until 1 hour 15 minutes 10 secs before the sun rises in Greenwich, London.
Example 2: sunwait list 7 civil 55.752163N 37.617524E
List civil sunrise and sunset times for today and next 6 days. Moscow.
Example 3: sunwait poll exit angle 10 54.897786N -1.517536E
Indicate by program exit-code if is Day or Night using a custom twilight angle of 10 degrees above horizon. Washington, UK.
Example 4: sunwait list 7 gmt sunrise angle 3
List next 7 days sunrise times, custom +3 degree twilight angle, default location.
Uses GMT; as any change in daylight saving over the specified period is not considered.
Example 5: sunwait report y 20 m 3 d 15 10.49S 105.55E
Produce a report of the different sunrises and sunsets on an arbitrary day (2022/03/15) for an arbitrary location (Christmas Island)
Note that program uses C library functions to determine time and localtime.
Error for timings are estimated at: +/- 4 minutes.

296
docs/terraform.md Normal file
View File

@@ -0,0 +1,296 @@
# Terraform Practices & Patterns
This document describes the Terraform design philosophy, patterns, and practices used across our infrastructure. The audience includes LLMs assisting with development, new team members, and existing team members seeking a reference.
## Design Philosophy
### Incus-First Infrastructure
Incus containers form the foundational layer of all environments. Management and monitoring infrastructure (Prospero, Titania) must exist before application hosts. This is a **critical dependency** that must be explicitly codified.
**Why?** Terraform isn't magic. Implicit ordering can lead to race conditions or failed deployments. Always use explicit `depends_on` for critical infrastructure chains.
```hcl
# Example: Application host depends on monitoring infrastructure
resource "incus_instance" "app_host" {
# ...
depends_on = [incus_instance.uranian_hosts["prospero"]]
}
```
### Explicit Dependencies
Never rely solely on implicit resource ordering for critical infrastructure. Codify dependencies explicitly to:
- ✔ Prevent race conditions during parallel applies
- ✔ Document architectural relationships in code
- ✔ Ensure consistent deployment ordering across environments
## Repository Strategy
### Agathos (Sandbox)
Agathos is the **Sandbox repository** — isolated, safe for external demos, and uses local state.
| Aspect | Decision |
|--------|----------|
| Purpose | Evaluation, demos, pattern experimentation, new software testing |
| State | Local (no remote backend) |
| Secrets | No production credentials or references |
| Security | Safe to use on external infrastructure for demos |
### Production Repository (Separate)
A separate repository manages Dev, UAT, and Prod environments:
```
terraform/
├── modules/incus_host/ # Reusable container module
├── environments/
│ ├── dev/ # Local Incus only
│ └── prod/ # OCI + Incus (parameterized via tfvars)
```
| Aspect | Decision |
|--------|----------|
| State | PostgreSQL backend on `eris.helu.ca:6432` with SSL |
| Schemas | Separate per environment: `dev`, `uat`, `prod` |
| UAT/Prod | Parameterized twins via `-var-file` |
## Module Design
### When to Extract a Module
A pattern is a good module candidate when it meets these criteria:
| Criterion | Description |
|-----------|-------------|
| **Reuse** | Pattern used across multiple environments (Sandbox, Dev, UAT, Prod) |
| **Stable Interface** | Inputs/outputs won't change frequently |
| **Testable** | Can validate module independently before promotion |
| **Encapsulates Complexity** | Hides `dynamic` blocks, `for_each`, cloud-init generation |
### When NOT to Extract
- Single-use patterns
- Tightly coupled to specific environment
- Adds indirection without measurable benefit
### The `incus_host` Module
The standard container provisioning pattern extracted from Agathos:
**Inputs:**
- `hosts` — Map of host definitions (name, role, image, devices, config)
- `project` — Incus project name
- `profile` — Incus profile name
- `cloud_init_template` — Cloud-init configuration template
- `ssh_key_path` — Path to SSH authorized keys
- `depends_on_resources` — Explicit dependencies for infrastructure ordering
**Outputs:**
- `host_details` — Name, IPv4, role, description for each host
- `inventory` — Documentation reference for DHCP/DNS provisioning
## Environment Strategy
### Environment Purposes
| Environment | Purpose | Infrastructure |
|-------------|---------|----------------|
| **Sandbox** | Evaluation, demos, pattern experimentation | Local Incus only |
| **Dev** | Integration testing, container builds, security testing | Local Incus only |
| **UAT** | User acceptance testing, bug resolution | OCI + Incus (hybrid) |
| **Prod** | Production workloads | OCI + Incus (hybrid) |
### Parameterized Twins (UAT/Prod)
UAT and Prod are architecturally identical. Use a single environment directory with variable files:
```bash
# UAT deployment
terraform apply -var-file=uat.tfvars
# Prod deployment
terraform apply -var-file=prod.tfvars
```
Key differences in tfvars:
- Hostnames and DNS domains
- Resource sizing (CPU, memory limits)
- OCI compartment IDs
- Credential references
## State Management
### Sandbox (Agathos)
Local state is acceptable because:
- Environment is ephemeral
- Single-user workflow
- No production secrets to protect
- Safe for external demos
### Production Environments
PostgreSQL backend on `eris.helu.ca`:
```hcl
terraform {
backend "pg" {
conn_str = "postgres://eris.helu.ca:6432/terraform_state?sslmode=verify-full"
schema_name = "dev" # or "uat", "prod"
}
}
```
**Connection requirements:**
- Port 6432 (pgBouncer)
- SSL with `sslmode=verify-full`
- Credentials via environment variables (`PGUSER`, `PGPASSWORD`)
- Separate schema per environment for isolation
## Integration Points
### Terraform → DHCP/DNS
The `agathos_inventory` output provides host information for DHCP/DNS provisioning:
1. Terraform creates containers with cloud-init
2. `agathos_inventory` output includes hostnames and IPs
3. MAC addresses registered in DHCP server
4. DHCP server creates DNS entries (`hostname.incus` domain)
5. Ansible uses DNS names for host connectivity
### Terraform → Ansible
Ansible does **not** consume Terraform outputs directly. Instead:
1. Terraform provisions containers
2. Incus DNS resolution provides `hostname.incus` domain
3. Ansible inventory uses static DNS names
4. `sandbox_up.yml` configures DNS resolution on the hypervisor
```yaml
# Ansible inventory uses DNS names, not Terraform outputs
ubuntu:
hosts:
oberon.incus:
ariel.incus:
prospero.incus:
```
### Terraform → Bash Scripts
The `ssh_key_update.sh` script demonstrates proper integration:
```bash
terraform output -json agathos_inventory | jq -r \
'.uranian_hosts.hosts | to_entries[] | "\(.key) \(.value.ipv4)"' | \
while read hostname ip; do
ssh-keyscan -H "$ip" >> ~/.ssh/known_hosts
ssh-keyscan -H "$hostname.incus" >> ~/.ssh/known_hosts
done
```
## Promotion Workflow
All infrastructure changes flow through this pipeline:
```
Agathos (Sandbox)
↓ Validate pattern works
↓ Extract to module if reusable
Dev
↓ Integration testing
↓ Container builds
↓ Security testing
UAT
↓ User acceptance testing
↓ Bug fixes return to Dev
↓ Delete environment, test restore
Prod
↓ Deploy from tested artifacts
```
**Critical:** Nothing starts in Prod. Every change originates in Agathos, is validated through the pipeline, and only then deployed to production.
### Promotion Includes
When promoting Terraform changes, always update corresponding:
- Ansible playbooks and templates
- Service documentation in `/docs/services/`
- Host variables if new services added
## Output Conventions
### `agathos_inventory`
The primary output for documentation and DNS integration:
```hcl
output "agathos_inventory" {
description = "Host inventory for documentation and DHCP/DNS provisioning"
value = {
uranian_hosts = {
hosts = {
for name, instance in incus_instance.uranian_hosts : name => {
name = instance.name
ipv4 = instance.ipv4_address
role = local.uranian_hosts[name].role
description = local.uranian_hosts[name].description
security_nesting = lookup(local.uranian_hosts[name].config, "security.nesting", false)
}
}
}
}
}
```
**Purpose:**
- Update [sandbox.html](sandbox.html) documentation
- Reference for DHCP server MAC/IP registration
- DNS entry creation via DHCP
## Layered Configuration
### Single Config with Conditional Resources
Avoid multiple separate Terraform configurations. Use one config with conditional resources:
```
environments/prod/
├── main.tf # Incus project, profile, images (always)
├── incus_hosts.tf # Module call for Incus containers (always)
├── oci_resources.tf # OCI compute (conditional)
├── variables.tf
├── dev.tfvars # Dev: enable_oci = false
├── uat.tfvars # UAT: enable_oci = true
└── prod.tfvars # Prod: enable_oci = true
```
```hcl
variable "enable_oci" {
description = "Enable OCI resources (false for Dev, true for UAT/Prod)"
type = bool
default = false
}
resource "oci_core_instance" "hosts" {
for_each = var.enable_oci ? var.oci_hosts : {}
# ...
}
```
## Best Practices Summary
| Practice | Rationale |
|----------|-----------|
| ✔ Explicit `depends_on` for critical chains | Terraform isn't magic |
| ✔ Local map for host definitions | Single source of truth, easy iteration |
| ✔ `for_each` over `count` | Stable resource addresses |
| ✔ `dynamic` blocks for optional devices | Clean, declarative device configuration |
| ✔ Merge base config with overrides | DRY principle for common settings |
| ✔ Separate tfvars for environment twins | Minimal duplication, clear parameterization |
| ✔ Document module interfaces | Enable promotion across environments |
| ✔ Never start in Prod | Always validate through pipeline |

38
docs/xrdp.md Normal file
View File

@@ -0,0 +1,38 @@
Purpose
This script automates the installation and configuration of xRDP (X Remote Desktop Protocol) on Ubuntu-based systems, providing a complete remote desktop solution with enhanced user experience.
Key Features
Multi-Distribution Support:
Ubuntu 22.04, 24.04, 24.10, 25.04
Linux Mint, Pop!OS, Zorin OS, Elementary OS
Debian support (best effort)
LMDE (Linux Mint Debian Edition)
Installation Modes:
Standard installation (from repositories)
Custom installation (compile from source)
Removal/cleanup option
Advanced Capabilities:
Sound redirection - Compiles audio modules for remote audio playback
H.264 encoding/decoding support (latest version)
Desktop environment detection - Handles GNOME, KDE, Budgie, etc.
Sound server detection - Works with both PulseAudio and PipeWire
Custom login screen - Branded xRDP login with custom colors/backgrounds
Smart Features:
SSH session detection - Warns when installing over SSH
Version compatibility checks - Prevents incompatible installations
Conflict resolution - Disables conflicting GNOME remote desktop services
Permission fixes - Handles SSL certificates and user groups
Polkit rules - Enables proper shutdown/reboot from remote sessions
What Makes It Special
Extensive OS/version support with graceful handling of EOL versions
Intelligent detection of desktop environments and sound systems
Post-installation optimization for better remote desktop experience
Comprehensive error handling and user feedback
Modular design with separate functions for different tasks
Active maintenance - regularly updated with new Ubuntu releases
The script essentially transforms a basic Ubuntu system into a fully-functional remote desktop server with professional-grade features, handling all the complex configuration that would normally require manual intervention.