docs: rewrite README with structured overview and quick start guide
Replaces the minimal project description with a comprehensive README including a component overview table, quick start instructions, common Ansible operations, and links to detailed documentation. Aligns with Red Panda Approval™ standards.
This commit is contained in:
38
docs/Scalable Twelve Factor App.md
Normal file
38
docs/Scalable Twelve Factor App.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Scalable Twelve-Factor App
|
||||
https://12factor.net/
|
||||
|
||||
The twelve-factor app is a methodology for building software-as-a-service apps that:
|
||||
|
||||
Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
|
||||
Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
|
||||
Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
|
||||
Minimize divergence between development and production, enabling continuous deployment for maximum agility;
|
||||
And can scale up without significant changes to tooling, architecture, or development practices.
|
||||
|
||||
I. Codebase
|
||||
One codebase tracked in revision control, many deploys
|
||||
II. Dependencies
|
||||
Explicitly declare and isolate dependencies
|
||||
III. Config
|
||||
Store config in the environment
|
||||
IV. Backing services
|
||||
Treat backing services as attached resources
|
||||
V. Build, release, run
|
||||
Strictly separate build and run stages
|
||||
VI. Processes
|
||||
Execute the app as one or more stateless processes
|
||||
VII. Port binding
|
||||
Export services via port binding
|
||||
VIII. Concurrency
|
||||
Scale out via the process model
|
||||
IX. Disposability
|
||||
Maximize robustness with fast startup and graceful shutdown
|
||||
X. Dev/prod parity
|
||||
Keep development, staging, and production as similar as possible
|
||||
XI. Logs
|
||||
Treat logs as event streams
|
||||
XII. Admin processes
|
||||
Run admin/management tasks as one-off processes
|
||||
|
||||
# Django Logging
|
||||
https://lincolnloop.com/blog/django-logging-right-way/
|
||||
13
docs/Semantic Versioning.md
Normal file
13
docs/Semantic Versioning.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Semantic Versioning 2.0.0
|
||||
https://semver.org/
|
||||
|
||||
Given a version number MAJOR.MINOR.PATCH, increment the:
|
||||
|
||||
MAJOR version when you make incompatible API changes
|
||||
MINOR version when you add functionality in a backwards compatible manner
|
||||
PATCH version when you make backwards compatible bug fixes
|
||||
Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.
|
||||
|
||||
|
||||
GitHub Actions: Gitbump
|
||||
https://betterprogramming.pub/how-to-version-your-code-in-2020-60bdd221278b
|
||||
184
docs/_template.md
Normal file
184
docs/_template.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Service Documentation Template
|
||||
|
||||
This is a template for documenting services deployed in the Agathos sandbox. Copy this file and replace placeholders with service-specific information.
|
||||
|
||||
---
|
||||
|
||||
# {Service Name}
|
||||
|
||||
## Overview
|
||||
|
||||
Brief description of the service, its purpose, and role in the infrastructure.
|
||||
|
||||
**Host:** {hostname} (e.g., oberon, miranda, prospero)
|
||||
**Role:** {role from Terraform} (e.g., container_orchestration, observability)
|
||||
**Port Range:** {exposed ports} (e.g., 25580-25599)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Client │────▶│ Service │────▶│ Database │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
Describe the service architecture, data flow, and integration points.
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
The service runs on `{hostname}`, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | {noble/plucky/questing} |
|
||||
| Role | {terraform role} |
|
||||
| Security Nesting | {true/false} |
|
||||
| Proxy Devices | {port mappings} |
|
||||
|
||||
### Dependencies
|
||||
|
||||
| Resource | Relationship |
|
||||
|----------|--------------|
|
||||
| {other host} | {description of dependency} |
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook {service}/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `{service}/deploy.yml` | Main deployment playbook |
|
||||
| `{service}/*.j2` | Jinja2 templates |
|
||||
|
||||
### Variables
|
||||
|
||||
#### Group Variables (`group_vars/all/main.yml`)
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `{service}_version` | Version to deploy | `latest` |
|
||||
|
||||
#### Host Variables (`host_vars/{hostname}.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `{service}_port` | Service port |
|
||||
| `{service}_data_dir` | Data directory |
|
||||
|
||||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_{service}_password` | Service password |
|
||||
| `vault_{service}_api_key` | API key (if applicable) |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Source |
|
||||
|----------|-------------|--------|
|
||||
| `{VAR_NAME}` | Description | `{{ vault_{service}_var }}` |
|
||||
|
||||
### Configuration Files
|
||||
|
||||
| File | Location | Template |
|
||||
|------|----------|----------|
|
||||
| `config.yml` | `/etc/{service}/` | `{service}/config.yml.j2` |
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `{service}_requests_total` | Total requests |
|
||||
| `{service}_errors_total` | Total errors |
|
||||
|
||||
**Scrape Target:** Configured in `ansible/prometheus/` or via Alloy.
|
||||
|
||||
### Loki Logs
|
||||
|
||||
| Log Source | Labels |
|
||||
|------------|--------|
|
||||
| Application log | `{job="{service}", host="{hostname}"}` |
|
||||
| Access log | `{job="{service}_access", host="{hostname}"}` |
|
||||
|
||||
**Collection:** Alloy agent on host ships logs to Loki on Prospero.
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Dashboard provisioned at: `ansible/grafana/dashboards/{service}.json`
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
# Via systemd (if applicable)
|
||||
sudo systemctl start {service}
|
||||
sudo systemctl stop {service}
|
||||
|
||||
# Via Docker (if applicable)
|
||||
docker compose -f /opt/{service}/docker-compose.yml up -d
|
||||
docker compose -f /opt/{service}/docker-compose.yml down
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl http://{hostname}.incus:{port}/health
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Systemd
|
||||
journalctl -u {service} -f
|
||||
|
||||
# Docker
|
||||
docker logs -f {container_name}
|
||||
|
||||
# Loki (via Grafana Explore)
|
||||
{job="{service}"}
|
||||
```
|
||||
|
||||
### Backup
|
||||
|
||||
Describe backup procedures, scripts, and schedules.
|
||||
|
||||
### Restore
|
||||
|
||||
Describe restore procedures and verification steps.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| Service won't start | Missing config | Check `{config_file}` exists |
|
||||
| Connection refused | Firewall/proxy | Verify Incus proxy device |
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable debug logging
|
||||
{service} --debug
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- Official Documentation: {url}
|
||||
- [Terraform Practices](../terraform.md)
|
||||
- [Ansible Practices](../ansible.md)
|
||||
- [Sandbox Overview](../sandbox.html)
|
||||
705
docs/ansible.md
Normal file
705
docs/ansible.md
Normal file
@@ -0,0 +1,705 @@
|
||||
# Ansible Project Structure - Best Practices
|
||||
|
||||
This document describes the clean, maintainable Ansible structure implemented in the Agathos project. Use this as a reference template for other Ansible projects.
|
||||
|
||||
## Overview
|
||||
|
||||
This structure emphasizes:
|
||||
- **Simplicity**: Minimal files at root level
|
||||
- **Organization**: Services contain all related files (playbooks + templates)
|
||||
- **Separation**: Variables live in dedicated files, not inline in inventory
|
||||
- **Discoverability**: Clear naming and logical grouping
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
ansible/
|
||||
├── ansible.cfg # Ansible configuration
|
||||
├── .vault_pass # Vault password file
|
||||
│
|
||||
├── site.yml # Master orchestration playbook
|
||||
├── apt_update.yml # Utility: Update all hosts
|
||||
├── sandbox_up.yml # Utility: Start infrastructure
|
||||
├── sandbox_down.yml # Utility: Stop infrastructure
|
||||
│
|
||||
├── inventory/ # Inventory organization
|
||||
│ ├── hosts # Simple host/group membership
|
||||
│ │
|
||||
│ ├── group_vars/ # Variables for groups
|
||||
│ │ └── all/
|
||||
│ │ ├── vars.yml # Common variables
|
||||
│ │ └── vault.yml # Encrypted secrets
|
||||
│ │
|
||||
│ └── host_vars/ # Variables per host
|
||||
│ ├── hostname1.yml # All vars for hostname1
|
||||
│ ├── hostname2.yml # All vars for hostname2
|
||||
│ └── ...
|
||||
│
|
||||
└── service_name/ # Per-service directories
|
||||
├── deploy.yml # Main deployment playbook
|
||||
├── stage.yml # Staging playbook (if needed)
|
||||
├── template1.j2 # Jinja2 templates
|
||||
├── template2.j2
|
||||
└── files/ # Static files (if needed)
|
||||
```
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. Simplified Inventory (`inventory/hosts`)
|
||||
|
||||
**Purpose**: Define ONLY host/group membership, no variables
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
---
|
||||
# Ansible Inventory - Simplified
|
||||
|
||||
# Main infrastructure group
|
||||
ubuntu:
|
||||
hosts:
|
||||
server1.example.com:
|
||||
server2.example.com:
|
||||
server3.example.com:
|
||||
|
||||
# Service-specific groups
|
||||
web_servers:
|
||||
hosts:
|
||||
server1.example.com:
|
||||
|
||||
database_servers:
|
||||
hosts:
|
||||
server2.example.com:
|
||||
```
|
||||
|
||||
**Before**: 361 lines with variables inline
|
||||
**After**: 34 lines of pure structure
|
||||
|
||||
### 2. Host Variables (`inventory/host_vars/`)
|
||||
|
||||
**Purpose**: All configuration specific to a single host
|
||||
|
||||
**File naming**: `{hostname}.yml` (matches inventory hostname exactly)
|
||||
|
||||
**Example** (`inventory/host_vars/server1.example.com.yml`):
|
||||
```yaml
|
||||
---
|
||||
# Server1 Configuration - Web Server
|
||||
# Services: nginx, php-fpm, redis
|
||||
|
||||
services:
|
||||
- nginx
|
||||
- php
|
||||
- redis
|
||||
|
||||
# Nginx Configuration
|
||||
nginx_user: www-data
|
||||
nginx_worker_processes: auto
|
||||
nginx_port: 80
|
||||
nginx_ssl_port: 443
|
||||
|
||||
# PHP-FPM Configuration
|
||||
php_version: 8.2
|
||||
php_max_children: 50
|
||||
|
||||
# Redis Configuration
|
||||
redis_port: 6379
|
||||
redis_password: "{{vault_redis_password}}"
|
||||
```
|
||||
|
||||
### 3. Group Variables (`inventory/group_vars/`)
|
||||
|
||||
**Purpose**: Variables shared across multiple hosts
|
||||
|
||||
**Structure**:
|
||||
```
|
||||
group_vars/
|
||||
├── all/ # Variables for ALL hosts
|
||||
│ ├── vars.yml # Common non-sensitive config
|
||||
│ └── vault.yml # Encrypted secrets (ansible-vault)
|
||||
│
|
||||
└── web_servers/ # Variables for web_servers group
|
||||
└── vars.yml
|
||||
```
|
||||
|
||||
**Example** (`inventory/group_vars/all/vars.yml`):
|
||||
```yaml
|
||||
---
|
||||
# Common Variables for All Hosts
|
||||
|
||||
remote_user: ansible
|
||||
deployment_environment: production
|
||||
ansible_python_interpreter: /usr/bin/python3
|
||||
|
||||
# Release versions
|
||||
app_release: v1.2.3
|
||||
api_release: v2.0.1
|
||||
|
||||
# Monitoring endpoints
|
||||
prometheus_url: http://monitoring.example.com:9090
|
||||
loki_url: http://monitoring.example.com:3100
|
||||
```
|
||||
|
||||
### 4. Service Directories
|
||||
|
||||
**Purpose**: Group all files related to a service deployment
|
||||
|
||||
**Pattern**: `{service_name}/`
|
||||
|
||||
**Contents**:
|
||||
- `deploy.yml` - Main deployment playbook
|
||||
- `stage.yml` - Staging/update playbook (optional)
|
||||
- `*.j2` - Jinja2 templates
|
||||
- `files/` - Static files (if needed)
|
||||
- `tasks/` - Task files (if splitting large playbooks)
|
||||
|
||||
**Example Structure**:
|
||||
```
|
||||
nginx/
|
||||
├── deploy.yml # Deployment playbook
|
||||
├── nginx.conf.j2 # Main config template
|
||||
├── site.conf.j2 # Virtual host template
|
||||
├── nginx.service.j2 # Systemd service file
|
||||
└── files/
|
||||
└── ssl_params.conf # Static SSL configuration
|
||||
```
|
||||
|
||||
### 5. Master Playbook (`site.yml`)
|
||||
|
||||
**Purpose**: Orchestrate full-stack deployment
|
||||
|
||||
**Pattern**: Import service playbooks in dependency order
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
---
|
||||
- name: Update All Hosts
|
||||
import_playbook: apt_update.yml
|
||||
|
||||
- name: Deploy Docker
|
||||
import_playbook: docker/deploy.yml
|
||||
|
||||
- name: Deploy PostgreSQL
|
||||
import_playbook: postgresql/deploy.yml
|
||||
|
||||
- name: Deploy Application
|
||||
import_playbook: myapp/deploy.yml
|
||||
|
||||
- name: Deploy Monitoring
|
||||
import_playbook: prometheus/deploy.yml
|
||||
```
|
||||
|
||||
### 6. Service Playbook Pattern
|
||||
|
||||
**Location**: `{service}/deploy.yml`
|
||||
|
||||
**Standard Structure**:
|
||||
```yaml
|
||||
---
|
||||
- name: Deploy Service Name
|
||||
hosts: target_group
|
||||
tasks:
|
||||
|
||||
# Service detection (if using services list)
|
||||
- name: Check if host has service_name service
|
||||
ansible.builtin.set_fact:
|
||||
has_service: "{{ 'service_name' in services | default([]) }}"
|
||||
|
||||
- name: Skip hosts without service
|
||||
ansible.builtin.meta: end_host
|
||||
when: not has_service
|
||||
|
||||
# Actual deployment tasks
|
||||
- name: Create service user
|
||||
become: true
|
||||
ansible.builtin.user:
|
||||
name: "{{service_user}}"
|
||||
group: "{{service_group}}"
|
||||
system: true
|
||||
|
||||
- name: Template configuration
|
||||
become: true
|
||||
ansible.builtin.template:
|
||||
src: config.j2
|
||||
dest: "{{service_directory}}/config.yml"
|
||||
notify: restart service
|
||||
|
||||
# Handlers
|
||||
handlers:
|
||||
- name: restart service
|
||||
become: true
|
||||
ansible.builtin.systemd:
|
||||
name: service_name
|
||||
state: restarted
|
||||
daemon_reload: true
|
||||
```
|
||||
|
||||
**IMPORTANT: Template Path Convention**
|
||||
- When playbooks are inside service directories, template `src:` paths are relative to that directory
|
||||
- Use `src: config.j2` NOT `src: service_name/config.j2`
|
||||
- The service directory prefix was correct when playbooks were at the ansible root, but is wrong now
|
||||
|
||||
**Host-Specific Templates**
|
||||
Some services need different configuration per host. Store these in subdirectories named by hostname:
|
||||
|
||||
```
|
||||
service_name/
|
||||
├── deploy.yml
|
||||
├── config.j2 # Default template
|
||||
├── hostname1/ # Host-specific overrides
|
||||
│ └── config.j2
|
||||
├── hostname2/
|
||||
│ └── config.j2
|
||||
└── hostname3/
|
||||
└── config.j2
|
||||
```
|
||||
|
||||
Use conditional logic to select the correct template:
|
||||
|
||||
```yaml
|
||||
- name: Check for host-specific configuration
|
||||
ansible.builtin.stat:
|
||||
path: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
|
||||
delegate_to: localhost
|
||||
register: host_specific_config
|
||||
become: false
|
||||
|
||||
- name: Template host-specific configuration
|
||||
become: true
|
||||
ansible.builtin.template:
|
||||
src: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
|
||||
dest: "{{service_directory}}/config"
|
||||
when: host_specific_config.stat.exists
|
||||
|
||||
- name: Template default configuration
|
||||
become: true
|
||||
ansible.builtin.template:
|
||||
src: config.j2
|
||||
dest: "{{service_directory}}/config"
|
||||
when: not host_specific_config.stat.exists
|
||||
```
|
||||
|
||||
**Real Example: Alloy Service**
|
||||
```
|
||||
alloy/
|
||||
├── deploy.yml
|
||||
├── config.alloy.j2 # Default configuration
|
||||
├── ariel/ # Neo4j monitoring
|
||||
│ └── config.alloy.j2
|
||||
├── miranda/ # Docker monitoring
|
||||
│ └── config.alloy.j2
|
||||
├── oberon/ # Web services monitoring
|
||||
│ └── config.alloy.j2
|
||||
└── puck/ # Application monitoring
|
||||
└── config.alloy.j2
|
||||
```
|
||||
|
||||
## Service Detection Pattern
|
||||
|
||||
**Purpose**: Allow hosts to selectively run service playbooks
|
||||
|
||||
**How it works**:
|
||||
1. Each host defines a `services:` list in `host_vars/`
|
||||
2. Each playbook checks if its service is in the list
|
||||
3. Playbook skips host if service not needed
|
||||
|
||||
**Example**:
|
||||
|
||||
`inventory/host_vars/server1.yml`:
|
||||
```yaml
|
||||
services:
|
||||
- docker
|
||||
- nginx
|
||||
- redis
|
||||
```
|
||||
|
||||
`nginx/deploy.yml`:
|
||||
```yaml
|
||||
- name: Deploy Nginx
|
||||
hosts: ubuntu
|
||||
tasks:
|
||||
- name: Check if host has nginx service
|
||||
ansible.builtin.set_fact:
|
||||
has_nginx: "{{ 'nginx' in services | default([]) }}"
|
||||
|
||||
- name: Skip hosts without nginx
|
||||
ansible.builtin.meta: end_host
|
||||
when: not has_nginx
|
||||
|
||||
# Rest of tasks only run if nginx in services list
|
||||
```
|
||||
|
||||
## Ansible Vault Integration
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# Create vault password file (one-time)
|
||||
echo "your_vault_password" > .vault_pass
|
||||
chmod 600 .vault_pass
|
||||
|
||||
# Configure ansible.cfg
|
||||
echo "vault_password_file = .vault_pass" >> ansible.cfg
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Edit vault file
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
|
||||
# View vault file
|
||||
ansible-vault view inventory/group_vars/all/vault.yml
|
||||
|
||||
# Encrypt new file
|
||||
ansible-vault encrypt secrets.yml
|
||||
```
|
||||
|
||||
**Variable naming convention**:
|
||||
- Prefix vault variables with `vault_`
|
||||
- Reference in regular vars: `db_password: "{{vault_db_password}}"`
|
||||
|
||||
## Running Playbooks
|
||||
|
||||
**Full deployment**:
|
||||
```bash
|
||||
ansible-playbook site.yml
|
||||
```
|
||||
|
||||
**Single service**:
|
||||
```bash
|
||||
ansible-playbook nginx/deploy.yml
|
||||
```
|
||||
|
||||
**Specific hosts**:
|
||||
```bash
|
||||
ansible-playbook nginx/deploy.yml --limit server1.example.com
|
||||
```
|
||||
|
||||
**Check mode (dry-run)**:
|
||||
```bash
|
||||
ansible-playbook site.yml --check
|
||||
```
|
||||
|
||||
**With extra verbosity**:
|
||||
```bash
|
||||
ansible-playbook nginx/deploy.yml -vv
|
||||
```
|
||||
|
||||
## Benefits of This Structure
|
||||
|
||||
### 1. Cleaner Root Directory
|
||||
- **Before**: 29+ playbook files cluttering root
|
||||
- **After**: 3-4 utility playbooks + site.yml
|
||||
|
||||
### 2. Simplified Inventory
|
||||
- **Before**: 361 lines with inline variables
|
||||
- **After**: 34 lines of pure structure
|
||||
- Variables organized logically by host/group
|
||||
|
||||
### 3. Service Cohesion
|
||||
- Everything related to a service in one place
|
||||
- Easy to find templates when editing playbooks
|
||||
- Natural grouping for git operations
|
||||
|
||||
### 4. Scalability
|
||||
- Easy to add new services (create directory, add playbook)
|
||||
- Easy to add new hosts (create host_vars file)
|
||||
- No risk of playbook name conflicts
|
||||
|
||||
### 5. Reusability
|
||||
- Service directories can be copied to other projects
|
||||
- host_vars pattern works for any inventory size
|
||||
- Clear separation of concerns
|
||||
|
||||
### 6. Maintainability
|
||||
- Changes isolated to service directories
|
||||
- Inventory file rarely needs editing
|
||||
- Clear audit trail in git (changes per service)
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
Moving an existing Ansible project to this structure:
|
||||
|
||||
- [ ] Create service directories for each playbook
|
||||
- [ ] Move `{service}_deploy.yml` → `{service}/deploy.yml`
|
||||
- [ ] Move templates into service directories
|
||||
- [ ] Extract host variables from inventory to `host_vars/`
|
||||
- [ ] Extract group variables to `group_vars/all/vars.yml`
|
||||
- [ ] Move secrets to `group_vars/all/vault.yml` (encrypted)
|
||||
- [ ] Update `site.yml` import_playbook paths
|
||||
- [ ] Backup original inventory: `cp hosts hosts.backup`
|
||||
- [ ] Create simplified inventory with only group/host structure
|
||||
- [ ] Test with `ansible-playbook site.yml --check`
|
||||
- [ ] Verify with limited deployment: `--limit test_host`
|
||||
|
||||
## Example: Adding a New Service
|
||||
|
||||
**1. Create service directory**:
|
||||
```bash
|
||||
mkdir ansible/myapp
|
||||
```
|
||||
|
||||
**2. Create deployment playbook** (`ansible/myapp/deploy.yml`):
|
||||
```yaml
|
||||
---
|
||||
- name: Deploy MyApp
|
||||
hosts: ubuntu
|
||||
tasks:
|
||||
- name: Check if host has myapp service
|
||||
ansible.builtin.set_fact:
|
||||
has_myapp: "{{ 'myapp' in services | default([]) }}"
|
||||
|
||||
- name: Skip hosts without myapp
|
||||
ansible.builtin.meta: end_host
|
||||
when: not has_myapp
|
||||
|
||||
- name: Deploy myapp
|
||||
# ... deployment tasks
|
||||
```
|
||||
|
||||
**3. Create template** (`ansible/myapp/config.yml.j2`):
|
||||
```yaml
|
||||
app_name: MyApp
|
||||
port: {{myapp_port}}
|
||||
database: {{myapp_db_host}}
|
||||
```
|
||||
|
||||
**4. Add variables to host** (`inventory/host_vars/server1.yml`):
|
||||
```yaml
|
||||
services:
|
||||
- myapp # Add to services list
|
||||
|
||||
# MyApp configuration
|
||||
myapp_port: 8080
|
||||
myapp_db_host: db.example.com
|
||||
```
|
||||
|
||||
**5. Add to site.yml**:
|
||||
```yaml
|
||||
- name: Deploy MyApp
|
||||
import_playbook: myapp/deploy.yml
|
||||
```
|
||||
|
||||
**6. Deploy**:
|
||||
```bash
|
||||
ansible-playbook myapp/deploy.yml
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Naming Conventions
|
||||
- Service directories: lowercase, underscores (e.g., `mcp_switchboard/`)
|
||||
- Playbooks: `deploy.yml`, `stage.yml`, `remove.yml`
|
||||
- Templates: descriptive name + `.j2` extension
|
||||
- Variables: service prefix (e.g., `nginx_port`, `redis_password`)
|
||||
- Vault variables: `vault_` prefix
|
||||
|
||||
### File Organization
|
||||
- Keep playbooks under 100 lines (split into task files if larger)
|
||||
- Group related templates in service directory
|
||||
- Use comments to document non-obvious variables
|
||||
- Add README.md to complex service directories
|
||||
|
||||
### Variable Organization
|
||||
- Host-specific: `host_vars/{hostname}.yml`
|
||||
- Service-specific across hosts: `group_vars/{service_group}/vars.yml`
|
||||
- Global configuration: `group_vars/all/vars.yml`
|
||||
- Secrets: `group_vars/all/vault.yml` (encrypted)
|
||||
|
||||
### Idempotency
|
||||
- Use `creates:` parameter for one-time operations
|
||||
- Use `state:` explicitly (present/absent/restarted)
|
||||
- Check conditions before destructive operations
|
||||
- Test with `--check` mode before applying
|
||||
|
||||
### Documentation
|
||||
- Comment complex task logic
|
||||
- Document required variables in playbook header
|
||||
- Add README.md for service directories with many files
|
||||
- Keep docs/ separate from ansible/ directory
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html)
|
||||
- [Ansible Vault Guide](https://docs.ansible.com/ansible/latest/vault_guide/index.html)
|
||||
- [Inventory Organization](https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html)
|
||||
|
||||
## Secret Management Patterns
|
||||
|
||||
### Ansible Vault (Sandbox Environment)
|
||||
|
||||
**Purpose**: Store sensitive values encrypted at rest in version control
|
||||
|
||||
**File Location**: `inventory/group_vars/all/vault.yml`
|
||||
|
||||
**Variable Naming Convention**: Prefix all vault variables with `vault_`
|
||||
|
||||
**Example vault.yml**:
|
||||
Note the entire vault file is encrypted
|
||||
```yaml
|
||||
---
|
||||
# Database passwords
|
||||
vault_postgres_admin_password: # Avoid special characters & non-ASCII
|
||||
vault_casdoor_db_password:
|
||||
# S3 credentials
|
||||
vault_casdoor_s3_access_key:
|
||||
vault_casdoor_s3_secret_key:
|
||||
vault_casdoor_s3_bucket:
|
||||
```
|
||||
|
||||
**Host Variables Reference Vault**:
|
||||
```yaml
|
||||
# In host_vars/oberon.incus.yml
|
||||
casdoor_db_password: "{{ vault_casdoor_db_password }}"
|
||||
casdoor_s3_access_key: "{{ vault_casdoor_s3_access_key }}"
|
||||
casdoor_s3_secret_key: "{{ vault_casdoor_s3_secret_key }}"
|
||||
casdoor_s3_bucket: "{{ vault_casdoor_s3_bucket }}"
|
||||
|
||||
# Non-sensitive values stay as plain variables
|
||||
casdoor_s3_endpoint: "https://ariel.incus:9000"
|
||||
casdoor_s3_region: "us-east-1"
|
||||
```
|
||||
|
||||
**Prerequisites**:
|
||||
- Set `ANSIBLE_VAULT_PASSWORD_FILE` environment variable
|
||||
- Create `.vault_pass` file with vault password
|
||||
- Add `.vault_pass` to `.gitignore`
|
||||
|
||||
**Encrypting New Values**:
|
||||
```bash
|
||||
# Encrypt a string and add to vault.yml
|
||||
echo -n "secret_value" | ansible-vault encrypt_string --stdin-name 'vault_variable_name'
|
||||
|
||||
# Edit vault file directly
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
### OCI Vault (Production Environment)
|
||||
|
||||
**Purpose**: Use Oracle Cloud Infrastructure Vault for centralized secret management
|
||||
|
||||
**Variable Pattern**: Use Ansible lookups to fetch secrets at runtime
|
||||
|
||||
**Example host_vars for OCI**:
|
||||
```yaml
|
||||
# In host_vars/production-server.yml
|
||||
|
||||
# Database passwords from OCI Vault
|
||||
casdoor_db_password: "{{ lookup('community.oci.oci_secret', 'casdoor-db-password', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
|
||||
|
||||
# S3 credentials from OCI Vault
|
||||
casdoor_s3_access_key: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-access-key', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
|
||||
casdoor_s3_secret_key: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-secret-key', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
|
||||
casdoor_s3_bucket: "{{ lookup('community.oci.oci_secret', 'casdoor-s3-bucket', compartment_id=oci_compartment_id, vault_id=oci_services_vault_id) }}"
|
||||
|
||||
# Non-sensitive values remain as plain variables
|
||||
casdoor_s3_endpoint: "https://objectstorage.us-phoenix-1.oraclecloud.com"
|
||||
casdoor_s3_region: "us-phoenix-1"
|
||||
```
|
||||
|
||||
**OCI Vault Organization**:
|
||||
```
|
||||
OCI Compartment: production
|
||||
├── Vault: agathos-databases
|
||||
│ ├── Secret: postgres-admin-password
|
||||
│ └── Secret: casdoor-db-password
|
||||
│
|
||||
├── Vault: agathos-services
|
||||
│ ├── Secret: casdoor-s3-access-key
|
||||
│ ├── Secret: casdoor-s3-secret-key
|
||||
│ ├── Secret: casdoor-s3-bucket
|
||||
│ └── Secret: openwebui-db-password
|
||||
│
|
||||
└── Vault: agathos-integrations
|
||||
├── Secret: apikey-openai
|
||||
└── Secret: apikey-anthropic
|
||||
```
|
||||
|
||||
**Secret Naming Convention**:
|
||||
- Ansible Vault: `vault_service_secret` (underscores)
|
||||
- OCI Vault: `service-secret` (hyphens)
|
||||
|
||||
**Benefits of Two-Tier Pattern**:
|
||||
1. **Portability**: Service playbooks remain unchanged across environments
|
||||
2. **Flexibility**: Switch secret backends by changing only host_vars
|
||||
3. **Clarity**: Variable names clearly indicate their purpose
|
||||
4. **Security**: Secrets never appear in playbooks or templates
|
||||
|
||||
### S3 Bucket Provisioning with Ansible
|
||||
|
||||
**Purpose**: Provision Incus S3 buckets and manage credentials in Ansible Vault
|
||||
|
||||
**Playbooks**:
|
||||
- `provision_s3.yml` - Create bucket and store credentials
|
||||
- `regenerate_s3_key.yml` - Rotate credentials
|
||||
- `remove_s3.yml` - Delete bucket and clean vault
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Provision new S3 bucket for a service
|
||||
ansible-playbook provision_s3.yml -e bucket_name=casdoor -e service_name=casdoor
|
||||
|
||||
# Regenerate access credentials (invalidates old keys)
|
||||
ansible-playbook regenerate_s3_key.yml -e bucket_name=casdoor -e service_name=casdoor
|
||||
|
||||
# Remove bucket and credentials
|
||||
ansible-playbook remove_s3.yml -e bucket_name=casdoor -e service_name=casdoor
|
||||
```
|
||||
|
||||
**Requirements**:
|
||||
- User must be member of `incus` group
|
||||
- `ANSIBLE_VAULT_PASSWORD_FILE` must be set
|
||||
- Incus CLI must be configured and accessible
|
||||
|
||||
**What Gets Created**:
|
||||
1. Incus storage bucket in project `agathos`, pool `default`
|
||||
2. Admin access key for the bucket
|
||||
3. Encrypted vault entries: `vault_<service>_s3_access_key`, `vault_<service>_s3_secret_key`, `vault_<service>_s3_bucket`
|
||||
|
||||
**Behind the Scenes**:
|
||||
- Role: `incus_storage_bucket`
|
||||
- Idempotent: Checks if bucket/key exists before creating
|
||||
- Atomic: Credentials captured and encrypted in single operation
|
||||
- Variables sourced from: `inventory/group_vars/all/vars.yml`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Template Not Found Errors
|
||||
|
||||
**Symptom**: `Could not find or access 'service_name/template.j2'`
|
||||
|
||||
**Cause**: When playbooks were moved from ansible root into service directories, template paths weren't updated.
|
||||
|
||||
**Solution**: Remove the service directory prefix from template paths:
|
||||
```yaml
|
||||
# WRONG (old path from when playbook was at root)
|
||||
src: service_name/config.j2
|
||||
|
||||
# CORRECT (playbook is now in service_name/ directory)
|
||||
src: config.j2
|
||||
```
|
||||
|
||||
### Host-Specific Template Path Issues
|
||||
|
||||
**Symptom**: Playbook fails to find host-specific templates
|
||||
|
||||
**Cause**: Host-specific directories are at the wrong level
|
||||
|
||||
**Expected Structure**:
|
||||
```
|
||||
service_name/
|
||||
├── deploy.yml
|
||||
├── config.j2 # Default
|
||||
└── hostname/ # Host-specific (inside service dir)
|
||||
└── config.j2
|
||||
```
|
||||
|
||||
**Use `{{playbook_dir}}` for relative paths**:
|
||||
```yaml
|
||||
# This finds templates relative to the playbook location
|
||||
src: "{{playbook_dir}}/{{inventory_hostname_short}}/config.j2"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: December 2025
|
||||
**Project**: Agathos Infrastructure
|
||||
**Approval**: Red Panda Approved™
|
||||
334
docs/anythingllm.md
Normal file
334
docs/anythingllm.md
Normal file
@@ -0,0 +1,334 @@
|
||||
# AnythingLLM
|
||||
|
||||
## Overview
|
||||
|
||||
AnythingLLM is a full-stack application that provides a unified interface for interacting with Large Language Models (LLMs). It supports multi-provider LLM access, document intelligence (RAG with pgvector), AI agents with tools, and Model Context Protocol (MCP) extensions.
|
||||
|
||||
**Host:** Rosalind
|
||||
**Role:** go_nodejs_php_apps
|
||||
**Port:** 22084 (internal), accessible via `anythingllm.ouranos.helu.ca` (HAProxy)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Client │────▶│ HAProxy │────▶│ AnythingLLM │
|
||||
│ (Browser/API) │ │ (Titania) │ │ (Rosalind) │
|
||||
└─────────────────┘ └─────────────────┘ └────────┬────────┘
|
||||
│
|
||||
┌────────────────────────────────┼────────────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ PostgreSQL │ │ LLM Backend │ │ TTS Service │
|
||||
│ + pgvector │ │ (pan.helu.ca) │ │ (FastKokoro) │
|
||||
│ (Portia) │ │ llama-cpp │ │ pan.helu.ca │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
AnythingLLM uses a native Node.js deployment with the following directory layout:
|
||||
|
||||
```
|
||||
/srv/anythingllm/
|
||||
├── app/ # Cloned git repository
|
||||
│ ├── server/ # Backend API server
|
||||
│ │ ├── .env # Environment configuration
|
||||
│ │ └── node_modules/
|
||||
│ ├── collector/ # Document processing service
|
||||
│ │ ├── hotdir -> ../hotdir # SYMLINK (critical!)
|
||||
│ │ └── node_modules/
|
||||
│ └── frontend/ # React frontend (built into server)
|
||||
├── storage/ # Persistent data
|
||||
│ ├── documents/ # Processed documents
|
||||
│ ├── vector-cache/ # Embedding cache
|
||||
│ └── plugins/ # MCP server configs
|
||||
└── hotdir/ # Upload staging directory (actual location)
|
||||
|
||||
/srv/collector/
|
||||
└── hotdir -> /srv/anythingllm/hotdir # SYMLINK (critical!)
|
||||
```
|
||||
|
||||
### Hotdir Path Resolution (Critical)
|
||||
|
||||
The server and collector use **different path resolution** for the upload directory:
|
||||
|
||||
| Component | Code Location | Resolves To |
|
||||
|-----------|--------------|-------------|
|
||||
| **Server** (multer) | `STORAGE_DIR/../../collector/hotdir` | `/srv/collector/hotdir` |
|
||||
| **Collector** | `__dirname/../hotdir` | `/srv/anythingllm/app/collector/hotdir` |
|
||||
|
||||
Both paths must point to the same physical directory. This is achieved with **two symlinks**:
|
||||
|
||||
1. `/srv/collector/hotdir` → `/srv/anythingllm/hotdir`
|
||||
2. `/srv/anythingllm/app/collector/hotdir` → `/srv/anythingllm/hotdir`
|
||||
|
||||
⚠️ **Important**: The collector ships with an empty `hotdir/` directory. The Ansible deploy must **remove** this directory before creating the symlink, or file uploads will fail with "File does not exist in upload directory."
|
||||
|
||||
### Key Integrations
|
||||
|
||||
| Component | Host | Purpose |
|
||||
|-----------|------|---------|
|
||||
| PostgreSQL + pgvector | Portia | Vector database for RAG embeddings |
|
||||
| LLM Provider | pan.helu.ca:22071 | Generic OpenAI-compatible llama-cpp |
|
||||
| TTS Service | pan.helu.ca:22070 | FastKokoro text-to-speech |
|
||||
| HAProxy | Titania | TLS termination and routing |
|
||||
| Loki | Prospero | Log aggregation |
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
AnythingLLM runs on **Rosalind**, which is already defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | go_nodejs_php_apps |
|
||||
| Security Nesting | true |
|
||||
| AppArmor | unconfined |
|
||||
| Port Range | 22080-22099 |
|
||||
|
||||
No Terraform changes required—AnythingLLM uses port 22084 within Rosalind's existing range.
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
source ~/env/agathos/bin/activate
|
||||
|
||||
# Deploy PostgreSQL database first (if not already done)
|
||||
ansible-playbook postgresql/deploy.yml
|
||||
|
||||
# Deploy AnythingLLM
|
||||
ansible-playbook anythingllm/deploy.yml
|
||||
|
||||
# Redeploy HAProxy to pick up new backend
|
||||
ansible-playbook haproxy/deploy.yml
|
||||
|
||||
# Redeploy Alloy to pick up new log source
|
||||
ansible-playbook alloy/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `anythingllm/deploy.yml` | Main deployment playbook |
|
||||
| `anythingllm/anythingllm-server.service.j2` | Systemd service for server |
|
||||
| `anythingllm/anythingllm-collector.service.j2` | Systemd service for collector |
|
||||
| `anythingllm/env.j2` | Environment variables template |
|
||||
|
||||
### Variables
|
||||
|
||||
#### Host Variables (`host_vars/rosalind.incus.yml`)
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `anythingllm_user` | Service account user | `anythingllm` |
|
||||
| `anythingllm_group` | Service account group | `anythingllm` |
|
||||
| `anythingllm_directory` | Installation directory | `/srv/anythingllm` |
|
||||
| `anythingllm_port` | Service port | `22084` |
|
||||
| `anythingllm_db_host` | PostgreSQL host | `portia.incus` |
|
||||
| `anythingllm_db_port` | PostgreSQL port | `5432` |
|
||||
| `anythingllm_db_name` | Database name | `anythingllm` |
|
||||
| `anythingllm_db_user` | Database user | `anythingllm` |
|
||||
| `anythingllm_llm_base_url` | LLM API endpoint | `http://pan.helu.ca:22071/v1` |
|
||||
| `anythingllm_llm_model` | Default LLM model | `llama-3-8b` |
|
||||
| `anythingllm_embedding_engine` | Embedding engine | `native` |
|
||||
| `anythingllm_tts_provider` | TTS provider | `openai` |
|
||||
| `anythingllm_tts_endpoint` | TTS API endpoint | `http://pan.helu.ca:22070/v1` |
|
||||
|
||||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_anythingllm_db_password` | PostgreSQL password |
|
||||
| `vault_anythingllm_jwt_secret` | JWT signing secret (32+ chars) |
|
||||
| `vault_anythingllm_sig_key` | Signature key (32+ chars) |
|
||||
| `vault_anythingllm_sig_salt` | Signature salt (32+ chars) |
|
||||
|
||||
Generate secrets with:
|
||||
```bash
|
||||
openssl rand -hex 32
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Source |
|
||||
|----------|-------------|--------|
|
||||
| `JWT_SECRET` | JWT signing secret | `vault_anythingllm_jwt_secret` |
|
||||
| `SIG_KEY` | Signature key | `vault_anythingllm_sig_key` |
|
||||
| `SIG_SALT` | Signature salt | `vault_anythingllm_sig_salt` |
|
||||
| `VECTOR_DB` | Vector database type | `pgvector` |
|
||||
| `PGVECTOR_CONNECTION_STRING` | PostgreSQL connection | Composed from host_vars |
|
||||
| `LLM_PROVIDER` | LLM provider type | `generic-openai` |
|
||||
| `EMBEDDING_ENGINE` | Embedding engine | `native` |
|
||||
| `TTS_PROVIDER` | TTS provider | `openai` |
|
||||
|
||||
### External Access
|
||||
|
||||
AnythingLLM is accessible via HAProxy on Titania:
|
||||
|
||||
| URL | Backend |
|
||||
|-----|---------|
|
||||
| `https://anythingllm.ouranos.helu.ca` | `rosalind.incus:22084` |
|
||||
|
||||
The HAProxy backend is configured in `host_vars/titania.incus.yml`.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Loki Logs
|
||||
|
||||
| Log Source | Labels |
|
||||
|------------|--------|
|
||||
| Server logs | `{unit="anythingllm-server.service"}` |
|
||||
| Collector logs | `{unit="anythingllm-collector.service"}` |
|
||||
|
||||
Logs are collected via systemd journal → Alloy on Rosalind → Loki on Prospero.
|
||||
|
||||
**Grafana Query:**
|
||||
```logql
|
||||
{unit=~"anythingllm.*"} |= ``
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# From any sandbox host
|
||||
curl http://rosalind.incus:22084/api/ping
|
||||
|
||||
# Via HAProxy (external)
|
||||
curl -k https://anythingllm.ouranos.helu.ca/api/ping
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
# SSH to Rosalind
|
||||
ssh rosalind.incus
|
||||
|
||||
# Manage via systemd
|
||||
sudo systemctl start anythingllm-server # Start server
|
||||
sudo systemctl start anythingllm-collector # Start collector
|
||||
sudo systemctl stop anythingllm-server # Stop server
|
||||
sudo systemctl stop anythingllm-collector # Stop collector
|
||||
sudo systemctl restart anythingllm-server # Restart server
|
||||
sudo systemctl restart anythingllm-collector # Restart collector
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Real-time server logs
|
||||
journalctl -u anythingllm-server -f
|
||||
|
||||
# Real-time collector logs
|
||||
journalctl -u anythingllm-collector -f
|
||||
|
||||
# Grafana (historical)
|
||||
# Query: {unit=~"anythingllm.*"}
|
||||
```
|
||||
|
||||
### Upgrade
|
||||
|
||||
Pull latest code and redeploy:
|
||||
|
||||
```bash
|
||||
ansible-playbook anythingllm/deploy.yml
|
||||
```
|
||||
|
||||
## Vault Setup
|
||||
|
||||
Add the following secrets to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
```bash
|
||||
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
```yaml
|
||||
# AnythingLLM Secrets
|
||||
vault_anythingllm_db_password: "your-secure-password"
|
||||
vault_anythingllm_jwt_secret: "your-32-char-jwt-secret"
|
||||
vault_anythingllm_sig_key: "your-32-char-signature-key"
|
||||
vault_anythingllm_sig_salt: "your-32-char-signature-salt"
|
||||
```
|
||||
|
||||
## Follow-On Tasks
|
||||
|
||||
### MCP Server Integration
|
||||
|
||||
AnythingLLM supports Model Context Protocol (MCP) for extending AI agent capabilities. Future integration with existing MCP servers:
|
||||
|
||||
| MCP Server | Host | Tools |
|
||||
|------------|------|-------|
|
||||
| MCPO | Miranda | Docker management |
|
||||
| Neo4j MCP | Miranda | Graph database queries |
|
||||
| GitHub MCP | (external) | Repository operations |
|
||||
|
||||
Configure MCP connections via AnythingLLM Admin UI after initial deployment.
|
||||
|
||||
### Casdoor SSO
|
||||
|
||||
For single sign-on integration, configure AnythingLLM to authenticate via Casdoor OAuth2. This requires:
|
||||
1. Creating an application in Casdoor admin
|
||||
2. Configuring OAuth2 environment variables in AnythingLLM
|
||||
3. Optionally using OAuth2-Proxy for transparent authentication
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### File Upload Fails with "File does not exist in upload directory"
|
||||
|
||||
**Symptom:** Uploading files via the UI returns 500 Internal Server Error with message "File does not exist in upload directory."
|
||||
|
||||
**Cause:** The server uploads files to `/srv/collector/hotdir`, but the collector looks for them in `/srv/anythingllm/app/collector/hotdir`. If these aren't the same physical directory, uploads fail.
|
||||
|
||||
**Solution:** Verify symlinks are correctly configured:
|
||||
|
||||
```bash
|
||||
# Check symlinks
|
||||
ls -la /srv/collector/hotdir
|
||||
# Should show: /srv/collector/hotdir -> /srv/anythingllm/hotdir
|
||||
|
||||
ls -la /srv/anythingllm/app/collector/hotdir
|
||||
# Should show: /srv/anythingllm/app/collector/hotdir -> /srv/anythingllm/hotdir
|
||||
|
||||
# If collector/hotdir is a directory (not symlink), fix it:
|
||||
sudo rm -rf /srv/anythingllm/app/collector/hotdir
|
||||
sudo ln -s /srv/anythingllm/hotdir /srv/anythingllm/app/collector/hotdir
|
||||
sudo chown -h anythingllm:anythingllm /srv/anythingllm/app/collector/hotdir
|
||||
sudo systemctl restart anythingllm-collector
|
||||
```
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
Check Docker logs:
|
||||
```bash
|
||||
sudo docker logs anythingllm
|
||||
```
|
||||
|
||||
Verify PostgreSQL connectivity:
|
||||
```bash
|
||||
psql -h portia.incus -U anythingllm -d anythingllm
|
||||
```
|
||||
|
||||
### Database Connection Issues
|
||||
|
||||
Ensure pgvector extension is enabled:
|
||||
```bash
|
||||
psql -h portia.incus -U postgres -d anythingllm -c "SELECT * FROM pg_extension WHERE extname = 'vector';"
|
||||
```
|
||||
|
||||
### LLM Provider Issues
|
||||
|
||||
Test LLM endpoint directly:
|
||||
```bash
|
||||
curl http://pan.helu.ca:22071/v1/models
|
||||
```
|
||||
207
docs/anythingllm_mcp.md
Normal file
207
docs/anythingllm_mcp.md
Normal file
@@ -0,0 +1,207 @@
|
||||
# AnythingLLM MCP Server Configuration
|
||||
|
||||
## Overview
|
||||
|
||||
AnythingLLM supports [Model Context Protocol (MCP)](https://modelcontextprotocol.io) servers, allowing AI agents to call tools provided by local processes or remote services. MCP servers are managed by the internal `MCPHypervisor` singleton and configured via a single JSON file.
|
||||
|
||||
## Configuration File Location
|
||||
|
||||
| Environment | Path |
|
||||
|-------------|------|
|
||||
| Development | `server/storage/plugins/anythingllm_mcp_servers.json` |
|
||||
| Production / Docker | `$STORAGE_DIR/plugins/anythingllm_mcp_servers.json` |
|
||||
|
||||
The file and its parent directory are created automatically with an empty `{ "mcpServers": {} }` object if they do not already exist.
|
||||
|
||||
## File Format
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"<server-name>": { ... },
|
||||
"<server-name-2>": { ... }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each key inside `mcpServers` is the unique name used to identify the server within AnythingLLM. The value is the server definition, whose required fields depend on the transport type (see below).
|
||||
|
||||
---
|
||||
|
||||
## Transport Types
|
||||
|
||||
### `stdio` — Local Process
|
||||
|
||||
Spawns a local process and communicates over stdin/stdout. The transport type is inferred automatically when a `command` field is present.
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"filesystem": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/docs"],
|
||||
"env": {
|
||||
"SOME_VAR": "value"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Required | Description |
|
||||
|-------|----------|-------------|
|
||||
| `command` | ✅ | Executable to run (e.g. `npx`, `node`, `python3`) |
|
||||
| `args` | ❌ | Array of arguments passed to the command |
|
||||
| `env` | ❌ | Extra environment variables merged into the process environment |
|
||||
|
||||
> **Note:** The process inherits PATH and NODE_PATH from the shell environment that started AnythingLLM. If a command such as `npx` is not found, ensure it is available in that shell's PATH.
|
||||
|
||||
---
|
||||
|
||||
### `sse` — Server-Sent Events (legacy)
|
||||
|
||||
Connects to a remote MCP server using the legacy SSE transport. The type is inferred automatically when only a `url` field is present (no `command`), or when `"type": "sse"` is set explicitly.
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"my-sse-server": {
|
||||
"url": "https://example.com/mcp",
|
||||
"type": "sse",
|
||||
"headers": {
|
||||
"Authorization": "Bearer <token>"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `streamable` / `http` — Streamable HTTP
|
||||
|
||||
Connects to a remote MCP server using the newer Streamable HTTP transport.
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"my-http-server": {
|
||||
"url": "https://example.com/mcp",
|
||||
"type": "streamable",
|
||||
"headers": {
|
||||
"Authorization": "Bearer <token>"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Both `"type": "streamable"` and `"type": "http"` select this transport.
|
||||
|
||||
| Field | Required | Description |
|
||||
|-------|----------|-------------|
|
||||
| `url` | ✅ | Full URL of the MCP endpoint |
|
||||
| `type` | ✅ | `"sse"`, `"streamable"`, or `"http"` |
|
||||
| `headers` | ❌ | HTTP headers sent with every request (useful for auth) |
|
||||
|
||||
---
|
||||
|
||||
## AnythingLLM-Specific Options
|
||||
|
||||
An optional `anythingllm` block inside any server definition can control AnythingLLM-specific behaviour:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"my-server": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "some-mcp-package"],
|
||||
"anythingllm": {
|
||||
"autoStart": false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Option | Type | Default | Description |
|
||||
|--------|------|---------|-------------|
|
||||
| `autoStart` | boolean | `true` | When `false`, the server is skipped at startup and must be started manually from the Admin UI |
|
||||
|
||||
---
|
||||
|
||||
## Full Example
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"filesystem": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/documents"]
|
||||
},
|
||||
"github": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@modelcontextprotocol/server-github"],
|
||||
"env": {
|
||||
"GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_xxxxxxxxxxxx"
|
||||
}
|
||||
},
|
||||
"remote-tools": {
|
||||
"url": "https://mcp.example.com/mcp",
|
||||
"type": "streamable",
|
||||
"headers": {
|
||||
"Authorization": "Bearer my-secret-token"
|
||||
}
|
||||
},
|
||||
"optional-server": {
|
||||
"command": "node",
|
||||
"args": ["/opt/mcp/server.js"],
|
||||
"anythingllm": {
|
||||
"autoStart": false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Managing Servers via the Admin UI
|
||||
|
||||
MCP servers can be managed without editing the JSON file directly:
|
||||
|
||||
1. Log in as an Admin.
|
||||
2. Go to **Admin → Agents → MCP Servers**.
|
||||
3. From this page you can:
|
||||
- View all configured servers and the tools each one exposes.
|
||||
- Start or stop individual servers.
|
||||
- Delete a server (removes it from the JSON file).
|
||||
- Force-reload all servers (stops all, re-reads the file, restarts them).
|
||||
|
||||
Any changes made through the UI are persisted back to `anythingllm_mcp_servers.json`.
|
||||
|
||||
---
|
||||
|
||||
## How Servers Are Started
|
||||
|
||||
- At startup, `MCPHypervisor` reads the config file and starts all servers whose `anythingllm.autoStart` is not `false`.
|
||||
- Each server has a **30-second connection timeout**. If a server fails to connect within that window it is marked as failed and its process is cleaned up.
|
||||
- Servers are exposed to agents via the `@agent` directive using the naming convention `@@mcp_<server-name>`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Likely Cause | Fix |
|
||||
|---------|-------------|-----|
|
||||
| `ENOENT` / command not found | The executable is not in PATH | Use the full absolute path for `command`, or ensure the binary is accessible in the shell that starts AnythingLLM |
|
||||
| Connection timeout after 30 s | Server process started but did not respond | Check the server's own logs; verify arguments are correct |
|
||||
| Tools not visible to agent | Server failed to start | Check the status badge in **Admin → Agents → MCP Servers** for the error message |
|
||||
| Auth / 401 errors on remote servers | Missing or incorrect credentials | Verify `headers` or `env` values in the config |
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [AnythingLLM MCP Compatibility Docs](https://docs.anythingllm.com/mcp-compatibility/overview)
|
||||
- [Model Context Protocol Specification](https://modelcontextprotocol.io)
|
||||
726
docs/anythingllm_overview.md
Normal file
726
docs/anythingllm_overview.md
Normal file
@@ -0,0 +1,726 @@
|
||||
# AnythingLLM: Your AI-Powered Knowledge Hub
|
||||
|
||||
## 🎯 What is AnythingLLM?
|
||||
|
||||
AnythingLLM is a **full-stack application** that transforms how you interact with Large Language Models (LLMs). Think of it as your personal AI assistant platform that can:
|
||||
|
||||
- 💬 Chat with multiple LLM providers
|
||||
- 📚 Query your own documents and data (RAG - Retrieval Augmented Generation)
|
||||
- 🤖 Run autonomous AI agents with tools
|
||||
- 🔌 Extend capabilities via Model Context Protocol (MCP)
|
||||
- 👥 Support multiple users and workspaces
|
||||
- 🎨 Provide a beautiful, intuitive web interface
|
||||
|
||||
**In simple terms:** It's like ChatGPT, but you control everything - the data, the models, the privacy, and the capabilities.
|
||||
|
||||
---
|
||||
|
||||
## 🌟 Key Capabilities
|
||||
|
||||
### 1. **Multi-Provider LLM Support**
|
||||
|
||||
AnythingLLM isn't locked to a single AI provider. It supports **30+ LLM providers**:
|
||||
|
||||
#### Your Environment:
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Your LLM Infrastructure │
|
||||
├─────────────────────────────────────────┤
|
||||
│ ✅ Llama CPP Router (pan.helu.ca) │
|
||||
│ - Load-balanced inference │
|
||||
│ - High availability │
|
||||
│ │
|
||||
│ ✅ Direct Llama CPP (nyx.helu.ca) │
|
||||
│ - Direct connection option │
|
||||
│ - Lower latency │
|
||||
│ │
|
||||
│ ✅ LLM Proxy - Arke (circe.helu.ca) │
|
||||
│ - Unified API gateway │
|
||||
│ - Request routing │
|
||||
│ │
|
||||
│ ✅ AWS Bedrock (optional) │
|
||||
│ - Claude, Titan models │
|
||||
│ - Enterprise-grade │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**What this means:**
|
||||
- Switch between providers without changing your application
|
||||
- Use different models for different workspaces
|
||||
- Fallback to alternative providers if one fails
|
||||
- Compare model performance side-by-side
|
||||
|
||||
### 2. **Document Intelligence (RAG)**
|
||||
|
||||
AnythingLLM can ingest and understand your documents:
|
||||
|
||||
**Supported Formats:**
|
||||
- 📄 PDF, DOCX, TXT, MD
|
||||
- 🌐 Websites (scraping)
|
||||
- 📊 CSV, JSON
|
||||
- 🎥 YouTube transcripts
|
||||
- 🔗 GitHub repositories
|
||||
- 📝 Confluence, Notion exports
|
||||
|
||||
**How it works:**
|
||||
```
|
||||
Your Document → Text Extraction → Chunking → Embeddings → Vector DB (PostgreSQL)
|
||||
↓
|
||||
User Question → Embedding → Similarity Search → Relevant Chunks → LLM → Answer
|
||||
```
|
||||
|
||||
**Example Use Case:**
|
||||
```
|
||||
You: "What's our refund policy?"
|
||||
AnythingLLM: [Searches your policy documents]
|
||||
"According to your Terms of Service (page 12),
|
||||
refunds are available within 30 days..."
|
||||
```
|
||||
|
||||
### 3. **AI Agents with Tools** 🤖
|
||||
|
||||
This is where AnythingLLM becomes **truly powerful**. Agents can:
|
||||
|
||||
#### Built-in Agent Tools:
|
||||
- 🌐 **Web Browsing** - Navigate websites, fill forms, take screenshots
|
||||
- 🔍 **Web Scraping** - Extract data from web pages
|
||||
- 📊 **SQL Agent** - Query databases (PostgreSQL, MySQL, MSSQL)
|
||||
- 📈 **Chart Generation** - Create visualizations
|
||||
- 💾 **File Operations** - Save and manage files
|
||||
- 📝 **Document Summarization** - Condense long documents
|
||||
- 🧠 **Memory** - Remember context across conversations
|
||||
|
||||
#### Agent Workflow Example:
|
||||
```
|
||||
User: "Check our database for users who signed up last week
|
||||
and send them a welcome email"
|
||||
|
||||
Agent:
|
||||
1. Uses SQL Agent to query PostgreSQL
|
||||
2. Retrieves user list
|
||||
3. Generates personalized email content
|
||||
4. (With email MCP) Sends emails
|
||||
5. Reports back with results
|
||||
```
|
||||
|
||||
### 4. **Model Context Protocol (MCP)** 🔌
|
||||
|
||||
MCP is AnythingLLM's **superpower** - it allows you to extend the AI with custom tools and data sources.
|
||||
|
||||
#### What is MCP?
|
||||
|
||||
MCP is a **standardized protocol** for connecting AI systems to external tools and data. Think of it as "plugins for AI."
|
||||
|
||||
#### Your MCP Possibilities:
|
||||
|
||||
**Example 1: Docker Management**
|
||||
```javascript
|
||||
// MCP Server: docker-mcp
|
||||
Tools Available:
|
||||
- list_containers()
|
||||
- start_container(name)
|
||||
- stop_container(name)
|
||||
- view_logs(container)
|
||||
- exec_command(container, command)
|
||||
|
||||
User: "Show me all running containers and restart the one using most memory"
|
||||
Agent: [Uses docker-mcp tools to check, analyze, and restart]
|
||||
```
|
||||
|
||||
**Example 2: GitHub Integration**
|
||||
```javascript
|
||||
// MCP Server: github-mcp
|
||||
Tools Available:
|
||||
- create_issue(repo, title, body)
|
||||
- search_code(query)
|
||||
- create_pr(repo, branch, title)
|
||||
- list_repos()
|
||||
|
||||
User: "Create a GitHub issue for the bug I just described"
|
||||
Agent: [Uses github-mcp to create issue with details]
|
||||
```
|
||||
|
||||
**Example 3: Custom Business Tools**
|
||||
```javascript
|
||||
// Your Custom MCP Server
|
||||
Tools Available:
|
||||
- query_crm(customer_id)
|
||||
- check_inventory(product_sku)
|
||||
- create_order(customer, items)
|
||||
- send_notification(user, message)
|
||||
|
||||
User: "Check if we have product XYZ in stock and notify me if it's low"
|
||||
Agent: [Uses your custom MCP tools]
|
||||
```
|
||||
|
||||
#### MCP Architecture in AnythingLLM:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ AnythingLLM │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ Agent System │ │
|
||||
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
|
||||
│ │ │ Built-in │ │ MCP │ │ Custom │ │ │
|
||||
│ │ │ Tools │ │ Tools │ │ Flows │ │ │
|
||||
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
│ ↓ │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ MCP Hypervisor │ │
|
||||
│ │ - Manages MCP server lifecycle │ │
|
||||
│ │ - Handles stdio/http/sse transports │ │
|
||||
│ │ - Auto-discovers tools │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ MCP Servers (Running Locally or Remote) │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Docker │ │ GitHub │ │ Custom │ │
|
||||
│ │ MCP │ │ MCP │ │ MCP │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- ✅ **Hot-reload** - Add/remove MCP servers without restarting
|
||||
- ✅ **Multiple transports** - stdio, HTTP, Server-Sent Events
|
||||
- ✅ **Auto-discovery** - Tools automatically appear in agent
|
||||
- ✅ **Process management** - Automatic start/stop/restart
|
||||
- ✅ **Error handling** - Graceful failures with logging
|
||||
|
||||
### 5. **Agent Flows** 🔄
|
||||
|
||||
Create **no-code agent workflows** for complex tasks:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Example Flow: "Daily Report Generator" │
|
||||
├─────────────────────────────────────────┤
|
||||
│ 1. Query database for yesterday's data │
|
||||
│ 2. Generate summary statistics │
|
||||
│ 3. Create visualization charts │
|
||||
│ 4. Write report to document │
|
||||
│ 5. Send via email (MCP) │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Flows can be:
|
||||
- Triggered manually
|
||||
- Scheduled (via external cron)
|
||||
- Called from other agents
|
||||
- Shared across workspaces
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ How AnythingLLM Fits Your Environment
|
||||
|
||||
### Your Complete Stack:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Internet │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ HAProxy (SSL Termination & Load Balancing) │
|
||||
│ - HTTPS/WSS support │
|
||||
│ - Security headers │
|
||||
│ - Health checks │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ AnythingLLM Application │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
|
||||
│ │ Web UI │ │ API Server │ │ Agent Engine │ │
|
||||
│ │ - React │ │ - Express.js │ │ - AIbitat │ │
|
||||
│ │ - WebSocket │ │ - REST API │ │ - MCP Support │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Data Layer │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ PostgreSQL 17 + pgvector │ │
|
||||
│ │ - User data & workspaces │ │
|
||||
│ │ - Chat history │ │
|
||||
│ │ - Vector embeddings (for RAG) │ │
|
||||
│ │ - Agent invocations │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ External LLM Services │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Llama Router │ │ Direct Llama │ │ LLM Proxy │ │
|
||||
│ │ pan.helu.ca │ │ nyx.helu.ca │ │ circe.helu.ca│ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TTS Service │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ FastKokoro (OpenAI-compatible TTS) │ │
|
||||
│ │ pan.helu.ca:22070 │ │
|
||||
│ │ - Text-to-speech generation │ │
|
||||
│ │ - Multiple voices │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Observability Stack:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Monitoring & Logging │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Grafana (Unified Dashboard) │ │
|
||||
│ │ - Metrics visualization │ │
|
||||
│ │ - Log exploration │ │
|
||||
│ │ - Alerting │ │
|
||||
│ └────────────┬─────────────────────────────┬────────────────┘ │
|
||||
│ ↓ ↓ │
|
||||
│ ┌────────────────────────┐ ┌────────────────────────┐ │
|
||||
│ │ Prometheus │ │ Loki │ │
|
||||
│ │ - Metrics storage │ │ - Log aggregation │ │
|
||||
│ │ - Alert rules │ │ - 31-day retention │ │
|
||||
│ │ - 30-day retention │ │ - Query language │ │
|
||||
│ └────────────────────────┘ └────────────────────────┘ │
|
||||
│ ↑ ↑ │
|
||||
│ ┌────────────┴─────────────────────────────┴────────────────┐ │
|
||||
│ │ Data Collection │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ cAdvisor │ │ Postgres │ │ Alloy │ │ │
|
||||
│ │ │ (Container) │ │ Exporter │ │ (Logs) │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎨 Real-World Use Cases
|
||||
|
||||
### Use Case 1: **Internal Knowledge Base**
|
||||
|
||||
**Scenario:** Your team needs quick access to company documentation
|
||||
|
||||
**Setup:**
|
||||
1. Upload all company docs to AnythingLLM workspace
|
||||
2. Documents are embedded and stored in PostgreSQL
|
||||
3. Team members ask questions naturally
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Employee: "What's the process for requesting time off?"
|
||||
AnythingLLM: [Searches HR documents]
|
||||
"According to the Employee Handbook, you need to:
|
||||
1. Submit request via HR portal
|
||||
2. Get manager approval
|
||||
3. Minimum 2 weeks notice for vacations..."
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ No more searching through SharePoint
|
||||
- ✅ Instant answers with source citations
|
||||
- ✅ Always up-to-date (re-sync documents)
|
||||
- ✅ Multi-user access with permissions
|
||||
|
||||
### Use Case 2: **DevOps Assistant**
|
||||
|
||||
**Scenario:** Manage infrastructure with natural language
|
||||
|
||||
**Setup:**
|
||||
1. Install Docker MCP server
|
||||
2. Install GitHub MCP server
|
||||
3. Connect to your monitoring stack
|
||||
|
||||
**Example Conversation:**
|
||||
```
|
||||
You: "Show me all containers and their resource usage"
|
||||
Agent: [Uses docker-mcp + Prometheus data]
|
||||
"Here are your containers:
|
||||
- anythingllm: 2.1GB RAM, 45% CPU
|
||||
- postgres: 1.8GB RAM, 12% CPU
|
||||
- prometheus: 1.2GB RAM, 8% CPU
|
||||
|
||||
anythingllm is using high CPU. Would you like me to investigate?"
|
||||
|
||||
You: "Yes, check the logs for errors"
|
||||
Agent: [Uses docker-mcp to fetch logs]
|
||||
"Found 15 errors in the last hour related to LLM timeouts.
|
||||
Should I create a GitHub issue?"
|
||||
|
||||
You: "Yes, and restart the container"
|
||||
Agent: [Creates GitHub issue, restarts container]
|
||||
"Done! Issue #123 created and container restarted.
|
||||
CPU usage now at 15%."
|
||||
```
|
||||
|
||||
### Use Case 3: **Customer Support Automation**
|
||||
|
||||
**Scenario:** AI-powered support that can take action
|
||||
|
||||
**Setup:**
|
||||
1. Upload product documentation
|
||||
2. Connect CRM via custom MCP
|
||||
3. Enable SQL agent for database queries
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Support Agent: "Customer John Doe says his order #12345 hasn't arrived"
|
||||
AnythingLLM: [Queries database via SQL agent]
|
||||
"Order #12345 shipped on Jan 5th via FedEx.
|
||||
Tracking shows it's delayed due to weather.
|
||||
|
||||
Would you like me to:
|
||||
1. Send customer an update email
|
||||
2. Offer expedited shipping on next order
|
||||
3. Issue a partial refund"
|
||||
|
||||
Support Agent: "Send update email"
|
||||
AnythingLLM: [Uses email MCP]
|
||||
"Email sent to john@example.com with tracking info
|
||||
and apology for delay."
|
||||
```
|
||||
|
||||
### Use Case 4: **Data Analysis Assistant**
|
||||
|
||||
**Scenario:** Query your database with natural language
|
||||
|
||||
**Setup:**
|
||||
1. Enable SQL Agent
|
||||
2. Connect to PostgreSQL
|
||||
3. Grant read-only access
|
||||
|
||||
**Example:**
|
||||
```
|
||||
You: "Show me user signups by month for the last 6 months"
|
||||
Agent: [Generates and executes SQL]
|
||||
SELECT
|
||||
DATE_TRUNC('month', created_at) as month,
|
||||
COUNT(*) as signups
|
||||
FROM users
|
||||
WHERE created_at >= NOW() - INTERVAL '6 months'
|
||||
GROUP BY month
|
||||
ORDER BY month;
|
||||
|
||||
Results:
|
||||
- July 2025: 145 signups
|
||||
- August 2025: 203 signups
|
||||
- September 2025: 187 signups
|
||||
...
|
||||
|
||||
You: "Create a chart of this"
|
||||
Agent: [Uses chart generation tool]
|
||||
[Displays bar chart visualization]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Security & Privacy
|
||||
|
||||
### Why Self-Hosted Matters:
|
||||
|
||||
**Your Data Stays Yours:**
|
||||
- ✅ Documents never leave your infrastructure
|
||||
- ✅ Chat history stored in your PostgreSQL
|
||||
- ✅ No data sent to third parties (except chosen LLM provider)
|
||||
- ✅ Full audit trail in logs (via Loki)
|
||||
|
||||
**Access Control:**
|
||||
- ✅ Multi-user authentication
|
||||
- ✅ Role-based permissions (Admin, User)
|
||||
- ✅ Workspace-level isolation
|
||||
- ✅ API key management
|
||||
|
||||
**Network Security:**
|
||||
- ✅ HAProxy SSL termination
|
||||
- ✅ Security headers (HSTS, CSP, etc.)
|
||||
- ✅ Internal network isolation
|
||||
- ✅ Firewall-friendly (only ports 80/443 exposed)
|
||||
|
||||
**Monitoring:**
|
||||
- ✅ All access logged to Loki
|
||||
- ✅ Failed login attempts tracked
|
||||
- ✅ Resource usage monitored
|
||||
- ✅ Alerts for suspicious activity
|
||||
|
||||
---
|
||||
|
||||
## 📊 Monitoring Integration
|
||||
|
||||
Your observability stack provides **complete visibility**:
|
||||
|
||||
### What You Can Monitor:
|
||||
|
||||
**Application Health:**
|
||||
```
|
||||
Grafana Dashboard: "AnythingLLM Overview"
|
||||
├─ Request Rate: 1,234 req/min
|
||||
├─ Response Time: 245ms avg
|
||||
├─ Error Rate: 0.3%
|
||||
├─ Active Users: 23
|
||||
└─ Agent Invocations: 45/hour
|
||||
```
|
||||
|
||||
**Resource Usage:**
|
||||
```
|
||||
Container Metrics (via cAdvisor):
|
||||
├─ CPU: 45% (2 cores)
|
||||
├─ Memory: 2.1GB / 4GB
|
||||
├─ Network: 15MB/s in, 8MB/s out
|
||||
└─ Disk I/O: 120 IOPS
|
||||
```
|
||||
|
||||
**Database Performance:**
|
||||
```
|
||||
PostgreSQL Metrics (via postgres-exporter):
|
||||
├─ Connections: 45 / 100
|
||||
├─ Query Time: 12ms avg
|
||||
├─ Cache Hit Ratio: 98.5%
|
||||
├─ Database Size: 2.3GB
|
||||
└─ Vector Index Size: 450MB
|
||||
```
|
||||
|
||||
**LLM Provider Performance:**
|
||||
```
|
||||
Custom Metrics (via HAProxy):
|
||||
├─ Llama Router: 234ms avg latency
|
||||
├─ Direct Llama: 189ms avg latency
|
||||
├─ Arke Proxy: 267ms avg latency
|
||||
└─ Success Rate: 99.2%
|
||||
```
|
||||
|
||||
**Log Analysis (Loki):**
|
||||
```logql
|
||||
# Find slow LLM responses
|
||||
{service="anythingllm"}
|
||||
| json
|
||||
| duration > 5000
|
||||
|
||||
# Track agent tool usage
|
||||
{service="anythingllm"}
|
||||
|= "agent"
|
||||
|= "tool_call"
|
||||
|
||||
# Monitor errors by type
|
||||
{service="anythingllm"}
|
||||
|= "ERROR"
|
||||
| json
|
||||
| count by error_type
|
||||
```
|
||||
|
||||
### Alerting Examples:
|
||||
|
||||
**Critical Alerts:**
|
||||
- 🚨 AnythingLLM container down
|
||||
- 🚨 PostgreSQL connection failures
|
||||
- 🚨 Disk space > 95%
|
||||
- 🚨 Memory usage > 90%
|
||||
|
||||
**Warning Alerts:**
|
||||
- ⚠️ High LLM response times (> 5s)
|
||||
- ⚠️ Database connections > 80%
|
||||
- ⚠️ Error rate > 1%
|
||||
- ⚠️ Agent failures
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Getting Started
|
||||
|
||||
### Quick Start:
|
||||
|
||||
```bash
|
||||
cd deployment
|
||||
|
||||
# 1. Configure environment
|
||||
cp .env.example .env
|
||||
nano .env # Set your LLM endpoints, passwords, etc.
|
||||
|
||||
# 2. Setup SSL certificates
|
||||
# (See README.md for Let's Encrypt instructions)
|
||||
|
||||
# 3. Deploy
|
||||
docker-compose up -d
|
||||
|
||||
# 4. Access services
|
||||
# - AnythingLLM: https://your-domain.com
|
||||
# - Grafana: http://localhost:3000
|
||||
# - Prometheus: http://localhost:9090
|
||||
```
|
||||
|
||||
### First Steps in AnythingLLM:
|
||||
|
||||
1. **Create Account** - First user becomes admin
|
||||
2. **Create Workspace** - Organize by project/team
|
||||
3. **Upload Documents** - Add your knowledge base
|
||||
4. **Configure LLM** - Choose your provider (already set via .env)
|
||||
5. **Enable Agents** - Turn on agent mode for tools
|
||||
6. **Add MCP Servers** - Extend with custom tools
|
||||
7. **Start Chatting!** - Ask questions, run agents
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Why AnythingLLM is Powerful
|
||||
|
||||
### Compared to ChatGPT:
|
||||
|
||||
| Feature | ChatGPT | AnythingLLM |
|
||||
|---------|---------|-------------|
|
||||
| **Data Privacy** | ❌ Data sent to OpenAI | ✅ Self-hosted, private |
|
||||
| **Custom Documents** | ⚠️ Limited (ChatGPT Plus) | ✅ Unlimited RAG |
|
||||
| **LLM Choice** | ❌ OpenAI only | ✅ 30+ providers |
|
||||
| **Agents** | ⚠️ Limited tools | ✅ Unlimited via MCP |
|
||||
| **Multi-User** | ❌ Individual accounts | ✅ Team workspaces |
|
||||
| **API Access** | ⚠️ Paid tier | ✅ Full REST API |
|
||||
| **Monitoring** | ❌ No visibility | ✅ Complete observability |
|
||||
| **Cost** | 💰 $20/user/month | ✅ Self-hosted (compute only) |
|
||||
|
||||
### Compared to LangChain/LlamaIndex:
|
||||
|
||||
| Feature | LangChain | AnythingLLM |
|
||||
|---------|-----------|-------------|
|
||||
| **Setup** | 🔧 Code required | ✅ Web UI, no code |
|
||||
| **User Interface** | ❌ Build your own | ✅ Beautiful UI included |
|
||||
| **Multi-User** | ❌ Build your own | ✅ Built-in |
|
||||
| **Agents** | ✅ Powerful | ✅ Equally powerful + UI |
|
||||
| **MCP Support** | ❌ No | ✅ Native support |
|
||||
| **Monitoring** | ❌ DIY | ✅ Integrated |
|
||||
| **Learning Curve** | 📚 Steep | ✅ Gentle |
|
||||
|
||||
---
|
||||
|
||||
## 🎓 Advanced Capabilities
|
||||
|
||||
### 1. **Workspace Isolation**
|
||||
|
||||
Create separate workspaces for different use cases:
|
||||
|
||||
```
|
||||
├─ Engineering Workspace
|
||||
│ ├─ Documents: Code docs, API specs
|
||||
│ ├─ LLM: Direct Llama (fast)
|
||||
│ └─ Agents: GitHub MCP, Docker MCP
|
||||
│
|
||||
├─ Customer Support Workspace
|
||||
│ ├─ Documents: Product docs, FAQs
|
||||
│ ├─ LLM: Llama Router (reliable)
|
||||
│ └─ Agents: CRM MCP, Email MCP
|
||||
│
|
||||
└─ Executive Workspace
|
||||
├─ Documents: Reports, analytics
|
||||
├─ LLM: AWS Bedrock Claude (best quality)
|
||||
└─ Agents: SQL Agent, Chart generation
|
||||
```
|
||||
|
||||
### 2. **Embedding Strategies**
|
||||
|
||||
AnythingLLM supports multiple embedding models:
|
||||
|
||||
- **Native** (Xenova) - Fast, runs locally
|
||||
- **OpenAI** - High quality, requires API
|
||||
- **Azure OpenAI** - Enterprise option
|
||||
- **Local AI** - Self-hosted alternative
|
||||
|
||||
**Your Setup:** Using native embeddings for privacy and speed
|
||||
|
||||
### 3. **Agent Chaining**
|
||||
|
||||
Agents can call other agents:
|
||||
|
||||
```
|
||||
Main Agent
|
||||
├─> Research Agent (web scraping)
|
||||
├─> Analysis Agent (SQL queries)
|
||||
└─> Report Agent (document generation)
|
||||
```
|
||||
|
||||
### 4. **API Integration**
|
||||
|
||||
Full REST API for programmatic access:
|
||||
|
||||
```bash
|
||||
# Send chat message
|
||||
curl -X POST https://your-domain.com/api/v1/workspace/chat \
|
||||
-H "Authorization: Bearer YOUR_API_KEY" \
|
||||
-d '{"message": "What is our refund policy?"}'
|
||||
|
||||
# Upload document
|
||||
curl -X POST https://your-domain.com/api/v1/document/upload \
|
||||
-H "Authorization: Bearer YOUR_API_KEY" \
|
||||
-F "file=@policy.pdf"
|
||||
|
||||
# Invoke agent
|
||||
curl -X POST https://your-domain.com/api/v1/agent/invoke \
|
||||
-H "Authorization: Bearer YOUR_API_KEY" \
|
||||
-d '{"prompt": "Check server status"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔮 Future Possibilities
|
||||
|
||||
With your infrastructure, you could:
|
||||
|
||||
### 1. **Voice Interface**
|
||||
- Use FastKokoro TTS for responses
|
||||
- Add speech-to-text (Whisper)
|
||||
- Create voice-controlled assistant
|
||||
|
||||
### 2. **Slack/Discord Bot**
|
||||
- Create MCP server for messaging
|
||||
- Deploy bot that uses AnythingLLM
|
||||
- Team can chat with AI in Slack
|
||||
|
||||
### 3. **Automated Workflows**
|
||||
- Scheduled agent runs (cron)
|
||||
- Webhook triggers
|
||||
- Event-driven automation
|
||||
|
||||
### 4. **Custom Dashboards**
|
||||
- Embed AnythingLLM in your apps
|
||||
- White-label the interface
|
||||
- Custom branding
|
||||
|
||||
### 5. **Multi-Modal AI**
|
||||
- Image analysis (with vision models)
|
||||
- Document OCR
|
||||
- Video transcription
|
||||
|
||||
---
|
||||
|
||||
## 📚 Summary
|
||||
|
||||
**AnythingLLM is your AI platform that:**
|
||||
|
||||
✅ **Respects Privacy** - Self-hosted, your data stays yours
|
||||
✅ **Flexible** - 30+ LLM providers, switch anytime
|
||||
✅ **Intelligent** - RAG for document understanding
|
||||
✅ **Powerful** - AI agents with unlimited tools via MCP
|
||||
✅ **Observable** - Full monitoring with Prometheus/Loki
|
||||
✅ **Scalable** - PostgreSQL + HAProxy for production
|
||||
✅ **Extensible** - MCP protocol for custom integrations
|
||||
✅ **User-Friendly** - Beautiful web UI, no coding required
|
||||
|
||||
**In your environment, it provides:**
|
||||
|
||||
🎯 **Unified AI Interface** - One place for all AI interactions
|
||||
🔧 **DevOps Automation** - Manage infrastructure with natural language
|
||||
📊 **Data Intelligence** - Query databases, analyze trends
|
||||
🤖 **Autonomous Agents** - Tasks that run themselves
|
||||
📈 **Complete Visibility** - Every metric, every log, every alert
|
||||
🔒 **Enterprise Security** - SSL, auth, audit trails, monitoring
|
||||
|
||||
**Think of it as:** Your personal AI assistant platform that can see your data, use your tools, and help your team - all while you maintain complete control.
|
||||
|
||||
---
|
||||
|
||||
## 🆘 Learn More
|
||||
|
||||
- **Deployment Guide**: [README.md](README.md)
|
||||
- **Monitoring Explained**: [PROMETHEUS_EXPLAINED.md](PROMETHEUS_EXPLAINED.md)
|
||||
- **Official Docs**: https://docs.anythingllm.com
|
||||
- **GitHub**: https://github.com/Mintplex-Labs/anything-llm
|
||||
- **Discord Community**: https://discord.gg/6UyHPeGZAC
|
||||
94
docs/arke.md
Normal file
94
docs/arke.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Arke Vault Variables Documentation
|
||||
|
||||
This document lists the vault variables that need to be added to `ansible/inventory/group_vars/all/vault.yml` for the Arke deployment.
|
||||
|
||||
## Required Vault Variables
|
||||
|
||||
### Existing Variables
|
||||
These should already be present in your vault:
|
||||
|
||||
```yaml
|
||||
vault_arke_db_password: "your_secure_password"
|
||||
vault_arke_ntth_tokens: '[{"app_id":"your_app_id","app_secret":"your_secret","name":"Production"}]'
|
||||
```
|
||||
|
||||
### New Variables to Add
|
||||
|
||||
```yaml
|
||||
# OpenAI-Compatible Embedding API Key (optional - can be empty string if not using OpenAI provider)
|
||||
vault_arke_openai_embedding_api_key: ""
|
||||
```
|
||||
|
||||
## Usage Notes
|
||||
|
||||
### vault_arke_openai_embedding_api_key
|
||||
- **Required when**: `arke_embedding_provider` is set to `openai` in the inventory
|
||||
- **Can be empty**: If using llama-cpp, LocalAI, or other services that don't require authentication
|
||||
- **Must be set**: If using actual OpenAI API or services requiring authentication
|
||||
- **Default in inventory**: Empty string (`""`)
|
||||
|
||||
### vault_arke_ntth_tokens
|
||||
- **Format**: JSON array of objects
|
||||
- **Required fields per object**:
|
||||
- `app_id`: The application ID
|
||||
- `app_secret`: The application secret
|
||||
- `name`: (optional) A descriptive name for the token
|
||||
|
||||
**Example with multiple tokens**:
|
||||
```yaml
|
||||
vault_arke_ntth_tokens: '[{"app_id":"id1","app_secret":"secret1","name":"Production-Primary"},{"app_id":"id2","app_secret":"secret2","name":"Production-Backup"}]'
|
||||
```
|
||||
|
||||
## Editing the Vault
|
||||
|
||||
To edit the vault file:
|
||||
|
||||
```bash
|
||||
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
Make sure you have the vault password available (stored in `ansible/.vault_pass` by default).
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Using Ollama (Current Default)
|
||||
No additional vault variables needed beyond the existing ones. The following inventory settings are used:
|
||||
|
||||
```yaml
|
||||
arke_embedding_provider: ollama
|
||||
arke_ollama_host: "pan.helu.ca"
|
||||
```
|
||||
|
||||
### Using OpenAI API
|
||||
Add to vault:
|
||||
```yaml
|
||||
vault_arke_openai_embedding_api_key: "sk-your-openai-api-key"
|
||||
```
|
||||
|
||||
Update inventory to:
|
||||
```yaml
|
||||
arke_embedding_provider: openai
|
||||
arke_openai_embedding_base_url: "https://api.openai.com"
|
||||
arke_openai_embedding_model: "text-embedding-3-small"
|
||||
```
|
||||
|
||||
### Using llama-cpp or LocalAI (No Auth Required)
|
||||
Vault variable can remain empty:
|
||||
```yaml
|
||||
vault_arke_openai_embedding_api_key: ""
|
||||
```
|
||||
|
||||
Update inventory to:
|
||||
```yaml
|
||||
arke_embedding_provider: openai
|
||||
arke_openai_embedding_base_url: "http://your-server:8080"
|
||||
arke_openai_embedding_model: "text-embedding-ada-002"
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. Always use `ansible-vault` to encrypt sensitive data
|
||||
2. Never commit unencrypted secrets to version control
|
||||
3. Keep the vault password secure and separate from the repository
|
||||
4. Rotate API keys and secrets regularly
|
||||
5. Use unique tokens for different environments (dev/staging/production)
|
||||
204
docs/auditd.md
Normal file
204
docs/auditd.md
Normal file
@@ -0,0 +1,204 @@
|
||||
## Auditd + Laurel: Host-Based Detection Done Right
|
||||
|
||||
### What They Are
|
||||
|
||||
**Auditd** is the Linux Audit Framework—a kernel-level system that logs security-relevant events: file access, system calls, process execution, user authentication, privilege changes. It's been in the kernel since 2.6 and is rock solid.
|
||||
|
||||
**Laurel** is a plugin that transforms auditd's notoriously awkward multi-line log format into clean, structured JSON—perfect for shipping to Loki.
|
||||
|
||||
### Why This Combination Works
|
||||
|
||||
Auditd alone has two problems:
|
||||
1. The log format is painful (events split across multiple lines, encoded arguments)
|
||||
2. High-volume logging can impact performance if not tuned
|
||||
|
||||
Laurel solves the first problem elegantly. Proper rule tuning solves the second.
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Auditd (likely already installed)
|
||||
sudo apt install auditd audispd-plugins
|
||||
|
||||
# Laurel - grab the latest release
|
||||
wget https://github.com/threathunters-io/laurel/releases/latest/download/laurel-x86_64-musl
|
||||
sudo mv laurel-x86_64-musl /usr/local/sbin/laurel
|
||||
sudo chmod 755 /usr/local/sbin/laurel
|
||||
|
||||
# Create laurel user and directories
|
||||
sudo useradd -r -s /usr/sbin/nologin laurel
|
||||
sudo mkdir -p /var/log/laurel /etc/laurel
|
||||
sudo chown laurel:laurel /var/log/laurel
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
**/etc/laurel/config.toml:**
|
||||
```toml
|
||||
[auditlog]
|
||||
# Output JSON logs here - point Promtail/Loki agent at this
|
||||
file = "/var/log/laurel/audit.json"
|
||||
size = 100000000 # 100MB rotation
|
||||
generations = 5
|
||||
|
||||
[transform]
|
||||
# Enrich with useful context
|
||||
execve-argv = "array"
|
||||
execve-env = "delete" # Don't log environment (secrets risk)
|
||||
|
||||
[filter]
|
||||
# Drop noisy low-value events
|
||||
filter-keys = ["exclude-noise"]
|
||||
```
|
||||
|
||||
**/etc/audit/plugins.d/laurel.conf:**
|
||||
```ini
|
||||
active = yes
|
||||
direction = out
|
||||
path = /usr/local/sbin/laurel
|
||||
type = always
|
||||
args = --config /etc/laurel/config.toml
|
||||
format = string
|
||||
```
|
||||
|
||||
### High-Value Audit Rules
|
||||
|
||||
Here's a starter set focused on actual intrusion indicators—not compliance checkbox noise:
|
||||
|
||||
**/etc/audit/rules.d/intrusion-detection.rules:**
|
||||
```bash
|
||||
# Clear existing rules
|
||||
-D
|
||||
|
||||
# Buffer size (tune based on your load)
|
||||
-b 8192
|
||||
|
||||
# Failed file access (credential hunting)
|
||||
-a always,exit -F arch=b64 -S open,openat -F exit=-EACCES -F key=access-denied
|
||||
-a always,exit -F arch=b64 -S open,openat -F exit=-EPERM -F key=access-denied
|
||||
|
||||
# Credential file access
|
||||
-w /etc/passwd -p wa -k credential-files
|
||||
-w /etc/shadow -p wa -k credential-files
|
||||
-w /etc/gshadow -p wa -k credential-files
|
||||
-w /etc/sudoers -p wa -k credential-files
|
||||
-w /etc/sudoers.d -p wa -k credential-files
|
||||
|
||||
# SSH key access
|
||||
-w /root/.ssh -p wa -k ssh-keys
|
||||
-w /home -p wa -k ssh-keys
|
||||
|
||||
# Privilege escalation
|
||||
-a always,exit -F arch=b64 -S setuid,setgid,setreuid,setregid -F key=priv-escalation
|
||||
-w /usr/bin/sudo -p x -k priv-escalation
|
||||
-w /usr/bin/su -p x -k priv-escalation
|
||||
|
||||
# Process injection / debugging
|
||||
-a always,exit -F arch=b64 -S ptrace -F key=process-injection
|
||||
|
||||
# Suspicious process execution
|
||||
-a always,exit -F arch=b64 -S execve -F euid=0 -F key=root-exec
|
||||
-w /tmp -p x -k exec-from-tmp
|
||||
-w /var/tmp -p x -k exec-from-tmp
|
||||
-w /dev/shm -p x -k exec-from-shm
|
||||
|
||||
# Network connections from unexpected processes
|
||||
-a always,exit -F arch=b64 -S connect -F key=network-connect
|
||||
|
||||
# Kernel module loading
|
||||
-a always,exit -F arch=b64 -S init_module,finit_module -F key=kernel-modules
|
||||
|
||||
# Audit log tampering (high priority)
|
||||
-w /var/log/audit -p wa -k audit-tampering
|
||||
-w /etc/audit -p wa -k audit-tampering
|
||||
|
||||
# Cron/scheduled task modification
|
||||
-w /etc/crontab -p wa -k persistence
|
||||
-w /etc/cron.d -p wa -k persistence
|
||||
-w /var/spool/cron -p wa -k persistence
|
||||
|
||||
# Systemd service creation (persistence mechanism)
|
||||
-w /etc/systemd/system -p wa -k persistence
|
||||
-w /usr/lib/systemd/system -p wa -k persistence
|
||||
|
||||
# Make config immutable (remove -e 2 while tuning)
|
||||
# -e 2
|
||||
```
|
||||
|
||||
Load the rules:
|
||||
```bash
|
||||
sudo augenrules --load
|
||||
sudo systemctl restart auditd
|
||||
```
|
||||
|
||||
### Shipping to Loki
|
||||
|
||||
**Promtail config snippet:**
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: laurel
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: auditd
|
||||
host: your-hostname
|
||||
__path__: /var/log/laurel/audit.json
|
||||
pipeline_stages:
|
||||
- json:
|
||||
expressions:
|
||||
event_type: SYSCALL.SYSCALL
|
||||
key: SYSCALL.key
|
||||
exe: SYSCALL.exe
|
||||
uid: SYSCALL.UID
|
||||
success: SYSCALL.success
|
||||
- labels:
|
||||
event_type:
|
||||
key:
|
||||
```
|
||||
|
||||
### Grafana Alerting Examples
|
||||
|
||||
Once in Loki, create alerts for the high-value events:
|
||||
|
||||
```logql
|
||||
# Credential file tampering
|
||||
{job="auditd"} |= `credential-files` | json | success = "yes"
|
||||
|
||||
# Execution from /tmp (classic attack pattern)
|
||||
{job="auditd"} |= `exec-from-tmp` | json
|
||||
|
||||
# Root execution by non-root user (priv esc)
|
||||
{job="auditd"} |= `priv-escalation` | json
|
||||
|
||||
# Kernel module loading (rootkit indicator)
|
||||
{job="auditd"} |= `kernel-modules` | json
|
||||
|
||||
# Audit log tampering (covering tracks)
|
||||
{job="auditd"} |= `audit-tampering` | json
|
||||
```
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
If ye see performance impact:
|
||||
1. **Add exclusions** for known-noisy processes:
|
||||
```bash
|
||||
-a never,exit -F exe=/usr/bin/prometheus -F key=exclude-noise
|
||||
```
|
||||
2. **Reduce network logging** — the `connect` syscall is high-volume; consider removing or filtering
|
||||
3. **Increase buffer** if you see `audit: backlog limit exceeded`
|
||||
|
||||
### What You'll Catch
|
||||
|
||||
With this setup, you'll detect:
|
||||
- Credential harvesting attempts
|
||||
- Privilege escalation (successful and attempted)
|
||||
- Persistence mechanisms (cron, systemd services)
|
||||
- Execution from world-writable directories
|
||||
- Process injection/debugging
|
||||
- Rootkit installation attempts
|
||||
- Evidence tampering
|
||||
|
||||
All with structured JSON flowing into your existing Loki/Grafana stack. No Suricata noise, just host-level events that actually matter.
|
||||
|
||||
Want me to help tune rules for specific services you're running, or set up the Grafana alert rules?
|
||||
542
docs/casdoor.md
Normal file
542
docs/casdoor.md
Normal file
@@ -0,0 +1,542 @@
|
||||
# Casdoor SSO Identity Provider
|
||||
|
||||
Casdoor provides Single Sign-On (SSO) authentication for Agathos services. This document covers the design decisions, architecture, and deployment procedures.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
### Security Isolation
|
||||
|
||||
Casdoor handles identity and authentication - the most security-sensitive data in any system. For this reason, Casdoor uses a **dedicated PostgreSQL instance** on Titania rather than sharing the PostgreSQL server on Portia with other applications.
|
||||
|
||||
This isolation provides:
|
||||
- **Data separation**: Authentication data is physically separated from application data
|
||||
- **Access control**: The `casdoor` database user only has access to the `casdoor` database
|
||||
- **Blast radius reduction**: A compromise of the shared database on Portia doesn't expose identity data
|
||||
- **Production alignment**: Dev/UAT/Prod environments use the same architecture
|
||||
|
||||
### Native PostgreSQL with Docker Casdoor
|
||||
|
||||
The architecture splits cleanly:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ titania.incus │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ Native PostgreSQL 17 (systemd) │ │
|
||||
│ │ - SSL enabled for external connections │ │
|
||||
│ │ - Local connections without SSL │ │
|
||||
│ │ - Managed like any standard PostgreSQL install │ │
|
||||
│ │ - Port 5432 │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ ▲ │
|
||||
│ │ localhost:5432 │
|
||||
│ │ sslmode=disable │
|
||||
│ │ │
|
||||
│ ┌────────┴───────────────────────────────────────────────┐ │
|
||||
│ │ Casdoor Docker Container (network_mode: host) │ │
|
||||
│ │ - Runs as casdoor:casdoor user │ │
|
||||
│ │ - Only has access to its database │ │
|
||||
│ │ - Cannot touch PostgreSQL server config │ │
|
||||
│ │ - Port 22081 (via HAProxy) │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ External: SSL required
|
||||
│ sslmode=verify-ca
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ PGadmin │
|
||||
│ on Portia │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
### Why Not Docker for PostgreSQL?
|
||||
|
||||
Docker makes PostgreSQL permission management unnecessarily complex:
|
||||
- UID/GID mapping between host and container
|
||||
- Volume permission issues
|
||||
- SSL certificate ownership problems
|
||||
- More difficult backups and maintenance
|
||||
|
||||
Native PostgreSQL is:
|
||||
- Easier to manage (standard Linux administration)
|
||||
- Better integrated with systemd
|
||||
- Simpler backup procedures
|
||||
- Well-documented and understood
|
||||
|
||||
### SSL Strategy
|
||||
|
||||
PostgreSQL connections follow a **split SSL policy**:
|
||||
|
||||
| Connection Source | SSL Requirement | Rationale |
|
||||
|-------------------|-----------------|-----------|
|
||||
| Casdoor (localhost) | `sslmode=disable` | Same host, trusted |
|
||||
| PGadmin (Portia) | `sslmode=verify-ca` | External network, requires encryption |
|
||||
| Other external | `hostssl` required | Enforced by pg_hba.conf |
|
||||
|
||||
This is controlled by `pg_hba.conf`:
|
||||
```
|
||||
# Local connections (Unix socket)
|
||||
local all all peer
|
||||
|
||||
# Localhost connections (no SSL required)
|
||||
host all all 127.0.0.1/32 md5
|
||||
|
||||
# External connections (SSL required)
|
||||
hostssl all all 0.0.0.0/0 md5
|
||||
```
|
||||
|
||||
### System User Pattern
|
||||
|
||||
The Casdoor service user is created without hardcoded UID/GID:
|
||||
|
||||
```yaml
|
||||
- name: Create casdoor user
|
||||
ansible.builtin.user:
|
||||
name: "{{ casdoor_user }}"
|
||||
system: true # System account, UID assigned by OS
|
||||
```
|
||||
|
||||
The playbook queries the assigned UID/GID at runtime for Docker container user mapping.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| PostgreSQL 17 | Native on Titania | Dedicated identity database |
|
||||
| Casdoor | Docker on Titania | SSO identity provider |
|
||||
| HAProxy | Titania | TLS termination, routing |
|
||||
| Alloy | Titania | Syslog collection |
|
||||
|
||||
### Deployment Order
|
||||
|
||||
```
|
||||
1. postgresql_ssl/deploy.yml → Install PostgreSQL, SSL, create casdoor DB
|
||||
2. casdoor/deploy.yml → Deploy Casdoor container
|
||||
3. pgadmin/deploy.yml → Distribute SSL cert to PGadmin (optional)
|
||||
```
|
||||
|
||||
### Network Ports
|
||||
|
||||
| Port | Service | Access |
|
||||
|------|---------|--------|
|
||||
| 22081 | Casdoor HTTP | Via HAProxy (network_mode: host) |
|
||||
| 5432 | PostgreSQL | SSL for external, plain for localhost |
|
||||
| 51401 | Syslog | Local only (Alloy) |
|
||||
|
||||
### Data Persistence
|
||||
|
||||
PostgreSQL data (native install):
|
||||
```
|
||||
/var/lib/postgresql/17/main/ # Database files
|
||||
/etc/postgresql/17/main/ # Configuration
|
||||
/etc/postgresql/17/main/ssl/ # SSL certificates
|
||||
```
|
||||
|
||||
Casdoor configuration:
|
||||
```
|
||||
/srv/casdoor/
|
||||
├── conf/
|
||||
│ └── app.conf # Casdoor configuration
|
||||
└── docker-compose.yml # Service definition
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. Terraform (S3 Buckets)
|
||||
|
||||
Casdoor can use S3-compatible storage for avatars and attachments:
|
||||
|
||||
```bash
|
||||
cd terraform
|
||||
terraform apply
|
||||
```
|
||||
|
||||
### 2. Ansible Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
```yaml
|
||||
# PostgreSQL SSL postgres user password (for Titania's dedicated PostgreSQL)
|
||||
vault_postgresql_ssl_postgres_password: "secure-postgres-password"
|
||||
|
||||
# Casdoor database password
|
||||
vault_casdoor_db_password: "secure-db-password"
|
||||
|
||||
# Casdoor application secrets
|
||||
vault_casdoor_auth_state: "random-32-char-string"
|
||||
vault_casdoor_app_client_secret: "generated-client-secret"
|
||||
|
||||
# Casdoor initial user passwords (changed after first login)
|
||||
vault_casdoor_admin_password: "initial-admin-password"
|
||||
vault_casdoor_hostmaster_password: "initial-hostmaster-password"
|
||||
|
||||
# Optional (for RADIUS protocol)
|
||||
vault_casdoor_radius_secret: "radius-secret"
|
||||
```
|
||||
|
||||
Generate secrets:
|
||||
```bash
|
||||
# Database password
|
||||
openssl rand -base64 24
|
||||
|
||||
# Auth state
|
||||
openssl rand -hex 16
|
||||
```
|
||||
|
||||
### 3. Alloy Log Collection
|
||||
|
||||
Ensure Alloy is deployed to receive syslog:
|
||||
|
||||
```bash
|
||||
ansible-playbook alloy/deploy.yml --limit titania.incus
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Fresh Installation
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
|
||||
# 1. Deploy PostgreSQL with SSL
|
||||
ansible-playbook postgresql_ssl/deploy.yml
|
||||
|
||||
# 2. Deploy Casdoor
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
|
||||
# 3. Update PGadmin with SSL certificate (optional)
|
||||
ansible-playbook pgadmin/deploy.yml
|
||||
```
|
||||
|
||||
### Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check PostgreSQL status
|
||||
ssh titania.incus "sudo systemctl status postgresql"
|
||||
|
||||
# Check Casdoor container
|
||||
ssh titania.incus "cd /srv/casdoor && docker compose ps"
|
||||
|
||||
# Check logs
|
||||
ssh titania.incus "cd /srv/casdoor && docker compose logs --tail=50"
|
||||
|
||||
# Test health endpoint
|
||||
curl -s http://titania.incus:22081/api/health
|
||||
```
|
||||
|
||||
### Redeployment
|
||||
|
||||
To redeploy Casdoor only (database preserved):
|
||||
|
||||
```bash
|
||||
ansible-playbook casdoor/remove.yml
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
```
|
||||
|
||||
To completely reset (including database):
|
||||
```bash
|
||||
ansible-playbook casdoor/remove.yml
|
||||
ssh titania.incus "sudo -u postgres dropdb casdoor"
|
||||
ssh titania.incus "sudo -u postgres dropuser casdoor"
|
||||
ansible-playbook postgresql_ssl/deploy.yml
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Host Variables
|
||||
|
||||
Located in `ansible/inventory/host_vars/titania.incus.yml`:
|
||||
|
||||
```yaml
|
||||
# PostgreSQL SSL (dedicated identity database)
|
||||
postgresql_ssl_postgres_password: "{{ vault_postgresql_ssl_postgres_password }}"
|
||||
postgresql_ssl_port: 5432
|
||||
postgresql_ssl_cert_path: /etc/postgresql/17/main/ssl/server.crt
|
||||
|
||||
# Casdoor service account (system-assigned UID/GID)
|
||||
casdoor_user: casdoor
|
||||
casdoor_group: casdoor
|
||||
casdoor_directory: /srv/casdoor
|
||||
|
||||
# Web
|
||||
casdoor_port: 22081
|
||||
casdoor_runmode: dev # or 'prod'
|
||||
|
||||
# Database (connects to localhost PostgreSQL)
|
||||
casdoor_db_port: 5432
|
||||
casdoor_db_name: casdoor
|
||||
casdoor_db_user: casdoor
|
||||
casdoor_db_password: "{{ vault_casdoor_db_password }}"
|
||||
casdoor_db_sslmode: disable # Localhost, no SSL needed
|
||||
|
||||
# Logging
|
||||
casdoor_syslog_port: 51401
|
||||
```
|
||||
|
||||
### SSL Certificate
|
||||
|
||||
The self-signed certificate is generated automatically with:
|
||||
- **Common Name**: `titania.incus`
|
||||
- **Subject Alt Names**: `titania.incus`, `localhost`, `127.0.0.1`
|
||||
- **Validity**: 10 years (`+3650d`)
|
||||
- **Key Size**: 4096 bits
|
||||
- **Location**: `/etc/postgresql/17/main/ssl/`
|
||||
|
||||
To regenerate certificates:
|
||||
```bash
|
||||
ssh titania.incus "sudo rm -rf /etc/postgresql/17/main/ssl/*"
|
||||
ansible-playbook postgresql_ssl/deploy.yml
|
||||
ansible-playbook pgadmin/deploy.yml # Update cert on Portia
|
||||
```
|
||||
|
||||
## PGadmin Connection
|
||||
|
||||
To connect from PGadmin on Portia:
|
||||
|
||||
1. Navigate to https://pgadmin.ouranos.helu.ca
|
||||
2. Add Server:
|
||||
- **General tab**
|
||||
- Name: `Titania PostgreSQL (Casdoor)`
|
||||
- **Connection tab**
|
||||
- Host: `titania.incus`
|
||||
- Port: `5432`
|
||||
- Database: `casdoor`
|
||||
- Username: `casdoor`
|
||||
- Password: *(from vault)*
|
||||
- **SSL tab**
|
||||
- SSL Mode: `Verify-CA`
|
||||
- Root certificate: `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
|
||||
|
||||
The certificate is automatically distributed by `ansible-playbook pgadmin/deploy.yml`.
|
||||
|
||||
## Application Branding & CSS Customization
|
||||
|
||||
Casdoor allows extensive customization of login/signup pages through CSS and HTML fields in the **Application** settings.
|
||||
|
||||
### Available CSS/HTML Fields
|
||||
|
||||
| Field | Purpose | Where Applied |
|
||||
|-------|---------|---------------|
|
||||
| `formCss` | Custom CSS for desktop login forms | Login, signup, consent pages |
|
||||
| `formCssMobile` | Mobile-specific CSS overrides | Mobile views |
|
||||
| `headerHtml` | Custom HTML in page header | All auth pages (can inject `<style>` tags) |
|
||||
| `footerHtml` | Custom footer HTML | Replaces "Powered by Casdoor" |
|
||||
| `formSideHtml` | HTML beside the form | Side panel content |
|
||||
| `formBackgroundUrl` | Background image URL | Full-page background |
|
||||
| `formBackgroundUrlMobile` | Mobile background image | Mobile background |
|
||||
| `signupHtml` | Custom HTML for signup page | Signup page only |
|
||||
| `signinHtml` | Custom HTML for signin page | Signin page only |
|
||||
|
||||
### Configuration via init_data.json
|
||||
|
||||
Application branding is configured in `ansible/casdoor/init_data.json.j2`:
|
||||
|
||||
```json
|
||||
{
|
||||
"applications": [
|
||||
{
|
||||
"name": "app-heluca",
|
||||
"formCss": "<style>/* Your CSS here */</style>",
|
||||
"footerHtml": "<div style=\"text-align:center;\">Powered by Helu.ca</div>",
|
||||
"headerHtml": "<style>/* Additional CSS via style tag */</style>",
|
||||
"formBackgroundUrl": "https://example.com/bg.jpg"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Example: Custom Theme CSS
|
||||
|
||||
The `formCss` field contains CSS to customize the Ant Design components:
|
||||
|
||||
```css
|
||||
<style>
|
||||
/* Login panel styling */
|
||||
.login-panel {
|
||||
background-color: #ffffff;
|
||||
border-radius: 10px;
|
||||
box-shadow: 0 0 30px 20px rgba(255,164,21,0.12);
|
||||
}
|
||||
|
||||
/* Primary button colors */
|
||||
.ant-btn-primary {
|
||||
background-color: #4b96ff !important;
|
||||
border-color: #4b96ff !important;
|
||||
}
|
||||
.ant-btn-primary:hover {
|
||||
background-color: #58c0ff !important;
|
||||
border-color: #58c0ff !important;
|
||||
}
|
||||
|
||||
/* Link colors */
|
||||
a { color: #ffa415; }
|
||||
a:hover { color: #ffc219; }
|
||||
|
||||
/* Input focus states */
|
||||
.ant-input:focus, .ant-input-focused {
|
||||
border-color: #4b96ff !important;
|
||||
box-shadow: 0 0 0 2px rgba(75,150,255,0.2) !important;
|
||||
}
|
||||
|
||||
/* Checkbox styling */
|
||||
.ant-checkbox-checked .ant-checkbox-inner {
|
||||
background-color: #4b96ff !important;
|
||||
border-color: #4b96ff !important;
|
||||
}
|
||||
</style>
|
||||
```
|
||||
|
||||
### Example: Custom Footer
|
||||
|
||||
Replace the default "Powered by Casdoor" footer:
|
||||
|
||||
```html
|
||||
<div style="text-align:center;padding:10px;color:#666;">
|
||||
<a href="https://helu.ca" style="color:#4b96ff;text-decoration:none;">
|
||||
Powered by Helu.ca
|
||||
</a>
|
||||
</div>
|
||||
```
|
||||
|
||||
### Organization-Level Theme
|
||||
|
||||
Organization settings also affect theming. Configure in the **Organization** settings:
|
||||
|
||||
| Setting | Purpose |
|
||||
|---------|---------|
|
||||
| `themeData.colorPrimary` | Primary color (Ant Design) |
|
||||
| `themeData.borderRadius` | Border radius for components |
|
||||
| `themeData.isCompact` | Compact mode toggle |
|
||||
| `logo` | Organization logo |
|
||||
| `favicon` | Browser favicon |
|
||||
| `websiteUrl` | Organization website |
|
||||
|
||||
### Updating Existing Applications
|
||||
|
||||
Changes to `init_data.json` only apply during **initial Casdoor setup**. For existing deployments:
|
||||
|
||||
1. **Via Admin UI**: Applications → Edit → Update CSS/HTML fields
|
||||
2. **Via API**: Use Casdoor's REST API to update application settings
|
||||
3. **Database reset**: Redeploy with `initDataNewOnly = false` (overwrites existing data)
|
||||
|
||||
### CSS Class Reference
|
||||
|
||||
Common CSS classes for targeting Casdoor UI elements:
|
||||
|
||||
| Class | Element |
|
||||
|-------|---------|
|
||||
| `.login-panel` | Main login form container |
|
||||
| `.login-logo-box` | Logo container |
|
||||
| `.login-username` | Username input wrapper |
|
||||
| `.login-password` | Password input wrapper |
|
||||
| `.login-button-box` | Submit button container |
|
||||
| `.login-forget-password` | Forgot password link |
|
||||
| `.login-signup-link` | Signup link |
|
||||
| `.login-languages` | Language selector |
|
||||
| `.back-button` | Back button |
|
||||
| `.provider-img` | OAuth provider icons |
|
||||
| `.signin-methods` | Sign-in method tabs |
|
||||
| `.verification-code` | Verification code input |
|
||||
| `.login-agreement` | Terms agreement checkbox |
|
||||
|
||||
## Initial Setup
|
||||
|
||||
After deployment, access Casdoor at https://id.ouranos.helu.ca:
|
||||
|
||||
1. **Login** with default credentials: `admin` / `123`
|
||||
2. **Change admin password immediately**
|
||||
3. **Create organization** for your domain
|
||||
4. **Create applications** for services that need SSO:
|
||||
- SearXNG (via OAuth2-Proxy)
|
||||
- Grafana
|
||||
- Other internal services
|
||||
|
||||
### OAuth2 Application Setup
|
||||
|
||||
For each service:
|
||||
1. Applications → Add
|
||||
2. Configure OAuth2 settings:
|
||||
- Redirect URI: `https://service.ouranos.helu.ca/oauth2/callback`
|
||||
- Grant types: Authorization Code
|
||||
3. Note the Client ID and Client Secret for service configuration
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### PostgreSQL Issues
|
||||
|
||||
```bash
|
||||
# Check PostgreSQL status
|
||||
ssh titania.incus "sudo systemctl status postgresql"
|
||||
|
||||
# View PostgreSQL logs
|
||||
ssh titania.incus "sudo journalctl -u postgresql -f"
|
||||
|
||||
# Check SSL configuration
|
||||
ssh titania.incus "sudo -u postgres psql -c 'SHOW ssl;'"
|
||||
ssh titania.incus "sudo -u postgres psql -c 'SHOW ssl_cert_file;'"
|
||||
|
||||
# Test SSL connection externally
|
||||
openssl s_client -connect titania.incus:5432 -starttls postgres
|
||||
```
|
||||
|
||||
### Casdoor Container Issues
|
||||
|
||||
```bash
|
||||
# View container status
|
||||
ssh titania.incus "cd /srv/casdoor && docker compose ps"
|
||||
|
||||
# View logs
|
||||
ssh titania.incus "cd /srv/casdoor && docker compose logs casdoor"
|
||||
|
||||
# Restart
|
||||
ssh titania.incus "cd /srv/casdoor && docker compose restart"
|
||||
```
|
||||
|
||||
### Database Connection
|
||||
|
||||
```bash
|
||||
# Connect as postgres admin
|
||||
ssh titania.incus "sudo -u postgres psql"
|
||||
|
||||
# Connect as casdoor user
|
||||
ssh titania.incus "sudo -u postgres psql -U casdoor -d casdoor -h localhost"
|
||||
|
||||
# List databases
|
||||
ssh titania.incus "sudo -u postgres psql -c '\l'"
|
||||
|
||||
# List users
|
||||
ssh titania.incus "sudo -u postgres psql -c '\du'"
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Casdoor health
|
||||
curl -s http://titania.incus:22081/api/health | jq
|
||||
|
||||
# PostgreSQL accepting connections
|
||||
ssh titania.incus "pg_isready -h localhost"
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Change default admin password** immediately after deployment
|
||||
2. **Rotate database passwords** periodically (update vault, redeploy)
|
||||
3. **Monitor authentication logs** in Grafana (via Alloy/Loki)
|
||||
4. **SSL certificates** have 10-year validity, regenerate if compromised
|
||||
5. **Backup PostgreSQL data** regularly - contains all identity data:
|
||||
```bash
|
||||
ssh titania.incus "sudo -u postgres pg_dump casdoor > casdoor_backup.sql"
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Ansible Practices](ansible.md) - Playbook and variable patterns
|
||||
- [Terraform Practices](terraform.md) - S3 bucket provisioning
|
||||
- [OAuth2-Proxy](services/oauth2_proxy.md) - Protecting services with Casdoor SSO
|
||||
191
docs/cerbot.md
Normal file
191
docs/cerbot.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Certbot DNS-01 with Namecheap
|
||||
|
||||
This playbook deploys certbot with the Namecheap DNS plugin for DNS-01 validation, enabling wildcard SSL certificates.
|
||||
|
||||
## Overview
|
||||
|
||||
| Component | Value |
|
||||
|-----------|-------|
|
||||
| Installation | Python virtualenv in `/srv/certbot/.venv` |
|
||||
| DNS Plugin | `certbot-dns-namecheap` |
|
||||
| Validation | DNS-01 (supports wildcards) |
|
||||
| Renewal | Systemd timer (twice daily) |
|
||||
| Certificate Output | `/etc/haproxy/certs/{domain}.pem` |
|
||||
| Metrics | Prometheus textfile collector |
|
||||
## Deployments
|
||||
|
||||
### Titania (ouranos.helu.ca)
|
||||
|
||||
Production deployment providing Let's Encrypt certificates for the Agathos sandbox HAProxy reverse proxy.
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| **Host** | titania.incus |
|
||||
| **Domain** | ouranos.helu.ca |
|
||||
| **Wildcard** | *.ouranos.helu.ca |
|
||||
| **Email** | webmaster@helu.ca |
|
||||
| **HAProxy** | Port 443 (HTTPS), Port 80 (HTTP redirect) |
|
||||
| **Renewal** | Twice daily, automatic HAProxy reload |
|
||||
|
||||
### Other Deployments
|
||||
|
||||
The playbook can be deployed to any host with HAProxy. See the example configuration for hippocamp.helu.ca (d.helu.ca domain) below.
|
||||
## Prerequisites
|
||||
|
||||
1. **Namecheap API Access** enabled on your account
|
||||
2. **Namecheap API key** generated
|
||||
3. **IP whitelisted** in Namecheap API settings
|
||||
4. **Ansible Vault** configured with Namecheap credentials
|
||||
|
||||
## Setup
|
||||
|
||||
### 1. Add Secrets to Ansible Vault
|
||||
|
||||
Add Namecheap credentials to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
```bash
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
Add the following variables:
|
||||
```yaml
|
||||
vault_namecheap_username: "your_namecheap_username"
|
||||
vault_namecheap_api_key: "your_namecheap_api_key"
|
||||
```
|
||||
|
||||
Map these in `inventory/group_vars/all/vars.yml`:
|
||||
```yaml
|
||||
namecheap_username: "{{ vault_namecheap_username }}"
|
||||
namecheap_api_key: "{{ vault_namecheap_api_key }}"
|
||||
```
|
||||
|
||||
### 2. Configure Host Variables
|
||||
|
||||
For Titania, the configuration is in `inventory/host_vars/titania.incus.yml`:
|
||||
```yaml
|
||||
services:
|
||||
- certbot
|
||||
- haproxy
|
||||
# ...
|
||||
|
||||
certbot_email: webmaster@helu.ca
|
||||
certbot_cert_name: ouranos.helu.ca
|
||||
certbot_domains:
|
||||
- "*.ouranos.helu.ca"
|
||||
- "ouranos.helu.ca"
|
||||
```
|
||||
|
||||
### 3. Deploy
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook certbot/deploy.yml --limit titania.incus
|
||||
```
|
||||
|
||||
## Files Created
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `/srv/certbot/.venv/` | Python virtualenv with certbot |
|
||||
| `/srv/certbot/config/` | Certbot configuration and certificates |
|
||||
| `/srv/certbot/credentials/namecheap.ini` | Namecheap API credentials (600 perms) |
|
||||
| `/srv/certbot/hooks/renewal-hook.sh` | Post-renewal script |
|
||||
| `/srv/certbot/hooks/cert-metrics.sh` | Prometheus metrics script |
|
||||
| `/etc/haproxy/certs/ouranos.helu.ca.pem` | Combined cert for HAProxy (Titania) |
|
||||
| `/etc/systemd/system/certbot-renew.service` | Renewal service unit |
|
||||
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
|
||||
| `/etc/systemd/system/certbot-renew.timer` | Twice-daily renewal timer |
|
||||
|
||||
## Renewal Process
|
||||
|
||||
1. Systemd timer triggers at 00:00 and 12:00 (with random delay up to 1 hour)
|
||||
2. Certbot checks if certificate needs renewal (within 30 days of expiry)
|
||||
3. If renewal needed:
|
||||
- Creates DNS TXT record via Namecheap API
|
||||
- Waits 120 seconds for propagation
|
||||
- Validates and downloads new certificate
|
||||
- Runs `renewal-hook.sh`
|
||||
4. Renewal hook:
|
||||
- Combines fullchain + privkey into HAProxy format
|
||||
- Reloads HAProxy via `docker compose kill -s HUP haproxy`
|
||||
- Updates Prometheus metrics
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
Metrics written to `/var/lib/prometheus/node-exporter/ssl_cert.prom`:
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `ssl_certificate_expiry_timestamp` | Unix timestamp when cert expires |
|
||||
| `ssl_certificate_expiry_seconds` | Seconds until cert expires |
|
||||
| `ssl_certificate_valid` | 1 if valid, 0 if expired/missing |
|
||||
|
||||
Example alert rule:
|
||||
```yaml
|
||||
- alert: SSLCertificateExpiringSoon
|
||||
expr: ssl_certificate_expiry_seconds < 604800 # 7 days
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "SSL certificate expiring soon"
|
||||
description: "Certificate for {{ $labels.domain }} expires in {{ $value | humanizeDuration }}"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### View Certificate Status
|
||||
|
||||
```bash
|
||||
# Check certificate expiry (Titania example)
|
||||
openssl x509 -enddate -noout -in /etc/haproxy/certs/ouranos.helu.ca.pem
|
||||
|
||||
# Check certbot certificates
|
||||
sudo -u certbot /srv/certbot/.venv/bin/certbot certificates \
|
||||
--config-dir /srv/certbot/config
|
||||
```
|
||||
|
||||
### Manual Renewal Test
|
||||
|
||||
```bash
|
||||
# Dry run renewal
|
||||
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
|
||||
--config-dir /srv/certbot/config \
|
||||
--work-dir /srv/certbot/work \
|
||||
--logs-dir /srv/certbot/logs \
|
||||
--dry-run
|
||||
|
||||
# Force renewal (if needed)
|
||||
sudo -u certbot /srv/certbot/.venv/bin/certbot renew \
|
||||
--config-dir /srv/certbot/config \
|
||||
--work-dir /srv/certbot/work \
|
||||
--logs-dir /srv/certbot/logs \
|
||||
--force-renewal
|
||||
```
|
||||
|
||||
### Check Systemd Timer
|
||||
|
||||
```bash
|
||||
# Timer status
|
||||
systemctl status certbot-renew.timer
|
||||
|
||||
# Last run
|
||||
journalctl -u certbot-renew.service --since "1 day ago"
|
||||
|
||||
# List timers
|
||||
systemctl list-timers certbot-renew.timer
|
||||
```
|
||||
|
||||
### DNS Propagation Issues
|
||||
|
||||
If certificate requests fail due to DNS propagation:
|
||||
|
||||
1. Check Namecheap API is accessible
|
||||
2. Verify IP is whitelisted
|
||||
3. Increase propagation wait time (default 120s)
|
||||
4. Check certbot logs: `/srv/certbot/logs/letsencrypt.log`
|
||||
|
||||
## Related Playbooks
|
||||
|
||||
- `haproxy/deploy.yml` - Depends on certificate from certbot
|
||||
- `prometheus/node_deploy.yml` - Deploys node_exporter for metrics collection
|
||||
1275
docs/django_mcp_standards.html
Normal file
1275
docs/django_mcp_standards.html
Normal file
File diff suppressed because it is too large
Load Diff
505
docs/documentation_style_guide.html
Normal file
505
docs/documentation_style_guide.html
Normal file
@@ -0,0 +1,505 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Documentation Style Guide</title>
|
||||
<!-- Bootstrap CSS -->
|
||||
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
|
||||
<!-- Bootstrap Icons -->
|
||||
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css">
|
||||
<style>
|
||||
/* Smooth scrolling */
|
||||
html {
|
||||
scroll-behavior: smooth;
|
||||
}
|
||||
|
||||
/* Icon styling */
|
||||
.section-icon {
|
||||
margin-right: 0.5rem;
|
||||
color: var(--bs-primary);
|
||||
}
|
||||
|
||||
.alert-icon {
|
||||
margin-right: 0.5rem;
|
||||
font-size: 1.2rem;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
/* Scroll to top button */
|
||||
#scrollTopBtn {
|
||||
position: fixed;
|
||||
bottom: 20px;
|
||||
right: 20px;
|
||||
z-index: 1000;
|
||||
display: none;
|
||||
border-radius: 50%;
|
||||
width: 50px;
|
||||
height: 50px;
|
||||
box-shadow: 0 2px 10px rgba(0,0,0,0.3);
|
||||
}
|
||||
|
||||
/* Icon legend */
|
||||
.icon-legend {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
margin-right: 1.5rem;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.icon-legend i {
|
||||
margin-right: 0.5rem;
|
||||
font-size: 1.2rem;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container-fluid">
|
||||
<nav class="navbar navbar-dark bg-dark rounded mb-4">
|
||||
<div class="container-fluid">
|
||||
<a class="navbar-brand" href="agathos.html">
|
||||
<i class="bi bi-arrow-left"></i> Back to Main Documentation
|
||||
</a>
|
||||
<div class="navbar-nav d-flex flex-row">
|
||||
<a class="nav-link me-3" href="#philosophy"><i class="bi bi-book"></i> Philosophy</a>
|
||||
<a class="nav-link me-3" href="#structure"><i class="bi bi-diagram-3"></i> Structure</a>
|
||||
<a class="nav-link me-3" href="#visual-design"><i class="bi bi-palette"></i> Design</a>
|
||||
<a class="nav-link me-3" href="#bootstrap-icons"><i class="bi bi-bootstrap"></i> Icons</a>
|
||||
<a class="nav-link" href="#implementation"><i class="bi bi-gear"></i> Implementation</a>
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<nav aria-label="breadcrumb">
|
||||
<ol class="breadcrumb">
|
||||
<li class="breadcrumb-item"><a href="agathos.html"><i class="bi bi-house-door"></i> Main Documentation</a></li>
|
||||
<li class="breadcrumb-item active" aria-current="page">Style Guide</li>
|
||||
</ol>
|
||||
</nav>
|
||||
|
||||
<div class="row">
|
||||
<div class="col-12">
|
||||
<h1 class="display-4 mb-4">
|
||||
<i class="bi bi-journal-code section-icon"></i>Documentation Style Guide
|
||||
<span class="badge bg-success"><i class="bi bi-check-circle-fill"></i> Complete</span>
|
||||
</h1>
|
||||
<p class="lead">This guide explains the approach and principles used to create comprehensive HTML documentation for infrastructure and software projects.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Icon Legend -->
|
||||
<div class="alert alert-light border">
|
||||
<h5><i class="bi bi-info-circle"></i> Icon Legend</h5>
|
||||
<div class="d-flex flex-wrap">
|
||||
<span class="icon-legend"><i class="bi bi-exclamation-triangle-fill text-danger"></i>Critical/Danger</span>
|
||||
<span class="icon-legend"><i class="bi bi-exclamation-circle-fill text-warning"></i>Warning/Important</span>
|
||||
<span class="icon-legend"><i class="bi bi-check-circle-fill text-success"></i>Success/Complete</span>
|
||||
<span class="icon-legend"><i class="bi bi-info-circle-fill text-info"></i>Information</span>
|
||||
<span class="icon-legend"><i class="bi bi-lightning-fill text-primary"></i>Active/Key</span>
|
||||
<span class="icon-legend"><i class="bi bi-link-45deg text-secondary"></i>Integration</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="philosophy" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-book section-icon"></i>Philosophy</h2>
|
||||
|
||||
<div class="row g-4">
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-diagram-3"></i> Documentation as Architecture
|
||||
</h3>
|
||||
<p>Documentation should mirror and reinforce the software architecture. Each component gets its own focused document that clearly explains its purpose, boundaries, and relationships.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-people"></i> User-Centric Design
|
||||
</h3>
|
||||
<p>Documentation serves multiple audiences:</p>
|
||||
<ul>
|
||||
<li><i class="bi bi-code-slash"></i> <strong>Developers</strong> need technical details and implementation guidance</li>
|
||||
<li><i class="bi bi-briefcase"></i> <strong>Stakeholders</strong> need high-level overviews and business context</li>
|
||||
<li><i class="bi bi-award"></i> <strong>Red Panda</strong> needs approval checkpoints and critical decisions highlighted</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-arrow-repeat"></i> Living Documentation
|
||||
</h3>
|
||||
<p>Documentation evolves with the codebase and captures both current state and architectural decisions.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="structure" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-diagram-3 section-icon"></i>Structure Principles</h2>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>1. Hierarchical Information Architecture</h3>
|
||||
<pre class="mb-0">Main Documentation (project.html)
|
||||
├── Component Docs (component1.html, component2.html, etc.)
|
||||
├── Standards References (docs/standards/)
|
||||
└── Supporting Materials (README.md, style guides)</pre>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h3><i class="bi bi-exclamation-circle-fill alert-icon"></i>2. Consistent Navigation</h3>
|
||||
<p>Every document includes:</p>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-compass"></i> <strong>Navigation bar</strong> with key sections</li>
|
||||
<li><i class="bi bi-link-45deg"></i> <strong>Cross-references</strong> to related components</li>
|
||||
<li><i class="bi bi-arrow-return-left"></i> <strong>Return links</strong> to main documentation</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>3. Progressive Disclosure</h3>
|
||||
<p>Information flows from general to specific:</p>
|
||||
<p class="mb-0"><strong><i class="bi bi-arrow-right-short"></i> Overview → Architecture → Implementation → Details</strong></p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="visual-design" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-palette section-icon"></i>Visual Design Principles</h2>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>1. Clean Typography</h3>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-fonts"></i> System fonts for readability</li>
|
||||
<li><i class="bi bi-text-paragraph"></i> Generous line spacing (1.6)</li>
|
||||
<li><i class="bi bi-list-nested"></i> Clear hierarchy with consistent heading sizes</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-danger border-start border-4 border-danger">
|
||||
<h3><i class="bi bi-exclamation-triangle-fill alert-icon"></i>2. Color-Coded Information Types</h3>
|
||||
<p><strong><i class="bi bi-bootstrap"></i> Bootstrap Alert Classes (Preferred):</strong></p>
|
||||
<ul class="mb-3">
|
||||
<li><i class="bi bi-exclamation-triangle-fill text-danger"></i> <code>alert alert-danger</code> - Critical decisions requiring immediate attention</li>
|
||||
<li><i class="bi bi-exclamation-circle-fill text-warning"></i> <code>alert alert-warning</code> - Important context and warnings</li>
|
||||
<li><i class="bi bi-check-circle-fill text-success"></i> <code>alert alert-success</code> - Completed features and positive outcomes</li>
|
||||
<li><i class="bi bi-info-circle-fill text-info"></i> <code>alert alert-info</code> - Technical architecture information</li>
|
||||
<li><i class="bi bi-lightning-fill text-primary"></i> <code>alert alert-primary</code> - Key workflows and processes</li>
|
||||
<li><i class="bi bi-link-45deg text-secondary"></i> <code>alert alert-secondary</code> - Cross-component integration details</li>
|
||||
</ul>
|
||||
<p><strong>Legacy Custom Classes (Backward Compatible):</strong></p>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-circle-fill text-info"></i> <strong>.tech-stack</strong> - Technical architecture information</li>
|
||||
<li><i class="bi bi-circle-fill text-warning"></i> <strong>.critical</strong> - Important decisions requiring attention</li>
|
||||
<li><i class="bi bi-circle-fill text-primary"></i> <strong>.workflow</strong> - Process and workflow information</li>
|
||||
<li><i class="bi bi-circle-fill text-secondary"></i> <strong>.integration</strong> - Cross-component integration details</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>3. Responsive Layout</h3>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-phone"></i> Bootstrap grid system for all screen sizes</li>
|
||||
<li><i class="bi bi-grid-3x3"></i> Consistent spacing with utility classes</li>
|
||||
<li><i class="bi bi-card-list"></i> Card-based information grouping</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="bootstrap-icons" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-bootstrap section-icon"></i>Bootstrap Icons Integration</h2>
|
||||
|
||||
<div class="alert alert-success border-start border-4 border-success">
|
||||
<h3><i class="bi bi-check-circle-fill alert-icon"></i>Setup</h3>
|
||||
<p>Add Bootstrap Icons CDN to your HTML documents:</p>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css"></code>
|
||||
</div>
|
||||
<p class="mt-3"><strong>Benefits:</strong></p>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-lightning-charge"></i> Minimal overhead (~75KB)</li>
|
||||
<li><i class="bi bi-palette"></i> 2000+ icons matching Bootstrap design</li>
|
||||
<li><i class="bi bi-cloud-download"></i> CDN caching for fast loading</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>Common Icon Patterns</h3>
|
||||
<div class="row g-3">
|
||||
<div class="col-md-6">
|
||||
<h5><i class="bi bi-check-square"></i> Status & Progress</h5>
|
||||
<ul>
|
||||
<li><i class="bi bi-check-square"></i> <code>bi-check-square</code> - Completed</li>
|
||||
<li><i class="bi bi-square"></i> <code>bi-square</code> - Pending</li>
|
||||
<li><i class="bi bi-hourglass-split"></i> <code>bi-hourglass-split</code> - In Progress</li>
|
||||
<li><i class="bi bi-x-circle"></i> <code>bi-x-circle</code> - Failed/Error</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<h5><i class="bi bi-compass"></i> Navigation</h5>
|
||||
<ul>
|
||||
<li><i class="bi bi-house-door"></i> <code>bi-house-door</code> - Home</li>
|
||||
<li><i class="bi bi-arrow-left"></i> <code>bi-arrow-left</code> - Back</li>
|
||||
<li><i class="bi bi-box-arrow-up-right"></i> <code>bi-box-arrow-up-right</code> - External</li>
|
||||
<li><i class="bi bi-link-45deg"></i> <code>bi-link-45deg</code> - Link</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<h5><i class="bi bi-exclamation-triangle"></i> Alerts</h5>
|
||||
<ul>
|
||||
<li><i class="bi bi-exclamation-triangle-fill text-danger"></i> <code>bi-exclamation-triangle-fill</code> - Danger</li>
|
||||
<li><i class="bi bi-exclamation-circle-fill text-warning"></i> <code>bi-exclamation-circle-fill</code> - Warning</li>
|
||||
<li><i class="bi bi-info-circle-fill text-info"></i> <code>bi-info-circle-fill</code> - Info</li>
|
||||
<li><i class="bi bi-check-circle-fill text-success"></i> <code>bi-check-circle-fill</code> - Success</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<h5><i class="bi bi-code-slash"></i> Technical</h5>
|
||||
<ul>
|
||||
<li><i class="bi bi-code-slash"></i> <code>bi-code-slash</code> - Code</li>
|
||||
<li><i class="bi bi-database"></i> <code>bi-database</code> - Database</li>
|
||||
<li><i class="bi bi-cpu"></i> <code>bi-cpu</code> - System</li>
|
||||
<li><i class="bi bi-plug"></i> <code>bi-plug</code> - API/Integration</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3><i class="bi bi-lightning-fill alert-icon"></i>Usage Examples</h3>
|
||||
|
||||
<h5 class="mt-3">Section Headers with Icons</h5>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><h2><i class="bi bi-book section-icon"></i>Section Title</h2></code>
|
||||
</div>
|
||||
|
||||
<h5 class="mt-3">Alert Boxes with Icons</h5>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><div class="alert alert-info border-start border-4 border-info"><br>
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>Information</h3><br>
|
||||
</div></code>
|
||||
</div>
|
||||
|
||||
<h5 class="mt-3">Badges with Icons</h5>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><span class="badge bg-success"><i class="bi bi-check-circle-fill"></i> Complete</span></code>
|
||||
</div>
|
||||
|
||||
<h5 class="mt-3">List Items with Icons</h5>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><li><i class="bi bi-check-circle"></i> Completed task</li><br>
|
||||
<li><i class="bi bi-arrow-right-short"></i> Action item</li></code>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h3><i class="bi bi-exclamation-circle-fill alert-icon"></i>Best Practices</h3>
|
||||
<ul class="mb-0">
|
||||
<li><i class="bi bi-check-circle"></i> Use semantic icons that match content meaning</li>
|
||||
<li><i class="bi bi-palette"></i> Maintain consistent icon usage across documents</li>
|
||||
<li><i class="bi bi-eye"></i> Don't overuse icons - they should enhance, not clutter</li>
|
||||
<li><i class="bi bi-phone"></i> Ensure icons are visible and meaningful at all screen sizes</li>
|
||||
<li><i class="bi bi-universal-access"></i> Icons should supplement text, not replace it (accessibility)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="implementation" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-gear section-icon"></i>Implementation Guidelines</h2>
|
||||
|
||||
<div class="alert alert-success border-start border-4 border-success">
|
||||
<h3><i class="bi bi-check-circle-fill alert-icon"></i>HTML Document Template</h3>
|
||||
<pre class="mb-0"><!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Document Title</title>
|
||||
<!-- Bootstrap CSS -->
|
||||
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
|
||||
<!-- Bootstrap Icons -->
|
||||
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="container-fluid">
|
||||
<!-- Navigation -->
|
||||
<nav class="navbar navbar-dark bg-dark rounded mb-4">
|
||||
<a class="navbar-brand" href="main.html">
|
||||
<i class="bi bi-arrow-left"></i> Back
|
||||
</a>
|
||||
</nav>
|
||||
|
||||
<!-- Breadcrumb -->
|
||||
<nav aria-label="breadcrumb">
|
||||
<ol class="breadcrumb">
|
||||
<li class="breadcrumb-item"><a href="main.html"><i class="bi bi-house-door"></i> Main</a></li>
|
||||
<li class="breadcrumb-item active">Current Page</li>
|
||||
</ol>
|
||||
</nav>
|
||||
|
||||
<!-- Content -->
|
||||
<h1><i class="bi bi-journal-code"></i> Page Title</h1>
|
||||
|
||||
<!-- Sections -->
|
||||
</div>
|
||||
|
||||
<!-- Bootstrap JS -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
|
||||
|
||||
<!-- Dark mode support -->
|
||||
<script>
|
||||
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
|
||||
document.documentElement.setAttribute('data-bs-theme', 'dark');
|
||||
}
|
||||
</script>
|
||||
</body>
|
||||
</html></pre>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3><i class="bi bi-info-circle-fill alert-icon"></i>Dark Mode Support</h3>
|
||||
<p>Bootstrap 5.3+ includes built-in dark mode support. Add this script to automatically detect system preferences:</p>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><script><br>
|
||||
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {<br>
|
||||
document.documentElement.setAttribute('data-bs-theme', 'dark');<br>
|
||||
}<br>
|
||||
</script></code>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3><i class="bi bi-lightning-fill alert-icon"></i>Scroll to Top Button</h3>
|
||||
<p>Add a floating button for easy navigation in long documents:</p>
|
||||
<div class="bg-light p-3 rounded my-2">
|
||||
<code><button id="scrollTopBtn" class="btn btn-primary"><br>
|
||||
<i class="bi bi-arrow-up-circle"></i><br>
|
||||
</button><br><br>
|
||||
<script><br>
|
||||
window.onscroll = function() {<br>
|
||||
if (document.documentElement.scrollTop > 300) {<br>
|
||||
document.getElementById('scrollTopBtn').style.display = 'block';<br>
|
||||
} else {<br>
|
||||
document.getElementById('scrollTopBtn').style.display = 'none';<br>
|
||||
}<br>
|
||||
};<br>
|
||||
document.getElementById('scrollTopBtn').onclick = function() {<br>
|
||||
window.scrollTo({top: 0, behavior: 'smooth'});<br>
|
||||
};<br>
|
||||
</script></code>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="quality-standards" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-award section-icon"></i>Quality Standards</h2>
|
||||
|
||||
<div class="progress mb-4">
|
||||
<div class="progress-bar bg-success" style="width: 100%">
|
||||
<i class="bi bi-check-circle-fill"></i> Style Guide Implementation: 100% Complete
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row g-4">
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-check-circle"></i> Technical Accuracy
|
||||
</h3>
|
||||
<ul class="mb-0">
|
||||
<li>All code examples must work</li>
|
||||
<li>All URLs must be valid</li>
|
||||
<li>All relationships must be correct</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-eye"></i> Clarity and Completeness
|
||||
</h3>
|
||||
<ul class="mb-0">
|
||||
<li>Each section serves a specific purpose</li>
|
||||
<li>Information is neither duplicated nor missing</li>
|
||||
<li>Cross-references are accurate</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">
|
||||
<i class="bi bi-stars"></i> Professional Presentation
|
||||
</h3>
|
||||
<ul class="mb-0">
|
||||
<li>Consistent formatting throughout</li>
|
||||
<li>Clean visual hierarchy</li>
|
||||
<li>Responsive design for all devices</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<div class="alert alert-secondary border-start border-4 border-secondary">
|
||||
<p class="mb-0">
|
||||
<i class="bi bi-award"></i> <strong>This style guide ensures consistent, professional, and maintainable documentation that serves both technical and business needs while supporting the long-term success of your projects.</strong>
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<!-- Scroll to top button -->
|
||||
<button id="scrollTopBtn" class="btn btn-primary" title="Scroll to top">
|
||||
<i class="bi bi-arrow-up-circle"></i>
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<!-- Bootstrap JS -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
|
||||
|
||||
<!-- Dark mode support -->
|
||||
<script>
|
||||
// Detect system preference and apply dark mode
|
||||
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
|
||||
document.documentElement.setAttribute('data-bs-theme', 'dark');
|
||||
}
|
||||
|
||||
// Listen for changes in system preference
|
||||
window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', function(e) {
|
||||
if (e.matches) {
|
||||
document.documentElement.setAttribute('data-bs-theme', 'dark');
|
||||
} else {
|
||||
document.documentElement.setAttribute('data-bs-theme', 'light');
|
||||
}
|
||||
});
|
||||
|
||||
// Scroll to top button functionality
|
||||
window.onscroll = function() {
|
||||
const scrollBtn = document.getElementById('scrollTopBtn');
|
||||
if (document.body.scrollTop > 300 || document.documentElement.scrollTop > 300) {
|
||||
scrollBtn.style.display = 'block';
|
||||
} else {
|
||||
scrollBtn.style.display = 'none';
|
||||
}
|
||||
};
|
||||
|
||||
document.getElementById('scrollTopBtn').onclick = function() {
|
||||
window.scrollTo({top: 0, behavior: 'smooth'});
|
||||
};
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
386
docs/gitea.md
Normal file
386
docs/gitea.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# Gitea - Git with a Cup of Tea
|
||||
|
||||
## Overview
|
||||
Gitea is a lightweight, self-hosted Git service providing a GitHub-like web interface with repository management, issue tracking, pull requests, and code review capabilities. Deployed on **Rosalind** with PostgreSQL backend on Portia and Memcached caching.
|
||||
|
||||
**Host:** rosalind.incus
|
||||
**Role:** Collaboration (PHP, Go, Node.js runtimes)
|
||||
**Container Port:** 22083 (HTTP), 22022 (SSH), 22093 (Metrics)
|
||||
**External Access:** https://gitea.ouranos.helu.ca/ (via HAProxy on Titania)
|
||||
**SSH Access:** `ssh -p 22022 git@gitea.ouranos.helu.ca` (TCP passthrough via HAProxy)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ Gitea │─────▶│PostgreSQL │
|
||||
│ │ │ (Titania) │ │(Rosalind)│ │ (Portia) │
|
||||
└──────────┘ └────────────┘ └──────────┘ └───────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────┐
|
||||
│ Memcached │
|
||||
│ (Local) │
|
||||
└───────────┘
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook gitea/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `gitea/deploy.yml` | Main deployment playbook |
|
||||
| `gitea/app.ini.j2` | Gitea configuration template |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Install Dependencies**: git, git-lfs, curl, memcached
|
||||
2. **Create System User**: `git:git` with home directory
|
||||
3. **Create Directories**: Work dir, data, LFS storage, repository root, logs
|
||||
4. **Download Gitea Binary**: Latest release from GitHub (architecture-specific)
|
||||
5. **Template Configuration**: Apply `app.ini.j2` with variables
|
||||
6. **Create Systemd Service**: Custom service unit for Gitea
|
||||
7. **Start Service**: Enable and start gitea.service
|
||||
8. **Configure OAuth2**: Register Casdoor as OpenID Connect provider
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Features
|
||||
|
||||
- **Git LFS Support**: Large file storage enabled
|
||||
- **SSH Server**: Built-in SSH server on port 22022
|
||||
- **Prometheus Metrics**: Metrics endpoint on port 22094
|
||||
- **Memcached Caching**: Session and cache storage with `gt_` prefix
|
||||
- **Repository Settings**: Push-to-create, all units enabled
|
||||
- **Security**: Argon2 password hashing, reverse proxy trusted
|
||||
|
||||
### Storage Locations
|
||||
|
||||
| Path | Purpose | Owner |
|
||||
|------|---------|-------|
|
||||
| `/var/lib/gitea` | Working directory | git:git |
|
||||
| `/var/lib/gitea/data` | Application data | git:git |
|
||||
| `/var/lib/gitea/data/lfs` | Git LFS objects | git:git |
|
||||
| `/mnt/dv` | Git repositories | git:git |
|
||||
| `/var/log/gitea` | Application logs | git:git |
|
||||
| `/etc/gitea` | Configuration files | root:git |
|
||||
|
||||
### Logging
|
||||
|
||||
- **Console Output**: Info level to systemd journal
|
||||
- **File Logs**: `/var/log/gitea/gitea.log`
|
||||
- **Rotation**: Daily rotation, 7-day retention
|
||||
- **SSH Logs**: Enabled for debugging
|
||||
|
||||
## Access After Deployment
|
||||
|
||||
1. **Web Interface**: https://gitea.ouranos.helu.ca/
|
||||
2. **First-Time Setup**: Create admin account on first visit
|
||||
3. **Git Clone**:
|
||||
```bash
|
||||
git clone https://gitea.ouranos.helu.ca/username/repo.git
|
||||
```
|
||||
4. **SSH Clone**:
|
||||
```bash
|
||||
git clone git@gitea.ouranos.helu.ca:username/repo.git
|
||||
```
|
||||
Note: SSH requires port 22022 configured in `~/.ssh/config`
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Alloy Configuration
|
||||
**File:** `ansible/alloy/rosalind/config.alloy.j2`
|
||||
|
||||
- **Log Collection**: `/var/log/gitea/gitea.log` → Loki
|
||||
- **Metrics**: Port 22094 → Prometheus (token-protected)
|
||||
- **System Metrics**: Process exporter tracks Gitea process
|
||||
|
||||
### Metrics Endpoint
|
||||
- **URL**: `http://rosalind.incus:22083/metrics`
|
||||
- **Authentication**: Bearer token required (`vault_gitea_metrics_token`)
|
||||
- **Note**: Metrics are exposed on the main web port, not a separate metrics port
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
### 1. Database Password
|
||||
```yaml
|
||||
vault_gitea_db_password: "YourSecurePassword123!"
|
||||
```
|
||||
**Requirements:**
|
||||
- Minimum 12 characters recommended
|
||||
- Used by PostgreSQL authentication
|
||||
|
||||
### 2. Secret Key (Session Encryption)
|
||||
```yaml
|
||||
vault_gitea_secret_key: "RandomString64CharactersLongForSessionCookieEncryptionSecurity123"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: Recommended 64+ characters
|
||||
- **Format**: Base64 or hex string
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -base64 48
|
||||
```
|
||||
|
||||
### 3. LFS JWT Secret
|
||||
```yaml
|
||||
vault_gitea_lfs_jwt_secret: "AnotherRandomString64CharsForLFSJWTTokenSigning1234567890ABC"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: Recommended 64+ characters
|
||||
- **Purpose**: Signs JWT tokens for Git LFS authentication
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -base64 48
|
||||
```
|
||||
|
||||
### 4. Metrics Token
|
||||
```yaml
|
||||
vault_gitea_metrics_token: "RandomTokenForPrometheusMetricsAccess123"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: 32+ characters recommended
|
||||
- **Purpose**: Bearer token for Prometheus scraping
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -hex 32
|
||||
```
|
||||
|
||||
### 5. OAuth Client ID
|
||||
```yaml
|
||||
vault_gitea_oauth_client_id: "gitea-oauth-client"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Purpose**: Client ID for Casdoor OAuth2 application
|
||||
- **Source**: Must match `clientId` in Casdoor application configuration
|
||||
|
||||
### 6. OAuth Client Secret
|
||||
```yaml
|
||||
vault_gitea_oauth_client_secret: "YourRandomOAuthSecret123!"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: 32+ characters recommended
|
||||
- **Purpose**: Client secret for Casdoor OAuth2 authentication
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
- **Source**: Must match `clientSecret` in Casdoor application configuration
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/rosalind.incus.yml`
|
||||
|
||||
```yaml
|
||||
# Gitea User and Directories
|
||||
gitea_user: git
|
||||
gitea_group: git
|
||||
gitea_work_dir: /var/lib/gitea
|
||||
gitea_data_dir: /var/lib/gitea/data
|
||||
gitea_lfs_dir: /var/lib/gitea/data/lfs
|
||||
gitea_repo_root: /mnt/dv
|
||||
gitea_config_file: /etc/gitea/app.ini
|
||||
|
||||
# Ports
|
||||
gitea_web_port: 22083
|
||||
gitea_ssh_port: 22022
|
||||
gitea_metrics_port: 22094
|
||||
|
||||
# Network
|
||||
gitea_domain: ouranos.helu.ca
|
||||
gitea_root_url: https://gitea.ouranos.helu.ca/
|
||||
|
||||
# Database Configuration
|
||||
gitea_db_type: postgres
|
||||
gitea_db_host: portia.incus
|
||||
gitea_db_port: 5432
|
||||
gitea_db_name: gitea
|
||||
gitea_db_user: gitea
|
||||
gitea_db_password: "{{vault_gitea_db_password}}"
|
||||
gitea_db_ssl_mode: disable
|
||||
|
||||
# Features
|
||||
gitea_lfs_enabled: true
|
||||
gitea_metrics_enabled: true
|
||||
|
||||
# Service Settings
|
||||
gitea_disable_registration: true # Use Casdoor SSO
|
||||
gitea_require_signin_view: false
|
||||
|
||||
# Security (vault secrets)
|
||||
gitea_secret_key: "{{vault_gitea_secret_key}}"
|
||||
gitea_lfs_jwt_secret: "{{vault_gitea_lfs_jwt_secret}}"
|
||||
gitea_metrics_token: "{{vault_gitea_metrics_token}}"
|
||||
|
||||
# OAuth2 (Casdoor SSO)
|
||||
gitea_oauth_enabled: true
|
||||
gitea_oauth_name: "casdoor"
|
||||
gitea_oauth_display_name: "Sign in with Casdoor"
|
||||
gitea_oauth_client_id: "{{vault_gitea_oauth_client_id}}"
|
||||
gitea_oauth_client_secret: "{{vault_gitea_oauth_client_secret}}"
|
||||
gitea_oauth_auth_url: "https://id.ouranos.helu.ca/login/oauth/authorize"
|
||||
gitea_oauth_token_url: "http://titania.incus:22081/api/login/oauth/access_token"
|
||||
gitea_oauth_userinfo_url: "http://titania.incus:22081/api/userinfo"
|
||||
gitea_oauth_scopes: "openid profile email"
|
||||
```
|
||||
|
||||
## OAuth2 / Casdoor SSO
|
||||
|
||||
Gitea integrates with Casdoor for Single Sign-On using OpenID Connect.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Browser │─────▶│ HAProxy │─────▶│ Gitea │─────▶│ Casdoor │
|
||||
│ │ │ (Titania) │ │(Rosalind)│ │(Titania) │
|
||||
└──────────┘ └────────────┘ └──────────┘ └──────────┘
|
||||
│ │ │
|
||||
│ 1. Click "Sign in with Casdoor" │ │
|
||||
│◀─────────────────────────────────────│ │
|
||||
│ 2. Redirect to Casdoor login │ │
|
||||
│─────────────────────────────────────────────────────▶│
|
||||
│ 3. User authenticates │ │
|
||||
│◀─────────────────────────────────────────────────────│
|
||||
│ 4. Redirect back with auth code │ │
|
||||
│─────────────────────────────────────▶│ │
|
||||
│ │ 5. Exchange code for token
|
||||
│ │────────────────▶│
|
||||
│ │◀────────────────│
|
||||
│ 6. User logged into Gitea │ │
|
||||
│◀─────────────────────────────────────│ │
|
||||
```
|
||||
|
||||
### Casdoor Application Configuration
|
||||
|
||||
A Gitea application is defined in `ansible/casdoor/init_data.json.j2`:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| **Name** | `app-gitea` |
|
||||
| **Client ID** | `vault_gitea_oauth_client_id` |
|
||||
| **Redirect URI** | `https://gitea.ouranos.helu.ca/user/oauth2/casdoor/callback` |
|
||||
| **Grant Types** | `authorization_code`, `refresh_token` |
|
||||
|
||||
### URL Strategy
|
||||
|
||||
| URL Type | Address | Used By |
|
||||
|----------|---------|---------|
|
||||
| **Auth URL** | `https://id.ouranos.helu.ca/...` | User's browser (external) |
|
||||
| **Token URL** | `http://titania.incus:22081/...` | Gitea server (internal) |
|
||||
| **Userinfo URL** | `http://titania.incus:22081/...` | Gitea server (internal) |
|
||||
| **Discovery URL** | `http://titania.incus:22081/.well-known/openid-configuration` | Gitea server (internal) |
|
||||
|
||||
The auth URL uses the external HAProxy address because it runs in the user's browser. Token/userinfo URLs use internal addresses for server-to-server communication.
|
||||
|
||||
### User Auto-Registration
|
||||
|
||||
With `ENABLE_AUTO_REGISTRATION = true` in `[oauth2_client]`, users who authenticate via Casdoor are automatically created in Gitea. Account linking uses `auto` mode to match by email address.
|
||||
|
||||
### Deployment Order
|
||||
|
||||
1. **Deploy Casdoor first** (if not already running):
|
||||
```bash
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
```
|
||||
|
||||
2. **Deploy Gitea** (registers OAuth provider):
|
||||
```bash
|
||||
ansible-playbook gitea/deploy.yml
|
||||
```
|
||||
|
||||
### Verify OAuth Configuration
|
||||
|
||||
```bash
|
||||
# List authentication sources
|
||||
ssh rosalind.incus "sudo -u git /usr/local/bin/gitea admin auth list --config /etc/gitea/app.ini"
|
||||
|
||||
# Should show: casdoor (OpenID Connect)
|
||||
```
|
||||
|
||||
## Database Setup
|
||||
|
||||
Gitea requires a PostgreSQL database on Portia. This is automatically created by the `postgresql/deploy.yml` playbook.
|
||||
|
||||
**Database Details:**
|
||||
- **Name**: gitea
|
||||
- **User**: gitea
|
||||
- **Owner**: gitea
|
||||
- **Extensions**: None required
|
||||
|
||||
## Integration with Other Services
|
||||
|
||||
### HAProxy Routing
|
||||
**Backend Configuration** (`titania.incus.yml`):
|
||||
```yaml
|
||||
- subdomain: "gitea"
|
||||
backend_host: "rosalind.incus"
|
||||
backend_port: 22083
|
||||
health_path: "/api/healthz"
|
||||
timeout_server: 120s
|
||||
```
|
||||
|
||||
### Memcached Integration
|
||||
- **Host**: localhost:11211
|
||||
- **Session Prefix**: N/A (Memcache adapter doesn't require prefix)
|
||||
- **Cache Prefix**: N/A
|
||||
|
||||
### Prometheus Monitoring
|
||||
- **Scrape Target**: `rosalind.incus:22094`
|
||||
- **Job Name**: gitea
|
||||
- **Authentication**: Bearer token
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Status
|
||||
```bash
|
||||
ssh rosalind.incus
|
||||
sudo systemctl status gitea
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Application logs
|
||||
sudo tail -f /var/log/gitea/gitea.log
|
||||
|
||||
# Systemd journal
|
||||
sudo journalctl -u gitea -f
|
||||
```
|
||||
|
||||
### Test Database Connection
|
||||
```bash
|
||||
psql -h portia.incus -U gitea -d gitea
|
||||
```
|
||||
|
||||
### Check Memcached
|
||||
```bash
|
||||
echo "stats" | nc localhost 11211
|
||||
```
|
||||
|
||||
### Verify Metrics Endpoint
|
||||
```bash
|
||||
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:22094/metrics
|
||||
```
|
||||
|
||||
## Version Information
|
||||
|
||||
- **Installation Method**: Binary download from GitHub releases
|
||||
- **Version Selection**: Latest stable release (dynamic)
|
||||
- **Update Process**: Re-run deployment playbook to fetch latest binary
|
||||
- **Architecture**: linux-amd64
|
||||
|
||||
## References
|
||||
|
||||
- **Official Documentation**: https://docs.gitea.com/
|
||||
- **GitHub Repository**: https://github.com/go-gitea/gitea
|
||||
- **Configuration Reference**: https://docs.gitea.com/administration/config-cheat-sheet
|
||||
759
docs/gitea_mcp.md
Normal file
759
docs/gitea_mcp.md
Normal file
@@ -0,0 +1,759 @@
|
||||
# Gitea MCP Server - Red Panda Approved™
|
||||
|
||||
Model Context Protocol (MCP) server providing programmatic access to Gitea repositories, issues, and pull requests. Deployed as a Docker container on Miranda (MCP Docker Host) in the Agathos sandbox.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Gitea MCP Server exposes Gitea's functionality through the MCP protocol, enabling AI assistants and automation tools to interact with Git repositories, issues, pull requests, and other Gitea features.
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Host** | Miranda (10.10.0.156) |
|
||||
| **Service Port** | 25535 |
|
||||
| **Container Port** | 8000 |
|
||||
| **Transport** | HTTP |
|
||||
| **Image** | `docker.gitea.com/gitea-mcp-server:latest` |
|
||||
| **Gitea Instance** | https://gitea.ouranos.helu.ca |
|
||||
| **Logging** | Syslog to port 51435 → Alloy → Loki |
|
||||
|
||||
### Purpose
|
||||
|
||||
- **Repository Operations**: Clone, read, and analyze repository contents
|
||||
- **Issue Management**: Create, read, update, and search issues
|
||||
- **Pull Request Workflow**: Manage PRs, reviews, and merges
|
||||
- **Code Search**: Search across repositories and file contents
|
||||
- **User/Organization Info**: Query user profiles and organization details
|
||||
|
||||
### Integration Points
|
||||
|
||||
```
|
||||
AI Assistant (Cline/Claude Desktop)
|
||||
↓ (MCP Protocol)
|
||||
MCP Switchboard (Oberon)
|
||||
↓ (HTTP)
|
||||
Gitea MCP Server (Miranda:25535)
|
||||
↓ (Gitea API)
|
||||
Gitea Instance (Rosalind:22083)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Deployment Model
|
||||
|
||||
**Container-Based**: Single Docker container managed via Docker Compose
|
||||
|
||||
**Directory Structure**:
|
||||
```
|
||||
/srv/gitea_mcp/
|
||||
└── docker-compose.yml # Container orchestration
|
||||
```
|
||||
|
||||
**System Integration**:
|
||||
- **User/Group**: `gitea_mcp:gitea_mcp` (system user)
|
||||
- **Ansible User Access**: Remote user added to gitea_mcp group
|
||||
- **Permissions**: Directory mode 750, compose file mode 550
|
||||
|
||||
### Network Configuration
|
||||
|
||||
| Component | Port | Protocol | Purpose |
|
||||
|-----------|------|----------|---------|
|
||||
| External Access | 25535 | HTTP | MCP protocol endpoint |
|
||||
| Container Internal | 8000 | HTTP | Service listening port |
|
||||
| Syslog | 51435 | TCP | Log forwarding to Alloy |
|
||||
|
||||
### Logging Pipeline
|
||||
|
||||
```
|
||||
Gitea MCP Container
|
||||
↓ (Docker syslog driver)
|
||||
Local Syslog (127.0.0.1:51435)
|
||||
↓ (Alloy collection)
|
||||
Loki (Prospero)
|
||||
↓ (Grafana queries)
|
||||
Grafana Dashboards
|
||||
```
|
||||
|
||||
**Log Format**: RFC5424 (syslog_format variable)
|
||||
**Log Tag**: `gitea-mcp`
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Infrastructure Requirements
|
||||
|
||||
1. **Miranda Host**: Docker engine installed and running
|
||||
2. **Gitea Instance**: Accessible Gitea server (gitea.ouranos.helu.ca)
|
||||
3. **Access Token**: Gitea personal access token with required permissions
|
||||
4. **Monitoring Stack**: Alloy configured for syslog collection (port 51435)
|
||||
|
||||
### Required Permissions
|
||||
|
||||
**Gitea Access Token Scopes**:
|
||||
- `repo`: Full repository access (read/write)
|
||||
- `user`: Read user information
|
||||
- `org`: Read organization information
|
||||
- `issue`: Manage issues
|
||||
- `pull_request`: Manage pull requests
|
||||
|
||||
**Token Creation**:
|
||||
1. Log into Gitea → User Settings → Applications
|
||||
2. Generate New Token → Select scopes
|
||||
3. Copy token (shown only once)
|
||||
4. Store in Ansible Vault as `vault_gitea_mcp_access_token`
|
||||
|
||||
### Ansible Dependencies
|
||||
|
||||
- `community.docker.docker_compose_v2` collection
|
||||
- Docker Python SDK on Miranda
|
||||
- Ansible Vault configured with password file
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Host Variables
|
||||
|
||||
All configuration is defined in `ansible/inventory/host_vars/miranda.incus.yml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
- gitea_mcp # Enable service on this host
|
||||
|
||||
# Gitea MCP Configuration
|
||||
gitea_mcp_user: gitea_mcp
|
||||
gitea_mcp_group: gitea_mcp
|
||||
gitea_mcp_directory: /srv/gitea_mcp
|
||||
gitea_mcp_port: 25535
|
||||
gitea_mcp_host: https://gitea.ouranos.helu.ca
|
||||
gitea_mcp_access_token: "{{ vault_gitea_mcp_access_token }}"
|
||||
gitea_mcp_syslog_port: 51435
|
||||
```
|
||||
|
||||
### Variable Reference
|
||||
|
||||
| Variable | Purpose | Example |
|
||||
|----------|---------|---------|
|
||||
| `gitea_mcp_user` | Service system user | `gitea_mcp` |
|
||||
| `gitea_mcp_group` | Service system group | `gitea_mcp` |
|
||||
| `gitea_mcp_directory` | Service root directory | `/srv/gitea_mcp` |
|
||||
| `gitea_mcp_port` | External port binding | `25535` |
|
||||
| `gitea_mcp_host` | Gitea instance URL | `https://gitea.ouranos.helu.ca` |
|
||||
| `gitea_mcp_access_token` | Gitea API token (vault) | `{{ vault_gitea_mcp_access_token }}` |
|
||||
| `gitea_mcp_syslog_port` | Local syslog port | `51435` |
|
||||
|
||||
### Vault Configuration
|
||||
|
||||
Store the Gitea access token securely in `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# Gitea MCP Server Access Token
|
||||
vault_gitea_mcp_access_token: "your_gitea_access_token_here"
|
||||
```
|
||||
|
||||
**Encrypt vault file**:
|
||||
```bash
|
||||
ansible-vault encrypt ansible/inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
**Edit vault file**:
|
||||
```bash
|
||||
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Initial Deployment
|
||||
|
||||
**Prerequisites Check**:
|
||||
```bash
|
||||
# Verify Miranda has Docker
|
||||
ansible miranda.incus -m command -a "docker --version"
|
||||
|
||||
# Verify Miranda is in inventory
|
||||
ansible miranda.incus -m ping
|
||||
|
||||
# Check Gitea accessibility
|
||||
curl -I https://gitea.ouranos.helu.ca
|
||||
```
|
||||
|
||||
**Deploy Service**:
|
||||
```bash
|
||||
cd ansible/
|
||||
|
||||
# Deploy only Gitea MCP service
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
|
||||
# Or deploy as part of full stack
|
||||
ansible-playbook site.yml
|
||||
```
|
||||
|
||||
**Deployment Process**:
|
||||
1. ✓ Check service is enabled in host's `services` list
|
||||
2. ✓ Create gitea_mcp system user and group
|
||||
3. ✓ Add Ansible remote user to gitea_mcp group
|
||||
4. ✓ Create /srv/gitea_mcp directory (mode 750)
|
||||
5. ✓ Template docker-compose.yml (mode 550)
|
||||
6. ✓ Reset SSH connection (apply group changes)
|
||||
7. ✓ Start Docker container via docker-compose
|
||||
|
||||
### Deployment Output
|
||||
|
||||
**Expected Success**:
|
||||
```
|
||||
PLAY [Deploy Gitea MCP Server with Docker Compose] ****************************
|
||||
|
||||
TASK [Check if host has gitea_mcp service] ************************************
|
||||
ok: [miranda.incus]
|
||||
|
||||
TASK [Create gitea_mcp group] *************************************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Create gitea_mcp user] **************************************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Add group gitea_mcp to Ansible remote_user] *****************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Create gitea_mcp directory] *********************************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Template docker-compose file] *******************************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Reset SSH connection to apply group changes] ****************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
TASK [Start Gitea MCP service] ************************************************
|
||||
changed: [miranda.incus]
|
||||
|
||||
PLAY RECAP ********************************************************************
|
||||
miranda.incus : ok=8 changed=7 unreachable=0 failed=0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
### Container Status
|
||||
|
||||
**Check container is running**:
|
||||
```bash
|
||||
# Via Ansible
|
||||
ansible miranda.incus -m command -a "docker ps | grep gitea-mcp"
|
||||
|
||||
# Direct SSH
|
||||
ssh miranda.incus
|
||||
docker ps | grep gitea-mcp
|
||||
```
|
||||
|
||||
**Expected Output**:
|
||||
```
|
||||
CONTAINER ID IMAGE STATUS PORTS
|
||||
abc123def456 docker.gitea.com/gitea-mcp-server:latest Up 2 minutes 0.0.0.0:25535->8000/tcp
|
||||
```
|
||||
|
||||
### Service Connectivity
|
||||
|
||||
**Test MCP endpoint**:
|
||||
```bash
|
||||
# From Miranda
|
||||
curl -v http://localhost:25535
|
||||
|
||||
# From other hosts
|
||||
curl -v http://miranda.incus:25535
|
||||
```
|
||||
|
||||
**Expected Response**: HTTP response indicating MCP server is listening
|
||||
|
||||
### Log Inspection
|
||||
|
||||
**Docker logs**:
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
docker logs gitea-mcp
|
||||
```
|
||||
|
||||
**Centralized logs via Loki**:
|
||||
```bash
|
||||
# Via logcli (if installed)
|
||||
logcli query '{job="syslog", container_name="gitea-mcp"}' --limit=50
|
||||
|
||||
# Via Grafana Explore
|
||||
# Navigate to: https://grafana.ouranos.helu.ca
|
||||
# Select Loki datasource
|
||||
# Query: {job="syslog", container_name="gitea-mcp"}
|
||||
```
|
||||
|
||||
### Functional Testing
|
||||
|
||||
**Test Gitea API access**:
|
||||
```bash
|
||||
# Enter container
|
||||
ssh miranda.incus
|
||||
docker exec -it gitea-mcp sh
|
||||
|
||||
# Test Gitea API connectivity (if curl available in container)
|
||||
# Note: Container may not have shell utilities
|
||||
```
|
||||
|
||||
**MCP Protocol Test** (from client):
|
||||
```bash
|
||||
# Using MCP inspector or client tool
|
||||
mcp connect http://miranda.incus:25535
|
||||
|
||||
# Or test via MCP Switchboard
|
||||
curl -X POST http://oberon.incus:22781/mcp/invoke \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"server":"gitea","method":"list_repositories"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Management
|
||||
|
||||
### Updating the Service
|
||||
|
||||
**Update container image**:
|
||||
```bash
|
||||
cd ansible/
|
||||
|
||||
# Re-run deployment (pulls latest image)
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
```
|
||||
|
||||
**Docker Compose will**:
|
||||
1. Pull latest `docker.gitea.com/gitea-mcp-server:latest` image
|
||||
2. Recreate container if image changed
|
||||
3. Preserve configuration from docker-compose.yml
|
||||
|
||||
### Restarting the Service
|
||||
|
||||
**Via Docker Compose**:
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
cd /srv/gitea_mcp
|
||||
docker compose restart
|
||||
```
|
||||
|
||||
**Via Docker**:
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
docker restart gitea-mcp
|
||||
```
|
||||
|
||||
**Via Ansible** (re-run deployment):
|
||||
```bash
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
```
|
||||
|
||||
### Removing the Service
|
||||
|
||||
**Complete removal**:
|
||||
```bash
|
||||
cd ansible/
|
||||
ansible-playbook gitea_mcp/remove.yml
|
||||
```
|
||||
|
||||
**Remove playbook actions**:
|
||||
1. Stop and remove Docker containers
|
||||
2. Remove Docker volumes
|
||||
3. Remove Docker images
|
||||
4. Prune unused Docker images
|
||||
5. Remove /srv/gitea_mcp directory
|
||||
|
||||
**Manual cleanup** (if needed):
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
|
||||
# Stop and remove container
|
||||
cd /srv/gitea_mcp
|
||||
docker compose down -v --rmi all
|
||||
|
||||
# Remove directory
|
||||
sudo rm -rf /srv/gitea_mcp
|
||||
|
||||
# Remove user/group (optional)
|
||||
sudo userdel gitea_mcp
|
||||
sudo groupdel gitea_mcp
|
||||
```
|
||||
|
||||
### Configuration Changes
|
||||
|
||||
**Update Gitea host or port**:
|
||||
1. Edit `ansible/inventory/host_vars/miranda.incus.yml`
|
||||
2. Modify `gitea_mcp_host` or `gitea_mcp_port`
|
||||
3. Re-run deployment: `ansible-playbook gitea_mcp/deploy.yml`
|
||||
|
||||
**Rotate access token**:
|
||||
1. Generate new token in Gitea
|
||||
2. Update vault: `ansible-vault edit ansible/inventory/group_vars/all/vault.yml`
|
||||
3. Update `vault_gitea_mcp_access_token` value
|
||||
4. Re-run deployment to update environment variable
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
**Symptom**: Container exits immediately or won't start
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
|
||||
# Check container logs
|
||||
docker logs gitea-mcp
|
||||
|
||||
# Check container status
|
||||
docker ps -a | grep gitea-mcp
|
||||
|
||||
# Inspect container
|
||||
docker inspect gitea-mcp
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
- **Invalid Access Token**: Check `GITEA_ACCESS_TOKEN` in docker-compose.yml
|
||||
- **Gitea Host Unreachable**: Verify `GITEA_HOST` is accessible from Miranda
|
||||
- **Port Conflict**: Check if port 25535 is already in use
|
||||
- **Image Pull Failure**: Check Docker registry connectivity
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Test Gitea connectivity
|
||||
curl -I https://gitea.ouranos.helu.ca
|
||||
|
||||
# Check port availability
|
||||
ss -tlnp | grep 25535
|
||||
|
||||
# Pull image manually
|
||||
docker pull docker.gitea.com/gitea-mcp-server:latest
|
||||
|
||||
# Re-run deployment with verbose logging
|
||||
ansible-playbook gitea_mcp/deploy.yml -vv
|
||||
```
|
||||
|
||||
### Authentication Errors
|
||||
|
||||
**Symptom**: "401 Unauthorized" or "403 Forbidden" in logs
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check token is correctly passed
|
||||
ssh miranda.incus
|
||||
docker exec gitea-mcp env | grep GITEA_ACCESS_TOKEN
|
||||
|
||||
# Test token manually
|
||||
TOKEN="your_token_here"
|
||||
curl -H "Authorization: token $TOKEN" https://gitea.ouranos.helu.ca/api/v1/user
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Verify token scopes in Gitea (repo, user, org, issue, pull_request)
|
||||
2. Regenerate token if expired or revoked
|
||||
3. Update vault with new token
|
||||
4. Re-run deployment
|
||||
|
||||
### Network Connectivity Issues
|
||||
|
||||
**Symptom**: Cannot connect to Gitea or MCP endpoint unreachable
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Test Gitea from Miranda
|
||||
ssh miranda.incus
|
||||
curl -v https://gitea.ouranos.helu.ca
|
||||
|
||||
# Test MCP endpoint from other hosts
|
||||
curl -v http://miranda.incus:25535
|
||||
|
||||
# Check Docker network
|
||||
docker network inspect bridge
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Verify Miranda can resolve and reach `gitea.ouranos.helu.ca`
|
||||
- Check firewall rules on Miranda
|
||||
- Verify port 25535 is not blocked
|
||||
- Check Docker network configuration
|
||||
|
||||
### Logs Not Appearing in Loki
|
||||
|
||||
**Symptom**: No logs in Grafana from gitea-mcp container
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check Alloy is listening on syslog port
|
||||
ssh miranda.incus
|
||||
ss -tlnp | grep 51435
|
||||
|
||||
# Check Alloy configuration
|
||||
sudo systemctl status alloy
|
||||
|
||||
# Verify syslog driver is configured
|
||||
docker inspect gitea-mcp | grep -A 10 LogConfig
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Verify Alloy is running: `sudo systemctl status alloy`
|
||||
2. Check Alloy syslog source configuration
|
||||
3. Verify `gitea_mcp_syslog_port` matches Alloy config
|
||||
4. Restart Alloy: `sudo systemctl restart alloy`
|
||||
5. Restart container to reconnect syslog
|
||||
|
||||
### Permission Denied Errors
|
||||
|
||||
**Symptom**: Cannot access /srv/gitea_mcp or docker-compose.yml
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
|
||||
# Check directory permissions
|
||||
ls -la /srv/gitea_mcp
|
||||
|
||||
# Check user group membership
|
||||
groups # Should show gitea_mcp group
|
||||
|
||||
# Check file ownership
|
||||
ls -la /srv/gitea_mcp/docker-compose.yml
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Re-run deployment to fix permissions
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
|
||||
# Manually fix if needed
|
||||
sudo chown -R gitea_mcp:gitea_mcp /srv/gitea_mcp
|
||||
sudo chmod 750 /srv/gitea_mcp
|
||||
sudo chmod 550 /srv/gitea_mcp/docker-compose.yml
|
||||
|
||||
# Re-login to apply group changes
|
||||
exit
|
||||
ssh miranda.incus
|
||||
```
|
||||
|
||||
### MCP Switchboard Integration Issues
|
||||
|
||||
**Symptom**: Switchboard cannot connect to Gitea MCP server
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check switchboard configuration
|
||||
ssh oberon.incus
|
||||
cat /srv/mcp-switchboard/config.json | jq '.servers.gitea'
|
||||
|
||||
# Test connectivity from Oberon
|
||||
curl -v http://miranda.incus:25535
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Verify Gitea MCP server URL in switchboard config
|
||||
2. Check network connectivity: Oberon → Miranda
|
||||
3. Verify port 25535 is accessible
|
||||
4. Restart MCP Switchboard after config changes
|
||||
|
||||
---
|
||||
|
||||
## MCP Protocol Integration
|
||||
|
||||
### Server Capabilities
|
||||
|
||||
The Gitea MCP Server exposes these resources and tools via the MCP protocol:
|
||||
|
||||
**Resources**:
|
||||
- Repository information
|
||||
- File contents
|
||||
- Issue details
|
||||
- Pull request data
|
||||
- User profiles
|
||||
- Organization information
|
||||
|
||||
**Tools**:
|
||||
- `list_repositories`: List accessible repositories
|
||||
- `get_repository`: Get repository details
|
||||
- `list_issues`: Search and list issues
|
||||
- `create_issue`: Create new issue
|
||||
- `update_issue`: Modify existing issue
|
||||
- `list_pull_requests`: List PRs in repository
|
||||
- `create_pull_request`: Open new PR
|
||||
- `search_code`: Search code across repositories
|
||||
|
||||
### Switchboard Configuration
|
||||
|
||||
**MCP Switchboard** on Oberon routes MCP requests to Gitea MCP Server.
|
||||
|
||||
**Configuration** (`/srv/mcp-switchboard/config.json`):
|
||||
```json
|
||||
{
|
||||
"servers": {
|
||||
"gitea": {
|
||||
"command": null,
|
||||
"args": [],
|
||||
"url": "http://miranda.incus:25535",
|
||||
"transport": "http"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Client Usage
|
||||
|
||||
**From AI Assistant** (Claude Desktop, Cline, etc.):
|
||||
|
||||
The assistant can interact with Gitea repositories through natural language:
|
||||
- "List all repositories in the organization"
|
||||
- "Show me open issues in the agathos repository"
|
||||
- "Create an issue about improving documentation"
|
||||
- "Search for 'ansible' in repository code"
|
||||
|
||||
**Direct MCP Client**:
|
||||
```json
|
||||
POST http://oberon.incus:22781/mcp/invoke
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"server": "gitea",
|
||||
"method": "list_repositories",
|
||||
"params": {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Access Token Management
|
||||
|
||||
**Best Practices**:
|
||||
- Store token in Ansible Vault (never in plain text)
|
||||
- Use minimum required scopes for token
|
||||
- Rotate tokens periodically
|
||||
- Revoke tokens when no longer needed
|
||||
- Use separate tokens for different services
|
||||
|
||||
**Token Rotation**:
|
||||
```bash
|
||||
# 1. Generate new token in Gitea
|
||||
# 2. Update vault
|
||||
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
|
||||
|
||||
# 3. Re-deploy to update environment variable
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
|
||||
# 4. Revoke old token in Gitea
|
||||
```
|
||||
|
||||
### Network Security
|
||||
|
||||
**Isolation**:
|
||||
- Service only accessible within Incus network (10.10.0.0/24)
|
||||
- No direct external exposure (proxied through Switchboard)
|
||||
- TLS handled by HAProxy (upstream) for external access
|
||||
|
||||
**Access Control**:
|
||||
- Gitea enforces user/repository permissions
|
||||
- MCP protocol authenticated by Switchboard
|
||||
- Container runs as non-root user
|
||||
|
||||
### Audit and Monitoring
|
||||
|
||||
**Logging**:
|
||||
- All requests logged to Loki via syslog
|
||||
- Grafana dashboards for monitoring access patterns
|
||||
- Alert on authentication failures
|
||||
|
||||
**Monitoring Queries**:
|
||||
```logql
|
||||
# All Gitea MCP logs
|
||||
{job="syslog", container_name="gitea-mcp"}
|
||||
|
||||
# Authentication errors
|
||||
{job="syslog", container_name="gitea-mcp"} |= "401" or |= "403"
|
||||
|
||||
# Error rate
|
||||
rate({job="syslog", container_name="gitea-mcp"} |= "error" [5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Resource Usage
|
||||
|
||||
**Container Resources**:
|
||||
- **Memory**: ~50-100 MB baseline
|
||||
- **CPU**: Minimal (< 1% idle, spikes during API calls)
|
||||
- **Disk**: ~100 MB for image, minimal runtime storage
|
||||
|
||||
**Scaling Considerations**:
|
||||
- Single container sufficient for development/sandbox
|
||||
- For production: Consider multiple replicas behind load balancer
|
||||
- Gitea API rate limits apply to token (typically 5000 requests/hour)
|
||||
|
||||
### Optimization
|
||||
|
||||
**Caching**:
|
||||
- Gitea MCP Server may cache repository metadata
|
||||
- Restart container to clear cache if needed
|
||||
|
||||
**Connection Pooling**:
|
||||
- Server maintains connection pool to Gitea API
|
||||
- Reuses connections for better performance
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
### Agathos Infrastructure
|
||||
- [Agathos Overview](agathos.md) - Complete infrastructure documentation
|
||||
- [Ansible Best Practices](ansible.md) - Deployment patterns and structure
|
||||
- [Miranda Host](agathos.md#miranda---mcp-docker-host) - MCP Docker host details
|
||||
|
||||
### Related Services
|
||||
- [Gitea Service](gitea.md) - Gitea server deployment and configuration
|
||||
- [MCP Switchboard](../ansible/mcp_switchboard/README.md) - MCP request routing
|
||||
- [Grafana MCP](grafana_mcp.md) - Similar MCP server deployment
|
||||
|
||||
### External References
|
||||
- [Gitea API Documentation](https://docs.gitea.com/api/1.21/) - Gitea REST API reference
|
||||
- [Model Context Protocol Specification](https://spec.modelcontextprotocol.io/) - MCP protocol details
|
||||
- [Gitea MCP Server Repository](https://gitea.com/gitea/mcp-server) - Upstream project
|
||||
- [Docker Compose Documentation](https://docs.docker.com/compose/) - Container orchestration
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Schedule
|
||||
|
||||
**Regular Tasks**:
|
||||
- **Weekly**: Review logs for errors or anomalies
|
||||
- **Monthly**: Update container image to latest version
|
||||
- **Quarterly**: Rotate Gitea access token
|
||||
- **As Needed**: Review and adjust token permissions
|
||||
|
||||
**Update Procedure**:
|
||||
```bash
|
||||
# Pull latest image and restart
|
||||
ansible-playbook gitea_mcp/deploy.yml
|
||||
|
||||
# Verify new version
|
||||
ssh miranda.incus
|
||||
docker inspect gitea-mcp | jq '.[0].Config.Image'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: February 2026
|
||||
**Project**: Agathos Infrastructure
|
||||
**Host**: Miranda (MCP Docker Host)
|
||||
**Status**: Red Panda Approved™ ✓
|
||||
|
||||
200
docs/gitea_runner.md
Normal file
200
docs/gitea_runner.md
Normal file
@@ -0,0 +1,200 @@
|
||||
# Gitea Act Runner
|
||||
|
||||
## Overview
|
||||
|
||||
Gitea Actions is Gitea's built-in CI/CD system, compatible with GitHub Actions workflows. The **Act Runner** is the agent that executes these workflows. It picks up jobs from a Gitea instance, spins up Docker containers for each workflow step, runs the commands, and reports results back.
|
||||
|
||||
The name "act" comes from [nektos/act](https://github.com/nektos/act), an open-source tool originally built to run GitHub Actions locally. Gitea forked and adapted it into their runner, so `act_runner` is a lineage artifact — the binary keeps the upstream name, but everything else in our infrastructure uses `gitea-runner`.
|
||||
|
||||
### How it works
|
||||
|
||||
1. The runner daemon polls the Gitea instance for queued workflow jobs
|
||||
2. When a job is picked up, the runner pulls the Docker image specified by the workflow label (e.g., `ubuntu-24.04` maps to `docker.gitea.com/runner-images:ubuntu-24.04`)
|
||||
3. Each workflow step executes inside an ephemeral container
|
||||
4. Logs and status are streamed back to Gitea in real time
|
||||
5. The container is destroyed after the job completes
|
||||
|
||||
### Architecture in Agathos
|
||||
|
||||
```
|
||||
Gitea (Rosalind) Act Runner (Puck)
|
||||
┌──────────────┐ poll/report ┌──────────────────┐
|
||||
│ gitea.ouranos │◄──────────────────│ act_runner daemon │
|
||||
│ .helu.ca │ │ (gitea-runner) │
|
||||
└──────────────┘ └────────┬─────────┘
|
||||
│ spawns
|
||||
┌────────▼─────────┐
|
||||
│ Docker containers │
|
||||
│ (workflow steps) │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
### Naming conventions
|
||||
|
||||
The **binary** is `act_runner` — that's the upstream package name and renaming it would break updates. Everything else uses `gitea-runner`:
|
||||
|
||||
| Component | Name |
|
||||
|-----------|------|
|
||||
| Binary | `/usr/local/bin/act_runner` (upstream, don't rename) |
|
||||
| Service account | `gitea-runner` |
|
||||
| Home directory | `/srv/gitea-runner/` |
|
||||
| Config file | `/srv/gitea-runner/config.yaml` |
|
||||
| Registration state | `/srv/gitea-runner/.runner` (created by registration) |
|
||||
| Systemd service | `gitea-runner.service` |
|
||||
| Runner name | `puck-runner` (shown in Gitea UI) |
|
||||
|
||||
---
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
The runner is deployed via the `gitea_runner` Ansible service to **Puck** (application runtime host with Docker already available).
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker must be installed on the target host (`docker` in services list)
|
||||
- Gitea must be running and accessible at `https://gitea.ouranos.helu.ca`
|
||||
|
||||
### Deploy
|
||||
|
||||
```bash
|
||||
# Deploy to all hosts with gitea_runner in their services list
|
||||
ansible-playbook gitea_runner/deploy.yml
|
||||
|
||||
# Dry run (skip registration prompt)
|
||||
ansible-playbook gitea_runner/deploy.yml --check
|
||||
|
||||
# Limit to a specific host
|
||||
ansible-playbook gitea_runner/deploy.yml --limit puck.incus
|
||||
|
||||
# Non-interactive mode (for CI/CD)
|
||||
ansible-playbook gitea_runner/deploy.yml -e registration_token=YOUR_TOKEN
|
||||
```
|
||||
|
||||
The playbook is also included in the full-stack deployment via `site.yml`, running after the Gitea playbook.
|
||||
|
||||
**Registration Prompt**: On first deployment, the playbook will pause and prompt for a registration token. Get the token from `https://gitea.ouranos.helu.ca/-/admin/runners` before running the playbook.
|
||||
|
||||
### What the playbook does
|
||||
|
||||
1. Filters hosts — only runs on hosts with `gitea_runner` in their `services` list
|
||||
2. Creates `gitea-runner` system group and user (added to `docker` group)
|
||||
3. Downloads `act_runner` binary from Gitea releases (version pinned as `act_runner_version` in `group_vars/all/vars.yml`)
|
||||
4. Skips download if the installed version already matches (idempotent)
|
||||
5. Copies the managed `config.yaml` from the Ansible controller (edit `ansible/gitea_runner/config.yaml` to change runner settings)
|
||||
6. Templates `gitea-runner.service` systemd unit
|
||||
7. **Registers the runner** — prompts for registration token on first deployment
|
||||
8. Enables and starts the service
|
||||
|
||||
### Systemd unit
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/gitea-runner.service
|
||||
[Unit]
|
||||
Description=Gitea Runner
|
||||
After=network.target docker.service
|
||||
Requires=docker.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=gitea-runner
|
||||
Group=gitea-runner
|
||||
WorkingDirectory=/srv/gitea-runner
|
||||
ExecStart=/usr/local/bin/act_runner daemon --config /srv/gitea-runner/config.yaml
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
Environment=HOME=/srv/gitea-runner
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### Registration Flow
|
||||
|
||||
On first deployment, the playbook will automatically prompt for a registration token:
|
||||
|
||||
```
|
||||
TASK [Prompt for registration token]
|
||||
|
||||
Gitea runner registration required.
|
||||
Get token from: https://gitea.ouranos.helu.ca/-/admin/runners
|
||||
|
||||
Enter registration token:
|
||||
[Enter token here]
|
||||
```
|
||||
|
||||
**Steps**:
|
||||
1. Before running the playbook, obtain a registration token:
|
||||
- Navigate to `https://gitea.ouranos.helu.ca/-/admin/runners`
|
||||
- Click "Create new Runner"
|
||||
- Copy the displayed token
|
||||
2. Run the deployment playbook
|
||||
3. Paste the token when prompted
|
||||
|
||||
The registration is **idempotent** — if the runner is already registered (`.runner` file exists), the prompt is skipped.
|
||||
|
||||
**Non-interactive mode**: Pass the token as an extra variable:
|
||||
```bash
|
||||
ansible-playbook gitea_runner/deploy.yml -e registration_token=YOUR_TOKEN
|
||||
```
|
||||
|
||||
**Manual registration** (if needed): The traditional method still works if you prefer manual control. Labels are picked up from `config.yaml` at daemon start, so `--labels` is not needed at registration:
|
||||
```bash
|
||||
ssh puck.incus
|
||||
sudo -iu gitea-runner
|
||||
act_runner register \
|
||||
--instance https://gitea.ouranos.helu.ca \
|
||||
--token <token> \
|
||||
--name puck-runner \
|
||||
--no-interactive
|
||||
```
|
||||
|
||||
### Verify
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status gitea-runner
|
||||
|
||||
# Check runner version
|
||||
act_runner --version
|
||||
|
||||
# View runner logs
|
||||
sudo journalctl -u gitea-runner -f
|
||||
```
|
||||
|
||||
`puck-runner` should show as **online** at `https://gitea.ouranos.helu.ca/-/admin/runners`.
|
||||
|
||||
### Runner labels
|
||||
|
||||
Labels map workflow `runs-on` values to Docker images. They are configured in `ansible/gitea_runner/config.yaml` under `runner.labels`:
|
||||
|
||||
| Label | Docker Image | Use case |
|
||||
|-------|-------------|----------|
|
||||
| `ubuntu-latest` | `docker.gitea.com/runner-images:ubuntu-latest` | General CI (Gitea official image) |
|
||||
| `ubuntu-24.04` | `docker.gitea.com/runner-images:ubuntu-24.04` | Ubuntu 24.04 builds |
|
||||
| `ubuntu-22.04` | `docker.gitea.com/runner-images:ubuntu-22.04` | Ubuntu 22.04 builds |
|
||||
| `ubuntu-20.04` | `docker.gitea.com/runner-images:ubuntu-20.04` | Ubuntu 20.04 builds |
|
||||
| `node-24` | `node:24-bookworm` | Node.js CI |
|
||||
|
||||
To add or change labels, edit `ansible/gitea_runner/config.yaml` and re-run the playbook.
|
||||
|
||||
### Configuration reference
|
||||
|
||||
| Variable | Location | Value |
|
||||
|----------|----------|-------|
|
||||
| `act_runner_version` | `group_vars/all/vars.yml` | `0.2.13` |
|
||||
| `gitea_runner_instance_url` | `group_vars/all/vars.yml` | `https://gitea.ouranos.helu.ca` |
|
||||
| `gitea_runner_name` | `host_vars/puck.incus.yml` | `puck-runner` |
|
||||
| Runner labels | `ansible/gitea_runner/config.yaml` | See `runner.labels` section |
|
||||
|
||||
### Upgrading
|
||||
|
||||
To upgrade the runner binary, update `act_runner_version` in `group_vars/all/vars.yml` and re-run the playbook:
|
||||
|
||||
```bash
|
||||
# Edit the version
|
||||
vim inventory/group_vars/all/vars.yml
|
||||
# act_runner_version: "0.2.14"
|
||||
|
||||
# Re-deploy — only the binary download and service restart will trigger
|
||||
ansible-playbook gitea_runner/deploy.yml
|
||||
```
|
||||
344
docs/github_mcp.md
Normal file
344
docs/github_mcp.md
Normal file
@@ -0,0 +1,344 @@
|
||||
# GitHub MCP Server
|
||||
|
||||
## Overview
|
||||
|
||||
The GitHub MCP server provides read-only access to GitHub repositories through the Model Context Protocol (MCP). It enables AI assistants and other MCP clients to explore repository contents, search code, read issues, and analyze pull requests without requiring local clones.
|
||||
|
||||
**Deployment Host:** miranda.incus (10.10.0.156)
|
||||
**Port:** 25533 (HTTP MCP endpoint)
|
||||
**MCPO Proxy:** http://miranda.incus:25530/github
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ MCP CLIENTS │
|
||||
│ VS Code/Cline │ OpenWebUI │ Custom Applications │
|
||||
└─────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌───────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
Direct MCP (port 25533) MCPO Proxy (port 25530)
|
||||
streamable-http OpenAI-compatible API
|
||||
│ │
|
||||
└──────────┬───────────────┘
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ GitHub MCP Server │
|
||||
│ Docker Container │
|
||||
│ miranda.incus │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ GitHub API │
|
||||
│ (Read-Only PAT) │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## GitHub Personal Access Token
|
||||
|
||||
### Required Scopes
|
||||
|
||||
The GitHub MCP server requires a **read-only Personal Access Token (PAT)** with the following scopes:
|
||||
|
||||
| Scope | Purpose |
|
||||
|-------|---------|
|
||||
| `public_repo` | Read access to public repositories |
|
||||
| `repo` | Read access to private repositories (if needed) |
|
||||
| `read:org` | Read organization membership and teams |
|
||||
| `read:user` | Read user profile information |
|
||||
|
||||
### Creating a PAT
|
||||
|
||||
1. Navigate to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
|
||||
2. Click "Generate new token (classic)"
|
||||
3. Set name: `Agathos GitHub MCP - Read Only`
|
||||
4. Set expiration: Custom or 90 days (recommended)
|
||||
5. Select scopes: `public_repo`, `read:org`, `read:user`
|
||||
6. Click "Generate token"
|
||||
7. Copy the token immediately (it won't be shown again)
|
||||
8. Store in Ansible vault: `ansible-vault edit ansible/inventory/group_vars/all/vault.yml`
|
||||
- Add: `vault_github_personal_access_token: "ghp_xxxxxxxxxxxxx"`
|
||||
|
||||
---
|
||||
|
||||
## Available Tools
|
||||
|
||||
The GitHub MCP server provides the following tools:
|
||||
|
||||
### Repository Operations
|
||||
- `get_file_contents` - Read file contents from repository
|
||||
- `search_repositories` - Search for repositories on GitHub
|
||||
- `list_commits` - List commits in a repository
|
||||
- `create_branch` - Create a new branch (requires write access)
|
||||
- `push_files` - Push files to repository (requires write access)
|
||||
|
||||
### Issue Management
|
||||
- `create_issue` - Create a new issue (requires write access)
|
||||
- `list_issues` - List issues in a repository
|
||||
- `get_issue` - Get details of a specific issue
|
||||
- `update_issue` - Update an issue (requires write access)
|
||||
|
||||
### Pull Request Management
|
||||
- `create_pull_request` - Create a new PR (requires write access)
|
||||
- `list_pull_requests` - List pull requests in a repository
|
||||
- `get_pull_request` - Get details of a specific PR
|
||||
|
||||
### Search Operations
|
||||
- `search_code` - Search code across repositories
|
||||
- `search_users` - Search for GitHub users
|
||||
|
||||
**Note:** With a read-only PAT, write operations (`create_*`, `update_*`, `push_*`) will fail. The primary use case is repository exploration and code reading.
|
||||
|
||||
---
|
||||
|
||||
## Client Configuration
|
||||
|
||||
### MCP Native Clients (Cline, Claude Desktop)
|
||||
|
||||
Add the following to your MCP settings (e.g., `~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json`):
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"github": {
|
||||
"type": "streamable-http",
|
||||
"url": "http://miranda.incus:25533/mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### OpenWebUI Configuration
|
||||
|
||||
1. Navigate to **Settings → Tools → OpenAPI Servers**
|
||||
2. Click **Add OpenAPI Server**
|
||||
3. Configure:
|
||||
- **Name:** GitHub MCP
|
||||
- **URL:** `http://miranda.incus:25530/github`
|
||||
- **Authentication:** None (MCPO handles upstream auth)
|
||||
4. Save and enable desired GitHub tools
|
||||
|
||||
### Custom Applications
|
||||
|
||||
**Direct MCP Connection:**
|
||||
```python
|
||||
import mcp
|
||||
|
||||
client = mcp.Client("http://miranda.incus:25533/mcp")
|
||||
tools = await client.list_tools()
|
||||
```
|
||||
|
||||
**Via MCPO (OpenAI-compatible):**
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://miranda.incus:25530/github",
|
||||
api_key="not-required" # MCPO doesn't require auth for GitHub MCP
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Miranda container running with Docker installed
|
||||
- Ansible vault containing `vault_github_personal_access_token`
|
||||
- Network connectivity from clients to miranda.incus
|
||||
|
||||
### Deploy GitHub MCP Server
|
||||
|
||||
```bash
|
||||
cd /home/robert/dv/agathos/ansible
|
||||
ansible-playbook github_mcp/deploy.yml
|
||||
```
|
||||
|
||||
This playbook:
|
||||
1. Creates `github_mcp` user and group
|
||||
2. Creates `/srv/github_mcp` directory
|
||||
3. Templates docker-compose.yml with PAT from vault
|
||||
4. Starts github-mcp-server container on port 25533
|
||||
|
||||
### Update MCPO Configuration
|
||||
|
||||
```bash
|
||||
ansible-playbook mcpo/deploy.yml
|
||||
```
|
||||
|
||||
This restarts MCPO with the updated config including GitHub MCP server.
|
||||
|
||||
### Update Alloy Logging
|
||||
|
||||
```bash
|
||||
ansible-playbook alloy/deploy.yml --limit miranda.incus
|
||||
```
|
||||
|
||||
This reconfigures Alloy to collect GitHub MCP server logs.
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
### Test Direct MCP Endpoint
|
||||
|
||||
```bash
|
||||
# Check container is running
|
||||
ssh miranda.incus docker ps | grep github-mcp-server
|
||||
|
||||
# Test MCP endpoint responds
|
||||
curl http://miranda.incus:25533/mcp
|
||||
|
||||
# List available tools (expect JSON response)
|
||||
curl -X POST http://miranda.incus:25533/mcp \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'
|
||||
```
|
||||
|
||||
### Test MCPO Proxy
|
||||
|
||||
```bash
|
||||
# List GitHub tools via MCPO
|
||||
curl http://miranda.incus:25530/github/tools
|
||||
|
||||
# Test repository file reading
|
||||
curl -X POST http://miranda.incus:25530/github/tools/get_file_contents \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"owner": "github",
|
||||
"repo": "docs",
|
||||
"path": "README.md"
|
||||
}'
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# Container logs
|
||||
ssh miranda.incus docker logs github-mcp-server
|
||||
|
||||
# Loki logs (via Grafana on prospero.incus)
|
||||
# Navigate to Explore → Loki
|
||||
# Query: {job="github-mcp-server"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
**Check Docker Compose:**
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
sudo -u github_mcp docker compose -f /srv/github_mcp/docker-compose.yml logs
|
||||
```
|
||||
|
||||
**Common Issues:**
|
||||
- Missing or invalid GitHub PAT in vault
|
||||
- Port 25533 already in use
|
||||
- Docker image pull failure
|
||||
|
||||
### MCP Endpoint Returns Errors
|
||||
|
||||
**Check GitHub PAT validity:**
|
||||
```bash
|
||||
curl -H "Authorization: token YOUR_PAT" https://api.github.com/user
|
||||
```
|
||||
|
||||
**Verify PAT scopes:**
|
||||
```bash
|
||||
curl -i -H "Authorization: token YOUR_PAT" https://api.github.com/user \
|
||||
| grep X-OAuth-Scopes
|
||||
```
|
||||
|
||||
### MCPO Not Exposing GitHub Tools
|
||||
|
||||
**Verify MCPO config:**
|
||||
```bash
|
||||
ssh miranda.incus cat /srv/mcpo/config.json | jq '.mcpServers.github'
|
||||
```
|
||||
|
||||
**Restart MCPO:**
|
||||
```bash
|
||||
ssh miranda.incus sudo systemctl restart mcpo
|
||||
ssh miranda.incus sudo systemctl status mcpo
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
GitHub MCP server exposes Prometheus metrics (if supported by the container). Add to Prometheus scrape config:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'github-mcp'
|
||||
static_configs:
|
||||
- targets: ['miranda.incus:25533']
|
||||
```
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Import or create a dashboard on prospero.incus to visualize:
|
||||
- Request rate and latency
|
||||
- GitHub API rate limits
|
||||
- Tool invocation counts
|
||||
- Error rates
|
||||
|
||||
### Log Queries
|
||||
|
||||
Useful Loki queries in Grafana:
|
||||
|
||||
```logql
|
||||
# All GitHub MCP logs
|
||||
{job="github-mcp-server"}
|
||||
|
||||
# Errors only
|
||||
{job="github-mcp-server"} |= "error" or |= "ERROR"
|
||||
|
||||
# GitHub API rate limit warnings
|
||||
{job="github-mcp-server"} |= "rate limit"
|
||||
|
||||
# Tool invocations
|
||||
{job="github-mcp-server"} |= "tool"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
✔ **Read-Only PAT** - Server uses minimal scopes, cannot modify repositories
|
||||
✔ **Network Isolation** - Only accessible within Agathos network (miranda.incus)
|
||||
✔ **Vault Storage** - PAT stored encrypted in Ansible Vault
|
||||
✔ **No Public Exposure** - MCP endpoint not exposed to internet
|
||||
⚠️ **PAT Rotation** - Consider rotating PAT every 90 days
|
||||
⚠️ **Access Control** - MCPO currently doesn't require authentication
|
||||
|
||||
### Recommended Enhancements
|
||||
|
||||
1. Add authentication to MCPO endpoints
|
||||
2. Implement request rate limiting
|
||||
3. Monitor GitHub API quota usage
|
||||
4. Set up PAT expiration alerts
|
||||
5. Restrict network access to miranda via firewall rules
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [GitHub MCP Server Repository](https://github.com/github/github-mcp-server)
|
||||
- [Model Context Protocol Specification](https://modelcontextprotocol.io/)
|
||||
- [MCPO Documentation](https://github.com/open-webui/mcpo)
|
||||
- [Agathos README](../../README.md)
|
||||
- [Agathos Sandbox Documentation](../sandbox.html)
|
||||
422
docs/grafana_mcp.md
Normal file
422
docs/grafana_mcp.md
Normal file
@@ -0,0 +1,422 @@
|
||||
# Grafana MCP Server
|
||||
|
||||
## Overview
|
||||
|
||||
The Grafana MCP server provides AI/LLM access to Grafana dashboards, datasources, and APIs through the Model Context Protocol (MCP). It runs as a Docker container on **Miranda** and connects to the Grafana instance inside the [PPLG stack](pplg.md) on **Prospero** via the internal Incus network.
|
||||
|
||||
**Deployment Host:** miranda.incus
|
||||
**Port:** 25533 (HTTP MCP endpoint)
|
||||
**MCPO Proxy:** http://miranda.incus:25530/grafana
|
||||
**Grafana Backend:** http://prospero.incus:3000 (PPLG stack)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ MCP CLIENTS │
|
||||
│ VS Code/Cline │ OpenWebUI │ LobeChat │ Custom Applications │
|
||||
└───────────────────────────┬─────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
Direct MCP (port 25533) MCPO Proxy (port 25530)
|
||||
streamable-http OpenAI-compatible API
|
||||
│ │
|
||||
└──────────┬───────────────┘
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Miranda (miranda.incus) │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ Grafana MCP Server (Docker) │ │
|
||||
│ │ mcp/grafana:latest │ │
|
||||
│ │ Container: grafana-mcp │ │
|
||||
│ │ :25533 → :8000 │ │
|
||||
│ └─────────────────────┬──────────────────────────┘ │
|
||||
│ │ HTTP (internal network) │
|
||||
└────────────────────────┼─────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Prospero (prospero.incus) — PPLG Stack │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ Grafana :3000 │ │
|
||||
│ │ Authenticated via Service Account Token │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Cross-Host Dependency
|
||||
|
||||
The Grafana MCP server on Miranda communicates with Grafana on Prospero over the Incus internal network (`prospero.incus:3000`). This means:
|
||||
|
||||
- **PPLG must be deployed first** — Grafana must be running before deploying the MCP server
|
||||
- The connection uses Grafana's **internal HTTP port** (3000), not the external HTTPS endpoint
|
||||
- Authentication is handled by a **Grafana service account token**, not Casdoor OAuth
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
Grafana MCP runs on Miranda, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | mcp_docker_host |
|
||||
| Security Nesting | true |
|
||||
| AppArmor | unconfined |
|
||||
| Proxy: mcp_containers | `0.0.0.0:25530-25539` → `127.0.0.1:25530-25539` |
|
||||
|
||||
### Dependencies
|
||||
|
||||
| Resource | Relationship |
|
||||
|----------|--------------|
|
||||
| prospero (PPLG) | Grafana backend — service account token auth on `:3000` |
|
||||
| miranda (MCPO) | MCPO proxies Grafana MCP at `localhost:25533/mcp` |
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **PPLG stack**: Grafana must be running on Prospero (`ansible-playbook pplg/deploy.yml`)
|
||||
2. **Docker**: Docker must be installed on the target host (`ansible-playbook docker/deploy.yml`)
|
||||
3. **Vault Secret**: `vault_grafana_service_account_token` must be set (see [Required Vault Secrets](#required-vault-secrets))
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook grafana_mcp/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `grafana_mcp/deploy.yml` | Main deployment playbook |
|
||||
| `grafana_mcp/docker-compose.yml.j2` | Docker Compose template for the MCP server |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Pre-flight Check**: Verify Grafana is reachable on Prospero (`/api/health`)
|
||||
2. **Create System User**: `grafana_mcp:grafana_mcp` system account
|
||||
3. **Create Directory**: `/srv/grafana_mcp` with restricted permissions (750)
|
||||
4. **Template Docker Compose**: Renders `docker-compose.yml.j2` with Grafana URL and service account token
|
||||
5. **Start Container**: `docker compose up` via `community.docker.docker_compose_v2`
|
||||
6. **Health Check**: Verifies the MCP endpoint is responding on `localhost:25533/mcp`
|
||||
|
||||
### Deployment Order
|
||||
|
||||
Grafana MCP must be deployed **after** PPLG and **before** MCPO:
|
||||
|
||||
```
|
||||
pplg → docker → grafana_mcp → mcpo
|
||||
```
|
||||
|
||||
This ensures Grafana is available before the MCP server starts, and MCPO can proxy to it.
|
||||
|
||||
## Docker Compose Configuration
|
||||
|
||||
The container is defined in `grafana_mcp/docker-compose.yml.j2`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
grafana-mcp:
|
||||
image: mcp/grafana:latest
|
||||
container_name: grafana-mcp
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "25533:8000"
|
||||
environment:
|
||||
- GRAFANA_URL=http://prospero.incus:3000
|
||||
- GRAFANA_SERVICE_ACCOUNT_TOKEN=<from vault>
|
||||
command: ["--transport", "streamable-http", "--address", "0.0.0.0:8000", "--tls-skip-verify"]
|
||||
logging:
|
||||
driver: syslog
|
||||
options:
|
||||
syslog-address: "tcp://127.0.0.1:51433"
|
||||
syslog-format: rfc5424
|
||||
tag: "grafana-mcp"
|
||||
```
|
||||
|
||||
Key configuration:
|
||||
- **Transport**: `streamable-http` — standard MCP HTTP transport
|
||||
- **TLS Skip Verify**: Enabled because Grafana is accessed over internal HTTP (not HTTPS)
|
||||
- **Syslog**: Logs shipped to Alloy on localhost for forwarding to Loki
|
||||
|
||||
## Available Tools
|
||||
|
||||
The Grafana MCP server exposes tools for interacting with Grafana's API:
|
||||
|
||||
### Dashboard Operations
|
||||
- Search and list dashboards
|
||||
- Get dashboard details and panels
|
||||
- Query panel data
|
||||
|
||||
### Datasource Operations
|
||||
- List configured datasources
|
||||
- Query datasources directly
|
||||
|
||||
### Alerting
|
||||
- List alert rules
|
||||
- Get alert rule details and status
|
||||
|
||||
### General
|
||||
- Get Grafana health status
|
||||
- Search across Grafana resources
|
||||
|
||||
> **Note:** The specific tools available depend on the `mcp/grafana` Docker image version. Use the MCPO Swagger docs at `http://miranda.incus:25530/docs` to see the current tool inventory.
|
||||
|
||||
## Client Configuration
|
||||
|
||||
### MCP Native Clients (Cline, Claude Desktop)
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"grafana": {
|
||||
"type": "streamable-http",
|
||||
"url": "http://miranda.incus:25533/mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Via MCPO (OpenAI-Compatible)
|
||||
|
||||
Grafana MCP is automatically available through MCPO at:
|
||||
|
||||
```
|
||||
http://miranda.incus:25530/grafana
|
||||
```
|
||||
|
||||
This endpoint is OpenAI-compatible and can be used by OpenWebUI, LobeChat, or any OpenAI SDK client:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://miranda.incus:25530/grafana",
|
||||
api_key="not-required"
|
||||
)
|
||||
```
|
||||
|
||||
### OpenWebUI / LobeChat
|
||||
|
||||
1. Navigate to **Settings → Tools → OpenAPI Servers**
|
||||
2. Click **Add OpenAPI Server**
|
||||
3. Configure:
|
||||
- **Name:** Grafana MCP
|
||||
- **URL:** `http://miranda.incus:25530/grafana`
|
||||
- **Authentication:** None (MCPO handles upstream auth)
|
||||
4. Save and enable the Grafana tools
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
| Variable | Purpose |
|
||||
|----------|---------|
|
||||
| `vault_grafana_service_account_token` | Grafana service account token for MCP API access |
|
||||
|
||||
### Creating a Grafana Service Account Token
|
||||
|
||||
1. Log in to Grafana at `https://grafana.ouranos.helu.ca` (Casdoor SSO or local admin)
|
||||
2. Navigate to **Administration → Service Accounts**
|
||||
3. Click **Add service account**
|
||||
- **Name:** `mcp-server`
|
||||
- **Role:** `Viewer` (or `Editor` if write tools are needed)
|
||||
4. Click **Add service account token**
|
||||
- **Name:** `mcp-token`
|
||||
- **Expiration:** No expiration (or set a rotation schedule)
|
||||
5. Copy the generated token
|
||||
6. Store in vault:
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
```yaml
|
||||
vault_grafana_service_account_token: "glsa_xxxxxxxxxxxxxxxxxxxx"
|
||||
```
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/miranda.incus.yml`
|
||||
|
||||
```yaml
|
||||
# Grafana MCP Config
|
||||
grafana_mcp_user: grafana_mcp
|
||||
grafana_mcp_group: grafana_mcp
|
||||
grafana_mcp_directory: /srv/grafana_mcp
|
||||
grafana_mcp_port: 25533
|
||||
grafana_mcp_grafana_host: prospero.incus
|
||||
grafana_mcp_grafana_port: 3000
|
||||
grafana_service_account_token: "{{ vault_grafana_service_account_token }}"
|
||||
```
|
||||
|
||||
Miranda's services list includes `grafana_mcp`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
- alloy
|
||||
- argos
|
||||
- docker
|
||||
- gitea_mcp
|
||||
- grafana_mcp
|
||||
- mcpo
|
||||
- neo4j_mcp
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Syslog to Loki
|
||||
|
||||
The Grafana MCP container ships logs via Docker's syslog driver to Alloy on Miranda:
|
||||
|
||||
| Server | Syslog Port | Loki Tag |
|
||||
|--------|-------------|----------|
|
||||
| grafana-mcp | 51433 | `grafana-mcp` |
|
||||
|
||||
### Grafana Log Queries
|
||||
|
||||
Useful Loki queries in Grafana Explore:
|
||||
|
||||
```logql
|
||||
# All Grafana MCP logs
|
||||
{hostname="miranda.incus", job="grafana_mcp"}
|
||||
|
||||
# Errors only
|
||||
{hostname="miranda.incus", job="grafana_mcp"} |= "error" or |= "ERROR"
|
||||
|
||||
# Tool invocations
|
||||
{hostname="miranda.incus", job="grafana_mcp"} |= "tool"
|
||||
```
|
||||
|
||||
### MCPO Aggregation
|
||||
|
||||
Grafana MCP is registered in MCPO's `config.json` as:
|
||||
|
||||
```json
|
||||
{
|
||||
"grafana": {
|
||||
"type": "streamable-http",
|
||||
"url": "http://localhost:25533/mcp"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
MCPO exposes it at `http://miranda.incus:25530/grafana` with OpenAI-compatible API and Swagger documentation.
|
||||
|
||||
## Operations
|
||||
|
||||
### Start / Stop
|
||||
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
|
||||
# Docker container
|
||||
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml up -d
|
||||
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml down
|
||||
|
||||
# Or redeploy via Ansible
|
||||
cd ansible
|
||||
ansible-playbook grafana_mcp/deploy.yml
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Container status
|
||||
ssh miranda.incus docker ps --filter name=grafana-mcp
|
||||
|
||||
# MCP endpoint
|
||||
curl http://miranda.incus:25533/mcp
|
||||
|
||||
# Via MCPO
|
||||
curl http://miranda.incus:25530/grafana/tools
|
||||
|
||||
# Grafana backend (from Miranda)
|
||||
curl http://prospero.incus:3000/api/health
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Docker container logs
|
||||
ssh miranda.incus docker logs -f grafana-mcp
|
||||
|
||||
# Loki logs (via Grafana on Prospero)
|
||||
# Query: {hostname="miranda.incus", job="grafana_mcp"}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
sudo -u grafana_mcp docker compose -f /srv/grafana_mcp/docker-compose.yml logs
|
||||
```
|
||||
|
||||
**Common causes:**
|
||||
- Grafana on Prospero not running → check `ssh prospero.incus sudo systemctl status grafana-server`
|
||||
- Invalid or expired service account token → regenerate in Grafana UI
|
||||
- Port 25533 already in use → `ss -tlnp | grep 25533`
|
||||
- Docker image pull failure → check Docker Hub access
|
||||
|
||||
### MCP Endpoint Returns Errors
|
||||
|
||||
**Verify service account token:**
|
||||
```bash
|
||||
curl -H "Authorization: Bearer YOUR_TOKEN" http://prospero.incus:3000/api/org
|
||||
```
|
||||
|
||||
**Check container environment:**
|
||||
```bash
|
||||
ssh miranda.incus docker inspect grafana-mcp | jq '.[0].Config.Env'
|
||||
```
|
||||
|
||||
### MCPO Not Exposing Grafana Tools
|
||||
|
||||
**Verify MCPO config:**
|
||||
```bash
|
||||
ssh miranda.incus cat /srv/mcpo/config.json | jq '.mcpServers.grafana'
|
||||
```
|
||||
|
||||
**Restart MCPO:**
|
||||
```bash
|
||||
ssh miranda.incus sudo systemctl restart mcpo
|
||||
```
|
||||
|
||||
### Grafana Unreachable from Miranda
|
||||
|
||||
**Test network connectivity:**
|
||||
```bash
|
||||
ssh miranda.incus curl -s http://prospero.incus:3000/api/health
|
||||
```
|
||||
|
||||
If this fails, check:
|
||||
- Prospero container is running: `incus list prospero`
|
||||
- Grafana service is up: `ssh prospero.incus sudo systemctl status grafana-server`
|
||||
- No firewall rules blocking inter-container traffic
|
||||
|
||||
## Security Considerations
|
||||
|
||||
✔ **Service Account Token** — Scoped to Viewer role, cannot modify Grafana configuration
|
||||
✔ **Internal Network** — MCP server only accessible within the Incus network
|
||||
✔ **Vault Storage** — Token stored encrypted in Ansible Vault
|
||||
✔ **No Public Exposure** — Neither the MCP endpoint nor the MCPO proxy are internet-facing
|
||||
⚠️ **Token Rotation** — Consider rotating the service account token periodically
|
||||
⚠️ **Access Control** — MCPO currently doesn't require authentication for tool access
|
||||
|
||||
## References
|
||||
|
||||
- [PPLG Stack Documentation](pplg.md) — Grafana deployment on Prospero
|
||||
- [MCPO Documentation](mcpo.md) — MCP gateway that proxies Grafana MCP
|
||||
- [Grafana MCP Server](https://github.com/grafana/mcp-grafana) — Upstream project
|
||||
- [Model Context Protocol Specification](https://modelcontextprotocol.io/)
|
||||
- [Ansible Practices](ansible.md)
|
||||
- [Agathos Overview](agathos.md)
|
||||
222
docs/hass.md
Normal file
222
docs/hass.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Home Assistant
|
||||
|
||||
## Overview
|
||||
|
||||
[Home Assistant](https://github.com/home-assistant/core) is an open-source home automation platform. In the Agathos sandbox it runs as a native Python application inside a virtual environment, backed by PostgreSQL for state recording and fronted by HAProxy for TLS termination.
|
||||
|
||||
**Host:** Oberon
|
||||
**Role:** container_orchestration
|
||||
**Port:** 8123
|
||||
**URL:** https://hass.ouranos.helu.ca
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ HTTPS ┌──────────────┐ HTTP ┌──────────────┐
|
||||
│ Client │────────▶│ HAProxy │────────▶│ Home │
|
||||
│ │ │ (Titania) │ │ Assistant │
|
||||
└──────────┘ │ :443 TLS │ │ (Oberon) │
|
||||
└──────────────┘ │ :8123 │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌─────────────────┼─────────────────┐
|
||||
│ │ │
|
||||
┌────▼─────┐ ┌──────▼──────┐ ┌─────▼─────┐
|
||||
│PostgreSQL│ │ Alloy │ │ Prometheus│
|
||||
│(Portia) │ │ (Oberon) │ │(Prospero) │
|
||||
│ :5432 │ │ scrape │ │ remote │
|
||||
│ recorder │ │ /api/prom │ │ write │
|
||||
└──────────┘ └─────────────┘ └───────────┘
|
||||
```
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook hass/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `hass/deploy.yml` | Main deployment playbook |
|
||||
| `hass/configuration.yaml.j2` | Home Assistant configuration |
|
||||
| `hass/requirements.txt.j2` | Python package pinning |
|
||||
| `hass/hass.service.j2` | Systemd service unit |
|
||||
|
||||
### Variables
|
||||
|
||||
#### Host Variables (`host_vars/oberon.incus.yml`)
|
||||
|
||||
| Variable | Description | Value |
|
||||
|----------|-------------|-------|
|
||||
| `hass_user` | System user | `hass` |
|
||||
| `hass_group` | System group | `hass` |
|
||||
| `hass_directory` | Install directory | `/srv/hass` |
|
||||
| `hass_media_directory` | Media storage | `/srv/hass/media` |
|
||||
| `hass_port` | HTTP listen port | `8123` |
|
||||
| `hass_version` | Pinned HA release | `2026.2.0` |
|
||||
| `hass_db_host` | PostgreSQL host | `portia.incus` |
|
||||
| `hass_db_port` | PostgreSQL port | `5432` |
|
||||
| `hass_db_name` | Database name | `hass` |
|
||||
| `hass_db_user` | Database user | `hass` |
|
||||
| `hass_db_password` | Database password | `{{ vault_hass_db_password }}` |
|
||||
| `hass_metrics_token` | Prometheus bearer token | `{{ vault_hass_metrics_token }}` |
|
||||
|
||||
#### Host Variables (`host_vars/portia.incus.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `hass_db_name` | Database name on Portia |
|
||||
| `hass_db_user` | Database user on Portia |
|
||||
| `hass_db_password` | `{{ vault_hass_db_password }}` |
|
||||
|
||||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_hass_db_password` | PostgreSQL password for hass database |
|
||||
| `vault_hass_metrics_token` | Long-Lived Access Token for Prometheus scraping |
|
||||
|
||||
## Configuration
|
||||
|
||||
### PostgreSQL Recorder
|
||||
|
||||
Home Assistant uses the `recorder` integration to persist entity states and events to PostgreSQL on Portia instead of the default SQLite. Configured in `configuration.yaml.j2`:
|
||||
|
||||
```yaml
|
||||
recorder:
|
||||
db_url: "postgresql://hass:<password>@portia.incus:5432/hass"
|
||||
purge_keep_days: 30
|
||||
commit_interval: 1
|
||||
```
|
||||
|
||||
The database and user are provisioned by `postgresql/deploy.yml` alongside other service databases.
|
||||
|
||||
### HTTP / Reverse Proxy
|
||||
|
||||
HAProxy on Titania terminates TLS and forwards to Oberon:8123. The `http` block in `configuration.yaml.j2` configures trusted proxies so HA correctly reads `X-Forwarded-For` headers:
|
||||
|
||||
```yaml
|
||||
http:
|
||||
server_port: 8123
|
||||
use_x_forwarded_for: true
|
||||
trusted_proxies:
|
||||
- 10.0.0.0/8
|
||||
```
|
||||
|
||||
### HAProxy Backend
|
||||
|
||||
Defined in `host_vars/titania.incus.yml` under `haproxy_backends`:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Subdomain | `hass` |
|
||||
| Backend | `oberon.incus:8123` |
|
||||
| Health path | `/api/` |
|
||||
| Timeout | 300s (WebSocket support) |
|
||||
|
||||
The wildcard TLS certificate (`*.ouranos.helu.ca`) covers `hass.ouranos.helu.ca` automatically — no certificate changes required.
|
||||
|
||||
## Authentication
|
||||
|
||||
Home Assistant uses its **native `homeassistant` auth provider** (built-in username/password). HA does not support OIDC/OAuth2 natively, so Casdoor SSO integration is not available.
|
||||
|
||||
On first deployment, HA will present an onboarding wizard to create the initial admin user.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
Home Assistant exposes Prometheus metrics at `/api/prometheus`. The Alloy agent on Oberon scrapes this endpoint with bearer token authentication and remote-writes to Prometheus on Prospero.
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Metrics path | `/api/prometheus` |
|
||||
| Scrape interval | 60s |
|
||||
| Auth | Bearer token (Long-Lived Access Token) |
|
||||
|
||||
**⚠️ Two-Phase Metrics Bootstrapping:**
|
||||
|
||||
The `vault_hass_metrics_token` must be a Home Assistant **Long-Lived Access Token**, which can only be generated from the HA web UI after the initial deployment:
|
||||
|
||||
1. Deploy Home Assistant: `ansible-playbook hass/deploy.yml`
|
||||
2. Complete the onboarding wizard at `https://hass.ouranos.helu.ca`
|
||||
3. Navigate to **Profile → Security → Long-Lived Access Tokens → Create Token**
|
||||
4. Store the token in vault: `vault_hass_metrics_token: "<token>"`
|
||||
5. Redeploy Alloy to pick up the token: `ansible-playbook alloy/deploy.yml`
|
||||
|
||||
Until the token is created, the Alloy hass scrape will fail silently.
|
||||
|
||||
### Loki Logs
|
||||
|
||||
Systemd journal logs are collected by Alloy's `loki.source.journal` and shipped to Loki on Prospero.
|
||||
|
||||
```bash
|
||||
# Query in Grafana Explore
|
||||
{job="systemd", hostname="oberon"} |= "hass"
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Start / Stop
|
||||
|
||||
```bash
|
||||
sudo systemctl start hass
|
||||
sudo systemctl stop hass
|
||||
sudo systemctl restart hass
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8123/api/
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
journalctl -u hass -f
|
||||
```
|
||||
|
||||
### Version Upgrade
|
||||
|
||||
1. Update `hass_version` in `host_vars/oberon.incus.yml`
|
||||
2. Run: `ansible-playbook hass/deploy.yml`
|
||||
|
||||
The playbook will reinstall the pinned version via pip and restart the service.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| Service won't start | Missing Python deps | Check `pip install` output in deploy log |
|
||||
| Database connection error | Portia unreachable | Verify PostgreSQL is running: `ansible-playbook postgresql/deploy.yml` |
|
||||
| 502 via HAProxy | HA not listening | Check `systemctl status hass` on Oberon |
|
||||
| Metrics scrape failing | Missing/invalid token | Generate Long-Lived Access Token from HA UI (see Monitoring section) |
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status hass
|
||||
|
||||
# View recent logs
|
||||
journalctl -u hass --since "5 minutes ago"
|
||||
|
||||
# Test database connectivity from Oberon
|
||||
psql -h portia.incus -U hass -d hass -c "SELECT 1"
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Home Assistant Documentation](https://www.home-assistant.io/docs/)
|
||||
- [Home Assistant GitHub](https://github.com/home-assistant/core)
|
||||
- [Recorder Integration](https://www.home-assistant.io/integrations/recorder/)
|
||||
- [Prometheus Integration](https://www.home-assistant.io/integrations/prometheus/)
|
||||
- [HTTP Integration](https://www.home-assistant.io/integrations/http/)
|
||||
342
docs/jupyterlab.md
Normal file
342
docs/jupyterlab.md
Normal file
@@ -0,0 +1,342 @@
|
||||
# JupyterLab - Interactive Computing Environment
|
||||
|
||||
## Overview
|
||||
JupyterLab is a web-based interactive development environment for notebooks, code, and data. Deployed on **Puck** as a systemd service running in a Python virtual environment, with OAuth2-Proxy sidecar providing Casdoor SSO authentication.
|
||||
|
||||
**Host:** puck.incus
|
||||
**Role:** Application Runtime (Python App Host)
|
||||
**Container Port:** 22181 (JupyterLab), 22182 (OAuth2-Proxy)
|
||||
**External Access:** https://jupyter.ouranos.helu.ca/ (via HAProxy on Titania)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌─────────────┐ ┌────────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ OAuth2-Proxy│─────▶│ JupyterLab │
|
||||
│ │ │ (Titania) │ │ (Puck) │ │ (Puck) │
|
||||
└──────────┘ └────────────┘ └─────────────┘ └────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────┐
|
||||
│ Casdoor │
|
||||
│ (Titania) │
|
||||
└───────────┘
|
||||
```
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌─────────────┐ ┌──────────┐
|
||||
│ Browser │─────▶│ HAProxy │─────▶│ OAuth2-Proxy│─────▶│ Casdoor │
|
||||
│ │ │ (Titania) │ │ (Puck) │ │(Titania) │
|
||||
└──────────┘ └────────────┘ └─────────────┘ └──────────┘
|
||||
│ │ │
|
||||
│ 1. Access jupyter.ouranos.helu.ca │ │
|
||||
│─────────────────────────────────────▶│ │
|
||||
│ 2. No session - redirect to Casdoor │ │
|
||||
│◀─────────────────────────────────────│ │
|
||||
│ 3. User authenticates │ │
|
||||
│─────────────────────────────────────────────────────────▶│
|
||||
│ 4. Redirect with auth code │ │
|
||||
│◀─────────────────────────────────────────────────────────│
|
||||
│ 5. Exchange code, set session cookie│ │
|
||||
│◀─────────────────────────────────────│ │
|
||||
│ 6. Proxy to JupyterLab │ │
|
||||
│◀─────────────────────────────────────│ │
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook jupyterlab/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `jupyterlab/deploy.yml` | Main deployment playbook |
|
||||
| `jupyterlab/jupyterlab.service.j2` | Systemd unit for JupyterLab |
|
||||
| `jupyterlab/oauth2-proxy-jupyter.service.j2` | Systemd unit for OAuth2-Proxy sidecar |
|
||||
| `jupyterlab/oauth2-proxy-jupyter.cfg.j2` | OAuth2-Proxy configuration |
|
||||
| `jupyterlab/jupyter_lab_config.py.j2` | JupyterLab server configuration |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Install Dependencies**: python3-venv, nodejs, npm, graphviz
|
||||
2. **Ensure User Exists**: `robert:robert` with home directory
|
||||
3. **Create Directories**: Notebooks dir, config dir, log dir
|
||||
4. **Create Virtual Environment**: `/home/robert/env/jupyter`
|
||||
5. **Install Python Packages**: jupyterlab, jupyter-ai, langchain-ollama, matplotlib, plotly
|
||||
6. **Install Jupyter Extensions**: contrib nbextensions
|
||||
7. **Template Configuration**: Apply JupyterLab config
|
||||
8. **Download OAuth2-Proxy**: Binary from GitHub releases
|
||||
9. **Template OAuth2-Proxy Config**: With Casdoor OIDC settings
|
||||
10. **Start Services**: Enable and start both systemd units
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Features
|
||||
|
||||
- **Jupyter AI**: AI assistance via jupyter-ai[all] with LangChain Ollama integration
|
||||
- **Visualization**: matplotlib, plotly for data visualization
|
||||
- **Diagrams**: Mermaid support via jupyterlab-mermaid
|
||||
- **Extensions**: Jupyter contrib nbextensions
|
||||
- **SSO**: Casdoor authentication via OAuth2-Proxy sidecar
|
||||
- **WebSocket**: Full WebSocket support through reverse proxy
|
||||
|
||||
### Storage Locations
|
||||
|
||||
| Path | Purpose | Owner |
|
||||
|------|---------|-------|
|
||||
| `/home/robert/Notebooks` | Notebook files | robert:robert |
|
||||
| `/home/robert/env/jupyter` | Python virtual environment | robert:robert |
|
||||
| `/etc/jupyterlab` | Configuration files | root:robert |
|
||||
| `/var/log/jupyterlab` | Application logs | robert:robert |
|
||||
| `/etc/oauth2-proxy-jupyter` | OAuth2-Proxy config | root:root |
|
||||
|
||||
### Installed Python Packages
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| `jupyterlab` | Core JupyterLab server |
|
||||
| `jupyter-ai[all]` | AI assistant integration |
|
||||
| `langchain-ollama` | Ollama LLM integration |
|
||||
| `matplotlib` | Data visualization |
|
||||
| `plotly` | Interactive charts |
|
||||
| `jupyter_contrib_nbextensions` | Community extensions |
|
||||
| `jupyterlab-mermaid` | Mermaid diagram support |
|
||||
| `ipywidgets` | Interactive widgets |
|
||||
|
||||
### Logging
|
||||
|
||||
- **JupyterLab**: systemd journal via `SyslogIdentifier=jupyterlab`
|
||||
- **OAuth2-Proxy**: systemd journal via `SyslogIdentifier=oauth2-proxy-jupyter`
|
||||
- **Alloy Forwarding**: Syslog port 51491 → Loki
|
||||
|
||||
## Access After Deployment
|
||||
|
||||
1. **Web Interface**: https://jupyter.ouranos.helu.ca/
|
||||
2. **Authentication**: Redirects to Casdoor SSO login
|
||||
3. **After Login**: Full JupyterLab interface with notebook access
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Alloy Configuration
|
||||
**File:** `ansible/alloy/puck/config.alloy.j2`
|
||||
|
||||
- **Log Collection**: Syslog port 51491 → Loki
|
||||
- **Job Label**: `jupyterlab`
|
||||
- **System Metrics**: Process exporter tracks JupyterLab process
|
||||
|
||||
### Health Check
|
||||
- **URL**: `http://puck.incus:22182/ping` (OAuth2-Proxy)
|
||||
- **JupyterLab API**: `http://127.0.0.1:22181/api/status` (localhost only)
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
### 1. OAuth Client ID
|
||||
```yaml
|
||||
vault_jupyter_oauth_client_id: "jupyter-oauth-client"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Purpose**: Client ID for Casdoor OAuth2 application
|
||||
- **Source**: Must match `clientId` in Casdoor application configuration
|
||||
|
||||
### 2. OAuth Client Secret
|
||||
```yaml
|
||||
vault_jupyter_oauth_client_secret: "YourRandomOAuthSecret123!"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: 32+ characters recommended
|
||||
- **Purpose**: Client secret for Casdoor OAuth2 authentication
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
### 3. Cookie Secret
|
||||
```yaml
|
||||
vault_jupyter_oauth2_cookie_secret: "32CharacterRandomStringHere1234"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: Exactly 32 characters (or 16/24 for AES)
|
||||
- **Purpose**: Encrypts OAuth2-Proxy session cookies
|
||||
- **Generation**:
|
||||
```bash
|
||||
openssl rand -base64 32 | head -c 32
|
||||
```
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/puck.incus.yml`
|
||||
|
||||
```yaml
|
||||
# JupyterLab Configuration
|
||||
jupyterlab_user: robert
|
||||
jupyterlab_group: robert
|
||||
jupyterlab_notebook_dir: /home/robert/Notebooks
|
||||
jupyterlab_venv_dir: /home/robert/env/jupyter
|
||||
|
||||
# Ports
|
||||
jupyterlab_port: 22181 # JupyterLab (localhost only)
|
||||
jupyterlab_proxy_port: 22182 # OAuth2-Proxy (exposed to HAProxy)
|
||||
|
||||
# OAuth2-Proxy Configuration
|
||||
jupyterlab_oauth2_proxy_dir: /etc/oauth2-proxy-jupyter
|
||||
jupyterlab_oauth2_proxy_version: "7.6.0"
|
||||
jupyterlab_domain: "ouranos.helu.ca"
|
||||
jupyterlab_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
|
||||
jupyterlab_oauth2_redirect_url: "https://jupyter.ouranos.helu.ca/oauth2/callback"
|
||||
|
||||
# OAuth2 Credentials (from vault)
|
||||
jupyterlab_oauth_client_id: "{{ vault_jupyter_oauth_client_id }}"
|
||||
jupyterlab_oauth_client_secret: "{{ vault_jupyter_oauth_client_secret }}"
|
||||
jupyterlab_oauth2_cookie_secret: "{{ vault_jupyter_oauth2_cookie_secret }}"
|
||||
|
||||
# Alloy Logging
|
||||
jupyterlab_syslog_port: 51491
|
||||
```
|
||||
|
||||
## OAuth2 / Casdoor SSO
|
||||
|
||||
JupyterLab uses OAuth2-Proxy as a sidecar to handle Casdoor authentication. This pattern is simpler than native OAuth for single-user setups.
|
||||
|
||||
### Why OAuth2-Proxy Sidecar?
|
||||
|
||||
| Approach | Pros | Cons |
|
||||
|----------|------|------|
|
||||
| **OAuth2-Proxy (chosen)** | Simple setup, no JupyterLab modification | Extra service to manage |
|
||||
| **Native JupyterHub OAuth** | Integrated solution | More complex, overkill for single user |
|
||||
| **Token-only auth** | Simplest | Less secure, no SSO integration |
|
||||
|
||||
### Casdoor Application Configuration
|
||||
|
||||
A JupyterLab application is defined in `ansible/casdoor/init_data.json.j2`:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| **Name** | `app-jupyter` |
|
||||
| **Client ID** | `vault_jupyter_oauth_client_id` |
|
||||
| **Redirect URI** | `https://jupyter.ouranos.helu.ca/oauth2/callback` |
|
||||
| **Grant Types** | `authorization_code`, `refresh_token` |
|
||||
|
||||
### URL Strategy
|
||||
|
||||
| URL Type | Address | Used By |
|
||||
|----------|---------|---------|
|
||||
| **OIDC Issuer** | `https://id.ouranos.helu.ca` | OAuth2-Proxy (external) |
|
||||
| **Redirect URL** | `https://jupyter.ouranos.helu.ca/oauth2/callback` | Browser callback |
|
||||
| **Upstream** | `http://127.0.0.1:22181` | OAuth2-Proxy → JupyterLab |
|
||||
|
||||
### Deployment Order
|
||||
|
||||
1. **Deploy Casdoor first** (if not already running):
|
||||
```bash
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
```
|
||||
|
||||
2. **Update HAProxy** (add jupyter backend):
|
||||
```bash
|
||||
ansible-playbook haproxy/deploy.yml
|
||||
```
|
||||
|
||||
3. **Deploy JupyterLab**:
|
||||
```bash
|
||||
ansible-playbook jupyterlab/deploy.yml
|
||||
```
|
||||
|
||||
4. **Update Alloy** (for log forwarding):
|
||||
```bash
|
||||
ansible-playbook alloy/deploy.yml
|
||||
```
|
||||
|
||||
## Integration with Other Services
|
||||
|
||||
### HAProxy Routing
|
||||
**Backend Configuration** (`titania.incus.yml`):
|
||||
```yaml
|
||||
- subdomain: "jupyter"
|
||||
backend_host: "puck.incus"
|
||||
backend_port: 22182 # OAuth2-Proxy port
|
||||
health_path: "/ping"
|
||||
timeout_server: 300s # WebSocket support
|
||||
```
|
||||
|
||||
### Alloy Log Forwarding
|
||||
**Syslog Configuration** (`puck/config.alloy.j2`):
|
||||
```hcl
|
||||
loki.source.syslog "jupyterlab_logs" {
|
||||
listener {
|
||||
address = "127.0.0.1:51491"
|
||||
protocol = "tcp"
|
||||
labels = {
|
||||
job = "jupyterlab",
|
||||
}
|
||||
}
|
||||
forward_to = [loki.write.default.receiver]
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Status
|
||||
```bash
|
||||
ssh puck.incus
|
||||
sudo systemctl status jupyterlab
|
||||
sudo systemctl status oauth2-proxy-jupyter
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# JupyterLab logs
|
||||
sudo journalctl -u jupyterlab -f
|
||||
|
||||
# OAuth2-Proxy logs
|
||||
sudo journalctl -u oauth2-proxy-jupyter -f
|
||||
```
|
||||
|
||||
### Test JupyterLab Directly (bypass OAuth)
|
||||
```bash
|
||||
# From puck container
|
||||
curl http://127.0.0.1:22181/api/status
|
||||
```
|
||||
|
||||
### Test OAuth2-Proxy Health
|
||||
```bash
|
||||
curl http://puck.incus:22182/ping
|
||||
```
|
||||
|
||||
### Verify Virtual Environment
|
||||
```bash
|
||||
ssh puck.incus
|
||||
sudo -u robert /home/robert/env/jupyter/bin/jupyter --version
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| WebSocket disconnects | Verify `timeout_server: 300s` in HAProxy backend |
|
||||
| OAuth redirect loop | Check `redirect_url` matches Casdoor app config |
|
||||
| 502 Bad Gateway | Ensure JupyterLab service is running on port 22181 |
|
||||
| Cookie errors | Verify `cookie_secret` is exactly 32 characters |
|
||||
|
||||
## Version Information
|
||||
|
||||
- **Installation Method**: Python pip in virtual environment
|
||||
- **JupyterLab Version**: Latest stable (pip managed)
|
||||
- **OAuth2-Proxy Version**: 7.6.0 (binary from GitHub)
|
||||
- **Update Process**: Re-run deployment playbook
|
||||
|
||||
## References
|
||||
|
||||
- **JupyterLab Documentation**: https://jupyterlab.readthedocs.io/
|
||||
- **OAuth2-Proxy Documentation**: https://oauth2-proxy.github.io/oauth2-proxy/
|
||||
- **Jupyter AI**: https://jupyter-ai.readthedocs.io/
|
||||
- **Casdoor OIDC**: https://casdoor.org/docs/integration/oidc
|
||||
@@ -0,0 +1,127 @@
|
||||
Docker Compose doesn't pull newer images for existing tags
|
||||
-----------------------------------------------------------
|
||||
|
||||
# Issue
|
||||
|
||||
Running `docker compose up` on a service tagged `:latest` does not check the registry for a newer image. The container keeps running the old image even though a newer one has been pushed upstream.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `docker compose up` starts the container immediately using the locally cached image
|
||||
- `docker compose pull` or `docker pull <image>:latest` successfully downloads a newer image
|
||||
- After pulling manually, `docker compose up` recreates the container with the new image
|
||||
- The `community.docker.docker_compose_v2` Ansible module with `state: present` behaves identically — no pull check
|
||||
|
||||
# Explanation
|
||||
|
||||
Docker's default behaviour is: **if an image with the requested tag exists locally, use it without checking the registry.** The `:latest` tag is not special — it's just a regular mutable tag. Docker does not treat it as "always fetch the newest." It is simply the default tag applied when no tag is specified.
|
||||
|
||||
When you run `docker compose up`:
|
||||
|
||||
1. Docker checks if `image:latest` exists in the local image store
|
||||
2. If yes → use it, no registry check
|
||||
3. If no → pull from registry
|
||||
|
||||
This means a stale `:latest` can sit on your host indefinitely while the upstream registry has a completely different image behind the same tag. The only way Docker knows to pull is if:
|
||||
- The image doesn't exist locally at all
|
||||
- You explicitly tell it to pull
|
||||
|
||||
The same applies to the Ansible `community.docker.docker_compose_v2` module — `state: present` maps to `docker compose up` behaviour, so no pull check occurs unless you tell it to.
|
||||
|
||||
# Solution
|
||||
|
||||
Two complementary fixes ensure images are always checked against the registry.
|
||||
|
||||
## 1. Docker Compose — `pull_policy: always`
|
||||
|
||||
Add `pull_policy: always` to the service definition in `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
my-service:
|
||||
image: registry.example.com/my-image:latest
|
||||
pull_policy: always # Check registry on every `up`
|
||||
container_name: my-service
|
||||
...
|
||||
```
|
||||
|
||||
With this set, `docker compose up` will always contact the registry and compare the local image digest with the remote one. If they match, no download occurs — it's a lightweight check. If they differ, the new image layers are pulled.
|
||||
|
||||
Valid values for `pull_policy`:
|
||||
|
||||
| Value | Behaviour |
|
||||
|-------|-----------|
|
||||
| `always` | Always check the registry before starting |
|
||||
| `missing` | Only pull if the image doesn't exist locally (default) |
|
||||
| `never` | Never pull, fail if image doesn't exist locally |
|
||||
| `build` | Always build the image (for services with `build:`) |
|
||||
|
||||
## 2. Ansible — `pull: always` on `docker_compose_v2`
|
||||
|
||||
Add `pull: always` to the `community.docker.docker_compose_v2` task:
|
||||
|
||||
```yaml
|
||||
- name: Start service
|
||||
community.docker.docker_compose_v2:
|
||||
project_src: "{{ service_directory }}"
|
||||
state: present
|
||||
pull: always # Check registry during deploy
|
||||
```
|
||||
|
||||
Valid values for `pull`:
|
||||
|
||||
| Value | Behaviour |
|
||||
|-------|-----------|
|
||||
| `always` | Always pull before starting (like `docker compose pull && up`) |
|
||||
| `missing` | Only pull if image doesn't exist locally |
|
||||
| `never` | Never pull |
|
||||
| `policy` | Defer to `pull_policy` defined in the compose file |
|
||||
|
||||
## Why use both?
|
||||
|
||||
- **`pull_policy` in compose file** — Protects against manual `docker compose up` on the host
|
||||
- **`pull: always` in Ansible** — Ensures automated deployments always get the freshest image
|
||||
|
||||
They are independent mechanisms. The Ansible `pull` parameter runs a pull step before compose up, regardless of what the compose file says. Belt and suspenders.
|
||||
|
||||
# Agathos Fix
|
||||
|
||||
Applied to `ansible/gitea_mcp/` as the first instance. The same pattern should be applied to any service using mutable tags (`:latest`, `:stable`, etc.).
|
||||
|
||||
**docker-compose.yml.j2:**
|
||||
```yaml
|
||||
services:
|
||||
gitea-mcp:
|
||||
image: docker.gitea.com/gitea-mcp-server:latest
|
||||
pull_policy: always
|
||||
...
|
||||
```
|
||||
|
||||
**deploy.yml:**
|
||||
```yaml
|
||||
- name: Start Gitea MCP service
|
||||
community.docker.docker_compose_v2:
|
||||
project_src: "{{ gitea_mcp_directory }}"
|
||||
state: present
|
||||
pull: always
|
||||
```
|
||||
|
||||
# When you DON'T need this
|
||||
|
||||
- **Pinned image tags** (e.g., `postgres:16.2`, `grafana/grafana:11.1.0`) — The tag is immutable, so there's nothing newer to pull. Using `pull: always` here just adds a redundant registry check on every deploy.
|
||||
- **Locally built images** — If the image is built by `docker compose build`, use `pull_policy: build` instead.
|
||||
- **Air-gapped / offline hosts** — `pull: always` will fail if the registry is unreachable. Use `missing` or `never`.
|
||||
|
||||
# Verification
|
||||
|
||||
```bash
|
||||
# Check what image a running container is using
|
||||
docker inspect --format='{{.Image}}' gitea-mcp
|
||||
|
||||
# Compare local digest with remote
|
||||
docker images --digests docker.gitea.com/gitea-mcp-server
|
||||
|
||||
# Force pull and check if image ID changes
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
```
|
||||
134
docs/kb/Docker won't start inside Incus container.md
Normal file
134
docs/kb/Docker won't start inside Incus container.md
Normal file
@@ -0,0 +1,134 @@
|
||||
Docker won't start inside Incus container
|
||||
------------------------------------------
|
||||
|
||||
# Issue
|
||||
Running Docker inside Incus has worked for years, but a recent Ubuntu package update caused it to fail.
|
||||
|
||||
## Symptoms
|
||||
|
||||
Docker containers won't start with the following error:
|
||||
|
||||
```
|
||||
docker compose up
|
||||
Attaching to neo4j
|
||||
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open sysctl net.ipv4.ip_unprivileged_port_start file: reopen fd 8: permission denied
|
||||
```
|
||||
|
||||
The issue is AppArmor on Incus containers. The host has AppArmor, and Incus applies an AppArmor profile to containers with `security.nesting=true` that blocks Docker from writing to `/proc/sys/net/ipv4/ip_unprivileged_port_start`.
|
||||
|
||||
# Solution (Automated)
|
||||
|
||||
The fix requires **both** host-side and container-side changes. These are now automated in our infrastructure:
|
||||
|
||||
## 1. Terraform - Host-side fix
|
||||
|
||||
In `terraform/containers.tf`, all containers with `security.nesting=true` now include:
|
||||
|
||||
```terraform
|
||||
config = {
|
||||
"security.nesting" = true
|
||||
"raw.lxc" = "lxc.apparmor.profile=unconfined"
|
||||
}
|
||||
```
|
||||
|
||||
This tells Incus not to load any AppArmor profile for the container.
|
||||
|
||||
## 2. Ansible - Container-side fix
|
||||
|
||||
In `ansible/docker/deploy.yml`, Docker deployment now creates a systemd override:
|
||||
|
||||
```yaml
|
||||
- name: Create AppArmor workaround for Incus nested Docker
|
||||
ansible.builtin.copy:
|
||||
content: |
|
||||
[Service]
|
||||
Environment=container="setmeandforgetme"
|
||||
dest: /etc/systemd/system/docker.service.d/apparmor-workaround.conf
|
||||
```
|
||||
|
||||
This tells Docker to skip loading its own AppArmor profile.
|
||||
|
||||
# Manual Workaround
|
||||
|
||||
If you need to fix this manually (e.g., before running Terraform/Ansible):
|
||||
|
||||
## Step 1: Force unconfined mode from the Incus host
|
||||
|
||||
```bash
|
||||
# On the HOST (pan.helu.ca), not in the container
|
||||
incus config set <container-name> raw.lxc "lxc.apparmor.profile=unconfined" --project agathos
|
||||
incus restart <container-name> --project agathos
|
||||
```
|
||||
|
||||
## Step 2: Disable AppArmor for Docker inside the container
|
||||
|
||||
```bash
|
||||
# Inside the container
|
||||
sudo mkdir -p /etc/systemd/system/docker.service.d
|
||||
sudo tee /etc/systemd/system/docker.service.d/apparmor-workaround.conf <<EOF
|
||||
[Service]
|
||||
Environment=container="setmeandforgetme"
|
||||
EOF
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
Reference: [ktz.blog](https://blog.ktz.me/proxmox-9-broke-my-docker-containers/)
|
||||
|
||||
# Verification
|
||||
|
||||
Tested on Miranda (2025-12-28):
|
||||
|
||||
```bash
|
||||
# Before fix - fails with permission denied
|
||||
$ ssh miranda.incus "docker run hello-world"
|
||||
docker: Error response from daemon: failed to create task for container: ... permission denied
|
||||
|
||||
# After applying both fixes
|
||||
$ ssh miranda.incus "docker run hello-world"
|
||||
Hello from Docker!
|
||||
|
||||
# Port binding also works
|
||||
$ ssh miranda.incus "docker run -d -p 8080:80 nginx"
|
||||
# Container starts successfully
|
||||
```
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Setting `lxc.apparmor.profile=unconfined` only disables the AppArmor profile that Incus applies **to** the container. The host's AppArmor daemon continues running and protecting the host itself.
|
||||
|
||||
Security layers with this fix:
|
||||
- Host AppArmor ✅ (still active)
|
||||
- Incus container isolation ✅ (namespaces, cgroups)
|
||||
- Container AppArmor ❌ (disabled with unconfined)
|
||||
- Docker container isolation ✅ (namespaces, cgroups)
|
||||
|
||||
For sandbox/dev environments, this tradeoff is acceptable since:
|
||||
- The Incus container is already isolated from the host
|
||||
- We're not running untrusted workloads
|
||||
- Production uses VMs + Docker without Incus nesting
|
||||
|
||||
# Explanation
|
||||
|
||||
What happened is that a recent update on the host (probably the incus and/or apparmor packages that landed in Ubuntu 24.04) started feeding the container a new AppArmor profile that contains this rule (or one very much like it):
|
||||
|
||||
```
|
||||
deny @{PROC}/sys/net/ipv4/ip_unprivileged_port_start rw,
|
||||
```
|
||||
|
||||
That rule is not present in the profile that ships with plain Docker, but it is present in the profile that Incus now attaches to every container that has `security.nesting=true` (the flag you need to run Docker inside Incus).
|
||||
|
||||
Because the rule is a `deny`, it overrides any later `allow`, so Docker's own profile (which allows the write) is ignored and the kernel returns `permission denied` the first time Docker/runc tries to write the value that tells the kernel which ports an unprivileged user may bind to.
|
||||
|
||||
So the container itself starts fine, but as soon as Docker tries to start any of its own containers, the AppArmor policy that Incus attached to the nested container blocks the write and the whole Docker container creation aborts.
|
||||
|
||||
The two workarounds remove the enforcing profile:
|
||||
|
||||
1. **`raw.lxc = lxc.apparmor.profile=unconfined`** — Tells Incus "don't load any AppArmor profile for this container at all", so the offending rule is never applied.
|
||||
|
||||
2. **`Environment=container="setmeandforgetme"`** — Is the magic string Docker's systemd unit looks for. When it sees that variable it skips loading the Docker-default AppArmor profile. The value literally does not matter; the variable only has to exist.
|
||||
|
||||
Either way you end up with no AppArmor policy on the nested Docker container, so the write to `ip_unprivileged_port_start` succeeds and your containers start again.
|
||||
|
||||
**In short:** Recent Incus added a deny rule that clashes with Docker's need to tweak that sysctl; disabling the profile (host-side or container-side) is the quickest fix until the profiles are updated to allow the operation.
|
||||
Because the rule is a deny, it overrides any later allow, so Docker’s own profile (which allows the write) is ignored and the kernel returns:
|
||||
202
docs/kernos.md
Normal file
202
docs/kernos.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# Kernos Service Documentation
|
||||
|
||||
HTTP-enabled MCP shell server using FastMCP. Wraps the existing `mcp-shell-server` execution logic with FastMCP's HTTP transport for remote AI agent access.
|
||||
|
||||
## Overview
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Host** | caliban.incus |
|
||||
| **Port** | 22021 |
|
||||
| **Service Type** | Systemd service (non-Docker) |
|
||||
| **Repository** | `ssh://robert@clio.helu.ca:18677/mnt/dev/kernos` |
|
||||
|
||||
## Features
|
||||
|
||||
- **HTTP Transport**: Accessible via URL instead of stdio
|
||||
- **Health Endpoints**: `/live`, `/ready`, `/health` for Kubernetes-style probes
|
||||
- **Prometheus Metrics**: `/metrics` endpoint for monitoring
|
||||
- **JSON Structured Logging**: Production-ready log format with correlation IDs
|
||||
- **Full Security**: Command whitelisting inherited from `mcp-shell-server`
|
||||
|
||||
## Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/mcp/` | POST | MCP protocol endpoint (FastMCP handles this) |
|
||||
| `/live` | GET | Liveness probe - always returns 200 |
|
||||
| `/ready` | GET | Readiness probe - checks executor and config |
|
||||
| `/health` | GET | Combined health check |
|
||||
| `/metrics` | GET | Prometheus metrics (text/plain) or JSON |
|
||||
|
||||
## Ansible Playbooks
|
||||
|
||||
### Stage Playbook
|
||||
|
||||
```bash
|
||||
ansible-playbook kernos/stage.yml
|
||||
```
|
||||
|
||||
Fetches the Kernos repository from clio and creates a release tarball at `~/rel/kernos_{{kernos_rel}}.tar`.
|
||||
|
||||
### Deploy Playbook
|
||||
|
||||
```bash
|
||||
ansible-playbook kernos/deploy.yml
|
||||
```
|
||||
|
||||
Deploys Kernos to caliban.incus:
|
||||
1. Creates kernos user/group
|
||||
2. Creates `/srv/kernos` directory
|
||||
3. Transfers and extracts the staged tarball
|
||||
4. Creates Python virtual environment
|
||||
5. Installs package dependencies
|
||||
6. Templates `.env` configuration
|
||||
7. Templates systemd service file
|
||||
8. Enables and starts the service
|
||||
9. Validates health endpoints
|
||||
|
||||
## Configuration Variables
|
||||
|
||||
### Host Variables (`ansible/inventory/host_vars/caliban.incus.yml`)
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `kernos_user` | `kernos` | System user for the service |
|
||||
| `kernos_group` | `kernos` | System group for the service |
|
||||
| `kernos_directory` | `/srv/kernos` | Installation directory |
|
||||
| `kernos_port` | `22021` | HTTP server port |
|
||||
| `kernos_host` | `0.0.0.0` | Server bind address |
|
||||
| `kernos_log_level` | `INFO` | Python log level |
|
||||
| `kernos_log_format` | `json` | Log format (`json` or `text`) |
|
||||
| `kernos_environment` | `production` | Environment name for logging |
|
||||
| `kernos_allow_commands` | (see below) | Comma-separated command whitelist |
|
||||
|
||||
### Global Variables (`ansible/inventory/group_vars/all/vars.yml`)
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `kernos_rel` | `master` | Git branch/tag for staging |
|
||||
|
||||
## Allowed Commands
|
||||
|
||||
The following commands are whitelisted for execution:
|
||||
|
||||
```
|
||||
ls, cat, head, tail, grep, find, wc, file, stat, mkdir, touch, cp, mv, rm,
|
||||
chmod, pwd, tree, du, df, sed, awk, sort, uniq, cut, tr, tee, curl, wget,
|
||||
ping, nc, dig, host, ps, pgrep, kill, pkill, nohup, timeout, python3, pip,
|
||||
node, npm, npx, pnpm, git, make, tar, gzip, gunzip, zip, unzip, whoami, id,
|
||||
uname, hostname, date, uptime, free, which, env, printenv, run-captured, jq
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
All security features are inherited from `mcp-shell-server`:
|
||||
|
||||
- **Command Whitelisting**: Only commands in `ALLOW_COMMANDS` can be executed
|
||||
- **Shell Operator Validation**: Commands after `;`, `&&`, `||`, `|` are validated
|
||||
- **Directory Validation**: Working directory must be absolute and accessible
|
||||
- **No Shell Injection**: Commands executed directly without shell interpretation
|
||||
|
||||
The systemd service includes additional hardening:
|
||||
- `NoNewPrivileges=true`
|
||||
- `PrivateTmp=true`
|
||||
- `ProtectSystem=strict`
|
||||
- `ProtectHome=true`
|
||||
- `ReadWritePaths=/tmp`
|
||||
|
||||
## Usage
|
||||
|
||||
### Testing Health Endpoints
|
||||
|
||||
```bash
|
||||
curl http://caliban.incus:22021/health
|
||||
curl http://caliban.incus:22021/ready
|
||||
curl http://caliban.incus:22021/live
|
||||
curl -H "Accept: text/plain" http://caliban.incus:22021/metrics
|
||||
```
|
||||
|
||||
### MCP Client Connection
|
||||
|
||||
Connect using any MCP client that supports HTTP transport:
|
||||
|
||||
```python
|
||||
from fastmcp import Client
|
||||
|
||||
client = Client("http://caliban.incus:22021/mcp")
|
||||
|
||||
async with client:
|
||||
result = await client.call_tool("shell_execute", {
|
||||
"command": ["ls", "-la"],
|
||||
"directory": "/tmp"
|
||||
})
|
||||
print(result)
|
||||
```
|
||||
|
||||
## Tool: shell_execute
|
||||
|
||||
Execute a shell command in a specified directory.
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
|-----------|------|----------|---------|-------------|
|
||||
| `command` | `list[str]` | Yes | - | Command and arguments as array |
|
||||
| `directory` | `str` | No | `/tmp` | Absolute path to working directory |
|
||||
| `stdin` | `str` | No | `None` | Input to pass to command |
|
||||
| `timeout` | `int` | No | `None` | Timeout in seconds |
|
||||
|
||||
### Response
|
||||
|
||||
```json
|
||||
{
|
||||
"stdout": "command output",
|
||||
"stderr": "",
|
||||
"status": 0,
|
||||
"execution_time": 0.123
|
||||
}
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
The `/metrics` endpoint exposes Prometheus-compatible metrics. Add to your Prometheus configuration:
|
||||
|
||||
```yaml
|
||||
- job_name: 'kernos'
|
||||
static_configs:
|
||||
- targets: ['caliban.incus:22021']
|
||||
```
|
||||
|
||||
### Service Status
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
ssh caliban.incus sudo systemctl status kernos
|
||||
|
||||
# View logs
|
||||
ssh caliban.incus sudo journalctl -u kernos -f
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Won't Start
|
||||
|
||||
1. Check logs: `journalctl -u kernos -n 50`
|
||||
2. Verify `.env` file exists and has correct permissions
|
||||
3. Ensure Python venv was created successfully
|
||||
4. Check that `ALLOW_COMMANDS` is set
|
||||
|
||||
### Health Check Failures
|
||||
|
||||
1. Verify the service is running: `systemctl status kernos`
|
||||
2. Check if port 22021 is accessible
|
||||
3. Review logs for startup errors
|
||||
|
||||
### Command Execution Denied
|
||||
|
||||
1. Verify the command is in `ALLOW_COMMANDS` whitelist
|
||||
2. Check that the working directory is absolute and accessible
|
||||
3. Review logs for security validation errors
|
||||
184
docs/lobechat.md
Normal file
184
docs/lobechat.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# LobeChat
|
||||
|
||||
Modern AI chat interface with multi-LLM support, deployed on **Rosalind** with PostgreSQL backend and S3 storage.
|
||||
|
||||
**Host:** rosalind.incus
|
||||
**Port:** 22081
|
||||
**External URL:** https://lobechat.ouranos.helu.ca/
|
||||
|
||||
## Quick Deployment
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook lobechat/deploy.yml
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ LobeChat │─────▶│PostgreSQL │
|
||||
│ │ │ (Titania) │ │(Rosalind)│ │ (Portia) │
|
||||
└──────────┘ └────────────┘ └──────────┘ └───────────┘
|
||||
│
|
||||
├─────────▶ Casdoor (SSO)
|
||||
├─────────▶ S3 (File Storage)
|
||||
├─────────▶ SearXNG (Search)
|
||||
└─────────▶ AI APIs
|
||||
```
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add secrets to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
### 1. Key Vaults Secret (Encryption Key)
|
||||
|
||||
```yaml
|
||||
vault_lobechat_key_vaults_secret: "your-generated-secret"
|
||||
```
|
||||
|
||||
**Purpose:** Encrypts sensitive data (API keys, credentials) stored in the database.
|
||||
|
||||
**Generate with:**
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
ℹ️ This secret must be at least 32 bytes (base64 encoded). If changed after deployment, previously stored encrypted data will become unreadable.
|
||||
|
||||
### 2. NextAuth Secret
|
||||
|
||||
```yaml
|
||||
vault_lobechat_next_auth_secret: "your-generated-secret"
|
||||
```
|
||||
|
||||
**Purpose:** Signs NextAuth.js JWT tokens for session management.
|
||||
|
||||
**Generate with:**
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
### 3. Database Password
|
||||
|
||||
```yaml
|
||||
vault_lobechat_db_password: "your-secure-password"
|
||||
```
|
||||
|
||||
**Purpose:** PostgreSQL authentication for the `lobechat` database user.
|
||||
|
||||
### 4. S3 Secret Key
|
||||
|
||||
```yaml
|
||||
vault_lobechat_s3_secret_key: "your-s3-secret-key"
|
||||
```
|
||||
|
||||
**Purpose:** Authentication for S3 file storage bucket.
|
||||
|
||||
**Get from Terraform:**
|
||||
```bash
|
||||
cd terraform
|
||||
terraform output -json lobechat_s3_credentials
|
||||
```
|
||||
|
||||
### 5. AI Provider API Keys (Optional)
|
||||
|
||||
```yaml
|
||||
vault_lobechat_openai_api_key: "sk-proj-..."
|
||||
vault_lobechat_anthropic_api_key: "sk-ant-api03-..."
|
||||
vault_lobechat_google_api_key: "AIza..."
|
||||
```
|
||||
|
||||
**Purpose:** Server-side AI provider access. Users can also provide their own keys via the UI.
|
||||
|
||||
| Provider | Get Key From |
|
||||
|----------|-------------|
|
||||
| OpenAI | https://platform.openai.com/api-keys |
|
||||
| Anthropic | https://console.anthropic.com/ |
|
||||
| Google | https://aistudio.google.com/apikey |
|
||||
|
||||
### 6. AWS Bedrock Credentials (Optional)
|
||||
|
||||
```yaml
|
||||
vault_lobechat_aws_access_key_id: "AKIA..."
|
||||
vault_lobechat_aws_secret_access_key: "wJalr..."
|
||||
vault_lobechat_aws_region: "us-east-1"
|
||||
```
|
||||
|
||||
**Purpose:** Access AWS Bedrock models (Claude, Titan, Llama, etc.)
|
||||
|
||||
**Requirements:**
|
||||
- IAM user/role with `bedrock:InvokeModel` permission
|
||||
- Model access enabled in AWS Bedrock console for the region
|
||||
|
||||
## Host Variables
|
||||
|
||||
Defined in `ansible/inventory/host_vars/rosalind.incus.yml`:
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `lobechat_user` | Service user (lobechat) |
|
||||
| `lobechat_directory` | Service directory (/srv/lobechat) |
|
||||
| `lobechat_port` | Container port (22081) |
|
||||
| `lobechat_db_*` | PostgreSQL connection settings |
|
||||
| `lobechat_auth_casdoor_*` | Casdoor SSO configuration |
|
||||
| `lobechat_s3_*` | S3 storage settings |
|
||||
| `lobechat_syslog_port` | Alloy log collection port (51461) |
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Service | Host | Purpose |
|
||||
|---------|------|---------|
|
||||
| PostgreSQL | Portia | Database backend |
|
||||
| Casdoor | Titania | SSO authentication |
|
||||
| HAProxy | Titania | HTTPS termination |
|
||||
| SearXNG | Oberon | Web search |
|
||||
| S3 Bucket | Incus | File storage |
|
||||
|
||||
## Ansible Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `lobechat/deploy.yml` | Main deployment playbook |
|
||||
| `lobechat/docker-compose.yml.j2` | Docker Compose template |
|
||||
|
||||
## Operations
|
||||
|
||||
### Check Status
|
||||
|
||||
```bash
|
||||
ssh rosalind.incus
|
||||
cd /srv/lobechat
|
||||
docker compose ps
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
### Update Container
|
||||
|
||||
```bash
|
||||
ssh rosalind.incus
|
||||
cd /srv/lobechat
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Database Access
|
||||
|
||||
```bash
|
||||
psql -h portia.incus -U lobechat -d lobechat
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Resolution |
|
||||
|-------|------------|
|
||||
| Container won't start | Check vault secrets are defined |
|
||||
| Database connection failed | Verify PostgreSQL on Portia is running |
|
||||
| SSO redirect fails | Check Casdoor application config |
|
||||
| File uploads fail | Verify S3 credentials from Terraform |
|
||||
|
||||
## References
|
||||
|
||||
- [Detailed Service Documentation](services/lobechat.md)
|
||||
- [LobeChat Official Docs](https://lobehub.com/docs)
|
||||
- [GitHub Repository](https://github.com/lobehub/lobe-chat)
|
||||
303
docs/mcpo.md
Normal file
303
docs/mcpo.md
Normal file
@@ -0,0 +1,303 @@
|
||||
# MCPO - Model Context Protocol OpenAI-Compatible Proxy
|
||||
|
||||
## Overview
|
||||
|
||||
MCPO is an OpenAI-compatible proxy that aggregates multiple Model Context Protocol (MCP) servers behind a single HTTP endpoint. It acts as the central MCP gateway for the Agathos sandbox, exposing tools from 13 MCP servers through a unified REST API with interactive Swagger documentation.
|
||||
|
||||
**Host:** miranda.incus
|
||||
**Role:** MCP Docker Host
|
||||
**Service Port:** 25530
|
||||
**API Docs:** http://miranda.incus:25530/docs
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌───────────────┐ ┌──────────────────────────────────────────────────────────┐
|
||||
│ LLM Client │ │ Miranda (miranda.incus) │
|
||||
│ (LobeChat, │────▶│ ┌────────────────────────────────────────────────────┐ │
|
||||
│ Open WebUI, │ │ │ MCPO :25530 │ │
|
||||
│ VS Code) │ │ │ OpenAI-compatible proxy │ │
|
||||
└───────────────┘ │ └─────┬────────────┬────────────┬───────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌─────▼─────┐ ┌────▼────┐ ┌────▼─────┐ │
|
||||
│ │ stdio │ │ Local │ │ Remote │ │
|
||||
│ │ servers │ │ Docker │ │ servers │ │
|
||||
│ │ │ │ MCP │ │ │ │
|
||||
│ │ • time │ │ │ │ • athena │ │
|
||||
│ │ • ctx7 │ │ • neo4j │ │ • github │ │
|
||||
│ │ │ │ • graf │ │ • hface │ │
|
||||
│ │ │ │ • gitea │ │ • argos │ │
|
||||
│ │ │ │ │ │ • rommie │ │
|
||||
│ │ │ │ │ │ • caliban│ │
|
||||
│ │ │ │ │ │ • korax │ │
|
||||
│ └───────────┘ └─────────┘ └──────────┘ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
MCPO manages two categories of MCP servers:
|
||||
- **stdio servers**: MCPO spawns and manages the process (time, context7)
|
||||
- **streamable-http servers**: MCPO proxies to Docker containers on localhost or remote services across the Incus network
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
MCPO runs on Miranda, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | mcp_docker_host |
|
||||
| Security Nesting | true |
|
||||
| AppArmor | unconfined |
|
||||
| Proxy: mcp_containers | `0.0.0.0:25530-25539` → `127.0.0.1:25530-25539` |
|
||||
| Proxy: mcpo_ports | `0.0.0.0:25560-25569` → `127.0.0.1:25560-25569` |
|
||||
|
||||
### Dependencies
|
||||
|
||||
| Resource | Relationship |
|
||||
|----------|--------------|
|
||||
| prospero | Monitoring (Alloy → Loki, Prometheus) |
|
||||
| ariel | Neo4j database for neo4j-cypher and neo4j-memory MCP servers |
|
||||
| puck | Athena MCP server |
|
||||
| caliban | Caliban and Rommie MCP servers |
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook mcpo/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `mcpo/deploy.yml` | Main deployment playbook |
|
||||
| `mcpo/config.json.j2` | MCP server configuration template |
|
||||
| `mcpo/mcpo.service.j2` | Systemd service unit template |
|
||||
| `mcpo/restart.yml` | Restart playbook with health check |
|
||||
| `mcpo/requirements.txt` | Python package requirements |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Create System User**: `mcpo:mcpo` system account
|
||||
2. **Create Directory**: `/srv/mcpo` with restricted permissions
|
||||
3. **Backup Config**: Saves existing `config.json` before overwriting
|
||||
4. **Template Config**: Renders `config.json.j2` with MCP server definitions
|
||||
5. **Install Node.js 22.x**: NodeSource repository for npx-based MCP servers
|
||||
6. **Install Python 3.12**: System packages for virtual environment
|
||||
7. **Create Virtual Environment**: Python 3.12 venv at `/srv/mcpo/.venv`
|
||||
8. **Install pip Packages**: `wheel`, `mcpo`, `mcp-server-time`
|
||||
9. **Pre-install Context7**: Downloads `@upstash/context7-mcp` via npx
|
||||
10. **Deploy Systemd Service**: Enables and starts `mcpo.service`
|
||||
11. **Health Check**: Verifies `http://localhost:25530/docs` returns HTTP 200
|
||||
|
||||
## MCP Servers
|
||||
|
||||
MCPO aggregates the following MCP servers in `config.json`:
|
||||
|
||||
### stdio Servers (managed by MCPO)
|
||||
|
||||
| Server | Command | Purpose |
|
||||
|--------|---------|---------|
|
||||
| `time` | `mcp-server-time` (Python venv) | Current time with timezone support |
|
||||
| `upstash-context7` | `npx @upstash/context7-mcp` | Library documentation lookup |
|
||||
|
||||
### streamable-http Servers (local Docker containers)
|
||||
|
||||
| Server | URL | Purpose |
|
||||
|--------|-----|---------|
|
||||
| `neo4j-cypher` | `localhost:25531/mcp` | Neo4j Cypher query execution |
|
||||
| `neo4j-memory` | `localhost:25532/mcp` | Neo4j knowledge graph memory |
|
||||
| `grafana` | `localhost:25533/mcp` | Grafana dashboard and API integration |
|
||||
| `gitea` | `localhost:25535/mcp` | Gitea repository management |
|
||||
|
||||
### streamable-http Servers (remote services)
|
||||
|
||||
| Server | URL | Purpose |
|
||||
|--------|-----|---------|
|
||||
| `argos-searxng` | `miranda.incus:25534/mcp` | SearXNG search integration |
|
||||
| `athena` | `puck.incus:22461/mcp` | Athena knowledge service (auth required) |
|
||||
| `github` | `api.githubcopilot.com/mcp/` | GitHub API integration |
|
||||
| `rommie` | `caliban.incus:8080/mcp` | Rommie agent interface |
|
||||
| `caliban` | `caliban.incus:22021/mcp` | Caliban computer use agent |
|
||||
| `korax` | `korax.helu.ca:22021/mcp` | Korax external agent |
|
||||
| `huggingface` | `huggingface.co/mcp` | Hugging Face model hub |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Systemd Service
|
||||
|
||||
MCPO runs as a systemd service:
|
||||
|
||||
```
|
||||
ExecStart=/srv/mcpo/.venv/bin/mcpo --port 25530 --config /srv/mcpo/config.json
|
||||
```
|
||||
|
||||
- **User:** mcpo
|
||||
- **Restart:** always (3s delay)
|
||||
- **WorkingDirectory:** /srv/mcpo
|
||||
|
||||
### Storage Locations
|
||||
|
||||
| Path | Purpose | Owner |
|
||||
|------|---------|-------|
|
||||
| `/srv/mcpo` | Service directory | mcpo:mcpo |
|
||||
| `/srv/mcpo/.venv` | Python virtual environment | mcpo:mcpo |
|
||||
| `/srv/mcpo/config.json` | MCP server configuration | mcpo:mcpo |
|
||||
| `/srv/mcpo/config.json.bak` | Config backup (pre-deploy) | mcpo:mcpo |
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
| Variable | Purpose |
|
||||
|----------|---------|
|
||||
| `vault_athena_mcp_auth` | Bearer token for Athena MCP server |
|
||||
| `vault_github_personal_access_token` | GitHub personal access token |
|
||||
| `vault_huggingface_mcp_token` | Hugging Face API token |
|
||||
| `vault_gitea_mcp_access_token` | Gitea personal access token for MCP |
|
||||
|
||||
```bash
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/miranda.incus.yml`
|
||||
|
||||
```yaml
|
||||
# MCPO Config
|
||||
mcpo_user: mcpo
|
||||
mcpo_group: mcpo
|
||||
mcpo_directory: /srv/mcpo
|
||||
mcpo_port: 25530
|
||||
argos_mcp_url: http://miranda.incus:25534/mcp
|
||||
athena_mcp_auth: "{{ vault_athena_mcp_auth }}"
|
||||
athena_mcp_url: http://puck.incus:22461/mcp
|
||||
github_personal_access_token: "{{ vault_github_personal_access_token }}"
|
||||
neo4j_cypher_mcp_port: 25531
|
||||
neo4j_memory_mcp_port: 25532
|
||||
caliban_mcp_url: http://caliban.incus:22021/mcp
|
||||
korax_mcp_url: http://korax.helu.ca:22021/mcp
|
||||
huggingface_mcp_token: "{{ vault_huggingface_mcp_token }}"
|
||||
gitea_mcp_port: 25535
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Loki Logs
|
||||
|
||||
MCPO logs are collected via systemd journal by Alloy on Miranda. A relabel rule in Alloy's config tags `mcpo.service` journal entries with `job="mcpo"` so they appear as a dedicated app in Grafana dashboards.
|
||||
|
||||
| Log Source | Labels |
|
||||
|------------|--------|
|
||||
| Systemd journal | `{job="mcpo", hostname="miranda.incus"}` |
|
||||
|
||||
The Docker-based MCP servers (neo4j, grafana, gitea) each have dedicated syslog ports forwarded to Loki:
|
||||
|
||||
| Server | Syslog Port | Loki Job |
|
||||
|--------|-------------|----------|
|
||||
| neo4j-cypher | 51431 | `neo4j-cypher` |
|
||||
| neo4j-memory | 51432 | `neo4j-memory` |
|
||||
| grafana-mcp | 51433 | `grafana_mcp` |
|
||||
| argos | 51434 | `argos` |
|
||||
| gitea-mcp | 51435 | `gitea-mcp` |
|
||||
|
||||
### Grafana
|
||||
|
||||
Query MCPO-related logs in Grafana Explore:
|
||||
|
||||
```
|
||||
{hostname="miranda.incus", job="mcpo"}
|
||||
{hostname="miranda.incus", job="gitea-mcp"}
|
||||
{hostname="miranda.incus", job="grafana_mcp"}
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
ssh miranda.incus
|
||||
|
||||
# MCPO service
|
||||
sudo systemctl start mcpo
|
||||
sudo systemctl stop mcpo
|
||||
sudo systemctl restart mcpo
|
||||
|
||||
# Or use the restart playbook with health check
|
||||
cd ansible
|
||||
ansible-playbook mcpo/restart.yml
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# API docs endpoint
|
||||
curl http://miranda.incus:25530/docs
|
||||
|
||||
# From Miranda itself
|
||||
curl http://localhost:25530/docs
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# MCPO systemd journal
|
||||
ssh miranda.incus "sudo journalctl -u mcpo -f"
|
||||
|
||||
# Docker MCP server logs
|
||||
ssh miranda.incus "docker logs -f gitea-mcp"
|
||||
ssh miranda.incus "docker logs -f grafana-mcp"
|
||||
```
|
||||
|
||||
### Adding a New MCP Server
|
||||
|
||||
1. Add the server definition to `ansible/mcpo/config.json.j2`
|
||||
2. Add any required variables to `ansible/inventory/host_vars/miranda.incus.yml`
|
||||
3. Add vault secrets (if needed) to `inventory/group_vars/all/vault.yml`
|
||||
4. If Docker-based: create a new `ansible/{service}/deploy.yml` and `docker-compose.yml.j2`
|
||||
5. If Docker-based: add a syslog port to Miranda's host vars and Alloy config
|
||||
6. Redeploy: `ansible-playbook mcpo/deploy.yml`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| MCPO won't start | Config JSON syntax error | Check `config.json` with `python -m json.tool` |
|
||||
| Server shows "unavailable" | Backend MCP server not running | Check Docker containers or remote service status |
|
||||
| Context7 timeout on first use | npx downloading package | Wait for download to complete, or re-run pre-install |
|
||||
| Health check fails | Port not ready | Increase retry delay, check `journalctl -u mcpo` |
|
||||
| stdio server crash loops | Missing runtime dependency | Verify Python venv and Node.js installation |
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```bash
|
||||
# Check MCPO service status
|
||||
ssh miranda.incus "sudo systemctl status mcpo"
|
||||
|
||||
# Validate config.json syntax
|
||||
ssh miranda.incus "python3 -m json.tool /srv/mcpo/config.json"
|
||||
|
||||
# List Docker MCP containers
|
||||
ssh miranda.incus "docker ps --filter name=mcp"
|
||||
|
||||
# Test a specific MCP server endpoint
|
||||
ssh miranda.incus "curl -s http://localhost:25531/mcp | head"
|
||||
|
||||
# Check MCPO port is listening
|
||||
ssh miranda.incus "ss -tlnp | grep 25530"
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- **MCPO Repository**: https://github.com/nicobailey/mcpo
|
||||
- **MCP Specification**: https://modelcontextprotocol.io/
|
||||
- [Ansible Practices](ansible.md)
|
||||
- [Agathos Overview](agathos.md)
|
||||
283
docs/neo4j.md
Normal file
283
docs/neo4j.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# Neo4j - Graph Database Platform
|
||||
|
||||
## Overview
|
||||
|
||||
Neo4j is a high-performance graph database providing native graph storage and processing. It enables efficient traversal of complex relationships and is used for knowledge graphs, recommendation engines, and connected data analysis. Deployed with the **APOC plugin** enabled for extended stored procedures and functions.
|
||||
|
||||
**Host:** ariel.incus
|
||||
**Role:** graph_database
|
||||
**Container Port:** 25554 (HTTP Browser), 7687 (Bolt)
|
||||
**External Access:** Direct Bolt connection via `ariel.incus:7687`
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Client │─────▶│ Neo4j │◀─────│ Neo4j MCP │
|
||||
│ (Browser) │ │ (Ariel) │ │ (Miranda) │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
│ │
|
||||
│ ▼
|
||||
│ ┌──────────────┐
|
||||
└────────────▶│ Neo4j Browser│
|
||||
│ HTTP :25554 │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
- **Neo4j Browser**: Web-based query interface on port 25554
|
||||
- **Bolt Protocol**: Binary protocol on port 7687 for high-performance connections
|
||||
- **APOC Plugin**: Extended procedures for import/export, graph algorithms, and utilities
|
||||
- **Neo4j MCP Servers**: Connect via Bolt from Miranda for AI agent access
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
The service runs on `ariel`, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | graph_database |
|
||||
| Security Nesting | true |
|
||||
| AppArmor | unconfined |
|
||||
| Description | Neo4j Host - Ethereal graph connections |
|
||||
|
||||
### Proxy Devices
|
||||
|
||||
| Device Name | Listen | Connect |
|
||||
|-------------|--------|---------|
|
||||
| neo4j_ports | tcp:0.0.0.0:25554 | tcp:127.0.0.1:25554 |
|
||||
|
||||
### Dependencies
|
||||
|
||||
| Resource | Relationship |
|
||||
|----------|--------------|
|
||||
| Prospero | Monitoring stack must exist for Alloy log shipping |
|
||||
| Miranda | Neo4j MCP servers connect to Neo4j via Bolt |
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook neo4j/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `neo4j/deploy.yml` | Main deployment playbook |
|
||||
| `neo4j/docker-compose.yml.j2` | Docker Compose template |
|
||||
| `alloy/ariel/config.alloy.j2` | Alloy log collection config |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Create System User**: `neo4j:neo4j` system group and user
|
||||
2. **Configure ponos Access**: Add ponos user to neo4j group
|
||||
3. **Create Directory**: `/srv/neo4j` with proper ownership
|
||||
4. **Template Compose File**: Apply `docker-compose.yml.j2`
|
||||
5. **Start Service**: Launch via `docker_compose_v2` module
|
||||
|
||||
## Configuration
|
||||
|
||||
### Host Variables (`host_vars/ariel.incus.yml`)
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `neo4j_version` | Neo4j Docker image version | `5.26.0` |
|
||||
| `neo4j_user` | System user | `neo4j` |
|
||||
| `neo4j_group` | System group | `neo4j` |
|
||||
| `neo4j_directory` | Installation directory | `/srv/neo4j` |
|
||||
| `neo4j_auth_user` | Database admin username | `neo4j` |
|
||||
| `neo4j_auth_password` | Database admin password | `{{ vault_neo4j_auth_password }}` |
|
||||
| `neo4j_http_port` | HTTP browser port | `25554` |
|
||||
| `neo4j_bolt_port` | Bolt protocol port | `7687` |
|
||||
| `neo4j_syslog_port` | Local syslog port for Alloy | `22011` |
|
||||
| `neo4j_apoc_unrestricted` | APOC procedures allowed | `apoc.*` |
|
||||
|
||||
### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_neo4j_auth_password` | Neo4j admin password |
|
||||
|
||||
### APOC Plugin Configuration
|
||||
|
||||
The APOC (Awesome Procedures on Cypher) plugin is enabled with the following settings:
|
||||
|
||||
| Environment Variable | Value | Purpose |
|
||||
|---------------------|-------|---------|
|
||||
| `NEO4J_PLUGINS` | `["apoc"]` | Install APOC plugin |
|
||||
| `NEO4J_apoc_export_file_enabled` | `true` | Allow file exports |
|
||||
| `NEO4J_apoc_import_file_enabled` | `true` | Allow file imports |
|
||||
| `NEO4J_apoc_import_file_use__neo4j__config` | `true` | Use Neo4j config for imports |
|
||||
| `NEO4J_dbms_security_procedures_unrestricted` | `apoc.*` | Allow all APOC procedures |
|
||||
|
||||
### Docker Volumes
|
||||
|
||||
| Volume | Mount Point | Purpose |
|
||||
|--------|-------------|---------|
|
||||
| `neo4j_data` | `/data` | Database files |
|
||||
| `neo4j_logs` | `/logs` | Application logs |
|
||||
| `neo4j_plugins` | `/plugins` | APOC and other plugins |
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Alloy Configuration
|
||||
|
||||
**File:** `ansible/alloy/ariel/config.alloy.j2`
|
||||
|
||||
Alloy on Ariel collects:
|
||||
- System logs (`/var/log/syslog`, `/var/log/auth.log`)
|
||||
- Systemd journal
|
||||
- Neo4j Docker container logs via syslog
|
||||
|
||||
### Loki Logs
|
||||
|
||||
| Log Source | Labels |
|
||||
|------------|--------|
|
||||
| Neo4j container | `{job="neo4j", hostname="ariel.incus"}` |
|
||||
| System logs | `{job="syslog", hostname="ariel.incus"}` |
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
Host-level metrics collected via Alloy's Unix exporter:
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `node_*` | Standard node exporter metrics |
|
||||
|
||||
### Log Collection Flow
|
||||
|
||||
```
|
||||
Neo4j Container → Syslog (tcp:127.0.0.1:22011) → Alloy → Loki (Prospero)
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
# Via Docker Compose
|
||||
cd /srv/neo4j
|
||||
docker compose up -d
|
||||
docker compose down
|
||||
|
||||
# Via Ansible
|
||||
ansible-playbook neo4j/deploy.yml
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# HTTP Browser
|
||||
curl http://ariel.incus:25554
|
||||
|
||||
# Bolt connection test
|
||||
cypher-shell -a bolt://ariel.incus:7687 -u neo4j -p <password> "RETURN 1"
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Docker container logs
|
||||
docker logs -f neo4j
|
||||
|
||||
# Via Loki (Grafana Explore)
|
||||
{job="neo4j", hostname="ariel.incus"}
|
||||
```
|
||||
|
||||
### Cypher Shell Access
|
||||
|
||||
```bash
|
||||
# SSH to Ariel and exec into container
|
||||
ssh ariel.incus
|
||||
docker exec -it neo4j cypher-shell -u neo4j -p <password>
|
||||
```
|
||||
|
||||
### Backup
|
||||
|
||||
Neo4j data persists in Docker volumes. Backup procedures:
|
||||
|
||||
```bash
|
||||
# Stop container for consistent backup
|
||||
docker compose -f /srv/neo4j/docker-compose.yml stop
|
||||
|
||||
# Backup volumes
|
||||
docker run --rm -v neo4j_data:/data -v /backup:/backup alpine \
|
||||
tar czf /backup/neo4j_data_$(date +%Y%m%d).tar.gz -C /data .
|
||||
|
||||
# Start container
|
||||
docker compose -f /srv/neo4j/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
### Restore
|
||||
|
||||
```bash
|
||||
# Stop container
|
||||
docker compose -f /srv/neo4j/docker-compose.yml down
|
||||
|
||||
# Remove existing volume
|
||||
docker volume rm neo4j_data
|
||||
|
||||
# Create new volume and restore
|
||||
docker volume create neo4j_data
|
||||
docker run --rm -v neo4j_data:/data -v /backup:/backup alpine \
|
||||
tar xzf /backup/neo4j_data_YYYYMMDD.tar.gz -C /data
|
||||
|
||||
# Start container
|
||||
docker compose -f /srv/neo4j/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| Container won't start | Auth format issue | Check `NEO4J_AUTH` format is `user/password` |
|
||||
| APOC procedures fail | Security restrictions | Verify `neo4j_apoc_unrestricted` includes procedure |
|
||||
| Connection refused | Port not exposed | Check Incus proxy device configuration |
|
||||
| Bolt connection fails | Wrong port | Use port 7687, not 25554 |
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# View container startup logs
|
||||
docker logs neo4j
|
||||
|
||||
# Check Neo4j internal logs
|
||||
docker exec neo4j cat /logs/debug.log
|
||||
```
|
||||
|
||||
### Verify APOC Installation
|
||||
|
||||
```cypher
|
||||
CALL apoc.help("apoc")
|
||||
YIELD name, text
|
||||
RETURN name, text LIMIT 10;
|
||||
```
|
||||
|
||||
## Related Services
|
||||
|
||||
### Neo4j MCP Servers (Miranda)
|
||||
|
||||
Two MCP servers run on Miranda to provide AI agent access to Neo4j:
|
||||
|
||||
| Server | Port | Purpose |
|
||||
|--------|------|---------|
|
||||
| neo4j-cypher | 25531 | Direct Cypher query execution |
|
||||
| neo4j-memory | 25532 | Knowledge graph memory operations |
|
||||
|
||||
See [Neo4j MCP documentation](#neo4j-mcp-servers) for deployment details.
|
||||
|
||||
## References
|
||||
|
||||
- [Neo4j Documentation](https://neo4j.com/docs/)
|
||||
- [APOC Library Documentation](https://neo4j.com/labs/apoc/)
|
||||
- [Terraform Practices](../terraform.md)
|
||||
- [Ansible Practices](../ansible.md)
|
||||
- [Sandbox Overview](../agathos.html)
|
||||
380
docs/nextcloud.md
Normal file
380
docs/nextcloud.md
Normal file
@@ -0,0 +1,380 @@
|
||||
# Nextcloud - Self-Hosted Cloud Collaboration
|
||||
|
||||
## Overview
|
||||
Nextcloud is a self-hosted cloud collaboration platform providing file storage, sharing, calendar, contacts, and productivity tools. Deployed as a **native LAPP stack** (Linux, Apache, PostgreSQL, PHP) on **Rosalind** with Memcached caching and Incus storage volume for data.
|
||||
|
||||
**Host:** rosalind.incus
|
||||
**Role:** Collaboration (PHP, Go, Node.js runtimes)
|
||||
**Container Port:** 22083
|
||||
**External Access:** https://nextcloud.ouranos.helu.ca/ (via HAProxy on Titania)
|
||||
**Installation Method:** Native (tar.bz2 extraction to /var/www/nextcloud)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ Apache2 │─────▶│PostgreSQL │
|
||||
│ │ │ (Titania) │ │ Nextcloud │ │ (Portia) │
|
||||
└──────────┘ └────────────┘ │(Rosalind) │ └───────────┘
|
||||
└───────────┘
|
||||
│
|
||||
├─────────▶ Memcached (Local)
|
||||
│
|
||||
└─────────▶ /mnt/nextcloud (Volume)
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook nextcloud/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `nextcloud/deploy.yml` | Main deployment playbook |
|
||||
| `nextcloud/nextcloud.conf.j2` | Apache VirtualHost template |
|
||||
|
||||
2. **Create Data Directory**: `/mnt/nextcloud` on Incus storage volume
|
||||
3. **Download Nextcloud**: Latest tarball from official site (if not already present)
|
||||
4. **Extract to Web Root**: `/var/www/nextcloud` (if new installation)
|
||||
5. **Set Permissions**: `www-data:www-data` ownership
|
||||
6. **Configure Apache**: Template vhost with port 22083, enable mods, disable default site
|
||||
7. **Run Installation**: OCC command-line installer (generates config.php with secrets)
|
||||
8. **Configure via OCC**: Set trusted domains, Memcached, background job mode
|
||||
9. **Setup Cron**: Background jobs every 5 minutes as www-data
|
||||
|
||||
**⚠️ Important**: The playbook does NOT template over config.php after installation. All configuration changes are made via OCC commands to preserve auto-generated secrets (instanceid, passwordsalt, secret).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Features
|
||||
|
||||
- **PostgreSQL Backend**: Database on Portia
|
||||
- **Memcached Caching**: Local distributed cache with `nc_` prefix
|
||||
- **Incus Storage Volume**: Dedicated 100GB volume at /mnt/nextcloud
|
||||
- **Apache Web Server**: mod_php with rewrite/headers modules
|
||||
- **Cron Background Jobs**: System cron (not Docker/AJAX)
|
||||
- **Native Installation**: No Docker overhead, matches production pattern
|
||||
|
||||
### Storage Configuration
|
||||
|
||||
| Path | Purpose | Owner | Mount |
|
||||
|------|---------|-------|-------|
|
||||
| `/var/www/nextcloud` | Application files | www-data | Local |
|
||||
| `/mnt/nextcloud` | User data directory | www-data | Incus volume |
|
||||
| `/var/log/apache2` | Web server logs | root | Local |
|
||||
|
||||
### Apache Modules
|
||||
|
||||
Required modules enabled by playbook:
|
||||
- `rewrite` - URL rewriting
|
||||
- `headers` - HTTP header manipulation
|
||||
- `env` - Environment variable passing
|
||||
- `dir` - Directory index handling
|
||||
- `mime` - MIME type configuration
|
||||
|
||||
### PHP Configuration
|
||||
|
||||
Installed PHP extensions:
|
||||
- `php-gd` - Image manipulation
|
||||
- `php-pgsql` - PostgreSQL database
|
||||
- `php-curl` - HTTP client
|
||||
- `php-mbstring` - Multibyte string handling
|
||||
- `php-intl` - Internationalization
|
||||
- `php-gmp` - GNU Multiple Precision
|
||||
- `php-bcmath` - Binary calculator
|
||||
- `php-xml` - XML processing
|
||||
- `php-imagick` - ImageMagick integration
|
||||
- `php-zip` - ZIP archive handling
|
||||
- `php-memcached` - Memcached caching
|
||||
|
||||
### Memcached Configuration
|
||||
|
||||
- **Host**: localhost:11211
|
||||
- **Prefix**: `nc_` (Nextcloud-specific keys)
|
||||
- **Local**: `\OC\Memcache\Memcached`
|
||||
- **Distributed**: `\OC\Memcache\Memcached`
|
||||
|
||||
### Cron Jobs
|
||||
|
||||
Background jobs configured via system cron:
|
||||
```cron
|
||||
*/5 * * * * php /var/www/nextcloud/cron.php
|
||||
```
|
||||
|
||||
Runs as `www-data` user every 5 minutes.
|
||||
|
||||
## Access After Deployment
|
||||
|
||||
1. **Web Interface**: https://nextcloud.ouranos.helu.ca/
|
||||
2. **First Login**: Use admin credentials from vault
|
||||
3. **Initial Setup**: Configure apps and settings via web UI
|
||||
4. **Client Apps**: Download desktop/mobile clients from Nextcloud website
|
||||
|
||||
### Desktop/Mobile Sync
|
||||
|
||||
- **Server URL**: https://nextcloud.ouranos.helu.ca
|
||||
- **Username**: admin (or created user)
|
||||
- **Password**: From vault
|
||||
- **Desktop Client**: https://nextcloud.com/install/#install-clients
|
||||
- **Mobile Apps**: iOS App Store / Google Play Store
|
||||
|
||||
### WebDAV Access
|
||||
|
||||
- **WebDAV URL**: `https://nextcloud.ouranos.helu.ca/remote.php/dav/files/USERNAME/`
|
||||
- **Use Cases**: File sync, calendar (CalDAV), contacts (CardDAV)
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Alloy Configuration
|
||||
**File:** `ansible/alloy/rosalind/config.alloy.j2`
|
||||
|
||||
- **Apache Access Logs**: `/var/log/apache2/access.log` → Loki
|
||||
- **Apache Error Logs**: `/var/log/apache2/error.log` → Loki
|
||||
- **System Metrics**: Process exporter tracks Apache/PHP processes
|
||||
- **Labels**: job=apache_access, job=apache_error
|
||||
|
||||
### Health Checks
|
||||
|
||||
**HAProxy Health Endpoint**: `/status.php`
|
||||
|
||||
**Manual Health Check**:
|
||||
```bash
|
||||
curl http://rosalind.incus:22082/status.php
|
||||
```
|
||||
|
||||
Expected response: JSON with status information
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
### 1. Database Password
|
||||
```yaml
|
||||
vault_nextcloud_db_password: "PostgresSecurePassword123!"
|
||||
```
|
||||
**Requirements:**
|
||||
- Minimum 12 characters
|
||||
- Used by PostgreSQL authentication
|
||||
|
||||
### 2. Admin Password
|
||||
```yaml
|
||||
vault_nextcloud_admin_password: "AdminSecurePassword123!"
|
||||
```
|
||||
**Requirements:**
|
||||
- Minimum 8 characters (Nextcloud requirement)
|
||||
- Used for admin user login
|
||||
- **Important**: Store securely, used for web interface access
|
||||
|
||||
### 3. Instance Secrets (Auto-Generated)
|
||||
These are automatically generated during installation by the OCC installer and stored in `/var/www/nextcloud/config/config.php`. The host_vars should leave these empty:
|
||||
```yaml
|
||||
nextcloud_instance_id: "" # Auto-generated, leave empty
|
||||
nextcloud_password_salt: "" # Auto-generated, leave empty
|
||||
nextcloud_secret: "" # Auto-generated, leave empty
|
||||
```
|
||||
|
||||
**ℹ️ These secrets persist in config.php and do not need to be stored in vault or host_vars.** They are only referenced in these variables for consistency with the original template design.
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/rosalind.incus.yml`
|
||||
|
||||
```yaml
|
||||
# Nextcloud Configuration
|
||||
nextcloud_web_port: 22083
|
||||
nextcloud_data_dir: /mnt/nextcloud
|
||||
|
||||
# Database Configuration
|
||||
nextcloud_db_type: pgsql
|
||||
nextcloud_db_host: portia.incus
|
||||
nextcloud_db_port: 5432
|
||||
nextcloud_db_name: nextcloud
|
||||
nextcloud_db_user: nextcloud
|
||||
nextcloud_db_password: "{{vault_nextcloud_db_password}}"
|
||||
|
||||
# Admin Configuration
|
||||
nextcloud_admin_user: admin
|
||||
nextcloud_admin_password: "{{vault_nextcloud_admin_password}}"
|
||||
|
||||
# Domain Configuration
|
||||
nextcloud_domain: nextcloud.ouranos.helu.ca
|
||||
|
||||
# Instance secrets (generated during install)
|
||||
nextcloud_instance_id: ""
|
||||
nextcloud_password_salt: ""
|
||||
nextcloud_secret: ""
|
||||
```
|
||||
|
||||
## Database Setup
|
||||
|
||||
Nextcloud requires a PostgreSQL database on Portia. This is automatically created by the `postgresql/deploy.yml` playbook.
|
||||
|
||||
**Database Details:**
|
||||
- **Name**: nextcloud
|
||||
- **User**: nextcloud
|
||||
- **Owner**: nextcloud
|
||||
- **Extensions**: None required
|
||||
|
||||
## Storage Setup
|
||||
|
||||
### Incus Storage Volume
|
||||
**Terraform Resource:** `terraform/storage.tf`
|
||||
```hcl
|
||||
resource "incus_storage_volume" "nextcloud_data" {
|
||||
name = "nextcloud-data"
|
||||
pool = "default"
|
||||
project = "agathos"
|
||||
config = { size = "100GB" }
|
||||
}
|
||||
```
|
||||
|
||||
Mounted at `/mnt/nextcloud` on Rosalind container. This volume stores all Nextcloud user data, including uploaded files, app data, and user-specific configurations.
|
||||
|
||||
## Integration with Other Services
|
||||
|
||||
### HAProxy Routing
|
||||
**Backend Configuration** (`titania.incus.yml`):
|
||||
```yaml
|
||||
- subdomain: "nextcloud"
|
||||
backend_host: "rosalind.incus"
|
||||
backend_port: 22083
|
||||
health_path: "/status.php"
|
||||
```
|
||||
|
||||
### Memcached Integration
|
||||
- **Host**: localhost:11211
|
||||
- **Prefix**: `nc_`
|
||||
- **Shared Instance**: Rosalind hosts Memcached for all services
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Status
|
||||
```bash
|
||||
ssh rosalind.incus
|
||||
sudo systemctl status apache2
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Apache access logs
|
||||
sudo tail -f /var/log/apache2/access.log
|
||||
|
||||
# Apache error logs
|
||||
sudo tail -f /var/log/apache2/error.log
|
||||
|
||||
# Nextcloud logs (via web UI)
|
||||
# Settings → Logging
|
||||
```
|
||||
|
||||
### OCC Command-Line Tool
|
||||
```bash
|
||||
# As www-data user
|
||||
sudo -u www-data php /var/www/nextcloud/occ
|
||||
|
||||
# Examples:
|
||||
sudo -u www-data php /var/www/nextcloud/occ status
|
||||
sudo -u www-data php /var/www/nextcloud/occ config:list
|
||||
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
|
||||
```
|
||||
|
||||
### Database Connection
|
||||
```bash
|
||||
psql -h portia.incus -U nextcloud -d nextcloud
|
||||
```
|
||||
|
||||
### Check Memcached
|
||||
```bash
|
||||
echo "stats" | nc localhost 11211
|
||||
```
|
||||
|
||||
### Verify Storage Volume
|
||||
```bash
|
||||
# Reset ownership
|
||||
sudo chown -R www-data:www-data /var/www/nextcloud
|
||||
sudo chown -R www-data:www-data /mnt/nextcloud
|
||||
|
||||
# Reset permissions
|
||||
sudo chmod -R 0750 /var/www/nextcloud
|
||||
```
|
||||
|
||||
### Maintenance Mode
|
||||
```bash
|
||||
# Enable maintenance mode
|
||||
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
|
||||
|
||||
# Disable maintenance mode
|
||||
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
|
||||
```
|
||||
|
||||
## Updates and Maintenance
|
||||
|
||||
### Updating Nextcloud
|
||||
|
||||
**⚠️ Important**: Always backup before updating!
|
||||
|
||||
```bash
|
||||
# 1. Enable maintenance mode
|
||||
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
|
||||
|
||||
# 2. Backup config and database
|
||||
sudo cp -r /var/www/nextcloud/config /backup/nextcloud-config-$(date +%Y%m%d)
|
||||
pg_dump -h portia.incus -U nextcloud nextcloud > /backup/nextcloud-db-$(date +%Y%m%d).sql
|
||||
|
||||
# 3. Download new version
|
||||
wget https://download.nextcloud.com/server/releases/latest.tar.bz2
|
||||
|
||||
# 4. Extract and replace (preserve config/)
|
||||
tar -xjf latest.tar.bz2
|
||||
sudo rsync -av --delete --exclude config/ nextcloud/ /var/www/nextcloud/
|
||||
|
||||
# 5. Run upgrade
|
||||
sudo -u www-data php /var/www/nextcloud/occ upgrade
|
||||
|
||||
# 6. Disable maintenance mode
|
||||
sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
|
||||
```
|
||||
|
||||
### Database Maintenance
|
||||
```bash
|
||||
# Add missing indices
|
||||
sudo -u www-data php /var/www/nextcloud/occ db:add-missing-indices
|
||||
|
||||
# Convert to bigint
|
||||
sudo -u www-data php /var/www/nextcloud/occ db:convert-filecache-bigint
|
||||
```
|
||||
|
||||
## Version Information
|
||||
|
||||
- **Installation Method**: Tarball extraction (official releases)
|
||||
- **Current Version**: Check web UI → Settings → Overview
|
||||
- **Update Channel**: Stable (latest.tar.bz2)
|
||||
- **PHP Version**: Installed by apt (Ubuntu repository version)
|
||||
|
||||
## Docker vs Native Comparison
|
||||
|
||||
**Why Native Installation?**
|
||||
|
||||
| Aspect | Native (Chosen) | Docker |
|
||||
|--------|-----------------|--------|
|
||||
| **Performance** | Better (no container overhead) | Good |
|
||||
| **Updates** | Manual tarball extraction | Container image pull |
|
||||
| **Cron Jobs** | System cron (reliable) | Requires sidecar/exec |
|
||||
| **App Updates** | Direct via web UI | Limited/complex |
|
||||
| **Customization** | Full PHP/Apache control | Constrained by image |
|
||||
| **Production Match** | Yes (same pattern) | No |
|
||||
| **Complexity** | Lower for LAMP stack | Higher for orchestration |
|
||||
|
||||
**Recommendation**: Native installation matches production deployment pattern and avoids Docker-specific limitations with Nextcloud's app ecosystem and cron requirements.
|
||||
|
||||
## References
|
||||
|
||||
- **Official Documentation**: https://docs.nextcloud.com/
|
||||
- **Admin Manual**: https://docs.nextcloud.com/server/latest/admin_manual/
|
||||
- **Installation Guide**: https://docs.nextcloud.com/server/latest/admin_manual/installation/
|
||||
- **OCC Commands**: https://docs.nextcloud.com/server/latest/admin_manual/configuration_server/occ_command.html
|
||||
314
docs/oauth2_proxy.md
Normal file
314
docs/oauth2_proxy.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# OAuth2-Proxy Authentication Gateway
|
||||
# Red Panda Approved
|
||||
|
||||
## Overview
|
||||
|
||||
OAuth2-Proxy provides authentication for services that don't natively support SSO/OIDC.
|
||||
It acts as a reverse proxy that requires users to authenticate via Casdoor before
|
||||
accessing the upstream service.
|
||||
|
||||
This document describes the generic approach for adding OAuth2-Proxy authentication
|
||||
to any service in the Agathos infrastructure.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌───────────────┐ ┌────────────────┐ ┌───────────────┐
|
||||
│ Browser │────▶│ HAProxy │────▶│ OAuth2-Proxy │────▶│ Your Service │
|
||||
│ │ │ (titania) │ │ (titania) │ │ (any host) │
|
||||
└──────────────┘ └───────┬───────┘ └───────┬────────┘ └───────────────┘
|
||||
│ │
|
||||
│ ┌───────────────▼───────────────┐
|
||||
└────▶│ Casdoor │
|
||||
│ (OIDC Provider - titania) │
|
||||
└───────────────────────────────┘
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. User requests `https://service.ouranos.helu.ca/`
|
||||
2. HAProxy routes to OAuth2-Proxy (titania:22082)
|
||||
3. OAuth2-Proxy checks for valid session cookie
|
||||
4. **No session?** → Redirect to Casdoor login → After login, redirect back with cookie
|
||||
5. **Valid session?** → Forward request to upstream service
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
ansible/oauth2_proxy/
|
||||
├── deploy.yml # Main deployment playbook
|
||||
├── docker-compose.yml.j2 # Docker Compose template
|
||||
├── oauth2-proxy.cfg.j2 # OAuth2-Proxy configuration
|
||||
└── stage.yml # Validation/staging playbook
|
||||
```
|
||||
|
||||
Monitoring configuration is integrated into the host-specific Alloy config:
|
||||
- `ansible/alloy/titania/config.alloy.j2` - Contains OAuth2-Proxy log collection and metrics scraping
|
||||
|
||||
## Variable Architecture
|
||||
|
||||
The OAuth2-Proxy template uses **generic variables** (`oauth2_proxy_*`) that are
|
||||
mapped from **service-specific variables** in host_vars:
|
||||
|
||||
```
|
||||
Vault (service-specific) Host Vars (mapping) Template (generic)
|
||||
──────────────────────── ─────────────────── ──────────────────
|
||||
vault_<service>_oauth2_* ──► <service>_oauth2_* ──► oauth2_proxy_*
|
||||
```
|
||||
|
||||
This allows:
|
||||
- Multiple services to use the same OAuth2-Proxy template
|
||||
- Service-specific credentials in vault
|
||||
- Clear naming conventions
|
||||
|
||||
## Configuration Steps
|
||||
|
||||
### Step 1: Create Casdoor Application
|
||||
|
||||
1. Login to Casdoor at `https://id.ouranos.helu.ca/` (Casdoor SSO)
|
||||
2. Navigate to **Applications** → **Add**
|
||||
3. Configure:
|
||||
- **Name**: `<your-service>` (e.g., `searxng`, `jupyter`)
|
||||
- **Organization**: `heluca` (or your organization)
|
||||
- **Redirect URLs**: `https://<service>.ouranos.helu.ca/oauth2/callback`
|
||||
- **Grant Types**: `authorization_code`, `refresh_token`
|
||||
4. Save and note the **Client ID** and **Client Secret**
|
||||
|
||||
### Step 2: Add Vault Secrets
|
||||
|
||||
```bash
|
||||
ansible-vault edit ansible/inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
Add service-specific credentials:
|
||||
```yaml
|
||||
# SearXNG OAuth2 credentials
|
||||
vault_searxng_oauth2_client_id: "abc123..."
|
||||
vault_searxng_oauth2_client_secret: "secret..."
|
||||
vault_searxng_oauth2_cookie_secret: "<generate-with-command-below>"
|
||||
```
|
||||
|
||||
Generate cookie secret:
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
### Step 3: Configure Host Variables
|
||||
|
||||
Add to the host that will run OAuth2-Proxy (typically `titania.incus.yml`):
|
||||
|
||||
```yaml
|
||||
# =============================================================================
|
||||
# <Service> OAuth2 Configuration (Service-Specific)
|
||||
# =============================================================================
|
||||
<service>_oauth2_client_id: "{{ vault_<service>_oauth2_client_id }}"
|
||||
<service>_oauth2_client_secret: "{{ vault_<service>_oauth2_client_secret }}"
|
||||
<service>_oauth2_cookie_secret: "{{ vault_<service>_oauth2_cookie_secret }}"
|
||||
|
||||
# =============================================================================
|
||||
# OAuth2-Proxy Configuration (Generic Template Variables)
|
||||
# =============================================================================
|
||||
oauth2_proxy_user: oauth2proxy
|
||||
oauth2_proxy_group: oauth2proxy
|
||||
oauth2_proxy_uid: 802
|
||||
oauth2_proxy_gid: 802
|
||||
oauth2_proxy_directory: /srv/oauth2-proxy
|
||||
oauth2_proxy_port: 22082
|
||||
|
||||
# OIDC Configuration
|
||||
oauth2_proxy_oidc_issuer_url: "http://titania.incus:{{ casdoor_port }}"
|
||||
|
||||
# Map service-specific credentials to generic template variables
|
||||
oauth2_proxy_client_id: "{{ <service>_oauth2_client_id }}"
|
||||
oauth2_proxy_client_secret: "{{ <service>_oauth2_client_secret }}"
|
||||
oauth2_proxy_cookie_secret: "{{ <service>_oauth2_cookie_secret }}"
|
||||
|
||||
# Service-specific URLs
|
||||
oauth2_proxy_redirect_url: "https://<service>.{{ haproxy_domain }}/oauth2/callback"
|
||||
oauth2_proxy_upstream_url: "http://<service-host>:<service-port>"
|
||||
oauth2_proxy_cookie_domain: "{{ haproxy_domain }}"
|
||||
|
||||
# Access Control
|
||||
oauth2_proxy_email_domains:
|
||||
- "*" # Or restrict to specific domains
|
||||
|
||||
# Session Configuration
|
||||
oauth2_proxy_cookie_expire: "168h"
|
||||
oauth2_proxy_cookie_refresh: "1h"
|
||||
|
||||
# SSL Verification
|
||||
oauth2_proxy_skip_ssl_verify: true # Set false for production
|
||||
```
|
||||
|
||||
### Step 4: Update HAProxy Backend
|
||||
|
||||
Change the service backend to route through OAuth2-Proxy:
|
||||
|
||||
```yaml
|
||||
haproxy_backends:
|
||||
- subdomain: "<service>"
|
||||
backend_host: "titania.incus" # OAuth2-Proxy host
|
||||
backend_port: 22082 # OAuth2-Proxy port
|
||||
health_path: "/ping" # OAuth2-Proxy health endpoint
|
||||
```
|
||||
|
||||
### Step 5: Deploy
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
|
||||
# Validate configuration
|
||||
ansible-playbook oauth2_proxy/stage.yml
|
||||
|
||||
# Deploy OAuth2-Proxy
|
||||
ansible-playbook oauth2_proxy/deploy.yml
|
||||
|
||||
# Update HAProxy routing
|
||||
ansible-playbook haproxy/deploy.yml
|
||||
```
|
||||
|
||||
## Complete Example: SearXNG
|
||||
|
||||
### Vault Variables
|
||||
```yaml
|
||||
vault_searxng_oauth2_client_id: "searxng-client-id-from-casdoor"
|
||||
vault_searxng_oauth2_client_secret: "searxng-client-secret-from-casdoor"
|
||||
vault_searxng_oauth2_cookie_secret: "ABCdef123..."
|
||||
```
|
||||
|
||||
### Host Variables (titania.incus.yml)
|
||||
```yaml
|
||||
# SearXNG OAuth2 (service-specific)
|
||||
searxng_oauth2_client_id: "{{ vault_searxng_oauth2_client_id }}"
|
||||
searxng_oauth2_client_secret: "{{ vault_searxng_oauth2_client_secret }}"
|
||||
searxng_oauth2_cookie_secret: "{{ vault_searxng_oauth2_cookie_secret }}"
|
||||
|
||||
# OAuth2-Proxy (generic mapping)
|
||||
oauth2_proxy_client_id: "{{ searxng_oauth2_client_id }}"
|
||||
oauth2_proxy_client_secret: "{{ searxng_oauth2_client_secret }}"
|
||||
oauth2_proxy_cookie_secret: "{{ searxng_oauth2_cookie_secret }}"
|
||||
oauth2_proxy_redirect_url: "https://searxng.{{ haproxy_domain }}/oauth2/callback"
|
||||
oauth2_proxy_upstream_url: "http://oberon.incus:25599"
|
||||
```
|
||||
|
||||
### HAProxy Backend
|
||||
```yaml
|
||||
- subdomain: "searxng"
|
||||
backend_host: "titania.incus"
|
||||
backend_port: 22082
|
||||
health_path: "/ping"
|
||||
```
|
||||
|
||||
## Adding a Second Service (e.g., Jupyter)
|
||||
|
||||
When adding authentication to another service, you would:
|
||||
|
||||
1. Create a new Casdoor application for Jupyter
|
||||
2. Add vault variables:
|
||||
```yaml
|
||||
vault_jupyter_oauth2_client_id: "..."
|
||||
vault_jupyter_oauth2_client_secret: "..."
|
||||
vault_jupyter_oauth2_cookie_secret: "..."
|
||||
```
|
||||
3. Either:
|
||||
- **Option A**: Deploy a second OAuth2-Proxy instance on a different port
|
||||
- **Option B**: Configure the same OAuth2-Proxy with multiple upstreams (more complex)
|
||||
|
||||
For multiple services, **Option A** is recommended for isolation and simplicity.
|
||||
|
||||
## Monitoring
|
||||
|
||||
OAuth2-Proxy monitoring is handled by Grafana Alloy, which runs on each host.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
OAuth2-Proxy ─────► Grafana Alloy ─────► Prometheus (prospero)
|
||||
(titania) (local agent) (remote_write)
|
||||
│
|
||||
└─────────────► Loki (prospero)
|
||||
(log forwarding)
|
||||
```
|
||||
|
||||
### Metrics (via Prometheus)
|
||||
|
||||
Alloy scrapes OAuth2-Proxy metrics at `/metrics` and forwards them to Prometheus:
|
||||
- `oauth2_proxy_requests_total` - Total requests processed
|
||||
- `oauth2_proxy_errors_total` - Total errors
|
||||
- `oauth2_proxy_upstream_latency_seconds` - Latency to upstream service
|
||||
|
||||
Configuration in `ansible/alloy/titania/config.alloy.j2`:
|
||||
```alloy
|
||||
prometheus.scrape "oauth2_proxy" {
|
||||
targets = [{"__address__" = "127.0.0.1:{{oauth2_proxy_port}}"}]
|
||||
scrape_interval = "30s"
|
||||
forward_to = [prometheus.remote_write.default.receiver]
|
||||
job_name = "oauth2-proxy"
|
||||
}
|
||||
```
|
||||
|
||||
### Logs (via Loki)
|
||||
|
||||
OAuth2-Proxy logs are collected via syslog and forwarded to Loki:
|
||||
```alloy
|
||||
loki.source.syslog "oauth2_proxy_logs" {
|
||||
listener {
|
||||
address = "127.0.0.1:{{oauth2_proxy_syslog_port}}"
|
||||
protocol = "tcp"
|
||||
labels = { job = "oauth2-proxy", hostname = "{{inventory_hostname}}" }
|
||||
}
|
||||
forward_to = [loki.write.default.receiver]
|
||||
}
|
||||
```
|
||||
|
||||
### Deploy Alloy After Changes
|
||||
|
||||
If you update the Alloy configuration:
|
||||
```bash
|
||||
ansible-playbook alloy/deploy.yml --limit titania.incus
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Cookie Security**:
|
||||
- `cookie_secure = true` - HTTPS only
|
||||
- `cookie_httponly = true` - No JavaScript access
|
||||
- `cookie_samesite = "lax"` - CSRF protection
|
||||
|
||||
2. **Access Control**:
|
||||
- Use `oauth2_proxy_email_domains` to restrict by email domain
|
||||
- Use `oauth2_proxy_allowed_groups` to restrict by Casdoor groups
|
||||
|
||||
3. **SSL Verification**:
|
||||
- Set `oauth2_proxy_skip_ssl_verify: false` in production
|
||||
- Ensure Casdoor has valid SSL certificates
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check OAuth2-Proxy Logs
|
||||
```bash
|
||||
ssh titania.incus
|
||||
docker logs oauth2-proxy
|
||||
```
|
||||
|
||||
### Test OIDC Discovery
|
||||
```bash
|
||||
curl http://titania.incus:22081/.well-known/openid-configuration
|
||||
```
|
||||
|
||||
### Verify Cookie Domain
|
||||
Ensure `oauth2_proxy_cookie_domain` matches your HAProxy domain.
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| Redirect loop | Cookie domain mismatch | Check `oauth2_proxy_cookie_domain` |
|
||||
| 403 Forbidden | Email domain not allowed | Update `oauth2_proxy_email_domains` |
|
||||
| OIDC discovery failed | Casdoor not accessible | Check network/firewall |
|
||||
| Invalid redirect URI | Mismatch in Casdoor app | Verify redirect URL in Casdoor |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SearXNG Authentication](services/searxng-auth.md) - Specific implementation details
|
||||
- [Casdoor Documentation](casdoor.md) - Identity provider configuration
|
||||
331
docs/openwebui.md
Normal file
331
docs/openwebui.md
Normal file
@@ -0,0 +1,331 @@
|
||||
# Open WebUI
|
||||
|
||||
Open WebUI is an extensible, self-hosted AI interface that provides a web-based chat experience for interacting with LLMs. This document covers deployment, Casdoor SSO integration, and configuration.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| Open WebUI | Native on Oberon | AI chat interface |
|
||||
| PostgreSQL | Portia | Database with pgvector extension |
|
||||
| Casdoor | Titania | SSO identity provider |
|
||||
| HAProxy | Ariel | TLS termination, routing |
|
||||
|
||||
### Network Diagram
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ External Access │
|
||||
│ https://openwebui.ouranos.helu.ca │
|
||||
└───────────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ ariel.incus (HAProxy) │
|
||||
│ TLS termination → proxy to oberon.incus:25588 │
|
||||
└───────────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ oberon.incus │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Open WebUI (systemd) │ │
|
||||
│ │ - Python 3.12 virtual environment │ │
|
||||
│ │ - Port 25588 │ │
|
||||
│ │ - OAuth/OIDC via Casdoor │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ PostgreSQL │ OIDC │
|
||||
│ ▼ ▼ │
|
||||
│ portia.incus:5432 titania.incus:22081 │
|
||||
│ (openwebui database) (Casdoor SSO) │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Network Ports
|
||||
|
||||
| Port | Service | Access |
|
||||
|------|---------|--------|
|
||||
| 25588 | Open WebUI HTTP | Via HAProxy |
|
||||
| 5432 | PostgreSQL | Internal (Portia) |
|
||||
| 22081 | Casdoor | Internal (Titania) |
|
||||
|
||||
## Casdoor SSO Integration
|
||||
|
||||
Open WebUI uses native OAuth/OIDC to authenticate against Casdoor. Local signup is disabled—all users must authenticate through Casdoor.
|
||||
|
||||
### How It Works
|
||||
|
||||
1. User visits `https://openwebui.ouranos.helu.ca`
|
||||
2. Open WebUI redirects to Casdoor login page
|
||||
3. User authenticates with Casdoor credentials
|
||||
4. Casdoor redirects back with authorization code
|
||||
5. Open WebUI exchanges code for tokens and creates/updates user session
|
||||
6. User email from Casdoor becomes their Open WebUI identity
|
||||
|
||||
### Configuration
|
||||
|
||||
OAuth settings are defined in host variables and rendered into the environment file:
|
||||
|
||||
**Host Variables** (`inventory/host_vars/oberon.incus.yml`):
|
||||
```yaml
|
||||
# OAuth/OIDC Configuration (Casdoor SSO)
|
||||
openwebui_oauth_client_id: "{{ vault_openwebui_oauth_client_id }}"
|
||||
openwebui_oauth_client_secret: "{{ vault_openwebui_oauth_client_secret }}"
|
||||
openwebui_oauth_provider_name: "Casdoor"
|
||||
openwebui_oauth_provider_url: "https://id.ouranos.helu.ca/.well-known/openid-configuration"
|
||||
|
||||
# Disable local authentication
|
||||
openwebui_enable_signup: false
|
||||
openwebui_enable_email_login: false
|
||||
```
|
||||
|
||||
**Environment Variables** (rendered from `openwebui.env.j2`):
|
||||
```bash
|
||||
ENABLE_SIGNUP=false
|
||||
ENABLE_EMAIL_LOGIN=false
|
||||
ENABLE_OAUTH_SIGNUP=true
|
||||
OAUTH_CLIENT_ID=<client-id>
|
||||
OAUTH_CLIENT_SECRET=<client-secret>
|
||||
OAUTH_PROVIDER_NAME=Casdoor
|
||||
OPENID_PROVIDER_URL=https://id.ouranos.helu.ca/.well-known/openid-configuration
|
||||
```
|
||||
|
||||
### Casdoor Application
|
||||
|
||||
The `app-openwebui` application is defined in `ansible/casdoor/init_data.json.j2`:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Name | `app-openwebui` |
|
||||
| Display Name | Open WebUI |
|
||||
| Redirect URI | `https://openwebui.ouranos.helu.ca/oauth/oidc/callback` |
|
||||
| Grant Types | `authorization_code`, `refresh_token` |
|
||||
| Token Format | JWT |
|
||||
| Token Expiry | 168 hours (7 days) |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. PostgreSQL Database
|
||||
|
||||
The `openwebui` database must exist on Portia with the `pgvector` extension:
|
||||
|
||||
```bash
|
||||
ansible-playbook postgresql/deploy.yml
|
||||
```
|
||||
|
||||
### 2. Casdoor SSO
|
||||
|
||||
Casdoor must be deployed and the `app-openwebui` application configured:
|
||||
|
||||
```bash
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
```
|
||||
|
||||
### 3. Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
```yaml
|
||||
# OpenWebUI
|
||||
vault_openwebui_secret_key: "<random-secret>"
|
||||
vault_openwebui_db_password: "<database-password>"
|
||||
vault_openwebui_oauth_client_id: "<from-casdoor>"
|
||||
vault_openwebui_oauth_client_secret: "<from-casdoor>"
|
||||
|
||||
# API Keys (optional)
|
||||
vault_openwebui_openai_api_key: "<openai-key>"
|
||||
vault_openwebui_anthropic_api_key: "<anthropic-key>"
|
||||
vault_openwebui_groq_api_key: "<groq-key>"
|
||||
vault_openwebui_mistral_api_key: "<mistral-key>"
|
||||
```
|
||||
|
||||
Generate secrets:
|
||||
```bash
|
||||
# Secret key
|
||||
openssl rand -hex 32
|
||||
|
||||
# Database password
|
||||
openssl rand -base64 24
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Fresh Installation
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
|
||||
# 1. Ensure PostgreSQL is deployed
|
||||
ansible-playbook postgresql/deploy.yml
|
||||
|
||||
# 2. Deploy Casdoor (if not already deployed)
|
||||
ansible-playbook casdoor/deploy.yml
|
||||
|
||||
# 3. Get OAuth credentials from Casdoor admin UI
|
||||
# - Navigate to https://id.ouranos.helu.ca
|
||||
# - Go to Applications → app-openwebui
|
||||
# - Copy Client ID and Client Secret
|
||||
# - Update vault.yml with these values
|
||||
|
||||
# 4. Deploy Open WebUI
|
||||
ansible-playbook openwebui/deploy.yml
|
||||
```
|
||||
|
||||
### Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
ssh oberon.incus "sudo systemctl status openwebui"
|
||||
|
||||
# View logs
|
||||
ssh oberon.incus "sudo journalctl -u openwebui -f"
|
||||
|
||||
# Test health endpoint
|
||||
curl -s http://oberon.incus:25588/health
|
||||
|
||||
# Test via HAProxy
|
||||
curl -s https://openwebui.ouranos.helu.ca/health
|
||||
```
|
||||
|
||||
### Redeployment
|
||||
|
||||
To redeploy Open WebUI (preserves database):
|
||||
|
||||
```bash
|
||||
ansible-playbook openwebui/deploy.yml
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Host Variables
|
||||
|
||||
Located in `ansible/inventory/host_vars/oberon.incus.yml`:
|
||||
|
||||
```yaml
|
||||
# Service account
|
||||
openwebui_user: openwebui
|
||||
openwebui_group: openwebui
|
||||
openwebui_directory: /srv/openwebui
|
||||
openwebui_port: 25588
|
||||
openwebui_host: puck.incus
|
||||
|
||||
# Database
|
||||
openwebui_db_host: portia.incus
|
||||
openwebui_db_port: 5432
|
||||
openwebui_db_name: openwebui
|
||||
openwebui_db_user: openwebui
|
||||
openwebui_db_password: "{{ vault_openwebui_db_password }}"
|
||||
|
||||
# Authentication (SSO only)
|
||||
openwebui_enable_signup: false
|
||||
openwebui_enable_email_login: false
|
||||
|
||||
# OAuth/OIDC (Casdoor)
|
||||
openwebui_oauth_client_id: "{{ vault_openwebui_oauth_client_id }}"
|
||||
openwebui_oauth_client_secret: "{{ vault_openwebui_oauth_client_secret }}"
|
||||
openwebui_oauth_provider_name: "Casdoor"
|
||||
openwebui_oauth_provider_url: "https://id.ouranos.helu.ca/.well-known/openid-configuration"
|
||||
|
||||
# API Keys
|
||||
openwebui_openai_api_key: "{{ vault_openwebui_openai_api_key }}"
|
||||
openwebui_anthropic_api_key: "{{ vault_openwebui_anthropic_api_key }}"
|
||||
openwebui_groq_api_key: "{{ vault_openwebui_groq_api_key }}"
|
||||
openwebui_mistral_api_key: "{{ vault_openwebui_mistral_api_key }}"
|
||||
```
|
||||
|
||||
### Data Persistence
|
||||
|
||||
Open WebUI data locations:
|
||||
```
|
||||
/srv/openwebui/
|
||||
├── .venv/ # Python virtual environment
|
||||
├── .env # Environment configuration
|
||||
└── data/ # User uploads, cache
|
||||
```
|
||||
|
||||
Database (on Portia):
|
||||
```
|
||||
PostgreSQL: openwebui database with pgvector extension
|
||||
```
|
||||
|
||||
## User Management
|
||||
|
||||
### First-Time Setup
|
||||
|
||||
After deployment, the first user to authenticate via Casdoor becomes an admin. Subsequent users get standard user roles.
|
||||
|
||||
### Promoting Users to Admin
|
||||
|
||||
1. Log in as an existing admin
|
||||
2. Navigate to Admin Panel → Users
|
||||
3. Select the user and change their role to Admin
|
||||
|
||||
### Existing Users Migration
|
||||
|
||||
If users were created before SSO was enabled:
|
||||
- Users with matching email addresses will be linked automatically
|
||||
- Users without matching emails must be recreated through Casdoor
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Issues
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
ssh oberon.incus "sudo systemctl status openwebui"
|
||||
|
||||
# View logs
|
||||
ssh oberon.incus "sudo journalctl -u openwebui -n 100"
|
||||
|
||||
# Restart service
|
||||
ssh oberon.incus "sudo systemctl restart openwebui"
|
||||
```
|
||||
|
||||
### OAuth/OIDC Issues
|
||||
|
||||
```bash
|
||||
# Verify Casdoor is accessible
|
||||
curl -s https://id.ouranos.helu.ca/.well-known/openid-configuration | jq
|
||||
|
||||
# Check redirect URI matches
|
||||
# Must be: https://openwebui.ouranos.helu.ca/oauth/oidc/callback
|
||||
|
||||
# Verify client credentials in environment
|
||||
ssh oberon.incus "sudo grep OAUTH /srv/openwebui/.env"
|
||||
```
|
||||
|
||||
### Database Issues
|
||||
|
||||
```bash
|
||||
# Test database connection
|
||||
ssh oberon.incus "PGPASSWORD=<password> psql -h portia.incus -U openwebui -d openwebui -c '\dt'"
|
||||
|
||||
# Check pgvector extension
|
||||
ssh portia.incus "sudo -u postgres psql -d openwebui -c '\dx'"
|
||||
```
|
||||
|
||||
### Common Errors
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| "Invalid redirect_uri" | Mismatch between Casdoor config and Open WebUI | Verify redirect URI in Casdoor matches exactly |
|
||||
| "Invalid client credentials" | Wrong client ID/secret | Update vault with correct values from Casdoor |
|
||||
| "OIDC discovery failed" | Casdoor unreachable | Check Casdoor is running on Titania |
|
||||
| "Database connection failed" | PostgreSQL unreachable | Verify PostgreSQL on Portia, check network |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **SSO-only authentication** - Local signup disabled, all users authenticate through Casdoor
|
||||
2. **API keys in vault** - All API keys stored encrypted in Ansible vault
|
||||
3. **Database credentials** - Stored in vault, rendered to environment file with restrictive permissions (0600)
|
||||
4. **Session security** - JWT tokens with 7-day expiry, managed by Casdoor
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Casdoor SSO](services/casdoor.md) - Identity provider configuration
|
||||
- [PostgreSQL](../ansible.md) - Database deployment
|
||||
- [HAProxy](../terraform.md) - TLS termination and routing
|
||||
808
docs/ouranos.html
Normal file
808
docs/ouranos.html
Normal file
@@ -0,0 +1,808 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en" data-bs-theme="light">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Ouranos Lab - Red Panda Approved Infrastructure</title>
|
||||
<!-- Bootswatch Flatly -->
|
||||
<link href="https://cdn.jsdelivr.net/npm/bootswatch@5.3.2/dist/flatly/bootstrap.min.css" rel="stylesheet">
|
||||
<!-- Bootstrap Icons -->
|
||||
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.0/font/bootstrap-icons.css">
|
||||
<style>
|
||||
html { scroll-behavior: smooth; }
|
||||
#scrollTopBtn {
|
||||
position: fixed;
|
||||
bottom: 20px;
|
||||
right: 20px;
|
||||
z-index: 1000;
|
||||
display: none;
|
||||
border-radius: 50%;
|
||||
width: 50px;
|
||||
height: 50px;
|
||||
box-shadow: 0 2px 10px rgba(0,0,0,0.3);
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container-fluid px-4">
|
||||
|
||||
<!-- Navbar -->
|
||||
<nav class="navbar navbar-expand-lg navbar-dark bg-primary rounded mb-4 mt-3">
|
||||
<div class="container-fluid">
|
||||
<a class="navbar-brand fw-bold" href="#">
|
||||
<i class="bi bi-diagram-3-fill"></i> Ouranos Lab
|
||||
</a>
|
||||
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
|
||||
<span class="navbar-toggler-icon"></span>
|
||||
</button>
|
||||
<div class="collapse navbar-collapse" id="navbarNav">
|
||||
<ul class="navbar-nav me-auto">
|
||||
<li class="nav-item"><a class="nav-link" href="#overview"><i class="bi bi-info-circle"></i> Overview</a></li>
|
||||
<li class="nav-item"><a class="nav-link" href="#hosts"><i class="bi bi-hdd-network"></i> Hosts</a></li>
|
||||
<li class="nav-item"><a class="nav-link" href="#routing"><i class="bi bi-signpost-split"></i> Routing</a></li>
|
||||
<li class="nav-item"><a class="nav-link" href="#infrastructure"><i class="bi bi-gear"></i> Infrastructure</a></li>
|
||||
<li class="nav-item"><a class="nav-link" href="#automation"><i class="bi bi-play-circle"></i> Automation</a></li>
|
||||
<li class="nav-item"><a class="nav-link" href="#dataflow"><i class="bi bi-diagram-2"></i> Data Flow</a></li>
|
||||
</ul>
|
||||
<button id="darkModeToggle" class="btn btn-outline-light btn-sm" title="Toggle dark mode">
|
||||
<i class="bi bi-moon-fill"></i>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<!-- Hero -->
|
||||
<header class="bg-primary text-white py-5 rounded mb-4">
|
||||
<div class="container">
|
||||
<div class="row align-items-center">
|
||||
<div class="col-lg-8">
|
||||
<h1 class="display-4 fw-bold"><i class="bi bi-diagram-3-fill"></i> Ouranos Lab</h1>
|
||||
<p class="lead">Red Panda Approved™ Infrastructure as Code</p>
|
||||
<p class="mb-0">10 Incus containers named after moons of Uranus, provisioned with Terraform and configured with Ansible. Accessible at <a href="https://ouranos.helu.ca" class="text-white fw-bold">ouranos.helu.ca</a></p>
|
||||
</div>
|
||||
<div class="col-lg-4 text-center mt-3 mt-lg-0">
|
||||
<div class="badge bg-success fs-6 p-3">
|
||||
<i class="bi bi-check-circle-fill"></i> Red Panda Approved™
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Overview -->
|
||||
<section id="overview" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-info-circle text-primary me-2"></i>Project Overview</h2>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<p class="mb-1">Ouranos is a comprehensive infrastructure-as-code project that provisions and manages a complete development sandbox environment. All infrastructure and configuration is tracked in Git for reproducible deployments.</p>
|
||||
<p class="mb-0"><i class="bi bi-exclamation-triangle-fill text-warning me-1"></i><strong>DNS Domain:</strong> Incus resolves containers via the <code>.incus</code> suffix (e.g., <code>oberon.incus</code>). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.</p>
|
||||
</div>
|
||||
|
||||
<div class="row g-4">
|
||||
<div class="col-md-6">
|
||||
<div class="card h-100 border-primary">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-diagram-3 me-2"></i>Terraform</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="card-text">Provisions the Uranian host containers with:</p>
|
||||
<ul class="mb-0">
|
||||
<li>10 specialised Incus containers (LXC)</li>
|
||||
<li>DNS-resolved networking (<code>.incus</code> domain)</li>
|
||||
<li>Security policies and nested Docker support</li>
|
||||
<li>Port proxy devices and resource dependencies</li>
|
||||
<li>Incus S3 buckets for object storage (Casdoor, LobeChat)</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="card-footer text-muted small"><i class="bi bi-check-circle me-1"></i>Idempotent, elegant, observable</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<div class="card h-100 border-success">
|
||||
<div class="card-header bg-success text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-gear-fill me-2"></i>Ansible</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="card-text">Deploys and configures all services:</p>
|
||||
<ul class="mb-0">
|
||||
<li>Docker engine on nested-capable hosts</li>
|
||||
<li>Databases: PostgreSQL (Portia), Neo4j (Ariel)</li>
|
||||
<li>Observability: Prometheus, Loki, Grafana (Prospero)</li>
|
||||
<li>Application runtimes and LLM proxies</li>
|
||||
<li>HAProxy TLS termination and Casdoor SSO (Titania)</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="card-footer text-muted small"><i class="bi bi-check-circle me-1"></i>Idempotent, auditable, integrated</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Hosts -->
|
||||
<section id="hosts" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-hdd-network text-primary me-2"></i>Uranian Host Architecture</h2>
|
||||
|
||||
<div class="card mb-4">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-table me-2"></i>Hosts Summary</h5>
|
||||
</div>
|
||||
<div class="card-body p-0">
|
||||
<div class="table-responsive">
|
||||
<table class="table table-hover table-bordered mb-0 align-middle">
|
||||
<thead class="table-light">
|
||||
<tr>
|
||||
<th><i class="bi bi-tag me-1"></i>Name</th>
|
||||
<th><i class="bi bi-briefcase me-1"></i>Role</th>
|
||||
<th><i class="bi bi-list-ul me-1"></i>Key Services</th>
|
||||
<th class="text-center"><i class="bi bi-shield me-1"></i>Nesting</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><strong>ariel</strong></td>
|
||||
<td><span class="badge bg-warning text-dark">graph_database</span></td>
|
||||
<td>Neo4j 5.26.0</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>caliban</strong></td>
|
||||
<td><span class="badge bg-secondary">agent_automation</span></td>
|
||||
<td>Agent S MCP Server, Kernos, MATE Desktop, GPU</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>miranda</strong></td>
|
||||
<td><span class="badge bg-info">mcp_docker_host</span></td>
|
||||
<td>MCPO, Grafana MCP, Gitea MCP, Neo4j MCP, Argos MCP</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>oberon</strong></td>
|
||||
<td><span class="badge bg-primary">container_orchestration</span></td>
|
||||
<td>MCP Switchboard, RabbitMQ, Open WebUI, SearXNG, Home Assistant, smtp4dev</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>portia</strong></td>
|
||||
<td><span class="badge bg-success">database</span></td>
|
||||
<td>PostgreSQL 16</td>
|
||||
<td class="text-center"><i class="bi bi-x-circle-fill text-danger"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>prospero</strong></td>
|
||||
<td><span class="badge bg-dark">observability</span></td>
|
||||
<td>Prometheus, Loki, Grafana, PgAdmin, AlertManager</td>
|
||||
<td class="text-center"><i class="bi bi-x-circle-fill text-danger"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>puck</strong></td>
|
||||
<td><span class="badge bg-danger">application_runtime</span></td>
|
||||
<td>JupyterLab, Gitea Runner, Django apps (6×)</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>rosalind</strong></td>
|
||||
<td><span class="badge bg-success">collaboration</span></td>
|
||||
<td>Gitea, LobeChat, Nextcloud, AnythingLLM</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>sycorax</strong></td>
|
||||
<td><span class="badge bg-secondary">language_models</span></td>
|
||||
<td>Arke LLM Proxy</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>titania</strong></td>
|
||||
<td><span class="badge bg-primary">proxy_sso</span></td>
|
||||
<td>HAProxy, Casdoor SSO, certbot</td>
|
||||
<td class="text-center"><i class="bi bi-check-circle-fill text-success"></i></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Host Detail Cards -->
|
||||
<div class="row g-4">
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-primary">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-box me-2"></i>oberon — Container Orchestration</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">King of the Fairies orchestrating containers and managing MCP infrastructure.</p>
|
||||
<ul class="mb-0">
|
||||
<li>Docker engine</li>
|
||||
<li><strong>MCP Switchboard</strong> (port 22785) — Django app routing MCP tool calls</li>
|
||||
<li><strong>RabbitMQ</strong> message queue</li>
|
||||
<li><strong>Open WebUI</strong> LLM interface (port 22088, PostgreSQL backend on Portia)</li>
|
||||
<li><strong>SearXNG</strong> privacy search (port 22073, behind OAuth2-Proxy)</li>
|
||||
<li><strong>Home Assistant</strong> (port 8123)</li>
|
||||
<li><strong>smtp4dev</strong> SMTP test server (port 22025)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-success">
|
||||
<div class="card-header bg-success text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-database me-2"></i>portia — Relational Database</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Intelligent and resourceful — the reliability of relational databases.</p>
|
||||
<ul class="mb-0">
|
||||
<li>PostgreSQL 16 (port 5432)</li>
|
||||
<li>Databases: <code>arke</code>, <code>anythingllm</code>, <code>gitea</code>, <code>hass</code>, <code>lobechat</code>, <code>mcp_switchboard</code>, <code>nextcloud</code>, <code>openwebui</code>, <code>spelunker</code></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-warning">
|
||||
<div class="card-header bg-warning text-dark">
|
||||
<h5 class="mb-0"><i class="bi bi-diagram-2 me-2"></i>ariel — Graph Database</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Air spirit — ethereal, interconnected nature mirroring graph relationships.</p>
|
||||
<ul class="mb-0">
|
||||
<li>Neo4j 5.26.0 (Docker)</li>
|
||||
<li>HTTP API: port 25554</li>
|
||||
<li>Bolt: port 7687</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-danger">
|
||||
<div class="card-header bg-danger text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-code-slash me-2"></i>puck — Application Runtime</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Shape-shifting trickster embodying Python's versatility.</p>
|
||||
<ul class="mb-0">
|
||||
<li>Docker engine</li>
|
||||
<li><strong>JupyterLab</strong> (port 22071 via OAuth2-Proxy)</li>
|
||||
<li><strong>Gitea Runner</strong> CI/CD agent</li>
|
||||
<li>Django apps: <strong>Angelia</strong> (22281), <strong>Athena</strong> (22481), <strong>Kairos</strong> (22581), <strong>Icarlos</strong> (22681), <strong>Spelunker</strong> (22881), <strong>Peitho</strong> (22981)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-dark">
|
||||
<div class="card-header bg-dark text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-graph-up me-2"></i>prospero — Observability Stack</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Master magician observing all events.</p>
|
||||
<ul class="mb-0">
|
||||
<li>PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin</li>
|
||||
<li>Internal HAProxy with OAuth2-Proxy for all dashboards</li>
|
||||
<li>AlertManager with Pushover notifications</li>
|
||||
<li>Prometheus node-exporter metrics from all hosts</li>
|
||||
<li>Loki log aggregation via Alloy (all hosts)</li>
|
||||
<li>Grafana with Casdoor SSO integration</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-info">
|
||||
<div class="card-header bg-info text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-chat-dots me-2"></i>miranda — MCP Docker Host</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Curious bridge between worlds — hosting MCP server containers.</p>
|
||||
<ul class="mb-0">
|
||||
<li>Docker engine (API on port 2375 for MCP Switchboard)</li>
|
||||
<li><strong>MCPO</strong> OpenAI-compatible MCP proxy</li>
|
||||
<li><strong>Grafana MCP Server</strong> — Grafana API integration (port 25533)</li>
|
||||
<li><strong>Gitea MCP Server</strong> (port 25535)</li>
|
||||
<li><strong>Neo4j MCP Server</strong></li>
|
||||
<li><strong>Argos MCP Server</strong> — web search via SearXNG (port 25534)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100">
|
||||
<div class="card-header bg-secondary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-magic me-2"></i>sycorax — Language Models</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Original magical power wielding language magic.</p>
|
||||
<ul class="mb-0">
|
||||
<li><strong>Arke</strong> LLM API Proxy (port 25540)</li>
|
||||
<li>Multi-provider support (OpenAI, Anthropic, etc.)</li>
|
||||
<li>Session management with Memcached</li>
|
||||
<li>Database backend on Portia</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100">
|
||||
<div class="card-header bg-secondary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-robot me-2"></i>caliban — Agent Automation</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Autonomous computer agent learning through environmental interaction.</p>
|
||||
<ul class="mb-0">
|
||||
<li>Docker engine</li>
|
||||
<li><strong>Agent S MCP Server</strong> (MATE desktop, AT-SPI automation)</li>
|
||||
<li><strong>Kernos</strong> MCP Shell Server (port 22021)</li>
|
||||
<li>GPU passthrough for vision tasks</li>
|
||||
<li>RDP access (port 25521)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-success">
|
||||
<div class="card-header bg-success text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-people me-2"></i>rosalind — Collaboration Services</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Witty and resourceful moon for PHP, Go, and Node.js runtimes.</p>
|
||||
<ul class="mb-0">
|
||||
<li><strong>Gitea</strong> self-hosted Git (port 22082, SSH on 22022)</li>
|
||||
<li><strong>LobeChat</strong> AI chat interface (port 22081)</li>
|
||||
<li><strong>Nextcloud</strong> file sharing and collaboration (port 22083)</li>
|
||||
<li><strong>AnythingLLM</strong> document AI workspace (port 22084)</li>
|
||||
<li>Nextcloud data on dedicated Incus storage volume</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="col-lg-6">
|
||||
<div class="card h-100 border-primary">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-shield-check me-2"></i>titania — Proxy & SSO Services</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<p class="text-muted fst-italic small">Queen of the Fairies managing access control and authentication.</p>
|
||||
<ul class="mb-0">
|
||||
<li><strong>HAProxy 3.x</strong> with TLS termination (port 443)</li>
|
||||
<li>Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)</li>
|
||||
<li>HTTP to HTTPS redirect (port 80)</li>
|
||||
<li>Gitea SSH proxy (port 22022)</li>
|
||||
<li><strong>Casdoor SSO</strong> (port 22081, local PostgreSQL)</li>
|
||||
<li>Prometheus metrics at <code>:8404/metrics</code></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Routing -->
|
||||
<section id="routing" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-signpost-split text-primary me-2"></i>External Access via HAProxy</h2>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<p class="mb-0">Titania provides TLS termination and reverse proxy for all services. <strong>Base domain:</strong> <a href="https://ouranos.helu.ca" class="alert-link">ouranos.helu.ca</a> — HTTPS port 443, HTTP port 80 (redirects to HTTPS). Certificate: Let's Encrypt wildcard via certbot DNS-01 (Namecheap).</p>
|
||||
</div>
|
||||
|
||||
<div class="card">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-table me-2"></i>Route Table</h5>
|
||||
</div>
|
||||
<div class="card-body p-0">
|
||||
<div class="table-responsive">
|
||||
<table class="table table-hover table-bordered mb-0 align-middle">
|
||||
<thead class="table-light">
|
||||
<tr>
|
||||
<th><i class="bi bi-link-45deg me-1"></i>Subdomain</th>
|
||||
<th><i class="bi bi-hdd-network me-1"></i>Backend</th>
|
||||
<th><i class="bi bi-app me-1"></i>Service</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr><td><code>ouranos.helu.ca</code> <span class="badge bg-secondary">root</span></td><td><code>puck.incus:22281</code></td><td>Angelia (Django)</td></tr>
|
||||
<tr><td><code>alertmanager.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>AlertManager</td></tr>
|
||||
<tr><td><code>angelia.ouranos.helu.ca</code></td><td><code>puck.incus:22281</code></td><td>Angelia (Django)</td></tr>
|
||||
<tr><td><code>anythingllm.ouranos.helu.ca</code></td><td><code>rosalind.incus:22084</code></td><td>AnythingLLM</td></tr>
|
||||
<tr><td><code>arke.ouranos.helu.ca</code></td><td><code>sycorax.incus:25540</code></td><td>Arke LLM Proxy</td></tr>
|
||||
<tr><td><code>athena.ouranos.helu.ca</code></td><td><code>puck.incus:22481</code></td><td>Athena (Django)</td></tr>
|
||||
<tr><td><code>gitea.ouranos.helu.ca</code></td><td><code>rosalind.incus:22082</code></td><td>Gitea</td></tr>
|
||||
<tr><td><code>grafana.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Grafana</td></tr>
|
||||
<tr><td><code>hass.ouranos.helu.ca</code></td><td><code>oberon.incus:8123</code></td><td>Home Assistant</td></tr>
|
||||
<tr><td><code>id.ouranos.helu.ca</code></td><td><code>titania.incus:22081</code></td><td>Casdoor SSO</td></tr>
|
||||
<tr><td><code>icarlos.ouranos.helu.ca</code></td><td><code>puck.incus:22681</code></td><td>Icarlos (Django)</td></tr>
|
||||
<tr><td><code>jupyterlab.ouranos.helu.ca</code></td><td><code>puck.incus:22071</code></td><td>JupyterLab <span class="badge bg-secondary">OAuth2-Proxy</span></td></tr>
|
||||
<tr><td><code>kairos.ouranos.helu.ca</code></td><td><code>puck.incus:22581</code></td><td>Kairos (Django)</td></tr>
|
||||
<tr><td><code>lobechat.ouranos.helu.ca</code></td><td><code>rosalind.incus:22081</code></td><td>LobeChat</td></tr>
|
||||
<tr><td><code>loki.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Loki</td></tr>
|
||||
<tr><td><code>mcp-switchboard.ouranos.helu.ca</code></td><td><code>oberon.incus:22785</code></td><td>MCP Switchboard</td></tr>
|
||||
<tr><td><code>nextcloud.ouranos.helu.ca</code></td><td><code>rosalind.incus:22083</code></td><td>Nextcloud</td></tr>
|
||||
<tr><td><code>openwebui.ouranos.helu.ca</code></td><td><code>oberon.incus:22088</code></td><td>Open WebUI</td></tr>
|
||||
<tr><td><code>peitho.ouranos.helu.ca</code></td><td><code>puck.incus:22981</code></td><td>Peitho (Django)</td></tr>
|
||||
<tr><td><code>pgadmin.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>PgAdmin 4</td></tr>
|
||||
<tr><td><code>prometheus.ouranos.helu.ca</code></td><td><code>prospero.incus:443</code> <span class="badge bg-info text-dark">SSL</span></td><td>Prometheus</td></tr>
|
||||
<tr><td><code>searxng.ouranos.helu.ca</code></td><td><code>oberon.incus:22073</code></td><td>SearXNG <span class="badge bg-secondary">OAuth2-Proxy</span></td></tr>
|
||||
<tr><td><code>smtp4dev.ouranos.helu.ca</code></td><td><code>oberon.incus:22085</code></td><td>smtp4dev</td></tr>
|
||||
<tr><td><code>spelunker.ouranos.helu.ca</code></td><td><code>puck.incus:22881</code></td><td>Spelunker (Django)</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Infrastructure Management -->
|
||||
<section id="infrastructure" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-gear text-primary me-2"></i>Infrastructure Management</h2>
|
||||
|
||||
<div class="row g-4 mb-4">
|
||||
<div class="col-md-6">
|
||||
<div class="card h-100 border-primary">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-play-circle me-2"></i>Quick Start</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<pre class="mb-0"><code># Provision containers
|
||||
cd terraform
|
||||
terraform init
|
||||
terraform plan
|
||||
terraform apply
|
||||
|
||||
# Start all containers
|
||||
cd ../ansible
|
||||
source ~/env/agathos/bin/activate
|
||||
ansible-playbook sandbox_up.yml
|
||||
|
||||
# Deploy all services
|
||||
ansible-playbook site.yml
|
||||
|
||||
# Stop all containers
|
||||
ansible-playbook sandbox_down.yml</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<div class="card h-100 border-warning">
|
||||
<div class="card-header bg-warning text-dark">
|
||||
<h5 class="mb-0"><i class="bi bi-shield-lock me-2"></i>Vault Management</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<pre class="mb-0"><code># Edit secrets
|
||||
ansible-vault edit \
|
||||
inventory/group_vars/all/vault.yml
|
||||
|
||||
# View secrets
|
||||
ansible-vault view \
|
||||
inventory/group_vars/all/vault.yml
|
||||
|
||||
# Encrypt a new file
|
||||
ansible-vault encrypt new_secrets.yml</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row g-4">
|
||||
<div class="col-md-6">
|
||||
<div class="alert alert-primary border-start border-4 border-primary h-100 mb-0">
|
||||
<h5><i class="bi bi-lightning-fill me-2"></i>Terraform Workflow</h5>
|
||||
<ol class="mb-0">
|
||||
<li><strong>Define</strong> — Containers, networks, and resources in <code>*.tf</code> files</li>
|
||||
<li><strong>Plan</strong> — Review changes with <code>terraform plan</code></li>
|
||||
<li><strong>Apply</strong> — Provision with <code>terraform apply</code></li>
|
||||
<li><strong>Verify</strong> — Check outputs and container status</li>
|
||||
</ol>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-6">
|
||||
<div class="alert alert-success border-start border-4 border-success h-100 mb-0">
|
||||
<h5><i class="bi bi-check-circle-fill me-2"></i>Ansible Workflow</h5>
|
||||
<ol class="mb-0">
|
||||
<li><strong>Bootstrap</strong> — Update packages, install essentials (<code>apt_update.yml</code>)</li>
|
||||
<li><strong>Agents</strong> — Deploy Alloy and Node Exporter on all hosts</li>
|
||||
<li><strong>Services</strong> — Configure databases, Docker, applications, observability</li>
|
||||
<li><strong>Verify</strong> — Check service health and connectivity</li>
|
||||
</ol>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info mt-4">
|
||||
<h5><i class="bi bi-bucket me-2"></i>S3 Storage Provisioning</h5>
|
||||
<p>Terraform provisions Incus S3 buckets for services requiring object storage:</p>
|
||||
<div class="table-responsive">
|
||||
<table class="table table-sm table-bordered mb-1">
|
||||
<thead class="table-light">
|
||||
<tr><th>Service</th><th>Host</th><th>Purpose</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr><td><strong>Casdoor</strong></td><td>Titania</td><td>User avatars and SSO resource storage</td></tr>
|
||||
<tr><td><strong>LobeChat</strong></td><td>Rosalind</td><td>File uploads and attachments</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
<p class="mb-0 small"><i class="bi bi-shield-lock me-1"></i>S3 credentials are stored as sensitive Terraform outputs and in Ansible Vault with the <code>vault_*_s3_*</code> prefix.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Automation -->
|
||||
<section id="automation" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-play-circle text-primary me-2"></i>Ansible Automation</h2>
|
||||
|
||||
<div class="accordion" id="playbookAccordion">
|
||||
|
||||
<!-- site.yml -->
|
||||
<div class="accordion-item">
|
||||
<h2 class="accordion-header">
|
||||
<button class="accordion-button" type="button" data-bs-toggle="collapse" data-bs-target="#colSiteYml">
|
||||
<i class="bi bi-list-check me-2"></i>Full Deployment — <code>site.yml</code> (in order)
|
||||
</button>
|
||||
</h2>
|
||||
<div id="colSiteYml" class="accordion-collapse collapse show" data-bs-parent="#playbookAccordion">
|
||||
<div class="accordion-body">
|
||||
<div class="table-responsive">
|
||||
<table class="table table-hover table-bordered mb-0 align-middle">
|
||||
<thead class="table-light">
|
||||
<tr><th>Playbook</th><th>Host(s)</th><th>Purpose</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr><td><code>apt_update.yml</code></td><td>All</td><td>Update packages and install essentials</td></tr>
|
||||
<tr><td><code>alloy/deploy.yml</code></td><td>All</td><td>Grafana Alloy log/metrics collection</td></tr>
|
||||
<tr><td><code>prometheus/node_deploy.yml</code></td><td>All</td><td>Node Exporter metrics</td></tr>
|
||||
<tr><td><code>docker/deploy.yml</code></td><td>Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania</td><td>Docker engine</td></tr>
|
||||
<tr><td><code>smtp4dev/deploy.yml</code></td><td>Oberon</td><td>SMTP test server</td></tr>
|
||||
<tr><td><code>pplg/deploy.yml</code></td><td>Prospero</td><td>Full observability stack + internal HAProxy + OAuth2-Proxy</td></tr>
|
||||
<tr><td><code>postgresql/deploy.yml</code></td><td>Portia</td><td>PostgreSQL with all databases</td></tr>
|
||||
<tr><td><code>postgresql_ssl/deploy.yml</code></td><td>Titania</td><td>Dedicated PostgreSQL for Casdoor</td></tr>
|
||||
<tr><td><code>neo4j/deploy.yml</code></td><td>Ariel</td><td>Neo4j graph database</td></tr>
|
||||
<tr><td><code>searxng/deploy.yml</code></td><td>Oberon</td><td>SearXNG privacy search</td></tr>
|
||||
<tr><td><code>haproxy/deploy.yml</code></td><td>Titania</td><td>HAProxy TLS termination and routing</td></tr>
|
||||
<tr><td><code>casdoor/deploy.yml</code></td><td>Titania</td><td>Casdoor SSO</td></tr>
|
||||
<tr><td><code>mcpo/deploy.yml</code></td><td>Miranda</td><td>MCPO MCP proxy</td></tr>
|
||||
<tr><td><code>openwebui/deploy.yml</code></td><td>Oberon</td><td>Open WebUI LLM interface</td></tr>
|
||||
<tr><td><code>hass/deploy.yml</code></td><td>Oberon</td><td>Home Assistant</td></tr>
|
||||
<tr><td><code>gitea/deploy.yml</code></td><td>Rosalind</td><td>Gitea self-hosted Git</td></tr>
|
||||
<tr><td><code>nextcloud/deploy.yml</code></td><td>Rosalind</td><td>Nextcloud collaboration</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Individual services -->
|
||||
<div class="accordion-item">
|
||||
<h2 class="accordion-header">
|
||||
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#colIndividual">
|
||||
<i class="bi bi-puzzle me-2"></i>Individual Service Deployments
|
||||
</button>
|
||||
</h2>
|
||||
<div id="colIndividual" class="accordion-collapse collapse" data-bs-parent="#playbookAccordion">
|
||||
<div class="accordion-body">
|
||||
<div class="table-responsive">
|
||||
<table class="table table-hover table-bordered mb-0 align-middle">
|
||||
<thead class="table-light">
|
||||
<tr><th>Playbook</th><th>Host</th><th>Service</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr><td><code>anythingllm/deploy.yml</code></td><td>Rosalind</td><td>AnythingLLM document AI</td></tr>
|
||||
<tr><td><code>arke/deploy.yml</code></td><td>Sycorax</td><td>Arke LLM proxy</td></tr>
|
||||
<tr><td><code>argos/deploy.yml</code></td><td>Miranda</td><td>Argos MCP web search server</td></tr>
|
||||
<tr><td><code>caliban/deploy.yml</code></td><td>Caliban</td><td>Agent S MCP Server</td></tr>
|
||||
<tr><td><code>certbot/deploy.yml</code></td><td>Titania</td><td>Let's Encrypt certificate renewal</td></tr>
|
||||
<tr><td><code>gitea_mcp/deploy.yml</code></td><td>Miranda</td><td>Gitea MCP Server</td></tr>
|
||||
<tr><td><code>gitea_runner/deploy.yml</code></td><td>Puck</td><td>Gitea CI/CD runner</td></tr>
|
||||
<tr><td><code>grafana_mcp/deploy.yml</code></td><td>Miranda</td><td>Grafana MCP Server</td></tr>
|
||||
<tr><td><code>jupyterlab/deploy.yml</code></td><td>Puck</td><td>JupyterLab + OAuth2-Proxy</td></tr>
|
||||
<tr><td><code>kernos/deploy.yml</code></td><td>Caliban</td><td>Kernos MCP shell server</td></tr>
|
||||
<tr><td><code>lobechat/deploy.yml</code></td><td>Rosalind</td><td>LobeChat AI chat</td></tr>
|
||||
<tr><td><code>neo4j_mcp/deploy.yml</code></td><td>Miranda</td><td>Neo4j MCP Server</td></tr>
|
||||
<tr><td><code>rabbitmq/deploy.yml</code></td><td>Oberon</td><td>RabbitMQ message queue</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Lifecycle -->
|
||||
<div class="accordion-item">
|
||||
<h2 class="accordion-header">
|
||||
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#colLifecycle">
|
||||
<i class="bi bi-arrow-repeat me-2"></i>Lifecycle Playbooks
|
||||
</button>
|
||||
</h2>
|
||||
<div id="colLifecycle" class="accordion-collapse collapse" data-bs-parent="#playbookAccordion">
|
||||
<div class="accordion-body">
|
||||
<div class="row g-3">
|
||||
<div class="col-md-3">
|
||||
<div class="card border-success text-center h-100">
|
||||
<div class="card-body">
|
||||
<i class="bi bi-play-fill text-success" style="font-size:2rem;"></i>
|
||||
<h6 class="mt-2"><code>sandbox_up.yml</code></h6>
|
||||
<p class="small mb-0">Start all Uranian host containers</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-3">
|
||||
<div class="card border-primary text-center h-100">
|
||||
<div class="card-body">
|
||||
<i class="bi bi-list-check text-primary" style="font-size:2rem;"></i>
|
||||
<h6 class="mt-2"><code>site.yml</code></h6>
|
||||
<p class="small mb-0">Full deployment orchestration</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-3">
|
||||
<div class="card border-warning text-center h-100">
|
||||
<div class="card-body">
|
||||
<i class="bi bi-arrow-up-circle text-warning" style="font-size:2rem;"></i>
|
||||
<h6 class="mt-2"><code>apt_update.yml</code></h6>
|
||||
<p class="small mb-0">Update packages on all hosts</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-md-3">
|
||||
<div class="card border-danger text-center h-100">
|
||||
<div class="card-body">
|
||||
<i class="bi bi-stop-fill text-danger" style="font-size:2rem;"></i>
|
||||
<h6 class="mt-2"><code>sandbox_down.yml</code></h6>
|
||||
<p class="small mb-0">Gracefully stop all containers</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Data Flow -->
|
||||
<section id="dataflow" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-diagram-2 text-primary me-2"></i>Data Flow Architecture</h2>
|
||||
|
||||
<div class="card mb-4">
|
||||
<div class="card-header bg-dark text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-diagram-3 me-2"></i>Observability Pipeline</h5>
|
||||
</div>
|
||||
<div class="card-body">
|
||||
<div class="mermaid">
|
||||
flowchart LR
|
||||
subgraph hosts["All Hosts"]
|
||||
alloy["Alloy\n(syslog + journal)"]
|
||||
node_exp["Node Exporter\n(metrics)"]
|
||||
end
|
||||
subgraph prospero["Prospero"]
|
||||
loki["Loki\n(logs)"]
|
||||
prom["Prometheus\n(metrics)"]
|
||||
grafana["Grafana\n(dashboards)"]
|
||||
alert["AlertManager"]
|
||||
end
|
||||
pushover["Pushover\n(notifications)"]
|
||||
alloy -->|"HTTP push"| loki
|
||||
node_exp -->|"scrape 15s"| prom
|
||||
loki --> grafana
|
||||
prom --> grafana
|
||||
grafana --> alert
|
||||
alert -->|"webhook"| pushover
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="card">
|
||||
<div class="card-header bg-primary text-white">
|
||||
<h5 class="mb-0"><i class="bi bi-link-45deg me-2"></i>Service Integration Points</h5>
|
||||
</div>
|
||||
<div class="card-body p-0">
|
||||
<div class="table-responsive">
|
||||
<table class="table table-hover table-bordered mb-0 align-middle">
|
||||
<thead class="table-light">
|
||||
<tr><th>Consumer</th><th>Provider</th><th>Connection</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr><td>All LLM apps</td><td>Arke (Sycorax)</td><td><code>http://sycorax.incus:25540</code></td></tr>
|
||||
<tr><td>Open WebUI, Arke, Gitea, Nextcloud, LobeChat</td><td>PostgreSQL (Portia)</td><td><code>portia.incus:5432</code></td></tr>
|
||||
<tr><td>Neo4j MCP</td><td>Neo4j (Ariel)</td><td><code>ariel.incus:7687</code> (Bolt)</td></tr>
|
||||
<tr><td>MCP Switchboard</td><td>Docker API (Miranda)</td><td><code>tcp://miranda.incus:2375</code></td></tr>
|
||||
<tr><td>MCP Switchboard, Kairos, Spelunker</td><td>RabbitMQ (Oberon)</td><td><code>oberon.incus:5672</code></td></tr>
|
||||
<tr><td>All apps (SMTP)</td><td>smtp4dev (Oberon)</td><td><code>oberon.incus:22025</code></td></tr>
|
||||
<tr><td>All hosts (logs)</td><td>Loki (Prospero)</td><td><code>http://prospero.incus:3100</code></td></tr>
|
||||
<tr><td>All hosts (metrics)</td><td>Prometheus (Prospero)</td><td><code>http://prospero.incus:9090</code></td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Important Notes -->
|
||||
<section id="notes" class="mb-5">
|
||||
<h2 class="h2 mb-4"><i class="bi bi-exclamation-triangle text-warning me-2"></i>Important Notes</h2>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Alloy Host Variables Required</h5>
|
||||
<p class="mb-0">Every host with <code>alloy</code> in its <code>services</code> list must define <code>alloy_log_level</code> in <code>inventory/host_vars/<host>.incus.yml</code>. The playbook will fail with an undefined variable error if this is missing.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Alloy Syslog Listeners Required for Docker Services</h5>
|
||||
<p class="mb-0">Any Docker Compose service using the <code>syslog</code> logging driver must have a corresponding <code>loki.source.syslog</code> listener in the host's Alloy config template (<code>ansible/alloy/<hostname>/config.alloy.j2</code>). Missing listeners cause Docker containers to fail on start because the syslog driver cannot connect to its configured port.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Local Terraform State</h5>
|
||||
<p class="mb-0">This project uses local Terraform state (no remote backend). Do not run <code>terraform apply</code> from multiple machines simultaneously.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Nested Docker</h5>
|
||||
<p class="mb-0">Docker runs inside Incus containers (nested), requiring <code>security.nesting = true</code> and <code>lxc.apparmor.profile=unconfined</code> AppArmor override on all Docker-enabled hosts.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h5><i class="bi bi-exclamation-triangle-fill me-2"></i>Deployment Order</h5>
|
||||
<p class="mb-0">Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to <code>prospero.incus</code>. Run <code>pplg/deploy.yml</code> before <code>site.yml</code> on a fresh environment.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Footer -->
|
||||
<footer class="bg-dark text-white py-4 rounded mt-2 mb-4">
|
||||
<div class="container text-center">
|
||||
<p class="mb-1"><i class="bi bi-heart-fill text-danger"></i> Built with love and approved by red pandas</p>
|
||||
<small class="text-muted">Ouranos Lab — <a href="https://ouranos.helu.ca" class="text-muted">ouranos.helu.ca</a> — Infrastructure as Code for Development Excellence</small>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
<!-- Scroll to top button -->
|
||||
<button id="scrollTopBtn" class="btn btn-primary" title="Scroll to top">
|
||||
<i class="bi bi-arrow-up-circle"></i>
|
||||
</button>
|
||||
|
||||
</div><!-- /container-fluid -->
|
||||
|
||||
<!-- Bootstrap JS -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/js/bootstrap.bundle.min.js"></script>
|
||||
|
||||
<!-- Mermaid JS -->
|
||||
<script type="module">
|
||||
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
|
||||
const isDark = () => document.documentElement.getAttribute('data-bs-theme') === 'dark';
|
||||
mermaid.initialize({ startOnLoad: true, theme: isDark() ? 'dark' : 'default' });
|
||||
document.getElementById('darkModeToggle').addEventListener('click', () => {
|
||||
setTimeout(() => mermaid.initialize({ startOnLoad: false, theme: isDark() ? 'dark' : 'default' }), 50);
|
||||
});
|
||||
</script>
|
||||
|
||||
<script>
|
||||
// Dark mode toggle
|
||||
const toggleBtn = document.getElementById('darkModeToggle');
|
||||
function applyTheme(dark) {
|
||||
document.documentElement.setAttribute('data-bs-theme', dark ? 'dark' : 'light');
|
||||
toggleBtn.innerHTML = dark ? '<i class="bi bi-sun-fill"></i>' : '<i class="bi bi-moon-fill"></i>';
|
||||
toggleBtn.title = dark ? 'Switch to light mode' : 'Switch to dark mode';
|
||||
}
|
||||
toggleBtn.addEventListener('click', () => {
|
||||
applyTheme(document.documentElement.getAttribute('data-bs-theme') !== 'dark');
|
||||
});
|
||||
|
||||
// Scroll to top
|
||||
window.addEventListener('scroll', () => {
|
||||
document.getElementById('scrollTopBtn').style.display =
|
||||
(document.body.scrollTop > 300 || document.documentElement.scrollTop > 300) ? 'block' : 'none';
|
||||
});
|
||||
document.getElementById('scrollTopBtn').addEventListener('click', () => {
|
||||
window.scrollTo({ top: 0, behavior: 'smooth' });
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
333
docs/ouranos.md
Normal file
333
docs/ouranos.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Ouranos Lab
|
||||
|
||||
Infrastructure-as-Code project managing the **Ouranos Lab** — a development sandbox at [ouranos.helu.ca](https://ouranos.helu.ca). Uses **Terraform** for container provisioning and **Ansible** for configuration management, themed around the moons of Uranus.
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| **Terraform** | Provisions 10 specialised Incus containers (LXC) with DNS-resolved networking, security policies, and resource dependencies |
|
||||
| **Ansible** | Deploys Docker, databases (PostgreSQL, Neo4j), observability stack (Prometheus, Grafana, Loki), and application runtimes across all hosts |
|
||||
|
||||
> **DNS Domain**: Incus resolves containers via the `.incus` domain suffix (e.g., `oberon.incus`, `portia.incus`). IPv4 addresses are dynamically assigned — always use DNS names, never hardcode IPs.
|
||||
|
||||
---
|
||||
|
||||
## Uranian Host Architecture
|
||||
|
||||
All containers are named after moons of Uranus and resolved via the `.incus` DNS suffix.
|
||||
|
||||
| Name | Role | Description | Nesting |
|
||||
|------|------|-------------|---------|
|
||||
| **ariel** | graph_database | Neo4j — Ethereal graph connections | ✔ |
|
||||
| **caliban** | agent_automation | Agent S MCP Server with MATE Desktop | ✔ |
|
||||
| **miranda** | mcp_docker_host | Dedicated Docker Host for MCP Servers | ✔ |
|
||||
| **oberon** | container_orchestration | Docker Host — MCP Switchboard, RabbitMQ, Open WebUI | ✔ |
|
||||
| **portia** | database | PostgreSQL — Relational database host | ❌ |
|
||||
| **prospero** | observability | PPLG stack — Prometheus, Grafana, Loki, PgAdmin | ❌ |
|
||||
| **puck** | application_runtime | Python App Host — JupyterLab, Django apps, Gitea Runner | ✔ |
|
||||
| **rosalind** | collaboration | Gitea, LobeChat, Nextcloud, AnythingLLM | ✔ |
|
||||
| **sycorax** | language_models | Arke LLM Proxy | ✔ |
|
||||
| **titania** | proxy_sso | HAProxy TLS termination + Casdoor SSO | ✔ |
|
||||
|
||||
### oberon — Container Orchestration
|
||||
|
||||
King of the Fairies orchestrating containers and managing MCP infrastructure.
|
||||
|
||||
- Docker engine
|
||||
- MCP Switchboard (port 22785) — Django app routing MCP tool calls
|
||||
- RabbitMQ message queue
|
||||
- Open WebUI LLM interface (port 22088, PostgreSQL backend on Portia)
|
||||
- SearXNG privacy search (port 22083, behind OAuth2-Proxy)
|
||||
- smtp4dev SMTP test server (port 22025)
|
||||
|
||||
### portia — Relational Database
|
||||
|
||||
Intelligent and resourceful — the reliability of relational databases.
|
||||
|
||||
- PostgreSQL 17 (port 5432)
|
||||
- Databases: `arke`, `anythingllm`, `gitea`, `hass`, `lobechat`, `mcp_switchboard`, `nextcloud`, `openwebui`, `spelunker`
|
||||
|
||||
### ariel — Graph Database
|
||||
|
||||
Air spirit — ethereal, interconnected nature mirroring graph relationships.
|
||||
|
||||
- Neo4j 5.26.0 (Docker)
|
||||
- HTTP API: port 25584
|
||||
- Bolt: port 25554
|
||||
|
||||
### puck — Application Runtime
|
||||
|
||||
Shape-shifting trickster embodying Python's versatility.
|
||||
|
||||
- Docker engine
|
||||
- JupyterLab (port 22071 via OAuth2-Proxy)
|
||||
- Gitea Runner (CI/CD agent)
|
||||
- Home Assistant (port 8123)
|
||||
- Django applications: Angelia (22281), Athena (22481), Kairos (22581), Icarlos (22681), Spelunker (22881), Peitho (22981)
|
||||
|
||||
### prospero — Observability Stack
|
||||
|
||||
Master magician observing all events.
|
||||
|
||||
- PPLG stack via Docker Compose: Prometheus, Loki, Grafana, PgAdmin
|
||||
- Internal HAProxy with OAuth2-Proxy for all dashboards
|
||||
- AlertManager with Pushover notifications
|
||||
- Prometheus metrics collection (`node-exporter`, HAProxy, Loki)
|
||||
- Loki log aggregation via Alloy (all hosts)
|
||||
- Grafana dashboard suite with Casdoor SSO integration
|
||||
|
||||
### miranda — MCP Docker Host
|
||||
|
||||
Curious bridge between worlds — hosting MCP server containers.
|
||||
|
||||
- Docker engine (API exposed on port 2375 for MCP Switchboard)
|
||||
- MCPO OpenAI-compatible MCP proxy
|
||||
- Grafana MCP Server (port 25533)
|
||||
- Gitea MCP Server (port 25535)
|
||||
- Neo4j MCP Server
|
||||
- Argos MCP Server — web search via SearXNG (port 25534)
|
||||
|
||||
### sycorax — Language Models
|
||||
|
||||
Original magical power wielding language magic.
|
||||
|
||||
- Arke LLM API Proxy (port 25540)
|
||||
- Multi-provider support (OpenAI, Anthropic, etc.)
|
||||
- Session management with Memcached
|
||||
- Database backend on Portia
|
||||
|
||||
### caliban — Agent Automation
|
||||
|
||||
Autonomous computer agent learning through environmental interaction.
|
||||
|
||||
- Docker engine
|
||||
- Agent S MCP Server (MATE desktop, AT-SPI automation)
|
||||
- Kernos MCP Shell Server (port 22021)
|
||||
- GPU passthrough for vision tasks
|
||||
- RDP access (port 25521)
|
||||
|
||||
### rosalind — Collaboration Services
|
||||
|
||||
Witty and resourceful moon for PHP, Go, and Node.js runtimes.
|
||||
|
||||
- Gitea self-hosted Git (port 22082, SSH on 22022)
|
||||
- LobeChat AI chat interface (port 22081)
|
||||
- Nextcloud file sharing and collaboration (port 22083)
|
||||
- AnythingLLM document AI workspace (port 22084)
|
||||
- Nextcloud data on dedicated Incus storage volume
|
||||
|
||||
### titania — Proxy & SSO Services
|
||||
|
||||
Queen of the Fairies managing access control and authentication.
|
||||
|
||||
- HAProxy 3.x with TLS termination (port 443)
|
||||
- Let's Encrypt wildcard certificate via certbot DNS-01 (Namecheap)
|
||||
- HTTP to HTTPS redirect (port 80)
|
||||
- Gitea SSH proxy (port 22022)
|
||||
- Casdoor SSO (port 22081, local PostgreSQL)
|
||||
- Prometheus metrics at `:8404/metrics`
|
||||
|
||||
---
|
||||
|
||||
## External Access via HAProxy
|
||||
|
||||
Titania provides TLS termination and reverse proxy for all services.
|
||||
|
||||
- **Base domain**: `ouranos.helu.ca`
|
||||
- **HTTPS**: port 443 (standard)
|
||||
- **HTTP**: port 80 (redirects to HTTPS)
|
||||
- **Certificate**: Let's Encrypt wildcard via certbot DNS-01
|
||||
|
||||
### Route Table
|
||||
|
||||
| Subdomain | Backend | Service |
|
||||
|-----------|---------|---------|
|
||||
| `ouranos.helu.ca` (root) | puck.incus:22281 | Angelia (Django) |
|
||||
| `alertmanager.ouranos.helu.ca` | prospero.incus:443 (SSL) | AlertManager |
|
||||
| `angelia.ouranos.helu.ca` | puck.incus:22281 | Angelia (Django) |
|
||||
| `anythingllm.ouranos.helu.ca` | rosalind.incus:22084 | AnythingLLM |
|
||||
| `arke.ouranos.helu.ca` | sycorax.incus:25540 | Arke LLM Proxy |
|
||||
| `athena.ouranos.helu.ca` | puck.incus:22481 | Athena (Django) |
|
||||
| `gitea.ouranos.helu.ca` | rosalind.incus:22082 | Gitea |
|
||||
| `grafana.ouranos.helu.ca` | prospero.incus:443 (SSL) | Grafana |
|
||||
| `hass.ouranos.helu.ca` | oberon.incus:8123 | Home Assistant |
|
||||
| `id.ouranos.helu.ca` | titania.incus:22081 | Casdoor SSO |
|
||||
| `icarlos.ouranos.helu.ca` | puck.incus:22681 | Icarlos (Django) |
|
||||
| `jupyterlab.ouranos.helu.ca` | puck.incus:22071 | JupyterLab (OAuth2-Proxy) |
|
||||
| `kairos.ouranos.helu.ca` | puck.incus:22581 | Kairos (Django) |
|
||||
| `lobechat.ouranos.helu.ca` | rosalind.incus:22081 | LobeChat |
|
||||
| `loki.ouranos.helu.ca` | prospero.incus:443 (SSL) | Loki |
|
||||
| `mcp-switchboard.ouranos.helu.ca` | oberon.incus:22785 | MCP Switchboard |
|
||||
| `nextcloud.ouranos.helu.ca` | rosalind.incus:22083 | Nextcloud |
|
||||
| `openwebui.ouranos.helu.ca` | oberon.incus:22088 | Open WebUI |
|
||||
| `peitho.ouranos.helu.ca` | puck.incus:22981 | Peitho (Django) |
|
||||
| `pgadmin.ouranos.helu.ca` | prospero.incus:443 (SSL) | PgAdmin 4 |
|
||||
| `prometheus.ouranos.helu.ca` | prospero.incus:443 (SSL) | Prometheus |
|
||||
| `searxng.ouranos.helu.ca` | oberon.incus:22073 | SearXNG (OAuth2-Proxy) |
|
||||
| `smtp4dev.ouranos.helu.ca` | oberon.incus:22085 | smtp4dev |
|
||||
| `spelunker.ouranos.helu.ca` | puck.incus:22881 | Spelunker (Django) |
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Management
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Provision containers
|
||||
cd terraform
|
||||
terraform init
|
||||
terraform plan
|
||||
terraform apply
|
||||
|
||||
# Start all containers
|
||||
cd ../ansible
|
||||
source ~/env/agathos/bin/activate
|
||||
ansible-playbook sandbox_up.yml
|
||||
|
||||
# Deploy all services
|
||||
ansible-playbook site.yml
|
||||
|
||||
# Stop all containers
|
||||
ansible-playbook sandbox_down.yml
|
||||
```
|
||||
|
||||
### Terraform Workflow
|
||||
|
||||
1. **Define** — Containers, networks, and resources in `*.tf` files
|
||||
2. **Plan** — Review changes with `terraform plan`
|
||||
3. **Apply** — Provision with `terraform apply`
|
||||
4. **Verify** — Check outputs and container status
|
||||
|
||||
### Ansible Workflow
|
||||
|
||||
1. **Bootstrap** — Update packages, install essentials (`apt_update.yml`)
|
||||
2. **Agents** — Deploy Alloy (log/metrics) and Node Exporter on all hosts
|
||||
3. **Services** — Configure databases, Docker, applications, observability
|
||||
4. **Verify** — Check service health and connectivity
|
||||
|
||||
### Vault Management
|
||||
|
||||
```bash
|
||||
# Edit secrets
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
|
||||
# View secrets
|
||||
ansible-vault view inventory/group_vars/all/vault.yml
|
||||
|
||||
# Encrypt a new file
|
||||
ansible-vault encrypt new_secrets.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## S3 Storage Provisioning
|
||||
|
||||
Terraform provisions Incus S3 buckets for services requiring object storage:
|
||||
|
||||
| Service | Host | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Casdoor** | Titania | User avatars and SSO resource storage |
|
||||
| **LobeChat** | Rosalind | File uploads and attachments |
|
||||
|
||||
> S3 credentials (access key, secret key, endpoint) are stored as sensitive Terraform outputs and managed in Ansible Vault with the `vault_*_s3_*` prefix.
|
||||
|
||||
---
|
||||
|
||||
## Ansible Automation
|
||||
|
||||
### Full Deployment (`site.yml`)
|
||||
|
||||
Playbooks run in dependency order:
|
||||
|
||||
| Playbook | Hosts | Purpose |
|
||||
|----------|-------|---------|
|
||||
| `apt_update.yml` | All | Update packages and install essentials |
|
||||
| `alloy/deploy.yml` | All | Grafana Alloy log/metrics collection |
|
||||
| `prometheus/node_deploy.yml` | All | Node Exporter metrics |
|
||||
| `docker/deploy.yml` | Oberon, Ariel, Miranda, Puck, Rosalind, Sycorax, Caliban, Titania | Docker engine |
|
||||
| `smtp4dev/deploy.yml` | Oberon | SMTP test server |
|
||||
| `pplg/deploy.yml` | Prospero | Full observability stack + HAProxy + OAuth2-Proxy |
|
||||
| `postgresql/deploy.yml` | Portia | PostgreSQL with all databases |
|
||||
| `postgresql_ssl/deploy.yml` | Titania | Dedicated PostgreSQL for Casdoor |
|
||||
| `neo4j/deploy.yml` | Ariel | Neo4j graph database |
|
||||
| `searxng/deploy.yml` | Oberon | SearXNG privacy search |
|
||||
| `haproxy/deploy.yml` | Titania | HAProxy TLS termination and routing |
|
||||
| `casdoor/deploy.yml` | Titania | Casdoor SSO |
|
||||
| `mcpo/deploy.yml` | Miranda | MCPO MCP proxy |
|
||||
| `openwebui/deploy.yml` | Oberon | Open WebUI LLM interface |
|
||||
| `hass/deploy.yml` | Oberon | Home Assistant |
|
||||
| `gitea/deploy.yml` | Rosalind | Gitea self-hosted Git |
|
||||
| `nextcloud/deploy.yml` | Rosalind | Nextcloud collaboration |
|
||||
|
||||
### Individual Service Deployments
|
||||
|
||||
Services with standalone deploy playbooks (not in `site.yml`):
|
||||
|
||||
| Playbook | Host | Service |
|
||||
|----------|------|---------|
|
||||
| `anythingllm/deploy.yml` | Rosalind | AnythingLLM document AI |
|
||||
| `arke/deploy.yml` | Sycorax | Arke LLM proxy |
|
||||
| `argos/deploy.yml` | Miranda | Argos MCP web search server |
|
||||
| `caliban/deploy.yml` | Caliban | Agent S MCP Server |
|
||||
| `certbot/deploy.yml` | Titania | Let's Encrypt certificate renewal |
|
||||
| `gitea_mcp/deploy.yml` | Miranda | Gitea MCP Server |
|
||||
| `gitea_runner/deploy.yml` | Puck | Gitea CI/CD runner |
|
||||
| `grafana_mcp/deploy.yml` | Miranda | Grafana MCP Server |
|
||||
| `jupyterlab/deploy.yml` | Puck | JupyterLab + OAuth2-Proxy |
|
||||
| `kernos/deploy.yml` | Caliban | Kernos MCP shell server |
|
||||
| `lobechat/deploy.yml` | Rosalind | LobeChat AI chat |
|
||||
| `neo4j_mcp/deploy.yml` | Miranda | Neo4j MCP Server |
|
||||
| `rabbitmq/deploy.yml` | Oberon | RabbitMQ message queue |
|
||||
|
||||
### Lifecycle Playbooks
|
||||
|
||||
| Playbook | Purpose |
|
||||
|----------|---------|
|
||||
| `sandbox_up.yml` | Start all Uranian host containers |
|
||||
| `sandbox_down.yml` | Gracefully stop all containers |
|
||||
| `apt_update.yml` | Update packages on all hosts |
|
||||
| `site.yml` | Full deployment orchestration |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Architecture
|
||||
|
||||
### Observability Pipeline
|
||||
|
||||
```
|
||||
All Hosts Prospero Alerts
|
||||
Alloy + Node Exporter → Prometheus + Loki + Grafana → AlertManager + Pushover
|
||||
collect metrics & logs storage & visualisation notifications
|
||||
```
|
||||
|
||||
### Integration Points
|
||||
|
||||
| Consumer | Provider | Connection |
|
||||
|----------|----------|-----------|
|
||||
| All LLM apps | Arke (Sycorax) | `http://sycorax.incus:25540` |
|
||||
| Open WebUI, Arke, Gitea, Nextcloud, LobeChat | PostgreSQL (Portia) | `portia.incus:5432` |
|
||||
| Neo4j MCP | Neo4j (Ariel) | `ariel.incus:7687` (Bolt) |
|
||||
| MCP Switchboard | Docker API (Miranda) | `tcp://miranda.incus:2375` |
|
||||
| MCP Switchboard | RabbitMQ (Oberon) | `oberon.incus:5672` |
|
||||
| Kairos, Spelunker | RabbitMQ (Oberon) | `oberon.incus:5672` |
|
||||
| SMTP (all apps) | smtp4dev (Oberon) | `oberon.incus:22025` |
|
||||
| All hosts | Loki (Prospero) | `http://prospero.incus:3100` |
|
||||
| All hosts | Prometheus (Prospero) | `http://prospero.incus:9090` |
|
||||
|
||||
---
|
||||
|
||||
## Important Notes
|
||||
|
||||
⚠️ **Alloy Host Variables Required** — Every host with `alloy` in its `services` list must define `alloy_log_level` in `inventory/host_vars/<host>.incus.yml`. The playbook will fail with an undefined variable error if this is missing.
|
||||
|
||||
⚠️ **Alloy Syslog Listeners Required for Docker Services** — Any Docker Compose service using the syslog logging driver must have a corresponding `loki.source.syslog` listener in the host's Alloy config template (`ansible/alloy/<hostname>/config.alloy.j2`). Missing listeners cause Docker containers to fail on start.
|
||||
|
||||
⚠️ **Local Terraform State** — This project uses local Terraform state (no remote backend). Do not run `terraform apply` from multiple machines simultaneously.
|
||||
|
||||
⚠️ **Nested Docker** — Docker runs inside Incus containers (nested), requiring `security.nesting = true` and `lxc.apparmor.profile=unconfined` AppArmor override on all Docker-enabled hosts.
|
||||
|
||||
⚠️ **Deployment Order** — Prospero (observability) must be fully deployed before other hosts, as Alloy on every host pushes logs and metrics to `prospero.incus`. Run `pplg/deploy.yml` before `site.yml` on a fresh environment.
|
||||
192
docs/pgadmin.md
Normal file
192
docs/pgadmin.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# PgAdmin - PostgreSQL Web Administration
|
||||
|
||||
## Overview
|
||||
|
||||
PgAdmin 4 is a web-based administration and management tool for PostgreSQL. It is deployed on **Portia** alongside the shared PostgreSQL instance, providing a graphical interface for database management, query execution, and server monitoring across both PostgreSQL deployments (Portia and Titania).
|
||||
|
||||
**Host:** portia.incus
|
||||
**Role:** database
|
||||
**Container Port:** 80 (Apache / pgAdmin4 web app)
|
||||
**External Access:** https://pgadmin.ouranos.helu.ca/ (via HAProxy on Titania, proxied through host port 25555)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌──────────────────────────────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ Portia │
|
||||
│ │ │ (Titania) │ │ │
|
||||
│ │ │ :443 │ │ :25555 ──▶ :80 (Apache) │
|
||||
└──────────┘ └────────────┘ │ │ │
|
||||
│ ┌────▼─────┐ │
|
||||
│ │ PgAdmin4 │ │
|
||||
│ │ (web) │ │
|
||||
│ └────┬─────┘ │
|
||||
│ │ │
|
||||
│ ┌────────▼────────┐ │
|
||||
│ │ PostgreSQL 17 │ │
|
||||
│ │ (localhost) │ │
|
||||
│ └─────────────────┘ │
|
||||
└──────────┬─────────────────────┘
|
||||
│ SSL
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ PostgreSQL 17 (SSL) │
|
||||
│ (Titania) │
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
PgAdmin connects to:
|
||||
- **Portia's PostgreSQL** — locally via `localhost:5432` (no SSL)
|
||||
- **Titania's PostgreSQL** — over the Incus network via SSL, using the fetched certificate stored at `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Host Definition
|
||||
|
||||
PgAdmin runs on Portia, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | database |
|
||||
| Security Nesting | false |
|
||||
| Proxy Devices | `25555 → 80` (Apache/PgAdmin web UI) |
|
||||
|
||||
The Incus proxy device maps host port 25555 to Apache on port 80 inside the container, where PgAdmin4 is served as a WSGI application.
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook pgadmin/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `pgadmin/deploy.yml` | PgAdmin installation and SSL cert distribution |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Add PgAdmin repository** — Official pgAdmin4 APT repository with GPG key
|
||||
2. **Install PgAdmin** — `pgadmin4-web` package (includes Apache configuration)
|
||||
3. **Create certs directory** — `/var/lib/pgadmin/certs/` owned by `www-data`
|
||||
4. **Fetch Titania SSL certificate** — Retrieves the self-signed PostgreSQL SSL cert from Titania
|
||||
5. **Distribute certificate** — Copies to `/var/lib/pgadmin/certs/titania-postgres-ca.crt` for SSL connections
|
||||
|
||||
### ⚠️ Manual Post-Deployment Step Required
|
||||
|
||||
After running the playbook, you **must** SSH into Portia and run the PgAdmin web setup script manually:
|
||||
|
||||
```bash
|
||||
# SSH into Portia
|
||||
ssh portia.incus
|
||||
|
||||
# Run the setup script
|
||||
sudo /usr/pgadmin4/bin/setup-web.sh
|
||||
```
|
||||
|
||||
This interactive script:
|
||||
- Prompts for the **admin email address** and **password** (use the values from `pgadmin_email` and `pgadmin_password` vault variables)
|
||||
- Configures Apache virtual host for PgAdmin4
|
||||
- Sets file permissions and ownership
|
||||
- Restarts Apache to activate the configuration
|
||||
|
||||
This step cannot be automated via Ansible because the script requires interactive input and performs Apache configuration that depends on the local environment.
|
||||
|
||||
### Variables
|
||||
|
||||
#### Host Variables (`host_vars/portia.incus.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `pgadmin_user` | System user (`pgadmin`) |
|
||||
| `pgadmin_group` | System group (`pgadmin`) |
|
||||
| `pgadmin_directory` | Data directory (`/srv/pgadmin`) |
|
||||
| `pgadmin_port` | External port (`25555`) |
|
||||
| `pgadmin_email` | Admin login email (`{{ vault_pgadmin_email }}`) |
|
||||
| `pgadmin_password` | Admin login password (`{{ vault_pgadmin_password }}`) |
|
||||
|
||||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_pgadmin_email` | PgAdmin admin email address |
|
||||
| `vault_pgadmin_password` | PgAdmin admin password |
|
||||
|
||||
## Configuration
|
||||
|
||||
### SSL Certificate for Titania Connection
|
||||
|
||||
The playbook fetches the self-signed PostgreSQL SSL certificate from Titania and places it at `/var/lib/pgadmin/certs/titania-postgres-ca.crt`. When adding Titania's PostgreSQL as a server in PgAdmin:
|
||||
|
||||
1. Navigate to **Servers → Register → Server**
|
||||
2. On the **Connection** tab:
|
||||
- Host: `titania.incus`
|
||||
- Port: `5432`
|
||||
- Username: `postgres`
|
||||
3. On the **SSL** tab:
|
||||
- SSL mode: `verify-ca` or `require`
|
||||
- Root certificate: `/var/lib/pgadmin/certs/titania-postgres-ca.crt`
|
||||
|
||||
### Registered Servers
|
||||
|
||||
After setup, register both PostgreSQL instances:
|
||||
|
||||
| Server Name | Host | Port | SSL |
|
||||
|-------------|------|------|-----|
|
||||
| Portia (local) | `localhost` | `5432` | Off |
|
||||
| Titania (Casdoor) | `titania.incus` | `5432` | verify-ca |
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
# PgAdmin runs under Apache
|
||||
sudo systemctl start apache2
|
||||
sudo systemctl stop apache2
|
||||
sudo systemctl restart apache2
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Check Apache is serving PgAdmin
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost/pgadmin4/login
|
||||
|
||||
# Check from external host
|
||||
curl -s -o /dev/null -w "%{http_code}" http://portia.incus/pgadmin4/login
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Apache error log
|
||||
tail -f /var/log/apache2/error.log
|
||||
|
||||
# PgAdmin application log
|
||||
tail -f /var/log/pgadmin/pgadmin4.log
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| 502/503 on pgadmin.ouranos.helu.ca | Apache not running on Portia | `sudo systemctl restart apache2` on Portia |
|
||||
| Login page loads but can't authenticate | Setup script not run | SSH to Portia and run `sudo /usr/pgadmin4/bin/setup-web.sh` |
|
||||
| Can't connect to Titania PostgreSQL | Missing SSL certificate | Re-run `ansible-playbook pgadmin/deploy.yml` to fetch cert |
|
||||
| SSL certificate error for Titania | Certificate expired or regenerated | Re-fetch cert by re-running the playbook |
|
||||
| Port 25555 unreachable | Incus proxy device missing | Verify proxy device in `terraform/containers.tf` for Portia |
|
||||
|
||||
## References
|
||||
|
||||
- [PgAdmin 4 Documentation](https://www.pgadmin.org/docs/pgadmin4/latest/)
|
||||
- [PostgreSQL Deployment](postgresql.md)
|
||||
- [Terraform Practices](terraform.md)
|
||||
- [Ansible Practices](ansible.md)
|
||||
287
docs/postgresql.md
Normal file
287
docs/postgresql.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# PostgreSQL - Dual-Deployment Database Layer
|
||||
|
||||
## Overview
|
||||
|
||||
PostgreSQL 17 serves as the primary relational database engine for the Agathos sandbox. There are **two separate deployment playbooks**, each targeting a different host with a distinct purpose:
|
||||
|
||||
| Playbook | Host | Purpose |
|
||||
|----------|------|---------|
|
||||
| `postgresql/deploy.yml` | **Portia** | Shared multi-tenant database with **pgvector** for AI/vector workloads |
|
||||
| `postgresql_ssl/deploy.yml` | **Titania** | Dedicated SSL-enabled database for the **Casdoor** identity provider |
|
||||
|
||||
**Portia** acts as the central database server for most applications, while **Titania** runs an isolated PostgreSQL instance exclusively for Casdoor, hardened with self-signed SSL certificates for secure external connections.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Portia (postgresql) │
|
||||
┌──────────┐ │ ┌──────────────────────────────────────────────┐ │
|
||||
│ Arke │───────────▶│ │ PostgreSQL 17 + pgvector v0.8.0 │ │
|
||||
│(Caliban) │ │ │ │ │
|
||||
├──────────┤ │ │ Databases: │ │
|
||||
│ Gitea │───────────▶│ │ arke ─── openwebui ─── spelunker │ │
|
||||
│(Rosalind)│ │ │ gitea ── lobechat ──── nextcloud │ │
|
||||
├──────────┤ │ │ anythingllm ────────── hass │ │
|
||||
│ Open │───────────▶│ │ │ │
|
||||
│ WebUI │ │ │ pgvector enabled in: │ │
|
||||
├──────────┤ │ │ arke, lobechat, openwebui, │ │
|
||||
│ LobeChat │───────────▶│ │ spelunker, anythingllm │ │
|
||||
├──────────┤ │ └──────────────────────────────────────────────┘ │
|
||||
│ HASS │───────────▶│ │
|
||||
│ + others │ │ PgAdmin available on :25555 │
|
||||
└──────────┘ └────────────────────────────────────────────────────┘
|
||||
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Titania (postgresql_ssl) │
|
||||
┌──────────┐ │ ┌──────────────────────────────────────────────┐ │
|
||||
│ Casdoor │──SSL──────▶│ │ PostgreSQL 17 + SSL (self-signed) │ │
|
||||
│(Titania) │ (local) │ │ │ │
|
||||
└──────────┘ │ │ Database: casdoor (single-purpose) │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Portia – Shared Database Host
|
||||
|
||||
Defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | database |
|
||||
| Security Nesting | false |
|
||||
| Proxy Devices | `25555 → 80` (PgAdmin web UI) |
|
||||
|
||||
PostgreSQL port 5432 is **not** exposed externally—applications connect over the private Incus network (`10.10.0.0/16`).
|
||||
|
||||
### Titania – Proxy & SSO Host
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Image | noble |
|
||||
| Role | proxy_sso |
|
||||
| Security Nesting | true |
|
||||
| Proxy Devices | `443 → 8443`, `80 → 8080` (HAProxy) |
|
||||
|
||||
Titania runs PostgreSQL alongside Casdoor on the same host. Casdoor connects via localhost, so SSL is not required for the local connection despite being available for external clients.
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook 1: Shared PostgreSQL with pgvector (Portia)
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook postgresql/deploy.yml
|
||||
```
|
||||
|
||||
#### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `postgresql/deploy.yml` | Multi-tenant PostgreSQL with pgvector |
|
||||
|
||||
#### Deployment Steps
|
||||
|
||||
1. **Install build dependencies** — `curl`, `git`, `build-essential`, `vim`, `python3-psycopg2`
|
||||
2. **Add PGDG repository** — Official PostgreSQL APT repository
|
||||
3. **Install PostgreSQL 17** — Client, server, docs, `libpq-dev`, `server-dev`
|
||||
4. **Clone & build pgvector v0.8.0** — Compiled from source against the installed PG version
|
||||
5. **Start PostgreSQL** and restart after pgvector installation
|
||||
6. **Set data directory permissions** — `700` owned by `postgres:postgres`
|
||||
7. **Configure networking** — `listen_addresses = '*'`
|
||||
8. **Configure authentication** — `host all all 0.0.0.0/0 md5` in `pg_hba.conf`
|
||||
9. **Set admin password** — `postgres` superuser password from vault
|
||||
10. **Create application users** — 9 database users (see table below)
|
||||
11. **Create application databases** — 9 databases with matching owners
|
||||
12. **Enable pgvector** — `CREATE EXTENSION vector` in 5 databases
|
||||
|
||||
### Playbook 2: SSL-Enabled PostgreSQL (Titania)
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook postgresql_ssl/deploy.yml
|
||||
```
|
||||
|
||||
#### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `postgresql_ssl/deploy.yml` | Single-purpose SSL PostgreSQL for Casdoor |
|
||||
|
||||
#### Deployment Steps
|
||||
|
||||
1. **Install dependencies** — `curl`, `python3-psycopg2`, `python3-cryptography`
|
||||
2. **Add PGDG repository** — Official PostgreSQL APT repository
|
||||
3. **Install PostgreSQL 17** — Client and server only (no dev packages needed)
|
||||
4. **Generate SSL certificates** — 4096-bit RSA key, self-signed, 10-year validity
|
||||
5. **Configure networking** — `listen_addresses = '*'`
|
||||
6. **Enable SSL** — `ssl = on` with cert/key file paths
|
||||
7. **Configure tiered authentication** in `pg_hba.conf`:
|
||||
- `local` → `peer` (Unix socket, no password)
|
||||
- `host 127.0.0.1/32` → `md5` (localhost, no SSL)
|
||||
- `host 10.10.0.0/16` → `md5` (Incus network, no SSL)
|
||||
- `hostssl 0.0.0.0/0` → `md5` (external, SSL required)
|
||||
8. **Set admin password** — `postgres` superuser password from vault
|
||||
9. **Create Casdoor user and database** — Single-purpose
|
||||
|
||||
## User & Database Creation via Host Variables
|
||||
|
||||
Both playbooks derive all database names, usernames, and passwords from **host variables** defined in the Ansible inventory. No database credentials appear in `group_vars`—everything is scoped to the host that runs PostgreSQL.
|
||||
|
||||
### Portia Host Variables (`inventory/host_vars/portia.incus.yml`)
|
||||
|
||||
The `postgresql/deploy.yml` playbook loops over variable pairs to create users and databases. Each application gets three variables defined in Portia's host_vars:
|
||||
|
||||
| Variable Pattern | Example | Description |
|
||||
|-----------------|---------|-------------|
|
||||
| `{app}_db_name` | `arke_db_name: arke` | Database name |
|
||||
| `{app}_db_user` | `arke_db_user: arke` | Database owner/user |
|
||||
| `{app}_db_password` | `arke_db_password: "{{ vault_arke_db_password }}"` | Password (from vault) |
|
||||
|
||||
#### Application Database Matrix (Portia)
|
||||
|
||||
| Application | DB Name Variable | DB User Variable | pgvector |
|
||||
|-------------|-----------------|-----------------|----------|
|
||||
| Arke | `arke_db_name` | `arke_db_user` | ✔ |
|
||||
| Open WebUI | `openwebui_db_name` | `openwebui_db_user` | ✔ |
|
||||
| Spelunker | `spelunker_db_name` | `spelunker_db_user` | ✔ |
|
||||
| Gitea | `gitea_db_name` | `gitea_db_user` | |
|
||||
| LobeChat | `lobechat_db_name` | `lobechat_db_user` | ✔ |
|
||||
| Nextcloud | `nextcloud_db_name` | `nextcloud_db_user` | |
|
||||
| AnythingLLM | `anythingllm_db_name` | `anythingllm_db_user` | ✔ |
|
||||
| HASS | `hass_db_name` | `hass_db_user` | |
|
||||
| Nike | `nike_db_name` | `nike_db_user` | |
|
||||
|
||||
#### Additional Portia Variables
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `postgres_user` | System user (`postgres`) |
|
||||
| `postgres_group` | System group (`postgres`) |
|
||||
| `postgresql_port` | Port (`5432`) |
|
||||
| `postgresql_data_dir` | Data directory (`/var/lib/postgresql`) |
|
||||
| `postgres_password` | Admin password (`{{ vault_postgres_password }}`) |
|
||||
|
||||
### Titania Host Variables (`inventory/host_vars/titania.incus.yml`)
|
||||
|
||||
The `postgresql_ssl/deploy.yml` playbook creates a single database for Casdoor:
|
||||
|
||||
| Variable | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `postgresql_ssl_postgres_password` | `{{ vault_postgresql_ssl_postgres_password }}` | Admin password |
|
||||
| `postgresql_ssl_port` | `5432` | PostgreSQL port |
|
||||
| `postgresql_ssl_cert_path` | `/etc/postgresql/17/main/ssl/server.crt` | SSL certificate |
|
||||
| `casdoor_db_name` | `casdoor` | Database name |
|
||||
| `casdoor_db_user` | `casdoor` | Database user |
|
||||
| `casdoor_db_password` | `{{ vault_casdoor_db_password }}` | Password (from vault) |
|
||||
| `casdoor_db_sslmode` | `disable` | Local connection skips SSL |
|
||||
|
||||
### Adding a New Application Database
|
||||
|
||||
To add a new application database on Portia:
|
||||
|
||||
1. **Add variables** to `inventory/host_vars/portia.incus.yml`:
|
||||
```yaml
|
||||
myapp_db_name: myapp
|
||||
myapp_db_user: myapp
|
||||
myapp_db_password: "{{ vault_myapp_db_password }}"
|
||||
```
|
||||
|
||||
2. **Add the vault secret** to `inventory/group_vars/all/vault.yml`:
|
||||
```yaml
|
||||
vault_myapp_db_password: "s3cure-passw0rd"
|
||||
```
|
||||
|
||||
3. **Add the user** to the `Create application database users` loop in `postgresql/deploy.yml`:
|
||||
```yaml
|
||||
- { user: "{{ myapp_db_user }}", password: "{{ myapp_db_password }}" }
|
||||
```
|
||||
|
||||
4. **Add the database** to the `Create application databases with owners` loop:
|
||||
```yaml
|
||||
- { name: "{{ myapp_db_name }}", owner: "{{ myapp_db_user }}" }
|
||||
```
|
||||
|
||||
5. **(Optional)** If the application uses vector embeddings, add the database to the `Enable pgvector extension in databases` loop:
|
||||
```yaml
|
||||
- "{{ myapp_db_name }}"
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Start/Stop
|
||||
|
||||
```bash
|
||||
# On either host
|
||||
sudo systemctl start postgresql
|
||||
sudo systemctl stop postgresql
|
||||
sudo systemctl restart postgresql
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# From any Incus host → Portia
|
||||
psql -h portia.incus -U postgres -c "SELECT 1;"
|
||||
|
||||
# From Titania localhost
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Check pgvector availability
|
||||
sudo -u postgres psql -c "SELECT * FROM pg_available_extensions WHERE name = 'vector';"
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Systemd journal
|
||||
journalctl -u postgresql -f
|
||||
|
||||
# PostgreSQL log files
|
||||
tail -f /var/log/postgresql/postgresql-17-main.log
|
||||
|
||||
# Loki (via Grafana Explore)
|
||||
{job="postgresql"}
|
||||
```
|
||||
|
||||
### Backup
|
||||
|
||||
```bash
|
||||
# Dump a single database
|
||||
sudo -u postgres pg_dump myapp > myapp_backup.sql
|
||||
|
||||
# Dump all databases
|
||||
sudo -u postgres pg_dumpall > full_backup.sql
|
||||
```
|
||||
|
||||
### Restore
|
||||
|
||||
```bash
|
||||
# Restore a single database
|
||||
sudo -u postgres psql myapp < myapp_backup.sql
|
||||
|
||||
# Restore all databases
|
||||
sudo -u postgres psql < full_backup.sql
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| Connection refused from app host | `pg_hba.conf` missing entry | Verify client IP is covered by HBA rules |
|
||||
| pgvector extension not found | Built against wrong PG version | Re-run the `Build pgvector with correct pg_config` task |
|
||||
| SSL handshake failure (Titania) | Expired or missing certificate | Check `/etc/postgresql/17/main/ssl/server.crt` validity |
|
||||
| `FATAL: password authentication failed` | Wrong password in host_vars | Verify vault variable matches and re-run playbook |
|
||||
| PgAdmin unreachable on :25555 | Incus proxy device missing | Check `terraform/containers.tf` proxy for Portia |
|
||||
|
||||
## References
|
||||
|
||||
- [PostgreSQL 17 Documentation](https://www.postgresql.org/docs/17/)
|
||||
- [pgvector GitHub](https://github.com/pgvector/pgvector)
|
||||
- [Terraform Practices](terraform.md)
|
||||
- [Ansible Practices](ansible.md)
|
||||
583
docs/pplg.md
Normal file
583
docs/pplg.md
Normal file
@@ -0,0 +1,583 @@
|
||||
# PPLG - Consolidated Observability & Admin Stack
|
||||
|
||||
## Overview
|
||||
|
||||
PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication.
|
||||
|
||||
**Host:** prospero.incus
|
||||
**Role:** Observability
|
||||
**Incus Ports:** 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect)
|
||||
**External Access:** Via Titania HAProxy → `prospero.incus:443`
|
||||
|
||||
| Subdomain | Service | Auth Method |
|
||||
|-----------|---------|-------------|
|
||||
| `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth |
|
||||
| `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth |
|
||||
| `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar |
|
||||
| `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) |
|
||||
| `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
|
||||
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
|
||||
│ │ │ (Titania) │ │ │
|
||||
└──────────┘ │ :443 → :443 │ ┌──────────────────────────────────────────┐ │
|
||||
└────────────┘ │ │ HAProxy (systemd, :443/:80) │ │
|
||||
│ │ TLS termination + subdomain routing │ │
|
||||
┌──────────┐ │ └───┬──────┬──────┬──────┬──────┬──────────┘ │
|
||||
│ Alloy │──push──────────────────────────▶│ │ │ │ │
|
||||
│ (agents) │ loki.ouranos.helu.ca │ │ │ │ │ │
|
||||
│ │ prometheus.ouranos.helu.ca │ │ │ │ │
|
||||
└──────────┘ │ ▼ ▼ ▼ ▼ ▼ │
|
||||
│ Grafana PgAdmin OAuth2 Loki Alertmanager │
|
||||
│ :3000 :5050 Proxy :3100 :9093 │
|
||||
│ :9091 │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Prometheus │
|
||||
│ :9090 │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Traffic Flow
|
||||
|
||||
| Source | Destination | Path | Auth |
|
||||
|--------|-------------|------|------|
|
||||
| Browser → Grafana | Titania :443 → Prospero :443 → HAProxy → :3000 | Subdomain ACL | Casdoor OAuth |
|
||||
| Browser → PgAdmin | Titania :443 → Prospero :443 → HAProxy → :5050 | Subdomain ACL | Casdoor OAuth |
|
||||
| Browser → Prometheus | Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
|
||||
| Alloy → Loki | `https://loki.ouranos.helu.ca` → HAProxy :443 → :3100 | Subdomain ACL | None |
|
||||
| Alloy → Prometheus | `https://prometheus.ouranos.helu.ca/api/v1/write` → HAProxy :443 → :9090 | `skip_auth_route` | None |
|
||||
|
||||
## Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`)
|
||||
2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`)
|
||||
3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets))
|
||||
4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso))
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook pplg/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `pplg/deploy.yml` | Main consolidated deployment playbook |
|
||||
| `pplg/pplg-haproxy.cfg.j2` | HAProxy TLS termination config (5 backends) |
|
||||
| `pplg/prometheus.yml.j2` | Prometheus scrape configuration |
|
||||
| `pplg/alert_rules.yml.j2` | Prometheus alerting rules |
|
||||
| `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications |
|
||||
| `pplg/config.yml.j2` | Loki server configuration |
|
||||
| `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth |
|
||||
| `pplg/datasource.yml.j2` | Grafana provisioned datasources |
|
||||
| `pplg/users.yml.j2` | Grafana provisioned users |
|
||||
| `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth |
|
||||
| `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit |
|
||||
| `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI |
|
||||
| `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **APT Repositories**: Add Grafana and PgAdmin repos
|
||||
2. **Install Packages**: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn
|
||||
3. **Prometheus**: Config, alert rules, systemd override for remote write receiver
|
||||
4. **Alertmanager**: Install, config with Pushover integration
|
||||
5. **Loki**: Create user/dirs, template config
|
||||
6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config
|
||||
7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
|
||||
8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar
|
||||
9. **SSL Certificate**: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback)
|
||||
10. **HAProxy**: Template config, enable and start systemd service
|
||||
|
||||
### Deployment Order
|
||||
|
||||
PPLG must be deployed **before** services that push metrics/logs:
|
||||
|
||||
```
|
||||
apt_update → alloy → node_exporter → pplg → postgresql → ...
|
||||
```
|
||||
|
||||
This order is enforced in `site.yml`.
|
||||
|
||||
## Required Vault Secrets
|
||||
|
||||
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
||||
|
||||
⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like:
|
||||
|
||||
```
|
||||
TASK [Template prometheus.yml] ****
|
||||
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
|
||||
```
|
||||
|
||||
### Prometheus Scrape Credentials
|
||||
|
||||
These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea.
|
||||
|
||||
#### 1. Casdoor Prometheus Access Key
|
||||
```yaml
|
||||
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
|
||||
```
|
||||
|
||||
#### 2. Casdoor Prometheus Access Secret
|
||||
```yaml
|
||||
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
|
||||
```
|
||||
|
||||
**Requirements (both):**
|
||||
- **Source**: API key pair from the `built-in/admin` Casdoor user
|
||||
- **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params)
|
||||
- **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
|
||||
```bash
|
||||
# 1. Login to get session cookie
|
||||
curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
|
||||
|
||||
# 2. Generate API keys for built-in/admin
|
||||
curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"owner":"built-in","name":"admin"}'
|
||||
|
||||
# 3. Retrieve the generated keys
|
||||
curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
|
||||
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')"
|
||||
|
||||
# 4. Cleanup
|
||||
rm /tmp/casdoor-cookie.txt
|
||||
```
|
||||
|
||||
⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics.
|
||||
|
||||
#### 3. Gitea Metrics Token
|
||||
```yaml
|
||||
vault_gitea_metrics_token: "YourGiteaMetricsToken"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Length**: 32+ characters
|
||||
- **Source**: Must match the token configured in Gitea's `app.ini`
|
||||
- **Generation**: `openssl rand -hex 32`
|
||||
- **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth)
|
||||
|
||||
### Grafana Credentials
|
||||
|
||||
#### 4. Grafana Admin User
|
||||
```yaml
|
||||
vault_grafana_admin_name: "Admin"
|
||||
vault_grafana_admin_login: "admin"
|
||||
vault_grafana_admin_password: "YourSecureAdminPassword"
|
||||
```
|
||||
|
||||
#### 5. Grafana Viewer User
|
||||
```yaml
|
||||
vault_grafana_viewer_name: "Viewer"
|
||||
vault_grafana_viewer_login: "viewer"
|
||||
vault_grafana_viewer_password: "YourSecureViewerPassword"
|
||||
```
|
||||
|
||||
#### 6. Grafana OAuth (Casdoor SSO)
|
||||
```yaml
|
||||
vault_grafana_oauth_client_id: "grafana-oauth-client"
|
||||
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Source**: Must match the Casdoor application `app-grafana`
|
||||
- **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth`
|
||||
|
||||
### PgAdmin
|
||||
|
||||
#### 7. PgAdmin Setup
|
||||
|
||||
Just do it manually:
|
||||
cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
|
||||
|
||||
**Requirements:**
|
||||
- **Purpose**: Initial local admin account (fallback when OAuth is unavailable)
|
||||
|
||||
#### 8. PgAdmin OAuth (Casdoor SSO)
|
||||
```yaml
|
||||
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
|
||||
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Source**: Must match the Casdoor application `app-pgadmin`
|
||||
- **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
|
||||
|
||||
### Prometheus OAuth2-Proxy
|
||||
|
||||
#### 9. Prometheus OAuth2-Proxy (Casdoor SSO)
|
||||
```yaml
|
||||
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
|
||||
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
|
||||
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
|
||||
```
|
||||
**Requirements:**
|
||||
- Client ID/Secret must match the Casdoor application `app-prometheus`
|
||||
- **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback`
|
||||
- **Cookie secret generation**:
|
||||
```bash
|
||||
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
|
||||
```
|
||||
|
||||
### Alertmanager (Pushover)
|
||||
|
||||
#### 10. Pushover Notification Credentials
|
||||
```yaml
|
||||
vault_pushover_user_key: "YourPushoverUserKey"
|
||||
vault_pushover_api_token: "YourPushoverAPIToken"
|
||||
```
|
||||
**Requirements:**
|
||||
- **Source**: [pushover.net](https://pushover.net/) account
|
||||
- **User Key**: Found on Pushover dashboard
|
||||
- **API Token**: Create an application in Pushover
|
||||
|
||||
### Quick Reference
|
||||
|
||||
| Vault Variable | Used By | Source |
|
||||
|---------------|---------|--------|
|
||||
| `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
|
||||
| `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
|
||||
| `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini |
|
||||
| `vault_grafana_admin_name` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_admin_login` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_admin_password` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_viewer_name` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_viewer_login` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_viewer_password` | users.yml.j2 | Choose any |
|
||||
| `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app |
|
||||
| `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app |
|
||||
| `vault_pgadmin_email` | config_local.py.j2 | Choose any |
|
||||
| `vault_pgadmin_password` | config_local.py.j2 | Choose any |
|
||||
| `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app |
|
||||
| `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app |
|
||||
| `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
|
||||
| `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
|
||||
| `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate |
|
||||
| `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account |
|
||||
| `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account |
|
||||
|
||||
## Casdoor SSO
|
||||
|
||||
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
|
||||
|
||||
### Applications to Register
|
||||
|
||||
Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`:
|
||||
|
||||
| Application | Client ID | Redirect URI | Grant Types |
|
||||
|-------------|-----------|-------------|-------------|
|
||||
| `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` |
|
||||
| `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` |
|
||||
| `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` |
|
||||
|
||||
### URL Strategy
|
||||
|
||||
| URL Type | Address | Used By |
|
||||
|----------|---------|---------|
|
||||
| **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) |
|
||||
| **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server |
|
||||
| **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server |
|
||||
| **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy |
|
||||
|
||||
### Auth Methods per Service
|
||||
|
||||
| Service | Auth Method | Details |
|
||||
|---------|-------------|---------|
|
||||
| **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` |
|
||||
| **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` |
|
||||
| **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` |
|
||||
| **Loki** | None | Machine-to-machine (Alloy agents push logs) |
|
||||
| **Alertmanager** | None | Internal only |
|
||||
|
||||
## HAProxy Configuration
|
||||
|
||||
### Backends
|
||||
|
||||
| Backend | Upstream | Health Check | Auth |
|
||||
|---------|----------|-------------|------|
|
||||
| `backend_grafana` | `127.0.0.1:3000` | `GET /api/health` | Grafana OAuth |
|
||||
| `backend_pgadmin` | `127.0.0.1:5050` | `GET /misc/ping` | PgAdmin OAuth |
|
||||
| `backend_prometheus` | `127.0.0.1:9091` (OAuth2-Proxy) | `GET /ping` | OAuth2-Proxy |
|
||||
| `backend_prometheus_direct` | `127.0.0.1:9090` | — | None (write API) |
|
||||
| `backend_loki` | `127.0.0.1:3100` | `GET /ready` | None |
|
||||
| `backend_alertmanager` | `127.0.0.1:9093` | `GET /-/healthy` | None |
|
||||
|
||||
### skip_auth_route Pattern
|
||||
|
||||
The Prometheus write API (`/api/v1/write`) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy:
|
||||
|
||||
```
|
||||
acl is_prometheus_write path_beg /api/v1/write
|
||||
use_backend backend_prometheus_direct if host_prometheus is_prometheus_write
|
||||
```
|
||||
|
||||
This routes `https://prometheus.ouranos.helu.ca/api/v1/write` directly to Prometheus on `:9090`, while all other Prometheus traffic goes through OAuth2-Proxy on `:9091`.
|
||||
|
||||
### SSL Certificate
|
||||
|
||||
- **Primary**: Let's Encrypt wildcard cert (`*.ouranos.helu.ca`) fetched from Titania
|
||||
- **Fallback**: Self-signed cert generated on Prospero (if Titania unavailable)
|
||||
- **Path**: `/etc/haproxy/certs/ouranos.pem`
|
||||
|
||||
## Host Variables
|
||||
|
||||
**File:** `ansible/inventory/host_vars/prospero.incus.yml`
|
||||
|
||||
Services list:
|
||||
```yaml
|
||||
services:
|
||||
- alloy
|
||||
- pplg
|
||||
```
|
||||
|
||||
Key variable groups defined in `prospero.incus.yml`:
|
||||
- PPLG HAProxy (user, group, uid/gid 800, syslog port)
|
||||
- Grafana (datasources, users, OAuth config)
|
||||
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
|
||||
- Alertmanager (Pushover integration)
|
||||
- Loki (user, data/config directories)
|
||||
- PgAdmin (user, data/log directories, OAuth config)
|
||||
- Casdoor Metrics (access key/secret for Prometheus scraping)
|
||||
|
||||
## Terraform
|
||||
|
||||
### Prospero Port Mapping
|
||||
|
||||
```hcl
|
||||
devices = [
|
||||
{
|
||||
name = "https_internal"
|
||||
type = "proxy"
|
||||
properties = {
|
||||
listen = "tcp:0.0.0.0:25510"
|
||||
connect = "tcp:127.0.0.1:443"
|
||||
}
|
||||
},
|
||||
{
|
||||
name = "http_redirect"
|
||||
type = "proxy"
|
||||
properties = {
|
||||
listen = "tcp:0.0.0.0:25511"
|
||||
connect = "tcp:127.0.0.1:80"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Run `terraform apply` before deploying if port mappings changed.
|
||||
|
||||
### Titania Backend Routing
|
||||
|
||||
Titania's HAProxy routes external subdomains to Prospero's HTTPS port:
|
||||
|
||||
```yaml
|
||||
# In titania.incus.yml haproxy_backends
|
||||
- subdomain: "grafana"
|
||||
backend_host: "prospero.incus"
|
||||
backend_port: 443
|
||||
health_path: "/api/health"
|
||||
ssl_backend: true
|
||||
|
||||
- subdomain: "pgadmin"
|
||||
backend_host: "prospero.incus"
|
||||
backend_port: 443
|
||||
health_path: "/misc/ping"
|
||||
ssl_backend: true
|
||||
|
||||
- subdomain: "prometheus"
|
||||
backend_host: "prospero.incus"
|
||||
backend_port: 443
|
||||
health_path: "/ping"
|
||||
ssl_backend: true
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Alloy Configuration
|
||||
|
||||
**File:** `ansible/alloy/prospero/config.alloy.j2`
|
||||
|
||||
- **HAProxy Syslog**: `loki.source.syslog` on `127.0.0.1:51405` (TCP) receives Docker syslog from HAProxy container
|
||||
- **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus`
|
||||
- **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki
|
||||
- **Metrics**: Node exporter + process exporter → Prometheus remote write
|
||||
|
||||
### Prometheus Scrape Targets
|
||||
|
||||
| Job | Target | Auth |
|
||||
|-----|--------|------|
|
||||
| `prometheus` | `localhost:9090` | None |
|
||||
| `node-exporter` | All Uranian hosts `:9100` | None |
|
||||
| `alertmanager` | `prospero.incus:9093` | None |
|
||||
| `haproxy` | `titania.incus:8404` | None |
|
||||
| `gitea` | `oberon.incus:22084` | Bearer token |
|
||||
| `casdoor` | `titania.incus:22081` | Access key/secret params |
|
||||
|
||||
### Alert Rules
|
||||
|
||||
Groups defined in `alert_rules.yml.j2`:
|
||||
|
||||
| Group | Alerts | Scope |
|
||||
|-------|--------|-------|
|
||||
| `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
|
||||
| `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus |
|
||||
| `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
|
||||
| `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
|
||||
| `loki_alerts` | HighLogVolume | Loki |
|
||||
|
||||
### Alertmanager Routing
|
||||
|
||||
Alerts are routed to Pushover with severity-based priority:
|
||||
|
||||
| Severity | Pushover Priority | Emoji |
|
||||
|----------|-------------------|-------|
|
||||
| Critical | 2 (Emergency) | 🚨 |
|
||||
| Warning | 1 (High) | ⚠️ |
|
||||
| Info | 0 (Normal) | — |
|
||||
|
||||
## Grafana MCP Server
|
||||
|
||||
Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token.
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| MCP Host | miranda.incus |
|
||||
| MCP Port | 25533 |
|
||||
| MCPO Proxy | `http://miranda.incus:25530/grafana` |
|
||||
| Auth | Grafana service account token (`vault_grafana_service_account_token`) |
|
||||
|
||||
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`.
|
||||
|
||||
For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**.
|
||||
|
||||
## Access After Deployment
|
||||
|
||||
| Service | URL | Login |
|
||||
|---------|-----|-------|
|
||||
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
|
||||
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
|
||||
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
|
||||
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Status
|
||||
|
||||
```bash
|
||||
ssh prospero.incus
|
||||
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
|
||||
```
|
||||
|
||||
### HAProxy Service
|
||||
|
||||
```bash
|
||||
ssh prospero.incus
|
||||
sudo systemctl status haproxy
|
||||
sudo journalctl -u haproxy -f
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# All PPLG services via journal
|
||||
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
|
||||
|
||||
# HAProxy logs (shipped via syslog to Alloy → Loki)
|
||||
# Query in Grafana: {job="pplg-haproxy"}
|
||||
```
|
||||
|
||||
### Test Endpoints (from Prospero)
|
||||
|
||||
```bash
|
||||
# Grafana
|
||||
curl -s http://127.0.0.1:3000/api/health
|
||||
|
||||
# PgAdmin
|
||||
curl -s http://127.0.0.1:5050/misc/ping
|
||||
|
||||
# Prometheus
|
||||
curl -s http://127.0.0.1:9090/-/healthy
|
||||
|
||||
# Loki
|
||||
curl -s http://127.0.0.1:3100/ready
|
||||
|
||||
# Alertmanager
|
||||
curl -s http://127.0.0.1:9093/-/healthy
|
||||
|
||||
# HAProxy stats
|
||||
curl -s http://127.0.0.1:8404/metrics | head
|
||||
```
|
||||
|
||||
### Test TLS (from any host)
|
||||
|
||||
```bash
|
||||
# Direct to Prospero container
|
||||
curl -sk https://prospero.incus/api/health
|
||||
# Via Titania HAProxy
|
||||
curl -s https://grafana.ouranos.helu.ca/api/health
|
||||
```
|
||||
|
||||
### Common Errors
|
||||
|
||||
#### `vault_casdoor_prometheus_access_key` is undefined
|
||||
|
||||
```
|
||||
TASK [Template prometheus.yml]
|
||||
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
|
||||
```
|
||||
|
||||
**Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials.
|
||||
|
||||
**Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault:
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
```yaml
|
||||
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
|
||||
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
|
||||
```
|
||||
|
||||
#### Certificate fetch fails
|
||||
|
||||
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
|
||||
|
||||
**Fix**: Ensure Titania is up and certbot has run:
|
||||
```bash
|
||||
ansible-playbook sandbox_up.yml
|
||||
ansible-playbook certbot/deploy.yml
|
||||
```
|
||||
|
||||
The playbook falls back to a self-signed certificate if Titania is unavailable.
|
||||
|
||||
#### OAuth2 redirect loops
|
||||
|
||||
**Cause**: Casdoor application redirect URI doesn't match the service URL.
|
||||
|
||||
**Fix**: Verify redirect URIs match exactly:
|
||||
- Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth`
|
||||
- PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
|
||||
- Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback`
|
||||
|
||||
## Migration Notes
|
||||
|
||||
PPLG replaces the following standalone playbooks (kept as reference):
|
||||
|
||||
| Original Playbook | Replaced By |
|
||||
|-------------------|-------------|
|
||||
| `prometheus/deploy.yml` | `pplg/deploy.yml` |
|
||||
| `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` |
|
||||
| `loki/deploy.yml` | `pplg/deploy.yml` |
|
||||
| `grafana/deploy.yml` | `pplg/deploy.yml` |
|
||||
| `pgadmin/deploy.yml` | `pplg/deploy.yml` |
|
||||
|
||||
PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).
|
||||
546
docs/rabbitmq.md
Normal file
546
docs/rabbitmq.md
Normal file
@@ -0,0 +1,546 @@
|
||||
# RabbitMQ - Message Broker Infrastructure
|
||||
|
||||
## Overview
|
||||
|
||||
RabbitMQ 3 (management-alpine) serves as the central message broker for the Agathos sandbox, providing AMQP-compliant message queuing for asynchronous communication between services. The deployment includes the management web interface for monitoring and administration.
|
||||
|
||||
**Host:** Oberon (container_orchestration)
|
||||
**Role:** Message broker for event-driven architectures
|
||||
**AMQP Port:** 5672
|
||||
**Management Port:** 25582
|
||||
**Syslog Port:** 51402 (Alloy)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Oberon Host │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ RabbitMQ Container (Docker) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────────┬──────────────┐ │ │
|
||||
│ │ │ VHost │ VHost │ │ │
|
||||
│ │ │ "kairos" │ "spelunker" │ │ │
|
||||
│ │ │ │ │ │ │
|
||||
│ │ │ User: │ User: │ │ │
|
||||
│ │ │ kairos │ spelunker │ │ │
|
||||
│ │ │ (full perm) │ (full perm) │ │ │
|
||||
│ │ └──────────────┴──────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ Default Admin: rabbitmq │ │
|
||||
│ │ (all vhosts, admin privileges) │ │
|
||||
│ │ │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Ports: 5672 (AMQP), 25582 (Management) │
|
||||
│ Logs: syslog → Alloy:51402 → Loki │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ Kairos │───AMQP────▶│ kairos/ │
|
||||
│ (future) │ │ (vhost) │
|
||||
└──────────────┘ └──────────────┘
|
||||
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ Spelunker │───AMQP────▶│ spelunker/ │
|
||||
│ (future) │ │ (vhost) │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
**Note**: Kairos and Spelunker are future services. The RabbitMQ infrastructure is pre-provisioned with dedicated virtual hosts and users ready for when these services are deployed.
|
||||
|
||||
## Terraform Resources
|
||||
|
||||
### Oberon Host Definition
|
||||
|
||||
RabbitMQ runs on Oberon, defined in `terraform/containers.tf`:
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Description | Docker Host + MCP Switchboard - King of Fairies orchestrating containers |
|
||||
| Image | noble |
|
||||
| Role | container_orchestration |
|
||||
| Security Nesting | `true` (required for Docker) |
|
||||
| AppArmor Profile | unconfined |
|
||||
| Proxy Devices | `25580-25599 → 25580-25599` (application port range) |
|
||||
|
||||
### Container Dependencies
|
||||
|
||||
| Resource | Relationship |
|
||||
|----------|--------------|
|
||||
| Docker | RabbitMQ runs as a Docker container on Oberon |
|
||||
| Alloy | Collects syslog logs from RabbitMQ on port 51402 |
|
||||
| Prospero | Receives logs via Loki for observability |
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook rabbitmq/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `rabbitmq/deploy.yml` | Main deployment playbook |
|
||||
| `rabbitmq/docker-compose.yml.j2` | Docker Compose template |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
The playbook performs the following operations:
|
||||
|
||||
1. **User and Group Management**
|
||||
- Creates `rabbitmq` system user and group
|
||||
- Adds `ponos` user to `rabbitmq` group for operational access
|
||||
|
||||
2. **Directory Setup**
|
||||
- Creates service directory at `/srv/rabbitmq`
|
||||
- Sets ownership to `rabbitmq:rabbitmq`
|
||||
- Configures permissions (mode 750)
|
||||
|
||||
3. **Docker Compose Deployment**
|
||||
- Templates `docker-compose.yml` from Jinja2 template
|
||||
- Deploys RabbitMQ container with `docker compose up`
|
||||
|
||||
4. **rabbitmqadmin CLI Setup**
|
||||
- Extracts `rabbitmqadmin` from container to `/usr/local/bin/`
|
||||
- Makes it executable for host-level management
|
||||
|
||||
5. **Automatic Provisioning** (idempotent)
|
||||
- Creates virtual hosts: `kairos`, `spelunker`
|
||||
- Creates users with passwords from vault
|
||||
- Sets user tags (currently none, expandable for admin/monitoring roles)
|
||||
- Configures full permissions for each user on their respective vhost
|
||||
|
||||
### Variables
|
||||
|
||||
#### Host Variables (`host_vars/oberon.incus.yml`)
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `rabbitmq_user` | Service user | `rabbitmq` |
|
||||
| `rabbitmq_group` | Service group | `rabbitmq` |
|
||||
| `rabbitmq_directory` | Installation directory | `/srv/rabbitmq` |
|
||||
| `rabbitmq_amqp_port` | AMQP protocol port | `5672` |
|
||||
| `rabbitmq_management_port` | Management web interface | `25582` |
|
||||
| `rabbitmq_password` | Default admin password | `{{ vault_rabbitmq_password }}` |
|
||||
|
||||
#### Group Variables (`group_vars/all/vars.yml`)
|
||||
|
||||
Defines the provisioning configuration for vhosts, users, and permissions:
|
||||
|
||||
```yaml
|
||||
rabbitmq_vhosts:
|
||||
- name: kairos
|
||||
- name: spelunker
|
||||
|
||||
rabbitmq_users:
|
||||
- name: kairos
|
||||
password: "{{ kairos_rabbitmq_password }}"
|
||||
tags: []
|
||||
- name: spelunker
|
||||
password: "{{ spelunker_rabbitmq_password }}"
|
||||
tags: []
|
||||
|
||||
rabbitmq_permissions:
|
||||
- vhost: kairos
|
||||
user: kairos
|
||||
configure_priv: .*
|
||||
read_priv: .*
|
||||
write_priv: .*
|
||||
- vhost: spelunker
|
||||
user: spelunker
|
||||
configure_priv: .*
|
||||
read_priv: .*
|
||||
write_priv: .*
|
||||
```
|
||||
|
||||
**Vault Variable Mappings**:
|
||||
```yaml
|
||||
kairos_rabbitmq_password: "{{ vault_kairos_rabbitmq_password }}"
|
||||
spelunker_rabbitmq_password: "{{ vault_spelunker_rabbitmq_password }}"
|
||||
```
|
||||
|
||||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||||
|
||||
All sensitive credentials are encrypted in the vault:
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `vault_rabbitmq_password` | Default admin account password |
|
||||
| `vault_kairos_rabbitmq_password` | Kairos service user password |
|
||||
| `vault_spelunker_rabbitmq_password` | Spelunker service user password |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Docker Compose Template
|
||||
|
||||
The deployment uses a minimal Docker Compose configuration:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
rabbitmq:
|
||||
image: rabbitmq:3-management-alpine
|
||||
container_name: rabbitmq
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "{{rabbitmq_amqp_port}}:5672" # AMQP protocol
|
||||
- "{{rabbitmq_management_port}}:15672" # Management UI
|
||||
volumes:
|
||||
- rabbitmq_data:/var/lib/rabbitmq # Persistent data
|
||||
environment:
|
||||
RABBITMQ_DEFAULT_USER: "{{rabbitmq_user}}"
|
||||
RABBITMQ_DEFAULT_PASS: "{{rabbitmq_password}}"
|
||||
logging:
|
||||
driver: syslog
|
||||
options:
|
||||
syslog-address: "tcp://127.0.0.1:{{rabbitmq_syslog_port}}"
|
||||
syslog-format: "{{syslog_format}}"
|
||||
tag: "rabbitmq"
|
||||
```
|
||||
|
||||
### Data Persistence
|
||||
|
||||
- **Volume**: `rabbitmq_data` (Docker-managed volume)
|
||||
- **Location**: `/var/lib/rabbitmq` inside container
|
||||
- **Contents**:
|
||||
- Message queues and persistent messages
|
||||
- Virtual host metadata
|
||||
- User credentials and permissions
|
||||
- Configuration overrides
|
||||
|
||||
## Virtual Hosts and Users
|
||||
|
||||
### Default Admin Account
|
||||
|
||||
**Username**: `rabbitmq`
|
||||
**Password**: `{{ vault_rabbitmq_password }}` (from vault)
|
||||
**Privileges**: Full administrative access to all virtual hosts
|
||||
|
||||
The default admin account is created automatically when the container starts and can access:
|
||||
- All virtual hosts (including `/`, `kairos`, `spelunker`)
|
||||
- Management web interface
|
||||
- All RabbitMQ management commands
|
||||
|
||||
### Kairos Virtual Host
|
||||
|
||||
**VHost**: `kairos`
|
||||
**User**: `kairos`
|
||||
**Password**: `{{ vault_kairos_rabbitmq_password }}`
|
||||
**Permissions**: Full (configure, read, write) on all resources matching `.*`
|
||||
|
||||
Intended for the **Kairos** service (event-driven time-series processing system, planned future deployment).
|
||||
|
||||
### Spelunker Virtual Host
|
||||
|
||||
**VHost**: `spelunker`
|
||||
**User**: `spelunker`
|
||||
**Password**: `{{ vault_spelunker_rabbitmq_password }}`
|
||||
**Permissions**: Full (configure, read, write) on all resources matching `.*`
|
||||
|
||||
Intended for the **Spelunker** service (log exploration and analytics platform, planned future deployment).
|
||||
|
||||
### Permission Model
|
||||
|
||||
Both service users have full access within their respective virtual hosts:
|
||||
|
||||
| Permission | Pattern | Description |
|
||||
|------------|---------|-------------|
|
||||
| Configure | `.*` | Create/delete queues, exchanges, bindings |
|
||||
| Write | `.*` | Publish messages to exchanges |
|
||||
| Read | `.*` | Consume messages from queues |
|
||||
|
||||
This isolation ensures:
|
||||
- ✔ Each service operates in its own namespace
|
||||
- ✔ Messages cannot cross between services
|
||||
- ✔ Resource limits can be applied per-vhost
|
||||
- ✔ Service credentials can be rotated independently
|
||||
|
||||
## Access and Administration
|
||||
|
||||
### Management Web Interface
|
||||
|
||||
**URL**: `http://oberon.incus:25582`
|
||||
**External**: `http://{oberon-ip}:25582`
|
||||
**Login**: `rabbitmq` / `{{ vault_rabbitmq_password }}`
|
||||
|
||||
Features:
|
||||
- Queue inspection and message browsing
|
||||
- Exchange and binding management
|
||||
- Connection and channel monitoring
|
||||
- User and permission administration
|
||||
- Virtual host management
|
||||
- Performance metrics and charts
|
||||
|
||||
### CLI Administration
|
||||
|
||||
#### On Host Machine (using rabbitmqadmin)
|
||||
|
||||
```bash
|
||||
# List vhosts
|
||||
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD list vhosts
|
||||
|
||||
# List queues in a vhost
|
||||
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD -V kairos list queues
|
||||
|
||||
# Publish a test message
|
||||
rabbitmqadmin -H oberon.incus -P 25582 -u rabbitmq -p PASSWORD -V kairos publish \
|
||||
exchange=amq.default routing_key=test payload="test message"
|
||||
```
|
||||
|
||||
#### Inside Container
|
||||
|
||||
```bash
|
||||
# Enter the container
|
||||
docker exec -it rabbitmq /bin/sh
|
||||
|
||||
# List vhosts
|
||||
rabbitmqctl list_vhosts
|
||||
|
||||
# List users
|
||||
rabbitmqctl list_users
|
||||
|
||||
# List permissions for a user
|
||||
rabbitmqctl list_user_permissions kairos
|
||||
|
||||
# List queues in a vhost
|
||||
rabbitmqctl list_queues -p kairos
|
||||
|
||||
# Check node status
|
||||
rabbitmqctl status
|
||||
```
|
||||
|
||||
### Connection Strings
|
||||
|
||||
#### AMQP Connection (from other containers on Oberon)
|
||||
|
||||
```
|
||||
amqp://kairos:PASSWORD@localhost:5672/kairos
|
||||
amqp://spelunker:PASSWORD@localhost:5672/spelunker
|
||||
```
|
||||
|
||||
#### AMQP Connection (from other hosts)
|
||||
|
||||
```
|
||||
amqp://kairos:PASSWORD@oberon.incus:5672/kairos
|
||||
amqp://spelunker:PASSWORD@oberon.incus:5672/spelunker
|
||||
```
|
||||
|
||||
#### Management API
|
||||
|
||||
```
|
||||
http://rabbitmq:PASSWORD@oberon.incus:25582/api/
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Logging
|
||||
|
||||
- **Driver**: syslog (Docker logging driver)
|
||||
- **Destination**: `tcp://127.0.0.1:51402` (Alloy on Oberon)
|
||||
- **Tag**: `rabbitmq`
|
||||
- **Format**: `{{ syslog_format }}` (from Alloy configuration)
|
||||
|
||||
Logs are collected by Alloy and forwarded to Loki on Prospero for centralized log aggregation.
|
||||
|
||||
### Key Metrics (via Management UI)
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| Connections | Active AMQP client connections |
|
||||
| Channels | Active channels within connections |
|
||||
| Queues | Total queues across all vhosts |
|
||||
| Messages | Ready, unacknowledged, and total message counts |
|
||||
| Message Rate | Publish/deliver rates (msg/s) |
|
||||
| Memory Usage | Container memory consumption |
|
||||
| Disk Usage | Persistent storage utilization |
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Check if RabbitMQ is running
|
||||
docker ps | grep rabbitmq
|
||||
|
||||
# Check container logs
|
||||
docker logs rabbitmq
|
||||
|
||||
# Check RabbitMQ node status
|
||||
docker exec rabbitmq rabbitmqctl status
|
||||
|
||||
# Check cluster health (single-node, should show 1 node)
|
||||
docker exec rabbitmq rabbitmqctl cluster_status
|
||||
```
|
||||
|
||||
## Operational Tasks
|
||||
|
||||
### Restart RabbitMQ
|
||||
|
||||
```bash
|
||||
# Via Docker Compose
|
||||
cd /srv/rabbitmq
|
||||
sudo -u rabbitmq docker compose restart
|
||||
|
||||
# Via Docker directly
|
||||
docker restart rabbitmq
|
||||
```
|
||||
|
||||
### Recreate Container (preserves data)
|
||||
|
||||
```bash
|
||||
cd /srv/rabbitmq
|
||||
sudo -u rabbitmq docker compose down
|
||||
sudo -u rabbitmq docker compose up -d
|
||||
```
|
||||
|
||||
### Add New Virtual Host and User
|
||||
|
||||
1. Update `group_vars/all/vars.yml`:
|
||||
```yaml
|
||||
rabbitmq_vhosts:
|
||||
- name: newservice
|
||||
|
||||
rabbitmq_users:
|
||||
- name: newservice
|
||||
password: "{{ newservice_rabbitmq_password }}"
|
||||
tags: []
|
||||
|
||||
rabbitmq_permissions:
|
||||
- vhost: newservice
|
||||
user: newservice
|
||||
configure_priv: .*
|
||||
read_priv: .*
|
||||
write_priv: .*
|
||||
|
||||
# Add mapping
|
||||
newservice_rabbitmq_password: "{{ vault_newservice_rabbitmq_password }}"
|
||||
```
|
||||
|
||||
2. Add password to `group_vars/all/vault.yml`:
|
||||
```bash
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
# Add: vault_newservice_rabbitmq_password: "secure_password"
|
||||
```
|
||||
|
||||
3. Run the playbook:
|
||||
```bash
|
||||
ansible-playbook rabbitmq/deploy.yml
|
||||
```
|
||||
|
||||
The provisioning tasks are idempotent—existing vhosts and users are skipped, only new ones are created.
|
||||
|
||||
### Rotate User Password
|
||||
|
||||
```bash
|
||||
# Inside container
|
||||
docker exec rabbitmq rabbitmqctl change_password kairos "new_password"
|
||||
|
||||
# Update vault
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
# Update vault_kairos_rabbitmq_password
|
||||
```
|
||||
|
||||
### Clear All Messages in a Queue
|
||||
|
||||
```bash
|
||||
docker exec rabbitmq rabbitmqctl purge_queue queue_name -p kairos
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
Check Docker logs for errors:
|
||||
```bash
|
||||
docker logs rabbitmq
|
||||
```
|
||||
|
||||
Common issues:
|
||||
- Port conflict on 5672 or 25582
|
||||
- Permission issues on `/srv/rabbitmq` directory
|
||||
- Corrupted data volume
|
||||
|
||||
### Cannot Connect to Management UI
|
||||
|
||||
1. Verify port mapping: `docker port rabbitmq`
|
||||
2. Check firewall rules on Oberon
|
||||
3. Verify container is running: `docker ps | grep rabbitmq`
|
||||
4. Check if management plugin is enabled (should be in `-management-alpine` image)
|
||||
|
||||
### User Authentication Failing
|
||||
|
||||
```bash
|
||||
# List users and verify they exist
|
||||
docker exec rabbitmq rabbitmqctl list_users
|
||||
|
||||
# Check user permissions
|
||||
docker exec rabbitmq rabbitmqctl list_user_permissions kairos
|
||||
|
||||
# Verify vhost exists
|
||||
docker exec rabbitmq rabbitmqctl list_vhosts
|
||||
```
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
RabbitMQ may consume significant memory with many messages. Check:
|
||||
```bash
|
||||
# Memory usage
|
||||
docker exec rabbitmq rabbitmqctl status | grep memory
|
||||
|
||||
# Queue depths
|
||||
docker exec rabbitmq rabbitmqctl list_queues -p kairos messages
|
||||
|
||||
# Consider setting memory limits in docker-compose.yml
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Network Isolation
|
||||
|
||||
- RabbitMQ AMQP port (5672) is **only** exposed on the Incus network (`10.10.0.0/16`)
|
||||
- Management UI (25582) is exposed externally for administration
|
||||
- For production: Place HAProxy in front of management UI with authentication
|
||||
- Consider enabling SSL/TLS for AMQP connections in production
|
||||
|
||||
### Credential Management
|
||||
|
||||
- ✔ All passwords stored in Ansible Vault
|
||||
- ✔ Service accounts have isolated virtual hosts
|
||||
- ✔ Default admin account uses strong password from vault
|
||||
- ⚠️ Credentials passed as environment variables (visible in `docker inspect`)
|
||||
- Consider using Docker secrets or Vault integration for enhanced security
|
||||
|
||||
### Virtual Host Isolation
|
||||
|
||||
Each service operates in its own virtual host:
|
||||
- Messages cannot cross between vhosts
|
||||
- Resource quotas can be applied per-vhost
|
||||
- Credentials can be rotated without affecting other services
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] **SSL/TLS Support**: Enable encrypted AMQP connections
|
||||
- [ ] **Cluster Mode**: Add additional RabbitMQ nodes for high availability
|
||||
- [ ] **Federation**: Connect to external RabbitMQ clusters
|
||||
- [ ] **Prometheus Exporter**: Add metrics export for Grafana monitoring
|
||||
- [ ] **Shovel Plugin**: Configure message forwarding between brokers
|
||||
- [ ] **HAProxy Integration**: Reverse proxy for management UI with authentication
|
||||
- [ ] **Docker Secrets**: Replace environment variables with Docker secrets
|
||||
|
||||
## References
|
||||
|
||||
- [RabbitMQ Official Documentation](https://www.rabbitmq.com/documentation.html)
|
||||
- [RabbitMQ Management Plugin](https://www.rabbitmq.com/management.html)
|
||||
- [AMQP 0-9-1 Protocol Reference](https://www.rabbitmq.com/amqp-0-9-1-reference.html)
|
||||
- [Virtual Hosts](https://www.rabbitmq.com/vhosts.html)
|
||||
- [Access Control (Authentication, Authorisation)](https://www.rabbitmq.com/access-control.html)
|
||||
- [Monitoring RabbitMQ](https://www.rabbitmq.com/monitoring.html)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: February 12, 2026
|
||||
**Project**: Agathos Infrastructure
|
||||
**Approval**: Red Panda Approved™
|
||||
148
docs/red_panda_standards.md
Normal file
148
docs/red_panda_standards.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Red Panda Approval™ Standards
|
||||
|
||||
Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.
|
||||
|
||||
---
|
||||
|
||||
## 🐾 Red Panda Approval™
|
||||
|
||||
All implementations must meet the 5 Sacred Criteria:
|
||||
|
||||
1. **Fresh Environment Test** — Clean runs on new systems without drift. No leftover state, no manual steps.
|
||||
2. **Elegant Simplicity** — Modular, reusable, no copy-paste sprawl. One playbook per concern.
|
||||
3. **Observable & Auditable** — Clear task names, proper logging, check mode compatible. You can see what happened.
|
||||
4. **Idempotent Patterns** — Run multiple times with consistent results. No side effects on re-runs.
|
||||
5. **Actually Provisions & Configures** — Resources work, dependencies resolve, services integrate. It does the thing.
|
||||
|
||||
---
|
||||
|
||||
## Vault Security
|
||||
|
||||
All sensitive information is encrypted using Ansible Vault with AES256 encryption.
|
||||
|
||||
**Encrypted secrets:**
|
||||
- Database passwords (PostgreSQL, Neo4j)
|
||||
- API keys (OpenAI, Anthropic, Mistral, Groq)
|
||||
- Application secrets (Grafana, SearXNG, Arke)
|
||||
- Monitoring alerts (Pushover integration)
|
||||
|
||||
**Security rules:**
|
||||
- AES256 encryption with `ansible-vault`
|
||||
- Password file for automation — never pass `--vault-password-file` inline in scripts
|
||||
- Vault variables use the `vault_` prefix; map to friendly names in `group_vars/all/vars.yml`
|
||||
- No secrets in plain text files, ever
|
||||
|
||||
---
|
||||
|
||||
## Log Level Standards
|
||||
|
||||
All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.
|
||||
|
||||
### Level Definitions
|
||||
|
||||
| Level | When to Use | What MUST Be Included | Loki / Grafana Role |
|
||||
|-------|-------------|----------------------|---------------------|
|
||||
| **ERROR** | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`. | AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger Pushover notifications. |
|
||||
| **WARNING** | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). |
|
||||
| **INFO** | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the *story* of what the system did. | Default production visibility. The go-to level for post-incident timelines. |
|
||||
| **DEBUG** | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | **Actionable context is mandatory.** A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. |
|
||||
|
||||
### Anti-Patterns
|
||||
|
||||
These are explicit violations of Ouranos logging standards:
|
||||
|
||||
| ❌ Anti-Pattern | Why It's Wrong | ✅ Correct Approach |
|
||||
|----------------|---------------|-------------------|
|
||||
| Health checks logged at INFO (`GET /health → 200 OK`) | Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress health endpoints from access logs entirely, or demote to DEBUG. |
|
||||
| DEBUG with no context (`logger.debug("error occurred")`) | Provides zero diagnostic value. If DEBUG is noisy *and* useless, nobody will ever enable it. | `logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)` |
|
||||
| ERROR without exception details (`logger.error("task failed")`) | Cannot be triaged without reproduction steps. Wastes on-call time. | `logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)` |
|
||||
| Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: `api_key=sk-...a3f2`, `password=*****`. |
|
||||
| Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | **Python / Django**: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). **Go / infrastructure** (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`). |
|
||||
| Logging expected conditions as ERROR | A user entering a wrong password is not an error — it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. |
|
||||
|
||||
### Health Check Rule
|
||||
|
||||
> All services exposed through HAProxy MUST suppress or demote health check endpoints (`/health`, `/healthz`, `/api/health`, `/metrics`, `/ping`) to DEBUG or below. Health check success is the *absence* of errors, not the presence of 200s. If your syslog shows a successful health probe, your log level is wrong.
|
||||
|
||||
**Implementation guidance:**
|
||||
- **Django / Gunicorn**: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
|
||||
- **Docker services**: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
|
||||
- **HAProxy**: HAProxy's own health check logs (`option httpchk`) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.
|
||||
|
||||
### Background Worker & Queue Monitoring
|
||||
|
||||
> **The most dangerous failure is the one that produces no logs.**
|
||||
|
||||
When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.
|
||||
|
||||
**Required practices:**
|
||||
|
||||
1. **Heartbeat logging** — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., `"worker alive, processed N jobs in last 5m, queue depth: M"`). The *absence* of this heartbeat is the alertable condition.
|
||||
|
||||
2. **Startup and shutdown at INFO** — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
|
||||
|
||||
3. **Queue depth as a metric** — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an **ERROR**-level alert, not a warning.
|
||||
|
||||
4. **Grafana "last seen" alerts** — For every background worker, configure a Grafana alert using `absent_over_time()` or equivalent staleness detection: *"Worker X has not logged a heartbeat in >10 minutes"* → ERROR severity → Pushover notification.
|
||||
|
||||
5. **Crash-on-start is ERROR** — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (`systemd OnFailure=`, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.
|
||||
|
||||
### Production Defaults
|
||||
|
||||
| Service Category | Default Level | Rationale |
|
||||
|-----------------|---------------|-----------|
|
||||
| Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | `WARNING` | Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. |
|
||||
| Gunicorn access logs | Suppress 2xx/3xx health probes | Routine request logging deferred to HAProxy access logs in Loki. |
|
||||
| Infrastructure agents (Alloy, Prometheus, Node Exporter) | `warn` | Stable — do not change without cause. |
|
||||
| HAProxy (Titania) | `warning` | Connection-level logging handled by HAProxy's own log format → Alloy → Loki. |
|
||||
| Databases (PostgreSQL, Neo4j) | `warning` | Query-level logging only enabled for active troubleshooting. |
|
||||
| Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | `warn` / `warning` | Per-service default. Tune individually if needed. |
|
||||
| LLM Proxy (Arke) | `info` | Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. |
|
||||
| Observability stack (Grafana, Loki, AlertManager) | `warn` | Should be quiet unless something is wrong with observability itself. |
|
||||
|
||||
### Loki & Grafana Alignment
|
||||
|
||||
**Label normalization**: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a `level` label on every log line. Without a `level` label, the log entry is invisible to level-based dashboard filters and alert rules.
|
||||
|
||||
**LogQL conventions for dashboards:**
|
||||
```logql
|
||||
# Production error monitoring (default dashboard view)
|
||||
{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"
|
||||
|
||||
# Warning-and-above for a specific service
|
||||
{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"
|
||||
|
||||
# Debug-level troubleshooting (temporary, never permanent dashboards)
|
||||
{container="angelia"} | json | level="debug"
|
||||
```
|
||||
|
||||
**Alerting rules** — Grafana alert rules MUST key off the normalized `level` label:
|
||||
- `level=~"error|fatal|critical"` → Immediate Pushover notification via AlertManager
|
||||
- `absent_over_time({service_name="celery_worker"}[10m])` → Worker heartbeat staleness → ERROR severity
|
||||
- Rate-based: `rate({service_name="arke"} | json | level="error" [5m]) > 0.1` → Sustained error rate
|
||||
|
||||
**Retention alignment**: Loki retention policies should preserve ERROR and WARNING logs longer than DEBUG. DEBUG-level logs generated during troubleshooting sessions should have a short TTL or be explicitly cleaned up.
|
||||
|
||||
---
|
||||
|
||||
## Documentation Standards
|
||||
|
||||
Place documentation in the `/docs/` directory of the repository.
|
||||
|
||||
### HTML Documents
|
||||
|
||||
HTML documents must follow [docs/documentation_style_guide.html](documentation_style_guide.html).
|
||||
|
||||
- Use Bootstrap CDN with Bootswatch theme **Flatly**
|
||||
- Include a dark mode toggle button in the navbar
|
||||
- Use Bootstrap Icons for icons
|
||||
- Use Bootstrap CSS for styles — avoid custom CSS
|
||||
- Use **Mermaid** for diagrams
|
||||
|
||||
### Markdown Documents
|
||||
|
||||
Only these status symbols are approved:
|
||||
- ✔ Success/Complete
|
||||
- ❌ Error/Failed
|
||||
- ⚠️ Warning/Caution
|
||||
- ℹ️ Information/Note
|
||||
253
docs/searxng-auth.md
Normal file
253
docs/searxng-auth.md
Normal file
@@ -0,0 +1,253 @@
|
||||
# SearXNG Authentication Design Document
|
||||
# Red Panda Approved
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the design for adding Casdoor-based authentication to SearXNG,
|
||||
which doesn't natively support SSO/OIDC authentication.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌───────────────┐ ┌─────────────────────────────────────┐
|
||||
│ Browser │────▶│ HAProxy │────▶│ Oberon │
|
||||
│ │ │ (titania) │ │ ┌────────────────┐ ┌───────────┐ │
|
||||
└──────────────┘ └───────┬───────┘ │ │ OAuth2-Proxy │─▶│ SearXNG │ │
|
||||
│ │ │ (port 22073) │ │ (22083) │ │
|
||||
│ │ └───────┬────────┘ └───────────┘ │
|
||||
│ └──────────┼─────────────────────────┘
|
||||
│ │ OIDC
|
||||
│ ┌──────────────────▼────────────────┐
|
||||
└────▶│ Casdoor │
|
||||
│ (OIDC Provider - titania) │
|
||||
└───────────────────────────────────┘
|
||||
```
|
||||
|
||||
The OAuth2-Proxy runs as a **native binary sidecar** on Oberon alongside SearXNG,
|
||||
following the same pattern used for JupyterLab on Puck. The upstream connection is
|
||||
`localhost` — eliminating the cross-host hop from the previous Docker-based deployment
|
||||
on Titania.
|
||||
|
||||
> ℹ️ Each host supports at most one OAuth2-Proxy sidecar instance. The binary is
|
||||
> shared at `/usr/local/bin/oauth2-proxy`; each service gets a unique config directory
|
||||
> and systemd unit name.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. OAuth2-Proxy (Sidecar on Oberon)
|
||||
- **Purpose**: Acts as authentication gateway for SearXNG
|
||||
- **Port**: 22073 (exposed to HAProxy)
|
||||
- **Binary**: Native `oauth2-proxy` v7.6.0 (systemd service `oauth2-proxy-searxng`)
|
||||
- **Config**: `/etc/oauth2-proxy-searxng/oauth2-proxy.cfg`
|
||||
- **Upstream**: `http://127.0.0.1:22083` (localhost sidecar to SearXNG)
|
||||
- **Logging**: systemd journal (`SyslogIdentifier=oauth2-proxy-searxng`)
|
||||
|
||||
### 2. Casdoor (Existing on Titania)
|
||||
- **Purpose**: OIDC Identity Provider
|
||||
- **Port**: 22081
|
||||
- **URL**: https://id.ouranos.helu.ca/ (via HAProxy)
|
||||
- **Required Setup**:
|
||||
- Create Application for SearXNG
|
||||
- Configure redirect URI
|
||||
- Generate client credentials
|
||||
|
||||
### 3. HAProxy Updates (Titania)
|
||||
- Route `searxng.ouranos.helu.ca` to OAuth2-Proxy on Oberon (`oberon.incus:22073`)
|
||||
- OAuth2-Proxy handles authentication before proxying to SearXNG on localhost
|
||||
|
||||
### 4. SearXNG (Existing on Oberon)
|
||||
- **No changes required** - remains unaware of authentication
|
||||
- Receives pre-authenticated requests from OAuth2-Proxy
|
||||
|
||||
## Authentication Flow
|
||||
|
||||
1. User navigates to `https://searxng.ouranos.helu.ca/`
|
||||
2. HAProxy routes to OAuth2-Proxy on oberon:22073
|
||||
3. OAuth2-Proxy checks for valid session cookie (`_oauth2_proxy_searxng`)
|
||||
4. **If no valid session**:
|
||||
- Redirect to Casdoor login: `https://id.ouranos.helu.ca/login/oauth/authorize`
|
||||
- User authenticates with Casdoor (username/password, social login, etc.)
|
||||
- Casdoor redirects back with authorization code
|
||||
- OAuth2-Proxy exchanges code for tokens
|
||||
- OAuth2-Proxy sets session cookie
|
||||
5. **If valid session**:
|
||||
- OAuth2-Proxy adds `X-Forwarded-User` header
|
||||
- Request proxied to SearXNG at `127.0.0.1:22083` (localhost sidecar)
|
||||
|
||||
## Casdoor Configuration
|
||||
|
||||
### Application Setup (Manual via Casdoor UI)
|
||||
|
||||
1. Login to Casdoor at https://id.ouranos.helu.ca/
|
||||
2. Navigate to Applications → Add
|
||||
3. Configure:
|
||||
- **Name**: `searxng`
|
||||
- **Display Name**: `SearXNG Search`
|
||||
- **Organization**: `built-in` (or your organization)
|
||||
- **Redirect URLs**:
|
||||
- `https://searxng.ouranos.helu.ca/oauth2/callback`
|
||||
- **Grant Types**: `authorization_code`, `refresh_token`
|
||||
- **Response Types**: `code`
|
||||
4. Save and note the `Client ID` and `Client Secret`
|
||||
|
||||
### Cookie Secret Generation
|
||||
|
||||
Generate a 32-byte random secret for OAuth2-Proxy cookies:
|
||||
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Development (Sandbox)
|
||||
```yaml
|
||||
# In inventory/host_vars/oberon.incus.yml
|
||||
searxng_oauth2_proxy_dir: /etc/oauth2-proxy-searxng
|
||||
searxng_oauth2_proxy_version: "7.6.0"
|
||||
searxng_proxy_port: 22073
|
||||
searxng_domain: "ouranos.helu.ca"
|
||||
searxng_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
|
||||
searxng_oauth2_redirect_url: "https://searxng.ouranos.helu.ca/oauth2/callback"
|
||||
|
||||
# OAuth2 Credentials (from vault)
|
||||
searxng_oauth2_client_id: "{{ vault_searxng_oauth2_client_id }}"
|
||||
searxng_oauth2_client_secret: "{{ vault_searxng_oauth2_client_secret }}"
|
||||
searxng_oauth2_cookie_secret: "{{ vault_searxng_oauth2_cookie_secret }}"
|
||||
```
|
||||
|
||||
> ℹ️ Variables use the `searxng_` prefix, following the same naming pattern as
|
||||
> `jupyterlab_oauth2_*` variables on Puck. The upstream URL (`http://127.0.0.1:22083`)
|
||||
> is derived from `searxng_port` in the config template — no cross-host URL needed.
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Add Vault Secrets
|
||||
```bash
|
||||
ansible-vault edit inventory/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
Add:
|
||||
```yaml
|
||||
vault_searxng_oauth2_client_id: "<from-casdoor>"
|
||||
vault_searxng_oauth2_client_secret: "<from-casdoor>"
|
||||
vault_searxng_oauth2_cookie_secret: "<generated-32-byte-secret>"
|
||||
```
|
||||
|
||||
Note: The `searxng_` prefix allows service-specific credentials. The Oberon host_vars
|
||||
maps these directly to `searxng_oauth2_*` variables used by the sidecar config template.
|
||||
|
||||
### 2. Update Host Variables
|
||||
OAuth2-Proxy variables are defined in `inventory/host_vars/oberon.incus.yml` alongside
|
||||
the existing SearXNG configuration. No separate service entry is needed — the OAuth2-Proxy
|
||||
sidecar is deployed as part of the `searxng` service.
|
||||
|
||||
```yaml
|
||||
# SearXNG OAuth2-Proxy Sidecar (in oberon.incus.yml)
|
||||
searxng_oauth2_proxy_dir: /etc/oauth2-proxy-searxng
|
||||
searxng_oauth2_proxy_version: "7.6.0"
|
||||
searxng_proxy_port: 22073
|
||||
searxng_domain: "ouranos.helu.ca"
|
||||
searxng_oauth2_oidc_issuer_url: "https://id.ouranos.helu.ca"
|
||||
searxng_oauth2_redirect_url: "https://searxng.ouranos.helu.ca/oauth2/callback"
|
||||
```
|
||||
|
||||
### 3. Update HAProxy Backend
|
||||
Route SearXNG traffic through OAuth2-Proxy on Oberon:
|
||||
```yaml
|
||||
# In inventory/host_vars/titania.incus.yml
|
||||
haproxy_backends:
|
||||
- subdomain: "searxng"
|
||||
backend_host: "oberon.incus" # Same host as SearXNG
|
||||
backend_port: 22073 # OAuth2-Proxy port
|
||||
health_path: "/ping" # OAuth2-Proxy health endpoint
|
||||
```
|
||||
|
||||
### 4. Deploy
|
||||
```bash
|
||||
cd ansible
|
||||
|
||||
# Deploy SearXNG + OAuth2-Proxy sidecar
|
||||
ansible-playbook searxng/deploy.yml
|
||||
|
||||
# Update HAProxy configuration
|
||||
ansible-playbook haproxy/deploy.yml
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Logs
|
||||
OAuth2-Proxy logs to systemd journal on Oberon. Alloy's default `systemd_logs`
|
||||
source captures these logs automatically, filterable by `SyslogIdentifier=oauth2-proxy-searxng`.
|
||||
|
||||
```bash
|
||||
# View logs on Oberon
|
||||
ssh oberon.incus
|
||||
journalctl -u oauth2-proxy-searxng -f
|
||||
```
|
||||
|
||||
### Metrics
|
||||
OAuth2-Proxy exposes Prometheus metrics at `/metrics` on port 22073:
|
||||
- `oauth2_proxy_requests_total` - Total requests
|
||||
- `oauth2_proxy_errors_total` - Error count
|
||||
- `oauth2_proxy_upstream_latency_seconds` - Upstream latency
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Cookie Security**:
|
||||
- `cookie_secure = true` enforces HTTPS-only cookies
|
||||
- `cookie_httponly = true` prevents JavaScript access
|
||||
- `cookie_samesite = "lax"` provides CSRF protection
|
||||
|
||||
2. **Email Domain Restriction**:
|
||||
- Configure `oauth2_proxy_email_domains` to limit who can access
|
||||
- Example: `["yourdomain.com"]` or `["*"]` for any
|
||||
|
||||
3. **Group-Based Access**:
|
||||
- Optional: Configure `oauth2_proxy_allowed_groups` in Casdoor
|
||||
- Only users in specified groups can access SearXNG
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check OAuth2-Proxy Status
|
||||
```bash
|
||||
ssh oberon.incus
|
||||
systemctl status oauth2-proxy-searxng
|
||||
journalctl -u oauth2-proxy-searxng --no-pager -n 50
|
||||
```
|
||||
|
||||
### Test OIDC Discovery
|
||||
```bash
|
||||
curl https://id.ouranos.helu.ca/.well-known/openid-configuration
|
||||
```
|
||||
|
||||
### Test Health Endpoint
|
||||
```bash
|
||||
curl http://oberon.incus:22073/ping
|
||||
```
|
||||
|
||||
### Verify Cookie Domain
|
||||
Ensure the cookie domain (`.ouranos.helu.ca`) matches your HAProxy domain.
|
||||
Cookies won't work across different domains.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `ansible/searxng/deploy.yml` | SearXNG + OAuth2-Proxy sidecar deployment |
|
||||
| `ansible/searxng/oauth2-proxy-searxng.cfg.j2` | OAuth2-Proxy OIDC configuration |
|
||||
| `ansible/searxng/oauth2-proxy-searxng.service.j2` | Systemd unit for OAuth2-Proxy |
|
||||
| `ansible/inventory/host_vars/oberon.incus.yml` | Host variables (`searxng_oauth2_*`) |
|
||||
| `docs/searxng-auth.md` | This design document |
|
||||
|
||||
### Generic OAuth2-Proxy Module (Retained)
|
||||
|
||||
The standalone `ansible/oauth2_proxy/` directory is retained as a generic, reusable
|
||||
Docker-based OAuth2-Proxy module for future services:
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `ansible/oauth2_proxy/deploy.yml` | Generic Docker Compose deployment |
|
||||
| `ansible/oauth2_proxy/docker-compose.yml.j2` | Docker Compose template |
|
||||
| `ansible/oauth2_proxy/oauth2-proxy.cfg.j2` | Generic OIDC configuration template |
|
||||
| `ansible/oauth2_proxy/stage.yml` | Validation / dry-run playbook |
|
||||
191
docs/smtp4dev.md
Normal file
191
docs/smtp4dev.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# smtp4dev - Development SMTP Server
|
||||
|
||||
## Overview
|
||||
|
||||
smtp4dev is a fake SMTP server for development and testing. It accepts all incoming email without delivering it, capturing messages for inspection via a web UI and IMAP client. All services in the Agathos sandbox that send email (Casdoor, Gitea, etc.) are wired to smtp4dev so email flows can be tested without a real mail server.
|
||||
|
||||
**Host:** Oberon (container_orchestration)
|
||||
**Web UI Port:** 22085 → `https://smtp4dev.ouranos.helu.ca`
|
||||
**SMTP Port:** 22025 (used by all services as `smtp_host:smtp_port`)
|
||||
**IMAP Port:** 22045
|
||||
**Syslog Port:** 51405 (Alloy)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Oberon Host │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ smtp4dev Container (Docker) │ │
|
||||
│ │ │ │
|
||||
│ │ Port 80 → host 22085 (Web UI) │ │
|
||||
│ │ Port 25 → host 22025 (SMTP) │ │
|
||||
│ │ Port 143 → host 22045 (IMAP) │ │
|
||||
│ │ │ │
|
||||
│ │ Volume: smtp4dev_data → /smtp4dev │ │
|
||||
│ │ Logs: syslog → Alloy:51405 → Loki │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
▲ ▲
|
||||
│ SMTP :22025 │ SMTP :22025
|
||||
┌──────┴──────┐ ┌──────┴──────┐
|
||||
│ Casdoor │ │ Gitea │
|
||||
│ (Titania) │ │ (Rosalind) │
|
||||
└─────────────┘ └─────────────┘
|
||||
|
||||
External access:
|
||||
https://smtp4dev.ouranos.helu.ca → HAProxy (Titania) → oberon.incus:22085
|
||||
```
|
||||
|
||||
## Shared SMTP Variables
|
||||
|
||||
smtp4dev connection details are defined once in `ansible/inventory/group_vars/all/vars.yml` and consumed by all service templates:
|
||||
|
||||
| Variable | Value | Purpose |
|
||||
|----------|-------|---------|
|
||||
| `smtp_host` | `oberon.incus` | SMTP server hostname |
|
||||
| `smtp_port` | `22025` | SMTP server port |
|
||||
| `smtp_from` | `noreply@ouranos.helu.ca` | Default sender address |
|
||||
| `smtp_from_name` | `Agathos` | Default sender display name |
|
||||
|
||||
Any service that needs to send email references these shared variables rather than defining its own SMTP config. This means switching to a real SMTP server only requires changing `group_vars/all/vars.yml`.
|
||||
|
||||
## Ansible Deployment
|
||||
|
||||
### Playbook
|
||||
|
||||
```bash
|
||||
# Deploy smtp4dev on Oberon
|
||||
ansible-playbook smtp4dev/deploy.yml
|
||||
|
||||
# Redeploy HAProxy to activate the smtp4dev.ouranos.helu.ca backend
|
||||
ansible-playbook haproxy/deploy.yml
|
||||
```
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `ansible/smtp4dev/deploy.yml` | Main deployment playbook |
|
||||
| `ansible/smtp4dev/docker-compose.yml.j2` | Docker Compose template |
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
The `deploy.yml` playbook:
|
||||
|
||||
1. Filters hosts — only runs on hosts with `smtp4dev` in their `services` list (Oberon)
|
||||
2. Creates `smtp4dev` system group and user
|
||||
3. Adds `ponos` user to the `smtp4dev` group (for `docker compose` access)
|
||||
4. Creates `/srv/smtp4dev` directory owned by `smtp4dev:smtp4dev`
|
||||
5. Templates `docker-compose.yml` into `/srv/smtp4dev/`
|
||||
6. Resets SSH connection to apply group membership
|
||||
7. Starts the service with `community.docker.docker_compose_v2: state: present`
|
||||
|
||||
### Host Variables
|
||||
|
||||
Defined in `ansible/inventory/host_vars/oberon.incus.yml`:
|
||||
|
||||
```yaml
|
||||
# smtp4dev Configuration
|
||||
smtp4dev_user: smtp4dev
|
||||
smtp4dev_group: smtp4dev
|
||||
smtp4dev_directory: /srv/smtp4dev
|
||||
smtp4dev_port: 22085 # Web UI (container port 80)
|
||||
smtp4dev_smtp_port: 22025 # SMTP (container port 25)
|
||||
smtp4dev_imap_port: 22045 # IMAP (container port 143)
|
||||
smtp4dev_syslog_port: 51405 # Alloy syslog collector
|
||||
```
|
||||
|
||||
## Service Integrations
|
||||
|
||||
### Casdoor
|
||||
|
||||
The Casdoor email provider is declared in `ansible/casdoor/init_data.json.j2` and seeded automatically on a **fresh** Casdoor deployment:
|
||||
|
||||
```json
|
||||
{
|
||||
"owner": "admin",
|
||||
"name": "provider-email-smtp4dev",
|
||||
"displayName": "smtp4dev Email",
|
||||
"category": "Email",
|
||||
"type": "SMTP",
|
||||
"host": "oberon.incus",
|
||||
"port": 22025,
|
||||
"disableSsl": true,
|
||||
"fromAddress": "noreply@ouranos.helu.ca",
|
||||
"fromName": "Agathos"
|
||||
}
|
||||
```
|
||||
|
||||
> ⚠️ For **existing** Casdoor installs, create the provider manually:
|
||||
> 1. Log in to `https://id.ouranos.helu.ca` as admin
|
||||
> 2. Navigate to **Identity → Providers → Add**
|
||||
> 3. Set **Category**: `Email`, **Type**: `SMTP`
|
||||
> 4. Fill host `oberon.incus`, port `22025`, disable SSL, from `noreply@ouranos.helu.ca`
|
||||
> 5. Save and assign the provider to the `heluca` organization under **Organizations → heluca → Edit → Default email provider**
|
||||
|
||||
### Gitea
|
||||
|
||||
Configured directly in `ansible/gitea/app.ini.j2`:
|
||||
|
||||
```ini
|
||||
[mailer]
|
||||
ENABLED = true
|
||||
SMTP_ADDR = {{ smtp_host }}
|
||||
SMTP_PORT = {{ smtp_port }}
|
||||
FROM = {{ smtp_from }}
|
||||
```
|
||||
|
||||
Redeploy Gitea to apply:
|
||||
|
||||
```bash
|
||||
ansible-playbook gitea/deploy.yml
|
||||
```
|
||||
|
||||
## External Access
|
||||
|
||||
smtp4dev's web UI is exposed via HAProxy on Titania at `https://smtp4dev.ouranos.helu.ca`.
|
||||
|
||||
Backend entry in `ansible/inventory/host_vars/titania.incus.yml`:
|
||||
|
||||
```yaml
|
||||
- subdomain: "smtp4dev"
|
||||
backend_host: "oberon.incus"
|
||||
backend_port: 22085
|
||||
health_path: "/"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check container is running
|
||||
ssh oberon.incus "cd /srv/smtp4dev && docker compose ps"
|
||||
|
||||
# Check logs
|
||||
ssh oberon.incus "cd /srv/smtp4dev && docker compose logs --tail=50"
|
||||
|
||||
# Test SMTP delivery (sends a test message)
|
||||
ssh oberon.incus "echo 'Subject: test' | sendmail -S oberon.incus:22025 test@example.com"
|
||||
|
||||
# Check web UI is reachable internally
|
||||
curl -s -o /dev/null -w "%{http_code}" http://oberon.incus:22085
|
||||
|
||||
# Check external HTTPS route
|
||||
curl -sk -o /dev/null -w "%{http_code}" https://smtp4dev.ouranos.helu.ca
|
||||
```
|
||||
|
||||
## site.yml Order
|
||||
|
||||
smtp4dev is deployed after Docker (it requires the Docker engine) and before Casdoor (so the SMTP endpoint exists when Casdoor initialises):
|
||||
|
||||
```yaml
|
||||
- name: Deploy Docker
|
||||
import_playbook: docker/deploy.yml
|
||||
|
||||
- name: Deploy smtp4dev
|
||||
import_playbook: smtp4dev/deploy.yml
|
||||
|
||||
- name: Deploy PPLG Stack # ...continues
|
||||
```
|
||||
70
docs/sunwait.txt
Normal file
70
docs/sunwait.txt
Normal file
@@ -0,0 +1,70 @@
|
||||
Calculate sunrise and sunset times for the current or targetted day.
|
||||
The times can be adjusted either for twilight or fixed durations.
|
||||
|
||||
The program can either: wait for sunrise or sunset (function: wait),
|
||||
or return the time (GMT or local) the event occurs (function: list),
|
||||
or report the day length and twilight timings (function: report),
|
||||
or simply report if it is DAY or NIGHT (function: poll).
|
||||
|
||||
You should specify the latitude and longitude of your target location.
|
||||
|
||||
|
||||
Usage: sunwait [major options] [minor options] [twilight type] [rise|set] [offset] [latitude] [longitude]
|
||||
|
||||
Major options, either:
|
||||
poll Returns immediately indicating DAY or NIGHT. See 'program exit codes'. Default.
|
||||
wait Sleep until specified event occurs. Else exit immediate.
|
||||
list [X] Report twilight times for next 'X' days (inclusive). Default: 1.
|
||||
report [date] Generate a report about the days sunrise and sunset timings. Default: the current day
|
||||
|
||||
Minor options, any of:
|
||||
[no]debug Print extra info and returns in one minute. Default: nodebug.
|
||||
[no]version Print the version number. Default: noversion.
|
||||
[no]help Print this help. Default: nohelp.
|
||||
[no]gmt Print times in GMT or local-time. Default: nogmt.
|
||||
|
||||
Twilight types, either:
|
||||
daylight Top of sun just below the horizon. Default.
|
||||
civil Civil Twilight. -6 degrees below horizon.
|
||||
nautical Nautical twilight. -12 degrees below horizon.
|
||||
astronomical Astronomical twilight. -18 degrees below horizon.
|
||||
angle [X.XX] User-specified twilight-angle (degrees). Default: 0.
|
||||
|
||||
Sunrise/sunset. Only useful with major-options: 'wait' and 'list'. Any of: (default: both)
|
||||
rise Wait for the sun to rise past specified twilight & offset.
|
||||
set Wait for the sun to set past specified twilight & offset.
|
||||
|
||||
Offset:
|
||||
offset [MM|HH:MM] Time interval (+ve towards noon) to adjust twilight calculation.
|
||||
|
||||
Target date. Only useful with major-options: 'report' or 'list'. Default: today
|
||||
d [DD] Set the target Day-of-Month to calculate for. 1 to 31.
|
||||
m [MM] Set the target Month to calculate for. 1 to 12.
|
||||
y [YYYY] Set the target Year to calculate for. 2000 to 2099.
|
||||
|
||||
latitude/longitude coordinates: floating-point degrees, with [NESW] appended. Default: Bingham, England.
|
||||
|
||||
Exit (return) codes:
|
||||
0 OK: exit from 'wait' or 'list' only.
|
||||
1 Error.
|
||||
2 Exit from 'poll': it is DAY or twilight.
|
||||
3 Exit from 'poll': it is NIGHT (after twilight).
|
||||
|
||||
Example 1: sunwait wait rise offset -1:15:10 51.477932N 0.000000E
|
||||
Wait until 1 hour 15 minutes 10 secs before the sun rises in Greenwich, London.
|
||||
|
||||
Example 2: sunwait list 7 civil 55.752163N 37.617524E
|
||||
List civil sunrise and sunset times for today and next 6 days. Moscow.
|
||||
|
||||
Example 3: sunwait poll exit angle 10 54.897786N -1.517536E
|
||||
Indicate by program exit-code if is Day or Night using a custom twilight angle of 10 degrees above horizon. Washington, UK.
|
||||
|
||||
Example 4: sunwait list 7 gmt sunrise angle 3
|
||||
List next 7 days sunrise times, custom +3 degree twilight angle, default location.
|
||||
Uses GMT; as any change in daylight saving over the specified period is not considered.
|
||||
|
||||
Example 5: sunwait report y 20 m 3 d 15 10.49S 105.55E
|
||||
Produce a report of the different sunrises and sunsets on an arbitrary day (2022/03/15) for an arbitrary location (Christmas Island)
|
||||
|
||||
Note that program uses C library functions to determine time and localtime.
|
||||
Error for timings are estimated at: +/- 4 minutes.
|
||||
296
docs/terraform.md
Normal file
296
docs/terraform.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Terraform Practices & Patterns
|
||||
|
||||
This document describes the Terraform design philosophy, patterns, and practices used across our infrastructure. The audience includes LLMs assisting with development, new team members, and existing team members seeking a reference.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
### Incus-First Infrastructure
|
||||
|
||||
Incus containers form the foundational layer of all environments. Management and monitoring infrastructure (Prospero, Titania) must exist before application hosts. This is a **critical dependency** that must be explicitly codified.
|
||||
|
||||
**Why?** Terraform isn't magic. Implicit ordering can lead to race conditions or failed deployments. Always use explicit `depends_on` for critical infrastructure chains.
|
||||
|
||||
```hcl
|
||||
# Example: Application host depends on monitoring infrastructure
|
||||
resource "incus_instance" "app_host" {
|
||||
# ...
|
||||
depends_on = [incus_instance.uranian_hosts["prospero"]]
|
||||
}
|
||||
```
|
||||
|
||||
### Explicit Dependencies
|
||||
|
||||
Never rely solely on implicit resource ordering for critical infrastructure. Codify dependencies explicitly to:
|
||||
|
||||
- ✔ Prevent race conditions during parallel applies
|
||||
- ✔ Document architectural relationships in code
|
||||
- ✔ Ensure consistent deployment ordering across environments
|
||||
|
||||
## Repository Strategy
|
||||
|
||||
### Agathos (Sandbox)
|
||||
|
||||
Agathos is the **Sandbox repository** — isolated, safe for external demos, and uses local state.
|
||||
|
||||
| Aspect | Decision |
|
||||
|--------|----------|
|
||||
| Purpose | Evaluation, demos, pattern experimentation, new software testing |
|
||||
| State | Local (no remote backend) |
|
||||
| Secrets | No production credentials or references |
|
||||
| Security | Safe to use on external infrastructure for demos |
|
||||
|
||||
### Production Repository (Separate)
|
||||
|
||||
A separate repository manages Dev, UAT, and Prod environments:
|
||||
|
||||
```
|
||||
terraform/
|
||||
├── modules/incus_host/ # Reusable container module
|
||||
├── environments/
|
||||
│ ├── dev/ # Local Incus only
|
||||
│ └── prod/ # OCI + Incus (parameterized via tfvars)
|
||||
```
|
||||
|
||||
| Aspect | Decision |
|
||||
|--------|----------|
|
||||
| State | PostgreSQL backend on `eris.helu.ca:6432` with SSL |
|
||||
| Schemas | Separate per environment: `dev`, `uat`, `prod` |
|
||||
| UAT/Prod | Parameterized twins via `-var-file` |
|
||||
|
||||
## Module Design
|
||||
|
||||
### When to Extract a Module
|
||||
|
||||
A pattern is a good module candidate when it meets these criteria:
|
||||
|
||||
| Criterion | Description |
|
||||
|-----------|-------------|
|
||||
| **Reuse** | Pattern used across multiple environments (Sandbox, Dev, UAT, Prod) |
|
||||
| **Stable Interface** | Inputs/outputs won't change frequently |
|
||||
| **Testable** | Can validate module independently before promotion |
|
||||
| **Encapsulates Complexity** | Hides `dynamic` blocks, `for_each`, cloud-init generation |
|
||||
|
||||
### When NOT to Extract
|
||||
|
||||
- Single-use patterns
|
||||
- Tightly coupled to specific environment
|
||||
- Adds indirection without measurable benefit
|
||||
|
||||
### The `incus_host` Module
|
||||
|
||||
The standard container provisioning pattern extracted from Agathos:
|
||||
|
||||
**Inputs:**
|
||||
- `hosts` — Map of host definitions (name, role, image, devices, config)
|
||||
- `project` — Incus project name
|
||||
- `profile` — Incus profile name
|
||||
- `cloud_init_template` — Cloud-init configuration template
|
||||
- `ssh_key_path` — Path to SSH authorized keys
|
||||
- `depends_on_resources` — Explicit dependencies for infrastructure ordering
|
||||
|
||||
**Outputs:**
|
||||
- `host_details` — Name, IPv4, role, description for each host
|
||||
- `inventory` — Documentation reference for DHCP/DNS provisioning
|
||||
|
||||
## Environment Strategy
|
||||
|
||||
### Environment Purposes
|
||||
|
||||
| Environment | Purpose | Infrastructure |
|
||||
|-------------|---------|----------------|
|
||||
| **Sandbox** | Evaluation, demos, pattern experimentation | Local Incus only |
|
||||
| **Dev** | Integration testing, container builds, security testing | Local Incus only |
|
||||
| **UAT** | User acceptance testing, bug resolution | OCI + Incus (hybrid) |
|
||||
| **Prod** | Production workloads | OCI + Incus (hybrid) |
|
||||
|
||||
### Parameterized Twins (UAT/Prod)
|
||||
|
||||
UAT and Prod are architecturally identical. Use a single environment directory with variable files:
|
||||
|
||||
```bash
|
||||
# UAT deployment
|
||||
terraform apply -var-file=uat.tfvars
|
||||
|
||||
# Prod deployment
|
||||
terraform apply -var-file=prod.tfvars
|
||||
```
|
||||
|
||||
Key differences in tfvars:
|
||||
- Hostnames and DNS domains
|
||||
- Resource sizing (CPU, memory limits)
|
||||
- OCI compartment IDs
|
||||
- Credential references
|
||||
|
||||
## State Management
|
||||
|
||||
### Sandbox (Agathos)
|
||||
|
||||
Local state is acceptable because:
|
||||
- Environment is ephemeral
|
||||
- Single-user workflow
|
||||
- No production secrets to protect
|
||||
- Safe for external demos
|
||||
|
||||
### Production Environments
|
||||
|
||||
PostgreSQL backend on `eris.helu.ca`:
|
||||
|
||||
```hcl
|
||||
terraform {
|
||||
backend "pg" {
|
||||
conn_str = "postgres://eris.helu.ca:6432/terraform_state?sslmode=verify-full"
|
||||
schema_name = "dev" # or "uat", "prod"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Connection requirements:**
|
||||
- Port 6432 (pgBouncer)
|
||||
- SSL with `sslmode=verify-full`
|
||||
- Credentials via environment variables (`PGUSER`, `PGPASSWORD`)
|
||||
- Separate schema per environment for isolation
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Terraform → DHCP/DNS
|
||||
|
||||
The `agathos_inventory` output provides host information for DHCP/DNS provisioning:
|
||||
|
||||
1. Terraform creates containers with cloud-init
|
||||
2. `agathos_inventory` output includes hostnames and IPs
|
||||
3. MAC addresses registered in DHCP server
|
||||
4. DHCP server creates DNS entries (`hostname.incus` domain)
|
||||
5. Ansible uses DNS names for host connectivity
|
||||
|
||||
### Terraform → Ansible
|
||||
|
||||
Ansible does **not** consume Terraform outputs directly. Instead:
|
||||
|
||||
1. Terraform provisions containers
|
||||
2. Incus DNS resolution provides `hostname.incus` domain
|
||||
3. Ansible inventory uses static DNS names
|
||||
4. `sandbox_up.yml` configures DNS resolution on the hypervisor
|
||||
|
||||
```yaml
|
||||
# Ansible inventory uses DNS names, not Terraform outputs
|
||||
ubuntu:
|
||||
hosts:
|
||||
oberon.incus:
|
||||
ariel.incus:
|
||||
prospero.incus:
|
||||
```
|
||||
|
||||
### Terraform → Bash Scripts
|
||||
|
||||
The `ssh_key_update.sh` script demonstrates proper integration:
|
||||
|
||||
```bash
|
||||
terraform output -json agathos_inventory | jq -r \
|
||||
'.uranian_hosts.hosts | to_entries[] | "\(.key) \(.value.ipv4)"' | \
|
||||
while read hostname ip; do
|
||||
ssh-keyscan -H "$ip" >> ~/.ssh/known_hosts
|
||||
ssh-keyscan -H "$hostname.incus" >> ~/.ssh/known_hosts
|
||||
done
|
||||
```
|
||||
|
||||
## Promotion Workflow
|
||||
|
||||
All infrastructure changes flow through this pipeline:
|
||||
|
||||
```
|
||||
Agathos (Sandbox)
|
||||
↓ Validate pattern works
|
||||
↓ Extract to module if reusable
|
||||
Dev
|
||||
↓ Integration testing
|
||||
↓ Container builds
|
||||
↓ Security testing
|
||||
UAT
|
||||
↓ User acceptance testing
|
||||
↓ Bug fixes return to Dev
|
||||
↓ Delete environment, test restore
|
||||
Prod
|
||||
↓ Deploy from tested artifacts
|
||||
```
|
||||
|
||||
**Critical:** Nothing starts in Prod. Every change originates in Agathos, is validated through the pipeline, and only then deployed to production.
|
||||
|
||||
### Promotion Includes
|
||||
|
||||
When promoting Terraform changes, always update corresponding:
|
||||
- Ansible playbooks and templates
|
||||
- Service documentation in `/docs/services/`
|
||||
- Host variables if new services added
|
||||
|
||||
## Output Conventions
|
||||
|
||||
### `agathos_inventory`
|
||||
|
||||
The primary output for documentation and DNS integration:
|
||||
|
||||
```hcl
|
||||
output "agathos_inventory" {
|
||||
description = "Host inventory for documentation and DHCP/DNS provisioning"
|
||||
value = {
|
||||
uranian_hosts = {
|
||||
hosts = {
|
||||
for name, instance in incus_instance.uranian_hosts : name => {
|
||||
name = instance.name
|
||||
ipv4 = instance.ipv4_address
|
||||
role = local.uranian_hosts[name].role
|
||||
description = local.uranian_hosts[name].description
|
||||
security_nesting = lookup(local.uranian_hosts[name].config, "security.nesting", false)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Purpose:**
|
||||
- Update [sandbox.html](sandbox.html) documentation
|
||||
- Reference for DHCP server MAC/IP registration
|
||||
- DNS entry creation via DHCP
|
||||
|
||||
## Layered Configuration
|
||||
|
||||
### Single Config with Conditional Resources
|
||||
|
||||
Avoid multiple separate Terraform configurations. Use one config with conditional resources:
|
||||
|
||||
```
|
||||
environments/prod/
|
||||
├── main.tf # Incus project, profile, images (always)
|
||||
├── incus_hosts.tf # Module call for Incus containers (always)
|
||||
├── oci_resources.tf # OCI compute (conditional)
|
||||
├── variables.tf
|
||||
├── dev.tfvars # Dev: enable_oci = false
|
||||
├── uat.tfvars # UAT: enable_oci = true
|
||||
└── prod.tfvars # Prod: enable_oci = true
|
||||
```
|
||||
|
||||
```hcl
|
||||
variable "enable_oci" {
|
||||
description = "Enable OCI resources (false for Dev, true for UAT/Prod)"
|
||||
type = bool
|
||||
default = false
|
||||
}
|
||||
|
||||
resource "oci_core_instance" "hosts" {
|
||||
for_each = var.enable_oci ? var.oci_hosts : {}
|
||||
# ...
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
| Practice | Rationale |
|
||||
|----------|-----------|
|
||||
| ✔ Explicit `depends_on` for critical chains | Terraform isn't magic |
|
||||
| ✔ Local map for host definitions | Single source of truth, easy iteration |
|
||||
| ✔ `for_each` over `count` | Stable resource addresses |
|
||||
| ✔ `dynamic` blocks for optional devices | Clean, declarative device configuration |
|
||||
| ✔ Merge base config with overrides | DRY principle for common settings |
|
||||
| ✔ Separate tfvars for environment twins | Minimal duplication, clear parameterization |
|
||||
| ✔ Document module interfaces | Enable promotion across environments |
|
||||
| ✔ Never start in Prod | Always validate through pipeline |
|
||||
38
docs/xrdp.md
Normal file
38
docs/xrdp.md
Normal file
@@ -0,0 +1,38 @@
|
||||
Purpose
|
||||
This script automates the installation and configuration of xRDP (X Remote Desktop Protocol) on Ubuntu-based systems, providing a complete remote desktop solution with enhanced user experience.
|
||||
|
||||
Key Features
|
||||
Multi-Distribution Support:
|
||||
Ubuntu 22.04, 24.04, 24.10, 25.04
|
||||
Linux Mint, Pop!OS, Zorin OS, Elementary OS
|
||||
Debian support (best effort)
|
||||
LMDE (Linux Mint Debian Edition)
|
||||
|
||||
Installation Modes:
|
||||
Standard installation (from repositories)
|
||||
Custom installation (compile from source)
|
||||
Removal/cleanup option
|
||||
|
||||
Advanced Capabilities:
|
||||
Sound redirection - Compiles audio modules for remote audio playback
|
||||
H.264 encoding/decoding support (latest version)
|
||||
Desktop environment detection - Handles GNOME, KDE, Budgie, etc.
|
||||
Sound server detection - Works with both PulseAudio and PipeWire
|
||||
Custom login screen - Branded xRDP login with custom colors/backgrounds
|
||||
|
||||
Smart Features:
|
||||
SSH session detection - Warns when installing over SSH
|
||||
Version compatibility checks - Prevents incompatible installations
|
||||
Conflict resolution - Disables conflicting GNOME remote desktop services
|
||||
Permission fixes - Handles SSL certificates and user groups
|
||||
Polkit rules - Enables proper shutdown/reboot from remote sessions
|
||||
|
||||
What Makes It Special
|
||||
Extensive OS/version support with graceful handling of EOL versions
|
||||
Intelligent detection of desktop environments and sound systems
|
||||
Post-installation optimization for better remote desktop experience
|
||||
Comprehensive error handling and user feedback
|
||||
Modular design with separate functions for different tasks
|
||||
Active maintenance - regularly updated with new Ubuntu releases
|
||||
|
||||
The script essentially transforms a basic Ubuntu system into a fully-functional remote desktop server with professional-grade features, handling all the complex configuration that would normally require manual intervention.
|
||||
Reference in New Issue
Block a user