Files
mnemosyne/docs/deploy.md
Robert Helewka 93639188d3
Some checks failed
CVE Scan & Docker Build / build-and-push (push) Has been cancelled
CVE Scan & Docker Build / security-scan (push) Has been cancelled
Build & Deploy Docs / build-and-deploy (push) Successful in 1m10s
feat: rework auth model with UserToken and Daedalus/Pallas integration
- Rename MCPToken to UserToken across models, views, and tests
- Update URL names from mcp-token-* to token-*
- Add Daedalus/Pallas integration design doc (v2)
- Switch docker-compose to build local mnemosyne:local image via shared
  build config instead of pulling from git.helu.ca
2026-05-23 19:50:29 -04:00

401 lines
14 KiB
Markdown

# Mnemosyne — Ansible Deployment Reference
This document gives the Ansible author everything needed to write and maintain the
Mnemosyne deployment role. All implementation decisions are already locked in
`docker-compose.yaml` and `nginx/mnemosyne.conf`; this document explains the
*why* behind each decision and provides the authoritative list of variables,
one-time steps, and verification checks.
---
## 1. Host & Stack Overview
| Item | Value |
|------|-------|
| Deploy target | `puck.incus` (Incus container, 10.10.0.0/24) |
| Compose project directory | `/srv/mnemosyne` |
| Image registry | `git.helu.ca/r/mnemosyne:latest` |
| Public host port | **23181** (nginx → HAProxy on Titania → `https://mnemosyne.ouranos.helu.ca`) |
| Internal app port | `app:8000` (Django/gunicorn) |
| Internal MCP port | `mcp:8001` (FastMCP/uvicorn) |
The four compose services (`app`, `mcp`, `worker`, `web`) all run from the same
image. A one-shot `static-init` service seeds the nginx static-file volume on
every `up` so static-file changes propagate automatically on deploy without
manual intervention.
---
## 2. External Dependencies (NOT managed by this role)
These services must exist before Mnemosyne can start. The role only consumes
credentials; it does not provision these hosts.
| Service | Host | Notes |
|---------|------|-------|
| PostgreSQL | `portia.incus:5432` | Database `mnemosyne`, user `mnemosyne` |
| Neo4j | `umbriel.incus:7687` | Bolt protocol. **Must be dedicated to Mnemosyne** — do not share with Spelunker or any other graph workload (see README §Note on Neo4j). HTTP browser on `umbriel.incus:25555`. |
| RabbitMQ | `oberon.incus:5672` | vhost `mnemosyne`, user `mnemosyne` |
| MinIO (Mnemosyne bucket) | `nyx.helu.ca:8555` | Bucket `mnemosyne-content`. Credentials scoped read+write. |
| MinIO (Daedalus bucket) | `nyx.helu.ca:8555` | Bucket `daedalus`. **Read-only** cross-bucket credentials for the ingest worker. |
| Memcached | `oberon.incus:11211` | Shared; prefix `mnemosyne` avoids collisions. |
| Embedder (Qwen3-VL-Embedding) | Configured via `EMBEDDING_*` vars in settings | GPU host on Nyx; not managed here. |
| Reranker (Synesis) | Configured via `RERANKER_*` vars in settings | GPU host on Nyx; not managed here. |
---
## 3. Role Tasks
### 3.1 Directory & file layout
```
/srv/mnemosyne/
├── docker-compose.yaml ← copied from repo (or symlinked via git pull)
├── nginx/
│ └── mnemosyne.conf ← copied from repo nginx/mnemosyne.conf
└── .env ← rendered from Jinja2 template + vault secrets
```
The role should:
1. Create `/srv/mnemosyne/` and `nginx/` (owner: `root`, mode `0750`).
2. Render `.env` from the vault-sourced Jinja2 template (mode `0600`, owner `root`).
3. Copy (or `git pull`) `docker-compose.yaml` and `nginx/mnemosyne.conf` from the repo.
### 3.2 Pull & start
```yaml
- name: Pull latest image
community.docker.docker_compose_v2:
project_src: /srv/mnemosyne
pull: always
- name: Bring stack up
community.docker.docker_compose_v2:
project_src: /srv/mnemosyne
state: present
```
This triggers `static-init` automatically on every `up` — no separate handler needed.
### 3.3 One-time setup (run once on first deploy, idempotent thereafter)
These management commands are safe to re-run; they do nothing if the target state
already exists. Run them as a post-start task gated on a `creates:` sentinel or
an explicit `when: mnemosyne_first_deploy` flag.
```bash
# Apply Django ORM migrations (PostgreSQL schema)
docker compose -f /srv/mnemosyne/docker-compose.yaml run --rm app migrate
# Create Neo4j vector + full-text indexes and load library-type defaults
docker compose -f /srv/mnemosyne/docker-compose.yaml \
run --rm app setup
# Seed the MCPSigningKey used to sign long-lived Pallas team JWTs.
# --retire-other deactivates any previously-active key. The hex
# emitted to stdout is persisted in Mnemosyne's database and is
# not re-injected from the vault — no operator action required
# beyond running this command once per fresh deployment.
docker compose -f /srv/mnemosyne/docker-compose.yaml \
run --rm app \
python manage.py seed_signing_key --kid daedalus-1 --retire-other
# Create Django groups for SSO role mapping (View Only / Staff / SME / Admin).
# Safe to re-run — idempotent.
docker compose -f /srv/mnemosyne/docker-compose.yaml \
run --rm app \
python manage.py create_sso_groups
```
The `seed_signing_key` command prints the generated secret once to stdout — it
is safe to discard that output after the command succeeds. Mnemosyne persists
the active key inside ``MCPSigningKey`` and reads it directly when minting each
team JWT; Daedalus never sees this value. To rotate, re-run the command with
``--retire-other`` and then rotate every Pallas team JWT via the Daedalus admin
UI so consumers pick up bearers signed with the new key.
---
## 4. Environment Variables (`.env` template)
All variables are consumed by `docker-compose.yaml` for interpolation into the
relevant service `environment:` blocks. The per-service scoping is defined in
`docker-compose.yaml`; the `.env` file just provides values.
### Django core — `app`, `mcp`, `worker`
| Variable | Example / default | Notes |
|----------|-------------------|-------|
| `SECRET_KEY` | `{{ vault_mnemosyne_secret_key }}` | Fernet-safe; never rotate without re-encrypting stored API keys first |
| `DEBUG` | `False` | |
| `TIME_ZONE` | `UTC` | |
| `LANGUAGE_CODE` | `en-us` | |
### HTTP surface — `app` (CSRF), `app` + `mcp` (ALLOWED_HOSTS)
| Variable | Example |
|----------|---------|
| `ALLOWED_HOSTS` | `localhost,127.0.0.1,mnemosyne.ouranos.helu.ca` |
| `CSRF_TRUSTED_ORIGINS` | `https://mnemosyne.ouranos.helu.ca` |
### PostgreSQL — `app`, `mcp`, `worker`
| Variable | Example |
|----------|---------|
| `APP_DB_NAME` | `mnemosyne` |
| `APP_DB_USER` | `mnemosyne` |
| `APP_DB_PASSWORD` | `{{ vault_mnemosyne_db_password }}` |
| `DB_HOST` | `portia.incus` |
| `DB_PORT` | `5432` |
### Neo4j — `app`, `mcp`, `worker`
| Variable | Example |
|----------|---------|
| `NEOMODEL_NEO4J_BOLT_URL` | `bolt://neo4j:{{ vault_neo4j_password }}@umbriel.incus:7687` |
> **URL-encode the password** if it contains `@ : / # % + ? & =` or a space.
> The Bolt URL parser is strict.
### Memcached — `app`, `mcp`, `worker`
| Variable | Example |
|----------|---------|
| `KVDB_LOCATION` | `oberon.incus:11211` |
| `KVDB_PREFIX` | `mnemosyne` |
### S3 / MinIO (Mnemosyne bucket) — `app`, `mcp`, `worker`
| Variable | Example |
|----------|---------|
| `USE_LOCAL_STORAGE` | `False` |
| `AWS_ACCESS_KEY_ID` | `{{ vault_mnemosyne_s3_key }}` |
| `AWS_SECRET_ACCESS_KEY` | `{{ vault_mnemosyne_s3_secret }}` |
| `AWS_STORAGE_BUCKET_NAME` | `mnemosyne-content` |
| `AWS_S3_ENDPOINT_URL` | `https://nyx.helu.ca:8555` |
| `AWS_S3_USE_SSL` | `True` |
| `AWS_S3_VERIFY` | `False` (self-signed cert on Nyx) |
| `AWS_S3_REGION_NAME` | `us-east-1` |
### Daedalus S3 (cross-bucket reads) — `worker` only
| Variable | Example |
|----------|---------|
| `DAEDALUS_S3_ENDPOINT_URL` | `https://nyx.helu.ca:8555` |
| `DAEDALUS_S3_ACCESS_KEY_ID` | `{{ vault_daedalus_s3_read_key }}` |
| `DAEDALUS_S3_SECRET_ACCESS_KEY` | `{{ vault_daedalus_s3_read_secret }}` |
| `DAEDALUS_S3_BUCKET_NAME` | `daedalus` |
| `DAEDALUS_S3_REGION_NAME` | `us-east-1` |
| `DAEDALUS_S3_USE_SSL` | `True` |
| `DAEDALUS_S3_VERIFY` | `True` |
### Celery / RabbitMQ — `app` (producer), `worker` (consumer)
| Variable | Example |
|----------|---------|
| `CELERY_BROKER_URL` | `amqp://mnemosyne:{{ vault_rabbitmq_password \| urlencode }}@oberon.incus:5672/mnemosyne` |
| `CELERY_RESULT_BACKEND` | `rpc://` |
| `CELERY_TASK_ALWAYS_EAGER` | `False` |
> **Percent-encode** the RabbitMQ password in the broker URL if it contains any
> URL-special characters. Use Ansible's `urlencode` filter or pre-encode in the
> vault variable. An unencoded password is the most common cause of
> `PLAIN 403 ACCESS_REFUSED` at worker startup.
### Worker tuning — `worker` only
| Variable | Default | Notes |
|----------|---------|-------|
| `CELERY_QUEUES` | `celery,embedding,batch` | Override per host for dedicated queue workers |
| `CELERY_CONCURRENCY` | `2` | Number of worker processes |
### MCP server — `mcp` only
| Variable | Production value |
|----------|-----------------|
| `MCP_REQUIRE_AUTH` | `True` |
### SSO / Casdoor — `app` only
| Variable | Example / default | Notes |
|----------|-------------------|-------|
| `CASDOOR_ENABLED` | `True` | Set `False` to disable SSO and show only local login |
| `CASDOOR_ORIGIN` | `https://casdoor.ouranos.helu.ca` | Backend URL used for OIDC discovery (`/.well-known/openid-configuration`) |
| `CASDOOR_ORIGIN_FRONTEND` | `https://casdoor.ouranos.helu.ca` | Frontend URL shown to the browser (may differ behind a reverse proxy) |
| `CASDOOR_CLIENT_ID` | `{{ vault_mnemosyne_casdoor_client_id }}` | OAuth client ID from the Casdoor application |
| `CASDOOR_CLIENT_SECRET` | `{{ vault_mnemosyne_casdoor_client_secret }}` | OAuth client secret from the Casdoor application |
| `CASDOOR_ORG_NAME` | `ouranos` | Default organisation slug in Casdoor |
| `CASDOOR_SSL_VERIFY` | `true` | `true` in production; `false` only in sandboxes with self-signed certs |
| `ALLOW_LOCAL_LOGIN` | `False` | Show the local username/password form to non-superusers. Superusers always see it regardless of this flag. |
Register the OIDC callback URL in the Casdoor application before enabling SSO:
```
https://mnemosyne.ouranos.helu.ca/accounts/oidc/casdoor/login/callback/
```
### LLM API encryption — `app`, `worker`
| Variable | Notes |
|----------|-------|
| `LLM_API_SECRETS_ENCRYPTION_KEY` | Fernet key. Generate once: `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`. Never rotate without re-encrypting all stored provider keys first. |
### Email — `app` only
| Variable | Example |
|----------|---------|
| `EMAIL_HOST` | `oberon.incus` |
| `EMAIL_PORT` | `22025` |
| `EMAIL_USE_TLS` | `False` |
### Embedding pipeline — `worker` only
| Variable | Default |
|----------|---------|
| `EMBEDDING_BATCH_SIZE` | `8` |
| `EMBEDDING_TIMEOUT` | `120` |
### Search & re-ranker — `app`, `mcp`
| Variable | Default |
|----------|---------|
| `SEARCH_VECTOR_TOP_K` | `50` |
| `SEARCH_FULLTEXT_TOP_K` | `30` |
| `SEARCH_GRAPH_MAX_DEPTH` | `2` |
| `SEARCH_RRF_K` | `60` |
| `SEARCH_DEFAULT_LIMIT` | `20` |
| `RERANKER_MAX_CANDIDATES` | `32` |
| `RERANKER_TIMEOUT` | `30` |
### Logging — `app`, `mcp`, `worker`
| Variable | Default |
|----------|---------|
| `LOGGING_LEVEL` | `INFO` |
| `DJANGO_LOGGING_LEVEL` | `WARNING` |
| `CELERY_LOGGING_LEVEL` | `INFO` |
---
## 5. Health Probes & Verification
After `docker compose up -d`, wait for all services to report healthy:
```bash
docker compose -f /srv/mnemosyne/docker-compose.yaml ps
```
Expected: `app`, `mcp`, `worker`, `web` all `healthy`; `static-init` `exited (0)`.
### Per-service probes
| Service | Healthcheck command | Expected |
|---------|---------------------|----------|
| `app` | `curl -f http://localhost:8000/live/` | 200 |
| `mcp` | `curl -f http://localhost:8001/mcp/health` | 200 JSON |
| `web` | `curl -f http://localhost/live/` | 200 (proxied to app) |
| `worker` | `celery -A mnemosyne inspect ping -d celery@$HOSTNAME` | `pong` |
### External checks (from inside the 10.10.0.0/24 network)
```bash
# Django liveness (via nginx)
curl -f http://puck.incus:23181/live/
# Django readiness (Postgres + Memcached)
curl -f http://puck.incus:23181/ready/
# MCP health (proxied from /healthz → mcp:8001/mcp/health)
curl -f http://puck.incus:23181/healthz
# Prometheus metrics (internal only)
curl http://puck.incus:23181/metrics | head -5
```
### Verify Daedalus auth (per-user API token)
Daedalus now authenticates as a Mnemosyne user via a `UserToken` minted
at `/profile/tokens/`. To smoke-test from a deploy host:
```bash
curl -H "Authorization: Bearer <user-token-plaintext>" \
https://mnemosyne.ouranos.helu.ca/library/api/workspaces/ws_smoke/ \
-o /dev/null -w "%{http_code}"
# Expect: 200 if the workspace exists for that user, 404 otherwise.
# An anonymous request gets 401 with `WWW-Authenticate: Bearer`.
```
### Verify MCP connectivity (from a client with a valid UserToken)
```bash
curl -H "Authorization: Bearer <token>" \
https://mnemosyne.ouranos.helu.ca/mcp/health
# Expect: {"status": "ok", ...}
```
---
## 6. Upgrade Procedure
A standard upgrade (new image pushed to `git.helu.ca/r/mnemosyne:latest`):
```bash
cd /srv/mnemosyne
docker compose pull
docker compose up -d # static-init re-seeds; running containers replaced
docker compose run --rm app migrate # no-op if no new migrations
```
The `static-init` service runs to completion on every `up`, propagating static
file changes without manual volume reset.
---
## 7. Rollback
```bash
# Pin to a specific digest
docker compose pull git.helu.ca/r/mnemosyne@sha256:<digest>
# Edit docker-compose.yaml image: line to use the digest, then:
docker compose up -d
```
Alternatively, tag good images in the registry before each deploy and reference
the tag.
---
## 8. HAProxy / Titania Configuration Notes
Titania terminates TLS and forwards to `puck.incus:23181`. The nginx config
preserves `X-Forwarded-Proto: https` so Django's `request.is_secure()`, secure
cookies, and `build_absolute_uri()` work correctly.
The HAProxy `health_path` for this backend should be `/healthz` (not `/live/` or
`/ready/`) — `/healthz` short-circuits directly to the FastMCP health endpoint
without touching Django, so it can confirm the MCP server is up even if Django
is momentarily unhealthy.
If HAProxy checks don't follow redirects, use `/live/` and `/ready/` **with** the
trailing slash. The un-slashed forms (`/live`, `/ready`) trigger Django's
`APPEND_SLASH` 301 redirect, which health checkers that don't follow redirects
will report as a failure.
---
## 9. Vault Variables Summary
| Vault variable | Used in `.env` as |
|----------------|-------------------|
| `vault_mnemosyne_secret_key` | `SECRET_KEY` |
| `vault_mnemosyne_db_password` | `APP_DB_PASSWORD` |
| `vault_neo4j_password` | embedded in `NEOMODEL_NEO4J_BOLT_URL` |
| `vault_mnemosyne_s3_key` | `AWS_ACCESS_KEY_ID` |
| `vault_mnemosyne_s3_secret` | `AWS_SECRET_ACCESS_KEY` |
| `vault_daedalus_s3_read_key` | `DAEDALUS_S3_ACCESS_KEY_ID` |
| `vault_daedalus_s3_read_secret` | `DAEDALUS_S3_SECRET_ACCESS_KEY` |
| `vault_rabbitmq_password` | embedded in `CELERY_BROKER_URL` |
| `vault_mnemosyne_llm_encryption_key` | `LLM_API_SECRETS_ENCRYPTION_KEY` |
| `vault_mnemosyne_casdoor_client_id` | `CASDOOR_CLIENT_ID` |
| `vault_mnemosyne_casdoor_client_secret` | `CASDOOR_CLIENT_SECRET` |