docs(env): expand .env.example into full compose interpolation template

Replace the minimal placeholder .env.example with a comprehensive template documenting every variable consumed by docker-compose.yaml, organized by service (Django core, HTTP, Postgres, Neo4j, Memcached, S3/MinIO, Daedalus, Celery/RabbitMQ, etc.). Clarifies that this file is rendered from an Ansible Jinja2 template with vaulted secrets in production, and distinguishes it from the in-tree mnemosyne/.env used for bare-Python development.
2026-05-04 07:04:28 -04:00
parent d84f0e548b
commit 003f958f7b
4 changed files with 623 additions and 68 deletions
--- a/README.md
+++ b/README.md
@@ -168,11 +168,34 @@ Production runs as four containers from a single image (built and pushed by [`.g

 Plus a one-shot `static-init` service that copies `/app/staticfiles` (baked into the image at build time via `collectstatic`) into the shared volume nginx reads from. It runs to completion on every `up`, so static-file changes propagate on each deploy without manual intervention.

-External services (NOT spun up by compose): Postgres on Portia, Neo4j on Umbriel (dedicated Mnemosyne instance), RabbitMQ on Oberon, S3/MinIO on Nyx, Memcached, embedder + reranker. All reached over the internal 10.10.0.0/24 network. URLs and credentials live in `mnemosyne/.env`.
+External services (NOT spun up by compose): Postgres on Portia, Neo4j on Umbriel (dedicated Mnemosyne instance), RabbitMQ on Oberon, S3/MinIO on Nyx, Memcached, embedder + reranker. All reached over the internal 10.10.0.0/24 network.
+
+### Environment scoping
+
+Each compose service declares *only* the environment variables it actually needs — there is no shared `env_file:`. The rationale:
+
+- The MCP server (the most exposed surface, because it talks to outside LLMs) should never see the Celery broker URL or the LLM API encryption key. It only needs Postgres, Neo4j, Memcached, S3, and the MCP-specific auth toggle.
+- The Celery worker has no business knowing `ALLOWED_HOSTS`, `CSRF_TRUSTED_ORIGINS`, `MCP_REQUIRE_AUTH`, or the email backend — it doesn't serve HTTP.
+- The Django app doesn't need the Daedalus S3 credentials — only the ingest Celery task reads that bucket.
+- When a shared secret (like the broker password) is mis-configured, the blast radius is limited to the services that actually need that secret, so you can still observe the rest of the stack while debugging.
+
+Values are interpolated from a `.env` file at the **repo root** (not `mnemosyne/.env`, which is the dev config for bare-Python runs). Copy `.env.example` to `.env` and fill in the blanks, or — in production — have your Ansible role render `.env` from a Jinja2 template with secrets from the vault.
+
+```bash
+cp .env.example .env
+$EDITOR .env       # fill in SECRET_KEY, DB/RabbitMQ/S3 creds, LLM_API_SECRETS_ENCRYPTION_KEY
+```
+
+The per-service surface is defined by the `environment:` blocks in `docker-compose.yaml`; `.env.example` documents every variable with which service(s) consume it.
+
+> **Broker URL gotcha.** If the RabbitMQ password contains any of `@ : / # % + ? & =` or a space, it must be percent-encoded in `CELERY_BROKER_URL`. Kombu's URL parser is strict, and this is the most common cause of a `PLAIN 403 ACCESS_REFUSED` at worker startup when the same credentials work fine under bare-Python `celery` invocations (because you were probably passing them as kwargs, not a URL).

 ### First-time bring-up

 ```bash
+# Generate the root .env from the template (or let Ansible do it)
+cp .env.example .env && $EDITOR .env
+
 # Pull the image (or build locally with `docker compose build`)
 docker compose pull

@@ -199,17 +222,35 @@ docker compose restart mcp         # restart just the MCP server
 docker compose pull && docker compose up -d
 ```

-### Things to verify in `mnemosyne/.env` before bringing up
+### Things to verify in `.env` before bringing up

-The development `.env` has a few values that need adjusting for production:
+The root `.env` (the one compose interpolates from — not `mnemosyne/.env`) needs the following set for a working production deploy:

 - `DEBUG=False`
- `USE_LOCAL_STORAGE=False` (already set; just confirm)
+- `USE_LOCAL_STORAGE=False`
 - `KVDB_LOCATION=<external-memcached-host>:11211` — `127.0.0.1` does not resolve from inside containers
- `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` filled in
- `DAEDALUS_S3_*` filled in for cross-bucket reads from the Daedalus bucket
+- `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` filled in (Mnemosyne's own MinIO bucket)
+- `DAEDALUS_S3_ACCESS_KEY_ID` / `DAEDALUS_S3_SECRET_ACCESS_KEY` filled in for cross-bucket ingest reads
+- `CELERY_BROKER_URL` with the RabbitMQ password **percent-encoded** if it contains URL-special characters
 - `ALLOWED_HOSTS` includes the public hostname HAProxy routes to (e.g. `mnemosyne.ouranos.helu.ca`)
- `LLM_API_SECRETS_ENCRYPTION_KEY` set to a real Fernet key
+- `CSRF_TRUSTED_ORIGINS` includes `https://<same-hostname>`
+- `LLM_API_SECRETS_ENCRYPTION_KEY` set to a real Fernet key (generated once per environment)
+
+### Verifying the environment reached a container
+
+If a service misbehaves on startup — typically the worker with an `AccessRefused` from RabbitMQ, or the app with a DB auth error — the fastest diagnostic is to print what Django actually parsed, since that removes every layer of env-file / interpolation / URL-encoding ambiguity:
+
+```bash
+# What broker URL did the worker actually receive?
+docker compose run --rm --no-deps worker \
+    python -c "from django.conf import settings; print(repr(settings.CELERY_BROKER_URL))"
+
+# What DB host/user?
+docker compose run --rm --no-deps app \
+    python -c "from django.conf import settings; print(settings.DATABASES['default'])"
+```
+
+The `repr(...)` form surfaces CRLF, trailing whitespace, stray quotes, or characters that should have been percent-encoded.

 ### Health probes