feat(alloy): add journal relabeling and kottos integration on puck
Introduce structured journal relabel rules on puck to tag Pallas-managed
units with {service, project, component} labels matching the Mnemosyne
and Daedalus schema. Add kottos release variable and vault secrets
example entries for the new Pallas FastAgent runtime.
Remove the defunct mnemosyne syslog listener now that Mnemosyne ships
JSON logs via the docker-socket pipeline.
This commit is contained in:
@@ -163,6 +163,96 @@ The registry includes model capabilities on each agent entry:
|
||||
}
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
Kottos runs two ways:
|
||||
|
||||
1. **Locally on caliban**, hand-started for iteration (`kottos` from the repo root). This is the flow documented above in *Quickstart*.
|
||||
2. **In Ouranos / Virgo / Taurus via Ansible**, as a `systemd`-managed `pallas` process on the puck.incus container. This is the pipeline that feeds the Puck Services dashboard in Grafana.
|
||||
|
||||
### Ansible role
|
||||
|
||||
Lives in `ouranos/ansible/kottos/`:
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `deploy.yml` | Main playbook — user/group, venv, systemd unit, config templating, registry probe. |
|
||||
| `stage.yml` | Clones `git.helu.ca/r/kottos` at `{{ kottos_rel }}` and creates the release tarball. |
|
||||
| `kottos.service.j2` | systemd unit. `SyslogIdentifier=kottos`, `StandardOutput=journal`, `PALLAS_LOG_STDOUT=1` via the env file. |
|
||||
| `.env.j2` | Runtime environment for `pallas` — logging config, `PALLAS_AGENTS_CONFIG`. |
|
||||
| `agents.yaml.j2` | Deployment topology with host/ports pulled from inventory. |
|
||||
| `fastagent.config.yaml.j2` | LLM provider + MCP server URLs, parametric per environment. |
|
||||
| `fastagent.secrets.yaml.j2` | API keys and auth tokens, rendered from Ansible Vault. |
|
||||
|
||||
### Inventory
|
||||
|
||||
Host variables live in `inventory/host_vars/puck.incus.yml` under **Kottos Configuration**:
|
||||
|
||||
```yaml
|
||||
kottos_user: kottos
|
||||
kottos_group: kottos
|
||||
kottos_directory: /srv/kottos
|
||||
kottos_host: "puck.incus"
|
||||
kottos_registry_port: 24100
|
||||
kottos_harper_port: 24101
|
||||
kottos_scotty_port: 24102
|
||||
kottos_research_port: 24150
|
||||
kottos_tech_research_port: 24151
|
||||
pallas_log_level: INFO
|
||||
kottos_default_model: "openai.Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf"
|
||||
kottos_openai_base_url: "http://nyx.helu.ca:22079/v1"
|
||||
# ...plus one entry per downstream MCP URL so each environment overrides freely
|
||||
```
|
||||
|
||||
Every host variable is parametric — Virgo's `puck.virgo.yml` (or wherever the Pallas host lives) can override any value without touching the templates.
|
||||
|
||||
### Vault
|
||||
|
||||
Four vault keys required — all documented in `inventory/group_vars/all/vault.yml.example`:
|
||||
|
||||
| Key | Used for |
|
||||
|---|---|
|
||||
| `vault_kottos_openai_api_key` | OpenAI-compatible LLM endpoint (nyx Qwen in Ouranos). |
|
||||
| `vault_kottos_github_pat` | `GITHUB_PERSONAL_ACCESS_TOKEN` for the local GitHub MCP Docker container. |
|
||||
| `vault_kottos_angelia_bearer` | Bearer token accepted by the Angelia MCP server. |
|
||||
| `vault_kottos_mnemosyne_jwt` | Long-lived team JWT from Daedalus admin UI — Mnemosyne validates it on every `search_memory` call and scopes results to this team's workspaces. |
|
||||
|
||||
### Deploying
|
||||
|
||||
Wired into `site.yml`:
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook kottos/stage.yml # clone repo + build tarball (local)
|
||||
ansible-playbook kottos/deploy.yml # deploy + template + start
|
||||
```
|
||||
|
||||
Or run the full site (`ansible-playbook site.yml`) — kottos's stage + deploy steps are the last block in the sequence.
|
||||
|
||||
### Logs
|
||||
|
||||
Journal identifier `kottos`, so on the host:
|
||||
|
||||
```bash
|
||||
sudo journalctl -u kottos -f --output=cat | jq .
|
||||
```
|
||||
|
||||
Alloy on puck's journal source relabels `__journal_syslog_identifier=kottos` to `{service="pallas", project="kottos"}`, then into Loki. Everything shows up in Grafana's *Puck Services — Logs & Health* dashboard under the **Pallas** row, with per-agent colouring driven by the `component` JSON field (`harper`, `scotty`, `research`, `tech_research`).
|
||||
|
||||
For per-agent follow-along:
|
||||
|
||||
```logql
|
||||
{service="pallas", project="kottos", component="harper"} | json
|
||||
```
|
||||
|
||||
For the opaque-MCP-transport-failure trace stream (see Pallas's bearer-forwarding incident history):
|
||||
|
||||
```logql
|
||||
{service="pallas", project="kottos"} |= "pallas.forward.trace" | json
|
||||
```
|
||||
|
||||
See [logging.md](logging.md) for the full label schema + level policy + add-a-new-service guide.
|
||||
|
||||
## Downstream MCP Servers
|
||||
|
||||
| Server | Host | URL |
|
||||
|
||||
173
docs/logging.md
Normal file
173
docs/logging.md
Normal file
@@ -0,0 +1,173 @@
|
||||
# Unified Logging — Mnemosyne, Pallas, Daedalus
|
||||
|
||||
PPLG is the single destination for every service's logs. This document describes the label schema every service emits, the two transports Alloy uses to collect logs, and the level policy that keeps INFO output actionable.
|
||||
|
||||
The three in-scope services today are **Mnemosyne**, **Pallas** (running as Kottos/Mentor/Iolaus), and **Daedalus**. The same patterns generalise to any future service that deploys on a `docker`-enabled host or under `systemd+journald`.
|
||||
|
||||
## Label schema
|
||||
|
||||
Every Loki log stream carries these labels, and nothing else:
|
||||
|
||||
| Label | Example values | Source |
|
||||
|---|---|---|
|
||||
| `service` | `mnemosyne`, `pallas`, `daedalus`, `athena`, `kairos`, `angelia` | Docker compose project name (container logs) **or** explicit systemd relabel rule (journal logs) |
|
||||
| `component` | `app`, `mcp`, `worker`, `nginx`, `harper`, `scotty`, `research`, `tech_research` | Docker compose service name **or** per-agent `ContextVar` (Pallas) |
|
||||
| `project` | `kottos` (Pallas only) | `agents.yaml` `name:` field read by `pallas.log.set_project()` |
|
||||
| `hostname` | `puck.incus`, `caliban.incus` | Alloy's `inventory_hostname` template var |
|
||||
| `environment` | `ouranos`, `virgo`, `taurus` | `deployment_environment` from Ansible group_vars |
|
||||
|
||||
**Everything else is a JSON field in the log body**, not a label. That includes `level`, `logger`, `funcName`, `lineno`, `message`, `request_id`, `workspace_id`, `agent`, `tool`, `duration_ms`, and any `extra={...}` kwargs the application passed in. LogQL's `| json` pipeline parses these on-query — keeping them out of the label index is what keeps Loki fast.
|
||||
|
||||
## Level policy
|
||||
|
||||
Same rules for every service. Health-check `200 OK`s live in DEBUG, never in INFO.
|
||||
|
||||
| Level | Meaning |
|
||||
|---|---|
|
||||
| `ERROR` | Broken; requires human attention. |
|
||||
| `WARNING` | Degraded but self-recovering — retries, skipped items, missing optional config. |
|
||||
| `INFO` | Lifecycle events and failures. Start, ready, shutdown, preflight, LLM provider validation. 200 OKs on health endpoints are **not** INFO. |
|
||||
| `DEBUG` | Per-request detail, successful health probes, verbose traces. Enable on demand when troubleshooting. |
|
||||
|
||||
Mnemosyne enforces this with `mnemosyne.log_filters.SuppressHealthAccessFilter` on Django/gunicorn access loggers; Pallas with `_HealthAccessFilter` on `uvicorn.access`; Daedalus with the equivalent filter in `daedalus.logging`.
|
||||
|
||||
## Two transports, one Alloy
|
||||
|
||||
Alloy on each host uses exactly two sources for application logs. Pick whichever matches the service's runtime model — **don't** invent a third.
|
||||
|
||||
### 1. Docker socket (for compose projects)
|
||||
|
||||
`discovery.docker` enumerates every running container, and `loki.source.docker` tails their stdout via the `json-file` driver. Compose project → `service` label, compose service → `component` label. One block covers every compose project on the host, current and future.
|
||||
|
||||
**Requirements on the service side:**
|
||||
|
||||
- Emit JSON lines to **stdout**, one per log record. Mnemosyne uses `python-json-logger`; Daedalus uses `structlog`; any Python service can do the same.
|
||||
- Pin the logging driver to `json-file` with bounded rotation in `docker-compose.yaml`:
|
||||
|
||||
```yaml
|
||||
x-logging: &default-logging
|
||||
driver: json-file
|
||||
options:
|
||||
tag: "{{.Name}}"
|
||||
max-size: "10m"
|
||||
max-file: "5"
|
||||
|
||||
services:
|
||||
app:
|
||||
# ...
|
||||
logging: *default-logging
|
||||
```
|
||||
|
||||
`json-file` is Docker's default, but pinning it defensively guarantees Alloy sees the same driver on every host.
|
||||
|
||||
- On the Alloy host, the `alloy` user must be in the `docker` group to read `/var/run/docker.sock`. The `ouranos/ansible/alloy/` role handles this.
|
||||
|
||||
### 2. Systemd journal (for systemd-managed units)
|
||||
|
||||
`loki.source.journal` tails journald. A `loki.relabel "journal_<host>"` block translates `__journal_syslog_identifier` → `service` / `project` labels so Pallas-managed agents land alongside Docker-based services with the same schema.
|
||||
|
||||
**Requirements on the service side:**
|
||||
|
||||
- Emit JSON to **stdout** (journald captures it with `PRIORITY=6` INFO by default).
|
||||
- The systemd unit must set a distinctive `SyslogIdentifier=` — the Alloy relabel block keys off this.
|
||||
- Under Pallas, set `PALLAS_LOG_STDOUT=1` in the unit's `EnvironmentFile`. Also set `PALLAS_LOG_FILE=/dev/null` to disable the rotating file sink (journald is already durable).
|
||||
|
||||
Example, from `ouranos/ansible/kottos/kottos.service.j2`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
...
|
||||
EnvironmentFile=/srv/kottos/.env
|
||||
ExecStart=/srv/kottos/.venv/bin/pallas
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=kottos
|
||||
```
|
||||
|
||||
And the matching Alloy relabel rule on puck:
|
||||
|
||||
```alloy
|
||||
loki.relabel "journal_puck" {
|
||||
forward_to = []
|
||||
rule {
|
||||
source_labels = ["__journal_syslog_identifier"]
|
||||
regex = "kottos"
|
||||
target_label = "service"
|
||||
replacement = "pallas"
|
||||
}
|
||||
rule {
|
||||
source_labels = ["__journal_syslog_identifier"]
|
||||
regex = "kottos"
|
||||
target_label = "project"
|
||||
replacement = "kottos"
|
||||
}
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
## Per-service reference
|
||||
|
||||
### Mnemosyne (Docker compose on puck)
|
||||
|
||||
- Logging config: `mnemosyne/mnemosyne/mnemosyne/settings.py` → `LOGGING` dict using `pythonjsonlogger.json.JsonFormatter`.
|
||||
- Component attribution: `MNEMOSYNE_COMPONENT` env var set per docker-compose service (`init`, `app`, `mcp`, `worker`). The settings module reads it into `static_fields.component`.
|
||||
- Health-filter: `mnemosyne.log_filters.SuppressHealthAccessFilter` on the `access` handler.
|
||||
- Metrics: `/metrics` on the nginx container (port 23181) — served by django-prometheus on the app container plus `mcp_server.metrics` (shared `prometheus_client` registry).
|
||||
- Scrape job: `mnemosyne` (see `ouranos/ansible/pplg/prometheus.yml.j2`).
|
||||
- Alerts: `mnemosyne_alerts` group in `ouranos/ansible/pplg/alert_rules.yml.j2`.
|
||||
|
||||
### Pallas — Kottos (systemd on puck via Ansible role `ouranos/ansible/kottos/`)
|
||||
|
||||
- Logging config: `pallas/pallas/log.py` → `setup_logging()` with `PALLAS_LOG_STDOUT=1`.
|
||||
- Component attribution: `pallas.log.set_agent_component(name)` is called by `_start_agent()` inside each agent's asyncio task, setting a `contextvars.ContextVar` that the `_StaticFieldsFilter` reads per record. Each agent (harper, scotty, research, tech_research) carries its own value without leaking across tasks.
|
||||
- Project attribution: `pallas.log.set_project(deploy_name)` is called once in `main()` from `agents.yaml`'s `name:`. For Kottos this renders as `project="kottos"` on every record.
|
||||
- Deployed by: `ansible-playbook kottos/deploy.yml` (wired into `site.yml`).
|
||||
- Metrics: none today — Pallas is observed through logs only. Future phase will add a `prometheus_client` endpoint on the registry port for `pallas_agent_requests_total{agent=…}`, `pallas_downstream_mcp_errors_total{server=…}`.
|
||||
|
||||
### Daedalus (Docker compose on puck)
|
||||
|
||||
- Logging config: `daedalus/backend/daedalus/logging.py` — `structlog` JSON processor chain, already production-ready.
|
||||
- Component attribution: `structlog.contextvars.bind_contextvars(service="daedalus", component="api")` at app startup.
|
||||
- Health-filter: `_SuppressHealthAccessFilter` on uvicorn's access logger.
|
||||
- Metrics: `/metrics` on the api container (port 22181).
|
||||
- Scrape job: `daedalus`.
|
||||
- Alerts: `daedalus_alerts` group.
|
||||
|
||||
## Useful LogQL queries
|
||||
|
||||
Once the pipeline is live, the "troubleshooting is a nightmare" problem becomes three-click queries in Grafana Explore:
|
||||
|
||||
```logql
|
||||
# All Mnemosyne errors in the last 15m
|
||||
{service="mnemosyne"} | json | level="ERROR"
|
||||
|
||||
# Everything Harper did in the last hour
|
||||
{service="pallas", project="kottos", component="harper"} | json
|
||||
|
||||
# The infamous pallas.forward.trace stream (MCP transport failures)
|
||||
{service="pallas", project="kottos"} |= "pallas.forward.trace"
|
||||
|
||||
# Cross-service trace of a single request (requires X-Request-Id propagation
|
||||
# — not yet implemented; Phase 1.5 nice-to-have)
|
||||
{environment="ouranos"} | json | request_id="<paste-id>"
|
||||
|
||||
# 5xx spike in Daedalus by path
|
||||
sum by (path) (rate({service="daedalus"} | json | level="ERROR" [5m]))
|
||||
```
|
||||
|
||||
The **Puck Services — Logs & Health** dashboard in Grafana (`/etc/grafana/provisioning/dashboards/puck.yaml` → `/var/lib/grafana/dashboards/puck_services.json`) has these pre-wired as panels per service row.
|
||||
|
||||
## Adding a new service
|
||||
|
||||
If you're adding a service to puck (or any Ouranos/Virgo host with this stack):
|
||||
|
||||
1. **Emit JSON to stdout** with `service`/`component` as static fields. Copy Mnemosyne's settings pattern or Pallas's `_StaticFieldsFilter`.
|
||||
2. **Pick a transport:**
|
||||
- Docker compose → add the `x-logging: &default-logging` anchor + `logging: *default-logging` on each service. Done. No Alloy changes needed.
|
||||
- systemd → set `SyslogIdentifier=<name>` on the unit and add a two-rule relabel block to the host's `loki.relabel "journal_<host>"` block.
|
||||
3. **Expose `/metrics`** if the service is in Python — `prometheus_client` plus either `django-prometheus` or `prometheus_fastapi_instrumentator`.
|
||||
4. **Add a scrape job** in `ouranos/ansible/pplg/prometheus.yml.j2` (parametrise the target — `{{ <service>_metrics_host }}:{{ <service>_metrics_port }}`) and wire the defaults into the host's `host_vars`.
|
||||
5. **Add alerts** in `ouranos/ansible/pplg/alert_rules.yml.j2`. At minimum: `Down`, `HighErrorRate`. Use the metric names the service actually exposes — no dead rules.
|
||||
6. **Optional**: add panels to the Puck Services dashboard JSON.
|
||||
|
||||
No new transport. No per-service Alloy block. No custom log format.
|
||||
Reference in New Issue
Block a user