Introduce structured journal relabel rules on puck to tag Pallas-managed
units with {service, project, component} labels matching the Mnemosyne
and Daedalus schema. Add kottos release variable and vault secrets
example entries for the new Pallas FastAgent runtime.
Remove the defunct mnemosyne syslog listener now that Mnemosyne ships
JSON logs via the docker-socket pipeline.
9.3 KiB
Unified Logging — Mnemosyne, Pallas, Daedalus
PPLG is the single destination for every service's logs. This document describes the label schema every service emits, the two transports Alloy uses to collect logs, and the level policy that keeps INFO output actionable.
The three in-scope services today are Mnemosyne, Pallas (running as Kottos/Mentor/Iolaus), and Daedalus. The same patterns generalise to any future service that deploys on a docker-enabled host or under systemd+journald.
Label schema
Every Loki log stream carries these labels, and nothing else:
| Label | Example values | Source |
|---|---|---|
service |
mnemosyne, pallas, daedalus, athena, kairos, angelia |
Docker compose project name (container logs) or explicit systemd relabel rule (journal logs) |
component |
app, mcp, worker, nginx, harper, scotty, research, tech_research |
Docker compose service name or per-agent ContextVar (Pallas) |
project |
kottos (Pallas only) |
agents.yaml name: field read by pallas.log.set_project() |
hostname |
puck.incus, caliban.incus |
Alloy's inventory_hostname template var |
environment |
ouranos, virgo, taurus |
deployment_environment from Ansible group_vars |
Everything else is a JSON field in the log body, not a label. That includes level, logger, funcName, lineno, message, request_id, workspace_id, agent, tool, duration_ms, and any extra={...} kwargs the application passed in. LogQL's | json pipeline parses these on-query — keeping them out of the label index is what keeps Loki fast.
Level policy
Same rules for every service. Health-check 200 OKs live in DEBUG, never in INFO.
| Level | Meaning |
|---|---|
ERROR |
Broken; requires human attention. |
WARNING |
Degraded but self-recovering — retries, skipped items, missing optional config. |
INFO |
Lifecycle events and failures. Start, ready, shutdown, preflight, LLM provider validation. 200 OKs on health endpoints are not INFO. |
DEBUG |
Per-request detail, successful health probes, verbose traces. Enable on demand when troubleshooting. |
Mnemosyne enforces this with mnemosyne.log_filters.SuppressHealthAccessFilter on Django/gunicorn access loggers; Pallas with _HealthAccessFilter on uvicorn.access; Daedalus with the equivalent filter in daedalus.logging.
Two transports, one Alloy
Alloy on each host uses exactly two sources for application logs. Pick whichever matches the service's runtime model — don't invent a third.
1. Docker socket (for compose projects)
discovery.docker enumerates every running container, and loki.source.docker tails their stdout via the json-file driver. Compose project → service label, compose service → component label. One block covers every compose project on the host, current and future.
Requirements on the service side:
-
Emit JSON lines to stdout, one per log record. Mnemosyne uses
python-json-logger; Daedalus usesstructlog; any Python service can do the same. -
Pin the logging driver to
json-filewith bounded rotation indocker-compose.yaml:x-logging: &default-logging driver: json-file options: tag: "{{.Name}}" max-size: "10m" max-file: "5" services: app: # ... logging: *default-loggingjson-fileis Docker's default, but pinning it defensively guarantees Alloy sees the same driver on every host. -
On the Alloy host, the
alloyuser must be in thedockergroup to read/var/run/docker.sock. Theouranos/ansible/alloy/role handles this.
2. Systemd journal (for systemd-managed units)
loki.source.journal tails journald. A loki.relabel "journal_<host>" block translates __journal_syslog_identifier → service / project labels so Pallas-managed agents land alongside Docker-based services with the same schema.
Requirements on the service side:
- Emit JSON to stdout (journald captures it with
PRIORITY=6INFO by default). - The systemd unit must set a distinctive
SyslogIdentifier=— the Alloy relabel block keys off this. - Under Pallas, set
PALLAS_LOG_STDOUT=1in the unit'sEnvironmentFile. Also setPALLAS_LOG_FILE=/dev/nullto disable the rotating file sink (journald is already durable).
Example, from ouranos/ansible/kottos/kottos.service.j2:
[Service]
...
EnvironmentFile=/srv/kottos/.env
ExecStart=/srv/kottos/.venv/bin/pallas
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kottos
And the matching Alloy relabel rule on puck:
loki.relabel "journal_puck" {
forward_to = []
rule {
source_labels = ["__journal_syslog_identifier"]
regex = "kottos"
target_label = "service"
replacement = "pallas"
}
rule {
source_labels = ["__journal_syslog_identifier"]
regex = "kottos"
target_label = "project"
replacement = "kottos"
}
// ...
}
Per-service reference
Mnemosyne (Docker compose on puck)
- Logging config:
mnemosyne/mnemosyne/mnemosyne/settings.py→LOGGINGdict usingpythonjsonlogger.json.JsonFormatter. - Component attribution:
MNEMOSYNE_COMPONENTenv var set per docker-compose service (init,app,mcp,worker). The settings module reads it intostatic_fields.component. - Health-filter:
mnemosyne.log_filters.SuppressHealthAccessFilteron theaccesshandler. - Metrics:
/metricson the nginx container (port 23181) — served by django-prometheus on the app container plusmcp_server.metrics(sharedprometheus_clientregistry). - Scrape job:
mnemosyne(seeouranos/ansible/pplg/prometheus.yml.j2). - Alerts:
mnemosyne_alertsgroup inouranos/ansible/pplg/alert_rules.yml.j2.
Pallas — Kottos (systemd on puck via Ansible role ouranos/ansible/kottos/)
- Logging config:
pallas/pallas/log.py→setup_logging()withPALLAS_LOG_STDOUT=1. - Component attribution:
pallas.log.set_agent_component(name)is called by_start_agent()inside each agent's asyncio task, setting acontextvars.ContextVarthat the_StaticFieldsFilterreads per record. Each agent (harper, scotty, research, tech_research) carries its own value without leaking across tasks. - Project attribution:
pallas.log.set_project(deploy_name)is called once inmain()fromagents.yaml'sname:. For Kottos this renders asproject="kottos"on every record. - Deployed by:
ansible-playbook kottos/deploy.yml(wired intosite.yml). - Metrics: none today — Pallas is observed through logs only. Future phase will add a
prometheus_clientendpoint on the registry port forpallas_agent_requests_total{agent=…},pallas_downstream_mcp_errors_total{server=…}.
Daedalus (Docker compose on puck)
- Logging config:
daedalus/backend/daedalus/logging.py—structlogJSON processor chain, already production-ready. - Component attribution:
structlog.contextvars.bind_contextvars(service="daedalus", component="api")at app startup. - Health-filter:
_SuppressHealthAccessFilteron uvicorn's access logger. - Metrics:
/metricson the api container (port 22181). - Scrape job:
daedalus. - Alerts:
daedalus_alertsgroup.
Useful LogQL queries
Once the pipeline is live, the "troubleshooting is a nightmare" problem becomes three-click queries in Grafana Explore:
# All Mnemosyne errors in the last 15m
{service="mnemosyne"} | json | level="ERROR"
# Everything Harper did in the last hour
{service="pallas", project="kottos", component="harper"} | json
# The infamous pallas.forward.trace stream (MCP transport failures)
{service="pallas", project="kottos"} |= "pallas.forward.trace"
# Cross-service trace of a single request (requires X-Request-Id propagation
# — not yet implemented; Phase 1.5 nice-to-have)
{environment="ouranos"} | json | request_id="<paste-id>"
# 5xx spike in Daedalus by path
sum by (path) (rate({service="daedalus"} | json | level="ERROR" [5m]))
The Puck Services — Logs & Health dashboard in Grafana (/etc/grafana/provisioning/dashboards/puck.yaml → /var/lib/grafana/dashboards/puck_services.json) has these pre-wired as panels per service row.
Adding a new service
If you're adding a service to puck (or any Ouranos/Virgo host with this stack):
- Emit JSON to stdout with
service/componentas static fields. Copy Mnemosyne's settings pattern or Pallas's_StaticFieldsFilter. - Pick a transport:
- Docker compose → add the
x-logging: &default-logginganchor +logging: *default-loggingon each service. Done. No Alloy changes needed. - systemd → set
SyslogIdentifier=<name>on the unit and add a two-rule relabel block to the host'sloki.relabel "journal_<host>"block.
- Docker compose → add the
- Expose
/metricsif the service is in Python —prometheus_clientplus eitherdjango-prometheusorprometheus_fastapi_instrumentator. - Add a scrape job in
ouranos/ansible/pplg/prometheus.yml.j2(parametrise the target —{{ <service>_metrics_host }}:{{ <service>_metrics_port }}) and wire the defaults into the host'shost_vars. - Add alerts in
ouranos/ansible/pplg/alert_rules.yml.j2. At minimum:Down,HighErrorRate. Use the metric names the service actually exposes — no dead rules. - Optional: add panels to the Puck Services dashboard JSON.
No new transport. No per-service Alloy block. No custom log format.