Files
ouranos/docs/logging.md
Robert Helewka 8c95173705 feat(alloy): add journal relabeling and kottos integration on puck
Introduce structured journal relabel rules on puck to tag Pallas-managed
units with {service, project, component} labels matching the Mnemosyne
and Daedalus schema. Add kottos release variable and vault secrets
example entries for the new Pallas FastAgent runtime.

Remove the defunct mnemosyne syslog listener now that Mnemosyne ships
JSON logs via the docker-socket pipeline.
2026-05-11 13:54:14 -04:00

9.3 KiB

Unified Logging — Mnemosyne, Pallas, Daedalus

PPLG is the single destination for every service's logs. This document describes the label schema every service emits, the two transports Alloy uses to collect logs, and the level policy that keeps INFO output actionable.

The three in-scope services today are Mnemosyne, Pallas (running as Kottos/Mentor/Iolaus), and Daedalus. The same patterns generalise to any future service that deploys on a docker-enabled host or under systemd+journald.

Label schema

Every Loki log stream carries these labels, and nothing else:

Label Example values Source
service mnemosyne, pallas, daedalus, athena, kairos, angelia Docker compose project name (container logs) or explicit systemd relabel rule (journal logs)
component app, mcp, worker, nginx, harper, scotty, research, tech_research Docker compose service name or per-agent ContextVar (Pallas)
project kottos (Pallas only) agents.yaml name: field read by pallas.log.set_project()
hostname puck.incus, caliban.incus Alloy's inventory_hostname template var
environment ouranos, virgo, taurus deployment_environment from Ansible group_vars

Everything else is a JSON field in the log body, not a label. That includes level, logger, funcName, lineno, message, request_id, workspace_id, agent, tool, duration_ms, and any extra={...} kwargs the application passed in. LogQL's | json pipeline parses these on-query — keeping them out of the label index is what keeps Loki fast.

Level policy

Same rules for every service. Health-check 200 OKs live in DEBUG, never in INFO.

Level Meaning
ERROR Broken; requires human attention.
WARNING Degraded but self-recovering — retries, skipped items, missing optional config.
INFO Lifecycle events and failures. Start, ready, shutdown, preflight, LLM provider validation. 200 OKs on health endpoints are not INFO.
DEBUG Per-request detail, successful health probes, verbose traces. Enable on demand when troubleshooting.

Mnemosyne enforces this with mnemosyne.log_filters.SuppressHealthAccessFilter on Django/gunicorn access loggers; Pallas with _HealthAccessFilter on uvicorn.access; Daedalus with the equivalent filter in daedalus.logging.

Two transports, one Alloy

Alloy on each host uses exactly two sources for application logs. Pick whichever matches the service's runtime model — don't invent a third.

1. Docker socket (for compose projects)

discovery.docker enumerates every running container, and loki.source.docker tails their stdout via the json-file driver. Compose project → service label, compose service → component label. One block covers every compose project on the host, current and future.

Requirements on the service side:

  • Emit JSON lines to stdout, one per log record. Mnemosyne uses python-json-logger; Daedalus uses structlog; any Python service can do the same.

  • Pin the logging driver to json-file with bounded rotation in docker-compose.yaml:

    x-logging: &default-logging
      driver: json-file
      options:
        tag: "{{.Name}}"
        max-size: "10m"
        max-file: "5"
    
    services:
      app:
        # ...
        logging: *default-logging
    

    json-file is Docker's default, but pinning it defensively guarantees Alloy sees the same driver on every host.

  • On the Alloy host, the alloy user must be in the docker group to read /var/run/docker.sock. The ouranos/ansible/alloy/ role handles this.

2. Systemd journal (for systemd-managed units)

loki.source.journal tails journald. A loki.relabel "journal_<host>" block translates __journal_syslog_identifierservice / project labels so Pallas-managed agents land alongside Docker-based services with the same schema.

Requirements on the service side:

  • Emit JSON to stdout (journald captures it with PRIORITY=6 INFO by default).
  • The systemd unit must set a distinctive SyslogIdentifier= — the Alloy relabel block keys off this.
  • Under Pallas, set PALLAS_LOG_STDOUT=1 in the unit's EnvironmentFile. Also set PALLAS_LOG_FILE=/dev/null to disable the rotating file sink (journald is already durable).

Example, from ouranos/ansible/kottos/kottos.service.j2:

[Service]
...
EnvironmentFile=/srv/kottos/.env
ExecStart=/srv/kottos/.venv/bin/pallas
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kottos

And the matching Alloy relabel rule on puck:

loki.relabel "journal_puck" {
  forward_to = []
  rule {
    source_labels = ["__journal_syslog_identifier"]
    regex         = "kottos"
    target_label  = "service"
    replacement   = "pallas"
  }
  rule {
    source_labels = ["__journal_syslog_identifier"]
    regex         = "kottos"
    target_label  = "project"
    replacement   = "kottos"
  }
  // ...
}

Per-service reference

Mnemosyne (Docker compose on puck)

  • Logging config: mnemosyne/mnemosyne/mnemosyne/settings.pyLOGGING dict using pythonjsonlogger.json.JsonFormatter.
  • Component attribution: MNEMOSYNE_COMPONENT env var set per docker-compose service (init, app, mcp, worker). The settings module reads it into static_fields.component.
  • Health-filter: mnemosyne.log_filters.SuppressHealthAccessFilter on the access handler.
  • Metrics: /metrics on the nginx container (port 23181) — served by django-prometheus on the app container plus mcp_server.metrics (shared prometheus_client registry).
  • Scrape job: mnemosyne (see ouranos/ansible/pplg/prometheus.yml.j2).
  • Alerts: mnemosyne_alerts group in ouranos/ansible/pplg/alert_rules.yml.j2.

Pallas — Kottos (systemd on puck via Ansible role ouranos/ansible/kottos/)

  • Logging config: pallas/pallas/log.pysetup_logging() with PALLAS_LOG_STDOUT=1.
  • Component attribution: pallas.log.set_agent_component(name) is called by _start_agent() inside each agent's asyncio task, setting a contextvars.ContextVar that the _StaticFieldsFilter reads per record. Each agent (harper, scotty, research, tech_research) carries its own value without leaking across tasks.
  • Project attribution: pallas.log.set_project(deploy_name) is called once in main() from agents.yaml's name:. For Kottos this renders as project="kottos" on every record.
  • Deployed by: ansible-playbook kottos/deploy.yml (wired into site.yml).
  • Metrics: none today — Pallas is observed through logs only. Future phase will add a prometheus_client endpoint on the registry port for pallas_agent_requests_total{agent=…}, pallas_downstream_mcp_errors_total{server=…}.

Daedalus (Docker compose on puck)

  • Logging config: daedalus/backend/daedalus/logging.pystructlog JSON processor chain, already production-ready.
  • Component attribution: structlog.contextvars.bind_contextvars(service="daedalus", component="api") at app startup.
  • Health-filter: _SuppressHealthAccessFilter on uvicorn's access logger.
  • Metrics: /metrics on the api container (port 22181).
  • Scrape job: daedalus.
  • Alerts: daedalus_alerts group.

Useful LogQL queries

Once the pipeline is live, the "troubleshooting is a nightmare" problem becomes three-click queries in Grafana Explore:

# All Mnemosyne errors in the last 15m
{service="mnemosyne"} | json | level="ERROR"

# Everything Harper did in the last hour
{service="pallas", project="kottos", component="harper"} | json

# The infamous pallas.forward.trace stream (MCP transport failures)
{service="pallas", project="kottos"} |= "pallas.forward.trace"

# Cross-service trace of a single request (requires X-Request-Id propagation
# — not yet implemented; Phase 1.5 nice-to-have)
{environment="ouranos"} | json | request_id="<paste-id>"

# 5xx spike in Daedalus by path
sum by (path) (rate({service="daedalus"} | json | level="ERROR" [5m]))

The Puck Services — Logs & Health dashboard in Grafana (/etc/grafana/provisioning/dashboards/puck.yaml/var/lib/grafana/dashboards/puck_services.json) has these pre-wired as panels per service row.

Adding a new service

If you're adding a service to puck (or any Ouranos/Virgo host with this stack):

  1. Emit JSON to stdout with service/component as static fields. Copy Mnemosyne's settings pattern or Pallas's _StaticFieldsFilter.
  2. Pick a transport:
    • Docker compose → add the x-logging: &default-logging anchor + logging: *default-logging on each service. Done. No Alloy changes needed.
    • systemd → set SyslogIdentifier=<name> on the unit and add a two-rule relabel block to the host's loki.relabel "journal_<host>" block.
  3. Expose /metrics if the service is in Python — prometheus_client plus either django-prometheus or prometheus_fastapi_instrumentator.
  4. Add a scrape job in ouranos/ansible/pplg/prometheus.yml.j2 (parametrise the target — {{ <service>_metrics_host }}:{{ <service>_metrics_port }}) and wire the defaults into the host's host_vars.
  5. Add alerts in ouranos/ansible/pplg/alert_rules.yml.j2. At minimum: Down, HighErrorRate. Use the metric names the service actually exposes — no dead rules.
  6. Optional: add panels to the Puck Services dashboard JSON.

No new transport. No per-service Alloy block. No custom log format.