r/pallas

Files

Robert Helewka ea37ab38c1 feat: add loop guard to halt repeated-identical tool call loops

Introduces `pallas.loop_guard` module that detects and halts agentic loops
where the same `(tool, args) → result` repeats consecutively, preventing
wasted LLM turns when upstream MCP servers return contradictory data.

- Add per-request `ToolRunnerHooks` tracking rolling tool-call signatures
- Halt loop after `loop_repeat_threshold` consecutive repeats (default 3)
- Collapse `max_iterations` on halt to terminate without further LLM call
- Append user-facing explanation to the turn with `stop_reason=endTurn`
- Expose `pallas_agent_loop_aborted_total{agent,reason}` counter
- Add per-agent `max_iterations` and `loop_repeat_threshold` config
- Document guard behavior, metric, and alerting query

2026-06-16 08:27:07 -04:00

41 KiB

Raw Permalink Blame History

Pallas — Technical Reference

Pallas is the generic runtime that turns fast-agent agent definitions into StreamableHTTP MCP servers. It is completely deployment-agnostic: all environment-specific values (agent names, ports, hosts, model) live in the calling project's configuration files, not in Pallas itself.

Solution Architecture

Pallas occupies the middle tier of a three-layer MCP architecture. It bridges a web-facing client (Daedalus) and a constellation of specialised downstream MCP servers.

┌──────────────────────────────────┐
│  Daedalus                        │  Web UI / FastAPI / MCP client
│  Workspace management, chat,     │  Discovers agents via registry
│  health monitoring, progress     │  Calls agent tools via MCP
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Pallas (FastAgent MCP Bridge)   │  Python runtime
│                                  │
│  ┌─ Registry  (port N)          │  GET /.well-known/mcp/server.json
│  ├─ Agent: Research  (port N+1) │  Chains, routers, sub-agents
│  ├─ Agent: Engineering (port N+2)│  Orchestrators, tool pipelines
│  └─ Agent: Orchestrator (N+3)   │  Delegates across agents
│                                  │
│  Each agent exposes:             │
│    • send_message tool           │
│    • get_health tool             │
│    • {agent}_history prompt      │
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Downstream MCP Servers          │
│                                  │
│  Argos        — web search       │
│  Neo4j        — knowledge graph  │
│  Mnemosyne    — content library  │
│  Kernos       — shell execution  │
│  Gitea        — repository mgmt  │
│  Grafana      — monitoring       │
│  Rommie       — system management│
└──────────────────────────────────┘

Daedalus → Pallas

Interaction	Mechanism
Agent discovery	`GET {registry}/.well-known/mcp/server.json` — plain HTTP, returns all agents with MCP endpoint URLs
Agent communication	MCP `tools/call` on `send_message` — text + optional images
Health monitoring	MCP `tools/call` on `get_health` — programmatic, no LLM invocation
Progress feedback	MCP `notifications/progress` — streamed over SSE during long-running tool calls
Conversation history	MCP `prompts/get` on `{agent}_history` — retrieves stored message history

Pallas → Downstream

Pallas agents call downstream MCP servers via standard MCP tool calls. Each agent declares its servers in its fast-agent definition (servers=["argos", "neo4j_cypher", ...]). The server URLs and auth headers are configured in the consuming project's fastagent.config.yaml.

Mnemosyne's Role

Mnemosyne provides a content-type-aware knowledge graph with hybrid search (vector + full-text + graph). Agents with mnemosyne in their servers list gain access to tools for searching documents, browsing libraries and collections, retrieving items, and traversing the concept graph. It complements Neo4j (graph topology and relationships) with content-focused retrieval and re-ranking.

Why MCP End-to-End

Pallas is the protocol boundary — MCP above (from Daedalus) and MCP below (to downstream servers). This eliminates any MCP→REST→MCP translation layer. A single fast.start_server(transport="http") call exposes a complete agent as a StreamableHTTP MCP endpoint, giving Daedalus:

Tool discovery via session.list_tools()
Native streaming via MCP Streamable HTTP / SSE
Health checks as ordinary tool calls — no separate API surface
Progress notifications built into the protocol

Pallas Internal Architecture

Pallas is four modules, composed at startup:

server.py main()
  │
  ├─ _load_deployment_config()         parse agents.yaml
  ├─ _build_agents_table()             {name: (module, port)}
  ├─ _build_agent_deps()               dependency graph
  │
  ├─ _start_all()  or  _run_single()
  │    │
  │    ├─ _preflight()
  │    │    ├─ _register_unknown_models()   model registration
  │    │    └─ validate_llm_providers()     LLM API key + model checks
  │    │
  │    ├─ start subagents (depends_on)
  │    ├─ wait for subagent readiness
  │    ├─ start top-level agents
  │    │    │
  │    │    └─ _start_agent(name)
  │    │         ├─ import agent module
  │    │         ├─ MultimodalAgentMCPServer(...)
  │    │         ├─ _resolve_downstream_servers()
  │    │         ├─ _preflight_mcp_servers()     warn on missing auth
  │    │         ├─ register_health_tool()
  │    │         └─ server.run_async()
  │    │
  │    └─ run_registry()               Starlette app on registry port
  │
  └─ asyncio.run(...)

Module	Purpose
`pallas.server`	CLI entry point, configuration loading, agent lifecycle orchestration, model registration
`pallas.registry`	Starlette app serving `GET /.well-known/mcp/server.json` — builds the agent catalogue from `agents.yaml` + `fastagent.config.yaml`
`pallas.multimodal_server`	`MultimodalAgentMCPServer` — `AgentMCPServer` subclass adding image attachment support and conversation history prompts
`pallas.health`	Two-layer health: startup LLM preflight validation + runtime `get_health` MCP tool with downstream server probing

Installation

pip install git+ssh://git@git.helu.ca:22022/r/pallas.git

Or as a project dependency:

dependencies = [
    "pallas-mcp @ git+ssh://git@git.helu.ca:22022/r/pallas.git",
]

Requires Python ≥ 3.13. Key dependencies: fast-agent-mcp, httpx, pyyaml, starlette, uvicorn.

Project Layout

Pallas reads configuration from the working directory at runtime. A consuming project looks like:

my-project/
├── agents/
│   ├── __init__.py
│   └── jarvis.py              # FastAgent definitions
├── agents.yaml                # Deployment topology
├── fastagent.config.yaml      # FastAgent + model config
├── fastagent.secrets.yaml     # API keys (gitignored)
└── .env                       # Secret values (gitignored)

Pallas itself contains no agent definitions, model names, ports, or hostnames. Everything is injected by the consuming project.

Configuration Reference

`agents.yaml`

Single source of truth for deployment topology.

name: my-project               # log prefixes and registry names
version: "1.0.0"               # published in registry entries
host: my-host.example.com      # hostname for registry URLs
namespace: com.example.project  # reverse-domain prefix for registry names
registry_port: 8200             # port for the registry server

agents:
  jarvis:
    module: agents.jarvis       # importable Python module path
    port: 8201                  # StreamableHTTP port for this agent
    title: Jarvis               # human-readable name (registry)
    description: "My assistant" # one-line description (registry)
    depends_on: [research]      # optional: start these agents first

  research:
    module: agents.research
    port: 8250
    title: Research Agent
    description: "Web search and knowledge graph"

Field	Required	Description
`name`	yes	Project name — used in log prefixes (`[my-project]`) and CLI help
`version`	no	Semver string published in registry entries. Default: `"1.0.0"`
`host`	no	Hostname used in registry `remotes[].url`. Default: `"localhost"`
`namespace`	no	Reverse-domain prefix for registry `server.name` (e.g. `com.example/jarvis`)
`registry_port`	no	Port for the registry server. Default: `24200`
`agents.<name>.module`	yes	Importable Python module path containing a `fast` instance
`agents.<name>.port`	yes	Port for this agent's StreamableHTTP MCP server
`agents.<name>.title`	no	Display name in registry. Default: `name.title()`
`agents.<name>.description`	no	Description in registry
`agents.<name>.depends_on`	no	List of agent names that must start and become ready before this agent
`agents.<name>.max_iterations`	no	Hard cap on agentic-loop turns per `send_message`. Default: `15`. fast-agent returns a partial answer once exceeded
`agents.<name>.loop_repeat_threshold`	no	Halt the loop after this many consecutive identical `(tool, args) → result` rounds. Default: `3`. `0` disables the guard

`fastagent.config.yaml` Extensions

Pallas reads two keys beyond the standard fast-agent config:

default_model: openai.my-model-name

model_capabilities:
  vision: false
  context_window: 200000
  max_output_tokens: 32000

Key	Description
`default_model`	`provider.model-name` format. The provider prefix (`anthropic` or `openai`) determines which LLM provider is active for health checks.
`model_capabilities.vision`	`true` registers the model with multimodal tokenization; `false` registers as text-only. Default: `false`
`model_capabilities.context_window`	Context window size in tokens. Default: `131072`
`model_capabilities.max_output_tokens`	Max output token limit. Default: `16384`

Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's ModelDatabase and published in the registry response.

Sampling parameters (temperature, top_p, top_k)

Sampling parameters are configured per-agent in the Python decorator, not in agents.yaml or fastagent.config.yaml. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.

from fast_agent import FastAgent
from fast_agent.types import RequestParams

fast = FastAgent("Jeffrey", parse_cli_args=False)

@fast.agent(
    name="jeffrey",
    instruction="...",
    servers=[...],
    request_params=RequestParams(temperature=0.6, top_p=0.9),
)
async def _jeffrey():
    pass

Provider support varies:

Provider	temperature	top_p	top_k
OpenAI (native, Responses API)	yes	yes	no
HuggingFace, OpenResponses (OpenAI-compatible)	yes	yes	yes (via `extra_body`)
Google Gemini	yes	yes	yes
Bedrock	yes	yes (most models)	varies
Anthropic Claude Opus 4.7	no	no	no

Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786). On Opus 4.7, use output_config.effort (verbosity, including the new xhigh level between high and max) instead.

Setting request_params on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.

`fastagent.secrets.yaml`

anthropic:
  api_key: "${ANTHROPIC_API_KEY}"
openai:
  api_key: "${OPENAI_API_KEY}"
  base_url: "${OPENAI_BASE_URL}"

${ENV_VAR} placeholders are expanded at runtime from environment variables.

`.env`

Pallas loads .env from the working directory into os.environ without overwriting existing variables. This supports both local development and systemd deployments:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=http://my-llm-server:8080/v1

OPENAI_BASE_URL defaults to https://api.openai.com/v1 if unset. For local llama-cpp, vLLM, or other OpenAI-compatible servers, set it to their endpoint.

Environment Variables

Variable	Default	Purpose
`PALLAS_AGENTS_CONFIG`	`agents.yaml`	Override path to deployment config

Running Pallas

CLI

pallas                     # start all agents + registry
pallas --agent jarvis      # start a single agent (no registry)
python -m pallas.server    # equivalent to `pallas`

Startup Sequence

All agents mode (pallas):

Load agents.yaml, build agents table and dependency graph
Preflight — register unknown models with ModelDatabase, validate LLM provider API keys and model availability
Start the registry server on registry_port
Start subagents (agents listed in other agents' depends_on)
Wait for each subagent to become ready (HTTP probe on /mcp, 60s timeout)
Start top-level agents (everything not a subagent)
All servers run concurrently via asyncio.gather

Single agent mode (pallas --agent <name>):

Load agents.yaml
Preflight
Start the named agent (no registry, no dependency resolution)

Per-Agent Startup

For each agent:

Import the agent module (agents.<name>) and obtain its fast instance
Enter fast.run() context — initialises the fast-agent runtime
Create a MultimodalAgentMCPServer wrapping the primary agent instance
Resolve downstream MCP server configs from the fast-agent configuration
Warn if any downstream auth headers reference unset environment variables
Register the get_health MCP tool with downstream server info
Bind to 0.0.0.0:<port> and serve StreamableHTTP

Daedalus Integration

This section describes the contract from Pallas's perspective. The full client-side specification is in docs/pallas_integration.md.

Registration Flow

Daedalus stores a registry URL (e.g. http://puck.incus:23030)
Fetches GET {url}/.well-known/mcp/server.json
Discovers all agents with their MCP endpoint URLs, titles, and descriptions
Creates connections to each agent

Health Polling

Daedalus calls get_health on each connected agent at a configurable interval (default 60s). The response maps to UI indicators:

`status`	Daedalus behaviour
`ok`	Green badge, normal operation
`degraded`	Yellow badge + warning banner showing `message`. Chat allowed.
`error`	Red badge. Chat disabled.

Progress Notifications

Long-running agent tool calls (agentic loops, sub-agent delegation) emit MCP notifications/progress on the SSE stream. Daedalus must include a progressToken in the _meta of tools/call requests to opt in:

result = await session.call_tool(
    "jarvis",
    arguments={"message": user_input},
    request_params={"_meta": {"progressToken": str(uuid4())}},
)

Progress notification fields:

Field	Description
`progressToken`	Matches the token sent in the request
`progress`	Monotonically increasing step counter
`total`	`null` = indeterminate (loop in progress), `1.0` = sub-task finished
`message`	Status text: `{server}/{tool}: started\|completed\|failed` or `{agent} step N (llm\|tool)`

Without a progressToken, Pallas skips all progress notifications and the client receives nothing until the final result.

Chat Blocking

If the target agent's cached health is error, Daedalus returns HTTP 503 and disables the message input. degraded shows a warning but allows chat.

Registry Server

Endpoint

GET {host}:{registry_port}/.well-known/mcp/server.json

Plain HTTP — not MCP. No authentication. Returns application/json.

Response Structure

Built dynamically from agents.yaml + fastagent.config.yaml:

{
  "servers": [
    {
      "server": {
        "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
        "name": "com.example.project/jarvis",
        "title": "Jarvis",
        "description": "My assistant agent",
        "version": "1.0.0",
        "remotes": [
          { "type": "streamable-http", "url": "http://my-host.example.com:8201/mcp" }
        ],
        "capabilities": {
          "model": "my-model-name",
          "vision": false,
          "context_window": 200000,
          "max_output_tokens": 32000
        }
      },
      "_meta": {
        "io.modelcontextprotocol.registry/official": {
          "status": "active",
          "updatedAt": "2026-01-01T00:00:00Z",
          "isLatest": true
        }
      }
    }
  ]
}

Registry Name Construction

{namespace}/{slug} — where slug is the agent key with underscores replaced by hyphens. Example: namespace com.example.project + agent key tech_research → com.example.project/tech-research.

Capabilities

If model_capabilities is defined in fastagent.config.yaml, each registry entry includes a capabilities object with model name, vision support, context window, and max output tokens. This allows clients to make informed decisions about what an agent can handle.

Multimodal Support

MultimodalAgentMCPServer extends fast-agent's AgentMCPServer with image attachment support.

`send_message` Tool

Each agent's MCP tool accepts:

Parameter	Type	Required	Description
`message`	`str`	yes	Text message to the agent
`images`	`list[dict]`	no	Base64-encoded images: `[{"data": "...", "mime_type": "image/png"}]`

When images is provided, the message is sent as a PromptMessageExtended containing both TextContent and ImageContent parts — the agent's underlying model must support vision.

Conversation History Prompt

For agents with instance_scope != "request", a {agent}_history prompt is registered that returns the full conversation history as FastMCP Message objects. This allows clients to retrieve the stored context.

Bearer Token Propagation

The server captures the authenticated bearer token from the incoming MCP request's Authorization: Bearer … header via fastmcp.server.dependencies.get_http_request() (FastMCP's get_access_token() returns None because Pallas runs without the auth middleware). Two consumers read it:

LLM-provider passthrough — the token is also pushed into the request_bearer_token ContextVar for the agent's LLM provider key manager to pick up automatically (used by HuggingFace and any other token-passthrough providers). The ContextVar works here because the LLM call runs in a child task of the request handler.
Downstream MCP servers (opt-in) — outgoing MCP calls inherit the same bearer when the downstream server is marked forward_inbound_auth: true in fastagent.config.yaml. Without that flag, the inbound bearer is not forwarded to MCP transport calls — server_config.headers is the only header source.

The forwarding is per-server so a FastAgent attached to both a credentialed downstream (e.g. Mnemosyne) and an unrelated public server doesn't leak the bearer to the latter.

Why a simple ContextVar forward isn't enough

fast-agent's MCPConnectionManager runs each downstream transport inside a long-lived anyio.TaskGroup created at manager startup. TaskGroup.start_soon snapshots the owner's contextvars.Context at spawn time — the request-handler's context is invisible to the transport task. A straight request_bearer_token.get() inside _prepare_headers_and_auth therefore always resolves to None even when the inbound handler has set the token a few frames up. The persistent connection is additionally reused across requests, so the first-call context (often empty) would be cached forever.

Pallas works around this in pallas._fastagent_patch by maintaining a process-wide _pending_bearers registry keyed by id(server_config). multimodal_server.send_message calls publish_bearer(cfg, token) for every opted-in downstream the agent is allowed to reach; the patched _prepare_headers_and_auth looks it up there (with the ContextVar as a fallback for non-persistent probe paths); and the request handler's finally block calls revoke_bearer(cfg) to clear the entry. Per-request bearers therefore survive the task-group boundary without any mutation of shared config.

Example:

mcp:
  servers:
    mnemosyne:
      transport: http
      url: "https://mnemosyne.example/mcp/"
      forward_inbound_auth: true   # inbound bearer rides outbound
    weather:
      transport: http
      url: "https://weather.example/mcp/"
      # no flag → outbound calls go unauthenticated

When the agent receives a request with Authorization: Bearer X, mnemosyne will see Authorization: Bearer X on the outbound call; weather will see no Authorization header. If mnemosyne.headers.Authorization is set explicitly, that wins (the inbound bearer is not overwritten on top of an explicit header).

Health System

Two-layer health checking: startup preflight validates LLM providers before agents launch, and a runtime get_health tool reports ongoing status.

Startup Preflight

Runs once before any agents start. Validates all LLM providers that have API keys configured.

Provider	Active (default_model matches)	Key set, not active
Anthropic	`GET /v1/models/{model}` — confirms model exists and key is valid	`GET /v1/models/claude-sonnet-4-5` — verifies API access
OpenAI	`GET {base_url}/models` — lists models, confirms configured model is present	`GET {base_url}/models` — lists available models

Warn-only — never blocks startup. Agents start regardless.
5-second timeout per provider API call.
Loads .env before checking.

Runtime `get_health` Tool

Registered on each agent's MCP server. Checks:

Downstream MCP servers — sends an MCP initialize handshake to each server URL. Uses initialize because it is the only MCP method that works without a pre-established session. After success, sends DELETE with the returned Mcp-Session-Id to tear down the session cleanly. 3-second timeout.
Active LLM provider — includes the preflight result for the provider that default_model points to. Only the active provider affects health status.

Response Format

{ "status": "ok", "timestamp": "2026-01-01T00:00:00Z" }

{
  "status": "degraded",
  "timestamp": "2026-01-01T00:00:00Z",
  "message": "Unreachable: neo4j_cypher; LLM: openai: model 'bad-model' not found"
}

Status	Meaning
`ok`	All downstream servers reachable and active LLM provider healthy
`degraded`	One or more downstream servers unreachable, or active LLM provider failed

Loop Guard

A small model occasionally gets stuck emitting the identical tool call every iteration — usually because an upstream MCP server returned a contradictory or malformed result it keeps trying to reconcile. Left alone the loop burns LLM turns and context until the client times out and the user sees empty_response.

pallas.loop_guard installs per-request ToolRunnerHooks (composed on top of the assistant-stream hooks) that track a rolling signature of (tool, normalized_args) → result_hash. When the same signature repeats loop_repeat_threshold times consecutively (default 3), the loop is halted immediately — the runtime does not ask the model to troubleshoot, because the fault is almost always upstream and self-recovery is slow, unpredictable, and token-hungry. On halt it:

collapses the request's max_iterations to the current iteration, so fast-agent's own _iteration > max_iterations check terminates the turn after the current tool result with no further LLM call;
appends an honest, user-facing explanation to the returned turn (and sets stop_reason = endTurn) so the client gets a real message instead of an empty/truncated one;
logs the offending tool, arguments, and result at WARNING (event=loop_halt in pallas.loop_guard) so the upstream bug can be fixed durably; and
increments pallas_agent_loop_aborted_total{reason="repeat"}.

This fires well before the max_iterations cap (a 3-round repeat halts within ~3 turns regardless of the configured ceiling), which is the point: the cap is a backstop, the guard is the fast path. Set loop_repeat_threshold: 0 on an agent to disable it.

Metrics

Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under asyncio.gather, so metrics are process-global.

Endpoint

GET {host}:{registry_port}/metrics

Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (text/plain; version=0.0.4).

The same metrics snapshot is also available on each agent's own port at {host}:{agent_port}/metrics. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.

Scrape Config

scrape_configs:
  - job_name: pallas
    static_configs:
      - targets: ['my-host.example.com:8200']    # registry_port
        labels:
          deployment: my-project

Metrics Reference

Metric	Type	Labels	Description
`pallas_up`	gauge	—	`1` while the Pallas process is running
`pallas_agent_info`	gauge	`agent`, `port`	`1` per configured agent — useful as a label join source
`pallas_send_message_total`	counter	`agent`, `outcome`	`send_message` MCP calls. `outcome` ∈ `ok`/`error`
`pallas_send_message_duration_seconds`	histogram	`agent`	End-to-end MCP `send_message` wall-clock duration
`pallas_llm_turns_total`	counter	`agent`, `model`	LLM provider round-trips per agent/model
`pallas_llm_tokens_total`	counter	`agent`, `model`, `kind`	Tokens consumed. `kind` ∈ `input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning`
`pallas_tool_calls_total`	counter	`agent`, `server`, `operation`, `outcome`	Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome` ∈ `ok`/`error`
`pallas_tool_call_duration_seconds`	histogram	`agent`, `server`, `operation`	Downstream MCP operation duration
`pallas_downstream_up`	gauge	`agent`, `server`	`1` when the named downstream MCP server passed the last `get_health` probe
`pallas_llm_provider_up`	gauge	`provider`	`1` when the active LLM provider passed its last preflight or runtime re-probe
`pallas_agent_health_status`	gauge	`agent`	Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error
`pallas_agent_loop_aborted_total`	counter	`agent`, `reason`	Agentic loops force-stopped by a runtime guard. `reason` ∈ `repeat` (identical-tool-call loop detected)

Standard process metrics (RSS, CPU, GC, open FDs) are emitted by prometheus-client's default collectors on the same endpoint.

Where the Numbers Come From

send_message metrics — captured around the MCP send_message handler in pallas.multimodal_server. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
LLM token metrics — read from fast-agent's UsageAccumulator on the request-scoped agent instance before disposal. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
Downstream tool call metrics — recorded in the pallas._fastagent_patch wrapper around MCPAggregator._execute_on_server. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as outcome="error" and full tracebacks remain in pallas.forward.trace log records.
Health gauges — updated as a side effect of every get_health MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside get_health.

Useful Queries

# Error rate per agent
sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
  / sum by (agent) (rate(pallas_send_message_total[5m]))

# p95 send_message latency per agent
histogram_quantile(0.95,
  sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
)

# Token spend per model (1h)
sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))

# Cache hit ratio (Anthropic)
sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
  / sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))

# Any downstream MCP server unreachable
min by (server) (pallas_downstream_up) == 0

# Active LLM provider down
pallas_llm_provider_up == 0

Suggested Alerts

Alert	Expression	Notes
Pallas process down	`up{job="pallas"} == 0` for 1m	Scrape failure
Active LLM unreachable	`pallas_llm_provider_up == 0` for 5m	Preflight or TTL re-probe failing
Downstream MCP unreachable	`pallas_downstream_up == 0` for 10m	Per-server; gauge updates on each `get_health`
Agent error rate elevated	`rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1`	>10% errors over 10 min
Latency regression	`histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60`	p95 over 60 s
Token burn	`sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N`	Set N to your budget
Agent loop halted	`increase(pallas_agent_loop_aborted_total[15m]) > 0`	A repeated-tool-call loop was force-stopped — investigate the upstream tool/data

Model Registration

Pallas registers models not in fast-agent's built-in ModelDatabase at startup, using the explicit capability declarations from fastagent.config.yaml.

The process:

Read default_model and model_capabilities from config
Extract the model name (portion after the provider prefix dot)
Check if ModelDatabase already knows this model — if so, skip
Register with ModelDatabase.register_runtime_model_params():
- vision: true → multimodal tokenization (QWEN_MULTIMODAL)
- vision: false → text-only tokenization (TEXT_ONLY)
- context_window and max_output_tokens from config (with sensible defaults)

This avoids the brittle pattern of inferring capabilities from model name substrings, which breaks for custom or fine-tuned models with non-standard names.

Module Reference

Module	File	Purpose
`pallas.server`	`server.py`	CLI entry point (`pallas` command), configuration loading, agent lifecycle orchestration, dependency ordering, model registration
`pallas.registry`	`registry.py`	Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config
`pallas.multimodal_server`	`multimodal_server.py`	`MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation
`pallas.health`	`health.py`	LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration
`pallas.loop_guard`	`loop_guard.py`	Per-request `ToolRunnerHooks` that halt the agentic loop on repeated-identical tool calls
`pallas.log`	`log.py`	JSON log configuration, third-party traceback capture, Rich-TUI-safe handler attachment
`pallas._fastagent_patch`	`_fastagent_patch.py`	Monkey-patches fast-agent at import time: per-request bearer forwarding via `httpx.Auth`, diagnostic trace-capture wrappers around `send_request` / `session.call_tool` / `_execute_on_server`

Incidents & Lessons Learned

The Pallas↔Mnemosyne bearer-forwarding rollout surfaced a chain of bugs that ranged from "obvious in hindsight" to "you have to go read the fast-agent source to see why". None of the individual symptoms pointed at the true cause — each had a plausible scapegoat — which is why the actual fix was to install structured diagnostics first and work the problem end-to-end. This section captures the findings so the next person to touch this code (likely future me) does not have to re-derive them.

1. Per-request bearer across an `anyio.TaskGroup` boundary

Symptom. Per-turn JWTs minted by Daedalus and sent as Authorization: Bearer … to Pallas never reached Mnemosyne; Mnemosyne saw either no Authorization header at all, or — worse, intermittently — a bearer from a previous turn against an unrelated workspace.

Cause. fast-agent's MCPConnectionManager runs each downstream transport inside a long-lived anyio.TaskGroup created at manager startup. TaskGroup.start_soon snapshots the owner's contextvars.Context at spawn time, so any request_bearer_token.set(…) done in the request handler a few frames up is invisible to the transport task. The persistent connection additionally caches its handshake context — so the bearer observed on the first call (often empty during a health-probe-triggered warm-up) gets reused forever.

Why the first attempt didn't help. We initially set the bearer via a contextvars.ContextVar and tried to have _prepare_headers_and_auth read it. It almost works — until any reconnect, retry, or persistent stream, at which point the cached snapshot wins.

Fix (pallas._fastagent_patch). Maintain a process-wide _pending_bearers: dict[int, str] keyed by id(server_config), guarded by a threading.Lock. multimodal_server.send_message calls publish_bearer(cfg, token) for every opted-in downstream before spawning any tool call; the patched _prepare_headers_and_auth pulls the token from the registry (ContextVar used as a fallback for non-persistent probe paths); a finally in the request handler calls revoke_bearer(cfg) to clear the entry. Per-request bearers therefore survive the task-group boundary without mutating any shared config object.

Bonus gotcha. The opt-in was originally keyed off a custom forward_inbound_auth: true field on the server block, read via fast-agent's pydantic config model. Pydantic's nested-model validation silently dropped unknown keys, so the flag never appeared on the parsed config. Workaround: scan fastagent.config.yaml directly for the flag at module import time (pallas._fastagent_patch._FORWARD_SERVERS) rather than rely on the parsed config object.

Bonus gotcha 2. httpx caches auth handshake headers on persistent connections. A plain mutation of server_config.headers["Authorization"] in the request handler only affects new connections. The forwarding patch works by providing a custom httpx.Auth subclass (_DynamicBearerAuth) that looks up the bearer on every request, not by mutating headers — this is why the override is auth_flow (the generic non-async flow), not async_auth_flow.

2. `install()` idempotency shadowing newly-added patches

Symptom. After adding two new diagnostic monkey-patches (_patch_session_call_tool, _patch_execute_on_server) and reinstalling pallas-mcp into the Kottos venv, the trace-capture records refused to appear in pallas.log. Four repro cycles, five log rotations, no evidence that the new code was running.

Cause. install() had a single top-level guard on _prepare_headers_and_auth._pallas_forward_patched. Once the bearer-forwarding patch was applied on first import, every subsequent install() call returned early — skipping the three later _patch_*() helpers entirely. The patches were present in the installed file; they were never executed.

Lesson. A shared idempotency guard at the top of an install()-style function is a liability as soon as the function grows past one patch. The fix (commit 082b611) moves each patch's guard to a per-target sentinel attribute on the target method (target._pallas_trace_patched = True), checked inside each helper. install() now calls every helper unconditionally; duplicate installs are cheap and harmless.

Bonus gotcha. install() runs at module-import time, which in Pallas happens before pallas.log.setup_logging() attaches the file handler. Any logger.info("patch installed") inside install() is emitted into the default handler and lost. "No 'patch installed' line in the log" is not evidence that the patch didn't install — only the runtime firing of the wrapper (e.g. forward.applied …) is a reliable presence marker.

3. FastMCP `on_call_tool` context shape: `message.name`, not `message.params.name`

Symptom. Once bearer forwarding worked, Harper's Mnemosyne tool calls came back to fast-agent as the literal string "object NoneType can't be used in 'await' expression". The tool result was visible in the OpenAI request payload of the next turn as {"role":"tool", "content":"object NoneType can't be used in 'await' expression"}. No traceback anywhere in Pallas or Mnemosyne.

Cause. mnemosyne/mcp_server/auth.py:MCPAuthMiddleware._extract_tool_name read context.message.params.name, but inside an on_call_tool hook FastMCP's MiddlewareContext[CallToolRequestParams] exposes .name and .arguments directly on context.message — the type parameter is already the params object. The extractor always returned None, which:

silently skipped the _PUBLIC_TOOLS = {"get_health"} bypass so even the public health probe went through JWT validation; and
made the per-tool ACL token.can_use_tool(None) short-circuit.

The NoneType await error string itself came from somewhere downstream of the middleware — the middleware still unconditionally awaited call_next(context). The most likely path was await self._tools.get(None)(...) in the FastMCP dispatch (None lookup returns None, then await None(...) raises the TypeError).

Fix (mnemosyne commit e0fa825). Read context.message.name directly; fall back to message.params.name only as a legacy safety net. Verified against fastmcp's own Middleware.on_call_tool signature (MiddlewareContext[mt.CallToolRequestParams]) and four independent docs examples.

Diagnostic helper. The commit also added _call_next_with_trace around await call_next(context) so any future exception inside FastMCP dispatch is captured with a full logger.exception traceback before propagating — and so the success path logs the result type, which doubles as a canary for "the middleware actually ran".

4. Rich-TUI corruption by DEBUG-level third-party loggers

Symptom. fast-agent go in an interactive session was unusable: massive blobs of plain-text DEBUG:openai._base_client:Sending HTTP Request: … and DEBUG:sse_starlette.sse:chunk: … lines splattered over the Rich chat UI on every redraw.

Cause. Two layers stacked up:

Pallas's original setup_logging() set the root logger to whatever logger.level was configured. With logger.level: debug in kottos/fastagent.config.yaml (set intentionally for Pallas diagnostics), every third-party library inherited DEBUG and started emitting.
Pallas attached a StreamHandler(stream=sys.__stderr__) to both root and pallas loggers so DEBUG records would "survive Rich's console takeover". This did solve the Rich-swallowing problem, but swapped it for a worse one: every library's DEBUG record now bypassed the Rich Live display and leaked through every TUI repaint.

Fix (commits dde7d4f + 89870f4).

PALLAS_LOG_STDERR env var gates the stderr handler. Off by default. Interactive users get a clean TUI + rotating file sink; systemd/journal deployments set PALLAS_LOG_STDERR=1.
Root-logger level is decoupled from Pallas's own level. Default: max(configured_level, INFO). Pallas's pallas.* loggers still honour logger.level: debug, but third-party libraries stay at INFO unless PALLAS_ROOT_LOG_LEVEL=DEBUG is set explicitly.
openai, openai._base_client, anthropic, anthropic._base_client, sse_starlette, sse_starlette.sse, mcp, mcp.client, mcp.server, httpx, httpcore pinned at WARNING individually — belt-and-braces against any future re-enablement of root DEBUG.

5. Logging configuration knobs (current state)

Env var / config	Default	Effect
`PALLAS_LOG_LEVEL`	`INFO`	Level for the `pallas.*` logger tree and the rotating file sink
`fastagent.config.yaml` `logger.level`	fallback for `PALLAS_LOG_LEVEL`	Unified knob — flipping fast-agent's level also flips Pallas's diagnostic level
`PALLAS_ROOT_LOG_LEVEL`	`max(pallas_level, INFO)`	Level for the root logger (controls third-party library output). Rarely needs to be changed.
`PALLAS_LOG_STDERR`	unset (off)	Attach a JSON `StreamHandler` to `sys.__stderr__`. Enable for systemd/journal; leave off in Rich TUI sessions.
`PALLAS_LOG_FILE`	`~/.local/state/pallas/pallas.log`	Rotating JSON log file. 10 MB × 5 backups.

The rotating file sink is always on. It's what catches tracebacks from fast-agent, fastmcp, the MCP SDK, and our own trace wrappers regardless of how Rich is interacting with the terminal. Tail with jq for structured access:

tail -n 100 -f ~/.local/state/pallas/pallas.log | jq -r '"\(.time) \(.level) \(.logger) \(.message)"'

When diagnosing a downstream-MCP issue, grep pallas.forward.trace in that file: any uncaught exception inside send_request, session.call_tool, or _execute_on_server appears there with full traceback, even when fast-agent's aggregator turns it into a terse CallToolResult(isError=True) by the time the agent loop sees it.

41 KiB Raw Permalink Blame History Unescape Escape

Pallas — Technical Reference

Solution Architecture

Daedalus → Pallas

Pallas → Downstream

Mnemosyne's Role

Why MCP End-to-End

Pallas Internal Architecture

Installation

Project Layout

Configuration Reference

agents.yaml

fastagent.config.yaml Extensions

Sampling parameters (temperature, top_p, top_k)

fastagent.secrets.yaml

.env

Environment Variables

Running Pallas

CLI

Startup Sequence

Per-Agent Startup

Daedalus Integration

Registration Flow

Health Polling

Progress Notifications

Chat Blocking

Registry Server

Endpoint

Response Structure

Registry Name Construction

Capabilities

Multimodal Support

send_message Tool

Conversation History Prompt

Bearer Token Propagation

Why a simple ContextVar forward isn't enough

Health System

Startup Preflight

Runtime get_health Tool

Response Format

Loop Guard

Metrics

Endpoint

Scrape Config

Metrics Reference

Where the Numbers Come From

Useful Queries

Suggested Alerts

Model Registration

Module Reference

Incidents & Lessons Learned

1. Per-request bearer across an anyio.TaskGroup boundary

2. install() idempotency shadowing newly-added patches

3. FastMCP on_call_tool context shape: message.name, not message.params.name

4. Rich-TUI corruption by DEBUG-level third-party loggers

5. Logging configuration knobs (current state)

41 KiB

Raw Permalink Blame History

`agents.yaml`

`fastagent.config.yaml` Extensions

`fastagent.secrets.yaml`

`.env`

`send_message` Tool

Runtime `get_health` Tool

1. Per-request bearer across an `anyio.TaskGroup` boundary

2. `install()` idempotency shadowing newly-added patches

3. FastMCP `on_call_tool` context shape: `message.name`, not `message.params.name`