pallas/docs/pallas.md

# Pallas — Technical Reference

Pallas is the generic runtime that turns [fast-agent](https://github.com/evalstate/fast-agent) agent definitions into StreamableHTTP MCP servers. It is **completely deployment-agnostic**: all environment-specific values (agent names, ports, hosts, model) live in the calling project's configuration files, not in Pallas itself.

---

## Solution Architecture

Pallas occupies the middle tier of a three-layer MCP architecture. It bridges a web-facing client (Daedalus) and a constellation of specialised downstream MCP servers.

```
┌──────────────────────────────────┐
│  Daedalus                        │  Web UI / FastAPI / MCP client
│  Workspace management, chat,     │  Discovers agents via registry
│  health monitoring, progress     │  Calls agent tools via MCP
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Pallas (FastAgent MCP Bridge)   │  Python runtime
│                                  │
│  ┌─ Registry  (port N)          │  GET /.well-known/mcp/server.json
│  ├─ Agent: Research  (port N+1) │  Chains, routers, sub-agents
│  ├─ Agent: Engineering (port N+2)│  Orchestrators, tool pipelines
│  └─ Agent: Orchestrator (N+3)   │  Delegates across agents
│                                  │
│  Each agent exposes:             │
│    • send_message tool           │
│    • get_health tool             │
│    • {agent}_history prompt      │
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Downstream MCP Servers          │
│                                  │
│  Argos        — web search       │
│  Neo4j        — knowledge graph  │
│  Mnemosyne    — content library  │
│  Kernos       — shell execution  │
│  Gitea        — repository mgmt  │
│  Grafana      — monitoring       │
│  Rommie       — system management│
└──────────────────────────────────┘
```

### Daedalus → Pallas

| Interaction | Mechanism |
|---|---|
| Agent discovery | `GET {registry}/.well-known/mcp/server.json` — plain HTTP, returns all agents with MCP endpoint URLs |
| Agent communication | MCP `tools/call` on `send_message` — text + optional images |
| Health monitoring | MCP `tools/call` on `get_health` — programmatic, no LLM invocation |
| Progress feedback | MCP `notifications/progress` — streamed over SSE during long-running tool calls |
| Conversation history | MCP `prompts/get` on `{agent}_history` — retrieves stored message history |

### Pallas → Downstream

Pallas agents call downstream MCP servers via standard MCP tool calls. Each agent declares its servers in its fast-agent definition (`servers=["argos", "neo4j_cypher", ...]`). The server URLs and auth headers are configured in the consuming project's `fastagent.config.yaml`.

### Mnemosyne's Role

Mnemosyne provides a content-type-aware knowledge graph with hybrid search (vector + full-text + graph). Agents with `mnemosyne` in their `servers` list gain access to tools for searching documents, browsing libraries and collections, retrieving items, and traversing the concept graph. It complements Neo4j (graph topology and relationships) with content-focused retrieval and re-ranking.

### Why MCP End-to-End

Pallas is the protocol boundary — MCP above (from Daedalus) and MCP below (to downstream servers). This eliminates any MCP→REST→MCP translation layer. A single `fast.start_server(transport="http")` call exposes a complete agent as a StreamableHTTP MCP endpoint, giving Daedalus:

- **Tool discovery** via `session.list_tools()`
- **Native streaming** via MCP Streamable HTTP / SSE
- **Health checks** as ordinary tool calls — no separate API surface
- **Progress notifications** built into the protocol

---

## Pallas Internal Architecture

Pallas is four modules, composed at startup:

```
server.py main()
  │
  ├─ _load_deployment_config()         parse agents.yaml
  ├─ _build_agents_table()             {name: (module, port)}
  ├─ _build_agent_deps()               dependency graph
  │
  ├─ _start_all()  or  _run_single()
  │    │
  │    ├─ _preflight()
  │    │    ├─ _register_unknown_models()   model registration
  │    │    └─ validate_llm_providers()     LLM API key + model checks
  │    │
  │    ├─ start subagents (depends_on)
  │    ├─ wait for subagent readiness
  │    ├─ start top-level agents
  │    │    │
  │    │    └─ _start_agent(name)
  │    │         ├─ import agent module
  │    │         ├─ MultimodalAgentMCPServer(...)
  │    │         ├─ _resolve_downstream_servers()
  │    │         ├─ _preflight_mcp_servers()     warn on missing auth
  │    │         ├─ register_health_tool()
  │    │         └─ server.run_async()
  │    │
  │    └─ run_registry()               Starlette app on registry port
  │
  └─ asyncio.run(...)
```

| Module | Purpose |
|---|---|
| `pallas.server` | CLI entry point, configuration loading, agent lifecycle orchestration, model registration |
| `pallas.registry` | Starlette app serving `GET /.well-known/mcp/server.json` — builds the agent catalogue from `agents.yaml` + `fastagent.config.yaml` |
| `pallas.multimodal_server` | `MultimodalAgentMCPServer` — `AgentMCPServer` subclass adding image attachment support and conversation history prompts |
| `pallas.health` | Two-layer health: startup LLM preflight validation + runtime `get_health` MCP tool with downstream server probing |

---

## Installation

```bash
pip install git+ssh://git@git.helu.ca:22022/r/pallas.git
```

Or as a project dependency:

```toml
dependencies = [
    "pallas-mcp @ git+ssh://git@git.helu.ca:22022/r/pallas.git",
]
```

Requires Python ≥ 3.13. Key dependencies: `fast-agent-mcp`, `httpx`, `pyyaml`, `starlette`, `uvicorn`.

---

## Project Layout

Pallas reads configuration from the **working directory** at runtime. A consuming project looks like:

```
my-project/
├── agents/
│   ├── __init__.py
│   └── jarvis.py              # FastAgent definitions
├── agents.yaml                # Deployment topology
├── fastagent.config.yaml      # FastAgent + model config
├── fastagent.secrets.yaml     # API keys (gitignored)
└── .env                       # Secret values (gitignored)
```

Pallas itself contains no agent definitions, model names, ports, or hostnames. Everything is injected by the consuming project.

---

## Configuration Reference

### `agents.yaml`

Single source of truth for deployment topology.

```yaml
name: my-project               # log prefixes and registry names
version: "1.0.0"               # published in registry entries
host: my-host.example.com      # hostname for registry URLs
namespace: com.example.project  # reverse-domain prefix for registry names
registry_port: 8200             # port for the registry server

agents:
  jarvis:
    module: agents.jarvis       # importable Python module path
    port: 8201                  # StreamableHTTP port for this agent
    title: Jarvis               # human-readable name (registry)
    description: "My assistant" # one-line description (registry)
    depends_on: [research]      # optional: start these agents first

  research:
    module: agents.research
    port: 8250
    title: Research Agent
    description: "Web search and knowledge graph"
```

| Field | Required | Description |
|---|---|---|
| `name` | yes | Project name — used in log prefixes (`[my-project]`) and CLI help |
| `version` | no | Semver string published in registry entries. Default: `"1.0.0"` |
| `host` | no | Hostname used in registry `remotes[].url`. Default: `"localhost"` |
| `namespace` | no | Reverse-domain prefix for registry `server.name` (e.g. `com.example/jarvis`) |
| `registry_port` | no | Port for the registry server. Default: `24200` |
| `agents.<name>.module` | yes | Importable Python module path containing a `fast` instance |
| `agents.<name>.port` | yes | Port for this agent's StreamableHTTP MCP server |
| `agents.<name>.title` | no | Display name in registry. Default: `name.title()` |
| `agents.<name>.description` | no | Description in registry |
| `agents.<name>.depends_on` | no | List of agent names that must start and become ready before this agent |
| `agents.<name>.max_iterations` | no | Hard cap on agentic-loop turns per `send_message`. Default: `15`. fast-agent returns a partial answer once exceeded |
| `agents.<name>.loop_repeat_threshold` | no | Halt the loop after this many consecutive identical `(tool, args) → result` rounds. Default: `3`. `0` disables the guard |

### `fastagent.config.yaml` Extensions

Pallas reads two keys beyond the standard fast-agent config:

```yaml
default_model: openai.my-model-name

model_capabilities:
  vision: false
  context_window: 200000
  max_output_tokens: 32000
```

| Key | Description |
|---|---|
| `default_model` | `provider.model-name` format. The provider prefix (`anthropic` or `openai`) determines which LLM provider is active for health checks. |
| `model_capabilities.vision` | `true` registers the model with multimodal tokenization; `false` registers as text-only. Default: `false` |
| `model_capabilities.context_window` | Context window size in tokens. Default: `131072` |
| `model_capabilities.max_output_tokens` | Max output token limit. Default: `16384` |

Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.

### Sampling parameters (temperature, top_p, top_k)

Sampling parameters are configured per-agent in the Python decorator, **not** in `agents.yaml` or `fastagent.config.yaml`. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.

```python
from fast_agent import FastAgent
from fast_agent.types import RequestParams

fast = FastAgent("Jeffrey", parse_cli_args=False)

@fast.agent(
    name="jeffrey",
    instruction="...",
    servers=[...],
    request_params=RequestParams(temperature=0.6, top_p=0.9),
)
async def _jeffrey():
    pass
```

Provider support varies:

| Provider | temperature | top_p | top_k |
|---|---|---|---|
| OpenAI (native, Responses API) | yes | yes | no |
| HuggingFace, OpenResponses (OpenAI-compatible) | yes | yes | yes (via `extra_body`) |
| Google Gemini | yes | yes | yes |
| Bedrock | yes | yes (most models) | varies |
| **Anthropic Claude Opus 4.7** | **no** | **no** | **no** |

Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see `fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786`). On Opus 4.7, use `output_config.effort` (verbosity, including the new `xhigh` level between `high` and `max`) instead.

Setting `request_params` on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.

### `fastagent.secrets.yaml`

```yaml
anthropic:
  api_key: "${ANTHROPIC_API_KEY}"
openai:
  api_key: "${OPENAI_API_KEY}"
  base_url: "${OPENAI_BASE_URL}"
```

`${ENV_VAR}` placeholders are expanded at runtime from environment variables.

### `.env`

Pallas loads `.env` from the working directory into `os.environ` without overwriting existing variables. This supports both local development and systemd deployments:

```dotenv
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=http://my-llm-server:8080/v1
```

`OPENAI_BASE_URL` defaults to `https://api.openai.com/v1` if unset. For local llama-cpp, vLLM, or other OpenAI-compatible servers, set it to their endpoint.

### Environment Variables

| Variable | Default | Purpose |
|---|---|---|
| `PALLAS_AGENTS_CONFIG` | `agents.yaml` | Override path to deployment config |

---

## Running Pallas

### CLI

```bash
pallas                     # start all agents + registry
pallas --agent jarvis      # start a single agent (no registry)
python -m pallas.server    # equivalent to `pallas`
```

### Startup Sequence

**All agents mode** (`pallas`):

1. Load `agents.yaml`, build agents table and dependency graph
2. **Preflight** — register unknown models with `ModelDatabase`, validate LLM provider API keys and model availability
3. Start the registry server on `registry_port`
4. Start **subagents** (agents listed in other agents' `depends_on`)
5. Wait for each subagent to become ready (HTTP probe on `/mcp`, 60s timeout)
6. Start **top-level agents** (everything not a subagent)
7. All servers run concurrently via `asyncio.gather`

**Single agent mode** (`pallas --agent <name>`):

1. Load `agents.yaml`
2. Preflight
3. Start the named agent (no registry, no dependency resolution)

### Per-Agent Startup

For each agent:

1. Import the agent module (`agents.<name>`) and obtain its `fast` instance
2. Enter `fast.run()` context — initialises the fast-agent runtime
3. Create a `MultimodalAgentMCPServer` wrapping the primary agent instance
4. Resolve downstream MCP server configs from the fast-agent configuration
5. Warn if any downstream auth headers reference unset environment variables
6. Register the `get_health` MCP tool with downstream server info
7. Bind to `0.0.0.0:<port>` and serve StreamableHTTP

---

## Daedalus Integration

This section describes the contract from Pallas's perspective. The full client-side specification is in `docs/pallas_integration.md`.

### Registration Flow

1. Daedalus stores a registry URL (e.g. `http://puck.incus:23030`)
2. Fetches `GET {url}/.well-known/mcp/server.json`
3. Discovers all agents with their MCP endpoint URLs, titles, and descriptions
4. Creates connections to each agent

### Health Polling

Daedalus calls `get_health` on each connected agent at a configurable interval (default 60s). The response maps to UI indicators:

| `status` | Daedalus behaviour |
|---|---|
| `ok` | Green badge, normal operation |
| `degraded` | Yellow badge + warning banner showing `message`. Chat allowed. |
| `error` | Red badge. Chat disabled. |

### Progress Notifications

Long-running agent tool calls (agentic loops, sub-agent delegation) emit MCP `notifications/progress` on the SSE stream. Daedalus must include a `progressToken` in the `_meta` of `tools/call` requests to opt in:

```python
result = await session.call_tool(
    "jarvis",
    arguments={"message": user_input},
    request_params={"_meta": {"progressToken": str(uuid4())}},
)
```

Progress notification fields:

| Field | Description |
|---|---|
| `progressToken` | Matches the token sent in the request |
| `progress` | Monotonically increasing step counter |
| `total` | `null` = indeterminate (loop in progress), `1.0` = sub-task finished |
| `message` | Status text: `{server}/{tool}: started\|completed\|failed` or `{agent} step N (llm\|tool)` |

Without a `progressToken`, Pallas skips all progress notifications and the client receives nothing until the final result.

### Chat Blocking

If the target agent's cached health is `error`, Daedalus returns HTTP 503 and disables the message input. `degraded` shows a warning but allows chat.

---

## Registry Server

### Endpoint

```
GET {host}:{registry_port}/.well-known/mcp/server.json
```

Plain HTTP — not MCP. No authentication. Returns `application/json`.

### Response Structure

Built dynamically from `agents.yaml` + `fastagent.config.yaml`:

```json
{
  "servers": [
    {
      "server": {
        "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
        "name": "com.example.project/jarvis",
        "title": "Jarvis",
        "description": "My assistant agent",
        "version": "1.0.0",
        "remotes": [
          { "type": "streamable-http", "url": "http://my-host.example.com:8201/mcp" }
        ],
        "capabilities": {
          "model": "my-model-name",
          "vision": false,
          "context_window": 200000,
          "max_output_tokens": 32000
        }
      },
      "_meta": {
        "io.modelcontextprotocol.registry/official": {
          "status": "active",
          "updatedAt": "2026-01-01T00:00:00Z",
          "isLatest": true
        }
      }
    }
  ]
}
```

### Registry Name Construction

`{namespace}/{slug}` — where `slug` is the agent key with underscores replaced by hyphens. Example: namespace `com.example.project` + agent key `tech_research` → `com.example.project/tech-research`.

### Capabilities

If `model_capabilities` is defined in `fastagent.config.yaml`, each registry entry includes a `capabilities` object with model name, vision support, context window, and max output tokens. This allows clients to make informed decisions about what an agent can handle.

---

## Multimodal Support

`MultimodalAgentMCPServer` extends fast-agent's `AgentMCPServer` with image attachment support.

### `send_message` Tool

Each agent's MCP tool accepts:

| Parameter | Type | Required | Description |
|---|---|---|---|
| `message` | `str` | yes | Text message to the agent |
| `images` | `list[dict]` | no | Base64-encoded images: `[{"data": "...", "mime_type": "image/png"}]` |

When `images` is provided, the message is sent as a `PromptMessageExtended` containing both `TextContent` and `ImageContent` parts — the agent's underlying model must support vision.

### Conversation History Prompt

For agents with `instance_scope != "request"`, a `{agent}_history` prompt is registered that returns the full conversation history as FastMCP `Message` objects. This allows clients to retrieve the stored context.

### Bearer Token Propagation

The server captures the authenticated bearer token from the incoming MCP request's `Authorization: Bearer …` header via `fastmcp.server.dependencies.get_http_request()` (FastMCP's `get_access_token()` returns `None` because Pallas runs without the auth middleware). Two consumers read it:

- **LLM-provider passthrough** — the token is also pushed into the `request_bearer_token` ContextVar for the agent's LLM provider key manager to pick up automatically (used by HuggingFace and any other token-passthrough providers). The ContextVar works here because the LLM call runs in a child task of the request handler.
- **Downstream MCP servers (opt-in)** — outgoing MCP calls inherit the same bearer when the downstream server is marked `forward_inbound_auth: true` in `fastagent.config.yaml`. Without that flag, the inbound bearer is **not** forwarded to MCP transport calls — `server_config.headers` is the only header source.

The forwarding is per-server so a FastAgent attached to both a credentialed downstream (e.g. Mnemosyne) and an unrelated public server doesn't leak the bearer to the latter.

#### Why a simple ContextVar forward isn't enough

fast-agent's `MCPConnectionManager` runs each downstream transport inside a long-lived `anyio.TaskGroup` created at manager startup. `TaskGroup.start_soon` snapshots the owner's `contextvars.Context` at spawn time — the request-handler's context is invisible to the transport task. A straight `request_bearer_token.get()` inside `_prepare_headers_and_auth` therefore always resolves to `None` even when the inbound handler has `set` the token a few frames up. The persistent connection is additionally reused across requests, so the first-call context (often empty) would be cached forever.

Pallas works around this in `pallas._fastagent_patch` by maintaining a process-wide `_pending_bearers` registry keyed by `id(server_config)`. `multimodal_server.send_message` calls `publish_bearer(cfg, token)` for every opted-in downstream the agent is allowed to reach; the patched `_prepare_headers_and_auth` looks it up there (with the ContextVar as a fallback for non-persistent probe paths); and the request handler's `finally` block calls `revoke_bearer(cfg)` to clear the entry. Per-request bearers therefore survive the task-group boundary without any mutation of shared config.


Example:

```yaml
mcp:
  servers:
    mnemosyne:
      transport: http
      url: "https://mnemosyne.example/mcp/"
      forward_inbound_auth: true   # inbound bearer rides outbound
    weather:
      transport: http
      url: "https://weather.example/mcp/"
      # no flag → outbound calls go unauthenticated
```

When the agent receives a request with `Authorization: Bearer X`, `mnemosyne` will see `Authorization: Bearer X` on the outbound call; `weather` will see no `Authorization` header. If `mnemosyne.headers.Authorization` is set explicitly, that wins (the inbound bearer is not overwritten on top of an explicit header).

---

## Health System

Two-layer health checking: **startup preflight** validates LLM providers before agents launch, and a **runtime `get_health` tool** reports ongoing status.

### Startup Preflight

Runs once before any agents start. Validates all LLM providers that have API keys configured.

| Provider | Active (default_model matches) | Key set, not active |
|---|---|---|
| **Anthropic** | `GET /v1/models/{model}` — confirms model exists and key is valid | `GET /v1/models/claude-sonnet-4-5` — verifies API access |
| **OpenAI** | `GET {base_url}/models` — lists models, confirms configured model is present | `GET {base_url}/models` — lists available models |

- **Warn-only** — never blocks startup. Agents start regardless.
- **5-second timeout** per provider API call.
- Loads `.env` before checking.

### Runtime `get_health` Tool

Registered on each agent's MCP server. Checks:

1. **Downstream MCP servers** — sends an MCP `initialize` handshake to each server URL. Uses `initialize` because it is the only MCP method that works without a pre-established session. After success, sends `DELETE` with the returned `Mcp-Session-Id` to tear down the session cleanly. 3-second timeout.

2. **Active LLM provider** — includes the preflight result for the provider that `default_model` points to. Only the active provider affects health status.

### Response Format

```json
{ "status": "ok", "timestamp": "2026-01-01T00:00:00Z" }
```

```json
{
  "status": "degraded",
  "timestamp": "2026-01-01T00:00:00Z",
  "message": "Unreachable: neo4j_cypher; LLM: openai: model 'bad-model' not found"
}
```

| Status | Meaning |
|---|---|
| `ok` | All downstream servers reachable and active LLM provider healthy |
| `degraded` | One or more downstream servers unreachable, or active LLM provider failed |

---

## Loop Guard

A small model occasionally gets stuck emitting the *identical* tool call every
iteration — usually because an upstream MCP server returned a contradictory or
malformed result it keeps trying to reconcile. Left alone the loop burns LLM
turns and context until the client times out and the user sees
`empty_response`.

`pallas.loop_guard` installs per-request `ToolRunnerHooks` (composed on top of
the assistant-stream hooks) that track a rolling signature of
`(tool, normalized_args) → result_hash`. When the same signature repeats
`loop_repeat_threshold` times consecutively (default **3**), the loop is
**halted immediately** — the runtime does *not* ask the model to troubleshoot,
because the fault is almost always upstream and self-recovery is slow,
unpredictable, and token-hungry. On halt it:

- collapses the request's `max_iterations` to the current iteration, so
  fast-agent's own `_iteration > max_iterations` check terminates the turn
  after the current tool result with **no further LLM call**;
- appends an honest, user-facing explanation to the returned turn (and sets
  `stop_reason = endTurn`) so the client gets a real message instead of an
  empty/truncated one;
- logs the offending tool, arguments, and result at WARNING (`event=loop_halt`
  in `pallas.loop_guard`) so the upstream bug can be fixed durably; and
- increments `pallas_agent_loop_aborted_total{reason="repeat"}`.

This fires well before the `max_iterations` cap (a 3-round repeat halts within
~3 turns regardless of the configured ceiling), which is the point: the cap is
a backstop, the guard is the fast path. Set `loop_repeat_threshold: 0` on an
agent to disable it.

---

## Metrics

Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under `asyncio.gather`, so metrics are process-global.

### Endpoint

```
GET {host}:{registry_port}/metrics
```

Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (`text/plain; version=0.0.4`).

The same metrics snapshot is also available on each agent's own port at `{host}:{agent_port}/metrics`. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.

### Scrape Config

```yaml
scrape_configs:
  - job_name: pallas
    static_configs:
      - targets: ['my-host.example.com:8200']    # registry_port
        labels:
          deployment: my-project
```

### Metrics Reference

| Metric | Type | Labels | Description |
|---|---|---|---|
| `pallas_up` | gauge | — | `1` while the Pallas process is running |
| `pallas_agent_info` | gauge | `agent`, `port` | `1` per configured agent — useful as a label join source |
| `pallas_send_message_total` | counter | `agent`, `outcome` | `send_message` MCP calls. `outcome` ∈ `ok`/`error` |
| `pallas_send_message_duration_seconds` | histogram | `agent` | End-to-end MCP `send_message` wall-clock duration |
| `pallas_llm_turns_total` | counter | `agent`, `model` | LLM provider round-trips per agent/model |
| `pallas_llm_tokens_total` | counter | `agent`, `model`, `kind` | Tokens consumed. `kind` ∈ `input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning` |
| `pallas_tool_calls_total` | counter | `agent`, `server`, `operation`, `outcome` | Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome` ∈ `ok`/`error` |
| `pallas_tool_call_duration_seconds` | histogram | `agent`, `server`, `operation` | Downstream MCP operation duration |
| `pallas_downstream_up` | gauge | `agent`, `server` | `1` when the named downstream MCP server passed the last `get_health` probe |
| `pallas_llm_provider_up` | gauge | `provider` | `1` when the active LLM provider passed its last preflight or runtime re-probe |
| `pallas_agent_health_status` | gauge | `agent` | Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error |
| `pallas_agent_loop_aborted_total` | counter | `agent`, `reason` | Agentic loops force-stopped by a runtime guard. `reason` ∈ `repeat` (identical-tool-call loop detected) |

Standard process metrics (RSS, CPU, GC, open FDs) are emitted by `prometheus-client`'s default collectors on the same endpoint.

### Where the Numbers Come From

- **send_message metrics** — captured around the MCP `send_message` handler in `pallas.multimodal_server`. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
- **LLM token metrics** — read from fast-agent's `UsageAccumulator` on the request-scoped agent instance *before disposal*. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
- **Downstream tool call metrics** — recorded in the `pallas._fastagent_patch` wrapper around `MCPAggregator._execute_on_server`. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as `outcome="error"` and full tracebacks remain in `pallas.forward.trace` log records.
- **Health gauges** — updated as a side effect of every `get_health` MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside `get_health`.

### Useful Queries

```promql
# Error rate per agent
sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
  / sum by (agent) (rate(pallas_send_message_total[5m]))

# p95 send_message latency per agent
histogram_quantile(0.95,
  sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
)

# Token spend per model (1h)
sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))

# Cache hit ratio (Anthropic)
sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
  / sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))

# Any downstream MCP server unreachable
min by (server) (pallas_downstream_up) == 0

# Active LLM provider down
pallas_llm_provider_up == 0
```

### Suggested Alerts

| Alert | Expression | Notes |
|---|---|---|
| Pallas process down | `up{job="pallas"} == 0` for 1m | Scrape failure |
| Active LLM unreachable | `pallas_llm_provider_up == 0` for 5m | Preflight or TTL re-probe failing |
| Downstream MCP unreachable | `pallas_downstream_up == 0` for 10m | Per-server; gauge updates on each `get_health` |
| Agent error rate elevated | `rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1` | >10% errors over 10 min |
| Latency regression | `histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60` | p95 over 60 s |
| Token burn | `sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N` | Set N to your budget |
| Agent loop halted | `increase(pallas_agent_loop_aborted_total[15m]) > 0` | A repeated-tool-call loop was force-stopped — investigate the upstream tool/data |

---

## Model Registration

Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.

The process:

1. Read `default_model` and `model_capabilities` from config
2. Extract the model name (portion after the provider prefix dot)
3. Check if `ModelDatabase` already knows this model — if so, skip
4. Register with `ModelDatabase.register_runtime_model_params()`:
   - `vision: true` → multimodal tokenization (`QWEN_MULTIMODAL`)
   - `vision: false` → text-only tokenization (`TEXT_ONLY`)
   - `context_window` and `max_output_tokens` from config (with sensible defaults)

This avoids the brittle pattern of inferring capabilities from model name substrings, which breaks for custom or fine-tuned models with non-standard names.

---

## Module Reference

| Module | File | Purpose |
|---|---|---|
| `pallas.server` | `server.py` | CLI entry point (`pallas` command), configuration loading, agent lifecycle orchestration, dependency ordering, model registration |
| `pallas.registry` | `registry.py` | Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config |
| `pallas.multimodal_server` | `multimodal_server.py` | `MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation |
| `pallas.health` | `health.py` | LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration |
| `pallas.loop_guard` | `loop_guard.py` | Per-request `ToolRunnerHooks` that halt the agentic loop on repeated-identical tool calls |
| `pallas.log` | `log.py` | JSON log configuration, third-party traceback capture, Rich-TUI-safe handler attachment |
| `pallas._fastagent_patch` | `_fastagent_patch.py` | Monkey-patches fast-agent at import time: per-request bearer forwarding via `httpx.Auth`, diagnostic trace-capture wrappers around `send_request` / `session.call_tool` / `_execute_on_server` |

---

## Incidents & Lessons Learned

The Pallas↔Mnemosyne bearer-forwarding rollout surfaced a chain of bugs that ranged from "obvious in hindsight" to "you have to go read the fast-agent source to see why". None of the individual symptoms pointed at the true cause — each had a plausible scapegoat — which is why the actual fix was to install structured diagnostics first and work the problem end-to-end. This section captures the findings so the next person to touch this code (likely future me) does not have to re-derive them.

### 1. Per-request bearer across an `anyio.TaskGroup` boundary

**Symptom.** Per-turn JWTs minted by Daedalus and sent as `Authorization: Bearer …` to Pallas never reached Mnemosyne; Mnemosyne saw either no `Authorization` header at all, or — worse, intermittently — a bearer from a *previous* turn against an unrelated workspace.

**Cause.** fast-agent's `MCPConnectionManager` runs each downstream transport inside a long-lived `anyio.TaskGroup` created at manager startup. `TaskGroup.start_soon` snapshots the owner's `contextvars.Context` at spawn time, so any `request_bearer_token.set(…)` done in the request handler a few frames up is **invisible** to the transport task. The persistent connection additionally caches its handshake context — so the bearer observed on the *first* call (often empty during a health-probe-triggered warm-up) gets reused forever.

**Why the first attempt didn't help.** We initially set the bearer via a `contextvars.ContextVar` and tried to have `_prepare_headers_and_auth` read it. It almost works — until any reconnect, retry, or persistent stream, at which point the cached snapshot wins.

**Fix (`pallas._fastagent_patch`).** Maintain a process-wide `_pending_bearers: dict[int, str]` keyed by `id(server_config)`, guarded by a `threading.Lock`. `multimodal_server.send_message` calls `publish_bearer(cfg, token)` for every opted-in downstream *before* spawning any tool call; the patched `_prepare_headers_and_auth` pulls the token from the registry (ContextVar used as a fallback for non-persistent probe paths); a `finally` in the request handler calls `revoke_bearer(cfg)` to clear the entry. Per-request bearers therefore survive the task-group boundary without mutating any shared config object.

**Bonus gotcha.** The opt-in was originally keyed off a custom `forward_inbound_auth: true` field on the server block, read via fast-agent's pydantic config model. Pydantic's nested-model validation silently **dropped unknown keys**, so the flag never appeared on the parsed config. Workaround: scan `fastagent.config.yaml` directly for the flag at module import time (`pallas._fastagent_patch._FORWARD_SERVERS`) rather than rely on the parsed config object.

**Bonus gotcha 2.** `httpx` caches auth handshake headers on persistent connections. A plain mutation of `server_config.headers["Authorization"]` in the request handler only affects *new* connections. The forwarding patch works by providing a custom `httpx.Auth` subclass (`_DynamicBearerAuth`) that looks up the bearer on every request, not by mutating headers — this is why the override is `auth_flow` (the generic non-async flow), not `async_auth_flow`.

### 2. `install()` idempotency shadowing newly-added patches

**Symptom.** After adding two new diagnostic monkey-patches (`_patch_session_call_tool`, `_patch_execute_on_server`) and reinstalling `pallas-mcp` into the Kottos venv, the trace-capture records refused to appear in `pallas.log`. Four repro cycles, five log rotations, no evidence that the new code was running.

**Cause.** `install()` had a single top-level guard on `_prepare_headers_and_auth._pallas_forward_patched`. Once the bearer-forwarding patch was applied on first import, every subsequent `install()` call returned early — skipping the *three* later `_patch_*()` helpers entirely. The patches were *present* in the installed file; they were never *executed*.

**Lesson.** A shared idempotency guard at the top of an `install()`-style function is a liability as soon as the function grows past one patch. The fix (commit `082b611`) moves each patch's guard to a per-target sentinel attribute on the target method (`target._pallas_trace_patched = True`), checked inside each helper. `install()` now calls every helper unconditionally; duplicate installs are cheap and harmless.

**Bonus gotcha.** `install()` runs at module-import time, which in Pallas happens *before* `pallas.log.setup_logging()` attaches the file handler. Any `logger.info("patch installed")` inside `install()` is emitted into the default handler and lost. "No 'patch installed' line in the log" is **not** evidence that the patch didn't install — only the runtime firing of the wrapper (e.g. `forward.applied …`) is a reliable presence marker.

### 3. FastMCP `on_call_tool` context shape: `message.name`, not `message.params.name`

**Symptom.** Once bearer forwarding worked, Harper's Mnemosyne tool calls came back to fast-agent as the literal string `"object NoneType can't be used in 'await' expression"`. The tool result was visible in the OpenAI request payload of the *next* turn as `{"role":"tool", "content":"object NoneType can't be used in 'await' expression"}`. No traceback anywhere in Pallas or Mnemosyne.

**Cause.** `mnemosyne/mcp_server/auth.py:MCPAuthMiddleware._extract_tool_name` read `context.message.params.name`, but inside an `on_call_tool` hook FastMCP's `MiddlewareContext[CallToolRequestParams]` exposes `.name` and `.arguments` **directly on `context.message`** — the type parameter is already the params object. The extractor always returned `None`, which:

- silently skipped the `_PUBLIC_TOOLS = {"get_health"}` bypass so even the public health probe went through JWT validation; and
- made the per-tool ACL `token.can_use_tool(None)` short-circuit.

The `NoneType await` error string itself came from somewhere downstream of the middleware — the middleware still unconditionally `await`ed `call_next(context)`. The most likely path was `await self._tools.get(None)(...)` in the FastMCP dispatch (`None` lookup returns `None`, then `await None(...)` raises the TypeError).

**Fix (mnemosyne commit `e0fa825`).** Read `context.message.name` directly; fall back to `message.params.name` only as a legacy safety net. Verified against fastmcp's own `Middleware.on_call_tool` signature (`MiddlewareContext[mt.CallToolRequestParams]`) and four independent docs examples.

**Diagnostic helper.** The commit also added `_call_next_with_trace` around `await call_next(context)` so any future exception inside FastMCP dispatch is captured with a full `logger.exception` traceback before propagating — and so the *success* path logs the result type, which doubles as a canary for "the middleware actually ran".

### 4. Rich-TUI corruption by DEBUG-level third-party loggers

**Symptom.** `fast-agent go` in an interactive session was unusable: massive blobs of plain-text `DEBUG:openai._base_client:Sending HTTP Request: …` and `DEBUG:sse_starlette.sse:chunk: …` lines splattered over the Rich chat UI on every redraw.

**Cause.** Two layers stacked up:

- Pallas's original `setup_logging()` set the **root logger** to whatever `logger.level` was configured. With `logger.level: debug` in `kottos/fastagent.config.yaml` (set intentionally for Pallas diagnostics), every third-party library inherited DEBUG and started emitting.
- Pallas attached a `StreamHandler(stream=sys.__stderr__)` to both root and `pallas` loggers so DEBUG records would "survive Rich's console takeover". This did solve the Rich-swallowing problem, but swapped it for a worse one: every library's DEBUG record now bypassed the Rich Live display and leaked through every TUI repaint.

**Fix (commits `dde7d4f` + `89870f4`).**
- `PALLAS_LOG_STDERR` env var gates the stderr handler. Off by default. Interactive users get a clean TUI + rotating file sink; systemd/journal deployments set `PALLAS_LOG_STDERR=1`.
- Root-logger level is decoupled from Pallas's own level. Default: `max(configured_level, INFO)`. Pallas's `pallas.*` loggers still honour `logger.level: debug`, but third-party libraries stay at INFO unless `PALLAS_ROOT_LOG_LEVEL=DEBUG` is set explicitly.
- `openai`, `openai._base_client`, `anthropic`, `anthropic._base_client`, `sse_starlette`, `sse_starlette.sse`, `mcp`, `mcp.client`, `mcp.server`, `httpx`, `httpcore` pinned at WARNING individually — belt-and-braces against any future re-enablement of root DEBUG.

### 5. Logging configuration knobs (current state)

| Env var / config | Default | Effect |
|---|---|---|
| `PALLAS_LOG_LEVEL` | `INFO` | Level for the `pallas.*` logger tree and the rotating file sink |
| `fastagent.config.yaml` `logger.level` | fallback for `PALLAS_LOG_LEVEL` | Unified knob — flipping fast-agent's level also flips Pallas's diagnostic level |
| `PALLAS_ROOT_LOG_LEVEL` | `max(pallas_level, INFO)` | Level for the root logger (controls third-party library output). Rarely needs to be changed. |
| `PALLAS_LOG_STDERR` | unset (off) | Attach a JSON `StreamHandler` to `sys.__stderr__`. Enable for systemd/journal; leave off in Rich TUI sessions. |
| `PALLAS_LOG_FILE` | `~/.local/state/pallas/pallas.log` | Rotating JSON log file. 10 MB × 5 backups. |

The **rotating file sink is always on**. It's what catches tracebacks from fast-agent, fastmcp, the MCP SDK, and our own trace wrappers regardless of how Rich is interacting with the terminal. Tail with `jq` for structured access:

```bash
tail -n 100 -f ~/.local/state/pallas/pallas.log | jq -r '"\(.time) \(.level) \(.logger) \(.message)"'
```

When diagnosing a downstream-MCP issue, `grep pallas.forward.trace` in that file: any uncaught exception inside `send_request`, `session.call_tool`, or `_execute_on_server` appears there with full traceback, even when fast-agent's aggregator turns it into a terse `CallToolResult(isError=True)` by the time the agent loop sees it.