Add two new sections to the Pallas documentation: - Sampling parameters: explain that temperature/top_p/top_k are configured via the fast-agent decorator's `request_params`, with a provider support matrix and a note on Claude Opus 4.7 stripping these params in favor of `output_config.effort`. - Metrics: document the Prometheus `/metrics` endpoint exposed on the registry port, including scrape config, full metrics reference table, and notes on where each metric is captured.
727 lines
39 KiB
Markdown
727 lines
39 KiB
Markdown
# Pallas — Technical Reference
|
||
|
||
Pallas is the generic runtime that turns [fast-agent](https://github.com/evalstate/fast-agent) agent definitions into StreamableHTTP MCP servers. It is **completely deployment-agnostic**: all environment-specific values (agent names, ports, hosts, model) live in the calling project's configuration files, not in Pallas itself.
|
||
|
||
---
|
||
|
||
## Solution Architecture
|
||
|
||
Pallas occupies the middle tier of a three-layer MCP architecture. It bridges a web-facing client (Daedalus) and a constellation of specialised downstream MCP servers.
|
||
|
||
```
|
||
┌──────────────────────────────────┐
|
||
│ Daedalus │ Web UI / FastAPI / MCP client
|
||
│ Workspace management, chat, │ Discovers agents via registry
|
||
│ health monitoring, progress │ Calls agent tools via MCP
|
||
└──────────┬───────────────────────┘
|
||
│ MCP over Streamable HTTP
|
||
▼
|
||
┌──────────────────────────────────┐
|
||
│ Pallas (FastAgent MCP Bridge) │ Python runtime
|
||
│ │
|
||
│ ┌─ Registry (port N) │ GET /.well-known/mcp/server.json
|
||
│ ├─ Agent: Research (port N+1) │ Chains, routers, sub-agents
|
||
│ ├─ Agent: Engineering (port N+2)│ Orchestrators, tool pipelines
|
||
│ └─ Agent: Orchestrator (N+3) │ Delegates across agents
|
||
│ │
|
||
│ Each agent exposes: │
|
||
│ • send_message tool │
|
||
│ • get_health tool │
|
||
│ • {agent}_history prompt │
|
||
└──────────┬───────────────────────┘
|
||
│ MCP over Streamable HTTP
|
||
▼
|
||
┌──────────────────────────────────┐
|
||
│ Downstream MCP Servers │
|
||
│ │
|
||
│ Argos — web search │
|
||
│ Neo4j — knowledge graph │
|
||
│ Mnemosyne — content library │
|
||
│ Kernos — shell execution │
|
||
│ Gitea — repository mgmt │
|
||
│ Grafana — monitoring │
|
||
│ Rommie — system management│
|
||
└──────────────────────────────────┘
|
||
```
|
||
|
||
### Daedalus → Pallas
|
||
|
||
| Interaction | Mechanism |
|
||
|---|---|
|
||
| Agent discovery | `GET {registry}/.well-known/mcp/server.json` — plain HTTP, returns all agents with MCP endpoint URLs |
|
||
| Agent communication | MCP `tools/call` on `send_message` — text + optional images |
|
||
| Health monitoring | MCP `tools/call` on `get_health` — programmatic, no LLM invocation |
|
||
| Progress feedback | MCP `notifications/progress` — streamed over SSE during long-running tool calls |
|
||
| Conversation history | MCP `prompts/get` on `{agent}_history` — retrieves stored message history |
|
||
|
||
### Pallas → Downstream
|
||
|
||
Pallas agents call downstream MCP servers via standard MCP tool calls. Each agent declares its servers in its fast-agent definition (`servers=["argos", "neo4j_cypher", ...]`). The server URLs and auth headers are configured in the consuming project's `fastagent.config.yaml`.
|
||
|
||
### Mnemosyne's Role
|
||
|
||
Mnemosyne provides a content-type-aware knowledge graph with hybrid search (vector + full-text + graph). Agents with `mnemosyne` in their `servers` list gain access to tools for searching documents, browsing libraries and collections, retrieving items, and traversing the concept graph. It complements Neo4j (graph topology and relationships) with content-focused retrieval and re-ranking.
|
||
|
||
### Why MCP End-to-End
|
||
|
||
Pallas is the protocol boundary — MCP above (from Daedalus) and MCP below (to downstream servers). This eliminates any MCP→REST→MCP translation layer. A single `fast.start_server(transport="http")` call exposes a complete agent as a StreamableHTTP MCP endpoint, giving Daedalus:
|
||
|
||
- **Tool discovery** via `session.list_tools()`
|
||
- **Native streaming** via MCP Streamable HTTP / SSE
|
||
- **Health checks** as ordinary tool calls — no separate API surface
|
||
- **Progress notifications** built into the protocol
|
||
|
||
---
|
||
|
||
## Pallas Internal Architecture
|
||
|
||
Pallas is four modules, composed at startup:
|
||
|
||
```
|
||
server.py main()
|
||
│
|
||
├─ _load_deployment_config() parse agents.yaml
|
||
├─ _build_agents_table() {name: (module, port)}
|
||
├─ _build_agent_deps() dependency graph
|
||
│
|
||
├─ _start_all() or _run_single()
|
||
│ │
|
||
│ ├─ _preflight()
|
||
│ │ ├─ _register_unknown_models() model registration
|
||
│ │ └─ validate_llm_providers() LLM API key + model checks
|
||
│ │
|
||
│ ├─ start subagents (depends_on)
|
||
│ ├─ wait for subagent readiness
|
||
│ ├─ start top-level agents
|
||
│ │ │
|
||
│ │ └─ _start_agent(name)
|
||
│ │ ├─ import agent module
|
||
│ │ ├─ MultimodalAgentMCPServer(...)
|
||
│ │ ├─ _resolve_downstream_servers()
|
||
│ │ ├─ _preflight_mcp_servers() warn on missing auth
|
||
│ │ ├─ register_health_tool()
|
||
│ │ └─ server.run_async()
|
||
│ │
|
||
│ └─ run_registry() Starlette app on registry port
|
||
│
|
||
└─ asyncio.run(...)
|
||
```
|
||
|
||
| Module | Purpose |
|
||
|---|---|
|
||
| `pallas.server` | CLI entry point, configuration loading, agent lifecycle orchestration, model registration |
|
||
| `pallas.registry` | Starlette app serving `GET /.well-known/mcp/server.json` — builds the agent catalogue from `agents.yaml` + `fastagent.config.yaml` |
|
||
| `pallas.multimodal_server` | `MultimodalAgentMCPServer` — `AgentMCPServer` subclass adding image attachment support and conversation history prompts |
|
||
| `pallas.health` | Two-layer health: startup LLM preflight validation + runtime `get_health` MCP tool with downstream server probing |
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
pip install git+ssh://git@git.helu.ca:22022/r/pallas.git
|
||
```
|
||
|
||
Or as a project dependency:
|
||
|
||
```toml
|
||
dependencies = [
|
||
"pallas-mcp @ git+ssh://git@git.helu.ca:22022/r/pallas.git",
|
||
]
|
||
```
|
||
|
||
Requires Python ≥ 3.13. Key dependencies: `fast-agent-mcp`, `httpx`, `pyyaml`, `starlette`, `uvicorn`.
|
||
|
||
---
|
||
|
||
## Project Layout
|
||
|
||
Pallas reads configuration from the **working directory** at runtime. A consuming project looks like:
|
||
|
||
```
|
||
my-project/
|
||
├── agents/
|
||
│ ├── __init__.py
|
||
│ └── jarvis.py # FastAgent definitions
|
||
├── agents.yaml # Deployment topology
|
||
├── fastagent.config.yaml # FastAgent + model config
|
||
├── fastagent.secrets.yaml # API keys (gitignored)
|
||
└── .env # Secret values (gitignored)
|
||
```
|
||
|
||
Pallas itself contains no agent definitions, model names, ports, or hostnames. Everything is injected by the consuming project.
|
||
|
||
---
|
||
|
||
## Configuration Reference
|
||
|
||
### `agents.yaml`
|
||
|
||
Single source of truth for deployment topology.
|
||
|
||
```yaml
|
||
name: my-project # log prefixes and registry names
|
||
version: "1.0.0" # published in registry entries
|
||
host: my-host.example.com # hostname for registry URLs
|
||
namespace: com.example.project # reverse-domain prefix for registry names
|
||
registry_port: 8200 # port for the registry server
|
||
|
||
agents:
|
||
jarvis:
|
||
module: agents.jarvis # importable Python module path
|
||
port: 8201 # StreamableHTTP port for this agent
|
||
title: Jarvis # human-readable name (registry)
|
||
description: "My assistant" # one-line description (registry)
|
||
depends_on: [research] # optional: start these agents first
|
||
|
||
research:
|
||
module: agents.research
|
||
port: 8250
|
||
title: Research Agent
|
||
description: "Web search and knowledge graph"
|
||
```
|
||
|
||
| Field | Required | Description |
|
||
|---|---|---|
|
||
| `name` | yes | Project name — used in log prefixes (`[my-project]`) and CLI help |
|
||
| `version` | no | Semver string published in registry entries. Default: `"1.0.0"` |
|
||
| `host` | no | Hostname used in registry `remotes[].url`. Default: `"localhost"` |
|
||
| `namespace` | no | Reverse-domain prefix for registry `server.name` (e.g. `com.example/jarvis`) |
|
||
| `registry_port` | no | Port for the registry server. Default: `24200` |
|
||
| `agents.<name>.module` | yes | Importable Python module path containing a `fast` instance |
|
||
| `agents.<name>.port` | yes | Port for this agent's StreamableHTTP MCP server |
|
||
| `agents.<name>.title` | no | Display name in registry. Default: `name.title()` |
|
||
| `agents.<name>.description` | no | Description in registry |
|
||
| `agents.<name>.depends_on` | no | List of agent names that must start and become ready before this agent |
|
||
|
||
### `fastagent.config.yaml` Extensions
|
||
|
||
Pallas reads two keys beyond the standard fast-agent config:
|
||
|
||
```yaml
|
||
default_model: openai.my-model-name
|
||
|
||
model_capabilities:
|
||
vision: false
|
||
context_window: 200000
|
||
max_output_tokens: 32000
|
||
```
|
||
|
||
| Key | Description |
|
||
|---|---|
|
||
| `default_model` | `provider.model-name` format. The provider prefix (`anthropic` or `openai`) determines which LLM provider is active for health checks. |
|
||
| `model_capabilities.vision` | `true` registers the model with multimodal tokenization; `false` registers as text-only. Default: `false` |
|
||
| `model_capabilities.context_window` | Context window size in tokens. Default: `131072` |
|
||
| `model_capabilities.max_output_tokens` | Max output token limit. Default: `16384` |
|
||
|
||
Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.
|
||
|
||
### Sampling parameters (temperature, top_p, top_k)
|
||
|
||
Sampling parameters are configured per-agent in the Python decorator, **not** in `agents.yaml` or `fastagent.config.yaml`. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.
|
||
|
||
```python
|
||
from fast_agent import FastAgent
|
||
from fast_agent.types import RequestParams
|
||
|
||
fast = FastAgent("Jeffrey", parse_cli_args=False)
|
||
|
||
@fast.agent(
|
||
name="jeffrey",
|
||
instruction="...",
|
||
servers=[...],
|
||
request_params=RequestParams(temperature=0.6, top_p=0.9),
|
||
)
|
||
async def _jeffrey():
|
||
pass
|
||
```
|
||
|
||
Provider support varies:
|
||
|
||
| Provider | temperature | top_p | top_k |
|
||
|---|---|---|---|
|
||
| OpenAI (native, Responses API) | yes | yes | no |
|
||
| HuggingFace, OpenResponses (OpenAI-compatible) | yes | yes | yes (via `extra_body`) |
|
||
| Google Gemini | yes | yes | yes |
|
||
| Bedrock | yes | yes (most models) | varies |
|
||
| **Anthropic Claude Opus 4.7** | **no** | **no** | **no** |
|
||
|
||
Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see `fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786`). On Opus 4.7, use `output_config.effort` (verbosity, including the new `xhigh` level between `high` and `max`) instead.
|
||
|
||
Setting `request_params` on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.
|
||
|
||
### `fastagent.secrets.yaml`
|
||
|
||
```yaml
|
||
anthropic:
|
||
api_key: "${ANTHROPIC_API_KEY}"
|
||
openai:
|
||
api_key: "${OPENAI_API_KEY}"
|
||
base_url: "${OPENAI_BASE_URL}"
|
||
```
|
||
|
||
`${ENV_VAR}` placeholders are expanded at runtime from environment variables.
|
||
|
||
### `.env`
|
||
|
||
Pallas loads `.env` from the working directory into `os.environ` without overwriting existing variables. This supports both local development and systemd deployments:
|
||
|
||
```dotenv
|
||
ANTHROPIC_API_KEY=sk-ant-...
|
||
OPENAI_API_KEY=sk-...
|
||
OPENAI_BASE_URL=http://my-llm-server:8080/v1
|
||
```
|
||
|
||
`OPENAI_BASE_URL` defaults to `https://api.openai.com/v1` if unset. For local llama-cpp, vLLM, or other OpenAI-compatible servers, set it to their endpoint.
|
||
|
||
### Environment Variables
|
||
|
||
| Variable | Default | Purpose |
|
||
|---|---|---|
|
||
| `PALLAS_AGENTS_CONFIG` | `agents.yaml` | Override path to deployment config |
|
||
|
||
---
|
||
|
||
## Running Pallas
|
||
|
||
### CLI
|
||
|
||
```bash
|
||
pallas # start all agents + registry
|
||
pallas --agent jarvis # start a single agent (no registry)
|
||
python -m pallas.server # equivalent to `pallas`
|
||
```
|
||
|
||
### Startup Sequence
|
||
|
||
**All agents mode** (`pallas`):
|
||
|
||
1. Load `agents.yaml`, build agents table and dependency graph
|
||
2. **Preflight** — register unknown models with `ModelDatabase`, validate LLM provider API keys and model availability
|
||
3. Start the registry server on `registry_port`
|
||
4. Start **subagents** (agents listed in other agents' `depends_on`)
|
||
5. Wait for each subagent to become ready (HTTP probe on `/mcp`, 60s timeout)
|
||
6. Start **top-level agents** (everything not a subagent)
|
||
7. All servers run concurrently via `asyncio.gather`
|
||
|
||
**Single agent mode** (`pallas --agent <name>`):
|
||
|
||
1. Load `agents.yaml`
|
||
2. Preflight
|
||
3. Start the named agent (no registry, no dependency resolution)
|
||
|
||
### Per-Agent Startup
|
||
|
||
For each agent:
|
||
|
||
1. Import the agent module (`agents.<name>`) and obtain its `fast` instance
|
||
2. Enter `fast.run()` context — initialises the fast-agent runtime
|
||
3. Create a `MultimodalAgentMCPServer` wrapping the primary agent instance
|
||
4. Resolve downstream MCP server configs from the fast-agent configuration
|
||
5. Warn if any downstream auth headers reference unset environment variables
|
||
6. Register the `get_health` MCP tool with downstream server info
|
||
7. Bind to `0.0.0.0:<port>` and serve StreamableHTTP
|
||
|
||
---
|
||
|
||
## Daedalus Integration
|
||
|
||
This section describes the contract from Pallas's perspective. The full client-side specification is in `docs/pallas_integration.md`.
|
||
|
||
### Registration Flow
|
||
|
||
1. Daedalus stores a registry URL (e.g. `http://puck.incus:23030`)
|
||
2. Fetches `GET {url}/.well-known/mcp/server.json`
|
||
3. Discovers all agents with their MCP endpoint URLs, titles, and descriptions
|
||
4. Creates connections to each agent
|
||
|
||
### Health Polling
|
||
|
||
Daedalus calls `get_health` on each connected agent at a configurable interval (default 60s). The response maps to UI indicators:
|
||
|
||
| `status` | Daedalus behaviour |
|
||
|---|---|
|
||
| `ok` | Green badge, normal operation |
|
||
| `degraded` | Yellow badge + warning banner showing `message`. Chat allowed. |
|
||
| `error` | Red badge. Chat disabled. |
|
||
|
||
### Progress Notifications
|
||
|
||
Long-running agent tool calls (agentic loops, sub-agent delegation) emit MCP `notifications/progress` on the SSE stream. Daedalus must include a `progressToken` in the `_meta` of `tools/call` requests to opt in:
|
||
|
||
```python
|
||
result = await session.call_tool(
|
||
"jarvis",
|
||
arguments={"message": user_input},
|
||
request_params={"_meta": {"progressToken": str(uuid4())}},
|
||
)
|
||
```
|
||
|
||
Progress notification fields:
|
||
|
||
| Field | Description |
|
||
|---|---|
|
||
| `progressToken` | Matches the token sent in the request |
|
||
| `progress` | Monotonically increasing step counter |
|
||
| `total` | `null` = indeterminate (loop in progress), `1.0` = sub-task finished |
|
||
| `message` | Status text: `{server}/{tool}: started\|completed\|failed` or `{agent} step N (llm\|tool)` |
|
||
|
||
Without a `progressToken`, Pallas skips all progress notifications and the client receives nothing until the final result.
|
||
|
||
### Chat Blocking
|
||
|
||
If the target agent's cached health is `error`, Daedalus returns HTTP 503 and disables the message input. `degraded` shows a warning but allows chat.
|
||
|
||
---
|
||
|
||
## Registry Server
|
||
|
||
### Endpoint
|
||
|
||
```
|
||
GET {host}:{registry_port}/.well-known/mcp/server.json
|
||
```
|
||
|
||
Plain HTTP — not MCP. No authentication. Returns `application/json`.
|
||
|
||
### Response Structure
|
||
|
||
Built dynamically from `agents.yaml` + `fastagent.config.yaml`:
|
||
|
||
```json
|
||
{
|
||
"servers": [
|
||
{
|
||
"server": {
|
||
"$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
|
||
"name": "com.example.project/jarvis",
|
||
"title": "Jarvis",
|
||
"description": "My assistant agent",
|
||
"version": "1.0.0",
|
||
"remotes": [
|
||
{ "type": "streamable-http", "url": "http://my-host.example.com:8201/mcp" }
|
||
],
|
||
"capabilities": {
|
||
"model": "my-model-name",
|
||
"vision": false,
|
||
"context_window": 200000,
|
||
"max_output_tokens": 32000
|
||
}
|
||
},
|
||
"_meta": {
|
||
"io.modelcontextprotocol.registry/official": {
|
||
"status": "active",
|
||
"updatedAt": "2026-01-01T00:00:00Z",
|
||
"isLatest": true
|
||
}
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### Registry Name Construction
|
||
|
||
`{namespace}/{slug}` — where `slug` is the agent key with underscores replaced by hyphens. Example: namespace `com.example.project` + agent key `tech_research` → `com.example.project/tech-research`.
|
||
|
||
### Capabilities
|
||
|
||
If `model_capabilities` is defined in `fastagent.config.yaml`, each registry entry includes a `capabilities` object with model name, vision support, context window, and max output tokens. This allows clients to make informed decisions about what an agent can handle.
|
||
|
||
---
|
||
|
||
## Multimodal Support
|
||
|
||
`MultimodalAgentMCPServer` extends fast-agent's `AgentMCPServer` with image attachment support.
|
||
|
||
### `send_message` Tool
|
||
|
||
Each agent's MCP tool accepts:
|
||
|
||
| Parameter | Type | Required | Description |
|
||
|---|---|---|---|
|
||
| `message` | `str` | yes | Text message to the agent |
|
||
| `images` | `list[dict]` | no | Base64-encoded images: `[{"data": "...", "mime_type": "image/png"}]` |
|
||
|
||
When `images` is provided, the message is sent as a `PromptMessageExtended` containing both `TextContent` and `ImageContent` parts — the agent's underlying model must support vision.
|
||
|
||
### Conversation History Prompt
|
||
|
||
For agents with `instance_scope != "request"`, a `{agent}_history` prompt is registered that returns the full conversation history as FastMCP `Message` objects. This allows clients to retrieve the stored context.
|
||
|
||
### Bearer Token Propagation
|
||
|
||
The server captures the authenticated bearer token from the incoming MCP request's `Authorization: Bearer …` header via `fastmcp.server.dependencies.get_http_request()` (FastMCP's `get_access_token()` returns `None` because Pallas runs without the auth middleware). Two consumers read it:
|
||
|
||
- **LLM-provider passthrough** — the token is also pushed into the `request_bearer_token` ContextVar for the agent's LLM provider key manager to pick up automatically (used by HuggingFace and any other token-passthrough providers). The ContextVar works here because the LLM call runs in a child task of the request handler.
|
||
- **Downstream MCP servers (opt-in)** — outgoing MCP calls inherit the same bearer when the downstream server is marked `forward_inbound_auth: true` in `fastagent.config.yaml`. Without that flag, the inbound bearer is **not** forwarded to MCP transport calls — `server_config.headers` is the only header source.
|
||
|
||
The forwarding is per-server so a FastAgent attached to both a credentialed downstream (e.g. Mnemosyne) and an unrelated public server doesn't leak the bearer to the latter.
|
||
|
||
#### Why a simple ContextVar forward isn't enough
|
||
|
||
fast-agent's `MCPConnectionManager` runs each downstream transport inside a long-lived `anyio.TaskGroup` created at manager startup. `TaskGroup.start_soon` snapshots the owner's `contextvars.Context` at spawn time — the request-handler's context is invisible to the transport task. A straight `request_bearer_token.get()` inside `_prepare_headers_and_auth` therefore always resolves to `None` even when the inbound handler has `set` the token a few frames up. The persistent connection is additionally reused across requests, so the first-call context (often empty) would be cached forever.
|
||
|
||
Pallas works around this in `pallas._fastagent_patch` by maintaining a process-wide `_pending_bearers` registry keyed by `id(server_config)`. `multimodal_server.send_message` calls `publish_bearer(cfg, token)` for every opted-in downstream the agent is allowed to reach; the patched `_prepare_headers_and_auth` looks it up there (with the ContextVar as a fallback for non-persistent probe paths); and the request handler's `finally` block calls `revoke_bearer(cfg)` to clear the entry. Per-request bearers therefore survive the task-group boundary without any mutation of shared config.
|
||
|
||
|
||
Example:
|
||
|
||
```yaml
|
||
mcp:
|
||
servers:
|
||
mnemosyne:
|
||
transport: http
|
||
url: "https://mnemosyne.example/mcp/"
|
||
forward_inbound_auth: true # inbound bearer rides outbound
|
||
weather:
|
||
transport: http
|
||
url: "https://weather.example/mcp/"
|
||
# no flag → outbound calls go unauthenticated
|
||
```
|
||
|
||
When the agent receives a request with `Authorization: Bearer X`, `mnemosyne` will see `Authorization: Bearer X` on the outbound call; `weather` will see no `Authorization` header. If `mnemosyne.headers.Authorization` is set explicitly, that wins (the inbound bearer is not overwritten on top of an explicit header).
|
||
|
||
---
|
||
|
||
## Health System
|
||
|
||
Two-layer health checking: **startup preflight** validates LLM providers before agents launch, and a **runtime `get_health` tool** reports ongoing status.
|
||
|
||
### Startup Preflight
|
||
|
||
Runs once before any agents start. Validates all LLM providers that have API keys configured.
|
||
|
||
| Provider | Active (default_model matches) | Key set, not active |
|
||
|---|---|---|
|
||
| **Anthropic** | `GET /v1/models/{model}` — confirms model exists and key is valid | `GET /v1/models/claude-sonnet-4-5` — verifies API access |
|
||
| **OpenAI** | `GET {base_url}/models` — lists models, confirms configured model is present | `GET {base_url}/models` — lists available models |
|
||
|
||
- **Warn-only** — never blocks startup. Agents start regardless.
|
||
- **5-second timeout** per provider API call.
|
||
- Loads `.env` before checking.
|
||
|
||
### Runtime `get_health` Tool
|
||
|
||
Registered on each agent's MCP server. Checks:
|
||
|
||
1. **Downstream MCP servers** — sends an MCP `initialize` handshake to each server URL. Uses `initialize` because it is the only MCP method that works without a pre-established session. After success, sends `DELETE` with the returned `Mcp-Session-Id` to tear down the session cleanly. 3-second timeout.
|
||
|
||
2. **Active LLM provider** — includes the preflight result for the provider that `default_model` points to. Only the active provider affects health status.
|
||
|
||
### Response Format
|
||
|
||
```json
|
||
{ "status": "ok", "timestamp": "2026-01-01T00:00:00Z" }
|
||
```
|
||
|
||
```json
|
||
{
|
||
"status": "degraded",
|
||
"timestamp": "2026-01-01T00:00:00Z",
|
||
"message": "Unreachable: neo4j_cypher; LLM: openai: model 'bad-model' not found"
|
||
}
|
||
```
|
||
|
||
| Status | Meaning |
|
||
|---|---|
|
||
| `ok` | All downstream servers reachable and active LLM provider healthy |
|
||
| `degraded` | One or more downstream servers unreachable, or active LLM provider failed |
|
||
|
||
---
|
||
|
||
## Metrics
|
||
|
||
Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under `asyncio.gather`, so metrics are process-global.
|
||
|
||
### Endpoint
|
||
|
||
```
|
||
GET {host}:{registry_port}/metrics
|
||
```
|
||
|
||
Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (`text/plain; version=0.0.4`).
|
||
|
||
The same metrics snapshot is also available on each agent's own port at `{host}:{agent_port}/metrics`. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.
|
||
|
||
### Scrape Config
|
||
|
||
```yaml
|
||
scrape_configs:
|
||
- job_name: pallas
|
||
static_configs:
|
||
- targets: ['my-host.example.com:8200'] # registry_port
|
||
labels:
|
||
deployment: my-project
|
||
```
|
||
|
||
### Metrics Reference
|
||
|
||
| Metric | Type | Labels | Description |
|
||
|---|---|---|---|
|
||
| `pallas_up` | gauge | — | `1` while the Pallas process is running |
|
||
| `pallas_agent_info` | gauge | `agent`, `port` | `1` per configured agent — useful as a label join source |
|
||
| `pallas_send_message_total` | counter | `agent`, `outcome` | `send_message` MCP calls. `outcome` ∈ `ok`/`error` |
|
||
| `pallas_send_message_duration_seconds` | histogram | `agent` | End-to-end MCP `send_message` wall-clock duration |
|
||
| `pallas_llm_turns_total` | counter | `agent`, `model` | LLM provider round-trips per agent/model |
|
||
| `pallas_llm_tokens_total` | counter | `agent`, `model`, `kind` | Tokens consumed. `kind` ∈ `input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning` |
|
||
| `pallas_tool_calls_total` | counter | `agent`, `server`, `operation`, `outcome` | Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome` ∈ `ok`/`error` |
|
||
| `pallas_tool_call_duration_seconds` | histogram | `agent`, `server`, `operation` | Downstream MCP operation duration |
|
||
| `pallas_downstream_up` | gauge | `agent`, `server` | `1` when the named downstream MCP server passed the last `get_health` probe |
|
||
| `pallas_llm_provider_up` | gauge | `provider` | `1` when the active LLM provider passed its last preflight or runtime re-probe |
|
||
| `pallas_agent_health_status` | gauge | `agent` | Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error |
|
||
|
||
Standard process metrics (RSS, CPU, GC, open FDs) are emitted by `prometheus-client`'s default collectors on the same endpoint.
|
||
|
||
### Where the Numbers Come From
|
||
|
||
- **send_message metrics** — captured around the MCP `send_message` handler in `pallas.multimodal_server`. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
|
||
- **LLM token metrics** — read from fast-agent's `UsageAccumulator` on the request-scoped agent instance *before disposal*. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
|
||
- **Downstream tool call metrics** — recorded in the `pallas._fastagent_patch` wrapper around `MCPAggregator._execute_on_server`. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as `outcome="error"` and full tracebacks remain in `pallas.forward.trace` log records.
|
||
- **Health gauges** — updated as a side effect of every `get_health` MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside `get_health`.
|
||
|
||
### Useful Queries
|
||
|
||
```promql
|
||
# Error rate per agent
|
||
sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
|
||
/ sum by (agent) (rate(pallas_send_message_total[5m]))
|
||
|
||
# p95 send_message latency per agent
|
||
histogram_quantile(0.95,
|
||
sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
|
||
)
|
||
|
||
# Token spend per model (1h)
|
||
sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))
|
||
|
||
# Cache hit ratio (Anthropic)
|
||
sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
|
||
/ sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))
|
||
|
||
# Any downstream MCP server unreachable
|
||
min by (server) (pallas_downstream_up) == 0
|
||
|
||
# Active LLM provider down
|
||
pallas_llm_provider_up == 0
|
||
```
|
||
|
||
### Suggested Alerts
|
||
|
||
| Alert | Expression | Notes |
|
||
|---|---|---|
|
||
| Pallas process down | `up{job="pallas"} == 0` for 1m | Scrape failure |
|
||
| Active LLM unreachable | `pallas_llm_provider_up == 0` for 5m | Preflight or TTL re-probe failing |
|
||
| Downstream MCP unreachable | `pallas_downstream_up == 0` for 10m | Per-server; gauge updates on each `get_health` |
|
||
| Agent error rate elevated | `rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1` | >10% errors over 10 min |
|
||
| Latency regression | `histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60` | p95 over 60 s |
|
||
| Token burn | `sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N` | Set N to your budget |
|
||
|
||
---
|
||
|
||
## Model Registration
|
||
|
||
Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.
|
||
|
||
The process:
|
||
|
||
1. Read `default_model` and `model_capabilities` from config
|
||
2. Extract the model name (portion after the provider prefix dot)
|
||
3. Check if `ModelDatabase` already knows this model — if so, skip
|
||
4. Register with `ModelDatabase.register_runtime_model_params()`:
|
||
- `vision: true` → multimodal tokenization (`QWEN_MULTIMODAL`)
|
||
- `vision: false` → text-only tokenization (`TEXT_ONLY`)
|
||
- `context_window` and `max_output_tokens` from config (with sensible defaults)
|
||
|
||
This avoids the brittle pattern of inferring capabilities from model name substrings, which breaks for custom or fine-tuned models with non-standard names.
|
||
|
||
---
|
||
|
||
## Module Reference
|
||
|
||
| Module | File | Purpose |
|
||
|---|---|---|
|
||
| `pallas.server` | `server.py` | CLI entry point (`pallas` command), configuration loading, agent lifecycle orchestration, dependency ordering, model registration |
|
||
| `pallas.registry` | `registry.py` | Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config |
|
||
| `pallas.multimodal_server` | `multimodal_server.py` | `MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation |
|
||
| `pallas.health` | `health.py` | LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration |
|
||
| `pallas.log` | `log.py` | JSON log configuration, third-party traceback capture, Rich-TUI-safe handler attachment |
|
||
| `pallas._fastagent_patch` | `_fastagent_patch.py` | Monkey-patches fast-agent at import time: per-request bearer forwarding via `httpx.Auth`, diagnostic trace-capture wrappers around `send_request` / `session.call_tool` / `_execute_on_server` |
|
||
|
||
---
|
||
|
||
## Incidents & Lessons Learned
|
||
|
||
The Pallas↔Mnemosyne bearer-forwarding rollout surfaced a chain of bugs that ranged from "obvious in hindsight" to "you have to go read the fast-agent source to see why". None of the individual symptoms pointed at the true cause — each had a plausible scapegoat — which is why the actual fix was to install structured diagnostics first and work the problem end-to-end. This section captures the findings so the next person to touch this code (likely future me) does not have to re-derive them.
|
||
|
||
### 1. Per-request bearer across an `anyio.TaskGroup` boundary
|
||
|
||
**Symptom.** Per-turn JWTs minted by Daedalus and sent as `Authorization: Bearer …` to Pallas never reached Mnemosyne; Mnemosyne saw either no `Authorization` header at all, or — worse, intermittently — a bearer from a *previous* turn against an unrelated workspace.
|
||
|
||
**Cause.** fast-agent's `MCPConnectionManager` runs each downstream transport inside a long-lived `anyio.TaskGroup` created at manager startup. `TaskGroup.start_soon` snapshots the owner's `contextvars.Context` at spawn time, so any `request_bearer_token.set(…)` done in the request handler a few frames up is **invisible** to the transport task. The persistent connection additionally caches its handshake context — so the bearer observed on the *first* call (often empty during a health-probe-triggered warm-up) gets reused forever.
|
||
|
||
**Why the first attempt didn't help.** We initially set the bearer via a `contextvars.ContextVar` and tried to have `_prepare_headers_and_auth` read it. It almost works — until any reconnect, retry, or persistent stream, at which point the cached snapshot wins.
|
||
|
||
**Fix (`pallas._fastagent_patch`).** Maintain a process-wide `_pending_bearers: dict[int, str]` keyed by `id(server_config)`, guarded by a `threading.Lock`. `multimodal_server.send_message` calls `publish_bearer(cfg, token)` for every opted-in downstream *before* spawning any tool call; the patched `_prepare_headers_and_auth` pulls the token from the registry (ContextVar used as a fallback for non-persistent probe paths); a `finally` in the request handler calls `revoke_bearer(cfg)` to clear the entry. Per-request bearers therefore survive the task-group boundary without mutating any shared config object.
|
||
|
||
**Bonus gotcha.** The opt-in was originally keyed off a custom `forward_inbound_auth: true` field on the server block, read via fast-agent's pydantic config model. Pydantic's nested-model validation silently **dropped unknown keys**, so the flag never appeared on the parsed config. Workaround: scan `fastagent.config.yaml` directly for the flag at module import time (`pallas._fastagent_patch._FORWARD_SERVERS`) rather than rely on the parsed config object.
|
||
|
||
**Bonus gotcha 2.** `httpx` caches auth handshake headers on persistent connections. A plain mutation of `server_config.headers["Authorization"]` in the request handler only affects *new* connections. The forwarding patch works by providing a custom `httpx.Auth` subclass (`_DynamicBearerAuth`) that looks up the bearer on every request, not by mutating headers — this is why the override is `auth_flow` (the generic non-async flow), not `async_auth_flow`.
|
||
|
||
### 2. `install()` idempotency shadowing newly-added patches
|
||
|
||
**Symptom.** After adding two new diagnostic monkey-patches (`_patch_session_call_tool`, `_patch_execute_on_server`) and reinstalling `pallas-mcp` into the Kottos venv, the trace-capture records refused to appear in `pallas.log`. Four repro cycles, five log rotations, no evidence that the new code was running.
|
||
|
||
**Cause.** `install()` had a single top-level guard on `_prepare_headers_and_auth._pallas_forward_patched`. Once the bearer-forwarding patch was applied on first import, every subsequent `install()` call returned early — skipping the *three* later `_patch_*()` helpers entirely. The patches were *present* in the installed file; they were never *executed*.
|
||
|
||
**Lesson.** A shared idempotency guard at the top of an `install()`-style function is a liability as soon as the function grows past one patch. The fix (commit `082b611`) moves each patch's guard to a per-target sentinel attribute on the target method (`target._pallas_trace_patched = True`), checked inside each helper. `install()` now calls every helper unconditionally; duplicate installs are cheap and harmless.
|
||
|
||
**Bonus gotcha.** `install()` runs at module-import time, which in Pallas happens *before* `pallas.log.setup_logging()` attaches the file handler. Any `logger.info("patch installed")` inside `install()` is emitted into the default handler and lost. "No 'patch installed' line in the log" is **not** evidence that the patch didn't install — only the runtime firing of the wrapper (e.g. `forward.applied …`) is a reliable presence marker.
|
||
|
||
### 3. FastMCP `on_call_tool` context shape: `message.name`, not `message.params.name`
|
||
|
||
**Symptom.** Once bearer forwarding worked, Harper's Mnemosyne tool calls came back to fast-agent as the literal string `"object NoneType can't be used in 'await' expression"`. The tool result was visible in the OpenAI request payload of the *next* turn as `{"role":"tool", "content":"object NoneType can't be used in 'await' expression"}`. No traceback anywhere in Pallas or Mnemosyne.
|
||
|
||
**Cause.** `mnemosyne/mcp_server/auth.py:MCPAuthMiddleware._extract_tool_name` read `context.message.params.name`, but inside an `on_call_tool` hook FastMCP's `MiddlewareContext[CallToolRequestParams]` exposes `.name` and `.arguments` **directly on `context.message`** — the type parameter is already the params object. The extractor always returned `None`, which:
|
||
|
||
- silently skipped the `_PUBLIC_TOOLS = {"get_health"}` bypass so even the public health probe went through JWT validation; and
|
||
- made the per-tool ACL `token.can_use_tool(None)` short-circuit.
|
||
|
||
The `NoneType await` error string itself came from somewhere downstream of the middleware — the middleware still unconditionally `await`ed `call_next(context)`. The most likely path was `await self._tools.get(None)(...)` in the FastMCP dispatch (`None` lookup returns `None`, then `await None(...)` raises the TypeError).
|
||
|
||
**Fix (mnemosyne commit `e0fa825`).** Read `context.message.name` directly; fall back to `message.params.name` only as a legacy safety net. Verified against fastmcp's own `Middleware.on_call_tool` signature (`MiddlewareContext[mt.CallToolRequestParams]`) and four independent docs examples.
|
||
|
||
**Diagnostic helper.** The commit also added `_call_next_with_trace` around `await call_next(context)` so any future exception inside FastMCP dispatch is captured with a full `logger.exception` traceback before propagating — and so the *success* path logs the result type, which doubles as a canary for "the middleware actually ran".
|
||
|
||
### 4. Rich-TUI corruption by DEBUG-level third-party loggers
|
||
|
||
**Symptom.** `fast-agent go` in an interactive session was unusable: massive blobs of plain-text `DEBUG:openai._base_client:Sending HTTP Request: …` and `DEBUG:sse_starlette.sse:chunk: …` lines splattered over the Rich chat UI on every redraw.
|
||
|
||
**Cause.** Two layers stacked up:
|
||
|
||
- Pallas's original `setup_logging()` set the **root logger** to whatever `logger.level` was configured. With `logger.level: debug` in `kottos/fastagent.config.yaml` (set intentionally for Pallas diagnostics), every third-party library inherited DEBUG and started emitting.
|
||
- Pallas attached a `StreamHandler(stream=sys.__stderr__)` to both root and `pallas` loggers so DEBUG records would "survive Rich's console takeover". This did solve the Rich-swallowing problem, but swapped it for a worse one: every library's DEBUG record now bypassed the Rich Live display and leaked through every TUI repaint.
|
||
|
||
**Fix (commits `dde7d4f` + `89870f4`).**
|
||
- `PALLAS_LOG_STDERR` env var gates the stderr handler. Off by default. Interactive users get a clean TUI + rotating file sink; systemd/journal deployments set `PALLAS_LOG_STDERR=1`.
|
||
- Root-logger level is decoupled from Pallas's own level. Default: `max(configured_level, INFO)`. Pallas's `pallas.*` loggers still honour `logger.level: debug`, but third-party libraries stay at INFO unless `PALLAS_ROOT_LOG_LEVEL=DEBUG` is set explicitly.
|
||
- `openai`, `openai._base_client`, `anthropic`, `anthropic._base_client`, `sse_starlette`, `sse_starlette.sse`, `mcp`, `mcp.client`, `mcp.server`, `httpx`, `httpcore` pinned at WARNING individually — belt-and-braces against any future re-enablement of root DEBUG.
|
||
|
||
### 5. Logging configuration knobs (current state)
|
||
|
||
| Env var / config | Default | Effect |
|
||
|---|---|---|
|
||
| `PALLAS_LOG_LEVEL` | `INFO` | Level for the `pallas.*` logger tree and the rotating file sink |
|
||
| `fastagent.config.yaml` `logger.level` | fallback for `PALLAS_LOG_LEVEL` | Unified knob — flipping fast-agent's level also flips Pallas's diagnostic level |
|
||
| `PALLAS_ROOT_LOG_LEVEL` | `max(pallas_level, INFO)` | Level for the root logger (controls third-party library output). Rarely needs to be changed. |
|
||
| `PALLAS_LOG_STDERR` | unset (off) | Attach a JSON `StreamHandler` to `sys.__stderr__`. Enable for systemd/journal; leave off in Rich TUI sessions. |
|
||
| `PALLAS_LOG_FILE` | `~/.local/state/pallas/pallas.log` | Rotating JSON log file. 10 MB × 5 backups. |
|
||
|
||
The **rotating file sink is always on**. It's what catches tracebacks from fast-agent, fastmcp, the MCP SDK, and our own trace wrappers regardless of how Rich is interacting with the terminal. Tail with `jq` for structured access:
|
||
|
||
```bash
|
||
tail -n 100 -f ~/.local/state/pallas/pallas.log | jq -r '"\(.time) \(.level) \(.logger) \(.message)"'
|
||
```
|
||
|
||
When diagnosing a downstream-MCP issue, `grep pallas.forward.trace` in that file: any uncaught exception inside `send_request`, `session.call_tool`, or `_execute_on_server` appears there with full traceback, even when fast-agent's aggregator turns it into a terse `CallToolResult(isError=True)` by the time the agent loop sees it.
|