r/pallas

Files

Robert Helewka be71709608 feat(pallas): add opt-in bearer token forwarding to downstream MCP servers

Introduce per-server `forward_inbound_auth` flag that controls whether the
inbound MCP bearer token is propagated to outbound MCP transport calls.
Implemented as a fast-agent monkey-patch auto-installed on package import,
preventing accidental credential leakage to unrelated downstream servers.

Update docs to describe the two bearer token consumers (LLM provider
passthrough and opt-in downstream MCP forwarding) with a config example.

2026-05-03 17:17:50 -04:00

21 KiB

Raw Blame History

Pallas — Technical Reference

Pallas is the generic runtime that turns fast-agent agent definitions into StreamableHTTP MCP servers. It is completely deployment-agnostic: all environment-specific values (agent names, ports, hosts, model) live in the calling project's configuration files, not in Pallas itself.

Solution Architecture

Pallas occupies the middle tier of a three-layer MCP architecture. It bridges a web-facing client (Daedalus) and a constellation of specialised downstream MCP servers.

┌──────────────────────────────────┐
│  Daedalus                        │  Web UI / FastAPI / MCP client
│  Workspace management, chat,     │  Discovers agents via registry
│  health monitoring, progress     │  Calls agent tools via MCP
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Pallas (FastAgent MCP Bridge)   │  Python runtime
│                                  │
│  ┌─ Registry  (port N)          │  GET /.well-known/mcp/server.json
│  ├─ Agent: Research  (port N+1) │  Chains, routers, sub-agents
│  ├─ Agent: Engineering (port N+2)│  Orchestrators, tool pipelines
│  └─ Agent: Orchestrator (N+3)   │  Delegates across agents
│                                  │
│  Each agent exposes:             │
│    • send_message tool           │
│    • get_health tool             │
│    • {agent}_history prompt      │
└──────────┬───────────────────────┘
           │ MCP over Streamable HTTP
           ▼
┌──────────────────────────────────┐
│  Downstream MCP Servers          │
│                                  │
│  Argos        — web search       │
│  Neo4j        — knowledge graph  │
│  Mnemosyne    — content library  │
│  Kernos       — shell execution  │
│  Gitea        — repository mgmt  │
│  Grafana      — monitoring       │
│  Rommie       — system management│
└──────────────────────────────────┘

Daedalus → Pallas

Interaction	Mechanism
Agent discovery	`GET {registry}/.well-known/mcp/server.json` — plain HTTP, returns all agents with MCP endpoint URLs
Agent communication	MCP `tools/call` on `send_message` — text + optional images
Health monitoring	MCP `tools/call` on `get_health` — programmatic, no LLM invocation
Progress feedback	MCP `notifications/progress` — streamed over SSE during long-running tool calls
Conversation history	MCP `prompts/get` on `{agent}_history` — retrieves stored message history

Pallas → Downstream

Pallas agents call downstream MCP servers via standard MCP tool calls. Each agent declares its servers in its fast-agent definition (servers=["argos", "neo4j_cypher", ...]). The server URLs and auth headers are configured in the consuming project's fastagent.config.yaml.

Mnemosyne's Role

Mnemosyne provides a content-type-aware knowledge graph with hybrid search (vector + full-text + graph). Agents with mnemosyne in their servers list gain access to tools for searching documents, browsing libraries and collections, retrieving items, and traversing the concept graph. It complements Neo4j (graph topology and relationships) with content-focused retrieval and re-ranking.

Why MCP End-to-End

Pallas is the protocol boundary — MCP above (from Daedalus) and MCP below (to downstream servers). This eliminates any MCP→REST→MCP translation layer. A single fast.start_server(transport="http") call exposes a complete agent as a StreamableHTTP MCP endpoint, giving Daedalus:

Tool discovery via session.list_tools()
Native streaming via MCP Streamable HTTP / SSE
Health checks as ordinary tool calls — no separate API surface
Progress notifications built into the protocol

Pallas Internal Architecture

Pallas is four modules, composed at startup:

server.py main()
  │
  ├─ _load_deployment_config()         parse agents.yaml
  ├─ _build_agents_table()             {name: (module, port)}
  ├─ _build_agent_deps()               dependency graph
  │
  ├─ _start_all()  or  _run_single()
  │    │
  │    ├─ _preflight()
  │    │    ├─ _register_unknown_models()   model registration
  │    │    └─ validate_llm_providers()     LLM API key + model checks
  │    │
  │    ├─ start subagents (depends_on)
  │    ├─ wait for subagent readiness
  │    ├─ start top-level agents
  │    │    │
  │    │    └─ _start_agent(name)
  │    │         ├─ import agent module
  │    │         ├─ MultimodalAgentMCPServer(...)
  │    │         ├─ _resolve_downstream_servers()
  │    │         ├─ _preflight_mcp_servers()     warn on missing auth
  │    │         ├─ register_health_tool()
  │    │         └─ server.run_async()
  │    │
  │    └─ run_registry()               Starlette app on registry port
  │
  └─ asyncio.run(...)

Module	Purpose
`pallas.server`	CLI entry point, configuration loading, agent lifecycle orchestration, model registration
`pallas.registry`	Starlette app serving `GET /.well-known/mcp/server.json` — builds the agent catalogue from `agents.yaml` + `fastagent.config.yaml`
`pallas.multimodal_server`	`MultimodalAgentMCPServer` — `AgentMCPServer` subclass adding image attachment support and conversation history prompts
`pallas.health`	Two-layer health: startup LLM preflight validation + runtime `get_health` MCP tool with downstream server probing

Installation

pip install git+ssh://git@git.helu.ca:22022/r/pallas.git

Or as a project dependency:

dependencies = [
    "pallas-mcp @ git+ssh://git@git.helu.ca:22022/r/pallas.git",
]

Requires Python ≥ 3.13. Key dependencies: fast-agent-mcp, httpx, pyyaml, starlette, uvicorn.

Project Layout

Pallas reads configuration from the working directory at runtime. A consuming project looks like:

my-project/
├── agents/
│   ├── __init__.py
│   └── jarvis.py              # FastAgent definitions
├── agents.yaml                # Deployment topology
├── fastagent.config.yaml      # FastAgent + model config
├── fastagent.secrets.yaml     # API keys (gitignored)
└── .env                       # Secret values (gitignored)

Pallas itself contains no agent definitions, model names, ports, or hostnames. Everything is injected by the consuming project.

Configuration Reference

`agents.yaml`

Single source of truth for deployment topology.

name: my-project               # log prefixes and registry names
version: "1.0.0"               # published in registry entries
host: my-host.example.com      # hostname for registry URLs
namespace: com.example.project  # reverse-domain prefix for registry names
registry_port: 8200             # port for the registry server

agents:
  jarvis:
    module: agents.jarvis       # importable Python module path
    port: 8201                  # StreamableHTTP port for this agent
    title: Jarvis               # human-readable name (registry)
    description: "My assistant" # one-line description (registry)
    depends_on: [research]      # optional: start these agents first

  research:
    module: agents.research
    port: 8250
    title: Research Agent
    description: "Web search and knowledge graph"

Field	Required	Description
`name`	yes	Project name — used in log prefixes (`[my-project]`) and CLI help
`version`	no	Semver string published in registry entries. Default: `"1.0.0"`
`host`	no	Hostname used in registry `remotes[].url`. Default: `"localhost"`
`namespace`	no	Reverse-domain prefix for registry `server.name` (e.g. `com.example/jarvis`)
`registry_port`	no	Port for the registry server. Default: `24200`
`agents.<name>.module`	yes	Importable Python module path containing a `fast` instance
`agents.<name>.port`	yes	Port for this agent's StreamableHTTP MCP server
`agents.<name>.title`	no	Display name in registry. Default: `name.title()`
`agents.<name>.description`	no	Description in registry
`agents.<name>.depends_on`	no	List of agent names that must start and become ready before this agent

`fastagent.config.yaml` Extensions

Pallas reads two keys beyond the standard fast-agent config:

default_model: openai.my-model-name

model_capabilities:
  vision: false
  context_window: 200000
  max_output_tokens: 32000

Key	Description
`default_model`	`provider.model-name` format. The provider prefix (`anthropic` or `openai`) determines which LLM provider is active for health checks.
`model_capabilities.vision`	`true` registers the model with multimodal tokenization; `false` registers as text-only. Default: `false`
`model_capabilities.context_window`	Context window size in tokens. Default: `131072`
`model_capabilities.max_output_tokens`	Max output token limit. Default: `16384`

Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's ModelDatabase and published in the registry response.

`fastagent.secrets.yaml`

anthropic:
  api_key: "${ANTHROPIC_API_KEY}"
openai:
  api_key: "${OPENAI_API_KEY}"
  base_url: "${OPENAI_BASE_URL}"

${ENV_VAR} placeholders are expanded at runtime from environment variables.

`.env`

Pallas loads .env from the working directory into os.environ without overwriting existing variables. This supports both local development and systemd deployments:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=http://my-llm-server:8080/v1

OPENAI_BASE_URL defaults to https://api.openai.com/v1 if unset. For local llama-cpp, vLLM, or other OpenAI-compatible servers, set it to their endpoint.

Environment Variables

Variable	Default	Purpose
`PALLAS_AGENTS_CONFIG`	`agents.yaml`	Override path to deployment config

Running Pallas

CLI

pallas                     # start all agents + registry
pallas --agent jarvis      # start a single agent (no registry)
python -m pallas.server    # equivalent to `pallas`

Startup Sequence

All agents mode (pallas):

Load agents.yaml, build agents table and dependency graph
Preflight — register unknown models with ModelDatabase, validate LLM provider API keys and model availability
Start the registry server on registry_port
Start subagents (agents listed in other agents' depends_on)
Wait for each subagent to become ready (HTTP probe on /mcp, 60s timeout)
Start top-level agents (everything not a subagent)
All servers run concurrently via asyncio.gather

Single agent mode (pallas --agent <name>):

Load agents.yaml
Preflight
Start the named agent (no registry, no dependency resolution)

Per-Agent Startup

For each agent:

Import the agent module (agents.<name>) and obtain its fast instance
Enter fast.run() context — initialises the fast-agent runtime
Create a MultimodalAgentMCPServer wrapping the primary agent instance
Resolve downstream MCP server configs from the fast-agent configuration
Warn if any downstream auth headers reference unset environment variables
Register the get_health MCP tool with downstream server info
Bind to 0.0.0.0:<port> and serve StreamableHTTP

Daedalus Integration

This section describes the contract from Pallas's perspective. The full client-side specification is in docs/pallas_integration.md.

Registration Flow

Daedalus stores a registry URL (e.g. http://puck.incus:23030)
Fetches GET {url}/.well-known/mcp/server.json
Discovers all agents with their MCP endpoint URLs, titles, and descriptions
Creates connections to each agent

Health Polling

Daedalus calls get_health on each connected agent at a configurable interval (default 60s). The response maps to UI indicators:

`status`	Daedalus behaviour
`ok`	Green badge, normal operation
`degraded`	Yellow badge + warning banner showing `message`. Chat allowed.
`error`	Red badge. Chat disabled.

Progress Notifications

Long-running agent tool calls (agentic loops, sub-agent delegation) emit MCP notifications/progress on the SSE stream. Daedalus must include a progressToken in the _meta of tools/call requests to opt in:

result = await session.call_tool(
    "jarvis",
    arguments={"message": user_input},
    request_params={"_meta": {"progressToken": str(uuid4())}},
)

Progress notification fields:

Field	Description
`progressToken`	Matches the token sent in the request
`progress`	Monotonically increasing step counter
`total`	`null` = indeterminate (loop in progress), `1.0` = sub-task finished
`message`	Status text: `{server}/{tool}: started\|completed\|failed` or `{agent} step N (llm\|tool)`

Without a progressToken, Pallas skips all progress notifications and the client receives nothing until the final result.

Chat Blocking

If the target agent's cached health is error, Daedalus returns HTTP 503 and disables the message input. degraded shows a warning but allows chat.

Registry Server

Endpoint

GET {host}:{registry_port}/.well-known/mcp/server.json

Plain HTTP — not MCP. No authentication. Returns application/json.

Response Structure

Built dynamically from agents.yaml + fastagent.config.yaml:

{
  "servers": [
    {
      "server": {
        "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
        "name": "com.example.project/jarvis",
        "title": "Jarvis",
        "description": "My assistant agent",
        "version": "1.0.0",
        "remotes": [
          { "type": "streamable-http", "url": "http://my-host.example.com:8201/mcp" }
        ],
        "capabilities": {
          "model": "my-model-name",
          "vision": false,
          "context_window": 200000,
          "max_output_tokens": 32000
        }
      },
      "_meta": {
        "io.modelcontextprotocol.registry/official": {
          "status": "active",
          "updatedAt": "2026-01-01T00:00:00Z",
          "isLatest": true
        }
      }
    }
  ]
}

Registry Name Construction

{namespace}/{slug} — where slug is the agent key with underscores replaced by hyphens. Example: namespace com.example.project + agent key tech_research → com.example.project/tech-research.

Capabilities

If model_capabilities is defined in fastagent.config.yaml, each registry entry includes a capabilities object with model name, vision support, context window, and max output tokens. This allows clients to make informed decisions about what an agent can handle.

Multimodal Support

MultimodalAgentMCPServer extends fast-agent's AgentMCPServer with image attachment support.

`send_message` Tool

Each agent's MCP tool accepts:

Parameter	Type	Required	Description
`message`	`str`	yes	Text message to the agent
`images`	`list[dict]`	no	Base64-encoded images: `[{"data": "...", "mime_type": "image/png"}]`

When images is provided, the message is sent as a PromptMessageExtended containing both TextContent and ImageContent parts — the agent's underlying model must support vision.

Conversation History Prompt

For agents with instance_scope != "request", a {agent}_history prompt is registered that returns the full conversation history as FastMCP Message objects. This allows clients to retrieve the stored context.

Bearer Token Propagation

The server captures the authenticated bearer token from the incoming MCP request into the request_bearer_token context variable. Two consumers read it:

LLM-provider passthrough — the agent's LLM provider key manager picks it up automatically (used by HuggingFace and any other token-passthrough providers).
Downstream MCP servers (opt-in) — outgoing MCP calls inherit the same bearer when the downstream server is marked forward_inbound_auth: true in fastagent.config.yaml. Without that flag, request_bearer_token is not forwarded to MCP transport calls — server_config.headers is the only header source. This is implemented as a fast-agent monkey-patch in pallas._fastagent_patch and is per-server so a FastAgent attached to both a credentialed downstream (e.g. Mnemosyne) and an unrelated public server doesn't leak the bearer to the latter.

Example:

mcp:
  servers:
    mnemosyne:
      transport: http
      url: "https://mnemosyne.example/mcp/"
      forward_inbound_auth: true   # inbound bearer rides outbound
    weather:
      transport: http
      url: "https://weather.example/mcp/"
      # no flag → outbound calls go unauthenticated

When the agent receives a request with Authorization: Bearer X, mnemosyne will see Authorization: Bearer X on the outbound call; weather will see no Authorization header. If mnemosyne.headers.Authorization is set explicitly, that wins (the inbound bearer is not overwritten on top of an explicit header).

Health System

Two-layer health checking: startup preflight validates LLM providers before agents launch, and a runtime get_health tool reports ongoing status.

Startup Preflight

Runs once before any agents start. Validates all LLM providers that have API keys configured.

Provider	Active (default_model matches)	Key set, not active
Anthropic	`GET /v1/models/{model}` — confirms model exists and key is valid	`GET /v1/models/claude-sonnet-4-5` — verifies API access
OpenAI	`GET {base_url}/models` — lists models, confirms configured model is present	`GET {base_url}/models` — lists available models

Warn-only — never blocks startup. Agents start regardless.
5-second timeout per provider API call.
Loads .env before checking.

Runtime `get_health` Tool

Registered on each agent's MCP server. Checks:

Downstream MCP servers — sends an MCP initialize handshake to each server URL. Uses initialize because it is the only MCP method that works without a pre-established session. After success, sends DELETE with the returned Mcp-Session-Id to tear down the session cleanly. 3-second timeout.
Active LLM provider — includes the preflight result for the provider that default_model points to. Only the active provider affects health status.

Response Format

{ "status": "ok", "timestamp": "2026-01-01T00:00:00Z" }

{
  "status": "degraded",
  "timestamp": "2026-01-01T00:00:00Z",
  "message": "Unreachable: neo4j_cypher; LLM: openai: model 'bad-model' not found"
}

Status	Meaning
`ok`	All downstream servers reachable and active LLM provider healthy
`degraded`	One or more downstream servers unreachable, or active LLM provider failed

Model Registration

Pallas registers models not in fast-agent's built-in ModelDatabase at startup, using the explicit capability declarations from fastagent.config.yaml.

The process:

Read default_model and model_capabilities from config
Extract the model name (portion after the provider prefix dot)
Check if ModelDatabase already knows this model — if so, skip
Register with ModelDatabase.register_runtime_model_params():
- vision: true → multimodal tokenization (QWEN_MULTIMODAL)
- vision: false → text-only tokenization (TEXT_ONLY)
- context_window and max_output_tokens from config (with sensible defaults)

This avoids the brittle pattern of inferring capabilities from model name substrings, which breaks for custom or fine-tuned models with non-standard names.

Module Reference

Module	File	Purpose
`pallas.server`	`server.py`	CLI entry point (`pallas` command), configuration loading, agent lifecycle orchestration, dependency ordering, model registration
`pallas.registry`	`registry.py`	Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config
`pallas.multimodal_server`	`multimodal_server.py`	`MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation
`pallas.health`	`health.py`	LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration

21 KiB Raw Blame History