feat: add /healthz and /metrics endpoints, replace print with logging

- Add /healthz endpoint returning LLM provider validation status - Add /metrics endpoint serving Prometheus metrics via prometheus_client - Replace all print() calls in health.py with proper logging module - Remove _PREFIX variable in favor of structured logger context
2026-04-10 11:22:26 +00:00
parent 9092afb532
commit 0cea5ece3a
10 changed files with 1433 additions and 36 deletions
--- a/docs/pallas.md
+++ b/docs/pallas.md
@@ -0,0 +1,495 @@
+# Pallas — Technical Reference
+
+Pallas is the generic runtime that turns [fast-agent](https://github.com/evalstate/fast-agent) agent definitions into StreamableHTTP MCP servers. It is **completely deployment-agnostic**: all environment-specific values (agent names, ports, hosts, model) live in the calling project's configuration files, not in Pallas itself.
+
+---
+
+## Solution Architecture
+
+Pallas occupies the middle tier of a three-layer MCP architecture. It bridges a web-facing client (Daedalus) and a constellation of specialised downstream MCP servers.
+
+```
+┌──────────────────────────────────┐
+│  Daedalus                        │  Web UI / FastAPI / MCP client
+│  Workspace management, chat,     │  Discovers agents via registry
+│  health monitoring, progress     │  Calls agent tools via MCP
+└──────────┬───────────────────────┘
+           │ MCP over Streamable HTTP
+           ▼
+┌──────────────────────────────────┐
+│  Pallas (FastAgent MCP Bridge)   │  Python runtime
+│                                  │
+│  ┌─ Registry  (port N)          │  GET /.well-known/mcp/server.json
+│  ├─ Agent: Research  (port N+1) │  Chains, routers, sub-agents
+│  ├─ Agent: Engineering (port N+2)│  Orchestrators, tool pipelines
+│  └─ Agent: Orchestrator (N+3)   │  Delegates across agents
+│                                  │
+│  Each agent exposes:             │
+│    • send_message tool           │
+│    • get_health tool             │
+│    • {agent}_history prompt      │
+└──────────┬───────────────────────┘
+           │ MCP over Streamable HTTP
+           ▼
+┌──────────────────────────────────┐
+│  Downstream MCP Servers          │
+│                                  │
+│  Argos        — web search       │
+│  Neo4j        — knowledge graph  │
+│  Mnemosyne    — content library  │
+│  Kernos       — shell execution  │
+│  Gitea        — repository mgmt  │
+│  Grafana      — monitoring       │
+│  Rommie       — system management│
+└──────────────────────────────────┘
+```
+
+### Daedalus → Pallas
+
+| Interaction | Mechanism |
+|---|---|
+| Agent discovery | `GET {registry}/.well-known/mcp/server.json` — plain HTTP, returns all agents with MCP endpoint URLs |
+| Agent communication | MCP `tools/call` on `send_message` — text + optional images |
+| Health monitoring | MCP `tools/call` on `get_health` — programmatic, no LLM invocation |
+| Progress feedback | MCP `notifications/progress` — streamed over SSE during long-running tool calls |
+| Conversation history | MCP `prompts/get` on `{agent}_history` — retrieves stored message history |
+
+### Pallas → Downstream
+
+Pallas agents call downstream MCP servers via standard MCP tool calls. Each agent declares its servers in its fast-agent definition (`servers=["argos", "neo4j_cypher", ...]`). The server URLs and auth headers are configured in the consuming project's `fastagent.config.yaml`.
+
+### Mnemosyne's Role
+
+Mnemosyne provides a content-type-aware knowledge graph with hybrid search (vector + full-text + graph). Agents with `mnemosyne` in their `servers` list gain access to tools for searching documents, browsing libraries and collections, retrieving items, and traversing the concept graph. It complements Neo4j (graph topology and relationships) with content-focused retrieval and re-ranking.
+
+### Why MCP End-to-End
+
+Pallas is the protocol boundary — MCP above (from Daedalus) and MCP below (to downstream servers). This eliminates any MCP→REST→MCP translation layer. A single `fast.start_server(transport="http")` call exposes a complete agent as a StreamableHTTP MCP endpoint, giving Daedalus:
+
+- **Tool discovery** via `session.list_tools()`
+- **Native streaming** via MCP Streamable HTTP / SSE
+- **Health checks** as ordinary tool calls — no separate API surface
+- **Progress notifications** built into the protocol
+
+---
+
+## Pallas Internal Architecture
+
+Pallas is four modules, composed at startup:
+
+```
+server.py main()
+  │
+  ├─ _load_deployment_config()         parse agents.yaml
+  ├─ _build_agents_table()             {name: (module, port)}
+  ├─ _build_agent_deps()               dependency graph
+  │
+  ├─ _start_all()  or  _run_single()
+  │    │
+  │    ├─ _preflight()
+  │    │    ├─ _register_unknown_models()   model registration
+  │    │    └─ validate_llm_providers()     LLM API key + model checks
+  │    │
+  │    ├─ start subagents (depends_on)
+  │    ├─ wait for subagent readiness
+  │    ├─ start top-level agents
+  │    │    │
+  │    │    └─ _start_agent(name)
+  │    │         ├─ import agent module
+  │    │         ├─ MultimodalAgentMCPServer(...)
+  │    │         ├─ _resolve_downstream_servers()
+  │    │         ├─ _preflight_mcp_servers()     warn on missing auth
+  │    │         ├─ register_health_tool()
+  │    │         └─ server.run_async()
+  │    │
+  │    └─ run_registry()               Starlette app on registry port
+  │
+  └─ asyncio.run(...)
+```
+
+| Module | Purpose |
+|---|---|
+| `pallas.server` | CLI entry point, configuration loading, agent lifecycle orchestration, model registration |
+| `pallas.registry` | Starlette app serving `GET /.well-known/mcp/server.json` — builds the agent catalogue from `agents.yaml` + `fastagent.config.yaml` |
+| `pallas.multimodal_server` | `MultimodalAgentMCPServer` — `AgentMCPServer` subclass adding image attachment support and conversation history prompts |
+| `pallas.health` | Two-layer health: startup LLM preflight validation + runtime `get_health` MCP tool with downstream server probing |
+
+---
+
+## Installation
+
+```bash
+pip install git+ssh://git@git.helu.ca:22022/r/pallas.git
+```
+
+Or as a project dependency:
+
+```toml
+dependencies = [
+    "pallas-mcp @ git+ssh://git@git.helu.ca:22022/r/pallas.git",
+]
+```
+
+Requires Python ≥ 3.13. Key dependencies: `fast-agent-mcp`, `httpx`, `pyyaml`, `starlette`, `uvicorn`.
+
+---
+
+## Project Layout
+
+Pallas reads configuration from the **working directory** at runtime. A consuming project looks like:
+
+```
+my-project/
+├── agents/
+│   ├── __init__.py
+│   └── jarvis.py              # FastAgent definitions
+├── agents.yaml                # Deployment topology
+├── fastagent.config.yaml      # FastAgent + model config
+├── fastagent.secrets.yaml     # API keys (gitignored)
+└── .env                       # Secret values (gitignored)
+```
+
+Pallas itself contains no agent definitions, model names, ports, or hostnames. Everything is injected by the consuming project.
+
+---
+
+## Configuration Reference
+
+### `agents.yaml`
+
+Single source of truth for deployment topology.
+
+```yaml
+name: my-project               # log prefixes and registry names
+version: "1.0.0"               # published in registry entries
+host: my-host.example.com      # hostname for registry URLs
+namespace: com.example.project  # reverse-domain prefix for registry names
+registry_port: 8200             # port for the registry server
+
+agents:
+  jarvis:
+    module: agents.jarvis       # importable Python module path
+    port: 8201                  # StreamableHTTP port for this agent
+    title: Jarvis               # human-readable name (registry)
+    description: "My assistant" # one-line description (registry)
+    depends_on: [research]      # optional: start these agents first
+
+  research:
+    module: agents.research
+    port: 8250
+    title: Research Agent
+    description: "Web search and knowledge graph"
+```
+
+| Field | Required | Description |
+|---|---|---|
+| `name` | yes | Project name — used in log prefixes (`[my-project]`) and CLI help |
+| `version` | no | Semver string published in registry entries. Default: `"1.0.0"` |
+| `host` | no | Hostname used in registry `remotes[].url`. Default: `"localhost"` |
+| `namespace` | no | Reverse-domain prefix for registry `server.name` (e.g. `com.example/jarvis`) |
+| `registry_port` | no | Port for the registry server. Default: `24200` |
+| `agents.<name>.module` | yes | Importable Python module path containing a `fast` instance |
+| `agents.<name>.port` | yes | Port for this agent's StreamableHTTP MCP server |
+| `agents.<name>.title` | no | Display name in registry. Default: `name.title()` |
+| `agents.<name>.description` | no | Description in registry |
+| `agents.<name>.depends_on` | no | List of agent names that must start and become ready before this agent |
+
+### `fastagent.config.yaml` Extensions
+
+Pallas reads two keys beyond the standard fast-agent config:
+
+```yaml
+default_model: openai.my-model-name
+
+model_capabilities:
+  vision: false
+  context_window: 200000
+  max_output_tokens: 32000
+```
+
+| Key | Description |
+|---|---|
+| `default_model` | `provider.model-name` format. The provider prefix (`anthropic` or `openai`) determines which LLM provider is active for health checks. |
+| `model_capabilities.vision` | `true` registers the model with multimodal tokenization; `false` registers as text-only. Default: `false` |
+| `model_capabilities.context_window` | Context window size in tokens. Default: `131072` |
+| `model_capabilities.max_output_tokens` | Max output token limit. Default: `16384` |
+
+Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.
+
+### `fastagent.secrets.yaml`
+
+```yaml
+anthropic:
+  api_key: "${ANTHROPIC_API_KEY}"
+openai:
+  api_key: "${OPENAI_API_KEY}"
+  base_url: "${OPENAI_BASE_URL}"
+```
+
+`${ENV_VAR}` placeholders are expanded at runtime from environment variables.
+
+### `.env`
+
+Pallas loads `.env` from the working directory into `os.environ` without overwriting existing variables. This supports both local development and systemd deployments:
+
+```dotenv
+ANTHROPIC_API_KEY=sk-ant-...
+OPENAI_API_KEY=sk-...
+OPENAI_BASE_URL=http://my-llm-server:8080/v1
+```
+
+`OPENAI_BASE_URL` defaults to `https://api.openai.com/v1` if unset. For local llama-cpp, vLLM, or other OpenAI-compatible servers, set it to their endpoint.
+
+### Environment Variables
+
+| Variable | Default | Purpose |
+|---|---|---|
+| `PALLAS_AGENTS_CONFIG` | `agents.yaml` | Override path to deployment config |
+
+---
+
+## Running Pallas
+
+### CLI
+
+```bash
+pallas                     # start all agents + registry
+pallas --agent jarvis      # start a single agent (no registry)
+python -m pallas.server    # equivalent to `pallas`
+```
+
+### Startup Sequence
+
+**All agents mode** (`pallas`):
+
+1. Load `agents.yaml`, build agents table and dependency graph
+2. **Preflight** — register unknown models with `ModelDatabase`, validate LLM provider API keys and model availability
+3. Start the registry server on `registry_port`
+4. Start **subagents** (agents listed in other agents' `depends_on`)
+5. Wait for each subagent to become ready (HTTP probe on `/mcp`, 60s timeout)
+6. Start **top-level agents** (everything not a subagent)
+7. All servers run concurrently via `asyncio.gather`
+
+**Single agent mode** (`pallas --agent <name>`):
+
+1. Load `agents.yaml`
+2. Preflight
+3. Start the named agent (no registry, no dependency resolution)
+
+### Per-Agent Startup
+
+For each agent:
+
+1. Import the agent module (`agents.<name>`) and obtain its `fast` instance
+2. Enter `fast.run()` context — initialises the fast-agent runtime
+3. Create a `MultimodalAgentMCPServer` wrapping the primary agent instance
+4. Resolve downstream MCP server configs from the fast-agent configuration
+5. Warn if any downstream auth headers reference unset environment variables
+6. Register the `get_health` MCP tool with downstream server info
+7. Bind to `0.0.0.0:<port>` and serve StreamableHTTP
+
+---
+
+## Daedalus Integration
+
+This section describes the contract from Pallas's perspective. The full client-side specification is in `docs/pallas_integration.md`.
+
+### Registration Flow
+
+1. Daedalus stores a registry URL (e.g. `http://puck.incus:23030`)
+2. Fetches `GET {url}/.well-known/mcp/server.json`
+3. Discovers all agents with their MCP endpoint URLs, titles, and descriptions
+4. Creates connections to each agent
+
+### Health Polling
+
+Daedalus calls `get_health` on each connected agent at a configurable interval (default 60s). The response maps to UI indicators:
+
+| `status` | Daedalus behaviour |
+|---|---|
+| `ok` | Green badge, normal operation |
+| `degraded` | Yellow badge + warning banner showing `message`. Chat allowed. |
+| `error` | Red badge. Chat disabled. |
+
+### Progress Notifications
+
+Long-running agent tool calls (agentic loops, sub-agent delegation) emit MCP `notifications/progress` on the SSE stream. Daedalus must include a `progressToken` in the `_meta` of `tools/call` requests to opt in:
+
+```python
+result = await session.call_tool(
+    "jarvis",
+    arguments={"message": user_input},
+    request_params={"_meta": {"progressToken": str(uuid4())}},
+)
+```
+
+Progress notification fields:
+
+| Field | Description |
+|---|---|
+| `progressToken` | Matches the token sent in the request |
+| `progress` | Monotonically increasing step counter |
+| `total` | `null` = indeterminate (loop in progress), `1.0` = sub-task finished |
+| `message` | Status text: `{server}/{tool}: started\|completed\|failed` or `{agent} step N (llm\|tool)` |
+
+Without a `progressToken`, Pallas skips all progress notifications and the client receives nothing until the final result.
+
+### Chat Blocking
+
+If the target agent's cached health is `error`, Daedalus returns HTTP 503 and disables the message input. `degraded` shows a warning but allows chat.
+
+---
+
+## Registry Server
+
+### Endpoint
+
+```
+GET {host}:{registry_port}/.well-known/mcp/server.json
+```
+
+Plain HTTP — not MCP. No authentication. Returns `application/json`.
+
+### Response Structure
+
+Built dynamically from `agents.yaml` + `fastagent.config.yaml`:
+
+```json
+{
+  "servers": [
+    {
+      "server": {
+        "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
+        "name": "com.example.project/jarvis",
+        "title": "Jarvis",
+        "description": "My assistant agent",
+        "version": "1.0.0",
+        "remotes": [
+          { "type": "streamable-http", "url": "http://my-host.example.com:8201/mcp" }
+        ],
+        "capabilities": {
+          "model": "my-model-name",
+          "vision": false,
+          "context_window": 200000,
+          "max_output_tokens": 32000
+        }
+      },
+      "_meta": {
+        "io.modelcontextprotocol.registry/official": {
+          "status": "active",
+          "updatedAt": "2026-01-01T00:00:00Z",
+          "isLatest": true
+        }
+      }
+    }
+  ]
+}
+```
+
+### Registry Name Construction
+
+`{namespace}/{slug}` — where `slug` is the agent key with underscores replaced by hyphens. Example: namespace `com.example.project` + agent key `tech_research` → `com.example.project/tech-research`.
+
+### Capabilities
+
+If `model_capabilities` is defined in `fastagent.config.yaml`, each registry entry includes a `capabilities` object with model name, vision support, context window, and max output tokens. This allows clients to make informed decisions about what an agent can handle.
+
+---
+
+## Multimodal Support
+
+`MultimodalAgentMCPServer` extends fast-agent's `AgentMCPServer` with image attachment support.
+
+### `send_message` Tool
+
+Each agent's MCP tool accepts:
+
+| Parameter | Type | Required | Description |
+|---|---|---|---|
+| `message` | `str` | yes | Text message to the agent |
+| `images` | `list[dict]` | no | Base64-encoded images: `[{"data": "...", "mime_type": "image/png"}]` |
+
+When `images` is provided, the message is sent as a `PromptMessageExtended` containing both `TextContent` and `ImageContent` parts — the agent's underlying model must support vision.
+
+### Conversation History Prompt
+
+For agents with `instance_scope != "request"`, a `{agent}_history` prompt is registered that returns the full conversation history as FastMCP `Message` objects. This allows clients to retrieve the stored context.
+
+### Bearer Token Propagation
+
+The server captures the authenticated bearer token from the incoming MCP request and propagates it via `request_bearer_token` context variable to downstream calls.
+
+---
+
+## Health System
+
+Two-layer health checking: **startup preflight** validates LLM providers before agents launch, and a **runtime `get_health` tool** reports ongoing status.
+
+### Startup Preflight
+
+Runs once before any agents start. Validates all LLM providers that have API keys configured.
+
+| Provider | Active (default_model matches) | Key set, not active |
+|---|---|---|
+| **Anthropic** | `GET /v1/models/{model}` — confirms model exists and key is valid | `GET /v1/models/claude-sonnet-4-5` — verifies API access |
+| **OpenAI** | `GET {base_url}/models` — lists models, confirms configured model is present | `GET {base_url}/models` — lists available models |
+
+- **Warn-only** — never blocks startup. Agents start regardless.
+- **5-second timeout** per provider API call.
+- Loads `.env` before checking.
+
+### Runtime `get_health` Tool
+
+Registered on each agent's MCP server. Checks:
+
+1. **Downstream MCP servers** — sends an MCP `initialize` handshake to each server URL. Uses `initialize` because it is the only MCP method that works without a pre-established session. After success, sends `DELETE` with the returned `Mcp-Session-Id` to tear down the session cleanly. 3-second timeout.
+
+2. **Active LLM provider** — includes the preflight result for the provider that `default_model` points to. Only the active provider affects health status.
+
+### Response Format
+
+```json
+{ "status": "ok", "timestamp": "2026-01-01T00:00:00Z" }
+```
+
+```json
+{
+  "status": "degraded",
+  "timestamp": "2026-01-01T00:00:00Z",
+  "message": "Unreachable: neo4j_cypher; LLM: openai: model 'bad-model' not found"
+}
+```
+
+| Status | Meaning |
+|---|---|
+| `ok` | All downstream servers reachable and active LLM provider healthy |
+| `degraded` | One or more downstream servers unreachable, or active LLM provider failed |
+
+---
+
+## Model Registration
+
+Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.
+
+The process:
+
+1. Read `default_model` and `model_capabilities` from config
+2. Extract the model name (portion after the provider prefix dot)
+3. Check if `ModelDatabase` already knows this model — if so, skip
+4. Register with `ModelDatabase.register_runtime_model_params()`:
+   - `vision: true` → multimodal tokenization (`QWEN_MULTIMODAL`)
+   - `vision: false` → text-only tokenization (`TEXT_ONLY`)
+   - `context_window` and `max_output_tokens` from config (with sensible defaults)
+
+This avoids the brittle pattern of inferring capabilities from model name substrings, which breaks for custom or fine-tuned models with non-standard names.
+
+---
+
+## Module Reference
+
+| Module | File | Purpose |
+|---|---|---|
+| `pallas.server` | `server.py` | CLI entry point (`pallas` command), configuration loading, agent lifecycle orchestration, dependency ordering, model registration |
+| `pallas.registry` | `registry.py` | Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config |
+| `pallas.multimodal_server` | `multimodal_server.py` | `MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation |
+| `pallas.health` | `health.py` | LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration |