diff --git a/docs/pallas.md b/docs/pallas.md index 0033dde..d543a02 100644 --- a/docs/pallas.md +++ b/docs/pallas.md @@ -522,3 +522,82 @@ This avoids the brittle pattern of inferring capabilities from model name substr | `pallas.registry` | `registry.py` | Starlette app serving `GET /.well-known/mcp/server.json` — agent catalogue built from config | | `pallas.multimodal_server` | `multimodal_server.py` | `MultimodalAgentMCPServer` — extends `AgentMCPServer` with image support, conversation history prompts, bearer token propagation | | `pallas.health` | `health.py` | LLM provider preflight validation, downstream MCP server probing, `get_health` tool registration | +| `pallas.log` | `log.py` | JSON log configuration, third-party traceback capture, Rich-TUI-safe handler attachment | +| `pallas._fastagent_patch` | `_fastagent_patch.py` | Monkey-patches fast-agent at import time: per-request bearer forwarding via `httpx.Auth`, diagnostic trace-capture wrappers around `send_request` / `session.call_tool` / `_execute_on_server` | + +--- + +## Incidents & Lessons Learned + +The Pallas↔Mnemosyne bearer-forwarding rollout surfaced a chain of bugs that ranged from "obvious in hindsight" to "you have to go read the fast-agent source to see why". None of the individual symptoms pointed at the true cause — each had a plausible scapegoat — which is why the actual fix was to install structured diagnostics first and work the problem end-to-end. This section captures the findings so the next person to touch this code (likely future me) does not have to re-derive them. + +### 1. Per-request bearer across an `anyio.TaskGroup` boundary + +**Symptom.** Per-turn JWTs minted by Daedalus and sent as `Authorization: Bearer …` to Pallas never reached Mnemosyne; Mnemosyne saw either no `Authorization` header at all, or — worse, intermittently — a bearer from a *previous* turn against an unrelated workspace. + +**Cause.** fast-agent's `MCPConnectionManager` runs each downstream transport inside a long-lived `anyio.TaskGroup` created at manager startup. `TaskGroup.start_soon` snapshots the owner's `contextvars.Context` at spawn time, so any `request_bearer_token.set(…)` done in the request handler a few frames up is **invisible** to the transport task. The persistent connection additionally caches its handshake context — so the bearer observed on the *first* call (often empty during a health-probe-triggered warm-up) gets reused forever. + +**Why the first attempt didn't help.** We initially set the bearer via a `contextvars.ContextVar` and tried to have `_prepare_headers_and_auth` read it. It almost works — until any reconnect, retry, or persistent stream, at which point the cached snapshot wins. + +**Fix (`pallas._fastagent_patch`).** Maintain a process-wide `_pending_bearers: dict[int, str]` keyed by `id(server_config)`, guarded by a `threading.Lock`. `multimodal_server.send_message` calls `publish_bearer(cfg, token)` for every opted-in downstream *before* spawning any tool call; the patched `_prepare_headers_and_auth` pulls the token from the registry (ContextVar used as a fallback for non-persistent probe paths); a `finally` in the request handler calls `revoke_bearer(cfg)` to clear the entry. Per-request bearers therefore survive the task-group boundary without mutating any shared config object. + +**Bonus gotcha.** The opt-in was originally keyed off a custom `forward_inbound_auth: true` field on the server block, read via fast-agent's pydantic config model. Pydantic's nested-model validation silently **dropped unknown keys**, so the flag never appeared on the parsed config. Workaround: scan `fastagent.config.yaml` directly for the flag at module import time (`pallas._fastagent_patch._FORWARD_SERVERS`) rather than rely on the parsed config object. + +**Bonus gotcha 2.** `httpx` caches auth handshake headers on persistent connections. A plain mutation of `server_config.headers["Authorization"]` in the request handler only affects *new* connections. The forwarding patch works by providing a custom `httpx.Auth` subclass (`_DynamicBearerAuth`) that looks up the bearer on every request, not by mutating headers — this is why the override is `auth_flow` (the generic non-async flow), not `async_auth_flow`. + +### 2. `install()` idempotency shadowing newly-added patches + +**Symptom.** After adding two new diagnostic monkey-patches (`_patch_session_call_tool`, `_patch_execute_on_server`) and reinstalling `pallas-mcp` into the Kottos venv, the trace-capture records refused to appear in `pallas.log`. Four repro cycles, five log rotations, no evidence that the new code was running. + +**Cause.** `install()` had a single top-level guard on `_prepare_headers_and_auth._pallas_forward_patched`. Once the bearer-forwarding patch was applied on first import, every subsequent `install()` call returned early — skipping the *three* later `_patch_*()` helpers entirely. The patches were *present* in the installed file; they were never *executed*. + +**Lesson.** A shared idempotency guard at the top of an `install()`-style function is a liability as soon as the function grows past one patch. The fix (commit `082b611`) moves each patch's guard to a per-target sentinel attribute on the target method (`target._pallas_trace_patched = True`), checked inside each helper. `install()` now calls every helper unconditionally; duplicate installs are cheap and harmless. + +**Bonus gotcha.** `install()` runs at module-import time, which in Pallas happens *before* `pallas.log.setup_logging()` attaches the file handler. Any `logger.info("patch installed")` inside `install()` is emitted into the default handler and lost. "No 'patch installed' line in the log" is **not** evidence that the patch didn't install — only the runtime firing of the wrapper (e.g. `forward.applied …`) is a reliable presence marker. + +### 3. FastMCP `on_call_tool` context shape: `message.name`, not `message.params.name` + +**Symptom.** Once bearer forwarding worked, Harper's Mnemosyne tool calls came back to fast-agent as the literal string `"object NoneType can't be used in 'await' expression"`. The tool result was visible in the OpenAI request payload of the *next* turn as `{"role":"tool", "content":"object NoneType can't be used in 'await' expression"}`. No traceback anywhere in Pallas or Mnemosyne. + +**Cause.** `mnemosyne/mcp_server/auth.py:MCPAuthMiddleware._extract_tool_name` read `context.message.params.name`, but inside an `on_call_tool` hook FastMCP's `MiddlewareContext[CallToolRequestParams]` exposes `.name` and `.arguments` **directly on `context.message`** — the type parameter is already the params object. The extractor always returned `None`, which: + +- silently skipped the `_PUBLIC_TOOLS = {"get_health"}` bypass so even the public health probe went through JWT validation; and +- made the per-tool ACL `token.can_use_tool(None)` short-circuit. + +The `NoneType await` error string itself came from somewhere downstream of the middleware — the middleware still unconditionally `await`ed `call_next(context)`. The most likely path was `await self._tools.get(None)(...)` in the FastMCP dispatch (`None` lookup returns `None`, then `await None(...)` raises the TypeError). + +**Fix (mnemosyne commit `e0fa825`).** Read `context.message.name` directly; fall back to `message.params.name` only as a legacy safety net. Verified against fastmcp's own `Middleware.on_call_tool` signature (`MiddlewareContext[mt.CallToolRequestParams]`) and four independent docs examples. + +**Diagnostic helper.** The commit also added `_call_next_with_trace` around `await call_next(context)` so any future exception inside FastMCP dispatch is captured with a full `logger.exception` traceback before propagating — and so the *success* path logs the result type, which doubles as a canary for "the middleware actually ran". + +### 4. Rich-TUI corruption by DEBUG-level third-party loggers + +**Symptom.** `fast-agent go` in an interactive session was unusable: massive blobs of plain-text `DEBUG:openai._base_client:Sending HTTP Request: …` and `DEBUG:sse_starlette.sse:chunk: …` lines splattered over the Rich chat UI on every redraw. + +**Cause.** Two layers stacked up: + +- Pallas's original `setup_logging()` set the **root logger** to whatever `logger.level` was configured. With `logger.level: debug` in `kottos/fastagent.config.yaml` (set intentionally for Pallas diagnostics), every third-party library inherited DEBUG and started emitting. +- Pallas attached a `StreamHandler(stream=sys.__stderr__)` to both root and `pallas` loggers so DEBUG records would "survive Rich's console takeover". This did solve the Rich-swallowing problem, but swapped it for a worse one: every library's DEBUG record now bypassed the Rich Live display and leaked through every TUI repaint. + +**Fix (commits `dde7d4f` + `89870f4`).** +- `PALLAS_LOG_STDERR` env var gates the stderr handler. Off by default. Interactive users get a clean TUI + rotating file sink; systemd/journal deployments set `PALLAS_LOG_STDERR=1`. +- Root-logger level is decoupled from Pallas's own level. Default: `max(configured_level, INFO)`. Pallas's `pallas.*` loggers still honour `logger.level: debug`, but third-party libraries stay at INFO unless `PALLAS_ROOT_LOG_LEVEL=DEBUG` is set explicitly. +- `openai`, `openai._base_client`, `anthropic`, `anthropic._base_client`, `sse_starlette`, `sse_starlette.sse`, `mcp`, `mcp.client`, `mcp.server`, `httpx`, `httpcore` pinned at WARNING individually — belt-and-braces against any future re-enablement of root DEBUG. + +### 5. Logging configuration knobs (current state) + +| Env var / config | Default | Effect | +|---|---|---| +| `PALLAS_LOG_LEVEL` | `INFO` | Level for the `pallas.*` logger tree and the rotating file sink | +| `fastagent.config.yaml` `logger.level` | fallback for `PALLAS_LOG_LEVEL` | Unified knob — flipping fast-agent's level also flips Pallas's diagnostic level | +| `PALLAS_ROOT_LOG_LEVEL` | `max(pallas_level, INFO)` | Level for the root logger (controls third-party library output). Rarely needs to be changed. | +| `PALLAS_LOG_STDERR` | unset (off) | Attach a JSON `StreamHandler` to `sys.__stderr__`. Enable for systemd/journal; leave off in Rich TUI sessions. | +| `PALLAS_LOG_FILE` | `~/.local/state/pallas/pallas.log` | Rotating JSON log file. 10 MB × 5 backups. | + +The **rotating file sink is always on**. It's what catches tracebacks from fast-agent, fastmcp, the MCP SDK, and our own trace wrappers regardless of how Rich is interacting with the terminal. Tail with `jq` for structured access: + +```bash +tail -n 100 -f ~/.local/state/pallas/pallas.log | jq -r '"\(.time) \(.level) \(.logger) \(.message)"' +``` + +When diagnosing a downstream-MCP issue, `grep pallas.forward.trace` in that file: any uncaught exception inside `send_request`, `session.call_tool`, or `_execute_on_server` appears there with full traceback, even when fast-agent's aggregator turns it into a terse `CallToolResult(isError=True)` by the time the agent loop sees it.