docs(pallas): document sampling parameters and Prometheus metrics

Add two new sections to the Pallas documentation: - Sampling parameters: explain that temperature/top_p/top_k are configured via the fast-agent decorator's `request_params`, with a provider support matrix and a note on Claude Opus 4.7 stripping these params in favor of `output_config.effort`. - Metrics: document the Prometheus `/metrics` endpoint exposed on the registry port, including scrape config, full metrics reference table, and notes on where each metric is captured.
2026-05-23 07:49:21 -04:00
parent 6fcdb509df
commit ca7d714a31
8 changed files with 545 additions and 39 deletions
--- a/docs/pallas.md
+++ b/docs/pallas.md
@@ -216,6 +216,40 @@ model_capabilities:

 Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.

+### Sampling parameters (temperature, top_p, top_k)
+
+Sampling parameters are configured per-agent in the Python decorator, **not** in `agents.yaml` or `fastagent.config.yaml`. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.
+
+```python
+from fast_agent import FastAgent
+from fast_agent.types import RequestParams
+
+fast = FastAgent("Jeffrey", parse_cli_args=False)
+
+@fast.agent(
+    name="jeffrey",
+    instruction="...",
+    servers=[...],
+    request_params=RequestParams(temperature=0.6, top_p=0.9),
+)
+async def _jeffrey():
+    pass
+```
+
+Provider support varies:
+
+| Provider | temperature | top_p | top_k |
+|---|---|---|---|
+| OpenAI (native, Responses API) | yes | yes | no |
+| HuggingFace, OpenResponses (OpenAI-compatible) | yes | yes | yes (via `extra_body`) |
+| Google Gemini | yes | yes | yes |
+| Bedrock | yes | yes (most models) | varies |
+| **Anthropic Claude Opus 4.7** | **no** | **no** | **no** |
+
+Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see `fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786`). On Opus 4.7, use `output_config.effort` (verbosity, including the new `xhigh` level between `high` and `max`) instead.
+
+Setting `request_params` on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.
+
 ### `fastagent.secrets.yaml`

 ```yaml
@@ -496,6 +530,95 @@ Registered on each agent's MCP server. Checks:

 ---

+## Metrics
+
+Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under `asyncio.gather`, so metrics are process-global.
+
+### Endpoint
+
+```
+GET {host}:{registry_port}/metrics
+```
+
+Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (`text/plain; version=0.0.4`).
+
+The same metrics snapshot is also available on each agent's own port at `{host}:{agent_port}/metrics`. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.
+
+### Scrape Config
+
+```yaml
+scrape_configs:
+  - job_name: pallas
+    static_configs:
+      - targets: ['my-host.example.com:8200']    # registry_port
+        labels:
+          deployment: my-project
+```
+
+### Metrics Reference
+
+| Metric | Type | Labels | Description |
+|---|---|---|---|
+| `pallas_up` | gauge | — | `1` while the Pallas process is running |
+| `pallas_agent_info` | gauge | `agent`, `port` | `1` per configured agent — useful as a label join source |
+| `pallas_send_message_total` | counter | `agent`, `outcome` | `send_message` MCP calls. `outcome` ∈ `ok`/`error` |
+| `pallas_send_message_duration_seconds` | histogram | `agent` | End-to-end MCP `send_message` wall-clock duration |
+| `pallas_llm_turns_total` | counter | `agent`, `model` | LLM provider round-trips per agent/model |
+| `pallas_llm_tokens_total` | counter | `agent`, `model`, `kind` | Tokens consumed. `kind` ∈ `input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning` |
+| `pallas_tool_calls_total` | counter | `agent`, `server`, `operation`, `outcome` | Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome` ∈ `ok`/`error` |
+| `pallas_tool_call_duration_seconds` | histogram | `agent`, `server`, `operation` | Downstream MCP operation duration |
+| `pallas_downstream_up` | gauge | `agent`, `server` | `1` when the named downstream MCP server passed the last `get_health` probe |
+| `pallas_llm_provider_up` | gauge | `provider` | `1` when the active LLM provider passed its last preflight or runtime re-probe |
+| `pallas_agent_health_status` | gauge | `agent` | Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error |
+
+Standard process metrics (RSS, CPU, GC, open FDs) are emitted by `prometheus-client`'s default collectors on the same endpoint.
+
+### Where the Numbers Come From
+
+- **send_message metrics** — captured around the MCP `send_message` handler in `pallas.multimodal_server`. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
+- **LLM token metrics** — read from fast-agent's `UsageAccumulator` on the request-scoped agent instance *before disposal*. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
+- **Downstream tool call metrics** — recorded in the `pallas._fastagent_patch` wrapper around `MCPAggregator._execute_on_server`. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as `outcome="error"` and full tracebacks remain in `pallas.forward.trace` log records.
+- **Health gauges** — updated as a side effect of every `get_health` MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside `get_health`.
+
+### Useful Queries
+
+```promql
+# Error rate per agent
+sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
+  / sum by (agent) (rate(pallas_send_message_total[5m]))
+
+# p95 send_message latency per agent
+histogram_quantile(0.95,
+  sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
+)
+
+# Token spend per model (1h)
+sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))
+
+# Cache hit ratio (Anthropic)
+sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
+  / sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))
+
+# Any downstream MCP server unreachable
+min by (server) (pallas_downstream_up) == 0
+
+# Active LLM provider down
+pallas_llm_provider_up == 0
+```
+
+### Suggested Alerts
+
+| Alert | Expression | Notes |
+|---|---|---|
+| Pallas process down | `up{job="pallas"} == 0` for 1m | Scrape failure |
+| Active LLM unreachable | `pallas_llm_provider_up == 0` for 5m | Preflight or TTL re-probe failing |
+| Downstream MCP unreachable | `pallas_downstream_up == 0` for 10m | Per-server; gauge updates on each `get_health` |
+| Agent error rate elevated | `rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1` | >10% errors over 10 min |
+| Latency regression | `histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60` | p95 over 60 s |
+| Token burn | `sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N` | Set N to your budget |
+
+---
+
 ## Model Registration

 Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.