docs(pallas): document sampling parameters and Prometheus metrics

Add two new sections to the Pallas documentation:

- Sampling parameters: explain that temperature/top_p/top_k are
  configured via the fast-agent decorator's `request_params`, with a
  provider support matrix and a note on Claude Opus 4.7 stripping these
  params in favor of `output_config.effort`.
- Metrics: document the Prometheus `/metrics` endpoint exposed on the
  registry port, including scrape config, full metrics reference table,
  and notes on where each metric is captured.
This commit is contained in:
2026-05-23 07:49:21 -04:00
parent 6fcdb509df
commit ca7d714a31
8 changed files with 545 additions and 39 deletions

View File

@@ -216,6 +216,40 @@ model_capabilities:
Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.
### Sampling parameters (temperature, top_p, top_k)
Sampling parameters are configured per-agent in the Python decorator, **not** in `agents.yaml` or `fastagent.config.yaml`. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.
```python
from fast_agent import FastAgent
from fast_agent.types import RequestParams
fast = FastAgent("Jeffrey", parse_cli_args=False)
@fast.agent(
name="jeffrey",
instruction="...",
servers=[...],
request_params=RequestParams(temperature=0.6, top_p=0.9),
)
async def _jeffrey():
pass
```
Provider support varies:
| Provider | temperature | top_p | top_k |
|---|---|---|---|
| OpenAI (native, Responses API) | yes | yes | no |
| HuggingFace, OpenResponses (OpenAI-compatible) | yes | yes | yes (via `extra_body`) |
| Google Gemini | yes | yes | yes |
| Bedrock | yes | yes (most models) | varies |
| **Anthropic Claude Opus 4.7** | **no** | **no** | **no** |
Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see `fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786`). On Opus 4.7, use `output_config.effort` (verbosity, including the new `xhigh` level between `high` and `max`) instead.
Setting `request_params` on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.
### `fastagent.secrets.yaml`
```yaml
@@ -496,6 +530,95 @@ Registered on each agent's MCP server. Checks:
---
## Metrics
Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under `asyncio.gather`, so metrics are process-global.
### Endpoint
```
GET {host}:{registry_port}/metrics
```
Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (`text/plain; version=0.0.4`).
The same metrics snapshot is also available on each agent's own port at `{host}:{agent_port}/metrics`. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.
### Scrape Config
```yaml
scrape_configs:
- job_name: pallas
static_configs:
- targets: ['my-host.example.com:8200'] # registry_port
labels:
deployment: my-project
```
### Metrics Reference
| Metric | Type | Labels | Description |
|---|---|---|---|
| `pallas_up` | gauge | — | `1` while the Pallas process is running |
| `pallas_agent_info` | gauge | `agent`, `port` | `1` per configured agent — useful as a label join source |
| `pallas_send_message_total` | counter | `agent`, `outcome` | `send_message` MCP calls. `outcome``ok`/`error` |
| `pallas_send_message_duration_seconds` | histogram | `agent` | End-to-end MCP `send_message` wall-clock duration |
| `pallas_llm_turns_total` | counter | `agent`, `model` | LLM provider round-trips per agent/model |
| `pallas_llm_tokens_total` | counter | `agent`, `model`, `kind` | Tokens consumed. `kind``input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning` |
| `pallas_tool_calls_total` | counter | `agent`, `server`, `operation`, `outcome` | Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome``ok`/`error` |
| `pallas_tool_call_duration_seconds` | histogram | `agent`, `server`, `operation` | Downstream MCP operation duration |
| `pallas_downstream_up` | gauge | `agent`, `server` | `1` when the named downstream MCP server passed the last `get_health` probe |
| `pallas_llm_provider_up` | gauge | `provider` | `1` when the active LLM provider passed its last preflight or runtime re-probe |
| `pallas_agent_health_status` | gauge | `agent` | Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error |
Standard process metrics (RSS, CPU, GC, open FDs) are emitted by `prometheus-client`'s default collectors on the same endpoint.
### Where the Numbers Come From
- **send_message metrics** — captured around the MCP `send_message` handler in `pallas.multimodal_server`. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
- **LLM token metrics** — read from fast-agent's `UsageAccumulator` on the request-scoped agent instance *before disposal*. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
- **Downstream tool call metrics** — recorded in the `pallas._fastagent_patch` wrapper around `MCPAggregator._execute_on_server`. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as `outcome="error"` and full tracebacks remain in `pallas.forward.trace` log records.
- **Health gauges** — updated as a side effect of every `get_health` MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside `get_health`.
### Useful Queries
```promql
# Error rate per agent
sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
/ sum by (agent) (rate(pallas_send_message_total[5m]))
# p95 send_message latency per agent
histogram_quantile(0.95,
sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
)
# Token spend per model (1h)
sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))
# Cache hit ratio (Anthropic)
sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
/ sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))
# Any downstream MCP server unreachable
min by (server) (pallas_downstream_up) == 0
# Active LLM provider down
pallas_llm_provider_up == 0
```
### Suggested Alerts
| Alert | Expression | Notes |
|---|---|---|
| Pallas process down | `up{job="pallas"} == 0` for 1m | Scrape failure |
| Active LLM unreachable | `pallas_llm_provider_up == 0` for 5m | Preflight or TTL re-probe failing |
| Downstream MCP unreachable | `pallas_downstream_up == 0` for 10m | Per-server; gauge updates on each `get_health` |
| Agent error rate elevated | `rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1` | >10% errors over 10 min |
| Latency regression | `histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60` | p95 over 60 s |
| Token burn | `sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N` | Set N to your budget |
---
## Model Registration
Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.