docs(pallas): document sampling parameters and Prometheus metrics
Add two new sections to the Pallas documentation: - Sampling parameters: explain that temperature/top_p/top_k are configured via the fast-agent decorator's `request_params`, with a provider support matrix and a note on Claude Opus 4.7 stripping these params in favor of `output_config.effort`. - Metrics: document the Prometheus `/metrics` endpoint exposed on the registry port, including scrape config, full metrics reference table, and notes on where each metric is captured.
This commit is contained in:
123
docs/pallas.md
123
docs/pallas.md
@@ -216,6 +216,40 @@ model_capabilities:
|
||||
|
||||
Capabilities are declared explicitly rather than inferred from model name — naming conventions vary across model families, making regex heuristics brittle. These values are both used to register unknown models with fast-agent's `ModelDatabase` and published in the registry response.
|
||||
|
||||
### Sampling parameters (temperature, top_p, top_k)
|
||||
|
||||
Sampling parameters are configured per-agent in the Python decorator, **not** in `agents.yaml` or `fastagent.config.yaml`. Pallas itself does no sampling-param handling — this is pure fast-agent decorator-side configuration.
|
||||
|
||||
```python
|
||||
from fast_agent import FastAgent
|
||||
from fast_agent.types import RequestParams
|
||||
|
||||
fast = FastAgent("Jeffrey", parse_cli_args=False)
|
||||
|
||||
@fast.agent(
|
||||
name="jeffrey",
|
||||
instruction="...",
|
||||
servers=[...],
|
||||
request_params=RequestParams(temperature=0.6, top_p=0.9),
|
||||
)
|
||||
async def _jeffrey():
|
||||
pass
|
||||
```
|
||||
|
||||
Provider support varies:
|
||||
|
||||
| Provider | temperature | top_p | top_k |
|
||||
|---|---|---|---|
|
||||
| OpenAI (native, Responses API) | yes | yes | no |
|
||||
| HuggingFace, OpenResponses (OpenAI-compatible) | yes | yes | yes (via `extra_body`) |
|
||||
| Google Gemini | yes | yes | yes |
|
||||
| Bedrock | yes | yes (most models) | varies |
|
||||
| **Anthropic Claude Opus 4.7** | **no** | **no** | **no** |
|
||||
|
||||
Anthropic's 4.7 design moves away from low-level numeric dials toward adaptive control — fast-agent's Anthropic provider explicitly strips temperature/top_p/top_k for Opus 4.7 with a warning (see `fast_agent/llm/provider/anthropic/llm_anthropic.py:1776-1786`). On Opus 4.7, use `output_config.effort` (verbosity, including the new `xhigh` level between `high` and `max`) instead.
|
||||
|
||||
Setting `request_params` on an Anthropic-Opus-4.7 agent is a safe no-op — the params apply automatically the moment the agent is routed to a non-Anthropic model.
|
||||
|
||||
### `fastagent.secrets.yaml`
|
||||
|
||||
```yaml
|
||||
@@ -496,6 +530,95 @@ Registered on each agent's MCP server. Checks:
|
||||
|
||||
---
|
||||
|
||||
## Metrics
|
||||
|
||||
Pallas exposes Prometheus metrics for scraping and alerting. One scrape target per Pallas deployment is sufficient — all agents run as coroutines in a single process under `asyncio.gather`, so metrics are process-global.
|
||||
|
||||
### Endpoint
|
||||
|
||||
```
|
||||
GET {host}:{registry_port}/metrics
|
||||
```
|
||||
|
||||
Plain HTTP, unauthenticated, served by the same Starlette app that hosts the registry. Returns Prometheus text exposition format (`text/plain; version=0.0.4`).
|
||||
|
||||
The same metrics snapshot is also available on each agent's own port at `{host}:{agent_port}/metrics`. Scraping the registry endpoint is the recommended default; the per-agent endpoints exist for cases where a load balancer terminates per-backend.
|
||||
|
||||
### Scrape Config
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: pallas
|
||||
static_configs:
|
||||
- targets: ['my-host.example.com:8200'] # registry_port
|
||||
labels:
|
||||
deployment: my-project
|
||||
```
|
||||
|
||||
### Metrics Reference
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|---|---|---|---|
|
||||
| `pallas_up` | gauge | — | `1` while the Pallas process is running |
|
||||
| `pallas_agent_info` | gauge | `agent`, `port` | `1` per configured agent — useful as a label join source |
|
||||
| `pallas_send_message_total` | counter | `agent`, `outcome` | `send_message` MCP calls. `outcome` ∈ `ok`/`error` |
|
||||
| `pallas_send_message_duration_seconds` | histogram | `agent` | End-to-end MCP `send_message` wall-clock duration |
|
||||
| `pallas_llm_turns_total` | counter | `agent`, `model` | LLM provider round-trips per agent/model |
|
||||
| `pallas_llm_tokens_total` | counter | `agent`, `model`, `kind` | Tokens consumed. `kind` ∈ `input`/`output`/`cache_read`/`cache_write`/`cache_hit`/`reasoning` |
|
||||
| `pallas_tool_calls_total` | counter | `agent`, `server`, `operation`, `outcome` | Downstream MCP operations dispatched by fast-agent's aggregator. `operation` is the fast-agent operation type (`tool`, `prompt`, `resource`, …); `outcome` ∈ `ok`/`error` |
|
||||
| `pallas_tool_call_duration_seconds` | histogram | `agent`, `server`, `operation` | Downstream MCP operation duration |
|
||||
| `pallas_downstream_up` | gauge | `agent`, `server` | `1` when the named downstream MCP server passed the last `get_health` probe |
|
||||
| `pallas_llm_provider_up` | gauge | `provider` | `1` when the active LLM provider passed its last preflight or runtime re-probe |
|
||||
| `pallas_agent_health_status` | gauge | `agent` | Aggregate from the last `get_health`: `1`=ok, `0.5`=degraded, `0`=error |
|
||||
|
||||
Standard process metrics (RSS, CPU, GC, open FDs) are emitted by `prometheus-client`'s default collectors on the same endpoint.
|
||||
|
||||
### Where the Numbers Come From
|
||||
|
||||
- **send_message metrics** — captured around the MCP `send_message` handler in `pallas.multimodal_server`. The duration spans the full agentic loop, including all sub-agent and tool-call latency.
|
||||
- **LLM token metrics** — read from fast-agent's `UsageAccumulator` on the request-scoped agent instance *before disposal*. Each request's accumulator is fresh, so every recorded turn is genuinely new — no double-counting across requests.
|
||||
- **Downstream tool call metrics** — recorded in the `pallas._fastagent_patch` wrapper around `MCPAggregator._execute_on_server`. This catches every dispatch (tools, prompts, resources) and is independent of which downstream server it lands on. Failures still surface in the counter as `outcome="error"` and full tracebacks remain in `pallas.forward.trace` log records.
|
||||
- **Health gauges** — updated as a side effect of every `get_health` MCP call. Daedalus's polling cadence (default 60 s) therefore drives gauge freshness. The LLM gauge is also set at startup preflight and on the TTL re-probe inside `get_health`.
|
||||
|
||||
### Useful Queries
|
||||
|
||||
```promql
|
||||
# Error rate per agent
|
||||
sum by (agent) (rate(pallas_send_message_total{outcome="error"}[5m]))
|
||||
/ sum by (agent) (rate(pallas_send_message_total[5m]))
|
||||
|
||||
# p95 send_message latency per agent
|
||||
histogram_quantile(0.95,
|
||||
sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[5m]))
|
||||
)
|
||||
|
||||
# Token spend per model (1h)
|
||||
sum by (model, kind) (rate(pallas_llm_tokens_total[1h]))
|
||||
|
||||
# Cache hit ratio (Anthropic)
|
||||
sum(rate(pallas_llm_tokens_total{kind="cache_read"}[5m]))
|
||||
/ sum(rate(pallas_llm_tokens_total{kind=~"input|cache_read|cache_write"}[5m]))
|
||||
|
||||
# Any downstream MCP server unreachable
|
||||
min by (server) (pallas_downstream_up) == 0
|
||||
|
||||
# Active LLM provider down
|
||||
pallas_llm_provider_up == 0
|
||||
```
|
||||
|
||||
### Suggested Alerts
|
||||
|
||||
| Alert | Expression | Notes |
|
||||
|---|---|---|
|
||||
| Pallas process down | `up{job="pallas"} == 0` for 1m | Scrape failure |
|
||||
| Active LLM unreachable | `pallas_llm_provider_up == 0` for 5m | Preflight or TTL re-probe failing |
|
||||
| Downstream MCP unreachable | `pallas_downstream_up == 0` for 10m | Per-server; gauge updates on each `get_health` |
|
||||
| Agent error rate elevated | `rate(pallas_send_message_total{outcome="error"}[10m]) > 0.1` | >10% errors over 10 min |
|
||||
| Latency regression | `histogram_quantile(0.95, sum by (agent, le) (rate(pallas_send_message_duration_seconds_bucket[10m]))) > 60` | p95 over 60 s |
|
||||
| Token burn | `sum(rate(pallas_llm_tokens_total{kind="output"}[1h])) > N` | Set N to your budget |
|
||||
|
||||
---
|
||||
|
||||
## Model Registration
|
||||
|
||||
Pallas registers models not in fast-agent's built-in `ModelDatabase` at startup, using the explicit capability declarations from `fastagent.config.yaml`.
|
||||
|
||||
Reference in New Issue
Block a user