URGENT: Detect and break repeated-identical-tool-call loops in the agentic loop #1

New Issue

r · 2026-06-15T15:30:25Z

r commented

2026-06-15 15:30:25 +00:00

Priority: URGENT

A small-model agent (Shawn, Iolaus, proteus.helu.ca) entered an infinite tool-call loop that ran until the Daedalus MCP send_message call timed out and returned empty_response to the web UI. There is currently no guard in the Pallas/fast-agent agentic loop to detect or break a model that repeats the same tool call indefinitely. This is the single most expensive failure mode we have: the loop consumes LLM turns and context (cost + latency) for the entire client timeout window and produces nothing.

Evidence (Loki, `{hostname="proteus.helu.ca"}`, 2026-06-15)

Reading one episode chronologically, the same cycle repeats ~1/sec:

shawn tool call - kairos__update_task
shawn tool result - text only 92 chars
Streaming complete - Model: Qwen3.6-27B-Q5_K_M, Input tokens: 41216, Output tokens: 33
shawn tool call - kairos__update_task        <-- identical call again
... Input tokens: 41295 ... 41374 ... 41453 ... 41532 ... 41611 ... 41690 ... 41769 ... 41848 ... 41927

Output is a fixed 33 tokens every turn -> the model emits the identical kairos__update_task call each iteration.
Input grows ~79 tokens/turn -> each loop appends the same ~92-char tool result, inflating context until the call wall-clocks out.
The Daedalus UI recorded the call at step 31 as kairos__update_task (task_id=494) ... empty_response.

The immediate trigger was a data inconsistency in Kairos task 494 ("Mnemosyne Deploy": status=COMPLETED but percent_complete=0), which the model tried to reconcile forever. That data bug is being fixed separately — but the loop itself must be broken by the runtime regardless of what triggers it. Any bad/contradictory tool result can induce this.

Proposed fix

Add loop-breaking to the agentic loop (pallas.multimodal_server.send_message / wherever the fast-agent loop is driven):

Repeated-identical-tool-call detection (primary). Track a rolling signature of (server, tool, normalized_arguments) -> result_hash per request. If the same (tool, args) produces the same result hash N times consecutively (suggest N=3), stop the loop and return a partial result with an injected system message explaining the tool call is not converging, so the model can change strategy or summarize instead of spinning.
Hard max-turns cap per send_message (defense in depth). A configurable ceiling (e.g. PALLAS_MAX_AGENT_TURNS, default ~40) that terminates the loop with a partial result rather than letting it run to the client timeout. Even with #1, this bounds pathological non-identical loops.
Emit a metric when either guard fires (e.g. pallas_agent_loop_aborted_total{agent, reason="repeat"|"max_turns"}) so we can alert on it. (Note: pallas_* metrics do not currently appear in the Taurus Prometheus — only daedalus_pallas_instances_total — so the Pallas scrape target may also need to be wired up; see separate follow-up.)

Acceptance

A model repeating the same tool call with the same result is stopped within N iterations and returns a partial answer instead of empty_response.
The hard turn cap prevents any single send_message from running to the client timeout.
Loop aborts are observable via logs and a counter metric.

Found while investigating Shawn timeouts on Taurus production (proteus.helu.ca). Filed by Claude Code on behalf of @r.

## Priority: URGENT A small-model agent (Shawn, Iolaus, `proteus.helu.ca`) entered an **infinite tool-call loop** that ran until the Daedalus MCP `send_message` call timed out and returned `empty_response` to the web UI. There is currently no guard in the Pallas/fast-agent agentic loop to detect or break a model that repeats the same tool call indefinitely. This is the single most expensive failure mode we have: the loop consumes LLM turns and context (cost + latency) for the entire client timeout window and produces nothing. ## Evidence (Loki, `{hostname="proteus.helu.ca"}`, 2026-06-15) Reading one episode chronologically, the same cycle repeats ~1/sec: ``` shawn tool call - kairos__update_task shawn tool result - text only 92 chars Streaming complete - Model: Qwen3.6-27B-Q5_K_M, Input tokens: 41216, Output tokens: 33 shawn tool call - kairos__update_task <-- identical call again ... Input tokens: 41295 ... 41374 ... 41453 ... 41532 ... 41611 ... 41690 ... 41769 ... 41848 ... 41927 ``` - **Output is a fixed 33 tokens every turn** -> the model emits the *identical* `kairos__update_task` call each iteration. - **Input grows ~79 tokens/turn** -> each loop appends the same ~92-char tool result, inflating context until the call wall-clocks out. - The Daedalus UI recorded the call at **step 31** as `kairos__update_task (task_id=494) ... empty_response`. The immediate *trigger* was a data inconsistency in Kairos task 494 ("Mnemosyne Deploy": `status=COMPLETED` but `percent_complete=0`), which the model tried to reconcile forever. That data bug is being fixed separately — but **the loop itself must be broken by the runtime regardless of what triggers it.** Any bad/contradictory tool result can induce this. ## Proposed fix Add loop-breaking to the agentic loop (`pallas.multimodal_server.send_message` / wherever the fast-agent loop is driven): 1. **Repeated-identical-tool-call detection (primary).** Track a rolling signature of `(server, tool, normalized_arguments)` -> `result_hash` per request. If the same `(tool, args)` produces the same result hash **N times consecutively** (suggest N=3), stop the loop and return a partial result with an injected system message explaining the tool call is not converging, so the model can change strategy or summarize instead of spinning. 2. **Hard max-turns cap per `send_message` (defense in depth).** A configurable ceiling (e.g. `PALLAS_MAX_AGENT_TURNS`, default ~40) that terminates the loop with a partial result rather than letting it run to the client timeout. Even with #1, this bounds pathological non-identical loops. 3. **Emit a metric** when either guard fires (e.g. `pallas_agent_loop_aborted_total{agent, reason="repeat"|"max_turns"}`) so we can alert on it. (Note: `pallas_*` metrics do not currently appear in the Taurus Prometheus — only `daedalus_pallas_instances_total` — so the Pallas scrape target may also need to be wired up; see separate follow-up.) ## Acceptance - A model repeating the same tool call with the same result is stopped within N iterations and returns a partial answer instead of `empty_response`. - The hard turn cap prevents any single `send_message` from running to the client timeout. - Loop aborts are observable via logs and a counter metric. Found while investigating Shawn timeouts on Taurus production (proteus.helu.ca). Filed by Claude Code on behalf of @r.

r referenced this issue from a commit

2026-06-15 15:56:29 +00:00

shawn: parameterize Neo4j Cypher examples and add correctness rules

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: r/pallas#1

URGENT: Detect and break repeated-identical-tool-call loops in the agentic loop #1

Priority: URGENT

Evidence (Loki, {hostname="proteus.helu.ca"}, 2026-06-15)

Proposed fix

Acceptance

Evidence (Loki, `{hostname="proteus.helu.ca"}`, 2026-06-15)