URGENT: Detect and break repeated-identical-tool-call loops in the agentic loop #1
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Priority: URGENT
A small-model agent (Shawn, Iolaus,
proteus.helu.ca) entered an infinite tool-call loop that ran until the Daedalus MCPsend_messagecall timed out and returnedempty_responseto the web UI. There is currently no guard in the Pallas/fast-agent agentic loop to detect or break a model that repeats the same tool call indefinitely. This is the single most expensive failure mode we have: the loop consumes LLM turns and context (cost + latency) for the entire client timeout window and produces nothing.Evidence (Loki,
{hostname="proteus.helu.ca"}, 2026-06-15)Reading one episode chronologically, the same cycle repeats ~1/sec:
kairos__update_taskcall each iteration.kairos__update_task (task_id=494) ... empty_response.The immediate trigger was a data inconsistency in Kairos task 494 ("Mnemosyne Deploy":
status=COMPLETEDbutpercent_complete=0), which the model tried to reconcile forever. That data bug is being fixed separately — but the loop itself must be broken by the runtime regardless of what triggers it. Any bad/contradictory tool result can induce this.Proposed fix
Add loop-breaking to the agentic loop (
pallas.multimodal_server.send_message/ wherever the fast-agent loop is driven):(server, tool, normalized_arguments)->result_hashper request. If the same(tool, args)produces the same result hash N times consecutively (suggest N=3), stop the loop and return a partial result with an injected system message explaining the tool call is not converging, so the model can change strategy or summarize instead of spinning.send_message(defense in depth). A configurable ceiling (e.g.PALLAS_MAX_AGENT_TURNS, default ~40) that terminates the loop with a partial result rather than letting it run to the client timeout. Even with #1, this bounds pathological non-identical loops.pallas_agent_loop_aborted_total{agent, reason="repeat"|"max_turns"}) so we can alert on it. (Note:pallas_*metrics do not currently appear in the Taurus Prometheus — onlydaedalus_pallas_instances_total— so the Pallas scrape target may also need to be wired up; see separate follow-up.)Acceptance
empty_response.send_messagefrom running to the client timeout.Found while investigating Shawn timeouts on Taurus production (proteus.helu.ca). Filed by Claude Code on behalf of @r.