Files
koios/docs/tools/grafana.md
Robert Helewka 703b3402d4 docs(readme): update assistant roster, prompt layers, repo structure
- Update assistant lists (added Shawn, Watson, David, CASE, AWS SA; modified Scotty/Harper roles)
- Reflect new architecture layers: Tool Prompt Snippets and Shared Context
- Align repository structure diagram with current filesystem layout
2026-05-20 22:50:22 -04:00

2.1 KiB

Grafana

Metrics, logs, and dashboards.

  • MCP server name: grafana (Grafana MCP server; talks to the Grafana instance which hosts Prometheus metrics, Loki logs, and dashboards)
  • Prompt snippet: prompts/tools/grafana.md

What It Is

Grafana is Scotty's observability tool. Through the MCP server, agents can query Prometheus metrics (PromQL), Loki logs (LogQL), and read dashboard configuration — all the things you'd otherwise click through the Grafana web UI to see.

This is the primary tool for "what changed?" and "what's wrong right now?" Without it, Scotty is guessing from fragments. With it, Scotty can see actual system state across time.

What It's Good For

  • Pulling logs during an incident — service logs, application logs, system logs (Loki)
  • Querying metrics — CPU, memory, request rates, error rates, latency percentiles (Prometheus)
  • Checking historical state — "how did this look an hour ago, before the deploy?"
  • Confirming a fix worked — was the metric actually restored after the intervention?
  • Capacity planning conversations — read trends, not guesses

What It's Not Good For

  • Mutating system state — Grafana reads; Kernos acts
  • Realtime tail-the-log-and-watch — Grafana is request/response; for live tailing, shell into the host via Kernos and use journalctl -f
  • Code-level debugging — Grafana shows symptoms; the cause may be in source, where this tool can't help

Known Gotchas

  • Time ranges matter. A PromQL query without a sensible time window returns either nothing or the whole history. Always scope.
  • Loki label cardinality. Some labels have huge cardinality; querying without filters can be expensive and slow. Prefer filtering by service / level / host.
  • Partial-log overconfidence. Reading a fragment of a log and forming a hypothesis is one of Scotty's documented failure modes. Pull enough context (surrounding lines, related services) before concluding.
  • PromQL is not SQL. Aggregation operators behave differently. If a query looks weird, sanity-check on a known-good metric first.