docs(readme): update assistant roster, prompt layers, repo structure

- Update assistant lists (added Shawn, Watson, David, CASE, AWS SA; modified Scotty/Harper roles) - Reflect new architecture layers: Tool Prompt Snippets and Shared Context - Align repository structure diagram with current filesystem layout
2026-05-20 22:50:22 -04:00
parent c1cc6e26c5
commit 703b3402d4
39 changed files with 1181 additions and 158 deletions
--- a/docs/tools/grafana.md
+++ b/docs/tools/grafana.md
@@ -0,0 +1,33 @@
+# Grafana
+
+> Metrics, logs, and dashboards.
+
+- **MCP server name:** `grafana` (Grafana MCP server; talks to the Grafana instance which hosts Prometheus metrics, Loki logs, and dashboards)
+- **Prompt snippet:** [prompts/tools/grafana.md](../../prompts/tools/grafana.md)
+
+## What It Is
+
+Grafana is Scotty's observability tool. Through the MCP server, agents can query Prometheus metrics (PromQL), Loki logs (LogQL), and read dashboard configuration — all the things you'd otherwise click through the Grafana web UI to see.
+
+This is the primary tool for **"what changed?"** and **"what's wrong right now?"** Without it, Scotty is guessing from fragments. With it, Scotty can see actual system state across time.
+
+## What It's Good For
+
+- Pulling logs during an incident — service logs, application logs, system logs (Loki)
+- Querying metrics — CPU, memory, request rates, error rates, latency percentiles (Prometheus)
+- Checking historical state — "how did this look an hour ago, before the deploy?"
+- Confirming a fix worked — was the metric actually restored after the intervention?
+- Capacity planning conversations — read trends, not guesses
+
+## What It's Not Good For
+
+- Mutating system state — Grafana reads; Kernos acts
+- Realtime tail-the-log-and-watch — Grafana is request/response; for live tailing, shell into the host via Kernos and use `journalctl -f`
+- Code-level debugging — Grafana shows symptoms; the cause may be in source, where this tool can't help
+
+## Known Gotchas
+
+- **Time ranges matter.** A PromQL query without a sensible time window returns either nothing or the whole history. Always scope.
+- **Loki label cardinality.** Some labels have huge cardinality; querying without filters can be expensive and slow. Prefer filtering by service / level / host.
+- **Partial-log overconfidence.** Reading a fragment of a log and forming a hypothesis is one of Scotty's documented failure modes. Pull enough context (surrounding lines, related services) before concluding.
+- **PromQL is not SQL.** Aggregation operators behave differently. If a query looks weird, sanity-check on a known-good metric first.