docs(readme): update assistant roster, prompt layers, repo structure

- Update assistant lists (added Shawn, Watson, David, CASE, AWS SA; modified Scotty/Harper roles)
- Reflect new architecture layers: Tool Prompt Snippets and Shared Context
- Align repository structure diagram with current filesystem layout
This commit is contained in:
2026-05-20 22:50:22 -04:00
parent c1cc6e26c5
commit 703b3402d4
39 changed files with 1181 additions and 158 deletions

33
docs/tools/grafana.md Normal file
View File

@@ -0,0 +1,33 @@
# Grafana
> Metrics, logs, and dashboards.
- **MCP server name:** `grafana` (Grafana MCP server; talks to the Grafana instance which hosts Prometheus metrics, Loki logs, and dashboards)
- **Prompt snippet:** [prompts/tools/grafana.md](../../prompts/tools/grafana.md)
## What It Is
Grafana is Scotty's observability tool. Through the MCP server, agents can query Prometheus metrics (PromQL), Loki logs (LogQL), and read dashboard configuration — all the things you'd otherwise click through the Grafana web UI to see.
This is the primary tool for **"what changed?"** and **"what's wrong right now?"** Without it, Scotty is guessing from fragments. With it, Scotty can see actual system state across time.
## What It's Good For
- Pulling logs during an incident — service logs, application logs, system logs (Loki)
- Querying metrics — CPU, memory, request rates, error rates, latency percentiles (Prometheus)
- Checking historical state — "how did this look an hour ago, before the deploy?"
- Confirming a fix worked — was the metric actually restored after the intervention?
- Capacity planning conversations — read trends, not guesses
## What It's Not Good For
- Mutating system state — Grafana reads; Kernos acts
- Realtime tail-the-log-and-watch — Grafana is request/response; for live tailing, shell into the host via Kernos and use `journalctl -f`
- Code-level debugging — Grafana shows symptoms; the cause may be in source, where this tool can't help
## Known Gotchas
- **Time ranges matter.** A PromQL query without a sensible time window returns either nothing or the whole history. Always scope.
- **Loki label cardinality.** Some labels have huge cardinality; querying without filters can be expensive and slow. Prefer filtering by service / level / host.
- **Partial-log overconfidence.** Reading a fragment of a log and forming a hypothesis is one of Scotty's documented failure modes. Pull enough context (surrounding lines, related services) before concluding.
- **PromQL is not SQL.** Aggregation operators behave differently. If a query looks weird, sanity-check on a known-good metric first.