docs(readme): update assistant roster, prompt layers, repo structure
- Update assistant lists (added Shawn, Watson, David, CASE, AWS SA; modified Scotty/Harper roles) - Reflect new architecture layers: Tool Prompt Snippets and Shared Context - Align repository structure diagram with current filesystem layout
This commit is contained in:
33
docs/tools/grafana.md
Normal file
33
docs/tools/grafana.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Grafana
|
||||
|
||||
> Metrics, logs, and dashboards.
|
||||
|
||||
- **MCP server name:** `grafana` (Grafana MCP server; talks to the Grafana instance which hosts Prometheus metrics, Loki logs, and dashboards)
|
||||
- **Prompt snippet:** [prompts/tools/grafana.md](../../prompts/tools/grafana.md)
|
||||
|
||||
## What It Is
|
||||
|
||||
Grafana is Scotty's observability tool. Through the MCP server, agents can query Prometheus metrics (PromQL), Loki logs (LogQL), and read dashboard configuration — all the things you'd otherwise click through the Grafana web UI to see.
|
||||
|
||||
This is the primary tool for **"what changed?"** and **"what's wrong right now?"** Without it, Scotty is guessing from fragments. With it, Scotty can see actual system state across time.
|
||||
|
||||
## What It's Good For
|
||||
|
||||
- Pulling logs during an incident — service logs, application logs, system logs (Loki)
|
||||
- Querying metrics — CPU, memory, request rates, error rates, latency percentiles (Prometheus)
|
||||
- Checking historical state — "how did this look an hour ago, before the deploy?"
|
||||
- Confirming a fix worked — was the metric actually restored after the intervention?
|
||||
- Capacity planning conversations — read trends, not guesses
|
||||
|
||||
## What It's Not Good For
|
||||
|
||||
- Mutating system state — Grafana reads; Kernos acts
|
||||
- Realtime tail-the-log-and-watch — Grafana is request/response; for live tailing, shell into the host via Kernos and use `journalctl -f`
|
||||
- Code-level debugging — Grafana shows symptoms; the cause may be in source, where this tool can't help
|
||||
|
||||
## Known Gotchas
|
||||
|
||||
- **Time ranges matter.** A PromQL query without a sensible time window returns either nothing or the whole history. Always scope.
|
||||
- **Loki label cardinality.** Some labels have huge cardinality; querying without filters can be expensive and slow. Prefer filtering by service / level / host.
|
||||
- **Partial-log overconfidence.** Reading a fragment of a log and forming a hypothesis is one of Scotty's documented failure modes. Pull enough context (surrounding lines, related services) before concluding.
|
||||
- **PromQL is not SQL.** Aggregation operators behave differently. If a query looks weird, sanity-check on a known-good metric first.
|
||||
Reference in New Issue
Block a user