docs: rewrite README with structured overview and quick start guide

Replaces the minimal project description with a comprehensive README including a component overview table, quick start instructions, common Ansible operations, and links to detailed documentation. Aligns with Red Panda Approval™ standards.
2026-03-03 12:49:06 +00:00
parent c7be03a743
commit b4d60f2f38
219 changed files with 34586 additions and 2 deletions
--- a/docs/red_panda_standards.md
+++ b/docs/red_panda_standards.md
@@ -0,0 +1,148 @@
+# Red Panda Approval™ Standards
+
+Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.
+
+---
+
+## 🐾 Red Panda Approval™
+
+All implementations must meet the 5 Sacred Criteria:
+
+1. **Fresh Environment Test** — Clean runs on new systems without drift. No leftover state, no manual steps.
+2. **Elegant Simplicity** — Modular, reusable, no copy-paste sprawl. One playbook per concern.
+3. **Observable & Auditable** — Clear task names, proper logging, check mode compatible. You can see what happened.
+4. **Idempotent Patterns** — Run multiple times with consistent results. No side effects on re-runs.
+5. **Actually Provisions & Configures** — Resources work, dependencies resolve, services integrate. It does the thing.
+
+---
+
+## Vault Security
+
+All sensitive information is encrypted using Ansible Vault with AES256 encryption.
+
+**Encrypted secrets:**
+- Database passwords (PostgreSQL, Neo4j)
+- API keys (OpenAI, Anthropic, Mistral, Groq)
+- Application secrets (Grafana, SearXNG, Arke)
+- Monitoring alerts (Pushover integration)
+
+**Security rules:**
+- AES256 encryption with `ansible-vault`
+- Password file for automation — never pass `--vault-password-file` inline in scripts
+- Vault variables use the `vault_` prefix; map to friendly names in `group_vars/all/vars.yml`
+- No secrets in plain text files, ever
+
+---
+
+## Log Level Standards
+
+All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.
+
+### Level Definitions
+
+| Level | When to Use | What MUST Be Included | Loki / Grafana Role |
+|-------|-------------|----------------------|---------------------|
+| **ERROR** | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`. | AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger Pushover notifications. |
+| **WARNING** | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). |
+| **INFO** | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the *story* of what the system did. | Default production visibility. The go-to level for post-incident timelines. |
+| **DEBUG** | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | **Actionable context is mandatory.** A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. |
+
+### Anti-Patterns
+
+These are explicit violations of Ouranos logging standards:
+
+| ❌ Anti-Pattern | Why It's Wrong | ✅ Correct Approach |
+|----------------|---------------|-------------------|
+| Health checks logged at INFO (`GET /health → 200 OK`) | Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress health endpoints from access logs entirely, or demote to DEBUG. |
+| DEBUG with no context (`logger.debug("error occurred")`) | Provides zero diagnostic value. If DEBUG is noisy *and* useless, nobody will ever enable it. | `logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)` |
+| ERROR without exception details (`logger.error("task failed")`) | Cannot be triaged without reproduction steps. Wastes on-call time. | `logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)` |
+| Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: `api_key=sk-...a3f2`, `password=*****`. |
+| Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | **Python / Django**: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). **Go / infrastructure** (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`). |
+| Logging expected conditions as ERROR | A user entering a wrong password is not an error — it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. |
+
+### Health Check Rule
+
+> All services exposed through HAProxy MUST suppress or demote health check endpoints (`/health`, `/healthz`, `/api/health`, `/metrics`, `/ping`) to DEBUG or below. Health check success is the *absence* of errors, not the presence of 200s. If your syslog shows a successful health probe, your log level is wrong.
+
+**Implementation guidance:**
+- **Django / Gunicorn**: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
+- **Docker services**: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
+- **HAProxy**: HAProxy's own health check logs (`option httpchk`) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.
+
+### Background Worker & Queue Monitoring
+
+> **The most dangerous failure is the one that produces no logs.**
+
+When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.
+
+**Required practices:**
+
+1. **Heartbeat logging** — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., `"worker alive, processed N jobs in last 5m, queue depth: M"`). The *absence* of this heartbeat is the alertable condition.
+
+2. **Startup and shutdown at INFO** — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
+
+3. **Queue depth as a metric** — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an **ERROR**-level alert, not a warning.
+
+4. **Grafana "last seen" alerts** — For every background worker, configure a Grafana alert using `absent_over_time()` or equivalent staleness detection: *"Worker X has not logged a heartbeat in >10 minutes"* → ERROR severity → Pushover notification.
+
+5. **Crash-on-start is ERROR** — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (`systemd OnFailure=`, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.
+
+### Production Defaults
+
+| Service Category | Default Level | Rationale |
+|-----------------|---------------|-----------|
+| Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | `WARNING` | Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. |
+| Gunicorn access logs | Suppress 2xx/3xx health probes | Routine request logging deferred to HAProxy access logs in Loki. |
+| Infrastructure agents (Alloy, Prometheus, Node Exporter) | `warn` | Stable — do not change without cause. |
+| HAProxy (Titania) | `warning` | Connection-level logging handled by HAProxy's own log format → Alloy → Loki. |
+| Databases (PostgreSQL, Neo4j) | `warning` | Query-level logging only enabled for active troubleshooting. |
+| Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | `warn` / `warning` | Per-service default. Tune individually if needed. |
+| LLM Proxy (Arke) | `info` | Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. |
+| Observability stack (Grafana, Loki, AlertManager) | `warn` | Should be quiet unless something is wrong with observability itself. |
+
+### Loki & Grafana Alignment
+
+**Label normalization**: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a `level` label on every log line. Without a `level` label, the log entry is invisible to level-based dashboard filters and alert rules.
+
+**LogQL conventions for dashboards:**
+```logql
+# Production error monitoring (default dashboard view)
+{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"
+
+# Warning-and-above for a specific service
+{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"
+
+# Debug-level troubleshooting (temporary, never permanent dashboards)
+{container="angelia"} | json | level="debug"
+```
+
+**Alerting rules** — Grafana alert rules MUST key off the normalized `level` label:
+- `level=~"error|fatal|critical"` → Immediate Pushover notification via AlertManager
+- `absent_over_time({service_name="celery_worker"}[10m])` → Worker heartbeat staleness → ERROR severity
+- Rate-based: `rate({service_name="arke"} | json | level="error" [5m]) > 0.1` → Sustained error rate
+
+**Retention alignment**: Loki retention policies should preserve ERROR and WARNING logs longer than DEBUG. DEBUG-level logs generated during troubleshooting sessions should have a short TTL or be explicitly cleaned up.
+
+---
+
+## Documentation Standards
+
+Place documentation in the `/docs/` directory of the repository.
+
+### HTML Documents
+
+HTML documents must follow [docs/documentation_style_guide.html](documentation_style_guide.html).
+
+- Use Bootstrap CDN with Bootswatch theme **Flatly**
+- Include a dark mode toggle button in the navbar
+- Use Bootstrap Icons for icons
+- Use Bootstrap CSS for styles — avoid custom CSS
+- Use **Mermaid** for diagrams
+
+### Markdown Documents
+
+Only these status symbols are approved:
+- ✔ Success/Complete
+- ❌ Error/Failed
+- ⚠️ Warning/Caution
+- ℹ️ Information/Note