# Red Panda Approvalβ„’ Standards Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards. --- ## 🐾 Red Panda Approvalβ„’ All implementations must meet the 5 Sacred Criteria: 1. **Fresh Environment Test** β€” Clean runs on new systems without drift. No leftover state, no manual steps. 2. **Elegant Simplicity** β€” Modular, reusable, no copy-paste sprawl. One playbook per concern. 3. **Observable & Auditable** β€” Clear task names, proper logging, check mode compatible. You can see what happened. 4. **Idempotent Patterns** β€” Run multiple times with consistent results. No side effects on re-runs. 5. **Actually Provisions & Configures** β€” Resources work, dependencies resolve, services integrate. It does the thing. --- ## Vault Security All sensitive information is encrypted using Ansible Vault with AES256 encryption. **Encrypted secrets:** - Database passwords (PostgreSQL, Neo4j) - API keys (OpenAI, Anthropic, Mistral, Groq) - Application secrets (Grafana, SearXNG, Arke) - Monitoring alerts (Pushover integration) **Security rules:** - AES256 encryption with `ansible-vault` - Password file for automation β€” never pass `--vault-password-file` inline in scripts - Vault variables use the `vault_` prefix; map to friendly names in `group_vars/all/vars.yml` - No secrets in plain text files, ever --- ## Log Level Standards All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy β†’ Loki β†’ Grafana, so disciplined leveling is not cosmetic β€” it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio. ### Level Definitions | Level | When to Use | What MUST Be Included | Loki / Grafana Role | |-------|-------------|----------------------|---------------------| | **ERROR** | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`. | AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger Pushover notifications. | | **WARNING** | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). | | **INFO** | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the *story* of what the system did. | Default production visibility. The go-to level for post-incident timelines. | | **DEBUG** | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | **Actionable context is mandatory.** A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. | ### Anti-Patterns These are explicit violations of Ouranos logging standards: | ❌ Anti-Pattern | Why It's Wrong | βœ… Correct Approach | |----------------|---------------|-------------------| | Health checks logged at INFO (`GET /health β†’ 200 OK`) | Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress health endpoints from access logs entirely, or demote to DEBUG. | | DEBUG with no context (`logger.debug("error occurred")`) | Provides zero diagnostic value. If DEBUG is noisy *and* useless, nobody will ever enable it. | `logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)` | | ERROR without exception details (`logger.error("task failed")`) | Cannot be triaged without reproduction steps. Wastes on-call time. | `logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)` | | Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: `api_key=sk-...a3f2`, `password=*****`. | | Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | **Python / Django**: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). **Go / infrastructure** (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`). | | Logging expected conditions as ERROR | A user entering a wrong password is not an error β€” it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. | ### Health Check Rule > All services exposed through HAProxy MUST suppress or demote health check endpoints (`/health`, `/healthz`, `/api/health`, `/metrics`, `/ping`) to DEBUG or below. Health check success is the *absence* of errors, not the presence of 200s. If your syslog shows a successful health probe, your log level is wrong. **Implementation guidance:** - **Django / Gunicorn**: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents. - **FastAPI / Uvicorn**: Add a `logging.Filter` on the `uvicorn.access` logger that matches health paths in the access log message. Uvicorn's access log format includes the full request line in quotes (e.g., `"GET /live HTTP/1.1"`), so filter regexes must account for that. See also the structured logging notes below. - **Docker services**: Configure the application's internal logging to exclude health routes β€” the syslog driver forwards everything it receives. - **HAProxy**: HAProxy's own health check logs (`option httpchk`) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO. ### Background Worker & Queue Monitoring > **The most dangerous failure is the one that produces no logs.** When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed. **Required practices:** 1. **Heartbeat logging** β€” Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., `"worker alive, processed N jobs in last 5m, queue depth: M"`). The *absence* of this heartbeat is the alertable condition. 2. **Startup and shutdown at INFO** β€” Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO. 3. **Queue depth as a metric** β€” RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an **ERROR**-level alert, not a warning. 4. **Grafana "last seen" alerts** β€” For every background worker, configure a Grafana alert using `absent_over_time()` or equivalent staleness detection: *"Worker X has not logged a heartbeat in >10 minutes"* β†’ ERROR severity β†’ Pushover notification. 5. **Crash-on-start is ERROR** β€” If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (`systemd OnFailure=`, Docker restart policy logs). Do not rely on the crashing application to log its own death β€” it may never get the chance. ### Production Defaults | Service Category | Default Level | Rationale | |-----------------|---------------|-----------| | Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | `WARNING` | Business logic β€” only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. | | FastAPI apps (Periplus) | `WARNING` | Same rationale as Django. Uvicorn lifecycle events (start/stop) are pinned to INFO via the `uvicorn.error` logger regardless of app log level. | | Gunicorn access logs | Suppress 2xx/3xx health probes | Routine request logging deferred to HAProxy access logs in Loki. | | Infrastructure agents (Alloy, Prometheus, Node Exporter) | `warn` | Stable β€” do not change without cause. | | HAProxy (Titania) | `warning` | Connection-level logging handled by HAProxy's own log format β†’ Alloy β†’ Loki. | | Databases (PostgreSQL, Neo4j) | `warning` | Query-level logging only enabled for active troubleshooting. | | Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | `warn` / `warning` | Per-service default. Tune individually if needed. | | LLM Proxy (Arke) | `info` | Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. | | Observability stack (Grafana, Loki, AlertManager) | `warn` | Should be quiet unless something is wrong with observability itself. | ### Structured Logging β€” FastAPI / Uvicorn FastAPI apps using uvicorn require special handling to achieve JSON-structured log output for the Alloy β†’ Loki pipeline. Uvicorn manages its own loggers aggressively, and naive approaches will fail silently. **Required practices:** 1. **Override uvicorn's handlers, don't just add to root** β€” Uvicorn's `config.load()` creates its own `StreamHandler` instances on `uvicorn`, `uvicorn.error`, and `uvicorn.access`. You must remove these handlers and set `propagate = True` so log records flow to the root logger where your JSON formatter lives. 2. **Re-apply logging config in the lifespan** β€” Configuring logging at module import time is not sufficient. Uvicorn's `config.load()` runs *after* your module is imported but *before* the ASGI lifespan starts. Call your logging configuration function again inside the FastAPI `lifespan` context manager to recapture control. 3. **Remap uvicorn logger names** β€” Uvicorn uses `uvicorn.error` for all lifecycle messages (startup, shutdown, errors) despite the misleading name. Remap it to `uvicorn` in your JSON formatter's output for clarity in Loki queries. 4. **Use `pydantic-settings` with `extra = "ignore"`** β€” When loading config from `.env` files that contain variables for other services (e.g., oauth2-proxy), pydantic-settings will reject unknown fields by default. Always set `extra = "ignore"` in the model config. ### Loki & Grafana Alignment **Label normalization**: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a `level` label on every log line. Without a `level` label, the log entry is invisible to level-based dashboard filters and alert rules. **LogQL conventions for dashboards:** ```logql # Production error monitoring (default dashboard view) {job="syslog", hostname="puck"} | json | level=~"error|fatal|critical" # Warning-and-above for a specific service {service_name="haproxy"} | logfmt | level=~"warn|error|fatal" # Debug-level troubleshooting (temporary, never permanent dashboards) {container="angelia"} | json | level="debug" ``` **Alerting rules** β€” Grafana alert rules MUST key off the normalized `level` label: - `level=~"error|fatal|critical"` β†’ Immediate Pushover notification via AlertManager - `absent_over_time({service_name="celery_worker"}[10m])` β†’ Worker heartbeat staleness β†’ ERROR severity - Rate-based: `rate({service_name="arke"} | json | level="error" [5m]) > 0.1` β†’ Sustained error rate **Retention alignment**: Loki retention policies should preserve ERROR and WARNING logs longer than DEBUG. DEBUG-level logs generated during troubleshooting sessions should have a short TTL or be explicitly cleaned up. --- ## Health Check Endpoints All services MUST expose Kubernetes-style health endpoints at these paths: | Endpoint | Purpose | Auth | |----------|---------|------| | `GET /live` | **Liveness** β€” process is running and accepting connections | None | | `GET /ready` | **Readiness** β€” process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None | | `GET /metrics` | Prometheus metrics | IP-restricted (no JWT) | - HAProxy uses `health_path: /ready/` for backend health checks β€” return HTTP 200 when ready - Health endpoints MUST NOT require authentication - Third-party services use their native paths (`/api/health`, `/api/healthz`, `/-/healthy`, etc.) ### Docker Compose Healthchecks Use `curl -f` (install curl in images if needed). Do not use `wget --spider`. ```yaml healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/live"] interval: 30s timeout: 10s retries: 3 start_period: 40s ``` --- ## Endpoint Protection | Protected (require valid JWT) | Unprotected | |-------------------------------|-------------| | All `/api/v1/*` routes | `GET /live` | | | `GET /ready` | | | `GET /metrics` (IP-restricted to internal networks) | | | `GET /api/auth/login-url` | | | `POST /api/auth/token` | | | `POST /api/v1/telemetry` (sendBeacon cannot set headers) | > **Why `/api/v1/telemetry` is unprotected**: The browser `sendBeacon` API cannot set `Authorization` headers. The telemetry endpoint must be open to receive client-side error reports and performance data, or browser errors will be silently lost. --- ## Prometheus Metrics All services SHOULD expose `GET /metrics` in Prometheus exposition format, scraped by Prospero's Prometheus at 15s intervals. - **IP-restricted** to internal networks: `10.10.0.0/24`, `172.16.0.0/12`, `127.0.0.0/8` - No JWT required β€” HAProxy and Prometheus scrapers cannot authenticate - Useful metrics to expose: request totals and durations, error rates, active connections, queue depths, dependency health --- ## Browser Telemetry Frontend/browser code MUST report errors and performance data back to the server. - Send to `POST /api/v1/telemetry` β€” unprotected endpoint - Capture: JavaScript exceptions, promise rejections, resource load failures, performance metrics - The server MUST log client-side exceptions at **WARNING** level (they indicate user-facing problems but are not server failures) - Include enough context to reproduce: URL, user agent, error message, stack trace (if available) --- ## Environment Variable Naming All environment variables for an application MUST use a consistent prefix matching the service name (e.g., `PERIPLUS_`, `ARKE_`, `ANGELIA_`). This applies to every variable in the `.env` file, including those consumed by sidecar services like oauth2-proxy. **Rules:** - All vars in `.env` use the `SERVICENAME_` prefix β€” no exceptions - `compose.yaml` maps prefixed vars to the sidecar's expected names (e.g., `OAUTH2_PROXY_CLIENT_ID: ${PERIPLUS_CASDOOR_CLIENT_ID}`) - The application's Settings model SHOULD declare all prefixed vars, even those only consumed by sidecars, so the full configuration is documented in one place - Every repo MUST include a `.env.example` with placeholder values for all required variables. Add `!.env.example` to `.gitignore` if a broad `.env.*` pattern would otherwise exclude it - `.env` files with real secrets are ALWAYS gitignored β€” no exceptions --- ## Docker Networking - Use the **default Docker bridge network** for simple deployments - Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested - Do not define custom networks for single-service Docker Compose stacks --- ## Documentation Standards Place documentation in the `/docs/` directory of the repository. ### HTML Documents HTML documents must follow [docs/documentation_style_guide.html](documentation_style_guide.html). - Use Bootstrap CDN with Bootswatch theme **Flatly** - Include a dark mode toggle button in the navbar - Use Bootstrap Icons for icons - Use Bootstrap CSS for styles β€” avoid custom CSS - Use **Mermaid** for diagrams