19 KiB
Red Panda Approval™ Standards
Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.
Owner: Robert Helewka <r@helu.ca> Version: 1.00 Last reviewed: 2026-04-18
🐾 Red Panda Approval™
All implementations must meet the 5 Sacred Criteria:
- Fresh Environment Test — Clean runs on new systems without drift. No leftover state, no manual steps.
- Elegant Simplicity — Modular, reusable, no copy-paste sprawl. One playbook per concern.
- Observable & Auditable — Clear task names, proper logging, check mode compatible. You can see what happened.
- Idempotent Patterns — Run multiple times with consistent results. No side effects on re-runs.
- Actually Provisions & Configures — Resources work, dependencies resolve, services integrate. It does the thing.
Vault Security
All sensitive information is encrypted using Ansible Vault with AES256 encryption.
Encrypted secrets:
- Database passwords (PostgreSQL, Neo4j)
- API keys (OpenAI, Anthropic, Mistral, Groq)
- Application secrets (Grafana, SearXNG, Arke)
- Monitoring alerts (AlertManager email integration)
Security rules:
- AES256 encryption with
ansible-vault - Password file for automation — never pass
--vault-password-fileinline in scripts - Vault variables use the
vault_prefix; map to friendly names ingroup_vars/all/vars.yml - No secrets in plain text files, ever
Log Level Standards
All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.
Level Definitions
| Level | When to Use | What MUST Be Included | Loki / Grafana Role |
|---|---|---|---|
| ERROR | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare "something failed". |
AlertManager rules fire on level=~"error|fatal|critical". These trigger email notifications. |
| WARNING | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). |
| INFO | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the story of what the system did. | Default production visibility. The go-to level for post-incident timelines. |
| DEBUG | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | Actionable context is mandatory. A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. |
Anti-Patterns
These are explicit violations of Ouranos logging standards:
| ❌ Anti-Pattern | Why It's Wrong | ✅ Correct Approach |
|---|---|---|
Health/metrics checks logged at INFO (GET /live → 200 OK, GET /metrics → 200 OK) |
Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress successful probes to /live, /ready, /metrics, /health*, /ping from access logs entirely. Non-2xx responses MUST still log. |
DEBUG with no context (logger.debug("error occurred")) |
Provides zero diagnostic value. If DEBUG is noisy and useless, nobody will ever enable it. | logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp) |
ERROR without exception details (logger.error("task failed")) |
Cannot be triaged without reproduction steps. Wastes on-call time. | logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True) |
| Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: api_key=sk-...a3f2, password=*****. |
| Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | Python / Django: UPPERCASE (INFO, WARNING, ERROR, DEBUG). Go / infrastructure (HAProxy, Alloy, Gitea): lowercase (info, warn, error, debug). |
| Logging expected conditions as ERROR | A user entering a wrong password is not an error — it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. |
Health Check & Monitoring Endpoint Rule
All services MUST suppress successful (2xx/3xx) access log entries for health and monitoring endpoints:
/live,/ready,/health,/healthz,/api/health,/metrics,/ping. Health check success is the absence of errors, not the presence of 200s. If your syslog shows a successful probe of one of these endpoints, your log level is wrong.Non-2xx responses to these paths MUST still be logged — a failing
/readyis a real signal.
Implementation guidance:
- Django / Gunicorn: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
- FastAPI / Uvicorn: Add a
logging.Filteron theuvicorn.accesslogger that matches health paths in the access log message. Uvicorn's access log format includes the full request line in quotes (e.g.,"GET /live HTTP/1.1"), so filter regexes must account for that. See also the structured logging notes below. - nginx containers: nginx does not log through Python loggers, so app-level filters do not apply. Suppress probe access lines at the nginx config level using
mapon$request_urior$status:Applies to every nginx-based container (static frontends, reverse proxies, sidecars).map $request_uri $loggable { ~^/(live|ready|metrics|health|healthz|ping)(/|$|\?) 0; default 1; } server { access_log /var/log/nginx/access.log combined if=$loggable; # errors (4xx/5xx) still logged via error_log regardless } - Docker services: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
- HAProxy: HAProxy's own health check logs (
option httpchk) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.
Background Worker & Queue Monitoring
The most dangerous failure is the one that produces no logs.
When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.
Required practices:
-
Heartbeat logging — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g.,
"worker alive, processed N jobs in last 5m, queue depth: M"). Cadence: every 60 seconds. The staleness alert fires after 10 minutes of silence (= 10 consecutive missed heartbeats), which gives enough margin to absorb transient Loki ingestion lag without flapping. The absence of this heartbeat is the alertable condition. -
Startup and shutdown at INFO — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
-
Queue depth as a metric — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an ERROR-level alert, not a warning.
-
Grafana "last seen" alerts — For every background worker, configure a Grafana alert using
absent_over_time()or equivalent staleness detection: "Worker X has not logged a heartbeat in >10 minutes" → ERROR severity → email notification via AlertManager. -
Crash-on-start is ERROR — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (
systemd OnFailure=, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.
Production Defaults
| Service Category | Default Level | Rationale |
|---|---|---|
| Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | WARNING |
Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. |
| FastAPI apps (Periplus) | WARNING |
Same rationale as Django. Uvicorn lifecycle events (start/stop) are pinned to INFO via the uvicorn.error logger regardless of app log level. |
| Gunicorn / Uvicorn / nginx access logs | Suppress successful probes to /live, /ready, /metrics, /health*, /ping |
Routine request logging deferred to HAProxy access logs in Loki. |
| Infrastructure agents (Alloy, Prometheus, Node Exporter) | warn |
Stable — do not change without cause. |
| HAProxy (Titania) | warning |
Connection-level logging handled by HAProxy's own log format → Alloy → Loki. |
| Databases (PostgreSQL, Neo4j) | warning |
Query-level logging only enabled for active troubleshooting. |
| Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | warn / warning |
Per-service default. Tune individually if needed. |
| LLM Proxy (Arke) | info |
Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. |
| Observability stack (Grafana, Loki, AlertManager) | warn |
Should be quiet unless something is wrong with observability itself. |
Structured Logging — FastAPI / Uvicorn
FastAPI apps using uvicorn require special handling to achieve JSON-structured log output for the Alloy → Loki pipeline. Uvicorn manages its own loggers aggressively, and naive approaches will fail silently.
Required practices:
-
Override uvicorn's handlers, don't just add to root — Uvicorn's
config.load()creates its ownStreamHandlerinstances onuvicorn,uvicorn.error, anduvicorn.access. You must remove these handlers and setpropagate = Trueso log records flow to the root logger where your JSON formatter lives. -
Re-apply logging config in the lifespan — Configuring logging at module import time is not sufficient. Uvicorn's
config.load()runs after your module is imported but before the ASGI lifespan starts. Call your logging configuration function again inside the FastAPIlifespancontext manager to recapture control. -
Remap uvicorn logger names — Uvicorn uses
uvicorn.errorfor all lifecycle messages (startup, shutdown, errors) despite the misleading name. Remap it touvicornin your JSON formatter's output for clarity in Loki queries. -
Use
pydantic-settingswithextra = "ignore"— When loading config from.envfiles that contain variables for other services (e.g., oauth2-proxy), pydantic-settings will reject unknown fields by default. Always setextra = "ignore"in the model config.
Loki & Grafana Alignment
Label normalization: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a level label on every log line. Without a level label, the log entry is invisible to level-based dashboard filters and alert rules.
LogQL conventions for dashboards:
# Production error monitoring (default dashboard view)
{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"
# Warning-and-above for a specific service
{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"
# Debug-level troubleshooting (temporary, never permanent dashboards)
{container="angelia"} | json | level="debug"
Alerting rules — Grafana alert rules MUST key off the normalized level label:
level=~"error|fatal|critical"→ Immediate email notification via AlertManagerabsent_over_time({service_name="celery_worker"}[10m])→ Worker heartbeat staleness → ERROR severity- Rate-based:
rate({service_name="arke"} | json | level="error" [5m]) > 0.1→ Sustained error rate
Retention alignment: Loki retention policies MUST preserve higher-severity logs longer than lower-severity ones. Target retention:
| Level | Retention | Rationale |
|---|---|---|
| DEBUG | 7 days | Troubleshooting context only — stale debug data is noise. |
| INFO | 30 days | Post-incident timelines and lifecycle review. |
| WARNING | 90 days | Degradation trend analysis across release cycles. |
| ERROR / FATAL / CRITICAL | 90 days | Incident review, root-cause investigation, compliance. |
DEBUG-level logs generated during troubleshooting sessions should be explicitly cleaned up if they would blow past the 7-day budget.
Health Check Endpoints
All services MUST expose Kubernetes-style health endpoints at these paths:
| Endpoint | Purpose | Auth |
|---|---|---|
GET /live |
Liveness — process is running and accepting connections | None |
GET /ready |
Readiness — process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None |
GET /metrics |
Prometheus metrics | IP-restricted (no JWT) |
- HAProxy uses
health_path: /ready/(trailing slash) for backend health checks — return HTTP 200 when ready - Health endpoints MUST NOT require authentication
- Third-party services use their native paths (
/api/health,/api/healthz,/-/healthy, etc.)
Trailing slash: The standard path is /ready/ with a trailing slash. Django's APPEND_SLASH handling, FastAPI route declarations, and nginx location blocks all differ in how they treat the slash. Services that cannot comply (framework redirects, third-party apps) MUST be recorded in the Exceptions section below. Access-log suppression filters MUST match both /ready and /ready/ forms.
Docker Compose Healthchecks
Use curl -f (install curl in images if needed). Do not use wget --spider.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Endpoint Protection
| Protected (require valid JWT) | Unprotected |
|---|---|
All /api/v1/* routes |
GET /live |
GET /ready |
|
GET /metrics (IP-restricted to internal networks) |
|
GET /api/auth/login-url |
|
POST /api/auth/token |
|
POST /api/v1/telemetry (sendBeacon cannot set headers) |
Why
/api/v1/telemetryis unprotected: The browsersendBeaconAPI cannot setAuthorizationheaders. The telemetry endpoint must be open to receive client-side error reports and performance data, or browser errors will be silently lost.
Prometheus Metrics
All services MUST expose GET /metrics in Prometheus exposition format, scraped by Prospero's Prometheus at 15s intervals.
- IP-restricted to internal networks:
10.10.0.0/24,172.16.0.0/12,127.0.0.0/8 - No JWT required — HAProxy and Prometheus scrapers cannot authenticate
- Useful metrics to expose: request totals and durations, error rates, active connections, queue depths, dependency health
Browser Telemetry
Frontend/browser code MUST report errors and performance data back to the server.
- Send to
POST /api/v1/telemetry— unprotected endpoint - Capture: JavaScript exceptions, promise rejections, resource load failures, performance metrics
- The server MUST log client-side exceptions at WARNING level (they indicate user-facing problems but are not server failures)
- Include enough context to reproduce: URL, user agent, error message, stack trace (if available)
Environment Variable Naming
All environment variables for an application MUST use a consistent prefix matching the service name (e.g., PERIPLUS_, ARKE_, ANGELIA_). This applies to every variable in the .env file, including those consumed by sidecar services like oauth2-proxy.
Rules:
- All vars in
.envuse theSERVICENAME_prefix — no exceptions compose.yamlmaps prefixed vars to the sidecar's expected names (e.g.,OAUTH2_PROXY_CLIENT_ID: ${PERIPLUS_CASDOOR_CLIENT_ID})- The application's Settings model SHOULD declare all prefixed vars, even those only consumed by sidecars, so the full configuration is documented in one place
- Every repo MUST include a
.env.examplewith placeholder values for all required variables. Add!.env.exampleto.gitignoreif a broad.env.*pattern would otherwise exclude it .envfiles with real secrets are ALWAYS gitignored — no exceptions
Docker Networking
- Use the default Docker bridge network for simple deployments
- Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
- Do not define custom networks for single-service Docker Compose stacks
Documentation Standards
Place documentation in the /docs/ directory of the repository.
HTML Documents
HTML documents must follow docs/documentation_style_guide.html.
- Include a dark mode that follows the system automatically and include a toggle button in the navbar
- avoid custom CSS
- Use Mermaid for diagrams
Exceptions
Third-party services and vendor containers frequently cannot comply with every standard in this document (health endpoint paths, access-log filtering, log level semantics, env var prefixes). Rather than force non-compliance into a binary pass/fail, record deviations here so the gap is visible and intentional.
Rules for exceptions:
- Every exception MUST name the service, the standard being waived, and the reason (vendor constraint, upstream bug, deliberate trade-off).
- Exceptions MUST be reviewed on the doc's
Last revieweddate. If the underlying reason has gone away (vendor fixed it, we forked, we replaced the service), remove the exception. - A missing exception for a known-non-compliant service is itself a Red Panda violation — the point is transparency.
| Service | Standard waived | Reason | Reviewed |
|---|---|---|---|
| (example) Gitea | /live, /ready paths — uses /api/healthz |
Upstream does not expose K8s-style endpoints | 2026-04-18 |
| (example) Nextcloud | Env var prefix NEXTCLOUD_ — uses vendor-defined NC_* and unprefixed vars |
Vendor container ignores renamed vars | 2026-04-18 |
| (add real exceptions as they are discovered) |
Health path trailing-slash exceptions — services that serve /ready without the trailing slash (framework default, cannot be reconfigured without breaking routing):
| Service | Actual path | Reason |
|---|---|---|
| (add as discovered) |