From bd31dfd8d58d916751b129c228f46ff020da8a92 Mon Sep 17 00:00:00 2001
From: Robert Helewka <r@helu.ca>
Date: Fri, 10 Apr 2026 11:29:56 +0000
Subject: [PATCH] docs: add application conventions for health checks, logging,
 and endpoints

Establish standardized conventions across all Ouranos services:
- Kubernetes-style health endpoints (/live, /ready, /metrics)
- Logging level guidelines (health checks at DEBUG only)
- Protected vs unprotected endpoint definitions
- Prometheus metrics, browser telemetry, and Docker networking standards
- Update daedalus HAProxy health_path from /api/health to /ready/
---
 ansible/inventory/host_vars/titania.incus.yml |  2 +-
 docs/ouranos.md                               | 79 ++++++++++++++++++
 docs/red_panda_standards.md                   | 81 +++++++++++++++++--
 3 files changed, 153 insertions(+), 9 deletions(-)

diff --git a/ansible/inventory/host_vars/titania.incus.yml b/ansible/inventory/host_vars/titania.incus.yml
index cc71120..9b5ddae 100644
--- a/ansible/inventory/host_vars/titania.incus.yml
+++ b/ansible/inventory/host_vars/titania.incus.yml
@@ -118,7 +118,7 @@ haproxy_backends:
   - subdomain: "daedalus"
     backend_host: "puck.incus"
     backend_port: 20080
-    health_path: "/api/health"
+    health_path: "/ready/"
     timeout_server: 120s
 
   - subdomain: "lobechat"
diff --git a/docs/ouranos.md b/docs/ouranos.md
index 99b7cfb..349a5ef 100644
--- a/docs/ouranos.md
+++ b/docs/ouranos.md
@@ -153,6 +153,85 @@ Z Instance: The running instance of this app on the same host, starting at 1.  M
 
 514ZZ is the syslog port.  Docker containers send their syslog to an Alloy syslog collector port.  ZZ is the application instance, they just need to be different on the same host and increment from 01.
 
+---
+
+## Application Conventions
+
+Standards that all services deployed in Ouranos MUST follow. For full logging standards and anti-patterns, see [red_panda_standards.md](red_panda_standards.md).
+
+### Health Check Endpoints
+
+All services MUST expose Kubernetes-style health endpoints:
+
+| Endpoint | Purpose | Auth |
+|----------|---------|------|
+| `GET /live` | **Liveness** — process is running and accepting connections | None |
+| `GET /ready` | **Readiness** — process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None |
+| `GET /metrics` | Prometheus metrics (see below) | IP-restricted |
+
+- HAProxy checks `health_path` (typically `/ready/`) for backend health — return HTTP 200 when healthy
+- Health endpoints MUST NOT require authentication (no JWT, no session)
+- Third-party services use their native health paths (e.g., `/api/health`, `/api/healthz`, `/-/healthy`)
+
+### Health Checks in Docker Compose
+
+Use `curl -f` for Docker Compose healthchecks. Install curl in images if needed.
+
+```yaml
+healthcheck:
+  test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+  start_period: 40s
+```
+
+### Logging Conventions
+
+Log output flows through: **App → syslog (RFC3164) → Alloy → Loki → Grafana**
+
+| Level | Usage |
+|-------|-------|
+| **ERROR** | Broken state requiring human action — always include `exc_info=True`, error type, and context |
+| **WARNING** | Degraded but recovering — client disconnects, performance outliers, client-side exceptions, leaked markup |
+| **INFO** | Lifecycle events — service start/stop, connections, requests completed, jobs finished |
+| **DEBUG** | Diagnostic detail — SSE events, keepalive pings, health check 200 responses, negotiation steps |
+
+**Health check responses MUST be logged at DEBUG only.** HAProxy and Prometheus probe endpoints every 15-30 seconds. Logging these at INFO floods syslog with thousands of identical `200 OK` lines per hour, burying real events.
+
+### Protected vs Unprotected Endpoints
+
+| Protected (require valid JWT) | Unprotected |
+|-------------------------------|-------------|
+| All `/api/v1/*` routes | `GET /live` |
+| | `GET /ready` |
+| | `GET /metrics` (IP-restricted to internal networks) |
+| | `GET /api/auth/login-url` |
+| | `POST /api/auth/token` |
+| | `POST /api/v1/telemetry` (sendBeacon cannot set headers) |
+
+### Prometheus Metrics
+
+All services SHOULD expose `GET /metrics` in Prometheus exposition format, scraped by Prospero's Prometheus (default 15s interval).
+
+- **IP-restricted** to internal networks only (`10.10.0.0/24`, `172.16.0.0/12`, `127.0.0.0/8`)
+- Consider exposing: request counts/durations, error rates, active connections, queue depths, dependency health
+
+### Browser Telemetry
+
+Frontend/browser code MUST send telemetry data and errors back to the application's telemetry API:
+
+- `POST /api/v1/telemetry` — unprotected (browser `sendBeacon` cannot set Authorization headers)
+- Capture and report: JavaScript exceptions, performance metrics, user-facing errors
+- Client-side exceptions should log as **WARNING** on the server (they indicate a problem but not a server-side failure)
+
+### Docker Networking
+
+- Use the **default Docker bridge network** for simple deployments
+- Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
+- Do not create custom network definitions for single-service Docker Compose stacks
+
+---
 
 ## External Access via HAProxy
 
diff --git a/docs/red_panda_standards.md b/docs/red_panda_standards.md
index e6ede19..405b695 100644
--- a/docs/red_panda_standards.md
+++ b/docs/red_panda_standards.md
@@ -125,6 +125,79 @@ When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runne
 
 ---
 
+## Health Check Endpoints
+
+All services MUST expose Kubernetes-style health endpoints at these paths:
+
+| Endpoint | Purpose | Auth |
+|----------|---------|------|
+| `GET /live` | **Liveness** — process is running and accepting connections | None |
+| `GET /ready` | **Readiness** — process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None |
+| `GET /metrics` | Prometheus metrics | IP-restricted (no JWT) |
+
+- HAProxy uses `health_path: /ready/` for backend health checks — return HTTP 200 when ready
+- Health endpoints MUST NOT require authentication
+- Third-party services use their native paths (`/api/health`, `/api/healthz`, `/-/healthy`, etc.)
+
+### Docker Compose Healthchecks
+
+Use `curl -f` (install curl in images if needed). Do not use `wget --spider`.
+
+```yaml
+healthcheck:
+  test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+  start_period: 40s
+```
+
+---
+
+## Endpoint Protection
+
+| Protected (require valid JWT) | Unprotected |
+|-------------------------------|-------------|
+| All `/api/v1/*` routes | `GET /live` |
+| | `GET /ready` |
+| | `GET /metrics` (IP-restricted to internal networks) |
+| | `GET /api/auth/login-url` |
+| | `POST /api/auth/token` |
+| | `POST /api/v1/telemetry` (sendBeacon cannot set headers) |
+
+> **Why `/api/v1/telemetry` is unprotected**: The browser `sendBeacon` API cannot set `Authorization` headers. The telemetry endpoint must be open to receive client-side error reports and performance data, or browser errors will be silently lost.
+
+---
+
+## Prometheus Metrics
+
+All services SHOULD expose `GET /metrics` in Prometheus exposition format, scraped by Prospero's Prometheus at 15s intervals.
+
+- **IP-restricted** to internal networks: `10.10.0.0/24`, `172.16.0.0/12`, `127.0.0.0/8`
+- No JWT required — HAProxy and Prometheus scrapers cannot authenticate
+- Useful metrics to expose: request totals and durations, error rates, active connections, queue depths, dependency health
+
+---
+
+## Browser Telemetry
+
+Frontend/browser code MUST report errors and performance data back to the server.
+
+- Send to `POST /api/v1/telemetry` — unprotected endpoint
+- Capture: JavaScript exceptions, promise rejections, resource load failures, performance metrics
+- The server MUST log client-side exceptions at **WARNING** level (they indicate user-facing problems but are not server failures)
+- Include enough context to reproduce: URL, user agent, error message, stack trace (if available)
+
+---
+
+## Docker Networking
+
+- Use the **default Docker bridge network** for simple deployments
+- Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
+- Do not define custom networks for single-service Docker Compose stacks
+
+---
+
 ## Documentation Standards
 
 Place documentation in the `/docs/` directory of the repository.
@@ -138,11 +211,3 @@ HTML documents must follow [docs/documentation_style_guide.html](documentation_s
 - Use Bootstrap Icons for icons
 - Use Bootstrap CSS for styles — avoid custom CSS
 - Use **Mermaid** for diagrams
-
-### Markdown Documents
-
-Only these status symbols are approved:
-- ✔ Success/Complete
-- ❌ Error/Failed
-- ⚠️ Warning/Caution
-- ℹ️ Information/Note
\ No newline at end of file