docs: add application conventions for health checks, logging, and endpoints

Establish standardized conventions across all Ouranos services:
- Kubernetes-style health endpoints (/live, /ready, /metrics)
- Logging level guidelines (health checks at DEBUG only)
- Protected vs unprotected endpoint definitions
- Prometheus metrics, browser telemetry, and Docker networking standards
- Update daedalus HAProxy health_path from /api/health to /ready/
This commit is contained in:
2026-04-10 11:29:56 +00:00
parent 257e743d9a
commit bd31dfd8d5
3 changed files with 153 additions and 9 deletions

View File

@@ -153,6 +153,85 @@ Z Instance: The running instance of this app on the same host, starting at 1. M
514ZZ is the syslog port. Docker containers send their syslog to an Alloy syslog collector port. ZZ is the application instance, they just need to be different on the same host and increment from 01.
---
## Application Conventions
Standards that all services deployed in Ouranos MUST follow. For full logging standards and anti-patterns, see [red_panda_standards.md](red_panda_standards.md).
### Health Check Endpoints
All services MUST expose Kubernetes-style health endpoints:
| Endpoint | Purpose | Auth |
|----------|---------|------|
| `GET /live` | **Liveness** — process is running and accepting connections | None |
| `GET /ready` | **Readiness** — process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None |
| `GET /metrics` | Prometheus metrics (see below) | IP-restricted |
- HAProxy checks `health_path` (typically `/ready/`) for backend health — return HTTP 200 when healthy
- Health endpoints MUST NOT require authentication (no JWT, no session)
- Third-party services use their native health paths (e.g., `/api/health`, `/api/healthz`, `/-/healthy`)
### Health Checks in Docker Compose
Use `curl -f` for Docker Compose healthchecks. Install curl in images if needed.
```yaml
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
### Logging Conventions
Log output flows through: **App → syslog (RFC3164) → Alloy → Loki → Grafana**
| Level | Usage |
|-------|-------|
| **ERROR** | Broken state requiring human action — always include `exc_info=True`, error type, and context |
| **WARNING** | Degraded but recovering — client disconnects, performance outliers, client-side exceptions, leaked markup |
| **INFO** | Lifecycle events — service start/stop, connections, requests completed, jobs finished |
| **DEBUG** | Diagnostic detail — SSE events, keepalive pings, health check 200 responses, negotiation steps |
**Health check responses MUST be logged at DEBUG only.** HAProxy and Prometheus probe endpoints every 15-30 seconds. Logging these at INFO floods syslog with thousands of identical `200 OK` lines per hour, burying real events.
### Protected vs Unprotected Endpoints
| Protected (require valid JWT) | Unprotected |
|-------------------------------|-------------|
| All `/api/v1/*` routes | `GET /live` |
| | `GET /ready` |
| | `GET /metrics` (IP-restricted to internal networks) |
| | `GET /api/auth/login-url` |
| | `POST /api/auth/token` |
| | `POST /api/v1/telemetry` (sendBeacon cannot set headers) |
### Prometheus Metrics
All services SHOULD expose `GET /metrics` in Prometheus exposition format, scraped by Prospero's Prometheus (default 15s interval).
- **IP-restricted** to internal networks only (`10.10.0.0/24`, `172.16.0.0/12`, `127.0.0.0/8`)
- Consider exposing: request counts/durations, error rates, active connections, queue depths, dependency health
### Browser Telemetry
Frontend/browser code MUST send telemetry data and errors back to the application's telemetry API:
- `POST /api/v1/telemetry` — unprotected (browser `sendBeacon` cannot set Authorization headers)
- Capture and report: JavaScript exceptions, performance metrics, user-facing errors
- Client-side exceptions should log as **WARNING** on the server (they indicate a problem but not a server-side failure)
### Docker Networking
- Use the **default Docker bridge network** for simple deployments
- Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
- Do not create custom network definitions for single-service Docker Compose stacks
---
## External Access via HAProxy