r/pallas

Files

Robert Helewka 0cea5ece3a feat: add /healthz and /metrics endpoints, replace print with logging

- Add /healthz endpoint returning LLM provider validation status
- Add /metrics endpoint serving Prometheus metrics via prometheus_client
- Replace all print() calls in health.py with proper logging module
- Remove _PREFIX variable in favor of structured logger context

2026-04-10 11:22:26 +00:00

13 KiB

Raw Blame History

Red Panda Approval™ Standards

Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.

🐾 Red Panda Approval™

All implementations must meet the 5 Sacred Criteria:

Fresh Environment Test — Clean runs on new systems without drift. No leftover state, no manual steps.
Elegant Simplicity — Modular, reusable, no copy-paste sprawl. One playbook per concern.
Observable & Auditable — Clear task names, proper logging, check mode compatible. You can see what happened.
Idempotent Patterns — Run multiple times with consistent results. No side effects on re-runs.
Actually Provisions & Configures — Resources work, dependencies resolve, services integrate. It does the thing.

Vault Security

All sensitive information is encrypted using Ansible Vault with AES256 encryption.

Encrypted secrets:

Database passwords (PostgreSQL, Neo4j)
API keys (OpenAI, Anthropic, Mistral, Groq)
Application secrets (Grafana, SearXNG, Arke)
Monitoring alerts (Pushover integration)

Security rules:

AES256 encryption with ansible-vault
Password file for automation — never pass --vault-password-file inline in scripts
Vault variables use the vault_ prefix; map to friendly names in group_vars/all/vars.yml
No secrets in plain text files, ever

Log Level Standards

All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.

Level Definitions

Level	When to Use	What MUST Be Included	Loki / Grafana Role
ERROR	Something is broken and requires human intervention. The service cannot fulfil the current request or operation.	Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`.	AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger Pushover notifications.
WARNING	Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked.	What degraded, what recovery action was taken, current metric value vs. threshold.	Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min).
INFO	Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations.	The event and its outcome. This level tells the story of what the system did.	Default production visibility. The go-to level for post-incident timelines.
DEBUG	Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values.	Actionable context is mandatory. A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths.	Never enabled in production by default. Used on-demand via per-service level override.

Anti-Patterns

These are explicit violations of Ouranos logging standards:

❌ Anti-Pattern	Why It's Wrong	✅ Correct Approach
Health checks logged at INFO (`GET /health → 200 OK`)	Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events.	Suppress health endpoints from access logs entirely, or demote to DEBUG.
DEBUG with no context (`logger.debug("error occurred")`)	Provides zero diagnostic value. If DEBUG is noisy and useless, nobody will ever enable it.	`logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)`
ERROR without exception details (`logger.error("task failed")`)	Cannot be triaged without reproduction steps. Wastes on-call time.	`logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)`
Logging sensitive data at any level	Passwords, tokens, API keys, and PII in Loki are a security incident.	Mask or redact: `api_key=sk-...a3f2`, `password=*****`.
Inconsistent level casing	Breaks LogQL filters and Grafana label selectors.	Python / Django: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). Go / infrastructure (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`).
Logging expected conditions as ERROR	A user entering a wrong password is not an error — it is normal business logic.	Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken.

Health Check Rule

All services exposed through HAProxy MUST suppress or demote health check endpoints (/health, /healthz, /api/health, /metrics, /ping) to DEBUG or below. Health check success is the absence of errors, not the presence of 200s. If your syslog shows a successful health probe, your log level is wrong.

Implementation guidance:

Django / Gunicorn: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
Docker services: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
HAProxy: HAProxy's own health check logs (option httpchk) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.

Background Worker & Queue Monitoring

The most dangerous failure is the one that produces no logs.

When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.

Required practices:

Heartbeat logging — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., "worker alive, processed N jobs in last 5m, queue depth: M"). The absence of this heartbeat is the alertable condition.
Startup and shutdown at INFO — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
Queue depth as a metric — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an ERROR-level alert, not a warning.
Grafana "last seen" alerts — For every background worker, configure a Grafana alert using absent_over_time() or equivalent staleness detection: "Worker X has not logged a heartbeat in >10 minutes" → ERROR severity → Pushover notification.
Crash-on-start is ERROR — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (systemd OnFailure=, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.

Production Defaults

Service Category	Default Level	Rationale
Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard)	`WARNING`	Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd.
Gunicorn access logs	Suppress 2xx/3xx health probes	Routine request logging deferred to HAProxy access logs in Loki.
Infrastructure agents (Alloy, Prometheus, Node Exporter)	`warn`	Stable — do not change without cause.
HAProxy (Titania)	`warning`	Connection-level logging handled by HAProxy's own log format → Alloy → Loki.
Databases (PostgreSQL, Neo4j)	`warning`	Query-level logging only enabled for active troubleshooting.
Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG)	`warn` / `warning`	Per-service default. Tune individually if needed.
LLM Proxy (Arke)	`info`	Token usage tracking and provider routing decisions justify INFO. Review periodically for noise.
Observability stack (Grafana, Loki, AlertManager)	`warn`	Should be quiet unless something is wrong with observability itself.

Loki & Grafana Alignment

Label normalization: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a level label on every log line. Without a level label, the log entry is invisible to level-based dashboard filters and alert rules.

LogQL conventions for dashboards:

# Production error monitoring (default dashboard view)
{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"

# Warning-and-above for a specific service
{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"

# Debug-level troubleshooting (temporary, never permanent dashboards)
{container="angelia"} | json | level="debug"

Alerting rules — Grafana alert rules MUST key off the normalized level label:

level=~"error|fatal|critical" → Immediate Pushover notification via AlertManager
absent_over_time({service_name="celery_worker"}[10m]) → Worker heartbeat staleness → ERROR severity
Rate-based: rate({service_name="arke"} | json | level="error" [5m]) > 0.1 → Sustained error rate

Retention alignment: Loki retention policies should preserve ERROR and WARNING logs longer than DEBUG. DEBUG-level logs generated during troubleshooting sessions should have a short TTL or be explicitly cleaned up.

Health Check Endpoints

All services MUST expose Kubernetes-style health endpoints at these paths:

Endpoint	Purpose	Auth
`GET /live`	Liveness — process is running and accepting connections	None
`GET /ready`	Readiness — process is running AND all dependencies (DB, cache, upstream APIs) are healthy	None
`GET /metrics`	Prometheus metrics	IP-restricted (no JWT)

HAProxy uses health_path: /ready/ for backend health checks — return HTTP 200 when ready
Health endpoints MUST NOT require authentication
Third-party services use their native paths (/api/health, /api/healthz, /-/healthy, etc.)

Docker Compose Healthchecks

Use curl -f (install curl in images if needed). Do not use wget --spider.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Endpoint Protection

Protected (require valid JWT)	Unprotected
All `/api/v1/*` routes	`GET /live`
	`GET /ready`
	`GET /metrics` (IP-restricted to internal networks)
	`GET /api/auth/login-url`
	`POST /api/auth/token`
	`POST /api/v1/telemetry` (sendBeacon cannot set headers)

Why /api/v1/telemetry is unprotected: The browser sendBeacon API cannot set Authorization headers. The telemetry endpoint must be open to receive client-side error reports and performance data, or browser errors will be silently lost.

Prometheus Metrics

All services SHOULD expose GET /metrics in Prometheus exposition format, scraped by Prospero's Prometheus at 15s intervals.

IP-restricted to internal networks: 10.10.0.0/24, 172.16.0.0/12, 127.0.0.0/8
No JWT required — HAProxy and Prometheus scrapers cannot authenticate
Useful metrics to expose: request totals and durations, error rates, active connections, queue depths, dependency health

Browser Telemetry

Frontend/browser code MUST report errors and performance data back to the server.

Send to POST /api/v1/telemetry — unprotected endpoint
Capture: JavaScript exceptions, promise rejections, resource load failures, performance metrics
The server MUST log client-side exceptions at WARNING level (they indicate user-facing problems but are not server failures)
Include enough context to reproduce: URL, user agent, error message, stack trace (if available)

Docker Networking

Use the default Docker bridge network for simple deployments
Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
Do not define custom networks for single-service Docker Compose stacks

Documentation Standards

Place documentation in the /docs/ directory of the repository.

HTML Documents

HTML documents must follow docs/documentation_style_guide.html.

Use Bootstrap CDN with Bootswatch theme Flatly
Include a dark mode toggle button in the navbar
Use Bootstrap Icons for icons
Use Bootstrap CSS for styles — avoid custom CSS
Use Mermaid for diagrams

13 KiB Raw Blame History