Docs: Red Panda Standards Update regarding logging

2026-04-18 06:58:59 -04:00
parent 072291929f
commit 5251288975
1 changed files with 158 additions and 0 deletions
--- a/Standards_Django_V1-01.md
+++ b/Standards_Django_V1-01.md
@@ -222,6 +222,140 @@ ANGELIA_DB_PORT=5432
 - RabbitMQ is the Message Broker
 - Flower Monitoring: Use for debugging failed tasks

+### Celery Observability (per main standard)
+
+Celery workers are "long-running background workers" under the main standard and MUST comply with its Background Worker & Queue Monitoring section:
+
+- **Heartbeat**: every 60 seconds at INFO level, e.g. `logger.info("celery worker alive, processed %d tasks in last 5m, queue depth: %d", n, depth)`. Implement as a Celery beat task or a dedicated heartbeat thread.
+- **Startup / shutdown / crash-exit** logged at INFO — hook `worker_ready`, `worker_shutdown`, `worker_process_init` signals.
+- **Queue depth** exposed as a Prometheus metric (via `celery-exporter` or equivalent) so a growing-queue-with-no-consumers alert can fire at ERROR severity.
+- **Grafana staleness alert**: `absent_over_time({service_name="celery_worker_<app>"}[10m])` → ERROR → email via AlertManager.
+- **Crash-on-start**: rely on the systemd unit or Docker restart policy to log the exit — do not assume the crashing Celery worker will log its own death.
+
+## Logging (per main standard)
+
+Django apps follow the main standard's [Log Level Standards](Red_Panda_Standards_V1-00.md#log-level-standards). Django-specific implementation notes:
+
+- **Default level: `WARNING`** for app loggers in production. Business logic only surfaces when degraded or broken.
+- **Level casing: UPPERCASE** (`INFO`, `WARNING`, `ERROR`, `DEBUG`) — Python/Django convention.
+- **Never use `print()`** — always `logger = logging.getLogger(__name__)`.
+- **Client telemetry** received at `POST /api/v1/telemetry` MUST be logged at `WARNING` level (browser-side errors are user-facing problems, not server failures).
+- **Access log filtering**: Gunicorn AND the upstream reverse proxy (nginx) must not emit 2xx/3xx entries for `/live`, `/ready`, `/metrics`, `/nginx_status`, `/health*`, `/ping`, or service-specific probes like `/mcp/health`. Filter these in the access-log handler. Both trailing-slash and non-trailing-slash forms MUST be matched. Implementation recipes are in the Gunicorn and nginx subsections under Health Check Endpoints below.
+- **Structured output**: log to stdout in a format Alloy can parse (JSON preferred). Every log line MUST carry a `level` label downstream.
+- **Expected conditions are not ERROR**: failed logins, form validation errors, 404s on user-supplied slugs → WARNING or INFO. Reserve ERROR for things that are actually broken.
+
+## Health Check Endpoints (per main standard)
+
+Every Django service MUST expose:
+
+| Endpoint | Purpose | Auth |
+|----------|---------|------|
+| `GET /live/` | Liveness — process is running | None |
+| `GET /ready/` | Readiness — DB, cache, upstream deps all healthy | None |
+| `GET /metrics` | Prometheus metrics | IP-restricted, no JWT |
+
+- **Trailing slash**: standard is `/live/` and `/ready/`. Django's `APPEND_SLASH` redirects un-slashed requests to the canonical slashed form — document as an exception only if you disable that behavior.
+- **Readiness logic** MUST actually probe dependencies: `connection.ensure_connection()` for the DB, a Memcached `ping`, a minimal RabbitMQ connection check. A bare `return HttpResponse(status=200)` fails the main standard.
+- **Do NOT require authentication** on health endpoints — HAProxy and Prometheus scrapers cannot authenticate.
+- **`/metrics`** is exposed via `django-prometheus` (preferred) and IP-restricted to internal networks per the main standard.
+
+### Internal-network allowlist (nginx)
+
+Any endpoint restricted to "internal networks only" (`/metrics`, `/nginx_status`, `nginx-prometheus-exporter` scrape targets, etc.) MUST use the full RFC1918 + loopback allowlist — **all four ranges**, in this order:
+
+```nginx
+allow 127.0.0.0/8;     # loopback
+allow 10.0.0.0/8;      # RFC1918 — primary internal range
+allow 172.16.0.0/12;   # RFC1918 — Docker default bridge range
+allow 192.168.0.0/16;  # RFC1918
+deny all;
+```
+
+Omitting `10.0.0.0/8` is the most common mistake and will silently break Prometheus scrapes from hosts on that network. Do not copy a shorter allowlist from older configs.
+
+### Gunicorn configuration
+
+Gunicorn MUST:
+
+- Log access AND error output to **stdout/stderr** — never a file inside the container. The Docker logging driver (syslog → Alloy in our stack) is the single collection point.
+- Use a `gunicorn.conf.py` referenced via `--config` so configuration lives in version control rather than a growing CMD string.
+- Filter probe paths out of the access log via a `logging.Filter` attached to the `gunicorn.access` logger in BOTH `on_starting` (master) AND `post_worker_init` (workers — Gunicorn re-applies logger config per worker, so a master-only filter is silently stripped).
+
+Canonical launch command:
+
+```dockerfile
+CMD ["gunicorn", \
+     "--config", "/srv/<app>/gunicorn.conf.py", \
+     "--bind", ":8080", \
+     "--workers", "3", \
+     "--timeout", "120", \
+     "--keep-alive", "5", \
+     "--access-logfile", "-", \
+     "--error-logfile", "-", \
+     "<app>.wsgi:application"]
+```
+
+Canonical `gunicorn.conf.py` probe filter:
+
+```python
+import logging
+import re
+
+_PROBE_PATH = re.compile(
+    r"^(?:/live|/ready|/metrics|/nginx_status|/health[^ ]*|/ping|/mcp/health)/?(?:\?|$)"
+)
+
+
+class _ProbePathFilter(logging.Filter):
+    def filter(self, record: logging.LogRecord) -> bool:
+        request = getattr(record, "args", None)
+        if isinstance(request, dict):
+            # Gunicorn access log atoms: 'U' = URL path, 'r' = full request line
+            path = request.get("U") or request.get("r", "")
+        else:
+            path = record.getMessage()
+        return not _PROBE_PATH.search(path)
+
+
+_filter = _ProbePathFilter()
+
+
+def on_starting(server):
+    logging.getLogger("gunicorn.access").addFilter(_filter)
+
+
+def post_worker_init(worker):
+    logging.getLogger("gunicorn.access").addFilter(_filter)
+```
+
+Update the probe-path regex if the service exposes additional health endpoints (e.g. sidecar servers). Do NOT special-case by status code — a 500 on `/ready/` is noise in Gunicorn's access log but is already surfaced via the readiness probe failing and the error log.
+
+### Nginx access-log filtering
+
+The reverse proxy sees the same probe traffic and will log it unless filtered. Use a `map` + conditional `access_log`:
+
+```nginx
+http {
+    map $request_uri $loggable {
+        default                   1;
+        ~^/live(/|\?|$)           0;
+        ~^/ready(/|\?|$)          0;
+        ~^/metrics(/|\?|$)        0;
+        ~^/nginx_status(/|\?|$)   0;
+        ~^/health                 0;
+        ~^/ping(/|\?|$)           0;
+        ~^/mcp/health(/|\?|$)     0;
+    }
+
+    access_log /var/log/nginx/access.log combined if=$loggable;
+    # ...
+}
+```
+
+This is an nginx-wide switch — do not duplicate per `location` block. Error logging is unaffected; genuine 4xx/5xx on probe paths still surface via the error log and the probe itself failing.
+
+See [Red_Panda_Standards_V1-00.md §Health Check Endpoints](Red_Panda_Standards_V1-00.md#health-check-endpoints) for the full definition.
+
 ## Testing
 - Framework: Django TestCase (not pytest)
 - Separate test files per module: test_models.py, test_views.py, test_forms.py
@@ -266,6 +400,10 @@ ANGELIA_DB_PORT=5432
 ### Caching
 - pymemcache — Memcached backend

+### Observability
+- django-prometheus — `/metrics` endpoint in Prometheus exposition format
+- celery-exporter (or equivalent) — queue depth metrics for Celery workers
+
 ### Database
 - psycopg[binary] — PostgreSQL adapter
 - shortuuid — Short UUIDs for public URLs
@@ -339,3 +477,23 @@ ANGELIA_DB_PORT=5432
 - Don't pass model instances to tasks (pass IDs and re-fetch)
 - Don't assume tasks run immediately
 - Don't forget retry logic for external service calls
+- Don't run a Celery worker without a heartbeat (see Celery Observability)
+
+### Logging
+- Don't use `print()` — always use `logging.getLogger(__name__)`
+- Don't log at ERROR for expected conditions (failed logins, 404s, validation errors)
+- Don't log at INFO for successful probes of `/live`, `/ready`, `/metrics`
+- Don't log passwords, tokens, API keys, session cookies, or PII at any level
+- Don't use lowercase level names in Python code (UPPERCASE for Django/Python)
+
+---
+
+## Exceptions
+
+Per the main standard, deviations from Red Panda requirements MUST be recorded rather than hidden. Third-party Django packages, framework defaults, or deliberate trade-offs all go here.
+
+| Service | Standard waived | Reason | Reviewed |
+|---------|-----------------|--------|----------|
+| _(add as discovered)_ | | | |
+
+Exceptions MUST be re-reviewed on the doc's `Last reviewed` date. Remove entries whose underlying reason has gone away.