From 525128897561f7a331b3c5b28e8210edd344db23 Mon Sep 17 00:00:00 2001 From: Robert Helewka Date: Sat, 18 Apr 2026 06:58:59 -0400 Subject: [PATCH] Docs: Red Panda Standards Update regarding logging --- docs/Red Panda Standards_Django_V1-01.md | 158 +++++++++++++++++++++++ 1 file changed, 158 insertions(+) diff --git a/docs/Red Panda Standards_Django_V1-01.md b/docs/Red Panda Standards_Django_V1-01.md index 9645409..54393b3 100644 --- a/docs/Red Panda Standards_Django_V1-01.md +++ b/docs/Red Panda Standards_Django_V1-01.md @@ -222,6 +222,140 @@ ANGELIA_DB_PORT=5432 - RabbitMQ is the Message Broker - Flower Monitoring: Use for debugging failed tasks +### Celery Observability (per main standard) + +Celery workers are "long-running background workers" under the main standard and MUST comply with its Background Worker & Queue Monitoring section: + +- **Heartbeat**: every 60 seconds at INFO level, e.g. `logger.info("celery worker alive, processed %d tasks in last 5m, queue depth: %d", n, depth)`. Implement as a Celery beat task or a dedicated heartbeat thread. +- **Startup / shutdown / crash-exit** logged at INFO — hook `worker_ready`, `worker_shutdown`, `worker_process_init` signals. +- **Queue depth** exposed as a Prometheus metric (via `celery-exporter` or equivalent) so a growing-queue-with-no-consumers alert can fire at ERROR severity. +- **Grafana staleness alert**: `absent_over_time({service_name="celery_worker_"}[10m])` → ERROR → email via AlertManager. +- **Crash-on-start**: rely on the systemd unit or Docker restart policy to log the exit — do not assume the crashing Celery worker will log its own death. + +## Logging (per main standard) + +Django apps follow the main standard's [Log Level Standards](Red_Panda_Standards_V1-00.md#log-level-standards). Django-specific implementation notes: + +- **Default level: `WARNING`** for app loggers in production. Business logic only surfaces when degraded or broken. +- **Level casing: UPPERCASE** (`INFO`, `WARNING`, `ERROR`, `DEBUG`) — Python/Django convention. +- **Never use `print()`** — always `logger = logging.getLogger(__name__)`. +- **Client telemetry** received at `POST /api/v1/telemetry` MUST be logged at `WARNING` level (browser-side errors are user-facing problems, not server failures). +- **Access log filtering**: Gunicorn AND the upstream reverse proxy (nginx) must not emit 2xx/3xx entries for `/live`, `/ready`, `/metrics`, `/nginx_status`, `/health*`, `/ping`, or service-specific probes like `/mcp/health`. Filter these in the access-log handler. Both trailing-slash and non-trailing-slash forms MUST be matched. Implementation recipes are in the Gunicorn and nginx subsections under Health Check Endpoints below. +- **Structured output**: log to stdout in a format Alloy can parse (JSON preferred). Every log line MUST carry a `level` label downstream. +- **Expected conditions are not ERROR**: failed logins, form validation errors, 404s on user-supplied slugs → WARNING or INFO. Reserve ERROR for things that are actually broken. + +## Health Check Endpoints (per main standard) + +Every Django service MUST expose: + +| Endpoint | Purpose | Auth | +|----------|---------|------| +| `GET /live/` | Liveness — process is running | None | +| `GET /ready/` | Readiness — DB, cache, upstream deps all healthy | None | +| `GET /metrics` | Prometheus metrics | IP-restricted, no JWT | + +- **Trailing slash**: standard is `/live/` and `/ready/`. Django's `APPEND_SLASH` redirects un-slashed requests to the canonical slashed form — document as an exception only if you disable that behavior. +- **Readiness logic** MUST actually probe dependencies: `connection.ensure_connection()` for the DB, a Memcached `ping`, a minimal RabbitMQ connection check. A bare `return HttpResponse(status=200)` fails the main standard. +- **Do NOT require authentication** on health endpoints — HAProxy and Prometheus scrapers cannot authenticate. +- **`/metrics`** is exposed via `django-prometheus` (preferred) and IP-restricted to internal networks per the main standard. + +### Internal-network allowlist (nginx) + +Any endpoint restricted to "internal networks only" (`/metrics`, `/nginx_status`, `nginx-prometheus-exporter` scrape targets, etc.) MUST use the full RFC1918 + loopback allowlist — **all four ranges**, in this order: + +```nginx +allow 127.0.0.0/8; # loopback +allow 10.0.0.0/8; # RFC1918 — primary internal range +allow 172.16.0.0/12; # RFC1918 — Docker default bridge range +allow 192.168.0.0/16; # RFC1918 +deny all; +``` + +Omitting `10.0.0.0/8` is the most common mistake and will silently break Prometheus scrapes from hosts on that network. Do not copy a shorter allowlist from older configs. + +### Gunicorn configuration + +Gunicorn MUST: + +- Log access AND error output to **stdout/stderr** — never a file inside the container. The Docker logging driver (syslog → Alloy in our stack) is the single collection point. +- Use a `gunicorn.conf.py` referenced via `--config` so configuration lives in version control rather than a growing CMD string. +- Filter probe paths out of the access log via a `logging.Filter` attached to the `gunicorn.access` logger in BOTH `on_starting` (master) AND `post_worker_init` (workers — Gunicorn re-applies logger config per worker, so a master-only filter is silently stripped). + +Canonical launch command: + +```dockerfile +CMD ["gunicorn", \ + "--config", "/srv//gunicorn.conf.py", \ + "--bind", ":8080", \ + "--workers", "3", \ + "--timeout", "120", \ + "--keep-alive", "5", \ + "--access-logfile", "-", \ + "--error-logfile", "-", \ + ".wsgi:application"] +``` + +Canonical `gunicorn.conf.py` probe filter: + +```python +import logging +import re + +_PROBE_PATH = re.compile( + r"^(?:/live|/ready|/metrics|/nginx_status|/health[^ ]*|/ping|/mcp/health)/?(?:\?|$)" +) + + +class _ProbePathFilter(logging.Filter): + def filter(self, record: logging.LogRecord) -> bool: + request = getattr(record, "args", None) + if isinstance(request, dict): + # Gunicorn access log atoms: 'U' = URL path, 'r' = full request line + path = request.get("U") or request.get("r", "") + else: + path = record.getMessage() + return not _PROBE_PATH.search(path) + + +_filter = _ProbePathFilter() + + +def on_starting(server): + logging.getLogger("gunicorn.access").addFilter(_filter) + + +def post_worker_init(worker): + logging.getLogger("gunicorn.access").addFilter(_filter) +``` + +Update the probe-path regex if the service exposes additional health endpoints (e.g. sidecar servers). Do NOT special-case by status code — a 500 on `/ready/` is noise in Gunicorn's access log but is already surfaced via the readiness probe failing and the error log. + +### Nginx access-log filtering + +The reverse proxy sees the same probe traffic and will log it unless filtered. Use a `map` + conditional `access_log`: + +```nginx +http { + map $request_uri $loggable { + default 1; + ~^/live(/|\?|$) 0; + ~^/ready(/|\?|$) 0; + ~^/metrics(/|\?|$) 0; + ~^/nginx_status(/|\?|$) 0; + ~^/health 0; + ~^/ping(/|\?|$) 0; + ~^/mcp/health(/|\?|$) 0; + } + + access_log /var/log/nginx/access.log combined if=$loggable; + # ... +} +``` + +This is an nginx-wide switch — do not duplicate per `location` block. Error logging is unaffected; genuine 4xx/5xx on probe paths still surface via the error log and the probe itself failing. + +See [Red_Panda_Standards_V1-00.md §Health Check Endpoints](Red_Panda_Standards_V1-00.md#health-check-endpoints) for the full definition. + ## Testing - Framework: Django TestCase (not pytest) - Separate test files per module: test_models.py, test_views.py, test_forms.py @@ -266,6 +400,10 @@ ANGELIA_DB_PORT=5432 ### Caching - pymemcache — Memcached backend +### Observability +- django-prometheus — `/metrics` endpoint in Prometheus exposition format +- celery-exporter (or equivalent) — queue depth metrics for Celery workers + ### Database - psycopg[binary] — PostgreSQL adapter - shortuuid — Short UUIDs for public URLs @@ -339,3 +477,23 @@ ANGELIA_DB_PORT=5432 - Don't pass model instances to tasks (pass IDs and re-fetch) - Don't assume tasks run immediately - Don't forget retry logic for external service calls +- Don't run a Celery worker without a heartbeat (see Celery Observability) + +### Logging +- Don't use `print()` — always use `logging.getLogger(__name__)` +- Don't log at ERROR for expected conditions (failed logins, 404s, validation errors) +- Don't log at INFO for successful probes of `/live`, `/ready`, `/metrics` +- Don't log passwords, tokens, API keys, session cookies, or PII at any level +- Don't use lowercase level names in Python code (UPPERCASE for Django/Python) + +--- + +## Exceptions + +Per the main standard, deviations from Red Panda requirements MUST be recorded rather than hidden. Third-party Django packages, framework defaults, or deliberate trade-offs all go here. + +| Service | Standard waived | Reason | Reviewed | +|---------|-----------------|--------|----------| +| _(add as discovered)_ | | | | + +Exceptions MUST be re-reviewed on the doc's `Last reviewed` date. Remove entries whose underlying reason has gone away.