Docs: Red Panda Standards Update regarding logging

This commit is contained in:
2026-04-18 06:58:59 -04:00
parent 072291929f
commit 5251288975

View File

@@ -222,6 +222,140 @@ ANGELIA_DB_PORT=5432
- RabbitMQ is the Message Broker - RabbitMQ is the Message Broker
- Flower Monitoring: Use for debugging failed tasks - Flower Monitoring: Use for debugging failed tasks
### Celery Observability (per main standard)
Celery workers are "long-running background workers" under the main standard and MUST comply with its Background Worker & Queue Monitoring section:
- **Heartbeat**: every 60 seconds at INFO level, e.g. `logger.info("celery worker alive, processed %d tasks in last 5m, queue depth: %d", n, depth)`. Implement as a Celery beat task or a dedicated heartbeat thread.
- **Startup / shutdown / crash-exit** logged at INFO — hook `worker_ready`, `worker_shutdown`, `worker_process_init` signals.
- **Queue depth** exposed as a Prometheus metric (via `celery-exporter` or equivalent) so a growing-queue-with-no-consumers alert can fire at ERROR severity.
- **Grafana staleness alert**: `absent_over_time({service_name="celery_worker_<app>"}[10m])` → ERROR → email via AlertManager.
- **Crash-on-start**: rely on the systemd unit or Docker restart policy to log the exit — do not assume the crashing Celery worker will log its own death.
## Logging (per main standard)
Django apps follow the main standard's [Log Level Standards](Red_Panda_Standards_V1-00.md#log-level-standards). Django-specific implementation notes:
- **Default level: `WARNING`** for app loggers in production. Business logic only surfaces when degraded or broken.
- **Level casing: UPPERCASE** (`INFO`, `WARNING`, `ERROR`, `DEBUG`) — Python/Django convention.
- **Never use `print()`** — always `logger = logging.getLogger(__name__)`.
- **Client telemetry** received at `POST /api/v1/telemetry` MUST be logged at `WARNING` level (browser-side errors are user-facing problems, not server failures).
- **Access log filtering**: Gunicorn AND the upstream reverse proxy (nginx) must not emit 2xx/3xx entries for `/live`, `/ready`, `/metrics`, `/nginx_status`, `/health*`, `/ping`, or service-specific probes like `/mcp/health`. Filter these in the access-log handler. Both trailing-slash and non-trailing-slash forms MUST be matched. Implementation recipes are in the Gunicorn and nginx subsections under Health Check Endpoints below.
- **Structured output**: log to stdout in a format Alloy can parse (JSON preferred). Every log line MUST carry a `level` label downstream.
- **Expected conditions are not ERROR**: failed logins, form validation errors, 404s on user-supplied slugs → WARNING or INFO. Reserve ERROR for things that are actually broken.
## Health Check Endpoints (per main standard)
Every Django service MUST expose:
| Endpoint | Purpose | Auth |
|----------|---------|------|
| `GET /live/` | Liveness — process is running | None |
| `GET /ready/` | Readiness — DB, cache, upstream deps all healthy | None |
| `GET /metrics` | Prometheus metrics | IP-restricted, no JWT |
- **Trailing slash**: standard is `/live/` and `/ready/`. Django's `APPEND_SLASH` redirects un-slashed requests to the canonical slashed form — document as an exception only if you disable that behavior.
- **Readiness logic** MUST actually probe dependencies: `connection.ensure_connection()` for the DB, a Memcached `ping`, a minimal RabbitMQ connection check. A bare `return HttpResponse(status=200)` fails the main standard.
- **Do NOT require authentication** on health endpoints — HAProxy and Prometheus scrapers cannot authenticate.
- **`/metrics`** is exposed via `django-prometheus` (preferred) and IP-restricted to internal networks per the main standard.
### Internal-network allowlist (nginx)
Any endpoint restricted to "internal networks only" (`/metrics`, `/nginx_status`, `nginx-prometheus-exporter` scrape targets, etc.) MUST use the full RFC1918 + loopback allowlist — **all four ranges**, in this order:
```nginx
allow 127.0.0.0/8; # loopback
allow 10.0.0.0/8; # RFC1918 — primary internal range
allow 172.16.0.0/12; # RFC1918 — Docker default bridge range
allow 192.168.0.0/16; # RFC1918
deny all;
```
Omitting `10.0.0.0/8` is the most common mistake and will silently break Prometheus scrapes from hosts on that network. Do not copy a shorter allowlist from older configs.
### Gunicorn configuration
Gunicorn MUST:
- Log access AND error output to **stdout/stderr** — never a file inside the container. The Docker logging driver (syslog → Alloy in our stack) is the single collection point.
- Use a `gunicorn.conf.py` referenced via `--config` so configuration lives in version control rather than a growing CMD string.
- Filter probe paths out of the access log via a `logging.Filter` attached to the `gunicorn.access` logger in BOTH `on_starting` (master) AND `post_worker_init` (workers — Gunicorn re-applies logger config per worker, so a master-only filter is silently stripped).
Canonical launch command:
```dockerfile
CMD ["gunicorn", \
"--config", "/srv/<app>/gunicorn.conf.py", \
"--bind", ":8080", \
"--workers", "3", \
"--timeout", "120", \
"--keep-alive", "5", \
"--access-logfile", "-", \
"--error-logfile", "-", \
"<app>.wsgi:application"]
```
Canonical `gunicorn.conf.py` probe filter:
```python
import logging
import re
_PROBE_PATH = re.compile(
r"^(?:/live|/ready|/metrics|/nginx_status|/health[^ ]*|/ping|/mcp/health)/?(?:\?|$)"
)
class _ProbePathFilter(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
request = getattr(record, "args", None)
if isinstance(request, dict):
# Gunicorn access log atoms: 'U' = URL path, 'r' = full request line
path = request.get("U") or request.get("r", "")
else:
path = record.getMessage()
return not _PROBE_PATH.search(path)
_filter = _ProbePathFilter()
def on_starting(server):
logging.getLogger("gunicorn.access").addFilter(_filter)
def post_worker_init(worker):
logging.getLogger("gunicorn.access").addFilter(_filter)
```
Update the probe-path regex if the service exposes additional health endpoints (e.g. sidecar servers). Do NOT special-case by status code — a 500 on `/ready/` is noise in Gunicorn's access log but is already surfaced via the readiness probe failing and the error log.
### Nginx access-log filtering
The reverse proxy sees the same probe traffic and will log it unless filtered. Use a `map` + conditional `access_log`:
```nginx
http {
map $request_uri $loggable {
default 1;
~^/live(/|\?|$) 0;
~^/ready(/|\?|$) 0;
~^/metrics(/|\?|$) 0;
~^/nginx_status(/|\?|$) 0;
~^/health 0;
~^/ping(/|\?|$) 0;
~^/mcp/health(/|\?|$) 0;
}
access_log /var/log/nginx/access.log combined if=$loggable;
# ...
}
```
This is an nginx-wide switch — do not duplicate per `location` block. Error logging is unaffected; genuine 4xx/5xx on probe paths still surface via the error log and the probe itself failing.
See [Red_Panda_Standards_V1-00.md §Health Check Endpoints](Red_Panda_Standards_V1-00.md#health-check-endpoints) for the full definition.
## Testing ## Testing
- Framework: Django TestCase (not pytest) - Framework: Django TestCase (not pytest)
- Separate test files per module: test_models.py, test_views.py, test_forms.py - Separate test files per module: test_models.py, test_views.py, test_forms.py
@@ -266,6 +400,10 @@ ANGELIA_DB_PORT=5432
### Caching ### Caching
- pymemcache — Memcached backend - pymemcache — Memcached backend
### Observability
- django-prometheus — `/metrics` endpoint in Prometheus exposition format
- celery-exporter (or equivalent) — queue depth metrics for Celery workers
### Database ### Database
- psycopg[binary] — PostgreSQL adapter - psycopg[binary] — PostgreSQL adapter
- shortuuid — Short UUIDs for public URLs - shortuuid — Short UUIDs for public URLs
@@ -339,3 +477,23 @@ ANGELIA_DB_PORT=5432
- Don't pass model instances to tasks (pass IDs and re-fetch) - Don't pass model instances to tasks (pass IDs and re-fetch)
- Don't assume tasks run immediately - Don't assume tasks run immediately
- Don't forget retry logic for external service calls - Don't forget retry logic for external service calls
- Don't run a Celery worker without a heartbeat (see Celery Observability)
### Logging
- Don't use `print()` — always use `logging.getLogger(__name__)`
- Don't log at ERROR for expected conditions (failed logins, 404s, validation errors)
- Don't log at INFO for successful probes of `/live`, `/ready`, `/metrics`
- Don't log passwords, tokens, API keys, session cookies, or PII at any level
- Don't use lowercase level names in Python code (UPPERCASE for Django/Python)
---
## Exceptions
Per the main standard, deviations from Red Panda requirements MUST be recorded rather than hidden. Third-party Django packages, framework defaults, or deliberate trade-offs all go here.
| Service | Standard waived | Reason | Reviewed |
|---------|-----------------|--------|----------|
| _(add as discovered)_ | | | |
Exceptions MUST be re-reviewed on the doc's `Last reviewed` date. Remove entries whose underlying reason has gone away.