Docs: Red Panda Standards Update regarding logging
This commit is contained in:
@@ -222,6 +222,140 @@ ANGELIA_DB_PORT=5432
|
||||
- RabbitMQ is the Message Broker
|
||||
- Flower Monitoring: Use for debugging failed tasks
|
||||
|
||||
### Celery Observability (per main standard)
|
||||
|
||||
Celery workers are "long-running background workers" under the main standard and MUST comply with its Background Worker & Queue Monitoring section:
|
||||
|
||||
- **Heartbeat**: every 60 seconds at INFO level, e.g. `logger.info("celery worker alive, processed %d tasks in last 5m, queue depth: %d", n, depth)`. Implement as a Celery beat task or a dedicated heartbeat thread.
|
||||
- **Startup / shutdown / crash-exit** logged at INFO — hook `worker_ready`, `worker_shutdown`, `worker_process_init` signals.
|
||||
- **Queue depth** exposed as a Prometheus metric (via `celery-exporter` or equivalent) so a growing-queue-with-no-consumers alert can fire at ERROR severity.
|
||||
- **Grafana staleness alert**: `absent_over_time({service_name="celery_worker_<app>"}[10m])` → ERROR → email via AlertManager.
|
||||
- **Crash-on-start**: rely on the systemd unit or Docker restart policy to log the exit — do not assume the crashing Celery worker will log its own death.
|
||||
|
||||
## Logging (per main standard)
|
||||
|
||||
Django apps follow the main standard's [Log Level Standards](Red_Panda_Standards_V1-00.md#log-level-standards). Django-specific implementation notes:
|
||||
|
||||
- **Default level: `WARNING`** for app loggers in production. Business logic only surfaces when degraded or broken.
|
||||
- **Level casing: UPPERCASE** (`INFO`, `WARNING`, `ERROR`, `DEBUG`) — Python/Django convention.
|
||||
- **Never use `print()`** — always `logger = logging.getLogger(__name__)`.
|
||||
- **Client telemetry** received at `POST /api/v1/telemetry` MUST be logged at `WARNING` level (browser-side errors are user-facing problems, not server failures).
|
||||
- **Access log filtering**: Gunicorn AND the upstream reverse proxy (nginx) must not emit 2xx/3xx entries for `/live`, `/ready`, `/metrics`, `/nginx_status`, `/health*`, `/ping`, or service-specific probes like `/mcp/health`. Filter these in the access-log handler. Both trailing-slash and non-trailing-slash forms MUST be matched. Implementation recipes are in the Gunicorn and nginx subsections under Health Check Endpoints below.
|
||||
- **Structured output**: log to stdout in a format Alloy can parse (JSON preferred). Every log line MUST carry a `level` label downstream.
|
||||
- **Expected conditions are not ERROR**: failed logins, form validation errors, 404s on user-supplied slugs → WARNING or INFO. Reserve ERROR for things that are actually broken.
|
||||
|
||||
## Health Check Endpoints (per main standard)
|
||||
|
||||
Every Django service MUST expose:
|
||||
|
||||
| Endpoint | Purpose | Auth |
|
||||
|----------|---------|------|
|
||||
| `GET /live/` | Liveness — process is running | None |
|
||||
| `GET /ready/` | Readiness — DB, cache, upstream deps all healthy | None |
|
||||
| `GET /metrics` | Prometheus metrics | IP-restricted, no JWT |
|
||||
|
||||
- **Trailing slash**: standard is `/live/` and `/ready/`. Django's `APPEND_SLASH` redirects un-slashed requests to the canonical slashed form — document as an exception only if you disable that behavior.
|
||||
- **Readiness logic** MUST actually probe dependencies: `connection.ensure_connection()` for the DB, a Memcached `ping`, a minimal RabbitMQ connection check. A bare `return HttpResponse(status=200)` fails the main standard.
|
||||
- **Do NOT require authentication** on health endpoints — HAProxy and Prometheus scrapers cannot authenticate.
|
||||
- **`/metrics`** is exposed via `django-prometheus` (preferred) and IP-restricted to internal networks per the main standard.
|
||||
|
||||
### Internal-network allowlist (nginx)
|
||||
|
||||
Any endpoint restricted to "internal networks only" (`/metrics`, `/nginx_status`, `nginx-prometheus-exporter` scrape targets, etc.) MUST use the full RFC1918 + loopback allowlist — **all four ranges**, in this order:
|
||||
|
||||
```nginx
|
||||
allow 127.0.0.0/8; # loopback
|
||||
allow 10.0.0.0/8; # RFC1918 — primary internal range
|
||||
allow 172.16.0.0/12; # RFC1918 — Docker default bridge range
|
||||
allow 192.168.0.0/16; # RFC1918
|
||||
deny all;
|
||||
```
|
||||
|
||||
Omitting `10.0.0.0/8` is the most common mistake and will silently break Prometheus scrapes from hosts on that network. Do not copy a shorter allowlist from older configs.
|
||||
|
||||
### Gunicorn configuration
|
||||
|
||||
Gunicorn MUST:
|
||||
|
||||
- Log access AND error output to **stdout/stderr** — never a file inside the container. The Docker logging driver (syslog → Alloy in our stack) is the single collection point.
|
||||
- Use a `gunicorn.conf.py` referenced via `--config` so configuration lives in version control rather than a growing CMD string.
|
||||
- Filter probe paths out of the access log via a `logging.Filter` attached to the `gunicorn.access` logger in BOTH `on_starting` (master) AND `post_worker_init` (workers — Gunicorn re-applies logger config per worker, so a master-only filter is silently stripped).
|
||||
|
||||
Canonical launch command:
|
||||
|
||||
```dockerfile
|
||||
CMD ["gunicorn", \
|
||||
"--config", "/srv/<app>/gunicorn.conf.py", \
|
||||
"--bind", ":8080", \
|
||||
"--workers", "3", \
|
||||
"--timeout", "120", \
|
||||
"--keep-alive", "5", \
|
||||
"--access-logfile", "-", \
|
||||
"--error-logfile", "-", \
|
||||
"<app>.wsgi:application"]
|
||||
```
|
||||
|
||||
Canonical `gunicorn.conf.py` probe filter:
|
||||
|
||||
```python
|
||||
import logging
|
||||
import re
|
||||
|
||||
_PROBE_PATH = re.compile(
|
||||
r"^(?:/live|/ready|/metrics|/nginx_status|/health[^ ]*|/ping|/mcp/health)/?(?:\?|$)"
|
||||
)
|
||||
|
||||
|
||||
class _ProbePathFilter(logging.Filter):
|
||||
def filter(self, record: logging.LogRecord) -> bool:
|
||||
request = getattr(record, "args", None)
|
||||
if isinstance(request, dict):
|
||||
# Gunicorn access log atoms: 'U' = URL path, 'r' = full request line
|
||||
path = request.get("U") or request.get("r", "")
|
||||
else:
|
||||
path = record.getMessage()
|
||||
return not _PROBE_PATH.search(path)
|
||||
|
||||
|
||||
_filter = _ProbePathFilter()
|
||||
|
||||
|
||||
def on_starting(server):
|
||||
logging.getLogger("gunicorn.access").addFilter(_filter)
|
||||
|
||||
|
||||
def post_worker_init(worker):
|
||||
logging.getLogger("gunicorn.access").addFilter(_filter)
|
||||
```
|
||||
|
||||
Update the probe-path regex if the service exposes additional health endpoints (e.g. sidecar servers). Do NOT special-case by status code — a 500 on `/ready/` is noise in Gunicorn's access log but is already surfaced via the readiness probe failing and the error log.
|
||||
|
||||
### Nginx access-log filtering
|
||||
|
||||
The reverse proxy sees the same probe traffic and will log it unless filtered. Use a `map` + conditional `access_log`:
|
||||
|
||||
```nginx
|
||||
http {
|
||||
map $request_uri $loggable {
|
||||
default 1;
|
||||
~^/live(/|\?|$) 0;
|
||||
~^/ready(/|\?|$) 0;
|
||||
~^/metrics(/|\?|$) 0;
|
||||
~^/nginx_status(/|\?|$) 0;
|
||||
~^/health 0;
|
||||
~^/ping(/|\?|$) 0;
|
||||
~^/mcp/health(/|\?|$) 0;
|
||||
}
|
||||
|
||||
access_log /var/log/nginx/access.log combined if=$loggable;
|
||||
# ...
|
||||
}
|
||||
```
|
||||
|
||||
This is an nginx-wide switch — do not duplicate per `location` block. Error logging is unaffected; genuine 4xx/5xx on probe paths still surface via the error log and the probe itself failing.
|
||||
|
||||
See [Red_Panda_Standards_V1-00.md §Health Check Endpoints](Red_Panda_Standards_V1-00.md#health-check-endpoints) for the full definition.
|
||||
|
||||
## Testing
|
||||
- Framework: Django TestCase (not pytest)
|
||||
- Separate test files per module: test_models.py, test_views.py, test_forms.py
|
||||
@@ -266,6 +400,10 @@ ANGELIA_DB_PORT=5432
|
||||
### Caching
|
||||
- pymemcache — Memcached backend
|
||||
|
||||
### Observability
|
||||
- django-prometheus — `/metrics` endpoint in Prometheus exposition format
|
||||
- celery-exporter (or equivalent) — queue depth metrics for Celery workers
|
||||
|
||||
### Database
|
||||
- psycopg[binary] — PostgreSQL adapter
|
||||
- shortuuid — Short UUIDs for public URLs
|
||||
@@ -339,3 +477,23 @@ ANGELIA_DB_PORT=5432
|
||||
- Don't pass model instances to tasks (pass IDs and re-fetch)
|
||||
- Don't assume tasks run immediately
|
||||
- Don't forget retry logic for external service calls
|
||||
- Don't run a Celery worker without a heartbeat (see Celery Observability)
|
||||
|
||||
### Logging
|
||||
- Don't use `print()` — always use `logging.getLogger(__name__)`
|
||||
- Don't log at ERROR for expected conditions (failed logins, 404s, validation errors)
|
||||
- Don't log at INFO for successful probes of `/live`, `/ready`, `/metrics`
|
||||
- Don't log passwords, tokens, API keys, session cookies, or PII at any level
|
||||
- Don't use lowercase level names in Python code (UPPERCASE for Django/Python)
|
||||
|
||||
---
|
||||
|
||||
## Exceptions
|
||||
|
||||
Per the main standard, deviations from Red Panda requirements MUST be recorded rather than hidden. Third-party Django packages, framework defaults, or deliberate trade-offs all go here.
|
||||
|
||||
| Service | Standard waived | Reason | Reviewed |
|
||||
|---------|-----------------|--------|----------|
|
||||
| _(add as discovered)_ | | | |
|
||||
|
||||
Exceptions MUST be re-reviewed on the doc's `Last reviewed` date. Remove entries whose underlying reason has gone away.
|
||||
|
||||
Reference in New Issue
Block a user