Docs: Red Panda Standards Upate
This commit is contained in:
341
docs/Red Panda Standards_Django_V1-01.md
Normal file
341
docs/Red Panda Standards_Django_V1-01.md
Normal file
@@ -0,0 +1,341 @@
|
||||
# Red Panda Approval™ — Django Addendum
|
||||
|
||||
**Owner:** Robert Helewka <r@helu.ca>
|
||||
**Version:** 1.01
|
||||
**Last reviewed:** 2026-04-18
|
||||
**Parent document:** [Red_Panda_Standards_V1-00.md](Red_Panda_Standards_V1-00.md)
|
||||
|
||||
This document extends the main Red Panda Standards with Django-specific conventions. Where the two documents overlap, the **main standard governs** — this addendum only adds Django-specific detail or explicitly-noted exceptions.
|
||||
|
||||
## 🐾 Red Panda Approval™
|
||||
|
||||
This project follows Red Panda Approval standards — our gold standard for Django application quality. Code must be elegant, reliable, and maintainable to earn the approval of our adorable red panda judges.
|
||||
|
||||
### The 5 Sacred Django Criteria
|
||||
1. **Fresh Migration Test** — Clean migrations from empty database
|
||||
2. **Elegant Simplicity** — No unnecessary complexity
|
||||
3. **Observable & Debuggable** — Proper logging and error handling
|
||||
4. **Consistent Patterns** — Follow Django conventions
|
||||
5. **Actually Works** — Passes all checks and serves real user needs
|
||||
|
||||
## Environment Standards
|
||||
- Virtual environment: ~/env/PROJECT/bin/activate
|
||||
- Use pyproject.toml for project configuration (no setup.py, no requirements.txt)
|
||||
- Python version: specified in pyproject.toml
|
||||
- Dependencies: floor-pinned with ceiling (e.g. `Django>=5.2,<6.0`)
|
||||
|
||||
### Dependency Pinning
|
||||
|
||||
```toml
|
||||
# Correct — floor pin with ceiling
|
||||
dependencies = [
|
||||
"Django>=5.2,<6.0",
|
||||
"djangorestframework>=3.14,<4.0",
|
||||
"cryptography>=41.0,<45.0",
|
||||
]
|
||||
|
||||
# Wrong — exact pins in library packages
|
||||
dependencies = [
|
||||
"Django==5.2.7", # too strict, breaks downstream
|
||||
]
|
||||
```
|
||||
|
||||
Exact pins (`==`) are only appropriate in application-level lock files, not in reusable library packages.
|
||||
|
||||
## Directory Structure
|
||||
myproject/ # Git repository root
|
||||
├── .gitignore
|
||||
├── README.md
|
||||
├── pyproject.toml # Project configuration (moved to repo root)
|
||||
├── docker-compose.yml
|
||||
├── .env # Docker Compose environment
|
||||
│ # ANGELIA_DB_ENGINE=postgresql
|
||||
│ # ANGELIA_DB_NAME=angelia2
|
||||
│ # ANGELIA_DB_USER=angelia
|
||||
│ # ANGELIA_DB_PASSWORD=changeme
|
||||
│ # ANGELIA_DB_HOST=db
|
||||
│ # ANGELIA_DB_PORT=5432
|
||||
├── .env.example
|
||||
│
|
||||
├── project/ # Django project root (manage.py lives here)
|
||||
│ ├── manage.py
|
||||
│ ├── Dockerfile
|
||||
│ ├── .env # Local development environment
|
||||
│ │ # ANGELIA_DB_ENGINE=sqlite
|
||||
├── .env.example
|
||||
│
|
||||
├── config/ # Django configuration module
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── settings.py
|
||||
│ │ ├── urls.py
|
||||
│ │ ├── wsgi.py
|
||||
│ │ └── asgi.py
|
||||
│ │
|
||||
│ ├── accounts/ # Django app
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── models.py
|
||||
│ │ ├── views.py
|
||||
│ │ └── urls.py
|
||||
│ │
|
||||
│ ├── blog/ # Django app
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── models.py
|
||||
│ │ ├── views.py
|
||||
│ │ └── urls.py
|
||||
│ │
|
||||
│ ├── static/
|
||||
│ │ ├── css/
|
||||
│ │ └── js/
|
||||
│ │
|
||||
│ └── templates/
|
||||
│ └── base.html
|
||||
│
|
||||
├── web/ # Nginx configuration
|
||||
│ └── nginx.conf
|
||||
│
|
||||
├── db/ # PostgreSQL configuration
|
||||
│ └── postgresql.conf
|
||||
│
|
||||
└── docs/ # Project documentation
|
||||
└── index.md
|
||||
|
||||
## Settings Structure
|
||||
- Use a single settings.py file
|
||||
- Use django-environ or python-dotenv for environment variables
|
||||
- Never commit .env files to version control
|
||||
- Provide .env.example with all required variables documented
|
||||
- Create .gitignore file
|
||||
- Create a .dockerignore file
|
||||
|
||||
## Environment Variables
|
||||
|
||||
All env vars in `.env` MUST use the `SERVICENAME_` prefix (per main standard). The examples below use `ANGELIA_` — substitute the actual service name for your app.
|
||||
|
||||
### PostgreSQL settings (only if `SERVICENAME_DB_ENGINE=postgresql`)
|
||||
```
|
||||
ANGELIA_DB_NAME=angelia2
|
||||
ANGELIA_DB_USER=angelia
|
||||
ANGELIA_DB_PASSWORD=changeme
|
||||
ANGELIA_DB_HOST=db
|
||||
ANGELIA_DB_PORT=5432
|
||||
```
|
||||
|
||||
### Rules
|
||||
- Never use `DATABASE_URL` or `dj-database-url` — always individual vars
|
||||
- Never use unprefixed `DB_HOST` / `APP_DB_NAME` — always service-prefixed
|
||||
- The Django `Settings` class declares each prefixed var explicitly so the full config is documented in one place
|
||||
- `.env` is gitignored; `.env.example` with placeholder values is committed
|
||||
|
||||
## Code Organization
|
||||
- Imports: PEP 8 ordering (stdlib, third-party, local)
|
||||
- Type hints on function parameters
|
||||
- CSS: External .css files only (no inline styles, no embedded `<style>` tags)
|
||||
- JS: External .js files only (no inline handlers, no embedded `<script>` blocks)
|
||||
- Maximum file length: 1000 lines
|
||||
- If a file exceeds 500 lines, consider splitting by domain concept
|
||||
|
||||
## Database Conventions
|
||||
- Migrations run cleanly from empty database
|
||||
- Never edit deployed migrations
|
||||
- Use meaningful migration names: --name add_email_to_profile
|
||||
- One logical change per migration when possible
|
||||
- Test migrations both forward and backward
|
||||
|
||||
### Development vs Production
|
||||
- Development: SQLite
|
||||
- Production: PostgreSQL
|
||||
|
||||
## Caching
|
||||
- Expensive queries are cached
|
||||
- Cache keys follow naming convention
|
||||
- TTLs are appropriate (not infinite)
|
||||
- Invalidation is documented
|
||||
- Key Naming Pattern: {app}:{model}:{identifier}:{field}
|
||||
|
||||
## Model Naming
|
||||
- Model names: singular PascalCase (User, BlogPost, OrderItem)
|
||||
- Correct English pluralization on related names
|
||||
- All models have created_at and updated_at
|
||||
- All models define __str__ and get_absolute_url
|
||||
- TextChoices used for status fields
|
||||
- related_name defined on ForeignKey fields
|
||||
- Related names: plural snake_case with proper English pluralization
|
||||
|
||||
## Forms
|
||||
- Use ModelForm with explicit fields list (never __all__)
|
||||
|
||||
## Field Naming
|
||||
- Foreign keys: singular without _id suffix (author, category, parent)
|
||||
- Boolean fields: use prefixes (is_active, has_permission, can_edit)
|
||||
- Date fields: use suffixes (created_at, updated_at, published_on)
|
||||
- Avoid abbreviations (use description, not desc)
|
||||
|
||||
## Required Model Fields
|
||||
- All models should include:
|
||||
- created_at = models.DateTimeField(auto_now_add=True)
|
||||
- updated_at = models.DateTimeField(auto_now=True)
|
||||
- Consider adding:
|
||||
- id = models.UUIDField(primary_key=True) for public-facing models
|
||||
- is_active = models.BooleanField(default=True) for soft deletes
|
||||
|
||||
## Indexing
|
||||
- Add db_index=True to frequently queried fields
|
||||
- Use Meta.indexes for composite indexes
|
||||
- Document why each index exists
|
||||
|
||||
## Queries
|
||||
- Use select_related() for foreign keys
|
||||
- Use prefetch_related() for reverse relations and M2M
|
||||
- Avoid queries in loops (N+1 problem)
|
||||
- Use .only() and .defer() for large models
|
||||
- Add comments explaining complex querysets
|
||||
|
||||
## Docstrings
|
||||
- Use Sphinx style docstrings
|
||||
- Document all public functions, classes, and modules
|
||||
- Skip docstrings for obvious one-liners and standard Django overrides
|
||||
|
||||
## Views
|
||||
- Use Function-Based Views (FBVs) exclusively
|
||||
- Explicit logic is preferred over implicit inheritance
|
||||
- Extract shared logic into utility functions
|
||||
|
||||
## URLs & Identifiers
|
||||
|
||||
- Public URLs use short UUIDs (12 characters) via `shortuuid`
|
||||
- Never expose sequential IDs in URLs (security/enumeration risk)
|
||||
- Internal references may use standard UUIDs or PKs
|
||||
|
||||
## URL Patterns
|
||||
- Resource-based URLs (RESTful style)
|
||||
- Namespaced URL names per app
|
||||
- Trailing slashes (Django default)
|
||||
- Flat structure preferred over deep nesting
|
||||
|
||||
## Background Tasks
|
||||
- All tasks are run synchronously unless the design specifies background tasks are needed for long operations
|
||||
- Long operations use Celery tasks
|
||||
- Use Memcached, task progress pattern: {app}:task:{task_id}:progress
|
||||
- Tasks are idempotent
|
||||
- Tasks include retry logic
|
||||
- Tasks live in app/tasks.py
|
||||
- RabbitMQ is the Message Broker
|
||||
- Flower Monitoring: Use for debugging failed tasks
|
||||
|
||||
## Testing
|
||||
- Framework: Django TestCase (not pytest)
|
||||
- Separate test files per module: test_models.py, test_views.py, test_forms.py
|
||||
|
||||
## Frontend Standards
|
||||
|
||||
### New Projects (DaisyUI + Tailwind)
|
||||
- DaisyUI 4 via CDN for component classes
|
||||
- Tailwind CSS via CDN for utility classes
|
||||
- Theme management via Themis (DaisyUI `data-theme` attribute)
|
||||
- All apps extend `themis/base.html` for consistent navigation
|
||||
- No inline styles or scripts
|
||||
|
||||
### Existing Projects (Bootstrap 5)
|
||||
- Bootstrap 5 via CDN
|
||||
- Bootstrap Icons via CDN
|
||||
- Bootswatch for theme variants (if applicable)
|
||||
- django-bootstrap5 and crispy-bootstrap5 for form rendering
|
||||
|
||||
## Preferred Packages
|
||||
|
||||
### Core Django
|
||||
- django>=5.2,<6.0
|
||||
- django-environ — Environment variables
|
||||
|
||||
### Authentication & Security
|
||||
- django-allauth — User management
|
||||
- django-allauth-2fa — Two-factor authentication
|
||||
|
||||
### API Development
|
||||
- djangorestframework>=3.14,<4.0 — REST APIs
|
||||
- drf-spectacular — OpenAPI/Swagger documentation
|
||||
|
||||
### Encryption
|
||||
- cryptography — Fernet encryption for secrets/API keys
|
||||
|
||||
### Background Tasks
|
||||
- celery — Async task queue
|
||||
- django-celery-progress — Progress bars
|
||||
- flower — Celery monitoring
|
||||
|
||||
### Caching
|
||||
- pymemcache — Memcached backend
|
||||
|
||||
### Database
|
||||
- psycopg[binary] — PostgreSQL adapter
|
||||
- shortuuid — Short UUIDs for public URLs
|
||||
|
||||
### Production
|
||||
- gunicorn — WSGI server
|
||||
|
||||
### Shared Apps
|
||||
- django-heluca-themis — User preferences, themes, key management, navigation
|
||||
|
||||
### Deprecated / Removed
|
||||
- ~~pytz~~ — Use stdlib `zoneinfo` (Python 3.9+, Django 4+)
|
||||
- ~~Pillow~~ — Only add if your app needs ImageField
|
||||
- ~~django-heluca-core~~ — Replaced by Themis
|
||||
- ~~dj-database-url~~ — Use individual Django DB env vars instead
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
### Models
|
||||
- Don't use `Model.objects.get()` without handling `DoesNotExist`
|
||||
- Don't use `null=True` on `CharField` or `TextField` (use `blank=True, default=""`)
|
||||
- Don't use `related_name='+'` unless you have a specific reason
|
||||
- Don't override `save()` for business logic (use signals or service functions)
|
||||
- Don't use `auto_now=True` on fields you might need to manually set
|
||||
- Don't use `ForeignKey` without specifying `on_delete` explicitly
|
||||
- Don't use `Meta.ordering` on large tables (specify ordering in queries)
|
||||
|
||||
### Queries
|
||||
- Don't query inside loops (N+1 problem)
|
||||
- Don't use `.all()` when you need a subset
|
||||
- Don't use raw SQL unless absolutely necessary
|
||||
- Don't forget `select_related()` and `prefetch_related()`
|
||||
|
||||
### Views
|
||||
- Don't put business logic in views
|
||||
- Don't use `request.POST.get()` without validation (use forms)
|
||||
- Don't return sensitive data in error messages
|
||||
- Don't forget `login_required` decorator on protected views
|
||||
|
||||
### Forms
|
||||
- Don't use `fields = '__all__'` in ModelForm
|
||||
- Don't trust client-side validation alone
|
||||
- Don't use `exclude` in ModelForm (use explicit `fields`)
|
||||
|
||||
### Templates
|
||||
- Don't use `{{ variable }}` for URLs (use `{% url %}` tag)
|
||||
- Don't put logic in templates
|
||||
- Don't use inline CSS or JavaScript (external files only)
|
||||
- Don't forget `{% csrf_token %}` in forms
|
||||
|
||||
### Security
|
||||
- Don't store secrets in `settings.py` (use environment variables)
|
||||
- Don't commit `.env` files to version control
|
||||
- Don't use `DEBUG=True` in production
|
||||
- Don't expose sequential IDs in public URLs
|
||||
- Don't use `mark_safe()` on user-supplied content
|
||||
- Don't disable CSRF protection
|
||||
|
||||
### Imports & Code Style
|
||||
- Don't use `from module import *`
|
||||
- Don't use mutable default arguments
|
||||
- Don't use bare `except:` clauses
|
||||
- Don't ignore linter warnings without documented reason
|
||||
|
||||
### Migrations
|
||||
- Don't edit migrations that have been deployed
|
||||
- Don't use `RunPython` without a reverse function
|
||||
- Don't add non-nullable fields without a default value
|
||||
|
||||
### Celery Tasks
|
||||
- Don't pass model instances to tasks (pass IDs and re-fetch)
|
||||
- Don't assume tasks run immediately
|
||||
- Don't forget retry logic for external service calls
|
||||
292
docs/Red_Panda_Standards_V1-00.md
Normal file
292
docs/Red_Panda_Standards_V1-00.md
Normal file
@@ -0,0 +1,292 @@
|
||||
# Red Panda Approval™ Standards
|
||||
|
||||
Quality and observability standards for the Ouranos Lab. All infrastructure code, application code, and LLM-generated code deployed into this environment must meet these standards.
|
||||
|
||||
**Owner:** Robert Helewka <r@helu.ca>
|
||||
**Version:** 1.00
|
||||
**Last reviewed:** 2026-04-18
|
||||
|
||||
---
|
||||
|
||||
## 🐾 Red Panda Approval™
|
||||
|
||||
All implementations must meet the 5 Sacred Criteria:
|
||||
|
||||
1. **Fresh Environment Test** — Clean runs on new systems without drift. No leftover state, no manual steps.
|
||||
2. **Elegant Simplicity** — Modular, reusable, no copy-paste sprawl. One playbook per concern.
|
||||
3. **Observable & Auditable** — Clear task names, proper logging, check mode compatible. You can see what happened.
|
||||
4. **Idempotent Patterns** — Run multiple times with consistent results. No side effects on re-runs.
|
||||
5. **Actually Provisions & Configures** — Resources work, dependencies resolve, services integrate. It does the thing.
|
||||
|
||||
---
|
||||
|
||||
## Vault Security
|
||||
|
||||
All sensitive information is encrypted using Ansible Vault with AES256 encryption.
|
||||
|
||||
**Encrypted secrets:**
|
||||
- Database passwords (PostgreSQL, Neo4j)
|
||||
- API keys (OpenAI, Anthropic, Mistral, Groq)
|
||||
- Application secrets (Grafana, SearXNG, Arke)
|
||||
- Monitoring alerts (AlertManager email integration)
|
||||
|
||||
**Security rules:**
|
||||
- AES256 encryption with `ansible-vault`
|
||||
- Password file for automation — never pass `--vault-password-file` inline in scripts
|
||||
- Vault variables use the `vault_` prefix; map to friendly names in `group_vars/all/vars.yml`
|
||||
- No secrets in plain text files, ever
|
||||
|
||||
---
|
||||
|
||||
## Log Level Standards
|
||||
|
||||
All services in the Ouranos Lab MUST follow these log level conventions. These rules apply to application code, infrastructure services, and any LLM-generated code deployed into this environment. Log output flows through Alloy → Loki → Grafana, so disciplined leveling is not cosmetic — it directly determines alert quality, dashboard usefulness, and on-call signal-to-noise ratio.
|
||||
|
||||
### Level Definitions
|
||||
|
||||
| Level | When to Use | What MUST Be Included | Loki / Grafana Role |
|
||||
|-------|-------------|----------------------|---------------------|
|
||||
| **ERROR** | Something is broken and requires human intervention. The service cannot fulfil the current request or operation. | Exception class, message, stack trace, and relevant context (request ID, user, resource identifier). Never a bare `"something failed"`. | AlertManager rules fire on `level=~"error\|fatal\|critical"`. These trigger email notifications. |
|
||||
| **WARNING** | Degraded but self-recovering: retries succeeding, fallback paths taken, thresholds approaching, deprecated features invoked. | What degraded, what recovery action was taken, current metric value vs. threshold. | Grafana dashboard panels. Rate-based alerting (e.g., >N warnings/min). |
|
||||
| **INFO** | Significant lifecycle and business events: service start/stop, configuration loaded, deployment markers, user authentication, job completion, schema migrations. | The event and its outcome. This level tells the *story* of what the system did. | Default production visibility. The go-to level for post-incident timelines. |
|
||||
| **DEBUG** | Diagnostic detail for active troubleshooting: request/response payloads, SQL queries, internal state, variable values. | **Actionable context is mandatory.** A DEBUG line with no detail is worse than no line at all. Include variable values, object states, or decision paths. | Never enabled in production by default. Used on-demand via per-service level override. |
|
||||
|
||||
### Anti-Patterns
|
||||
|
||||
These are explicit violations of Ouranos logging standards:
|
||||
|
||||
| ❌ Anti-Pattern | Why It's Wrong | ✅ Correct Approach |
|
||||
|----------------|---------------|-------------------|
|
||||
| Health/metrics checks logged at INFO (`GET /live → 200 OK`, `GET /metrics → 200 OK`) | Routine HAProxy/Prometheus probes flood syslog with thousands of identical lines per hour, burying real events. | Suppress successful probes to `/live`, `/ready`, `/metrics`, `/health*`, `/ping` from access logs entirely. Non-2xx responses MUST still log. |
|
||||
| DEBUG with no context (`logger.debug("error occurred")`) | Provides zero diagnostic value. If DEBUG is noisy *and* useless, nobody will ever enable it. | `logger.debug("PaymentService.process failed: order_id=%s, provider=%s, response=%r", oid, provider, resp)` |
|
||||
| ERROR without exception details (`logger.error("task failed")`) | Cannot be triaged without reproduction steps. Wastes on-call time. | `logger.error("Celery task invoice_gen failed: order_id=%s", oid, exc_info=True)` |
|
||||
| Logging sensitive data at any level | Passwords, tokens, API keys, and PII in Loki are a security incident. | Mask or redact: `api_key=sk-...a3f2`, `password=*****`. |
|
||||
| Inconsistent level casing | Breaks LogQL filters and Grafana label selectors. | **Python / Django**: UPPERCASE (`INFO`, `WARNING`, `ERROR`, `DEBUG`). **Go / infrastructure** (HAProxy, Alloy, Gitea): lowercase (`info`, `warn`, `error`, `debug`). |
|
||||
| Logging expected conditions as ERROR | A user entering a wrong password is not an error — it is normal business logic. | Use WARNING or INFO for expected-but-notable conditions. Reserve ERROR for things that are actually broken. |
|
||||
|
||||
### Health Check & Monitoring Endpoint Rule
|
||||
|
||||
> All services MUST suppress successful (2xx/3xx) access log entries for health and monitoring endpoints: `/live`, `/ready`, `/health`, `/healthz`, `/api/health`, `/metrics`, `/ping`. Health check success is the *absence* of errors, not the presence of 200s. If your syslog shows a successful probe of one of these endpoints, your log level is wrong.
|
||||
>
|
||||
> Non-2xx responses to these paths MUST still be logged — a failing `/ready` is a real signal.
|
||||
|
||||
**Implementation guidance:**
|
||||
- **Django / Gunicorn**: Filter health paths in the access log handler or use middleware that skips logging for probe user-agents.
|
||||
- **FastAPI / Uvicorn**: Add a `logging.Filter` on the `uvicorn.access` logger that matches health paths in the access log message. Uvicorn's access log format includes the full request line in quotes (e.g., `"GET /live HTTP/1.1"`), so filter regexes must account for that. See also the structured logging notes below.
|
||||
- **nginx containers**: nginx does not log through Python loggers, so app-level filters do not apply. Suppress probe access lines at the nginx config level using `map` on `$request_uri` or `$status`:
|
||||
```nginx
|
||||
map $request_uri $loggable {
|
||||
~^/(live|ready|metrics|health|healthz|ping)(/|$|\?) 0;
|
||||
default 1;
|
||||
}
|
||||
server {
|
||||
access_log /var/log/nginx/access.log combined if=$loggable;
|
||||
# errors (4xx/5xx) still logged via error_log regardless
|
||||
}
|
||||
```
|
||||
Applies to every nginx-based container (static frontends, reverse proxies, sidecars).
|
||||
- **Docker services**: Configure the application's internal logging to exclude health routes — the syslog driver forwards everything it receives.
|
||||
- **HAProxy**: HAProxy's own health check logs (`option httpchk`) should remain at the HAProxy level for connection debugging, but backend application responses to those probes must not surface at INFO.
|
||||
|
||||
### Background Worker & Queue Monitoring
|
||||
|
||||
> **The most dangerous failure is the one that produces no logs.**
|
||||
|
||||
When a background worker (Celery task consumer, RabbitMQ subscriber, Gitea Runner, cron job) fails to start or crashes on startup, it generates no ongoing log output. Error-rate dashboards stay green because there is no process running to produce errors. Meanwhile, queues grow unbounded and work silently stops being processed.
|
||||
|
||||
**Required practices:**
|
||||
|
||||
1. **Heartbeat logging** — Every long-running background worker MUST emit a periodic INFO-level heartbeat (e.g., `"worker alive, processed N jobs in last 5m, queue depth: M"`). **Cadence: every 60 seconds.** The staleness alert fires after 10 minutes of silence (= 10 consecutive missed heartbeats), which gives enough margin to absorb transient Loki ingestion lag without flapping. The *absence* of this heartbeat is the alertable condition.
|
||||
|
||||
2. **Startup and shutdown at INFO** — Worker start, ready, graceful shutdown, and crash-exit are significant lifecycle events. These MUST log at INFO.
|
||||
|
||||
3. **Queue depth as a metric** — RabbitMQ queue depths and any application-level task queues MUST be exposed as Prometheus metrics. A growing queue with zero consumer activity is an **ERROR**-level alert, not a warning.
|
||||
|
||||
4. **Grafana "last seen" alerts** — For every background worker, configure a Grafana alert using `absent_over_time()` or equivalent staleness detection: *"Worker X has not logged a heartbeat in >10 minutes"* → ERROR severity → email notification via AlertManager.
|
||||
|
||||
5. **Crash-on-start is ERROR** — If a worker exits within seconds of starting (missing config, failed DB connection, import error), the exit MUST be captured at ERROR level by the service manager (`systemd OnFailure=`, Docker restart policy logs). Do not rely on the crashing application to log its own death — it may never get the chance.
|
||||
|
||||
### Production Defaults
|
||||
|
||||
| Service Category | Default Level | Rationale |
|
||||
|-----------------|---------------|-----------|
|
||||
| Django apps (Angelia, Athena, Kairos, Icarlos, Spelunker, Peitho, MCP Switchboard) | `WARNING` | Business logic — only degraded or broken conditions surface. Lifecycle events (start/stop/deploy) still log at INFO via Gunicorn and systemd. |
|
||||
| FastAPI apps (Periplus) | `WARNING` | Same rationale as Django. Uvicorn lifecycle events (start/stop) are pinned to INFO via the `uvicorn.error` logger regardless of app log level. |
|
||||
| Gunicorn / Uvicorn / nginx access logs | Suppress successful probes to `/live`, `/ready`, `/metrics`, `/health*`, `/ping` | Routine request logging deferred to HAProxy access logs in Loki. |
|
||||
| Infrastructure agents (Alloy, Prometheus, Node Exporter) | `warn` | Stable — do not change without cause. |
|
||||
| HAProxy (Titania) | `warning` | Connection-level logging handled by HAProxy's own log format → Alloy → Loki. |
|
||||
| Databases (PostgreSQL, Neo4j) | `warning` | Query-level logging only enabled for active troubleshooting. |
|
||||
| Docker services (Gitea, LobeChat, Nextcloud, AnythingLLM, SearXNG) | `warn` / `warning` | Per-service default. Tune individually if needed. |
|
||||
| LLM Proxy (Arke) | `info` | Token usage tracking and provider routing decisions justify INFO. Review periodically for noise. |
|
||||
| Observability stack (Grafana, Loki, AlertManager) | `warn` | Should be quiet unless something is wrong with observability itself. |
|
||||
|
||||
### Structured Logging — FastAPI / Uvicorn
|
||||
|
||||
FastAPI apps using uvicorn require special handling to achieve JSON-structured log output for the Alloy → Loki pipeline. Uvicorn manages its own loggers aggressively, and naive approaches will fail silently.
|
||||
|
||||
**Required practices:**
|
||||
|
||||
1. **Override uvicorn's handlers, don't just add to root** — Uvicorn's `config.load()` creates its own `StreamHandler` instances on `uvicorn`, `uvicorn.error`, and `uvicorn.access`. You must remove these handlers and set `propagate = True` so log records flow to the root logger where your JSON formatter lives.
|
||||
|
||||
2. **Re-apply logging config in the lifespan** — Configuring logging at module import time is not sufficient. Uvicorn's `config.load()` runs *after* your module is imported but *before* the ASGI lifespan starts. Call your logging configuration function again inside the FastAPI `lifespan` context manager to recapture control.
|
||||
|
||||
3. **Remap uvicorn logger names** — Uvicorn uses `uvicorn.error` for all lifecycle messages (startup, shutdown, errors) despite the misleading name. Remap it to `uvicorn` in your JSON formatter's output for clarity in Loki queries.
|
||||
|
||||
4. **Use `pydantic-settings` with `extra = "ignore"`** — When loading config from `.env` files that contain variables for other services (e.g., oauth2-proxy), pydantic-settings will reject unknown fields by default. Always set `extra = "ignore"` in the model config.
|
||||
|
||||
### Loki & Grafana Alignment
|
||||
|
||||
**Label normalization**: Alloy pipelines (syslog listeners and journal relabeling) MUST extract and forward a `level` label on every log line. Without a `level` label, the log entry is invisible to level-based dashboard filters and alert rules.
|
||||
|
||||
**LogQL conventions for dashboards:**
|
||||
```logql
|
||||
# Production error monitoring (default dashboard view)
|
||||
{job="syslog", hostname="puck"} | json | level=~"error|fatal|critical"
|
||||
|
||||
# Warning-and-above for a specific service
|
||||
{service_name="haproxy"} | logfmt | level=~"warn|error|fatal"
|
||||
|
||||
# Debug-level troubleshooting (temporary, never permanent dashboards)
|
||||
{container="angelia"} | json | level="debug"
|
||||
```
|
||||
|
||||
**Alerting rules** — Grafana alert rules MUST key off the normalized `level` label:
|
||||
- `level=~"error|fatal|critical"` → Immediate email notification via AlertManager
|
||||
- `absent_over_time({service_name="celery_worker"}[10m])` → Worker heartbeat staleness → ERROR severity
|
||||
- Rate-based: `rate({service_name="arke"} | json | level="error" [5m]) > 0.1` → Sustained error rate
|
||||
|
||||
**Retention alignment**: Loki retention policies MUST preserve higher-severity logs longer than lower-severity ones. Target retention:
|
||||
|
||||
| Level | Retention | Rationale |
|
||||
|-------|-----------|-----------|
|
||||
| DEBUG | 7 days | Troubleshooting context only — stale debug data is noise. |
|
||||
| INFO | 30 days | Post-incident timelines and lifecycle review. |
|
||||
| WARNING | 90 days | Degradation trend analysis across release cycles. |
|
||||
| ERROR / FATAL / CRITICAL | 90 days | Incident review, root-cause investigation, compliance. |
|
||||
|
||||
DEBUG-level logs generated during troubleshooting sessions should be explicitly cleaned up if they would blow past the 7-day budget.
|
||||
|
||||
---
|
||||
|
||||
## Health Check Endpoints
|
||||
|
||||
All services MUST expose Kubernetes-style health endpoints at these paths:
|
||||
|
||||
| Endpoint | Purpose | Auth |
|
||||
|----------|---------|------|
|
||||
| `GET /live` | **Liveness** — process is running and accepting connections | None |
|
||||
| `GET /ready` | **Readiness** — process is running AND all dependencies (DB, cache, upstream APIs) are healthy | None |
|
||||
| `GET /metrics` | Prometheus metrics | IP-restricted (no JWT) |
|
||||
|
||||
- HAProxy uses `health_path: /ready/` (trailing slash) for backend health checks — return HTTP 200 when ready
|
||||
- Health endpoints MUST NOT require authentication
|
||||
- Third-party services use their native paths (`/api/health`, `/api/healthz`, `/-/healthy`, etc.)
|
||||
|
||||
**Trailing slash**: The standard path is `/ready/` with a trailing slash. Django's `APPEND_SLASH` handling, FastAPI route declarations, and nginx `location` blocks all differ in how they treat the slash. Services that cannot comply (framework redirects, third-party apps) MUST be recorded in the Exceptions section below. Access-log suppression filters MUST match both `/ready` and `/ready/` forms.
|
||||
|
||||
### Docker Compose Healthchecks
|
||||
|
||||
Use `curl -f` (install curl in images if needed). Do not use `wget --spider`.
|
||||
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/live"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Endpoint Protection
|
||||
|
||||
| Protected (require valid JWT) | Unprotected |
|
||||
|-------------------------------|-------------|
|
||||
| All `/api/v1/*` routes | `GET /live` |
|
||||
| | `GET /ready` |
|
||||
| | `GET /metrics` (IP-restricted to internal networks) |
|
||||
| | `GET /api/auth/login-url` |
|
||||
| | `POST /api/auth/token` |
|
||||
| | `POST /api/v1/telemetry` (sendBeacon cannot set headers) |
|
||||
|
||||
> **Why `/api/v1/telemetry` is unprotected**: The browser `sendBeacon` API cannot set `Authorization` headers. The telemetry endpoint must be open to receive client-side error reports and performance data, or browser errors will be silently lost.
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
All services MUST expose `GET /metrics` in Prometheus exposition format, scraped by Prospero's Prometheus at 15s intervals.
|
||||
|
||||
- **IP-restricted** to internal networks: `10.10.0.0/24`, `172.16.0.0/12`, `127.0.0.0/8`
|
||||
- No JWT required — HAProxy and Prometheus scrapers cannot authenticate
|
||||
- Useful metrics to expose: request totals and durations, error rates, active connections, queue depths, dependency health
|
||||
|
||||
---
|
||||
|
||||
## Browser Telemetry
|
||||
|
||||
Frontend/browser code MUST report errors and performance data back to the server.
|
||||
|
||||
- Send to `POST /api/v1/telemetry` — unprotected endpoint
|
||||
- Capture: JavaScript exceptions, promise rejections, resource load failures, performance metrics
|
||||
- The server MUST log client-side exceptions at **WARNING** level (they indicate user-facing problems but are not server failures)
|
||||
- Include enough context to reproduce: URL, user agent, error message, stack trace (if available)
|
||||
|
||||
---
|
||||
|
||||
## Environment Variable Naming
|
||||
|
||||
All environment variables for an application MUST use a consistent prefix matching the service name (e.g., `PERIPLUS_`, `ARKE_`, `ANGELIA_`). This applies to every variable in the `.env` file, including those consumed by sidecar services like oauth2-proxy.
|
||||
|
||||
**Rules:**
|
||||
- All vars in `.env` use the `SERVICENAME_` prefix — no exceptions
|
||||
- `compose.yaml` maps prefixed vars to the sidecar's expected names (e.g., `OAUTH2_PROXY_CLIENT_ID: ${PERIPLUS_CASDOOR_CLIENT_ID}`)
|
||||
- The application's Settings model SHOULD declare all prefixed vars, even those only consumed by sidecars, so the full configuration is documented in one place
|
||||
- Every repo MUST include a `.env.example` with placeholder values for all required variables. Add `!.env.example` to `.gitignore` if a broad `.env.*` pattern would otherwise exclude it
|
||||
- `.env` files with real secrets are ALWAYS gitignored — no exceptions
|
||||
|
||||
---
|
||||
|
||||
## Docker Networking
|
||||
|
||||
- Use the **default Docker bridge network** for simple deployments
|
||||
- Add additional named networks only when required (e.g., isolating database traffic) or explicitly requested
|
||||
- Do not define custom networks for single-service Docker Compose stacks
|
||||
|
||||
---
|
||||
|
||||
## Documentation Standards
|
||||
|
||||
Place documentation in the `/docs/` directory of the repository.
|
||||
|
||||
### HTML Documents
|
||||
|
||||
HTML documents must follow [docs/documentation_style_guide.html](documentation_style_guide.html).
|
||||
|
||||
- Include a dark mode that follows the system automatically and include a toggle button in the navbar
|
||||
- avoid custom CSS
|
||||
- Use **Mermaid** for diagrams
|
||||
|
||||
---
|
||||
|
||||
## Exceptions
|
||||
|
||||
Third-party services and vendor containers frequently cannot comply with every standard in this document (health endpoint paths, access-log filtering, log level semantics, env var prefixes). Rather than force non-compliance into a binary pass/fail, record deviations here so the gap is visible and intentional.
|
||||
|
||||
**Rules for exceptions:**
|
||||
- Every exception MUST name the service, the standard being waived, and the reason (vendor constraint, upstream bug, deliberate trade-off).
|
||||
- Exceptions MUST be reviewed on the doc's `Last reviewed` date. If the underlying reason has gone away (vendor fixed it, we forked, we replaced the service), remove the exception.
|
||||
- A missing exception for a known-non-compliant service is itself a Red Panda violation — the point is transparency.
|
||||
|
||||
| Service | Standard waived | Reason | Reviewed |
|
||||
|---------|-----------------|--------|----------|
|
||||
| _(example)_ Gitea | `/live`, `/ready` paths — uses `/api/healthz` | Upstream does not expose K8s-style endpoints | 2026-04-18 |
|
||||
| _(example)_ Nextcloud | Env var prefix `NEXTCLOUD_` — uses vendor-defined `NC_*` and unprefixed vars | Vendor container ignores renamed vars | 2026-04-18 |
|
||||
| _(add real exceptions as they are discovered)_ | | | |
|
||||
|
||||
**Health path trailing-slash exceptions** — services that serve `/ready` without the trailing slash (framework default, cannot be reconfigured without breaking routing):
|
||||
|
||||
| Service | Actual path | Reason |
|
||||
|---------|-------------|--------|
|
||||
| _(add as discovered)_ | | |
|
||||
Reference in New Issue
Block a user