Files
ouranos/docs/pplg.md
Robert Helewka 0f21380fd0 refactor: remove HAProxy from Prospero, centralize TLS on Titania
Move TLS termination and reverse proxying entirely to Titania's
HAProxy, eliminating the redundant HAProxy instance on Prospero.
Backends now communicate over plain HTTP within the internal network.

- Remove HAProxy container, config, certs, and syslog from Prospero
- Remove ssl_backend flags from Titania backend definitions
- Replace pplg_haproxy_* vars with single pplg_domain variable
- Remove HAProxy syslog source from Alloy config
- Update OAuth2-Proxy to listen on all interfaces for Titania access
2026-04-08 17:57:09 +00:00

520 lines
20 KiB
Markdown

# PPLG - Consolidated Observability & Admin Stack
## Overview
PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. TLS termination is handled by Titania's HAProxy, which routes directly to each service on Prospero.
**Host:** prospero.incus
**Role:** Observability
**External Access:** Via Titania HAProxy → `prospero.incus` (direct to service ports)
| Subdomain | Service | Auth Method |
|-----------|---------|-------------|
| `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth |
| `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth |
| `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar |
| `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) |
| `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) |
## Architecture
```
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
│ │ │ (Titania) │ │ │
└──────────┘ │ :443 TLS │ │ Grafana (:3000) — Casdoor OAuth │
│ termination│ │ PgAdmin (:5050) — Casdoor OAuth │
┌──────────┐ └────────────┘ │ OAuth2-Proxy (:9091) → Prometheus (:9090) │
│ Alloy │─────────────────────────▶│ Loki (:3100) — no auth │
│ (agents) │ │ Alertmanager (:9093) — no auth │
└──────────┘ └─────────────────────────────────────────────────┘
```
### Traffic Flow
| Source | Destination | Path | Auth |
|--------|-------------|------|------|
| Browser → Grafana | Titania :443 → Prospero :3000 | Subdomain ACL | Casdoor OAuth |
| Browser → PgAdmin | Titania :443 → Prospero :5050 | Subdomain ACL | Casdoor OAuth |
| Browser → Prometheus | Titania :443 → Prospero :9091 (OAuth2-Proxy) → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
| Alloy → Loki | Titania :443 → Prospero :3100 | Subdomain ACL | None |
| Alloy → Prometheus | Titania :443 → Prospero :9091 → :9090 | `skip_auth_routes` | None |
## Deployment
### Prerequisites
1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`)
2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`)
3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets))
4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso))
### Playbook
```bash
cd ansible
ansible-playbook pplg/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `pplg/deploy.yml` | Main consolidated deployment playbook |
| `pplg/prometheus.yml.j2` | Prometheus scrape configuration |
| `pplg/alert_rules.yml.j2` | Prometheus alerting rules |
| `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications |
| `pplg/config.yml.j2` | Loki server configuration |
| `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth |
| `pplg/datasource.yml.j2` | Grafana provisioned datasources |
| `pplg/users.yml.j2` | Grafana provisioned users |
| `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth |
| `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit |
| `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI |
| `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit |
### Deployment Steps
1. **APT Repositories**: Add Grafana and PgAdmin repos
2. **Install Packages**: prometheus, loki, grafana, pgadmin4-web
3. **Prometheus**: Config, alert rules, systemd override for remote write receiver
4. **Alertmanager**: Install, config with Pushover integration
5. **Loki**: Create user/dirs, template config
6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config
7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar
### Deployment Order
PPLG must be deployed **before** services that push metrics/logs:
```
apt_update → alloy → node_exporter → pplg → postgresql → ...
```
This order is enforced in `site.yml`.
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like:
```
TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
### Prometheus Scrape Credentials
These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea.
#### 1. Casdoor Prometheus Access Key
```yaml
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
```
#### 2. Casdoor Prometheus Access Secret
```yaml
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
```
**Requirements (both):**
- **Source**: API key pair from the `built-in/admin` Casdoor user
- **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params)
- **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
```bash
# 1. Login to get session cookie
curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
-H "Content-Type: application/json" \
-d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
# 2. Generate API keys for built-in/admin
curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
-H "Content-Type: application/json" \
-d '{"owner":"built-in","name":"admin"}'
# 3. Retrieve the generated keys
curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')"
# 4. Cleanup
rm /tmp/casdoor-cookie.txt
```
⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics.
#### 3. Gitea Metrics Token
```yaml
vault_gitea_metrics_token: "YourGiteaMetricsToken"
```
**Requirements:**
- **Length**: 32+ characters
- **Source**: Must match the token configured in Gitea's `app.ini`
- **Generation**: `openssl rand -hex 32`
- **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth)
### Grafana Credentials
#### 4. Grafana Admin User
```yaml
vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"
```
#### 5. Grafana Viewer User
```yaml
vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"
```
#### 6. Grafana OAuth (Casdoor SSO)
```yaml
vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-grafana`
- **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth`
### PgAdmin
#### 7. PgAdmin Setup
Just do it manually:
cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
**Requirements:**
- **Purpose**: Initial local admin account (fallback when OAuth is unavailable)
#### 8. PgAdmin OAuth (Casdoor SSO)
```yaml
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-pgadmin`
- **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
### Prometheus OAuth2-Proxy
#### 9. Prometheus OAuth2-Proxy (Casdoor SSO)
```yaml
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
```
**Requirements:**
- Client ID/Secret must match the Casdoor application `app-prometheus`
- **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback`
- **Cookie secret generation**:
```bash
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
```
### Alertmanager (Pushover)
#### 10. Pushover Notification Credentials
```yaml
vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"
```
**Requirements:**
- **Source**: [pushover.net](https://pushover.net/) account
- **User Key**: Found on Pushover dashboard
- **API Token**: Create an application in Pushover
### Quick Reference
| Vault Variable | Used By | Source |
|---------------|---------|--------|
| `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini |
| `vault_grafana_admin_name` | users.yml.j2 | Choose any |
| `vault_grafana_admin_login` | users.yml.j2 | Choose any |
| `vault_grafana_admin_password` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_name` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_login` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_password` | users.yml.j2 | Choose any |
| `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app |
| `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app |
| `vault_pgadmin_email` | config_local.py.j2 | Choose any |
| `vault_pgadmin_password` | config_local.py.j2 | Choose any |
| `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app |
| `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate |
| `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account |
| `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account |
## Casdoor SSO
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
### Applications to Register
Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`:
| Application | Client ID | Redirect URI | Grant Types |
|-------------|-----------|-------------|-------------|
| `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` |
| `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` |
| `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` |
### URL Strategy
| URL Type | Address | Used By |
|----------|---------|---------|
| **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) |
| **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server |
| **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server |
| **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy |
### Auth Methods per Service
| Service | Auth Method | Details |
|---------|-------------|---------|
| **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` |
| **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` |
| **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` |
| **Loki** | None | Machine-to-machine (Alloy agents push logs) |
| **Alertmanager** | None | Internal only |
## OAuth2-Proxy skip_auth_routes
The Prometheus write API (`/api/v1/write`) and health check (`/ping`) are accessed by Alloy agents for machine-to-machine metric pushes. OAuth2-Proxy's `skip_auth_routes` config bypasses authentication for these paths:
```toml
skip_auth_routes = [
"^/ping$",
"^/api/v1/write$"
]
```
This allows `https://prometheus.ouranos.helu.ca/api/v1/write` to reach Prometheus without OAuth, while all other Prometheus traffic requires Casdoor SSO authentication.
## Host Variables
**File:** `ansible/inventory/host_vars/prospero.incus.yml`
Services list:
```yaml
services:
- alloy
- pplg
```
Key variable groups defined in `prospero.incus.yml`:
- PPLG domain (`ouranos.helu.ca`)
- Grafana (datasources, users, OAuth config)
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
- Alertmanager (Pushover integration)
- Loki (user, data/config directories)
- PgAdmin (user, data/log directories, OAuth config)
- Casdoor Metrics (access key/secret for Prometheus scraping)
## Titania Backend Routing
Titania's HAProxy routes external subdomains directly to Prospero service ports:
```yaml
# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
backend_host: "prospero.incus"
backend_port: 3000
health_path: "/api/health"
- subdomain: "pgadmin"
backend_host: "prospero.incus"
backend_port: 5050
health_path: "/misc/ping"
- subdomain: "prometheus"
backend_host: "prospero.incus"
backend_port: 9091 # OAuth2-Proxy sidecar
health_path: "/ping"
- subdomain: "loki"
backend_host: "prospero.incus"
backend_port: 3100
health_path: "/ready"
- subdomain: "alertmanager"
backend_host: "prospero.incus"
backend_port: 9093
health_path: "/-/healthy"
```
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/prospero/config.alloy.j2`
- **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus`
- **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki
- **Metrics**: Node exporter + process exporter → Prometheus remote write
### Prometheus Scrape Targets
| Job | Target | Auth |
|-----|--------|------|
| `prometheus` | `localhost:9090` | None |
| `node-exporter` | All Uranian hosts `:9100` | None |
| `alertmanager` | `prospero.incus:9093` | None |
| `haproxy` | `titania.incus:8404` | None |
| `gitea` | `oberon.incus:22084` | Bearer token |
| `casdoor` | `titania.incus:22081` | Access key/secret params |
### Alert Rules
Groups defined in `alert_rules.yml.j2`:
| Group | Alerts | Scope |
|-------|--------|-------|
| `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
| `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus |
| `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
| `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
| `loki_alerts` | HighLogVolume | Loki |
### Alertmanager Routing
Alerts are routed to Pushover with severity-based priority:
| Severity | Pushover Priority | Emoji |
|----------|-------------------|-------|
| Critical | 2 (Emergency) | 🚨 |
| Warning | 1 (High) | ⚠️ |
| Info | 0 (Normal) | — |
## Grafana MCP Server
Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token.
| Property | Value |
|----------|-------|
| MCP Host | miranda.incus |
| MCP Port | 25533 |
| MCPO Proxy | `http://miranda.incus:25530/grafana` |
| Auth | Grafana service account token (`vault_grafana_service_account_token`) |
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`.
For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**.
## Access After Deployment
| Service | URL | Login |
|---------|-----|-------|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
## Troubleshooting
### Service Status
```bash
ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
```
### View Logs
```bash
# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
```
### Test Endpoints (from Prospero)
```bash
# Grafana
curl -s http://127.0.0.1:3000/api/health
# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping
# Prometheus
curl -s http://127.0.0.1:9090/-/healthy
# Loki
curl -s http://127.0.0.1:3100/ready
# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy
```
### Test External Access (from any host)
```bash
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
curl -s https://pgadmin.ouranos.helu.ca/misc/ping
curl -s https://prometheus.ouranos.helu.ca/ping
curl -s https://loki.ouranos.helu.ca/ready
curl -s https://alertmanager.ouranos.helu.ca/-/healthy
```
### Common Errors
#### `vault_casdoor_prometheus_access_key` is undefined
```
TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
**Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials.
**Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault:
```bash
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
```
```yaml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
```
#### Certificate fetch fails
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
**Fix**: Ensure Titania is up and certbot has run:
```bash
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
```
The playbook falls back to a self-signed certificate if Titania is unavailable.
#### OAuth2 redirect loops
**Cause**: Casdoor application redirect URI doesn't match the service URL.
**Fix**: Verify redirect URIs match exactly:
- Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth`
- PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
- Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback`
## Migration Notes
PPLG replaces the following standalone playbooks (kept as reference):
| Original Playbook | Replaced By |
|-------------------|-------------|
| `prometheus/deploy.yml` | `pplg/deploy.yml` |
| `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` |
| `loki/deploy.yml` | `pplg/deploy.yml` |
| `grafana/deploy.yml` | `pplg/deploy.yml` |
| `pgadmin/deploy.yml` | `pplg/deploy.yml` |
PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).