Files
ouranos/docs/pplg.md

521 lines
20 KiB
Markdown

# PPLG - Consolidated Observability & Admin Stack
## Overview
PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. TLS termination is handled by Titania's HAProxy, which routes directly to each service on Prospero.
**Host:** prospero.incus
**Role:** Observability
**External Access:** Via Titania HAProxy → `prospero.incus` (direct to service ports)
| Subdomain | Service | Auth Method |
|-----------|---------|-------------|
| `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth |
| `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth |
| `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar |
| `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) |
| `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) |
## Architecture
```
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
│ │ │ (Titania) │ │ │
└──────────┘ │ :443 TLS │ │ Grafana (:3000) — Casdoor OAuth │
│ termination│ │ PgAdmin (:5050) — Casdoor OAuth │
┌──────────┐ └────────────┘ │ OAuth2-Proxy (:9091) → Prometheus (:9090) │
│ Alloy │─────────────────────────▶│ Loki (:3100) — no auth │
│ (agents) │ │ Alertmanager (:9093) — no auth │
└──────────┘ └─────────────────────────────────────────────────┘
```
### Traffic Flow
| Source | Destination | Path | Auth |
|--------|-------------|------|------|
| Browser → Grafana | Titania :443 → Prospero :3000 | Subdomain ACL | Casdoor OAuth |
| Browser → PgAdmin | Titania :443 → Prospero :5050 | Subdomain ACL | Casdoor OAuth |
| Browser → Prometheus | Titania :443 → Prospero :9091 (OAuth2-Proxy) → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
| Alloy → Loki | Titania :443 → Prospero :3100 | Subdomain ACL | None |
| Alloy → Prometheus | Titania :443 → Prospero :9091 → :9090 | `skip_auth_routes` | None |
## Deployment
### Prerequisites
1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`)
2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`)
3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets))
4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso))
### Playbook
```bash
cd ansible
ansible-playbook pplg/deploy.yml
```
### Files
| File | Purpose |
|------|---------|
| `pplg/deploy.yml` | Main consolidated deployment playbook |
| `pplg/prometheus.yml.j2` | Prometheus scrape configuration |
| `pplg/alert_rules.yml.j2` | Prometheus alerting rules |
| `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications |
| `pplg/config.yml.j2` | Loki server configuration |
| `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth |
| `pplg/datasource.yml.j2` | Grafana provisioned datasources |
| `pplg/users.yml.j2` | Grafana provisioned users |
| `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth |
| `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit |
| `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI |
| `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit |
### Deployment Steps
1. **APT Repositories**: Add Grafana and PgAdmin repos
2. **Install Packages**: prometheus, loki, grafana, pgadmin4-web
3. **Prometheus**: Config, alert rules, systemd override for remote write receiver
4. **Alertmanager**: Install, config with Pushover integration
5. **Loki**: Create user/dirs, template config
6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config
7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar
### Deployment Order
PPLG must be deployed **before** services that push metrics/logs:
```
apt_update → alloy → node_exporter → pplg → postgresql → ...
```
This order is enforced in `site.yml`.
## Required Vault Secrets
Add to `ansible/inventory/group_vars/all/vault.yml`:
⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like:
```
TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
### Prometheus Scrape Credentials
These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea.
#### 1. Casdoor Prometheus Access Key
```yaml
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
```
#### 2. Casdoor Prometheus Access Secret
```yaml
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
```
**Requirements (both):**
- **Source**: API key pair from the `built-in/admin` Casdoor user
- **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params)
- **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
```bash
# 1. Login to get session cookie
curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
-H "Content-Type: application/json" \
-d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
# 2. Generate API keys for built-in/admin
curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
-H "Content-Type: application/json" \
-d '{"owner":"built-in","name":"admin"}'
# 3. Retrieve the generated keys
curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')"
# 4. Cleanup
rm /tmp/casdoor-cookie.txt
```
⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics.
#### 3. Gitea Metrics Token
```yaml
vault_gitea_metrics_token: "YourGiteaMetricsToken"
```
**Requirements:**
- **Length**: 32+ characters
- **Source**: Must match the token configured in Gitea's `app.ini`
- **Generation**: `openssl rand -hex 32`
- **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth)
### Grafana Credentials
#### 4. Grafana Admin User
```yaml
vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"
```
#### 5. Grafana Viewer User
```yaml
vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"
```
#### 6. Grafana OAuth (Casdoor SSO)
```yaml
vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-grafana`
- **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth`
### PgAdmin
#### 7. PgAdmin Setup
Just do it manually:
sudo -u pgadmin /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
**Requirements:**
- **Purpose**: Initial local admin account (fallback when OAuth is unavailable)
#### 8. PgAdmin OAuth (Casdoor SSO)
```yaml
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
```
**Requirements:**
- **Source**: Must match the Casdoor application `app-pgadmin`
- **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
### Prometheus OAuth2-Proxy
#### 9. Prometheus OAuth2-Proxy (Casdoor SSO)
```yaml
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
```
**Requirements:**
- Client ID/Secret must match the Casdoor application `app-prometheus`
- **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback`
- **Cookie secret generation**:
```bash
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
```
### Alertmanager (Pushover)
#### 10. Pushover Notification Credentials
```yaml
vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"
```
**Requirements:**
- **Source**: [pushover.net](https://pushover.net/) account
- **User Key**: Found on Pushover dashboard
- **API Token**: Create an application in Pushover
### Quick Reference
| Vault Variable | Used By | Source |
|---------------|---------|--------|
| `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
| `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini |
| `vault_grafana_admin_name` | users.yml.j2 | Choose any |
| `vault_grafana_admin_login` | users.yml.j2 | Choose any |
| `vault_grafana_admin_password` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_name` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_login` | users.yml.j2 | Choose any |
| `vault_grafana_viewer_password` | users.yml.j2 | Choose any |
| `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app |
| `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app |
| `vault_pgadmin_email` | config_local.py.j2 | Choose any |
| `vault_pgadmin_password` | config_local.py.j2 | Choose any |
| `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app |
| `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
| `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate |
| `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account |
| `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account |
## Casdoor SSO
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
### Applications to Register
Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`:
| Application | Client ID | Redirect URI | Grant Types |
|-------------|-----------|-------------|-------------|
| `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` |
| `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` |
| `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` |
### URL Strategy
| URL Type | Address | Used By |
|----------|---------|---------|
| **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) |
| **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server |
| **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server |
| **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy |
### Auth Methods per Service
| Service | Auth Method | Details |
|---------|-------------|---------|
| **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` |
| **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` |
| **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` |
| **Loki** | None | Machine-to-machine (Alloy agents push logs) |
| **Alertmanager** | None | Internal only |
## OAuth2-Proxy skip_auth_routes
The Prometheus write API (`/api/v1/write`) and health check (`/ping`) are accessed by Alloy agents for machine-to-machine metric pushes. OAuth2-Proxy's `skip_auth_routes` config bypasses authentication for these paths:
```toml
skip_auth_routes = [
"^/ping$",
"^/api/v1/write$"
]
```
This allows `https://prometheus.ouranos.helu.ca/api/v1/write` to reach Prometheus without OAuth, while all other Prometheus traffic requires Casdoor SSO authentication.
## Host Variables
**File:** `ansible/inventory/host_vars/prospero.incus.yml`
Services list:
```yaml
services:
- alloy
- pplg
```
Key variable groups defined in `prospero.incus.yml`:
- PPLG domain (`ouranos.helu.ca`)
- Grafana (datasources, users, OAuth config)
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
- Alertmanager (Pushover integration)
- Loki (user, data/config directories)
- PgAdmin (user, data/log directories, OAuth config)
- Casdoor Metrics (access key/secret for Prometheus scraping)
## Titania Backend Routing
Titania's HAProxy routes external subdomains directly to Prospero service ports:
```yaml
# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
backend_host: "prospero.incus"
backend_port: 3000
health_path: "/api/health"
- subdomain: "pgadmin"
backend_host: "prospero.incus"
backend_port: 5050
health_path: "/misc/ping"
- subdomain: "prometheus"
backend_host: "prospero.incus"
backend_port: 9091 # OAuth2-Proxy sidecar
health_path: "/ping"
- subdomain: "loki"
backend_host: "prospero.incus"
backend_port: 3100
health_path: "/ready"
- subdomain: "alertmanager"
backend_host: "prospero.incus"
backend_port: 9093
health_path: "/-/healthy"
```
## Monitoring
### Alloy Configuration
**File:** `ansible/alloy/prospero/config.alloy.j2`
- **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus`
- **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki
- **Metrics**: Node exporter + process exporter → Prometheus remote write
### Prometheus Scrape Targets
| Job | Target | Auth |
|-----|--------|------|
| `prometheus` | `localhost:9090` | None |
| `node-exporter` | All Uranian hosts `:9100` | None |
| `alertmanager` | `prospero.incus:9093` | None |
| `haproxy` | `titania.incus:8404` | None |
| `gitea` | `oberon.incus:22084` | Bearer token |
| `casdoor` | `titania.incus:22081` | Access key/secret params |
### Alert Rules
Groups defined in `alert_rules.yml.j2`:
| Group | Alerts | Scope |
|-------|--------|-------|
| `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
| `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus |
| `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
| `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
| `loki_alerts` | HighLogVolume | Loki |
### Alertmanager Routing
Alerts are routed to Pushover with severity-based priority:
| Severity | Pushover Priority | Emoji |
|----------|-------------------|-------|
| Critical | 2 (Emergency) | 🚨 |
| Warning | 1 (High) | ⚠️ |
| Info | 0 (Normal) | — |
## Grafana MCP Server
Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token.
| Property | Value |
|----------|-------|
| MCP Host | miranda.incus |
| MCP Port | 25533 |
| MCPO Proxy | `http://miranda.incus:25530/grafana` |
| Auth | Grafana service account token (`vault_grafana_service_account_token`) |
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`.
For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**.
## Access After Deployment
| Service | URL | Login |
|---------|-----|-------|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
## Troubleshooting
### Service Status
```bash
ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
```
### View Logs
```bash
# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
```
### Test Endpoints (from Prospero)
```bash
# Grafana
curl -s http://127.0.0.1:3000/api/health
# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping
# Prometheus
curl -s http://127.0.0.1:9090/-/healthy
# Loki
curl -s http://127.0.0.1:3100/ready
# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy
```
### Test External Access (from any host)
```bash
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
curl -s https://pgadmin.ouranos.helu.ca/misc/ping
curl -s https://prometheus.ouranos.helu.ca/ping
curl -s https://loki.ouranos.helu.ca/ready
curl -s https://alertmanager.ouranos.helu.ca/-/healthy
```
### Common Errors
#### `vault_casdoor_prometheus_access_key` is undefined
```
TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
```
**Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials.
**Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault:
```bash
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
```
```yaml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
```
#### Certificate fetch fails
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
**Fix**: Ensure Titania is up and certbot has run:
```bash
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
```
The playbook falls back to a self-signed certificate if Titania is unavailable.
#### OAuth2 redirect loops
**Cause**: Casdoor application redirect URI doesn't match the service URL.
**Fix**: Verify redirect URIs match exactly:
- Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth`
- PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
- Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback`
## Migration Notes
PPLG replaces the following standalone playbooks (kept as reference):
| Original Playbook | Replaced By |
|-------------------|-------------|
| `prometheus/deploy.yml` | `pplg/deploy.yml` |
| `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` |
| `loki/deploy.yml` | `pplg/deploy.yml` |
| `grafana/deploy.yml` | `pplg/deploy.yml` |
| `pgadmin/deploy.yml` | `pplg/deploy.yml` |
PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).