# PPLG - Consolidated Observability & Admin Stack ## Overview PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. **Host:** prospero.incus **Role:** Observability **Incus Ports:** 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect) **External Access:** Via Titania HAProxy → `prospero.incus:443` | Subdomain | Service | Auth Method | |-----------|---------|-------------| | `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth | | `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth | | `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar | | `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) | | `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) | ## Architecture ``` ┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐ │ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │ │ │ │ (Titania) │ │ │ └──────────┘ │ :443 → :443 │ ┌──────────────────────────────────────────┐ │ └────────────┘ │ │ HAProxy (systemd, :443/:80) │ │ │ │ TLS termination + subdomain routing │ │ ┌──────────┐ │ └───┬──────┬──────┬──────┬──────┬──────────┘ │ │ Alloy │──push──────────────────────────▶│ │ │ │ │ │ (agents) │ loki.ouranos.helu.ca │ │ │ │ │ │ │ │ prometheus.ouranos.helu.ca │ │ │ │ │ └──────────┘ │ ▼ ▼ ▼ ▼ ▼ │ │ Grafana PgAdmin OAuth2 Loki Alertmanager │ │ :3000 :5050 Proxy :3100 :9093 │ │ :9091 │ │ │ │ │ ▼ │ │ Prometheus │ │ :9090 │ └─────────────────────────────────────────────────┘ ``` ### Traffic Flow | Source | Destination | Path | Auth | |--------|-------------|------|------| | Browser → Grafana | Titania :443 → Prospero :443 → HAProxy → :3000 | Subdomain ACL | Casdoor OAuth | | Browser → PgAdmin | Titania :443 → Prospero :443 → HAProxy → :5050 | Subdomain ACL | Casdoor OAuth | | Browser → Prometheus | Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor | | Alloy → Loki | `https://loki.ouranos.helu.ca` → HAProxy :443 → :3100 | Subdomain ACL | None | | Alloy → Prometheus | `https://prometheus.ouranos.helu.ca/api/v1/write` → HAProxy :443 → :9090 | `skip_auth_route` | None | ## Deployment ### Prerequisites 1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`) 2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`) 3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets)) 4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso)) ### Playbook ```bash cd ansible ansible-playbook pplg/deploy.yml ``` ### Files | File | Purpose | |------|---------| | `pplg/deploy.yml` | Main consolidated deployment playbook | | `pplg/pplg-haproxy.cfg.j2` | HAProxy TLS termination config (5 backends) | | `pplg/prometheus.yml.j2` | Prometheus scrape configuration | | `pplg/alert_rules.yml.j2` | Prometheus alerting rules | | `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications | | `pplg/config.yml.j2` | Loki server configuration | | `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth | | `pplg/datasource.yml.j2` | Grafana provisioned datasources | | `pplg/users.yml.j2` | Grafana provisioned users | | `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth | | `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit | | `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI | | `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit | ### Deployment Steps 1. **APT Repositories**: Add Grafana and PgAdmin repos 2. **Install Packages**: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn 3. **Prometheus**: Config, alert rules, systemd override for remote write receiver 4. **Alertmanager**: Install, config with Pushover integration 5. **Loki**: Create user/dirs, template config 6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config 7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config 8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar 9. **SSL Certificate**: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback) 10. **HAProxy**: Template config, enable and start systemd service ### Deployment Order PPLG must be deployed **before** services that push metrics/logs: ``` apt_update → alloy → node_exporter → pplg → postgresql → ... ``` This order is enforced in `site.yml`. ## Required Vault Secrets Add to `ansible/inventory/group_vars/all/vault.yml`: ⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like: ``` TASK [Template prometheus.yml] **** [ERROR]: 'vault_casdoor_prometheus_access_key' is undefined ``` ### Prometheus Scrape Credentials These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea. #### 1. Casdoor Prometheus Access Key ```yaml vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey" ``` #### 2. Casdoor Prometheus Access Secret ```yaml vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret" ``` **Requirements (both):** - **Source**: API key pair from the `built-in/admin` Casdoor user - **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params) - **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default): ```bash # 1. Login to get session cookie curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \ -H "Content-Type: application/json" \ -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}' # 2. Generate API keys for built-in/admin curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \ -H "Content-Type: application/json" \ -d '{"owner":"built-in","name":"admin"}' # 3. Retrieve the generated keys curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" # 4. Cleanup rm /tmp/casdoor-cookie.txt ``` ⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics. #### 3. Gitea Metrics Token ```yaml vault_gitea_metrics_token: "YourGiteaMetricsToken" ``` **Requirements:** - **Length**: 32+ characters - **Source**: Must match the token configured in Gitea's `app.ini` - **Generation**: `openssl rand -hex 32` - **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth) ### Grafana Credentials #### 4. Grafana Admin User ```yaml vault_grafana_admin_name: "Admin" vault_grafana_admin_login: "admin" vault_grafana_admin_password: "YourSecureAdminPassword" ``` #### 5. Grafana Viewer User ```yaml vault_grafana_viewer_name: "Viewer" vault_grafana_viewer_login: "viewer" vault_grafana_viewer_password: "YourSecureViewerPassword" ``` #### 6. Grafana OAuth (Casdoor SSO) ```yaml vault_grafana_oauth_client_id: "grafana-oauth-client" vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret" ``` **Requirements:** - **Source**: Must match the Casdoor application `app-grafana` - **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth` ### PgAdmin #### 7. PgAdmin Setup Just do it manually: cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db **Requirements:** - **Purpose**: Initial local admin account (fallback when OAuth is unavailable) #### 8. PgAdmin OAuth (Casdoor SSO) ```yaml vault_pgadmin_oauth_client_id: "pgadmin-oauth-client" vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret" ``` **Requirements:** - **Source**: Must match the Casdoor application `app-pgadmin` - **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect` ### Prometheus OAuth2-Proxy #### 9. Prometheus OAuth2-Proxy (Casdoor SSO) ```yaml vault_prometheus_oauth2_client_id: "prometheus-oauth-client" vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret" vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret" ``` **Requirements:** - Client ID/Secret must match the Casdoor application `app-prometheus` - **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback` - **Cookie secret generation**: ```bash python3 -c 'import secrets; print(secrets.token_urlsafe(32))' ``` ### Alertmanager (Pushover) #### 10. Pushover Notification Credentials ```yaml vault_pushover_user_key: "YourPushoverUserKey" vault_pushover_api_token: "YourPushoverAPIToken" ``` **Requirements:** - **Source**: [pushover.net](https://pushover.net/) account - **User Key**: Found on Pushover dashboard - **API Token**: Create an application in Pushover ### Quick Reference | Vault Variable | Used By | Source | |---------------|---------|--------| | `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key | | `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key | | `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini | | `vault_grafana_admin_name` | users.yml.j2 | Choose any | | `vault_grafana_admin_login` | users.yml.j2 | Choose any | | `vault_grafana_admin_password` | users.yml.j2 | Choose any | | `vault_grafana_viewer_name` | users.yml.j2 | Choose any | | `vault_grafana_viewer_login` | users.yml.j2 | Choose any | | `vault_grafana_viewer_password` | users.yml.j2 | Choose any | | `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app | | `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app | | `vault_pgadmin_email` | config_local.py.j2 | Choose any | | `vault_pgadmin_password` | config_local.py.j2 | Choose any | | `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app | | `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app | | `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app | | `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app | | `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate | | `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account | | `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account | ## Casdoor SSO Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created. ### Applications to Register Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`: | Application | Client ID | Redirect URI | Grant Types | |-------------|-----------|-------------|-------------| | `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` | | `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` | | `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` | ### URL Strategy | URL Type | Address | Used By | |----------|---------|---------| | **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) | | **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server | | **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server | | **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy | ### Auth Methods per Service | Service | Auth Method | Details | |---------|-------------|---------| | **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` | | **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` | | **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` | | **Loki** | None | Machine-to-machine (Alloy agents push logs) | | **Alertmanager** | None | Internal only | ## HAProxy Configuration ### Backends | Backend | Upstream | Health Check | Auth | |---------|----------|-------------|------| | `backend_grafana` | `127.0.0.1:3000` | `GET /api/health` | Grafana OAuth | | `backend_pgadmin` | `127.0.0.1:5050` | `GET /misc/ping` | PgAdmin OAuth | | `backend_prometheus` | `127.0.0.1:9091` (OAuth2-Proxy) | `GET /ping` | OAuth2-Proxy | | `backend_prometheus_direct` | `127.0.0.1:9090` | — | None (write API) | | `backend_loki` | `127.0.0.1:3100` | `GET /ready` | None | | `backend_alertmanager` | `127.0.0.1:9093` | `GET /-/healthy` | None | ### skip_auth_route Pattern The Prometheus write API (`/api/v1/write`) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy: ``` acl is_prometheus_write path_beg /api/v1/write use_backend backend_prometheus_direct if host_prometheus is_prometheus_write ``` This routes `https://prometheus.ouranos.helu.ca/api/v1/write` directly to Prometheus on `:9090`, while all other Prometheus traffic goes through OAuth2-Proxy on `:9091`. ### SSL Certificate - **Primary**: Let's Encrypt wildcard cert (`*.ouranos.helu.ca`) fetched from Titania - **Fallback**: Self-signed cert generated on Prospero (if Titania unavailable) - **Path**: `/etc/haproxy/certs/ouranos.pem` ## Host Variables **File:** `ansible/inventory/host_vars/prospero.incus.yml` Services list: ```yaml services: - alloy - pplg ``` Key variable groups defined in `prospero.incus.yml`: - PPLG HAProxy (user, group, uid/gid 800, syslog port) - Grafana (datasources, users, OAuth config) - Prometheus (scrape targets, OAuth2-Proxy sidecar config) - Alertmanager (Pushover integration) - Loki (user, data/config directories) - PgAdmin (user, data/log directories, OAuth config) - Casdoor Metrics (access key/secret for Prometheus scraping) ## Terraform ### Prospero Port Mapping ```hcl devices = [ { name = "https_internal" type = "proxy" properties = { listen = "tcp:0.0.0.0:25510" connect = "tcp:127.0.0.1:443" } }, { name = "http_redirect" type = "proxy" properties = { listen = "tcp:0.0.0.0:25511" connect = "tcp:127.0.0.1:80" } } ] ``` Run `terraform apply` before deploying if port mappings changed. ### Titania Backend Routing Titania's HAProxy routes external subdomains to Prospero's HTTPS port: ```yaml # In titania.incus.yml haproxy_backends - subdomain: "grafana" backend_host: "prospero.incus" backend_port: 443 health_path: "/api/health" ssl_backend: true - subdomain: "pgadmin" backend_host: "prospero.incus" backend_port: 443 health_path: "/misc/ping" ssl_backend: true - subdomain: "prometheus" backend_host: "prospero.incus" backend_port: 443 health_path: "/ping" ssl_backend: true ``` ## Monitoring ### Alloy Configuration **File:** `ansible/alloy/prospero/config.alloy.j2` - **HAProxy Syslog**: `loki.source.syslog` on `127.0.0.1:51405` (TCP) receives Docker syslog from HAProxy container - **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus` - **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki - **Metrics**: Node exporter + process exporter → Prometheus remote write ### Prometheus Scrape Targets | Job | Target | Auth | |-----|--------|------| | `prometheus` | `localhost:9090` | None | | `node-exporter` | All Uranian hosts `:9100` | None | | `alertmanager` | `prospero.incus:9093` | None | | `haproxy` | `titania.incus:8404` | None | | `gitea` | `oberon.incus:22084` | Bearer token | | `casdoor` | `titania.incus:22081` | Access key/secret params | ### Alert Rules Groups defined in `alert_rules.yml.j2`: | Group | Alerts | Scope | |-------|--------|-------| | `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts | | `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus | | `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus | | `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure | | `loki_alerts` | HighLogVolume | Loki | ### Alertmanager Routing Alerts are routed to Pushover with severity-based priority: | Severity | Pushover Priority | Emoji | |----------|-------------------|-------| | Critical | 2 (Emergency) | 🚨 | | Warning | 1 (High) | ⚠️ | | Info | 0 (Normal) | — | ## Grafana MCP Server Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token. | Property | Value | |----------|-------| | MCP Host | miranda.incus | | MCP Port | 25533 | | MCPO Proxy | `http://miranda.incus:25530/grafana` | | Auth | Grafana service account token (`vault_grafana_service_account_token`) | The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`. For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**. ## Access After Deployment | Service | URL | Login | |---------|-----|-------| | Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin | | PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin | | Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO | | Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) | ## Troubleshooting ### Service Status ```bash ssh prospero.incus sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus ``` ### HAProxy Service ```bash ssh prospero.incus sudo systemctl status haproxy sudo journalctl -u haproxy -f ``` ### View Logs ```bash # All PPLG services via journal sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f # HAProxy logs (shipped via syslog to Alloy → Loki) # Query in Grafana: {job="pplg-haproxy"} ``` ### Test Endpoints (from Prospero) ```bash # Grafana curl -s http://127.0.0.1:3000/api/health # PgAdmin curl -s http://127.0.0.1:5050/misc/ping # Prometheus curl -s http://127.0.0.1:9090/-/healthy # Loki curl -s http://127.0.0.1:3100/ready # Alertmanager curl -s http://127.0.0.1:9093/-/healthy # HAProxy stats curl -s http://127.0.0.1:8404/metrics | head ``` ### Test TLS (from any host) ```bash # Direct to Prospero container curl -sk https://prospero.incus/api/health # Via Titania HAProxy curl -s https://grafana.ouranos.helu.ca/api/health ``` ### Common Errors #### `vault_casdoor_prometheus_access_key` is undefined ``` TASK [Template prometheus.yml] [ERROR]: 'vault_casdoor_prometheus_access_key' is undefined ``` **Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials. **Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault: ```bash cd ansible ansible-vault edit inventory/group_vars/all/vault.yml ``` ```yaml vault_casdoor_prometheus_access_key: "your-casdoor-access-key" vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret" ``` #### Certificate fetch fails **Cause**: Titania not running or certbot hasn't provisioned the cert yet. **Fix**: Ensure Titania is up and certbot has run: ```bash ansible-playbook sandbox_up.yml ansible-playbook certbot/deploy.yml ``` The playbook falls back to a self-signed certificate if Titania is unavailable. #### OAuth2 redirect loops **Cause**: Casdoor application redirect URI doesn't match the service URL. **Fix**: Verify redirect URIs match exactly: - Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth` - PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect` - Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback` ## Migration Notes PPLG replaces the following standalone playbooks (kept as reference): | Original Playbook | Replaced By | |-------------------|-------------| | `prometheus/deploy.yml` | `pplg/deploy.yml` | | `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` | | `loki/deploy.yml` | `pplg/deploy.yml` | | `grafana/deploy.yml` | `pplg/deploy.yml` | | `pgadmin/deploy.yml` | `pplg/deploy.yml` | PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).