Move TLS termination and reverse proxying entirely to Titania's HAProxy, eliminating the redundant HAProxy instance on Prospero. Backends now communicate over plain HTTP within the internal network. - Remove HAProxy container, config, certs, and syslog from Prospero - Remove ssl_backend flags from Titania backend definitions - Replace pplg_haproxy_* vars with single pplg_domain variable - Remove HAProxy syslog source from Alloy config - Update OAuth2-Proxy to listen on all interfaces for Titania access
20 KiB
PPLG - Consolidated Observability & Admin Stack
Overview
PPLG is the consolidated observability and administration stack running on Prospero. It bundles PgAdmin, Prometheus, Loki, and Grafana with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. TLS termination is handled by Titania's HAProxy, which routes directly to each service on Prospero.
Host: prospero.incus
Role: Observability
External Access: Via Titania HAProxy → prospero.incus (direct to service ports)
| Subdomain | Service | Auth Method |
|---|---|---|
grafana.ouranos.helu.ca |
Grafana | Native Casdoor OAuth |
pgadmin.ouranos.helu.ca |
PgAdmin | Native Casdoor OAuth |
prometheus.ouranos.helu.ca |
Prometheus | OAuth2-Proxy sidecar |
loki.ouranos.helu.ca |
Loki | None (machine-to-machine) |
alertmanager.ouranos.helu.ca |
Alertmanager | None (internal) |
Architecture
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
│ │ │ (Titania) │ │ │
└──────────┘ │ :443 TLS │ │ Grafana (:3000) — Casdoor OAuth │
│ termination│ │ PgAdmin (:5050) — Casdoor OAuth │
┌──────────┐ └────────────┘ │ OAuth2-Proxy (:9091) → Prometheus (:9090) │
│ Alloy │─────────────────────────▶│ Loki (:3100) — no auth │
│ (agents) │ │ Alertmanager (:9093) — no auth │
└──────────┘ └─────────────────────────────────────────────────┘
Traffic Flow
| Source | Destination | Path | Auth |
|---|---|---|---|
| Browser → Grafana | Titania :443 → Prospero :3000 | Subdomain ACL | Casdoor OAuth |
| Browser → PgAdmin | Titania :443 → Prospero :5050 | Subdomain ACL | Casdoor OAuth |
| Browser → Prometheus | Titania :443 → Prospero :9091 (OAuth2-Proxy) → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
| Alloy → Loki | Titania :443 → Prospero :3100 | Subdomain ACL | None |
| Alloy → Prometheus | Titania :443 → Prospero :9091 → :9090 | skip_auth_routes |
None |
Deployment
Prerequisites
- Terraform: Prospero container must have updated port mappings (
terraform apply) - Certbot: Wildcard cert must exist on Titania (
ansible-playbook certbot/deploy.yml) - Vault Secrets: All vault variables must be set (see Required Vault Secrets)
- Casdoor Applications: Register PgAdmin and Prometheus apps in Casdoor (see Casdoor SSO)
Playbook
cd ansible
ansible-playbook pplg/deploy.yml
Files
| File | Purpose |
|---|---|
pplg/deploy.yml |
Main consolidated deployment playbook |
pplg/prometheus.yml.j2 |
Prometheus scrape configuration |
pplg/alert_rules.yml.j2 |
Prometheus alerting rules |
pplg/alertmanager.yml.j2 |
Alertmanager routing and Pushover notifications |
pplg/config.yml.j2 |
Loki server configuration |
pplg/grafana.ini.j2 |
Grafana main config with Casdoor OAuth |
pplg/datasource.yml.j2 |
Grafana provisioned datasources |
pplg/users.yml.j2 |
Grafana provisioned users |
pplg/config_local.py.j2 |
PgAdmin config with Casdoor OAuth |
pplg/pgadmin.service.j2 |
PgAdmin gunicorn systemd unit |
pplg/oauth2-proxy-prometheus.cfg.j2 |
OAuth2-Proxy config for Prometheus UI |
pplg/oauth2-proxy-prometheus.service.j2 |
OAuth2-Proxy systemd unit |
Deployment Steps
- APT Repositories: Add Grafana and PgAdmin repos
- Install Packages: prometheus, loki, grafana, pgadmin4-web
- Prometheus: Config, alert rules, systemd override for remote write receiver
- Alertmanager: Install, config with Pushover integration
- Loki: Create user/dirs, template config
- Grafana: Provisioning (datasources, users, dashboards), OAuth config
- PgAdmin: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
- OAuth2-Proxy: Download binary (v7.6.0), config for Prometheus sidecar
Deployment Order
PPLG must be deployed before services that push metrics/logs:
apt_update → alloy → node_exporter → pplg → postgresql → ...
This order is enforced in site.yml.
Required Vault Secrets
Add to ansible/inventory/group_vars/all/vault.yml:
⚠️ All vault variables below must be set before running the playbook. Missing variables will cause template failures like:
TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
Prometheus Scrape Credentials
These are used in prometheus.yml.j2 to scrape metrics from Casdoor and Gitea.
1. Casdoor Prometheus Access Key
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
2. Casdoor Prometheus Access Secret
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
Requirements (both):
- Source: API key pair from the
built-in/adminCasdoor user - Used by:
prometheus.yml.j2Casdoor scrape job (accessKey/accessSecretquery params) - How to obtain: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
# 1. Login to get session cookie curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \ -H "Content-Type: application/json" \ -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}' # 2. Generate API keys for built-in/admin curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \ -H "Content-Type: application/json" \ -d '{"owner":"built-in","name":"admin"}' # 3. Retrieve the generated keys curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" # 4. Cleanup rm /tmp/casdoor-cookie.txt
⚠️ The built-in/admin user is used (not a heluca user) because Casdoor's /api/metrics endpoint requires an admin user and serves global platform metrics.
3. Gitea Metrics Token
vault_gitea_metrics_token: "YourGiteaMetricsToken"
Requirements:
- Length: 32+ characters
- Source: Must match the token configured in Gitea's
app.ini - Generation:
openssl rand -hex 32 - Used by:
prometheus.yml.j2Gitea scrape job (Bearer token auth)
Grafana Credentials
4. Grafana Admin User
vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"
5. Grafana Viewer User
vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"
6. Grafana OAuth (Casdoor SSO)
vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
Requirements:
- Source: Must match the Casdoor application
app-grafana - Redirect URI:
https://grafana.ouranos.helu.ca/login/generic_oauth
PgAdmin
7. PgAdmin Setup
Just do it manually: cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
Requirements:
- Purpose: Initial local admin account (fallback when OAuth is unavailable)
8. PgAdmin OAuth (Casdoor SSO)
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
Requirements:
- Source: Must match the Casdoor application
app-pgadmin - Redirect URI:
https://pgadmin.ouranos.helu.ca/oauth2/redirect
Prometheus OAuth2-Proxy
9. Prometheus OAuth2-Proxy (Casdoor SSO)
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
Requirements:
- Client ID/Secret must match the Casdoor application
app-prometheus - Redirect URI:
https://prometheus.ouranos.helu.ca/oauth2/callback - Cookie secret generation:
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
Alertmanager (Pushover)
10. Pushover Notification Credentials
vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"
Requirements:
- Source: pushover.net account
- User Key: Found on Pushover dashboard
- API Token: Create an application in Pushover
Quick Reference
| Vault Variable | Used By | Source |
|---|---|---|
vault_casdoor_prometheus_access_key |
prometheus.yml.j2 | Casdoor built-in/admin API key |
vault_casdoor_prometheus_access_secret |
prometheus.yml.j2 | Casdoor built-in/admin API key |
vault_gitea_metrics_token |
prometheus.yml.j2 | Gitea app.ini |
vault_grafana_admin_name |
users.yml.j2 | Choose any |
vault_grafana_admin_login |
users.yml.j2 | Choose any |
vault_grafana_admin_password |
users.yml.j2 | Choose any |
vault_grafana_viewer_name |
users.yml.j2 | Choose any |
vault_grafana_viewer_login |
users.yml.j2 | Choose any |
vault_grafana_viewer_password |
users.yml.j2 | Choose any |
vault_grafana_oauth_client_id |
grafana.ini.j2 | Casdoor app |
vault_grafana_oauth_client_secret |
grafana.ini.j2 | Casdoor app |
vault_pgadmin_email |
config_local.py.j2 | Choose any |
vault_pgadmin_password |
config_local.py.j2 | Choose any |
vault_pgadmin_oauth_client_id |
config_local.py.j2 | Casdoor app |
vault_pgadmin_oauth_client_secret |
config_local.py.j2 | Casdoor app |
vault_prometheus_oauth2_client_id |
oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
vault_prometheus_oauth2_client_secret |
oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
vault_prometheus_oauth2_cookie_secret |
oauth2-proxy-prometheus.cfg.j2 | Generate |
vault_pushover_user_key |
alertmanager.yml.j2 | Pushover account |
vault_pushover_api_token |
alertmanager.yml.j2 | Pushover account |
Casdoor SSO
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
Applications to Register
Register in Casdoor Admin UI (https://id.ouranos.helu.ca) or add to ansible/casdoor/init_data.json.j2:
| Application | Client ID | Redirect URI | Grant Types |
|---|---|---|---|
app-grafana |
vault_grafana_oauth_client_id |
https://grafana.ouranos.helu.ca/login/generic_oauth |
authorization_code, refresh_token |
app-pgadmin |
vault_pgadmin_oauth_client_id |
https://pgadmin.ouranos.helu.ca/oauth2/redirect |
authorization_code, refresh_token |
app-prometheus |
vault_prometheus_oauth2_client_id |
https://prometheus.ouranos.helu.ca/oauth2/callback |
authorization_code, refresh_token |
URL Strategy
| URL Type | Address | Used By |
|---|---|---|
| Auth URL | https://id.ouranos.helu.ca/login/oauth/authorize |
User's browser (external) |
| Token URL | https://id.ouranos.helu.ca/api/login/oauth/access_token |
Server-to-server |
| Userinfo URL | https://id.ouranos.helu.ca/api/userinfo |
Server-to-server |
| OIDC Discovery | https://id.ouranos.helu.ca/.well-known/openid-configuration |
OAuth2-Proxy |
Auth Methods per Service
| Service | Auth Method | Details |
|---|---|---|
| Grafana | Native [auth.generic_oauth] |
Built-in OAuth support in grafana.ini |
| PgAdmin | Native OAUTH2_CONFIG |
Built-in OAuth support in config_local.py |
| Prometheus | OAuth2-Proxy sidecar | Binary on :9091 proxying to :9090 |
| Loki | None | Machine-to-machine (Alloy agents push logs) |
| Alertmanager | None | Internal only |
OAuth2-Proxy skip_auth_routes
The Prometheus write API (/api/v1/write) and health check (/ping) are accessed by Alloy agents for machine-to-machine metric pushes. OAuth2-Proxy's skip_auth_routes config bypasses authentication for these paths:
skip_auth_routes = [
"^/ping$",
"^/api/v1/write$"
]
This allows https://prometheus.ouranos.helu.ca/api/v1/write to reach Prometheus without OAuth, while all other Prometheus traffic requires Casdoor SSO authentication.
Host Variables
File: ansible/inventory/host_vars/prospero.incus.yml
Services list:
services:
- alloy
- pplg
Key variable groups defined in prospero.incus.yml:
- PPLG domain (
ouranos.helu.ca) - Grafana (datasources, users, OAuth config)
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
- Alertmanager (Pushover integration)
- Loki (user, data/config directories)
- PgAdmin (user, data/log directories, OAuth config)
- Casdoor Metrics (access key/secret for Prometheus scraping)
Titania Backend Routing
Titania's HAProxy routes external subdomains directly to Prospero service ports:
# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
backend_host: "prospero.incus"
backend_port: 3000
health_path: "/api/health"
- subdomain: "pgadmin"
backend_host: "prospero.incus"
backend_port: 5050
health_path: "/misc/ping"
- subdomain: "prometheus"
backend_host: "prospero.incus"
backend_port: 9091 # OAuth2-Proxy sidecar
health_path: "/ping"
- subdomain: "loki"
backend_host: "prospero.incus"
backend_port: 3100
health_path: "/ready"
- subdomain: "alertmanager"
backend_host: "prospero.incus"
backend_port: 9093
health_path: "/-/healthy"
Monitoring
Alloy Configuration
File: ansible/alloy/prospero/config.alloy.j2
- Journal Labels: Dedicated job labels for
grafana-server,prometheus,loki,alertmanager,pgadmin,oauth2-proxy-prometheus - System Logs:
/var/log/syslog,/var/log/auth.log→ Loki - Metrics: Node exporter + process exporter → Prometheus remote write
Prometheus Scrape Targets
| Job | Target | Auth |
|---|---|---|
prometheus |
localhost:9090 |
None |
node-exporter |
All Uranian hosts :9100 |
None |
alertmanager |
prospero.incus:9093 |
None |
haproxy |
titania.incus:8404 |
None |
gitea |
oberon.incus:22084 |
Bearer token |
casdoor |
titania.incus:22081 |
Access key/secret params |
Alert Rules
Groups defined in alert_rules.yml.j2:
| Group | Alerts | Scope |
|---|---|---|
node_alerts |
InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
puck_process_alerts |
HighCPU/Memory per process, CrashLoop | puck.incus |
puck_container_alerts |
HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
service_alerts |
TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
loki_alerts |
HighLogVolume | Loki |
Alertmanager Routing
Alerts are routed to Pushover with severity-based priority:
| Severity | Pushover Priority | Emoji |
|---|---|---|
| Critical | 2 (Emergency) | 🚨 |
| Warning | 1 (High) | ⚠️ |
| Info | 0 (Normal) | — |
Grafana MCP Server
Grafana has an associated MCP (Model Context Protocol) server that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on Miranda and connects back to Grafana on Prospero via the internal network (prospero.incus:3000) using a service account token.
| Property | Value |
|---|---|
| MCP Host | miranda.incus |
| MCP Port | 25533 |
| MCPO Proxy | http://miranda.incus:25530/grafana |
| Auth | Grafana service account token (vault_grafana_service_account_token) |
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: pplg → grafana_mcp → mcpo.
For full details — deployment, configuration, available tools, troubleshooting — see Grafana MCP Server.
Access After Deployment
| Service | URL | Login |
|---|---|---|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
Troubleshooting
Service Status
ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
View Logs
# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
Test Endpoints (from Prospero)
# Grafana
curl -s http://127.0.0.1:3000/api/health
# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping
# Prometheus
curl -s http://127.0.0.1:9090/-/healthy
# Loki
curl -s http://127.0.0.1:3100/ready
# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy
Test External Access (from any host)
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
curl -s https://pgadmin.ouranos.helu.ca/misc/ping
curl -s https://prometheus.ouranos.helu.ca/ping
curl -s https://loki.ouranos.helu.ca/ready
curl -s https://alertmanager.ouranos.helu.ca/-/healthy
Common Errors
vault_casdoor_prometheus_access_key is undefined
TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
Cause: The Casdoor metrics scrape job in prometheus.yml.j2 requires access credentials.
Fix: Generate API keys for the built-in/admin Casdoor user (see Casdoor Prometheus Access Key for the full procedure), then add to vault:
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
Certificate fetch fails
Cause: Titania not running or certbot hasn't provisioned the cert yet.
Fix: Ensure Titania is up and certbot has run:
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
The playbook falls back to a self-signed certificate if Titania is unavailable.
OAuth2 redirect loops
Cause: Casdoor application redirect URI doesn't match the service URL.
Fix: Verify redirect URIs match exactly:
- Grafana:
https://grafana.ouranos.helu.ca/login/generic_oauth - PgAdmin:
https://pgadmin.ouranos.helu.ca/oauth2/redirect - Prometheus:
https://prometheus.ouranos.helu.ca/oauth2/callback
Migration Notes
PPLG replaces the following standalone playbooks (kept as reference):
| Original Playbook | Replaced By |
|---|---|
prometheus/deploy.yml |
pplg/deploy.yml |
prometheus/alertmanager_deploy.yml |
pplg/deploy.yml |
loki/deploy.yml |
pplg/deploy.yml |
grafana/deploy.yml |
pplg/deploy.yml |
pgadmin/deploy.yml |
pplg/deploy.yml |
PgAdmin was previously hosted on Portia (port 25555). It now runs on Prospero via gunicorn (no Apache).