Files
ouranos/docs/pplg.md
Robert Helewka 0f21380fd0 refactor: remove HAProxy from Prospero, centralize TLS on Titania
Move TLS termination and reverse proxying entirely to Titania's
HAProxy, eliminating the redundant HAProxy instance on Prospero.
Backends now communicate over plain HTTP within the internal network.

- Remove HAProxy container, config, certs, and syslog from Prospero
- Remove ssl_backend flags from Titania backend definitions
- Replace pplg_haproxy_* vars with single pplg_domain variable
- Remove HAProxy syslog source from Alloy config
- Update OAuth2-Proxy to listen on all interfaces for Titania access
2026-04-08 17:57:09 +00:00

20 KiB

PPLG - Consolidated Observability & Admin Stack

Overview

PPLG is the consolidated observability and administration stack running on Prospero. It bundles PgAdmin, Prometheus, Loki, and Grafana with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. TLS termination is handled by Titania's HAProxy, which routes directly to each service on Prospero.

Host: prospero.incus
Role: Observability
External Access: Via Titania HAProxy → prospero.incus (direct to service ports)

Subdomain Service Auth Method
grafana.ouranos.helu.ca Grafana Native Casdoor OAuth
pgadmin.ouranos.helu.ca PgAdmin Native Casdoor OAuth
prometheus.ouranos.helu.ca Prometheus OAuth2-Proxy sidecar
loki.ouranos.helu.ca Loki None (machine-to-machine)
alertmanager.ouranos.helu.ca Alertmanager None (internal)

Architecture

┌──────────┐      ┌────────────┐      ┌─────────────────────────────────────────────────┐
│  Client  │─────▶│  HAProxy   │─────▶│  Prospero (PPLG)                                │
│          │      │ (Titania)  │      │                                                  │
└──────────┘      │ :443 TLS   │      │  Grafana (:3000)     — Casdoor OAuth             │
                  │ termination│      │  PgAdmin (:5050)     — Casdoor OAuth             │
┌──────────┐      └────────────┘      │  OAuth2-Proxy (:9091) → Prometheus (:9090)       │
│  Alloy   │─────────────────────────▶│  Loki (:3100)        — no auth                   │
│ (agents) │                          │  Alertmanager (:9093) — no auth                   │
└──────────┘                          └─────────────────────────────────────────────────┘

Traffic Flow

Source Destination Path Auth
Browser → Grafana Titania :443 → Prospero :3000 Subdomain ACL Casdoor OAuth
Browser → PgAdmin Titania :443 → Prospero :5050 Subdomain ACL Casdoor OAuth
Browser → Prometheus Titania :443 → Prospero :9091 (OAuth2-Proxy) → :9090 Subdomain ACL OAuth2-Proxy → Casdoor
Alloy → Loki Titania :443 → Prospero :3100 Subdomain ACL None
Alloy → Prometheus Titania :443 → Prospero :9091 → :9090 skip_auth_routes None

Deployment

Prerequisites

  1. Terraform: Prospero container must have updated port mappings (terraform apply)
  2. Certbot: Wildcard cert must exist on Titania (ansible-playbook certbot/deploy.yml)
  3. Vault Secrets: All vault variables must be set (see Required Vault Secrets)
  4. Casdoor Applications: Register PgAdmin and Prometheus apps in Casdoor (see Casdoor SSO)

Playbook

cd ansible
ansible-playbook pplg/deploy.yml

Files

File Purpose
pplg/deploy.yml Main consolidated deployment playbook
pplg/prometheus.yml.j2 Prometheus scrape configuration
pplg/alert_rules.yml.j2 Prometheus alerting rules
pplg/alertmanager.yml.j2 Alertmanager routing and Pushover notifications
pplg/config.yml.j2 Loki server configuration
pplg/grafana.ini.j2 Grafana main config with Casdoor OAuth
pplg/datasource.yml.j2 Grafana provisioned datasources
pplg/users.yml.j2 Grafana provisioned users
pplg/config_local.py.j2 PgAdmin config with Casdoor OAuth
pplg/pgadmin.service.j2 PgAdmin gunicorn systemd unit
pplg/oauth2-proxy-prometheus.cfg.j2 OAuth2-Proxy config for Prometheus UI
pplg/oauth2-proxy-prometheus.service.j2 OAuth2-Proxy systemd unit

Deployment Steps

  1. APT Repositories: Add Grafana and PgAdmin repos
  2. Install Packages: prometheus, loki, grafana, pgadmin4-web
  3. Prometheus: Config, alert rules, systemd override for remote write receiver
  4. Alertmanager: Install, config with Pushover integration
  5. Loki: Create user/dirs, template config
  6. Grafana: Provisioning (datasources, users, dashboards), OAuth config
  7. PgAdmin: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
  8. OAuth2-Proxy: Download binary (v7.6.0), config for Prometheus sidecar

Deployment Order

PPLG must be deployed before services that push metrics/logs:

apt_update → alloy → node_exporter → pplg → postgresql → ...

This order is enforced in site.yml.

Required Vault Secrets

Add to ansible/inventory/group_vars/all/vault.yml:

⚠️ All vault variables below must be set before running the playbook. Missing variables will cause template failures like:

TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Prometheus Scrape Credentials

These are used in prometheus.yml.j2 to scrape metrics from Casdoor and Gitea.

1. Casdoor Prometheus Access Key

vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"

2. Casdoor Prometheus Access Secret

vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"

Requirements (both):

  • Source: API key pair from the built-in/admin Casdoor user
  • Used by: prometheus.yml.j2 Casdoor scrape job (accessKey / accessSecret query params)
  • How to obtain: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
    # 1. Login to get session cookie
    curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
      -H "Content-Type: application/json" \
      -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
    
    # 2. Generate API keys for built-in/admin
    curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
      -H "Content-Type: application/json" \
      -d '{"owner":"built-in","name":"admin"}'
    
    # 3. Retrieve the generated keys
    curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
      python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" 
    
    # 4. Cleanup
    rm /tmp/casdoor-cookie.txt
    

⚠️ The built-in/admin user is used (not a heluca user) because Casdoor's /api/metrics endpoint requires an admin user and serves global platform metrics.

3. Gitea Metrics Token

vault_gitea_metrics_token: "YourGiteaMetricsToken"

Requirements:

  • Length: 32+ characters
  • Source: Must match the token configured in Gitea's app.ini
  • Generation: openssl rand -hex 32
  • Used by: prometheus.yml.j2 Gitea scrape job (Bearer token auth)

Grafana Credentials

4. Grafana Admin User

vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"

5. Grafana Viewer User

vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"

6. Grafana OAuth (Casdoor SSO)

vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-grafana
  • Redirect URI: https://grafana.ouranos.helu.ca/login/generic_oauth

PgAdmin

7. PgAdmin Setup

Just do it manually: cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db

Requirements:

  • Purpose: Initial local admin account (fallback when OAuth is unavailable)

8. PgAdmin OAuth (Casdoor SSO)

vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-pgadmin
  • Redirect URI: https://pgadmin.ouranos.helu.ca/oauth2/redirect

Prometheus OAuth2-Proxy

9. Prometheus OAuth2-Proxy (Casdoor SSO)

vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"

Requirements:

  • Client ID/Secret must match the Casdoor application app-prometheus
  • Redirect URI: https://prometheus.ouranos.helu.ca/oauth2/callback
  • Cookie secret generation:
    python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
    

Alertmanager (Pushover)

10. Pushover Notification Credentials

vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"

Requirements:

  • Source: pushover.net account
  • User Key: Found on Pushover dashboard
  • API Token: Create an application in Pushover

Quick Reference

Vault Variable Used By Source
vault_casdoor_prometheus_access_key prometheus.yml.j2 Casdoor built-in/admin API key
vault_casdoor_prometheus_access_secret prometheus.yml.j2 Casdoor built-in/admin API key
vault_gitea_metrics_token prometheus.yml.j2 Gitea app.ini
vault_grafana_admin_name users.yml.j2 Choose any
vault_grafana_admin_login users.yml.j2 Choose any
vault_grafana_admin_password users.yml.j2 Choose any
vault_grafana_viewer_name users.yml.j2 Choose any
vault_grafana_viewer_login users.yml.j2 Choose any
vault_grafana_viewer_password users.yml.j2 Choose any
vault_grafana_oauth_client_id grafana.ini.j2 Casdoor app
vault_grafana_oauth_client_secret grafana.ini.j2 Casdoor app
vault_pgadmin_email config_local.py.j2 Choose any
vault_pgadmin_password config_local.py.j2 Choose any
vault_pgadmin_oauth_client_id config_local.py.j2 Casdoor app
vault_pgadmin_oauth_client_secret config_local.py.j2 Casdoor app
vault_prometheus_oauth2_client_id oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_client_secret oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_cookie_secret oauth2-proxy-prometheus.cfg.j2 Generate
vault_pushover_user_key alertmanager.yml.j2 Pushover account
vault_pushover_api_token alertmanager.yml.j2 Pushover account

Casdoor SSO

Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.

Applications to Register

Register in Casdoor Admin UI (https://id.ouranos.helu.ca) or add to ansible/casdoor/init_data.json.j2:

Application Client ID Redirect URI Grant Types
app-grafana vault_grafana_oauth_client_id https://grafana.ouranos.helu.ca/login/generic_oauth authorization_code, refresh_token
app-pgadmin vault_pgadmin_oauth_client_id https://pgadmin.ouranos.helu.ca/oauth2/redirect authorization_code, refresh_token
app-prometheus vault_prometheus_oauth2_client_id https://prometheus.ouranos.helu.ca/oauth2/callback authorization_code, refresh_token

URL Strategy

URL Type Address Used By
Auth URL https://id.ouranos.helu.ca/login/oauth/authorize User's browser (external)
Token URL https://id.ouranos.helu.ca/api/login/oauth/access_token Server-to-server
Userinfo URL https://id.ouranos.helu.ca/api/userinfo Server-to-server
OIDC Discovery https://id.ouranos.helu.ca/.well-known/openid-configuration OAuth2-Proxy

Auth Methods per Service

Service Auth Method Details
Grafana Native [auth.generic_oauth] Built-in OAuth support in grafana.ini
PgAdmin Native OAUTH2_CONFIG Built-in OAuth support in config_local.py
Prometheus OAuth2-Proxy sidecar Binary on :9091 proxying to :9090
Loki None Machine-to-machine (Alloy agents push logs)
Alertmanager None Internal only

OAuth2-Proxy skip_auth_routes

The Prometheus write API (/api/v1/write) and health check (/ping) are accessed by Alloy agents for machine-to-machine metric pushes. OAuth2-Proxy's skip_auth_routes config bypasses authentication for these paths:

skip_auth_routes = [
    "^/ping$",
    "^/api/v1/write$"
]

This allows https://prometheus.ouranos.helu.ca/api/v1/write to reach Prometheus without OAuth, while all other Prometheus traffic requires Casdoor SSO authentication.

Host Variables

File: ansible/inventory/host_vars/prospero.incus.yml

Services list:

services:
  - alloy
  - pplg

Key variable groups defined in prospero.incus.yml:

  • PPLG domain (ouranos.helu.ca)
  • Grafana (datasources, users, OAuth config)
  • Prometheus (scrape targets, OAuth2-Proxy sidecar config)
  • Alertmanager (Pushover integration)
  • Loki (user, data/config directories)
  • PgAdmin (user, data/log directories, OAuth config)
  • Casdoor Metrics (access key/secret for Prometheus scraping)

Titania Backend Routing

Titania's HAProxy routes external subdomains directly to Prospero service ports:

# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
  backend_host: "prospero.incus"
  backend_port: 3000
  health_path: "/api/health"

- subdomain: "pgadmin"
  backend_host: "prospero.incus"
  backend_port: 5050
  health_path: "/misc/ping"

- subdomain: "prometheus"
  backend_host: "prospero.incus"
  backend_port: 9091  # OAuth2-Proxy sidecar
  health_path: "/ping"

- subdomain: "loki"
  backend_host: "prospero.incus"
  backend_port: 3100
  health_path: "/ready"

- subdomain: "alertmanager"
  backend_host: "prospero.incus"
  backend_port: 9093
  health_path: "/-/healthy"

Monitoring

Alloy Configuration

File: ansible/alloy/prospero/config.alloy.j2

  • Journal Labels: Dedicated job labels for grafana-server, prometheus, loki, alertmanager, pgadmin, oauth2-proxy-prometheus
  • System Logs: /var/log/syslog, /var/log/auth.log → Loki
  • Metrics: Node exporter + process exporter → Prometheus remote write

Prometheus Scrape Targets

Job Target Auth
prometheus localhost:9090 None
node-exporter All Uranian hosts :9100 None
alertmanager prospero.incus:9093 None
haproxy titania.incus:8404 None
gitea oberon.incus:22084 Bearer token
casdoor titania.incus:22081 Access key/secret params

Alert Rules

Groups defined in alert_rules.yml.j2:

Group Alerts Scope
node_alerts InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage All hosts
puck_process_alerts HighCPU/Memory per process, CrashLoop puck.incus
puck_container_alerts HighContainerCount, Duplicates, Orphans, OOM puck.incus
service_alerts TargetMissing, JobMissing, AlertmanagerDown Infrastructure
loki_alerts HighLogVolume Loki

Alertmanager Routing

Alerts are routed to Pushover with severity-based priority:

Severity Pushover Priority Emoji
Critical 2 (Emergency) 🚨
Warning 1 (High) ⚠️
Info 0 (Normal)

Grafana MCP Server

Grafana has an associated MCP (Model Context Protocol) server that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on Miranda and connects back to Grafana on Prospero via the internal network (prospero.incus:3000) using a service account token.

Property Value
MCP Host miranda.incus
MCP Port 25533
MCPO Proxy http://miranda.incus:25530/grafana
Auth Grafana service account token (vault_grafana_service_account_token)

The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: pplg → grafana_mcp → mcpo.

For full details — deployment, configuration, available tools, troubleshooting — see Grafana MCP Server.

Access After Deployment

Service URL Login
Grafana https://grafana.ouranos.helu.ca Casdoor SSO or local admin
PgAdmin https://pgadmin.ouranos.helu.ca Casdoor SSO or local admin
Prometheus https://prometheus.ouranos.helu.ca Casdoor SSO
Alertmanager https://alertmanager.ouranos.helu.ca No auth (internal)

Troubleshooting

Service Status

ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus

View Logs

# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f

Test Endpoints (from Prospero)

# Grafana
curl -s http://127.0.0.1:3000/api/health

# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping

# Prometheus
curl -s http://127.0.0.1:9090/-/healthy

# Loki
curl -s http://127.0.0.1:3100/ready

# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy

Test External Access (from any host)

# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
curl -s https://pgadmin.ouranos.helu.ca/misc/ping
curl -s https://prometheus.ouranos.helu.ca/ping
curl -s https://loki.ouranos.helu.ca/ready
curl -s https://alertmanager.ouranos.helu.ca/-/healthy

Common Errors

vault_casdoor_prometheus_access_key is undefined

TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Cause: The Casdoor metrics scrape job in prometheus.yml.j2 requires access credentials.

Fix: Generate API keys for the built-in/admin Casdoor user (see Casdoor Prometheus Access Key for the full procedure), then add to vault:

cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"

Certificate fetch fails

Cause: Titania not running or certbot hasn't provisioned the cert yet.

Fix: Ensure Titania is up and certbot has run:

ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml

The playbook falls back to a self-signed certificate if Titania is unavailable.

OAuth2 redirect loops

Cause: Casdoor application redirect URI doesn't match the service URL.

Fix: Verify redirect URIs match exactly:

  • Grafana: https://grafana.ouranos.helu.ca/login/generic_oauth
  • PgAdmin: https://pgadmin.ouranos.helu.ca/oauth2/redirect
  • Prometheus: https://prometheus.ouranos.helu.ca/oauth2/callback

Migration Notes

PPLG replaces the following standalone playbooks (kept as reference):

Original Playbook Replaced By
prometheus/deploy.yml pplg/deploy.yml
prometheus/alertmanager_deploy.yml pplg/deploy.yml
loki/deploy.yml pplg/deploy.yml
grafana/deploy.yml pplg/deploy.yml
pgadmin/deploy.yml pplg/deploy.yml

PgAdmin was previously hosted on Portia (port 25555). It now runs on Prospero via gunicorn (no Apache).