Files
ouranos/docs/pplg.md

20 KiB

PPLG - Consolidated Observability & Admin Stack

Overview

PPLG is the consolidated observability and administration stack running on Prospero. It bundles PgAdmin, Prometheus, Loki, and Grafana with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication. TLS termination is handled by Titania's HAProxy, which routes directly to each service on Prospero.

Host: prospero.incus
Role: Observability
External Access: Via Titania HAProxy → prospero.incus (direct to service ports)

Subdomain Service Auth Method
grafana.ouranos.helu.ca Grafana Native Casdoor OAuth
pgadmin.ouranos.helu.ca PgAdmin Native Casdoor OAuth
prometheus.ouranos.helu.ca Prometheus OAuth2-Proxy sidecar
loki.ouranos.helu.ca Loki None (machine-to-machine)
alertmanager.ouranos.helu.ca Alertmanager None (internal)

Architecture

┌──────────┐      ┌────────────┐      ┌─────────────────────────────────────────────────┐
│  Client  │─────▶│  HAProxy   │─────▶│  Prospero (PPLG)                                │
│          │      │ (Titania)  │      │                                                  │
└──────────┘      │ :443 TLS   │      │  Grafana (:3000)     — Casdoor OAuth             │
                  │ termination│      │  PgAdmin (:5050)     — Casdoor OAuth             │
┌──────────┐      └────────────┘      │  OAuth2-Proxy (:9091) → Prometheus (:9090)       │
│  Alloy   │─────────────────────────▶│  Loki (:3100)        — no auth                   │
│ (agents) │                          │  Alertmanager (:9093) — no auth                   │
└──────────┘                          └─────────────────────────────────────────────────┘

Traffic Flow

Source Destination Path Auth
Browser → Grafana Titania :443 → Prospero :3000 Subdomain ACL Casdoor OAuth
Browser → PgAdmin Titania :443 → Prospero :5050 Subdomain ACL Casdoor OAuth
Browser → Prometheus Titania :443 → Prospero :9091 (OAuth2-Proxy) → :9090 Subdomain ACL OAuth2-Proxy → Casdoor
Alloy → Loki Titania :443 → Prospero :3100 Subdomain ACL None
Alloy → Prometheus Titania :443 → Prospero :9091 → :9090 skip_auth_routes None

Deployment

Prerequisites

  1. Terraform: Prospero container must have updated port mappings (terraform apply)
  2. Certbot: Wildcard cert must exist on Titania (ansible-playbook certbot/deploy.yml)
  3. Vault Secrets: All vault variables must be set (see Required Vault Secrets)
  4. Casdoor Applications: Register PgAdmin and Prometheus apps in Casdoor (see Casdoor SSO)

Playbook

cd ansible
ansible-playbook pplg/deploy.yml

Files

File Purpose
pplg/deploy.yml Main consolidated deployment playbook
pplg/prometheus.yml.j2 Prometheus scrape configuration
pplg/alert_rules.yml.j2 Prometheus alerting rules
pplg/alertmanager.yml.j2 Alertmanager routing and Pushover notifications
pplg/config.yml.j2 Loki server configuration
pplg/grafana.ini.j2 Grafana main config with Casdoor OAuth
pplg/datasource.yml.j2 Grafana provisioned datasources
pplg/users.yml.j2 Grafana provisioned users
pplg/config_local.py.j2 PgAdmin config with Casdoor OAuth
pplg/pgadmin.service.j2 PgAdmin gunicorn systemd unit
pplg/oauth2-proxy-prometheus.cfg.j2 OAuth2-Proxy config for Prometheus UI
pplg/oauth2-proxy-prometheus.service.j2 OAuth2-Proxy systemd unit

Deployment Steps

  1. APT Repositories: Add Grafana and PgAdmin repos
  2. Install Packages: prometheus, loki, grafana, pgadmin4-web
  3. Prometheus: Config, alert rules, systemd override for remote write receiver
  4. Alertmanager: Install, config with Pushover integration
  5. Loki: Create user/dirs, template config
  6. Grafana: Provisioning (datasources, users, dashboards), OAuth config
  7. PgAdmin: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
  8. OAuth2-Proxy: Download binary (v7.6.0), config for Prometheus sidecar

Deployment Order

PPLG must be deployed before services that push metrics/logs:

apt_update → alloy → node_exporter → pplg → postgresql → ...

This order is enforced in site.yml.

Required Vault Secrets

Add to ansible/inventory/group_vars/all/vault.yml:

⚠️ All vault variables below must be set before running the playbook. Missing variables will cause template failures like:

TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Prometheus Scrape Credentials

These are used in prometheus.yml.j2 to scrape metrics from Casdoor and Gitea.

1. Casdoor Prometheus Access Key

vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"

2. Casdoor Prometheus Access Secret

vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"

Requirements (both):

  • Source: API key pair from the built-in/admin Casdoor user
  • Used by: prometheus.yml.j2 Casdoor scrape job (accessKey / accessSecret query params)
  • How to obtain: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
    # 1. Login to get session cookie
    curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
      -H "Content-Type: application/json" \
      -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
    
    # 2. Generate API keys for built-in/admin
    curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
      -H "Content-Type: application/json" \
      -d '{"owner":"built-in","name":"admin"}'
    
    # 3. Retrieve the generated keys
    curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
      python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" 
    
    # 4. Cleanup
    rm /tmp/casdoor-cookie.txt
    

⚠️ The built-in/admin user is used (not a heluca user) because Casdoor's /api/metrics endpoint requires an admin user and serves global platform metrics.

3. Gitea Metrics Token

vault_gitea_metrics_token: "YourGiteaMetricsToken"

Requirements:

  • Length: 32+ characters
  • Source: Must match the token configured in Gitea's app.ini
  • Generation: openssl rand -hex 32
  • Used by: prometheus.yml.j2 Gitea scrape job (Bearer token auth)

Grafana Credentials

4. Grafana Admin User

vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"

5. Grafana Viewer User

vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"

6. Grafana OAuth (Casdoor SSO)

vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-grafana
  • Redirect URI: https://grafana.ouranos.helu.ca/login/generic_oauth

PgAdmin

7. PgAdmin Setup

Just do it manually:

sudo -u pgadmin /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db

Requirements:

  • Purpose: Initial local admin account (fallback when OAuth is unavailable)

8. PgAdmin OAuth (Casdoor SSO)

vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-pgadmin
  • Redirect URI: https://pgadmin.ouranos.helu.ca/oauth2/redirect

Prometheus OAuth2-Proxy

9. Prometheus OAuth2-Proxy (Casdoor SSO)

vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"

Requirements:

  • Client ID/Secret must match the Casdoor application app-prometheus
  • Redirect URI: https://prometheus.ouranos.helu.ca/oauth2/callback
  • Cookie secret generation:
    python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
    

Alertmanager (Pushover)

10. Pushover Notification Credentials

vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"

Requirements:

  • Source: pushover.net account
  • User Key: Found on Pushover dashboard
  • API Token: Create an application in Pushover

Quick Reference

Vault Variable Used By Source
vault_casdoor_prometheus_access_key prometheus.yml.j2 Casdoor built-in/admin API key
vault_casdoor_prometheus_access_secret prometheus.yml.j2 Casdoor built-in/admin API key
vault_gitea_metrics_token prometheus.yml.j2 Gitea app.ini
vault_grafana_admin_name users.yml.j2 Choose any
vault_grafana_admin_login users.yml.j2 Choose any
vault_grafana_admin_password users.yml.j2 Choose any
vault_grafana_viewer_name users.yml.j2 Choose any
vault_grafana_viewer_login users.yml.j2 Choose any
vault_grafana_viewer_password users.yml.j2 Choose any
vault_grafana_oauth_client_id grafana.ini.j2 Casdoor app
vault_grafana_oauth_client_secret grafana.ini.j2 Casdoor app
vault_pgadmin_email config_local.py.j2 Choose any
vault_pgadmin_password config_local.py.j2 Choose any
vault_pgadmin_oauth_client_id config_local.py.j2 Casdoor app
vault_pgadmin_oauth_client_secret config_local.py.j2 Casdoor app
vault_prometheus_oauth2_client_id oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_client_secret oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_cookie_secret oauth2-proxy-prometheus.cfg.j2 Generate
vault_pushover_user_key alertmanager.yml.j2 Pushover account
vault_pushover_api_token alertmanager.yml.j2 Pushover account

Casdoor SSO

Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.

Applications to Register

Register in Casdoor Admin UI (https://id.ouranos.helu.ca) or add to ansible/casdoor/init_data.json.j2:

Application Client ID Redirect URI Grant Types
app-grafana vault_grafana_oauth_client_id https://grafana.ouranos.helu.ca/login/generic_oauth authorization_code, refresh_token
app-pgadmin vault_pgadmin_oauth_client_id https://pgadmin.ouranos.helu.ca/oauth2/redirect authorization_code, refresh_token
app-prometheus vault_prometheus_oauth2_client_id https://prometheus.ouranos.helu.ca/oauth2/callback authorization_code, refresh_token

URL Strategy

URL Type Address Used By
Auth URL https://id.ouranos.helu.ca/login/oauth/authorize User's browser (external)
Token URL https://id.ouranos.helu.ca/api/login/oauth/access_token Server-to-server
Userinfo URL https://id.ouranos.helu.ca/api/userinfo Server-to-server
OIDC Discovery https://id.ouranos.helu.ca/.well-known/openid-configuration OAuth2-Proxy

Auth Methods per Service

Service Auth Method Details
Grafana Native [auth.generic_oauth] Built-in OAuth support in grafana.ini
PgAdmin Native OAUTH2_CONFIG Built-in OAuth support in config_local.py
Prometheus OAuth2-Proxy sidecar Binary on :9091 proxying to :9090
Loki None Machine-to-machine (Alloy agents push logs)
Alertmanager None Internal only

OAuth2-Proxy skip_auth_routes

The Prometheus write API (/api/v1/write) and health check (/ping) are accessed by Alloy agents for machine-to-machine metric pushes. OAuth2-Proxy's skip_auth_routes config bypasses authentication for these paths:

skip_auth_routes = [
    "^/ping$",
    "^/api/v1/write$"
]

This allows https://prometheus.ouranos.helu.ca/api/v1/write to reach Prometheus without OAuth, while all other Prometheus traffic requires Casdoor SSO authentication.

Host Variables

File: ansible/inventory/host_vars/prospero.incus.yml

Services list:

services:
  - alloy
  - pplg

Key variable groups defined in prospero.incus.yml:

  • PPLG domain (ouranos.helu.ca)
  • Grafana (datasources, users, OAuth config)
  • Prometheus (scrape targets, OAuth2-Proxy sidecar config)
  • Alertmanager (Pushover integration)
  • Loki (user, data/config directories)
  • PgAdmin (user, data/log directories, OAuth config)
  • Casdoor Metrics (access key/secret for Prometheus scraping)

Titania Backend Routing

Titania's HAProxy routes external subdomains directly to Prospero service ports:

# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
  backend_host: "prospero.incus"
  backend_port: 3000
  health_path: "/api/health"

- subdomain: "pgadmin"
  backend_host: "prospero.incus"
  backend_port: 5050
  health_path: "/misc/ping"

- subdomain: "prometheus"
  backend_host: "prospero.incus"
  backend_port: 9091  # OAuth2-Proxy sidecar
  health_path: "/ping"

- subdomain: "loki"
  backend_host: "prospero.incus"
  backend_port: 3100
  health_path: "/ready"

- subdomain: "alertmanager"
  backend_host: "prospero.incus"
  backend_port: 9093
  health_path: "/-/healthy"

Monitoring

Alloy Configuration

File: ansible/alloy/prospero/config.alloy.j2

  • Journal Labels: Dedicated job labels for grafana-server, prometheus, loki, alertmanager, pgadmin, oauth2-proxy-prometheus
  • System Logs: /var/log/syslog, /var/log/auth.log → Loki
  • Metrics: Node exporter + process exporter → Prometheus remote write

Prometheus Scrape Targets

Job Target Auth
prometheus localhost:9090 None
node-exporter All Uranian hosts :9100 None
alertmanager prospero.incus:9093 None
haproxy titania.incus:8404 None
gitea oberon.incus:22084 Bearer token
casdoor titania.incus:22081 Access key/secret params

Alert Rules

Groups defined in alert_rules.yml.j2:

Group Alerts Scope
node_alerts InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage All hosts
puck_process_alerts HighCPU/Memory per process, CrashLoop puck.incus
puck_container_alerts HighContainerCount, Duplicates, Orphans, OOM puck.incus
service_alerts TargetMissing, JobMissing, AlertmanagerDown Infrastructure
loki_alerts HighLogVolume Loki

Alertmanager Routing

Alerts are routed to Pushover with severity-based priority:

Severity Pushover Priority Emoji
Critical 2 (Emergency) 🚨
Warning 1 (High) ⚠️
Info 0 (Normal)

Grafana MCP Server

Grafana has an associated MCP (Model Context Protocol) server that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on Miranda and connects back to Grafana on Prospero via the internal network (prospero.incus:3000) using a service account token.

Property Value
MCP Host miranda.incus
MCP Port 25533
MCPO Proxy http://miranda.incus:25530/grafana
Auth Grafana service account token (vault_grafana_service_account_token)

The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: pplg → grafana_mcp → mcpo.

For full details — deployment, configuration, available tools, troubleshooting — see Grafana MCP Server.

Access After Deployment

Service URL Login
Grafana https://grafana.ouranos.helu.ca Casdoor SSO or local admin
PgAdmin https://pgadmin.ouranos.helu.ca Casdoor SSO or local admin
Prometheus https://prometheus.ouranos.helu.ca Casdoor SSO
Alertmanager https://alertmanager.ouranos.helu.ca No auth (internal)

Troubleshooting

Service Status

ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus

View Logs

# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f

Test Endpoints (from Prospero)

# Grafana
curl -s http://127.0.0.1:3000/api/health

# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping

# Prometheus
curl -s http://127.0.0.1:9090/-/healthy

# Loki
curl -s http://127.0.0.1:3100/ready

# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy

Test External Access (from any host)

# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
curl -s https://pgadmin.ouranos.helu.ca/misc/ping
curl -s https://prometheus.ouranos.helu.ca/ping
curl -s https://loki.ouranos.helu.ca/ready
curl -s https://alertmanager.ouranos.helu.ca/-/healthy

Common Errors

vault_casdoor_prometheus_access_key is undefined

TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Cause: The Casdoor metrics scrape job in prometheus.yml.j2 requires access credentials.

Fix: Generate API keys for the built-in/admin Casdoor user (see Casdoor Prometheus Access Key for the full procedure), then add to vault:

cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"

Certificate fetch fails

Cause: Titania not running or certbot hasn't provisioned the cert yet.

Fix: Ensure Titania is up and certbot has run:

ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml

The playbook falls back to a self-signed certificate if Titania is unavailable.

OAuth2 redirect loops

Cause: Casdoor application redirect URI doesn't match the service URL.

Fix: Verify redirect URIs match exactly:

  • Grafana: https://grafana.ouranos.helu.ca/login/generic_oauth
  • PgAdmin: https://pgadmin.ouranos.helu.ca/oauth2/redirect
  • Prometheus: https://prometheus.ouranos.helu.ca/oauth2/callback

Migration Notes

PPLG replaces the following standalone playbooks (kept as reference):

Original Playbook Replaced By
prometheus/deploy.yml pplg/deploy.yml
prometheus/alertmanager_deploy.yml pplg/deploy.yml
loki/deploy.yml pplg/deploy.yml
grafana/deploy.yml pplg/deploy.yml
pgadmin/deploy.yml pplg/deploy.yml

PgAdmin was previously hosted on Portia (port 25555). It now runs on Prospero via gunicorn (no Apache).