Files
ouranos/docs/pplg.md
Robert Helewka b4d60f2f38 docs: rewrite README with structured overview and quick start guide
Replaces the minimal project description with a comprehensive README
including a component overview table, quick start instructions, common
Ansible operations, and links to detailed documentation. Aligns with
Red Panda Approval™ standards.
2026-03-03 12:49:06 +00:00

22 KiB

PPLG - Consolidated Observability & Admin Stack

Overview

PPLG is the consolidated observability and administration stack running on Prospero. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication.

Host: prospero.incus
Role: Observability
Incus Ports: 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect)
External Access: Via Titania HAProxy → prospero.incus:443

Subdomain Service Auth Method
grafana.ouranos.helu.ca Grafana Native Casdoor OAuth
pgadmin.ouranos.helu.ca PgAdmin Native Casdoor OAuth
prometheus.ouranos.helu.ca Prometheus OAuth2-Proxy sidecar
loki.ouranos.helu.ca Loki None (machine-to-machine)
alertmanager.ouranos.helu.ca Alertmanager None (internal)

Architecture

┌──────────┐      ┌────────────┐      ┌─────────────────────────────────────────────────┐
│  Client  │─────▶│  HAProxy   │─────▶│  Prospero (PPLG)                                │
│          │      │ (Titania)  │      │                                                  │
└──────────┘      │ :443 → :443       │  ┌──────────────────────────────────────────┐    │
                  └────────────┘      │  │  HAProxy (systemd, :443/:80)             │    │
                                      │  │  TLS termination + subdomain routing     │    │
┌──────────┐                          │  └───┬──────┬──────┬──────┬──────┬──────────┘    │
│  Alloy   │──push──────────────────────────▶│      │      │      │                      │
│ (agents) │  loki.ouranos.helu.ca    │      │      │      │      │                      │
│          │  prometheus.ouranos.helu.ca      │      │      │      │                      │
└──────────┘                          │      ▼      ▼      ▼      ▼      ▼               │
                                      │  Grafana PgAdmin OAuth2  Loki  Alertmanager      │
                                      │  :3000   :5050  Proxy   :3100  :9093             │
                                      │                 :9091                             │
                                      │                   │                               │
                                      │                   ▼                               │
                                      │              Prometheus                           │
                                      │              :9090                                │
                                      └─────────────────────────────────────────────────┘

Traffic Flow

Source Destination Path Auth
Browser → Grafana Titania :443 → Prospero :443 → HAProxy → :3000 Subdomain ACL Casdoor OAuth
Browser → PgAdmin Titania :443 → Prospero :443 → HAProxy → :5050 Subdomain ACL Casdoor OAuth
Browser → Prometheus Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 Subdomain ACL OAuth2-Proxy → Casdoor
Alloy → Loki https://loki.ouranos.helu.ca → HAProxy :443 → :3100 Subdomain ACL None
Alloy → Prometheus https://prometheus.ouranos.helu.ca/api/v1/write → HAProxy :443 → :9090 skip_auth_route None

Deployment

Prerequisites

  1. Terraform: Prospero container must have updated port mappings (terraform apply)
  2. Certbot: Wildcard cert must exist on Titania (ansible-playbook certbot/deploy.yml)
  3. Vault Secrets: All vault variables must be set (see Required Vault Secrets)
  4. Casdoor Applications: Register PgAdmin and Prometheus apps in Casdoor (see Casdoor SSO)

Playbook

cd ansible
ansible-playbook pplg/deploy.yml

Files

File Purpose
pplg/deploy.yml Main consolidated deployment playbook
pplg/pplg-haproxy.cfg.j2 HAProxy TLS termination config (5 backends)
pplg/prometheus.yml.j2 Prometheus scrape configuration
pplg/alert_rules.yml.j2 Prometheus alerting rules
pplg/alertmanager.yml.j2 Alertmanager routing and Pushover notifications
pplg/config.yml.j2 Loki server configuration
pplg/grafana.ini.j2 Grafana main config with Casdoor OAuth
pplg/datasource.yml.j2 Grafana provisioned datasources
pplg/users.yml.j2 Grafana provisioned users
pplg/config_local.py.j2 PgAdmin config with Casdoor OAuth
pplg/pgadmin.service.j2 PgAdmin gunicorn systemd unit
pplg/oauth2-proxy-prometheus.cfg.j2 OAuth2-Proxy config for Prometheus UI
pplg/oauth2-proxy-prometheus.service.j2 OAuth2-Proxy systemd unit

Deployment Steps

  1. APT Repositories: Add Grafana and PgAdmin repos
  2. Install Packages: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn
  3. Prometheus: Config, alert rules, systemd override for remote write receiver
  4. Alertmanager: Install, config with Pushover integration
  5. Loki: Create user/dirs, template config
  6. Grafana: Provisioning (datasources, users, dashboards), OAuth config
  7. PgAdmin: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
  8. OAuth2-Proxy: Download binary (v7.6.0), config for Prometheus sidecar
  9. SSL Certificate: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback)
  10. HAProxy: Template config, enable and start systemd service

Deployment Order

PPLG must be deployed before services that push metrics/logs:

apt_update → alloy → node_exporter → pplg → postgresql → ...

This order is enforced in site.yml.

Required Vault Secrets

Add to ansible/inventory/group_vars/all/vault.yml:

⚠️ All vault variables below must be set before running the playbook. Missing variables will cause template failures like:

TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Prometheus Scrape Credentials

These are used in prometheus.yml.j2 to scrape metrics from Casdoor and Gitea.

1. Casdoor Prometheus Access Key

vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"

2. Casdoor Prometheus Access Secret

vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"

Requirements (both):

  • Source: API key pair from the built-in/admin Casdoor user
  • Used by: prometheus.yml.j2 Casdoor scrape job (accessKey / accessSecret query params)
  • How to obtain: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
    # 1. Login to get session cookie
    curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
      -H "Content-Type: application/json" \
      -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
    
    # 2. Generate API keys for built-in/admin
    curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
      -H "Content-Type: application/json" \
      -d '{"owner":"built-in","name":"admin"}'
    
    # 3. Retrieve the generated keys
    curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
      python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" 
    
    # 4. Cleanup
    rm /tmp/casdoor-cookie.txt
    

⚠️ The built-in/admin user is used (not a heluca user) because Casdoor's /api/metrics endpoint requires an admin user and serves global platform metrics.

3. Gitea Metrics Token

vault_gitea_metrics_token: "YourGiteaMetricsToken"

Requirements:

  • Length: 32+ characters
  • Source: Must match the token configured in Gitea's app.ini
  • Generation: openssl rand -hex 32
  • Used by: prometheus.yml.j2 Gitea scrape job (Bearer token auth)

Grafana Credentials

4. Grafana Admin User

vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"

5. Grafana Viewer User

vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"

6. Grafana OAuth (Casdoor SSO)

vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-grafana
  • Redirect URI: https://grafana.ouranos.helu.ca/login/generic_oauth

PgAdmin

7. PgAdmin Setup

Just do it manually: cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db

Requirements:

  • Purpose: Initial local admin account (fallback when OAuth is unavailable)

8. PgAdmin OAuth (Casdoor SSO)

vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"

Requirements:

  • Source: Must match the Casdoor application app-pgadmin
  • Redirect URI: https://pgadmin.ouranos.helu.ca/oauth2/redirect

Prometheus OAuth2-Proxy

9. Prometheus OAuth2-Proxy (Casdoor SSO)

vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"

Requirements:

  • Client ID/Secret must match the Casdoor application app-prometheus
  • Redirect URI: https://prometheus.ouranos.helu.ca/oauth2/callback
  • Cookie secret generation:
    python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
    

Alertmanager (Pushover)

10. Pushover Notification Credentials

vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"

Requirements:

  • Source: pushover.net account
  • User Key: Found on Pushover dashboard
  • API Token: Create an application in Pushover

Quick Reference

Vault Variable Used By Source
vault_casdoor_prometheus_access_key prometheus.yml.j2 Casdoor built-in/admin API key
vault_casdoor_prometheus_access_secret prometheus.yml.j2 Casdoor built-in/admin API key
vault_gitea_metrics_token prometheus.yml.j2 Gitea app.ini
vault_grafana_admin_name users.yml.j2 Choose any
vault_grafana_admin_login users.yml.j2 Choose any
vault_grafana_admin_password users.yml.j2 Choose any
vault_grafana_viewer_name users.yml.j2 Choose any
vault_grafana_viewer_login users.yml.j2 Choose any
vault_grafana_viewer_password users.yml.j2 Choose any
vault_grafana_oauth_client_id grafana.ini.j2 Casdoor app
vault_grafana_oauth_client_secret grafana.ini.j2 Casdoor app
vault_pgadmin_email config_local.py.j2 Choose any
vault_pgadmin_password config_local.py.j2 Choose any
vault_pgadmin_oauth_client_id config_local.py.j2 Casdoor app
vault_pgadmin_oauth_client_secret config_local.py.j2 Casdoor app
vault_prometheus_oauth2_client_id oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_client_secret oauth2-proxy-prometheus.cfg.j2 Casdoor app
vault_prometheus_oauth2_cookie_secret oauth2-proxy-prometheus.cfg.j2 Generate
vault_pushover_user_key alertmanager.yml.j2 Pushover account
vault_pushover_api_token alertmanager.yml.j2 Pushover account

Casdoor SSO

Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.

Applications to Register

Register in Casdoor Admin UI (https://id.ouranos.helu.ca) or add to ansible/casdoor/init_data.json.j2:

Application Client ID Redirect URI Grant Types
app-grafana vault_grafana_oauth_client_id https://grafana.ouranos.helu.ca/login/generic_oauth authorization_code, refresh_token
app-pgadmin vault_pgadmin_oauth_client_id https://pgadmin.ouranos.helu.ca/oauth2/redirect authorization_code, refresh_token
app-prometheus vault_prometheus_oauth2_client_id https://prometheus.ouranos.helu.ca/oauth2/callback authorization_code, refresh_token

URL Strategy

URL Type Address Used By
Auth URL https://id.ouranos.helu.ca/login/oauth/authorize User's browser (external)
Token URL https://id.ouranos.helu.ca/api/login/oauth/access_token Server-to-server
Userinfo URL https://id.ouranos.helu.ca/api/userinfo Server-to-server
OIDC Discovery https://id.ouranos.helu.ca/.well-known/openid-configuration OAuth2-Proxy

Auth Methods per Service

Service Auth Method Details
Grafana Native [auth.generic_oauth] Built-in OAuth support in grafana.ini
PgAdmin Native OAUTH2_CONFIG Built-in OAuth support in config_local.py
Prometheus OAuth2-Proxy sidecar Binary on :9091 proxying to :9090
Loki None Machine-to-machine (Alloy agents push logs)
Alertmanager None Internal only

HAProxy Configuration

Backends

Backend Upstream Health Check Auth
backend_grafana 127.0.0.1:3000 GET /api/health Grafana OAuth
backend_pgadmin 127.0.0.1:5050 GET /misc/ping PgAdmin OAuth
backend_prometheus 127.0.0.1:9091 (OAuth2-Proxy) GET /ping OAuth2-Proxy
backend_prometheus_direct 127.0.0.1:9090 None (write API)
backend_loki 127.0.0.1:3100 GET /ready None
backend_alertmanager 127.0.0.1:9093 GET /-/healthy None

skip_auth_route Pattern

The Prometheus write API (/api/v1/write) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy:

acl is_prometheus_write path_beg /api/v1/write
use_backend backend_prometheus_direct if host_prometheus is_prometheus_write

This routes https://prometheus.ouranos.helu.ca/api/v1/write directly to Prometheus on :9090, while all other Prometheus traffic goes through OAuth2-Proxy on :9091.

SSL Certificate

  • Primary: Let's Encrypt wildcard cert (*.ouranos.helu.ca) fetched from Titania
  • Fallback: Self-signed cert generated on Prospero (if Titania unavailable)
  • Path: /etc/haproxy/certs/ouranos.pem

Host Variables

File: ansible/inventory/host_vars/prospero.incus.yml

Services list:

services:
  - alloy
  - pplg

Key variable groups defined in prospero.incus.yml:

  • PPLG HAProxy (user, group, uid/gid 800, syslog port)
  • Grafana (datasources, users, OAuth config)
  • Prometheus (scrape targets, OAuth2-Proxy sidecar config)
  • Alertmanager (Pushover integration)
  • Loki (user, data/config directories)
  • PgAdmin (user, data/log directories, OAuth config)
  • Casdoor Metrics (access key/secret for Prometheus scraping)

Terraform

Prospero Port Mapping

devices = [
  {
    name = "https_internal"
    type = "proxy"
    properties = {
      listen  = "tcp:0.0.0.0:25510"
      connect = "tcp:127.0.0.1:443"
    }
  },
  {
    name = "http_redirect"
    type = "proxy"
    properties = {
      listen  = "tcp:0.0.0.0:25511"
      connect = "tcp:127.0.0.1:80"
    }
  }
]

Run terraform apply before deploying if port mappings changed.

Titania Backend Routing

Titania's HAProxy routes external subdomains to Prospero's HTTPS port:

# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
  backend_host: "prospero.incus"
  backend_port: 443
  health_path: "/api/health"
  ssl_backend: true

- subdomain: "pgadmin"
  backend_host: "prospero.incus"
  backend_port: 443
  health_path: "/misc/ping"
  ssl_backend: true

- subdomain: "prometheus"
  backend_host: "prospero.incus"
  backend_port: 443
  health_path: "/ping"
  ssl_backend: true

Monitoring

Alloy Configuration

File: ansible/alloy/prospero/config.alloy.j2

  • HAProxy Syslog: loki.source.syslog on 127.0.0.1:51405 (TCP) receives Docker syslog from HAProxy container
  • Journal Labels: Dedicated job labels for grafana-server, prometheus, loki, alertmanager, pgadmin, oauth2-proxy-prometheus
  • System Logs: /var/log/syslog, /var/log/auth.log → Loki
  • Metrics: Node exporter + process exporter → Prometheus remote write

Prometheus Scrape Targets

Job Target Auth
prometheus localhost:9090 None
node-exporter All Uranian hosts :9100 None
alertmanager prospero.incus:9093 None
haproxy titania.incus:8404 None
gitea oberon.incus:22084 Bearer token
casdoor titania.incus:22081 Access key/secret params

Alert Rules

Groups defined in alert_rules.yml.j2:

Group Alerts Scope
node_alerts InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage All hosts
puck_process_alerts HighCPU/Memory per process, CrashLoop puck.incus
puck_container_alerts HighContainerCount, Duplicates, Orphans, OOM puck.incus
service_alerts TargetMissing, JobMissing, AlertmanagerDown Infrastructure
loki_alerts HighLogVolume Loki

Alertmanager Routing

Alerts are routed to Pushover with severity-based priority:

Severity Pushover Priority Emoji
Critical 2 (Emergency) 🚨
Warning 1 (High) ⚠️
Info 0 (Normal)

Grafana MCP Server

Grafana has an associated MCP (Model Context Protocol) server that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on Miranda and connects back to Grafana on Prospero via the internal network (prospero.incus:3000) using a service account token.

Property Value
MCP Host miranda.incus
MCP Port 25533
MCPO Proxy http://miranda.incus:25530/grafana
Auth Grafana service account token (vault_grafana_service_account_token)

The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: pplg → grafana_mcp → mcpo.

For full details — deployment, configuration, available tools, troubleshooting — see Grafana MCP Server.

Access After Deployment

Service URL Login
Grafana https://grafana.ouranos.helu.ca Casdoor SSO or local admin
PgAdmin https://pgadmin.ouranos.helu.ca Casdoor SSO or local admin
Prometheus https://prometheus.ouranos.helu.ca Casdoor SSO
Alertmanager https://alertmanager.ouranos.helu.ca No auth (internal)

Troubleshooting

Service Status

ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus

HAProxy Service

ssh prospero.incus
sudo systemctl status haproxy
sudo journalctl -u haproxy -f

View Logs

# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f

# HAProxy logs (shipped via syslog to Alloy → Loki)
# Query in Grafana: {job="pplg-haproxy"}

Test Endpoints (from Prospero)

# Grafana
curl -s http://127.0.0.1:3000/api/health

# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping

# Prometheus
curl -s http://127.0.0.1:9090/-/healthy

# Loki
curl -s http://127.0.0.1:3100/ready

# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy

# HAProxy stats
curl -s http://127.0.0.1:8404/metrics | head

Test TLS (from any host)

# Direct to Prospero container
curl -sk https://prospero.incus/api/health
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health

Common Errors

vault_casdoor_prometheus_access_key is undefined

TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined

Cause: The Casdoor metrics scrape job in prometheus.yml.j2 requires access credentials.

Fix: Generate API keys for the built-in/admin Casdoor user (see Casdoor Prometheus Access Key for the full procedure), then add to vault:

cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"

Certificate fetch fails

Cause: Titania not running or certbot hasn't provisioned the cert yet.

Fix: Ensure Titania is up and certbot has run:

ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml

The playbook falls back to a self-signed certificate if Titania is unavailable.

OAuth2 redirect loops

Cause: Casdoor application redirect URI doesn't match the service URL.

Fix: Verify redirect URIs match exactly:

  • Grafana: https://grafana.ouranos.helu.ca/login/generic_oauth
  • PgAdmin: https://pgadmin.ouranos.helu.ca/oauth2/redirect
  • Prometheus: https://prometheus.ouranos.helu.ca/oauth2/callback

Migration Notes

PPLG replaces the following standalone playbooks (kept as reference):

Original Playbook Replaced By
prometheus/deploy.yml pplg/deploy.yml
prometheus/alertmanager_deploy.yml pplg/deploy.yml
loki/deploy.yml pplg/deploy.yml
grafana/deploy.yml pplg/deploy.yml
pgadmin/deploy.yml pplg/deploy.yml

PgAdmin was previously hosted on Portia (port 25555). It now runs on Prospero via gunicorn (no Apache).