Replaces the minimal project description with a comprehensive README including a component overview table, quick start instructions, common Ansible operations, and links to detailed documentation. Aligns with Red Panda Approval™ standards.
22 KiB
PPLG - Consolidated Observability & Admin Stack
Overview
PPLG is the consolidated observability and administration stack running on Prospero. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication.
Host: prospero.incus
Role: Observability
Incus Ports: 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect)
External Access: Via Titania HAProxy → prospero.incus:443
| Subdomain | Service | Auth Method |
|---|---|---|
grafana.ouranos.helu.ca |
Grafana | Native Casdoor OAuth |
pgadmin.ouranos.helu.ca |
PgAdmin | Native Casdoor OAuth |
prometheus.ouranos.helu.ca |
Prometheus | OAuth2-Proxy sidecar |
loki.ouranos.helu.ca |
Loki | None (machine-to-machine) |
alertmanager.ouranos.helu.ca |
Alertmanager | None (internal) |
Architecture
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
│ │ │ (Titania) │ │ │
└──────────┘ │ :443 → :443 │ ┌──────────────────────────────────────────┐ │
└────────────┘ │ │ HAProxy (systemd, :443/:80) │ │
│ │ TLS termination + subdomain routing │ │
┌──────────┐ │ └───┬──────┬──────┬──────┬──────┬──────────┘ │
│ Alloy │──push──────────────────────────▶│ │ │ │ │
│ (agents) │ loki.ouranos.helu.ca │ │ │ │ │ │
│ │ prometheus.ouranos.helu.ca │ │ │ │ │
└──────────┘ │ ▼ ▼ ▼ ▼ ▼ │
│ Grafana PgAdmin OAuth2 Loki Alertmanager │
│ :3000 :5050 Proxy :3100 :9093 │
│ :9091 │
│ │ │
│ ▼ │
│ Prometheus │
│ :9090 │
└─────────────────────────────────────────────────┘
Traffic Flow
| Source | Destination | Path | Auth |
|---|---|---|---|
| Browser → Grafana | Titania :443 → Prospero :443 → HAProxy → :3000 | Subdomain ACL | Casdoor OAuth |
| Browser → PgAdmin | Titania :443 → Prospero :443 → HAProxy → :5050 | Subdomain ACL | Casdoor OAuth |
| Browser → Prometheus | Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
| Alloy → Loki | https://loki.ouranos.helu.ca → HAProxy :443 → :3100 |
Subdomain ACL | None |
| Alloy → Prometheus | https://prometheus.ouranos.helu.ca/api/v1/write → HAProxy :443 → :9090 |
skip_auth_route |
None |
Deployment
Prerequisites
- Terraform: Prospero container must have updated port mappings (
terraform apply) - Certbot: Wildcard cert must exist on Titania (
ansible-playbook certbot/deploy.yml) - Vault Secrets: All vault variables must be set (see Required Vault Secrets)
- Casdoor Applications: Register PgAdmin and Prometheus apps in Casdoor (see Casdoor SSO)
Playbook
cd ansible
ansible-playbook pplg/deploy.yml
Files
| File | Purpose |
|---|---|
pplg/deploy.yml |
Main consolidated deployment playbook |
pplg/pplg-haproxy.cfg.j2 |
HAProxy TLS termination config (5 backends) |
pplg/prometheus.yml.j2 |
Prometheus scrape configuration |
pplg/alert_rules.yml.j2 |
Prometheus alerting rules |
pplg/alertmanager.yml.j2 |
Alertmanager routing and Pushover notifications |
pplg/config.yml.j2 |
Loki server configuration |
pplg/grafana.ini.j2 |
Grafana main config with Casdoor OAuth |
pplg/datasource.yml.j2 |
Grafana provisioned datasources |
pplg/users.yml.j2 |
Grafana provisioned users |
pplg/config_local.py.j2 |
PgAdmin config with Casdoor OAuth |
pplg/pgadmin.service.j2 |
PgAdmin gunicorn systemd unit |
pplg/oauth2-proxy-prometheus.cfg.j2 |
OAuth2-Proxy config for Prometheus UI |
pplg/oauth2-proxy-prometheus.service.j2 |
OAuth2-Proxy systemd unit |
Deployment Steps
- APT Repositories: Add Grafana and PgAdmin repos
- Install Packages: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn
- Prometheus: Config, alert rules, systemd override for remote write receiver
- Alertmanager: Install, config with Pushover integration
- Loki: Create user/dirs, template config
- Grafana: Provisioning (datasources, users, dashboards), OAuth config
- PgAdmin: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
- OAuth2-Proxy: Download binary (v7.6.0), config for Prometheus sidecar
- SSL Certificate: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback)
- HAProxy: Template config, enable and start systemd service
Deployment Order
PPLG must be deployed before services that push metrics/logs:
apt_update → alloy → node_exporter → pplg → postgresql → ...
This order is enforced in site.yml.
Required Vault Secrets
Add to ansible/inventory/group_vars/all/vault.yml:
⚠️ All vault variables below must be set before running the playbook. Missing variables will cause template failures like:
TASK [Template prometheus.yml] ****
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
Prometheus Scrape Credentials
These are used in prometheus.yml.j2 to scrape metrics from Casdoor and Gitea.
1. Casdoor Prometheus Access Key
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
2. Casdoor Prometheus Access Secret
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
Requirements (both):
- Source: API key pair from the
built-in/adminCasdoor user - Used by:
prometheus.yml.j2Casdoor scrape job (accessKey/accessSecretquery params) - How to obtain: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
# 1. Login to get session cookie curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \ -H "Content-Type: application/json" \ -d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}' # 2. Generate API keys for built-in/admin curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \ -H "Content-Type: application/json" \ -d '{"owner":"built-in","name":"admin"}' # 3. Retrieve the generated keys curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')" # 4. Cleanup rm /tmp/casdoor-cookie.txt
⚠️ The built-in/admin user is used (not a heluca user) because Casdoor's /api/metrics endpoint requires an admin user and serves global platform metrics.
3. Gitea Metrics Token
vault_gitea_metrics_token: "YourGiteaMetricsToken"
Requirements:
- Length: 32+ characters
- Source: Must match the token configured in Gitea's
app.ini - Generation:
openssl rand -hex 32 - Used by:
prometheus.yml.j2Gitea scrape job (Bearer token auth)
Grafana Credentials
4. Grafana Admin User
vault_grafana_admin_name: "Admin"
vault_grafana_admin_login: "admin"
vault_grafana_admin_password: "YourSecureAdminPassword"
5. Grafana Viewer User
vault_grafana_viewer_name: "Viewer"
vault_grafana_viewer_login: "viewer"
vault_grafana_viewer_password: "YourSecureViewerPassword"
6. Grafana OAuth (Casdoor SSO)
vault_grafana_oauth_client_id: "grafana-oauth-client"
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
Requirements:
- Source: Must match the Casdoor application
app-grafana - Redirect URI:
https://grafana.ouranos.helu.ca/login/generic_oauth
PgAdmin
7. PgAdmin Setup
Just do it manually: cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
Requirements:
- Purpose: Initial local admin account (fallback when OAuth is unavailable)
8. PgAdmin OAuth (Casdoor SSO)
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
Requirements:
- Source: Must match the Casdoor application
app-pgadmin - Redirect URI:
https://pgadmin.ouranos.helu.ca/oauth2/redirect
Prometheus OAuth2-Proxy
9. Prometheus OAuth2-Proxy (Casdoor SSO)
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
Requirements:
- Client ID/Secret must match the Casdoor application
app-prometheus - Redirect URI:
https://prometheus.ouranos.helu.ca/oauth2/callback - Cookie secret generation:
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
Alertmanager (Pushover)
10. Pushover Notification Credentials
vault_pushover_user_key: "YourPushoverUserKey"
vault_pushover_api_token: "YourPushoverAPIToken"
Requirements:
- Source: pushover.net account
- User Key: Found on Pushover dashboard
- API Token: Create an application in Pushover
Quick Reference
| Vault Variable | Used By | Source |
|---|---|---|
vault_casdoor_prometheus_access_key |
prometheus.yml.j2 | Casdoor built-in/admin API key |
vault_casdoor_prometheus_access_secret |
prometheus.yml.j2 | Casdoor built-in/admin API key |
vault_gitea_metrics_token |
prometheus.yml.j2 | Gitea app.ini |
vault_grafana_admin_name |
users.yml.j2 | Choose any |
vault_grafana_admin_login |
users.yml.j2 | Choose any |
vault_grafana_admin_password |
users.yml.j2 | Choose any |
vault_grafana_viewer_name |
users.yml.j2 | Choose any |
vault_grafana_viewer_login |
users.yml.j2 | Choose any |
vault_grafana_viewer_password |
users.yml.j2 | Choose any |
vault_grafana_oauth_client_id |
grafana.ini.j2 | Casdoor app |
vault_grafana_oauth_client_secret |
grafana.ini.j2 | Casdoor app |
vault_pgadmin_email |
config_local.py.j2 | Choose any |
vault_pgadmin_password |
config_local.py.j2 | Choose any |
vault_pgadmin_oauth_client_id |
config_local.py.j2 | Casdoor app |
vault_pgadmin_oauth_client_secret |
config_local.py.j2 | Casdoor app |
vault_prometheus_oauth2_client_id |
oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
vault_prometheus_oauth2_client_secret |
oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
vault_prometheus_oauth2_cookie_secret |
oauth2-proxy-prometheus.cfg.j2 | Generate |
vault_pushover_user_key |
alertmanager.yml.j2 | Pushover account |
vault_pushover_api_token |
alertmanager.yml.j2 | Pushover account |
Casdoor SSO
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
Applications to Register
Register in Casdoor Admin UI (https://id.ouranos.helu.ca) or add to ansible/casdoor/init_data.json.j2:
| Application | Client ID | Redirect URI | Grant Types |
|---|---|---|---|
app-grafana |
vault_grafana_oauth_client_id |
https://grafana.ouranos.helu.ca/login/generic_oauth |
authorization_code, refresh_token |
app-pgadmin |
vault_pgadmin_oauth_client_id |
https://pgadmin.ouranos.helu.ca/oauth2/redirect |
authorization_code, refresh_token |
app-prometheus |
vault_prometheus_oauth2_client_id |
https://prometheus.ouranos.helu.ca/oauth2/callback |
authorization_code, refresh_token |
URL Strategy
| URL Type | Address | Used By |
|---|---|---|
| Auth URL | https://id.ouranos.helu.ca/login/oauth/authorize |
User's browser (external) |
| Token URL | https://id.ouranos.helu.ca/api/login/oauth/access_token |
Server-to-server |
| Userinfo URL | https://id.ouranos.helu.ca/api/userinfo |
Server-to-server |
| OIDC Discovery | https://id.ouranos.helu.ca/.well-known/openid-configuration |
OAuth2-Proxy |
Auth Methods per Service
| Service | Auth Method | Details |
|---|---|---|
| Grafana | Native [auth.generic_oauth] |
Built-in OAuth support in grafana.ini |
| PgAdmin | Native OAUTH2_CONFIG |
Built-in OAuth support in config_local.py |
| Prometheus | OAuth2-Proxy sidecar | Binary on :9091 proxying to :9090 |
| Loki | None | Machine-to-machine (Alloy agents push logs) |
| Alertmanager | None | Internal only |
HAProxy Configuration
Backends
| Backend | Upstream | Health Check | Auth |
|---|---|---|---|
backend_grafana |
127.0.0.1:3000 |
GET /api/health |
Grafana OAuth |
backend_pgadmin |
127.0.0.1:5050 |
GET /misc/ping |
PgAdmin OAuth |
backend_prometheus |
127.0.0.1:9091 (OAuth2-Proxy) |
GET /ping |
OAuth2-Proxy |
backend_prometheus_direct |
127.0.0.1:9090 |
— | None (write API) |
backend_loki |
127.0.0.1:3100 |
GET /ready |
None |
backend_alertmanager |
127.0.0.1:9093 |
GET /-/healthy |
None |
skip_auth_route Pattern
The Prometheus write API (/api/v1/write) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy:
acl is_prometheus_write path_beg /api/v1/write
use_backend backend_prometheus_direct if host_prometheus is_prometheus_write
This routes https://prometheus.ouranos.helu.ca/api/v1/write directly to Prometheus on :9090, while all other Prometheus traffic goes through OAuth2-Proxy on :9091.
SSL Certificate
- Primary: Let's Encrypt wildcard cert (
*.ouranos.helu.ca) fetched from Titania - Fallback: Self-signed cert generated on Prospero (if Titania unavailable)
- Path:
/etc/haproxy/certs/ouranos.pem
Host Variables
File: ansible/inventory/host_vars/prospero.incus.yml
Services list:
services:
- alloy
- pplg
Key variable groups defined in prospero.incus.yml:
- PPLG HAProxy (user, group, uid/gid 800, syslog port)
- Grafana (datasources, users, OAuth config)
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
- Alertmanager (Pushover integration)
- Loki (user, data/config directories)
- PgAdmin (user, data/log directories, OAuth config)
- Casdoor Metrics (access key/secret for Prometheus scraping)
Terraform
Prospero Port Mapping
devices = [
{
name = "https_internal"
type = "proxy"
properties = {
listen = "tcp:0.0.0.0:25510"
connect = "tcp:127.0.0.1:443"
}
},
{
name = "http_redirect"
type = "proxy"
properties = {
listen = "tcp:0.0.0.0:25511"
connect = "tcp:127.0.0.1:80"
}
}
]
Run terraform apply before deploying if port mappings changed.
Titania Backend Routing
Titania's HAProxy routes external subdomains to Prospero's HTTPS port:
# In titania.incus.yml haproxy_backends
- subdomain: "grafana"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/api/health"
ssl_backend: true
- subdomain: "pgadmin"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/misc/ping"
ssl_backend: true
- subdomain: "prometheus"
backend_host: "prospero.incus"
backend_port: 443
health_path: "/ping"
ssl_backend: true
Monitoring
Alloy Configuration
File: ansible/alloy/prospero/config.alloy.j2
- HAProxy Syslog:
loki.source.syslogon127.0.0.1:51405(TCP) receives Docker syslog from HAProxy container - Journal Labels: Dedicated job labels for
grafana-server,prometheus,loki,alertmanager,pgadmin,oauth2-proxy-prometheus - System Logs:
/var/log/syslog,/var/log/auth.log→ Loki - Metrics: Node exporter + process exporter → Prometheus remote write
Prometheus Scrape Targets
| Job | Target | Auth |
|---|---|---|
prometheus |
localhost:9090 |
None |
node-exporter |
All Uranian hosts :9100 |
None |
alertmanager |
prospero.incus:9093 |
None |
haproxy |
titania.incus:8404 |
None |
gitea |
oberon.incus:22084 |
Bearer token |
casdoor |
titania.incus:22081 |
Access key/secret params |
Alert Rules
Groups defined in alert_rules.yml.j2:
| Group | Alerts | Scope |
|---|---|---|
node_alerts |
InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
puck_process_alerts |
HighCPU/Memory per process, CrashLoop | puck.incus |
puck_container_alerts |
HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
service_alerts |
TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
loki_alerts |
HighLogVolume | Loki |
Alertmanager Routing
Alerts are routed to Pushover with severity-based priority:
| Severity | Pushover Priority | Emoji |
|---|---|---|
| Critical | 2 (Emergency) | 🚨 |
| Warning | 1 (High) | ⚠️ |
| Info | 0 (Normal) | — |
Grafana MCP Server
Grafana has an associated MCP (Model Context Protocol) server that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on Miranda and connects back to Grafana on Prospero via the internal network (prospero.incus:3000) using a service account token.
| Property | Value |
|---|---|
| MCP Host | miranda.incus |
| MCP Port | 25533 |
| MCPO Proxy | http://miranda.incus:25530/grafana |
| Auth | Grafana service account token (vault_grafana_service_account_token) |
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: pplg → grafana_mcp → mcpo.
For full details — deployment, configuration, available tools, troubleshooting — see Grafana MCP Server.
Access After Deployment
| Service | URL | Login |
|---|---|---|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
Troubleshooting
Service Status
ssh prospero.incus
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
HAProxy Service
ssh prospero.incus
sudo systemctl status haproxy
sudo journalctl -u haproxy -f
View Logs
# All PPLG services via journal
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
# HAProxy logs (shipped via syslog to Alloy → Loki)
# Query in Grafana: {job="pplg-haproxy"}
Test Endpoints (from Prospero)
# Grafana
curl -s http://127.0.0.1:3000/api/health
# PgAdmin
curl -s http://127.0.0.1:5050/misc/ping
# Prometheus
curl -s http://127.0.0.1:9090/-/healthy
# Loki
curl -s http://127.0.0.1:3100/ready
# Alertmanager
curl -s http://127.0.0.1:9093/-/healthy
# HAProxy stats
curl -s http://127.0.0.1:8404/metrics | head
Test TLS (from any host)
# Direct to Prospero container
curl -sk https://prospero.incus/api/health
# Via Titania HAProxy
curl -s https://grafana.ouranos.helu.ca/api/health
Common Errors
vault_casdoor_prometheus_access_key is undefined
TASK [Template prometheus.yml]
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
Cause: The Casdoor metrics scrape job in prometheus.yml.j2 requires access credentials.
Fix: Generate API keys for the built-in/admin Casdoor user (see Casdoor Prometheus Access Key for the full procedure), then add to vault:
cd ansible
ansible-vault edit inventory/group_vars/all/vault.yml
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
Certificate fetch fails
Cause: Titania not running or certbot hasn't provisioned the cert yet.
Fix: Ensure Titania is up and certbot has run:
ansible-playbook sandbox_up.yml
ansible-playbook certbot/deploy.yml
The playbook falls back to a self-signed certificate if Titania is unavailable.
OAuth2 redirect loops
Cause: Casdoor application redirect URI doesn't match the service URL.
Fix: Verify redirect URIs match exactly:
- Grafana:
https://grafana.ouranos.helu.ca/login/generic_oauth - PgAdmin:
https://pgadmin.ouranos.helu.ca/oauth2/redirect - Prometheus:
https://prometheus.ouranos.helu.ca/oauth2/callback
Migration Notes
PPLG replaces the following standalone playbooks (kept as reference):
| Original Playbook | Replaced By |
|---|---|
prometheus/deploy.yml |
pplg/deploy.yml |
prometheus/alertmanager_deploy.yml |
pplg/deploy.yml |
loki/deploy.yml |
pplg/deploy.yml |
grafana/deploy.yml |
pplg/deploy.yml |
pgadmin/deploy.yml |
pplg/deploy.yml |
PgAdmin was previously hosted on Portia (port 25555). It now runs on Prospero via gunicorn (no Apache).