Replaces the minimal project description with a comprehensive README including a component overview table, quick start instructions, common Ansible operations, and links to detailed documentation. Aligns with Red Panda Approval™ standards.
584 lines
22 KiB
Markdown
584 lines
22 KiB
Markdown
# PPLG - Consolidated Observability & Admin Stack
|
|
|
|
## Overview
|
|
|
|
PPLG is the consolidated observability and administration stack running on **Prospero**. It bundles PgAdmin, Prometheus, Loki, and Grafana behind an internal HAProxy for TLS termination, with Casdoor SSO for user-facing services and OAuth2-Proxy as a sidecar for Prometheus UI authentication.
|
|
|
|
**Host:** prospero.incus
|
|
**Role:** Observability
|
|
**Incus Ports:** 25510 → 443 (HTTPS), 25511 → 80 (HTTP redirect)
|
|
**External Access:** Via Titania HAProxy → `prospero.incus:443`
|
|
|
|
| Subdomain | Service | Auth Method |
|
|
|-----------|---------|-------------|
|
|
| `grafana.ouranos.helu.ca` | Grafana | Native Casdoor OAuth |
|
|
| `pgadmin.ouranos.helu.ca` | PgAdmin | Native Casdoor OAuth |
|
|
| `prometheus.ouranos.helu.ca` | Prometheus | OAuth2-Proxy sidecar |
|
|
| `loki.ouranos.helu.ca` | Loki | None (machine-to-machine) |
|
|
| `alertmanager.ouranos.helu.ca` | Alertmanager | None (internal) |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────┐ ┌────────────┐ ┌─────────────────────────────────────────────────┐
|
|
│ Client │─────▶│ HAProxy │─────▶│ Prospero (PPLG) │
|
|
│ │ │ (Titania) │ │ │
|
|
└──────────┘ │ :443 → :443 │ ┌──────────────────────────────────────────┐ │
|
|
└────────────┘ │ │ HAProxy (systemd, :443/:80) │ │
|
|
│ │ TLS termination + subdomain routing │ │
|
|
┌──────────┐ │ └───┬──────┬──────┬──────┬──────┬──────────┘ │
|
|
│ Alloy │──push──────────────────────────▶│ │ │ │ │
|
|
│ (agents) │ loki.ouranos.helu.ca │ │ │ │ │ │
|
|
│ │ prometheus.ouranos.helu.ca │ │ │ │ │
|
|
└──────────┘ │ ▼ ▼ ▼ ▼ ▼ │
|
|
│ Grafana PgAdmin OAuth2 Loki Alertmanager │
|
|
│ :3000 :5050 Proxy :3100 :9093 │
|
|
│ :9091 │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ Prometheus │
|
|
│ :9090 │
|
|
└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Traffic Flow
|
|
|
|
| Source | Destination | Path | Auth |
|
|
|--------|-------------|------|------|
|
|
| Browser → Grafana | Titania :443 → Prospero :443 → HAProxy → :3000 | Subdomain ACL | Casdoor OAuth |
|
|
| Browser → PgAdmin | Titania :443 → Prospero :443 → HAProxy → :5050 | Subdomain ACL | Casdoor OAuth |
|
|
| Browser → Prometheus | Titania :443 → Prospero :443 → HAProxy → OAuth2-Proxy :9091 → :9090 | Subdomain ACL | OAuth2-Proxy → Casdoor |
|
|
| Alloy → Loki | `https://loki.ouranos.helu.ca` → HAProxy :443 → :3100 | Subdomain ACL | None |
|
|
| Alloy → Prometheus | `https://prometheus.ouranos.helu.ca/api/v1/write` → HAProxy :443 → :9090 | `skip_auth_route` | None |
|
|
|
|
## Deployment
|
|
|
|
### Prerequisites
|
|
|
|
1. **Terraform**: Prospero container must have updated port mappings (`terraform apply`)
|
|
2. **Certbot**: Wildcard cert must exist on Titania (`ansible-playbook certbot/deploy.yml`)
|
|
3. **Vault Secrets**: All vault variables must be set (see [Required Vault Secrets](#required-vault-secrets))
|
|
4. **Casdoor Applications**: Register PgAdmin and Prometheus apps in Casdoor (see [Casdoor SSO](#casdoor-sso))
|
|
|
|
### Playbook
|
|
|
|
```bash
|
|
cd ansible
|
|
ansible-playbook pplg/deploy.yml
|
|
```
|
|
|
|
### Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `pplg/deploy.yml` | Main consolidated deployment playbook |
|
|
| `pplg/pplg-haproxy.cfg.j2` | HAProxy TLS termination config (5 backends) |
|
|
| `pplg/prometheus.yml.j2` | Prometheus scrape configuration |
|
|
| `pplg/alert_rules.yml.j2` | Prometheus alerting rules |
|
|
| `pplg/alertmanager.yml.j2` | Alertmanager routing and Pushover notifications |
|
|
| `pplg/config.yml.j2` | Loki server configuration |
|
|
| `pplg/grafana.ini.j2` | Grafana main config with Casdoor OAuth |
|
|
| `pplg/datasource.yml.j2` | Grafana provisioned datasources |
|
|
| `pplg/users.yml.j2` | Grafana provisioned users |
|
|
| `pplg/config_local.py.j2` | PgAdmin config with Casdoor OAuth |
|
|
| `pplg/pgadmin.service.j2` | PgAdmin gunicorn systemd unit |
|
|
| `pplg/oauth2-proxy-prometheus.cfg.j2` | OAuth2-Proxy config for Prometheus UI |
|
|
| `pplg/oauth2-proxy-prometheus.service.j2` | OAuth2-Proxy systemd unit |
|
|
|
|
### Deployment Steps
|
|
|
|
1. **APT Repositories**: Add Grafana and PgAdmin repos
|
|
2. **Install Packages**: haproxy, prometheus, loki, grafana, pgadmin4-web, gunicorn
|
|
3. **Prometheus**: Config, alert rules, systemd override for remote write receiver
|
|
4. **Alertmanager**: Install, config with Pushover integration
|
|
5. **Loki**: Create user/dirs, template config
|
|
6. **Grafana**: Provisioning (datasources, users, dashboards), OAuth config
|
|
7. **PgAdmin**: Create user/dirs, gunicorn systemd service, Casdoor OAuth config
|
|
8. **OAuth2-Proxy**: Download binary (v7.6.0), config for Prometheus sidecar
|
|
9. **SSL Certificate**: Fetch Let's Encrypt wildcard cert from Titania (self-signed fallback)
|
|
10. **HAProxy**: Template config, enable and start systemd service
|
|
|
|
### Deployment Order
|
|
|
|
PPLG must be deployed **before** services that push metrics/logs:
|
|
|
|
```
|
|
apt_update → alloy → node_exporter → pplg → postgresql → ...
|
|
```
|
|
|
|
This order is enforced in `site.yml`.
|
|
|
|
## Required Vault Secrets
|
|
|
|
Add to `ansible/inventory/group_vars/all/vault.yml`:
|
|
|
|
⚠️ **All vault variables below must be set before running the playbook.** Missing variables will cause template failures like:
|
|
|
|
```
|
|
TASK [Template prometheus.yml] ****
|
|
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
|
|
```
|
|
|
|
### Prometheus Scrape Credentials
|
|
|
|
These are used in `prometheus.yml.j2` to scrape metrics from Casdoor and Gitea.
|
|
|
|
#### 1. Casdoor Prometheus Access Key
|
|
```yaml
|
|
vault_casdoor_prometheus_access_key: "YourCasdoorAccessKey"
|
|
```
|
|
|
|
#### 2. Casdoor Prometheus Access Secret
|
|
```yaml
|
|
vault_casdoor_prometheus_access_secret: "YourCasdoorAccessSecret"
|
|
```
|
|
|
|
**Requirements (both):**
|
|
- **Source**: API key pair from the `built-in/admin` Casdoor user
|
|
- **Used by**: `prometheus.yml.j2` Casdoor scrape job (`accessKey` / `accessSecret` query params)
|
|
- **How to obtain**: Generate via Casdoor API (the "API key" account item is not exposed in the UI by default):
|
|
```bash
|
|
# 1. Login to get session cookie
|
|
curl -sk -c /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/login" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"application":"app-built-in","organization":"built-in","username":"admin","password":"YOUR_PASSWORD","type":"login"}'
|
|
|
|
# 2. Generate API keys for built-in/admin
|
|
curl -sk -b /tmp/casdoor-cookie.txt -X POST "https://id.ouranos.helu.ca/api/add-user-keys" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"owner":"built-in","name":"admin"}'
|
|
|
|
# 3. Retrieve the generated keys
|
|
curl -sk -b /tmp/casdoor-cookie.txt "https://id.ouranos.helu.ca/api/get-user?id=built-in/admin" | \
|
|
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(f'accessKey: {d[\"accessKey\"]}\naccessSecret: {d[\"accessSecret\"]}')"
|
|
|
|
# 4. Cleanup
|
|
rm /tmp/casdoor-cookie.txt
|
|
```
|
|
|
|
⚠️ The `built-in/admin` user is used (not a `heluca` user) because Casdoor's `/api/metrics` endpoint requires an admin user and serves global platform metrics.
|
|
|
|
#### 3. Gitea Metrics Token
|
|
```yaml
|
|
vault_gitea_metrics_token: "YourGiteaMetricsToken"
|
|
```
|
|
**Requirements:**
|
|
- **Length**: 32+ characters
|
|
- **Source**: Must match the token configured in Gitea's `app.ini`
|
|
- **Generation**: `openssl rand -hex 32`
|
|
- **Used by**: `prometheus.yml.j2` Gitea scrape job (Bearer token auth)
|
|
|
|
### Grafana Credentials
|
|
|
|
#### 4. Grafana Admin User
|
|
```yaml
|
|
vault_grafana_admin_name: "Admin"
|
|
vault_grafana_admin_login: "admin"
|
|
vault_grafana_admin_password: "YourSecureAdminPassword"
|
|
```
|
|
|
|
#### 5. Grafana Viewer User
|
|
```yaml
|
|
vault_grafana_viewer_name: "Viewer"
|
|
vault_grafana_viewer_login: "viewer"
|
|
vault_grafana_viewer_password: "YourSecureViewerPassword"
|
|
```
|
|
|
|
#### 6. Grafana OAuth (Casdoor SSO)
|
|
```yaml
|
|
vault_grafana_oauth_client_id: "grafana-oauth-client"
|
|
vault_grafana_oauth_client_secret: "YourGrafanaOAuthSecret"
|
|
```
|
|
**Requirements:**
|
|
- **Source**: Must match the Casdoor application `app-grafana`
|
|
- **Redirect URI**: `https://grafana.ouranos.helu.ca/login/generic_oauth`
|
|
|
|
### PgAdmin
|
|
|
|
#### 7. PgAdmin Setup
|
|
|
|
Just do it manually:
|
|
cmd: /usr/pgadmin4/venv/bin/python3 /usr/pgadmin4/web/setup.py setup-db
|
|
|
|
**Requirements:**
|
|
- **Purpose**: Initial local admin account (fallback when OAuth is unavailable)
|
|
|
|
#### 8. PgAdmin OAuth (Casdoor SSO)
|
|
```yaml
|
|
vault_pgadmin_oauth_client_id: "pgadmin-oauth-client"
|
|
vault_pgadmin_oauth_client_secret: "YourPgAdminOAuthSecret"
|
|
```
|
|
**Requirements:**
|
|
- **Source**: Must match the Casdoor application `app-pgadmin`
|
|
- **Redirect URI**: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
|
|
|
|
### Prometheus OAuth2-Proxy
|
|
|
|
#### 9. Prometheus OAuth2-Proxy (Casdoor SSO)
|
|
```yaml
|
|
vault_prometheus_oauth2_client_id: "prometheus-oauth-client"
|
|
vault_prometheus_oauth2_client_secret: "YourPrometheusOAuthSecret"
|
|
vault_prometheus_oauth2_cookie_secret: "GeneratedCookieSecret"
|
|
```
|
|
**Requirements:**
|
|
- Client ID/Secret must match the Casdoor application `app-prometheus`
|
|
- **Redirect URI**: `https://prometheus.ouranos.helu.ca/oauth2/callback`
|
|
- **Cookie secret generation**:
|
|
```bash
|
|
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
|
|
```
|
|
|
|
### Alertmanager (Pushover)
|
|
|
|
#### 10. Pushover Notification Credentials
|
|
```yaml
|
|
vault_pushover_user_key: "YourPushoverUserKey"
|
|
vault_pushover_api_token: "YourPushoverAPIToken"
|
|
```
|
|
**Requirements:**
|
|
- **Source**: [pushover.net](https://pushover.net/) account
|
|
- **User Key**: Found on Pushover dashboard
|
|
- **API Token**: Create an application in Pushover
|
|
|
|
### Quick Reference
|
|
|
|
| Vault Variable | Used By | Source |
|
|
|---------------|---------|--------|
|
|
| `vault_casdoor_prometheus_access_key` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
|
|
| `vault_casdoor_prometheus_access_secret` | prometheus.yml.j2 | Casdoor `built-in/admin` API key |
|
|
| `vault_gitea_metrics_token` | prometheus.yml.j2 | Gitea app.ini |
|
|
| `vault_grafana_admin_name` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_admin_login` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_admin_password` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_viewer_name` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_viewer_login` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_viewer_password` | users.yml.j2 | Choose any |
|
|
| `vault_grafana_oauth_client_id` | grafana.ini.j2 | Casdoor app |
|
|
| `vault_grafana_oauth_client_secret` | grafana.ini.j2 | Casdoor app |
|
|
| `vault_pgadmin_email` | config_local.py.j2 | Choose any |
|
|
| `vault_pgadmin_password` | config_local.py.j2 | Choose any |
|
|
| `vault_pgadmin_oauth_client_id` | config_local.py.j2 | Casdoor app |
|
|
| `vault_pgadmin_oauth_client_secret` | config_local.py.j2 | Casdoor app |
|
|
| `vault_prometheus_oauth2_client_id` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
|
|
| `vault_prometheus_oauth2_client_secret` | oauth2-proxy-prometheus.cfg.j2 | Casdoor app |
|
|
| `vault_prometheus_oauth2_cookie_secret` | oauth2-proxy-prometheus.cfg.j2 | Generate |
|
|
| `vault_pushover_user_key` | alertmanager.yml.j2 | Pushover account |
|
|
| `vault_pushover_api_token` | alertmanager.yml.j2 | Pushover account |
|
|
|
|
## Casdoor SSO
|
|
|
|
Three Casdoor applications are required. Grafana's should already exist; PgAdmin and Prometheus need to be created.
|
|
|
|
### Applications to Register
|
|
|
|
Register in Casdoor Admin UI (`https://id.ouranos.helu.ca`) or add to `ansible/casdoor/init_data.json.j2`:
|
|
|
|
| Application | Client ID | Redirect URI | Grant Types |
|
|
|-------------|-----------|-------------|-------------|
|
|
| `app-grafana` | `vault_grafana_oauth_client_id` | `https://grafana.ouranos.helu.ca/login/generic_oauth` | `authorization_code`, `refresh_token` |
|
|
| `app-pgadmin` | `vault_pgadmin_oauth_client_id` | `https://pgadmin.ouranos.helu.ca/oauth2/redirect` | `authorization_code`, `refresh_token` |
|
|
| `app-prometheus` | `vault_prometheus_oauth2_client_id` | `https://prometheus.ouranos.helu.ca/oauth2/callback` | `authorization_code`, `refresh_token` |
|
|
|
|
### URL Strategy
|
|
|
|
| URL Type | Address | Used By |
|
|
|----------|---------|---------|
|
|
| **Auth URL** | `https://id.ouranos.helu.ca/login/oauth/authorize` | User's browser (external) |
|
|
| **Token URL** | `https://id.ouranos.helu.ca/api/login/oauth/access_token` | Server-to-server |
|
|
| **Userinfo URL** | `https://id.ouranos.helu.ca/api/userinfo` | Server-to-server |
|
|
| **OIDC Discovery** | `https://id.ouranos.helu.ca/.well-known/openid-configuration` | OAuth2-Proxy |
|
|
|
|
### Auth Methods per Service
|
|
|
|
| Service | Auth Method | Details |
|
|
|---------|-------------|---------|
|
|
| **Grafana** | Native `[auth.generic_oauth]` | Built-in OAuth support in `grafana.ini` |
|
|
| **PgAdmin** | Native `OAUTH2_CONFIG` | Built-in OAuth support in `config_local.py` |
|
|
| **Prometheus** | OAuth2-Proxy sidecar | Binary on `:9091` proxying to `:9090` |
|
|
| **Loki** | None | Machine-to-machine (Alloy agents push logs) |
|
|
| **Alertmanager** | None | Internal only |
|
|
|
|
## HAProxy Configuration
|
|
|
|
### Backends
|
|
|
|
| Backend | Upstream | Health Check | Auth |
|
|
|---------|----------|-------------|------|
|
|
| `backend_grafana` | `127.0.0.1:3000` | `GET /api/health` | Grafana OAuth |
|
|
| `backend_pgadmin` | `127.0.0.1:5050` | `GET /misc/ping` | PgAdmin OAuth |
|
|
| `backend_prometheus` | `127.0.0.1:9091` (OAuth2-Proxy) | `GET /ping` | OAuth2-Proxy |
|
|
| `backend_prometheus_direct` | `127.0.0.1:9090` | — | None (write API) |
|
|
| `backend_loki` | `127.0.0.1:3100` | `GET /ready` | None |
|
|
| `backend_alertmanager` | `127.0.0.1:9093` | `GET /-/healthy` | None |
|
|
|
|
### skip_auth_route Pattern
|
|
|
|
The Prometheus write API (`/api/v1/write`) is accessed by Alloy agents for machine-to-machine metric pushes. HAProxy uses an ACL to bypass OAuth2-Proxy:
|
|
|
|
```
|
|
acl is_prometheus_write path_beg /api/v1/write
|
|
use_backend backend_prometheus_direct if host_prometheus is_prometheus_write
|
|
```
|
|
|
|
This routes `https://prometheus.ouranos.helu.ca/api/v1/write` directly to Prometheus on `:9090`, while all other Prometheus traffic goes through OAuth2-Proxy on `:9091`.
|
|
|
|
### SSL Certificate
|
|
|
|
- **Primary**: Let's Encrypt wildcard cert (`*.ouranos.helu.ca`) fetched from Titania
|
|
- **Fallback**: Self-signed cert generated on Prospero (if Titania unavailable)
|
|
- **Path**: `/etc/haproxy/certs/ouranos.pem`
|
|
|
|
## Host Variables
|
|
|
|
**File:** `ansible/inventory/host_vars/prospero.incus.yml`
|
|
|
|
Services list:
|
|
```yaml
|
|
services:
|
|
- alloy
|
|
- pplg
|
|
```
|
|
|
|
Key variable groups defined in `prospero.incus.yml`:
|
|
- PPLG HAProxy (user, group, uid/gid 800, syslog port)
|
|
- Grafana (datasources, users, OAuth config)
|
|
- Prometheus (scrape targets, OAuth2-Proxy sidecar config)
|
|
- Alertmanager (Pushover integration)
|
|
- Loki (user, data/config directories)
|
|
- PgAdmin (user, data/log directories, OAuth config)
|
|
- Casdoor Metrics (access key/secret for Prometheus scraping)
|
|
|
|
## Terraform
|
|
|
|
### Prospero Port Mapping
|
|
|
|
```hcl
|
|
devices = [
|
|
{
|
|
name = "https_internal"
|
|
type = "proxy"
|
|
properties = {
|
|
listen = "tcp:0.0.0.0:25510"
|
|
connect = "tcp:127.0.0.1:443"
|
|
}
|
|
},
|
|
{
|
|
name = "http_redirect"
|
|
type = "proxy"
|
|
properties = {
|
|
listen = "tcp:0.0.0.0:25511"
|
|
connect = "tcp:127.0.0.1:80"
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
Run `terraform apply` before deploying if port mappings changed.
|
|
|
|
### Titania Backend Routing
|
|
|
|
Titania's HAProxy routes external subdomains to Prospero's HTTPS port:
|
|
|
|
```yaml
|
|
# In titania.incus.yml haproxy_backends
|
|
- subdomain: "grafana"
|
|
backend_host: "prospero.incus"
|
|
backend_port: 443
|
|
health_path: "/api/health"
|
|
ssl_backend: true
|
|
|
|
- subdomain: "pgadmin"
|
|
backend_host: "prospero.incus"
|
|
backend_port: 443
|
|
health_path: "/misc/ping"
|
|
ssl_backend: true
|
|
|
|
- subdomain: "prometheus"
|
|
backend_host: "prospero.incus"
|
|
backend_port: 443
|
|
health_path: "/ping"
|
|
ssl_backend: true
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Alloy Configuration
|
|
|
|
**File:** `ansible/alloy/prospero/config.alloy.j2`
|
|
|
|
- **HAProxy Syslog**: `loki.source.syslog` on `127.0.0.1:51405` (TCP) receives Docker syslog from HAProxy container
|
|
- **Journal Labels**: Dedicated job labels for `grafana-server`, `prometheus`, `loki`, `alertmanager`, `pgadmin`, `oauth2-proxy-prometheus`
|
|
- **System Logs**: `/var/log/syslog`, `/var/log/auth.log` → Loki
|
|
- **Metrics**: Node exporter + process exporter → Prometheus remote write
|
|
|
|
### Prometheus Scrape Targets
|
|
|
|
| Job | Target | Auth |
|
|
|-----|--------|------|
|
|
| `prometheus` | `localhost:9090` | None |
|
|
| `node-exporter` | All Uranian hosts `:9100` | None |
|
|
| `alertmanager` | `prospero.incus:9093` | None |
|
|
| `haproxy` | `titania.incus:8404` | None |
|
|
| `gitea` | `oberon.incus:22084` | Bearer token |
|
|
| `casdoor` | `titania.incus:22081` | Access key/secret params |
|
|
|
|
### Alert Rules
|
|
|
|
Groups defined in `alert_rules.yml.j2`:
|
|
|
|
| Group | Alerts | Scope |
|
|
|-------|--------|-------|
|
|
| `node_alerts` | InstanceDown, HighCPU, HighMemory, DiskSpace, LoadAverage | All hosts |
|
|
| `puck_process_alerts` | HighCPU/Memory per process, CrashLoop | puck.incus |
|
|
| `puck_container_alerts` | HighContainerCount, Duplicates, Orphans, OOM | puck.incus |
|
|
| `service_alerts` | TargetMissing, JobMissing, AlertmanagerDown | Infrastructure |
|
|
| `loki_alerts` | HighLogVolume | Loki |
|
|
|
|
### Alertmanager Routing
|
|
|
|
Alerts are routed to Pushover with severity-based priority:
|
|
|
|
| Severity | Pushover Priority | Emoji |
|
|
|----------|-------------------|-------|
|
|
| Critical | 2 (Emergency) | 🚨 |
|
|
| Warning | 1 (High) | ⚠️ |
|
|
| Info | 0 (Normal) | — |
|
|
|
|
## Grafana MCP Server
|
|
|
|
Grafana has an associated **MCP (Model Context Protocol) server** that provides AI/LLM access to dashboards, datasources, and alerting APIs. The Grafana MCP server runs as a Docker container on **Miranda** and connects back to Grafana on Prospero via the internal network (`prospero.incus:3000`) using a service account token.
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| MCP Host | miranda.incus |
|
|
| MCP Port | 25533 |
|
|
| MCPO Proxy | `http://miranda.incus:25530/grafana` |
|
|
| Auth | Grafana service account token (`vault_grafana_service_account_token`) |
|
|
|
|
The Grafana MCP server is deployed separately from PPLG but depends on Grafana being running first. Deploy order: `pplg → grafana_mcp → mcpo`.
|
|
|
|
For full details — deployment, configuration, available tools, troubleshooting — see **[Grafana MCP Server](grafana_mcp.md)**.
|
|
|
|
## Access After Deployment
|
|
|
|
| Service | URL | Login |
|
|
|---------|-----|-------|
|
|
| Grafana | https://grafana.ouranos.helu.ca | Casdoor SSO or local admin |
|
|
| PgAdmin | https://pgadmin.ouranos.helu.ca | Casdoor SSO or local admin |
|
|
| Prometheus | https://prometheus.ouranos.helu.ca | Casdoor SSO |
|
|
| Alertmanager | https://alertmanager.ouranos.helu.ca | No auth (internal) |
|
|
|
|
## Troubleshooting
|
|
|
|
### Service Status
|
|
|
|
```bash
|
|
ssh prospero.incus
|
|
sudo systemctl status prometheus grafana-server loki prometheus-alertmanager pgadmin oauth2-proxy-prometheus
|
|
```
|
|
|
|
### HAProxy Service
|
|
|
|
```bash
|
|
ssh prospero.incus
|
|
sudo systemctl status haproxy
|
|
sudo journalctl -u haproxy -f
|
|
```
|
|
|
|
### View Logs
|
|
|
|
```bash
|
|
# All PPLG services via journal
|
|
sudo journalctl -u prometheus -u grafana-server -u loki -u prometheus-alertmanager -u pgadmin -u oauth2-proxy-prometheus -f
|
|
|
|
# HAProxy logs (shipped via syslog to Alloy → Loki)
|
|
# Query in Grafana: {job="pplg-haproxy"}
|
|
```
|
|
|
|
### Test Endpoints (from Prospero)
|
|
|
|
```bash
|
|
# Grafana
|
|
curl -s http://127.0.0.1:3000/api/health
|
|
|
|
# PgAdmin
|
|
curl -s http://127.0.0.1:5050/misc/ping
|
|
|
|
# Prometheus
|
|
curl -s http://127.0.0.1:9090/-/healthy
|
|
|
|
# Loki
|
|
curl -s http://127.0.0.1:3100/ready
|
|
|
|
# Alertmanager
|
|
curl -s http://127.0.0.1:9093/-/healthy
|
|
|
|
# HAProxy stats
|
|
curl -s http://127.0.0.1:8404/metrics | head
|
|
```
|
|
|
|
### Test TLS (from any host)
|
|
|
|
```bash
|
|
# Direct to Prospero container
|
|
curl -sk https://prospero.incus/api/health
|
|
# Via Titania HAProxy
|
|
curl -s https://grafana.ouranos.helu.ca/api/health
|
|
```
|
|
|
|
### Common Errors
|
|
|
|
#### `vault_casdoor_prometheus_access_key` is undefined
|
|
|
|
```
|
|
TASK [Template prometheus.yml]
|
|
[ERROR]: 'vault_casdoor_prometheus_access_key' is undefined
|
|
```
|
|
|
|
**Cause**: The Casdoor metrics scrape job in `prometheus.yml.j2` requires access credentials.
|
|
|
|
**Fix**: Generate API keys for the `built-in/admin` Casdoor user (see [Casdoor Prometheus Access Key](#1-casdoor-prometheus-access-key) for the full procedure), then add to vault:
|
|
```bash
|
|
cd ansible
|
|
ansible-vault edit inventory/group_vars/all/vault.yml
|
|
```
|
|
```yaml
|
|
vault_casdoor_prometheus_access_key: "your-casdoor-access-key"
|
|
vault_casdoor_prometheus_access_secret: "your-casdoor-access-secret"
|
|
```
|
|
|
|
#### Certificate fetch fails
|
|
|
|
**Cause**: Titania not running or certbot hasn't provisioned the cert yet.
|
|
|
|
**Fix**: Ensure Titania is up and certbot has run:
|
|
```bash
|
|
ansible-playbook sandbox_up.yml
|
|
ansible-playbook certbot/deploy.yml
|
|
```
|
|
|
|
The playbook falls back to a self-signed certificate if Titania is unavailable.
|
|
|
|
#### OAuth2 redirect loops
|
|
|
|
**Cause**: Casdoor application redirect URI doesn't match the service URL.
|
|
|
|
**Fix**: Verify redirect URIs match exactly:
|
|
- Grafana: `https://grafana.ouranos.helu.ca/login/generic_oauth`
|
|
- PgAdmin: `https://pgadmin.ouranos.helu.ca/oauth2/redirect`
|
|
- Prometheus: `https://prometheus.ouranos.helu.ca/oauth2/callback`
|
|
|
|
## Migration Notes
|
|
|
|
PPLG replaces the following standalone playbooks (kept as reference):
|
|
|
|
| Original Playbook | Replaced By |
|
|
|-------------------|-------------|
|
|
| `prometheus/deploy.yml` | `pplg/deploy.yml` |
|
|
| `prometheus/alertmanager_deploy.yml` | `pplg/deploy.yml` |
|
|
| `loki/deploy.yml` | `pplg/deploy.yml` |
|
|
| `grafana/deploy.yml` | `pplg/deploy.yml` |
|
|
| `pgadmin/deploy.yml` | `pplg/deploy.yml` |
|
|
|
|
PgAdmin was previously hosted on **Portia** (port 25555). It now runs on **Prospero** via gunicorn (no Apache).
|