285 lines
11 KiB
Markdown
285 lines
11 KiB
Markdown
# SearXNG
|
||
|
||
## Overview
|
||
|
||
SearXNG is a privacy-respecting metasearch engine that aggregates results from
|
||
multiple upstream search providers and re-ranks them. The Ouranos deployment runs
|
||
as a single Docker container behind an authenticating OAuth2-Proxy sidecar (see
|
||
[`searxng-auth.md`](./searxng-auth.md) for the auth design).
|
||
|
||
**Host:** `rosalind.incus`
|
||
**Container port:** 22089 (host) → 8080 (container)
|
||
**Public URL:** `https://searxng.ouranos.helu.ca/` (via HAProxy → OAuth2-Proxy → SearXNG)
|
||
**Internal URL:** `http://rosalind.incus:22089/` (used by LobeChat, Argos, etc.)
|
||
|
||
## Ansible Deployment
|
||
|
||
### Layout
|
||
|
||
```
|
||
ansible/searxng/
|
||
├── deploy.yml # Main deployment playbook
|
||
├── deploy_oauth2.yml # OAuth2-Proxy sidecar playbook
|
||
├── docker-compose.yml.j2 # Docker Compose template
|
||
├── searxng-settings.yml.j2 # SearXNG settings.yml template
|
||
├── oauth2-proxy-searxng.cfg.j2 # OAuth2-Proxy config (see searxng-auth.md)
|
||
└── oauth2-proxy-searxng.service.j2 # Systemd unit for the sidecar
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
cd ansible
|
||
ansible-playbook searxng/deploy.yml --limit rosalind.incus
|
||
ansible-playbook searxng/deploy_oauth2.yml --limit rosalind.incus
|
||
```
|
||
|
||
`deploy.yml`:
|
||
|
||
1. Skips hosts that don't list `searxng` in their `services` list.
|
||
2. Creates the `searxng` system user and `/srv/searxng` directory.
|
||
3. Templates `docker-compose.yml` and `searxng-settings.yml` into `/srv/searxng/`.
|
||
4. Brings up the container with `community.docker.docker_compose_v2` (`pull: always`).
|
||
|
||
The container mounts `searxng-settings.yml` read-only at
|
||
`/etc/searxng/settings.yml`. There is no persistent volume — the cache lives in
|
||
the container's `/tmp` and is rebuilt on restart.
|
||
|
||
### Variables
|
||
|
||
#### Host Variables (`inventory/host_vars/rosalind.incus.yml`)
|
||
|
||
| Variable | Value | Purpose |
|
||
|--------------------------|----------------------------------|----------------------------------|
|
||
| `searxng_port` | `22089` | Host-side container port |
|
||
| `searxng_base_url` | `http://rosalind.incus:22089/` | Used by SearXNG to build URLs |
|
||
| `searxng_instance_name` | `Ouranos Search` | Shown in the UI header |
|
||
| `searxng_directory` | `/srv/searxng` | Compose project dir on the host |
|
||
| `searxng_user`/`group` | `searxng` | Owns templated config files |
|
||
| `searxng_syslog_port` | `51403` | Alloy syslog receiver port |
|
||
|
||
#### Vault Variables (`group_vars/all/vault.yml`)
|
||
|
||
| Variable | Purpose |
|
||
|--------------------------------|------------------------------------------------------------|
|
||
| `vault_searxng_secret_key` | `server.secret_key` — also used as cache DB password |
|
||
| `vault_searxng_brave_api_key` | Brave Search API subscription token (see below) |
|
||
| `vault_searxng_oauth_*` | OAuth2-Proxy sidecar — see `searxng-auth.md` |
|
||
|
||
> ⚠️ **Changing `vault_searxng_secret_key` truncates the cache.** SearXNG hashes
|
||
> cache keys with the secret key; on mismatch it drops every cache table on next
|
||
> startup. Harmless, but be aware that engines like `wikidata` and
|
||
> `radio_browser` will need to re-fetch their on-disk indexes.
|
||
|
||
## Search Engine Configuration
|
||
|
||
The engine list is templated in `searxng-settings.yml.j2` and merges with the
|
||
upstream defaults via `use_default_settings: true`. The merge is keyed by engine
|
||
`name` and is shallow — **only fields you explicitly set override the
|
||
defaults**, everything else (including hidden ones like `inactive`) is inherited.
|
||
|
||
### Enabled engines
|
||
|
||
| Engine | Notes |
|
||
|--------------|----------------------------------------------------|
|
||
| `duckduckgo` | General web |
|
||
| `startpage` | General web |
|
||
| `mojeek` | General web |
|
||
| `braveapi` | Brave Search via official REST API (see below) |
|
||
|
||
### Disabled engines
|
||
|
||
| Engine | Reason |
|
||
|--------------------------------|------------------------------------------------------------|
|
||
| `google` | Aggressive bot detection / unstable scraping results |
|
||
| `bing news` | Frequent parsing errors |
|
||
| `brave` (HTML scraper) | Replaced by `braveapi` — keeping both duplicates results |
|
||
| `brave.images` / `.videos` / `.news` | Scraping endpoints return 451 / access-denied |
|
||
| `duckduckgo images` | Suspended / access-denied responses |
|
||
| `pexels`, `vimeo` | Same — suspended / access-denied |
|
||
|
||
> ℹ️ **Why disable Google and Bing's web search?** Google's HTML scraper is
|
||
> blocked aggressively and produces low-quality / inconsistent results. Bing's
|
||
> news scraper hits parser failures often enough to be more noise than signal.
|
||
> The remaining four engines (Brave API, DuckDuckGo, Startpage, Mojeek) cover
|
||
> general web search with stable results and no API rate-limit surprises.
|
||
|
||
### Brave Search API (`braveapi`)
|
||
|
||
`braveapi` is the official REST API engine — distinct from the `brave` engine,
|
||
which scrapes the public Brave Search HTML. The API engine is more reliable, has
|
||
proper rate limiting, and supports paging and time-range filters.
|
||
|
||
#### Configuration
|
||
|
||
```yaml
|
||
- name: braveapi
|
||
engine: braveapi
|
||
api_key: "{{ searxng_brave_api_key }}"
|
||
results_per_page: 20
|
||
inactive: false
|
||
disabled: false
|
||
```
|
||
|
||
#### `inactive: false` is required
|
||
|
||
The upstream SearXNG `settings.yml` ships `braveapi` with `inactive: true` and
|
||
an empty API key. Because `use_default_settings` does a shallow merge, an
|
||
override that only sets `disabled: false` leaves the inherited `inactive: true`
|
||
in place — and `inactive` engines are filtered out before `load_engine()` runs.
|
||
The result is a silent disable: no error appears in the logs, and the engine
|
||
never shows up in `/config`.
|
||
|
||
`disabled` and `inactive` are different gates:
|
||
|
||
- **`disabled`** — engine still loads; user can toggle it on/off via Preferences.
|
||
- **`inactive`** — engine is filtered out before loading; the UI never sees it.
|
||
|
||
You need both `inactive: false` and `disabled: false` (or omit `disabled` and
|
||
let the default `false` apply).
|
||
|
||
#### Endpoint and result handling
|
||
|
||
The engine implementation (`searx/engines/braveapi.py`) hits a single endpoint:
|
||
|
||
```
|
||
https://api.search.brave.com/res/v1/web/search
|
||
```
|
||
|
||
with the `X-Subscription-Token` header. Although the Brave API can return
|
||
multiple result sections (`web`, `news`, `videos`, `discussions`, `infobox`,
|
||
`locations`, etc.), the SearXNG engine **only consumes `data["web"]["results"]`**.
|
||
Other sections in the response are silently discarded.
|
||
|
||
This means `braveapi` cannot be split into `braveapi.images` / `braveapi.news`
|
||
/ `braveapi.videos` engines the way the HTML-scraper `brave` engine is. To
|
||
surface those result types from Brave you'd need to patch the upstream engine
|
||
module. For now, the disabled `brave.*` scrapers and other category-specific
|
||
engines fill that role.
|
||
|
||
#### Categories
|
||
|
||
`braveapi` declares `categories = ["general", "web"]` at module level. You don't
|
||
need to override this in the YAML.
|
||
|
||
### Verifying the engine is live
|
||
|
||
After `ansible-playbook searxng/deploy.yml` and a container restart:
|
||
|
||
```bash
|
||
# 1. Engine is loaded and registered
|
||
curl -s 'http://rosalind.incus:22089/config' \
|
||
| jq '.engines[] | select(.name=="braveapi")'
|
||
|
||
# 2. Direct query — bypasses any UI/category filtering
|
||
curl -s 'http://rosalind.incus:22089/search?q=python&format=json&engines=braveapi' \
|
||
| jq '.results | length, .unresponsive_engines'
|
||
|
||
# 3. Container logs — look for braveapi-specific errors
|
||
docker logs searxng 2>&1 | grep -i braveapi
|
||
```
|
||
|
||
## Authentication
|
||
|
||
SearXNG itself does not authenticate users. All public access goes through an
|
||
OAuth2-Proxy sidecar that talks to Casdoor for OIDC. Internal callers
|
||
(LobeChat, Argos, etc.) hit `http://rosalind.incus:22089/` directly and bypass
|
||
auth.
|
||
|
||
See [`searxng-auth.md`](./searxng-auth.md) for the full design and Casdoor
|
||
application setup.
|
||
|
||
## Monitoring
|
||
|
||
### Logs
|
||
|
||
The container is configured to ship its stdout/stderr to Alloy's syslog
|
||
receiver:
|
||
|
||
```yaml
|
||
logging:
|
||
driver: syslog
|
||
options:
|
||
syslog-address: "tcp://127.0.0.1:51403"
|
||
syslog-format: "{{syslog_format}}"
|
||
tag: "searxng"
|
||
```
|
||
|
||
Alloy on `rosalind.incus` forwards these to Loki. Query in Grafana with:
|
||
|
||
```
|
||
{job="searxng", host="rosalind.incus"}
|
||
```
|
||
|
||
### Health check
|
||
|
||
```bash
|
||
curl -fsS http://rosalind.incus:22089/healthz
|
||
```
|
||
|
||
## Operations
|
||
|
||
### Restart
|
||
|
||
```bash
|
||
ssh rosalind.incus
|
||
cd /srv/searxng
|
||
docker compose restart
|
||
```
|
||
|
||
### Force pull a newer image
|
||
|
||
```bash
|
||
ssh rosalind.incus
|
||
cd /srv/searxng
|
||
docker compose pull
|
||
docker compose up -d
|
||
```
|
||
|
||
Or just re-run the playbook — `pull: always` is set on the deploy task.
|
||
|
||
### Inspect rendered settings inside the container
|
||
|
||
```bash
|
||
ssh rosalind.incus
|
||
docker exec searxng cat /etc/searxng/settings.yml | grep -A6 -B1 braveapi
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### "Brave doesn't work"
|
||
|
||
1. Confirm the engine is registered: `/config` JSON should include a `braveapi`
|
||
entry. If absent, `inactive: false` is missing or the template didn't deploy.
|
||
2. Confirm the API key is non-empty inside the container — see "Inspect rendered
|
||
settings" above.
|
||
3. Hit the engine directly with `&engines=braveapi`. If `unresponsive_engines`
|
||
contains it with a reason, that's your real error (auth, rate limit, network).
|
||
|
||
### `radio_browser` / `wikidata` init errors at startup
|
||
|
||
These are unrelated to your engine config:
|
||
|
||
- **`radio_browser`** — known cache init-order bug in recent
|
||
`searxng/searxng:latest` images. The SQLite `properties` table isn't created
|
||
before `radio_browser.init()` calls `CACHE.get(...)`. The engine simply stays
|
||
unregistered; other engines work normally. Pinning to an older image tag
|
||
works around it.
|
||
- **`wikidata`** — transient: `query.wikidata.org` returned a truncated SPARQL
|
||
response during the startup language-fetch. Restart the container; if it
|
||
persists, Wikidata is rate-limiting the source IP.
|
||
|
||
### Cache appears stale after rotating `vault_searxng_secret_key`
|
||
|
||
Expected. The secret key is hashed and used as the cache password; on mismatch
|
||
SearXNG truncates every cache table at startup. No data loss — search still
|
||
works, the engines just rebuild their indexes lazily.
|
||
|
||
## References
|
||
|
||
- Upstream docs: <https://docs.searxng.org/>
|
||
- Brave Search API engine: <https://docs.searxng.org/dev/engines/online/brave.html>
|
||
- Brave Search API reference: [`brave_search_api.md`](./brave_search_api.md)
|
||
- SearXNG authentication design: [`searxng-auth.md`](./searxng-auth.md)
|
||
- [Ansible Practices](./ansible.md)
|