Files
ouranos/docs/searxng.md

285 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SearXNG
## Overview
SearXNG is a privacy-respecting metasearch engine that aggregates results from
multiple upstream search providers and re-ranks them. The Ouranos deployment runs
as a single Docker container behind an authenticating OAuth2-Proxy sidecar (see
[`searxng-auth.md`](./searxng-auth.md) for the auth design).
**Host:** `rosalind.incus`
**Container port:** 22089 (host) → 8080 (container)
**Public URL:** `https://searxng.ouranos.helu.ca/` (via HAProxy → OAuth2-Proxy → SearXNG)
**Internal URL:** `http://rosalind.incus:22089/` (used by LobeChat, Argos, etc.)
## Ansible Deployment
### Layout
```
ansible/searxng/
├── deploy.yml # Main deployment playbook
├── deploy_oauth2.yml # OAuth2-Proxy sidecar playbook
├── docker-compose.yml.j2 # Docker Compose template
├── searxng-settings.yml.j2 # SearXNG settings.yml template
├── oauth2-proxy-searxng.cfg.j2 # OAuth2-Proxy config (see searxng-auth.md)
└── oauth2-proxy-searxng.service.j2 # Systemd unit for the sidecar
```
### Run
```bash
cd ansible
ansible-playbook searxng/deploy.yml --limit rosalind.incus
ansible-playbook searxng/deploy_oauth2.yml --limit rosalind.incus
```
`deploy.yml`:
1. Skips hosts that don't list `searxng` in their `services` list.
2. Creates the `searxng` system user and `/srv/searxng` directory.
3. Templates `docker-compose.yml` and `searxng-settings.yml` into `/srv/searxng/`.
4. Brings up the container with `community.docker.docker_compose_v2` (`pull: always`).
The container mounts `searxng-settings.yml` read-only at
`/etc/searxng/settings.yml`. There is no persistent volume — the cache lives in
the container's `/tmp` and is rebuilt on restart.
### Variables
#### Host Variables (`inventory/host_vars/rosalind.incus.yml`)
| Variable | Value | Purpose |
|--------------------------|----------------------------------|----------------------------------|
| `searxng_port` | `22089` | Host-side container port |
| `searxng_base_url` | `http://rosalind.incus:22089/` | Used by SearXNG to build URLs |
| `searxng_instance_name` | `Ouranos Search` | Shown in the UI header |
| `searxng_directory` | `/srv/searxng` | Compose project dir on the host |
| `searxng_user`/`group` | `searxng` | Owns templated config files |
| `searxng_syslog_port` | `51403` | Alloy syslog receiver port |
#### Vault Variables (`group_vars/all/vault.yml`)
| Variable | Purpose |
|--------------------------------|------------------------------------------------------------|
| `vault_searxng_secret_key` | `server.secret_key` — also used as cache DB password |
| `vault_searxng_brave_api_key` | Brave Search API subscription token (see below) |
| `vault_searxng_oauth_*` | OAuth2-Proxy sidecar — see `searxng-auth.md` |
> ⚠️ **Changing `vault_searxng_secret_key` truncates the cache.** SearXNG hashes
> cache keys with the secret key; on mismatch it drops every cache table on next
> startup. Harmless, but be aware that engines like `wikidata` and
> `radio_browser` will need to re-fetch their on-disk indexes.
## Search Engine Configuration
The engine list is templated in `searxng-settings.yml.j2` and merges with the
upstream defaults via `use_default_settings: true`. The merge is keyed by engine
`name` and is shallow — **only fields you explicitly set override the
defaults**, everything else (including hidden ones like `inactive`) is inherited.
### Enabled engines
| Engine | Notes |
|--------------|----------------------------------------------------|
| `duckduckgo` | General web |
| `startpage` | General web |
| `mojeek` | General web |
| `braveapi` | Brave Search via official REST API (see below) |
### Disabled engines
| Engine | Reason |
|--------------------------------|------------------------------------------------------------|
| `google` | Aggressive bot detection / unstable scraping results |
| `bing news` | Frequent parsing errors |
| `brave` (HTML scraper) | Replaced by `braveapi` — keeping both duplicates results |
| `brave.images` / `.videos` / `.news` | Scraping endpoints return 451 / access-denied |
| `duckduckgo images` | Suspended / access-denied responses |
| `pexels`, `vimeo` | Same — suspended / access-denied |
> **Why disable Google and Bing's web search?** Google's HTML scraper is
> blocked aggressively and produces low-quality / inconsistent results. Bing's
> news scraper hits parser failures often enough to be more noise than signal.
> The remaining four engines (Brave API, DuckDuckGo, Startpage, Mojeek) cover
> general web search with stable results and no API rate-limit surprises.
### Brave Search API (`braveapi`)
`braveapi` is the official REST API engine — distinct from the `brave` engine,
which scrapes the public Brave Search HTML. The API engine is more reliable, has
proper rate limiting, and supports paging and time-range filters.
#### Configuration
```yaml
- name: braveapi
engine: braveapi
api_key: "{{ searxng_brave_api_key }}"
results_per_page: 20
inactive: false
disabled: false
```
#### `inactive: false` is required
The upstream SearXNG `settings.yml` ships `braveapi` with `inactive: true` and
an empty API key. Because `use_default_settings` does a shallow merge, an
override that only sets `disabled: false` leaves the inherited `inactive: true`
in place — and `inactive` engines are filtered out before `load_engine()` runs.
The result is a silent disable: no error appears in the logs, and the engine
never shows up in `/config`.
`disabled` and `inactive` are different gates:
- **`disabled`** — engine still loads; user can toggle it on/off via Preferences.
- **`inactive`** — engine is filtered out before loading; the UI never sees it.
You need both `inactive: false` and `disabled: false` (or omit `disabled` and
let the default `false` apply).
#### Endpoint and result handling
The engine implementation (`searx/engines/braveapi.py`) hits a single endpoint:
```
https://api.search.brave.com/res/v1/web/search
```
with the `X-Subscription-Token` header. Although the Brave API can return
multiple result sections (`web`, `news`, `videos`, `discussions`, `infobox`,
`locations`, etc.), the SearXNG engine **only consumes `data["web"]["results"]`**.
Other sections in the response are silently discarded.
This means `braveapi` cannot be split into `braveapi.images` / `braveapi.news`
/ `braveapi.videos` engines the way the HTML-scraper `brave` engine is. To
surface those result types from Brave you'd need to patch the upstream engine
module. For now, the disabled `brave.*` scrapers and other category-specific
engines fill that role.
#### Categories
`braveapi` declares `categories = ["general", "web"]` at module level. You don't
need to override this in the YAML.
### Verifying the engine is live
After `ansible-playbook searxng/deploy.yml` and a container restart:
```bash
# 1. Engine is loaded and registered
curl -s 'http://rosalind.incus:22089/config' \
| jq '.engines[] | select(.name=="braveapi")'
# 2. Direct query — bypasses any UI/category filtering
curl -s 'http://rosalind.incus:22089/search?q=python&format=json&engines=braveapi' \
| jq '.results | length, .unresponsive_engines'
# 3. Container logs — look for braveapi-specific errors
docker logs searxng 2>&1 | grep -i braveapi
```
## Authentication
SearXNG itself does not authenticate users. All public access goes through an
OAuth2-Proxy sidecar that talks to Casdoor for OIDC. Internal callers
(LobeChat, Argos, etc.) hit `http://rosalind.incus:22089/` directly and bypass
auth.
See [`searxng-auth.md`](./searxng-auth.md) for the full design and Casdoor
application setup.
## Monitoring
### Logs
The container is configured to ship its stdout/stderr to Alloy's syslog
receiver:
```yaml
logging:
driver: syslog
options:
syslog-address: "tcp://127.0.0.1:51403"
syslog-format: "{{syslog_format}}"
tag: "searxng"
```
Alloy on `rosalind.incus` forwards these to Loki. Query in Grafana with:
```
{job="searxng", host="rosalind.incus"}
```
### Health check
```bash
curl -fsS http://rosalind.incus:22089/healthz
```
## Operations
### Restart
```bash
ssh rosalind.incus
cd /srv/searxng
docker compose restart
```
### Force pull a newer image
```bash
ssh rosalind.incus
cd /srv/searxng
docker compose pull
docker compose up -d
```
Or just re-run the playbook — `pull: always` is set on the deploy task.
### Inspect rendered settings inside the container
```bash
ssh rosalind.incus
docker exec searxng cat /etc/searxng/settings.yml | grep -A6 -B1 braveapi
```
## Troubleshooting
### "Brave doesn't work"
1. Confirm the engine is registered: `/config` JSON should include a `braveapi`
entry. If absent, `inactive: false` is missing or the template didn't deploy.
2. Confirm the API key is non-empty inside the container — see "Inspect rendered
settings" above.
3. Hit the engine directly with `&engines=braveapi`. If `unresponsive_engines`
contains it with a reason, that's your real error (auth, rate limit, network).
### `radio_browser` / `wikidata` init errors at startup
These are unrelated to your engine config:
- **`radio_browser`** — known cache init-order bug in recent
`searxng/searxng:latest` images. The SQLite `properties` table isn't created
before `radio_browser.init()` calls `CACHE.get(...)`. The engine simply stays
unregistered; other engines work normally. Pinning to an older image tag
works around it.
- **`wikidata`** — transient: `query.wikidata.org` returned a truncated SPARQL
response during the startup language-fetch. Restart the container; if it
persists, Wikidata is rate-limiting the source IP.
### Cache appears stale after rotating `vault_searxng_secret_key`
Expected. The secret key is hashed and used as the cache password; on mismatch
SearXNG truncates every cache table at startup. No data loss — search still
works, the engines just rebuild their indexes lazily.
## References
- Upstream docs: <https://docs.searxng.org/>
- Brave Search API engine: <https://docs.searxng.org/dev/engines/online/brave.html>
- Brave Search API reference: [`brave_search_api.md`](./brave_search_api.md)
- SearXNG authentication design: [`searxng-auth.md`](./searxng-auth.md)
- [Ansible Practices](./ansible.md)