Multimodal embedding and reranking service powered by Qwen3-VL-2B. Supports text, image, and mixed-modal inputs over a simple REST API.
Generate dense vector representations for text, images, or both. Vectors are suitable for semantic search, retrieval, clustering, and classification.
POST /v1/embeddings
Given a query and a list of candidate documents, score and sort them by relevance. Use after an initial retrieval step to improve precision.
POST /v1/rerank
Convenience endpoint to compute cosine similarity between two inputs without managing vectors yourself.
POST /v1/similarity
Full request/response schemas, try-it-out functionality, and auto-generated curl examples are available at http://<host>:8400/docs (Swagger UI). Use it to experiment with every endpoint interactively.
All endpoints are served from a single base URL. Configure this in your consuming application:
http://<synesis-host>:8400
Default port is 8400. No authentication is required (secure via network policy / firewall).
Synesis loads two Qwen3-VL-2B models into GPU memory at startup: one for embeddings and one for reranking. Both share the same NVIDIA 3090 (24 GB VRAM).
Generate dense vector embeddings for one or more inputs. Each input can be text, an image, or both (multimodal).
| Field | Type | Required | Description |
|---|---|---|---|
inputs |
array | Yes | List of items to embed (1 to max_batch_size). |
inputs[].text |
string | * | Text content. At least one of text or image is required. |
inputs[].image |
string | * | Image file path or URL. At least one of text or image is required. |
inputs[].instruction |
string | No | Optional task instruction to guide embedding (e.g. "Represent this document for retrieval"). |
dimension |
int | No | Output vector dimension (64–2048). Default: 2048. See Dimensions. |
normalize |
bool | No | L2-normalize output vectors. Default: true. |
| Field | Type | Description |
|---|---|---|
embeddings[] |
array | One embedding per input, in order. |
embeddings[].index |
int | Position in the input array. |
embeddings[].embedding |
float[] | The dense vector (length = dimension). |
usage.input_count |
int | Number of inputs processed. |
usage.dimension |
int | Dimension of returned vectors. |
usage.elapsed_ms |
float | Server-side processing time in milliseconds. |
{
"inputs": [
{"text": "quantum computing basics"},
{"text": "machine learning tutorial"}
]
}
{
"inputs": [
{"image": "/data/photos/cat.jpg"},
{"image": "https://example.com/dog.png"}
]
}
{
"inputs": [
{
"text": "product photo",
"image": "/data/products/shoe.jpg"
}
]
}
Score and rank a list of candidate documents against a query. Returns documents sorted by relevance (highest score first).
| Field | Type | Required | Description |
|---|---|---|---|
query |
object | Yes | The query to rank against. Must contain text, image, or both. |
query.text |
string | * | Query text. At least one of text or image required. |
query.image |
string | * | Query image path or URL. |
documents |
array | Yes | Candidate documents to rerank (1 to max_batch_size). |
documents[].text |
string | * | Document text. At least one of text or image required per document. |
documents[].image |
string | * | Document image path or URL. |
instruction |
string | No | Task instruction (e.g. "Retrieve images relevant to the query."). |
top_n |
int | No | Return only the top N results. Default: return all. |
| Field | Type | Description |
|---|---|---|
results[] |
array | Documents sorted by relevance score (descending). |
results[].index |
int | Original position of this document in the input array. |
results[].score |
float | Relevance score (higher = more relevant). |
results[].document |
object | The document that was ranked (echoed back). |
usage.query_count |
int | Always 1. |
usage.document_count |
int | Total documents scored. |
usage.returned_count |
int | Number of results returned (respects top_n). |
usage.elapsed_ms |
float | Server-side processing time in milliseconds. |
{
"query": {"text": "How do neural networks learn?"},
"documents": [
{"text": "Neural networks adjust weights through backpropagation..."},
{"text": "The stock market experienced a downturn in Q3..."},
{"text": "Deep learning uses gradient descent to minimize loss..."},
{"text": "Photosynthesis converts sunlight into chemical energy..."}
],
"top_n": 2
}
{
"query": {"text": "melancholy album artwork"},
"documents": [
{"image": "/data/covers/cover1.jpg"},
{"image": "/data/covers/cover2.jpg"},
{"text": "dark moody painting", "image": "/data/covers/cover3.jpg"}
],
"instruction": "Retrieve images relevant to the query.",
"top_n": 2
}
Synesis uses Matryoshka Representation Learning (MRL). The model always computes full 2048-dimensional vectors internally, then truncates to your requested dimension. This means you can choose a dimension that balances quality vs. storage/speed.
| Dimension | Vector Size | Quality | Use Case |
|---|---|---|---|
2048 (default) |
8 KB / vector (float32) | Maximum | Highest accuracy retrieval, small collections |
1024 |
4 KB / vector | Very high | Good balance for most production systems |
512 |
2 KB / vector | High | Large-scale search with reasonable quality |
256 |
1 KB / vector | Good | Very large collections, cost-sensitive |
128 |
512 B / vector | Moderate | Rough filtering, pre-screening |
64 |
256 B / vector | Basic | Coarse clustering, topic grouping |
All vectors in the same index/collection must use the same dimension. Choose a dimension at index creation time and use it consistently for both indexing and querying. You cannot mix 512-d and 1024-d vectors in the same vector database index.
The max_batch_size setting (default: 32) controls the maximum number of inputs per API call. This is tuned for the 3090's 24 GB VRAM.
When processing large datasets (thousands of documents), do not send all items in a single request. Instead, implement client-side microbatching:
/v1/embeddings request.The reranker also respects max_batch_size for the number of candidate documents. If you have more than 32 candidates, either pre-filter with embeddings first (recommended) or split into multiple rerank calls and merge results.
To integrate Synesis into another system, configure these settings:
| Setting | Value | Notes |
|---|---|---|
| Embedding API URL | http://<host>:8400/v1/embeddings |
POST, JSON body |
| Rerank API URL | http://<host>:8400/v1/rerank |
POST, JSON body |
| Health check URL | http://<host>:8400/ready/ |
GET, 200 = ready |
| Embedding dimension | 2048 (or your chosen value) |
Must match vector DB index config |
| Authentication | None | Secure via network policy |
| Content-Type | application/json |
All endpoints |
| Timeout | 30–60 seconds | Image inputs take longer; adjust for batch size |
import requests
SYNESIS_URL = "http://synesis-host:8400"
# --- Generate embeddings ---
resp = requests.post(f"{SYNESIS_URL}/v1/embeddings", json={
"inputs": [
{"text": "How to train a neural network"},
{"text": "Best practices for deep learning"},
],
"dimension": 1024,
})
data = resp.json()
vectors = [e["embedding"] for e in data["embeddings"]]
# vectors[0] is a list of 1024 floats
# --- Rerank candidates ---
resp = requests.post(f"{SYNESIS_URL}/v1/rerank", json={
"query": {"text": "neural network training"},
"documents": [
{"text": "Backpropagation adjusts weights using gradients..."},
{"text": "The weather forecast for tomorrow is sunny..."},
{"text": "Stochastic gradient descent is an optimization method..."},
],
"top_n": 2,
})
ranked = resp.json()
for result in ranked["results"]:
print(f" #{result['index']} score={result['score']:.4f}")
print(f" {result['document']['text'][:80]}")
/v1/embeddings and store vectors in your vector database (e.g. pgvector, Qdrant, Milvus, Weaviate)./v1/embeddings, perform approximate nearest neighbour (ANN) search in the vector DB to retrieve top 20–50 candidates./v1/rerank to get precise relevance scores. Return the top 5–10 to the user or LLM context.Compute cosine similarity between exactly two inputs. A convenience wrapper — embeds both, normalizes, and returns the dot product.
| Field | Type | Required | Description |
|---|---|---|---|
a |
object | Yes | First input (text, image, or both). |
b |
object | Yes | Second input (text, image, or both). |
dimension |
int | No | Embedding dimension for comparison (64–2048). Default: 2048. |
| Field | Type | Description |
|---|---|---|
score |
float | Cosine similarity (−1.0 to 1.0). Higher = more similar. |
dimension |
int | Dimension used for the comparison. |
| Endpoint | Method | Purpose |
|---|---|---|
/ready/ |
GET | Readiness probe. Returns 200 when both models are loaded and GPU is available. 503 otherwise. Use for load balancer health checks. |
/live/ |
GET | Liveness probe. Returns 200 if the process is alive. Use for container restart decisions. |
/health |
GET | Detailed status: model paths, loaded state, GPU device name, VRAM usage. |
/models/v1/models |
GET | List available models (OpenAI-compatible). Returns model IDs, capabilities, and metadata. Used by OpenAI SDK clients for model discovery. |
/metrics |
GET | Prometheus metrics (request counts, latency histograms, GPU memory, model status). |
Key custom metrics exposed:
embedding_model_loaded — Gauge (1 = loaded)reranker_model_loaded — Gauge (1 = loaded)embedding_gpu_memory_bytes — Gauge (current GPU allocation)embedding_inference_requests_total{endpoint} — Counter per endpoint (embeddings, similarity, rerank)embedding_inference_duration_seconds{endpoint} — Histogram of inference latencyprometheus-fastapi-instrumentatorAll settings use the EMBEDDING_ prefix and can be overridden via environment variables or /etc/default/synesis:
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL_PATH |
./models/Qwen3-VL-Embedding-2B |
Path to embedding model weights |
EMBEDDING_RERANKER_MODEL_PATH |
./models/Qwen3-VL-Reranker-2B |
Path to reranker model weights |
EMBEDDING_TORCH_DTYPE |
float16 |
Model precision (float16 or bfloat16) |
EMBEDDING_USE_FLASH_ATTENTION |
true |
Enable Flash Attention 2 |
EMBEDDING_DEFAULT_DIMENSION |
2048 |
Default embedding dimension when not specified per request |
EMBEDDING_MAX_BATCH_SIZE |
32 |
Maximum inputs per request (both embeddings and rerank) |
EMBEDDING_HOST |
0.0.0.0 |
Bind address |
EMBEDDING_PORT |
8400 |
Listen port |
| Code | Meaning | Action |
|---|---|---|
200 |
Success | Process the response. |
422 |
Validation error | Check your request body. Batch size may exceed max_batch_size, or required fields are missing. |
500 |
Inference error | Model failed during processing. Check server logs. May indicate OOM with large image batches. |
503 |
Model not loaded | Service is starting up or a model failed to load. Retry after checking /ready/. |
Synesis v0.2.0 — Qwen3-VL Embedding & Reranking Service. For interactive API exploration, visit /docs on the running service.