Synesis — API Usage Guide

Multimodal embedding and reranking service powered by Qwen3-VL-2B. Supports text, image, and mixed-modal inputs over a simple REST API.

Overview

Embeddings

Generate dense vector representations for text, images, or both. Vectors are suitable for semantic search, retrieval, clustering, and classification.

POST /v1/embeddings

Reranking

Given a query and a list of candidate documents, score and sort them by relevance. Use after an initial retrieval step to improve precision.

POST /v1/rerank

Similarity

Convenience endpoint to compute cosine similarity between two inputs without managing vectors yourself.

POST /v1/similarity

Interactive API Explorer

Full request/response schemas, try-it-out functionality, and auto-generated curl examples are available at http://<host>:8400/docs (Swagger UI). Use it to experiment with every endpoint interactively.

Base URL

All endpoints are served from a single base URL. Configure this in your consuming application:

http://<synesis-host>:8400

Default port is 8400. No authentication is required (secure via network policy / firewall).

Architecture

Service Architecture

Synesis loads two Qwen3-VL-2B models into GPU memory at startup: one for embeddings and one for reranking. Both share the same NVIDIA 3090 (24 GB VRAM).

Request Flow

graph LR Client["Client Application"] -->|HTTP POST| FastAPI["FastAPI
:8400"] FastAPI -->|/v1/embeddings| Embedder["Qwen3-VL
Embedder 2B"] FastAPI -->|/v1/rerank| Reranker["Qwen3-VL
Reranker 2B"] FastAPI -->|/v1/similarity| Embedder Embedder --> GPU["NVIDIA 3090
24 GB VRAM"] Reranker --> GPU FastAPI -->|/metrics| Prometheus["Prometheus"]

Typical RAG Integration

sequenceDiagram participant App as Your Application participant Synesis as Synesis API participant VDB as Vector Database Note over App: Indexing Phase App->>Synesis: POST /v1/embeddings (documents) Synesis-->>App: embedding vectors App->>VDB: Store vectors + metadata Note over App: Query Phase App->>Synesis: POST /v1/embeddings (query) Synesis-->>App: query vector App->>VDB: ANN search (top 50) VDB-->>App: candidate documents App->>Synesis: POST /v1/rerank (query + candidates) Synesis-->>App: ranked results with scores App->>App: Use top 5-10 results

Embeddings API

POST /v1/embeddings

Generate dense vector embeddings for one or more inputs. Each input can be text, an image, or both (multimodal).

Request Body

Field Type Required Description
inputs array Yes List of items to embed (1 to max_batch_size).
inputs[].text string * Text content. At least one of text or image is required.
inputs[].image string * Image file path or URL. At least one of text or image is required.
inputs[].instruction string No Optional task instruction to guide embedding (e.g. "Represent this document for retrieval").
dimension int No Output vector dimension (64–2048). Default: 2048. See Dimensions.
normalize bool No L2-normalize output vectors. Default: true.

Response Body

Field Type Description
embeddings[] array One embedding per input, in order.
embeddings[].index int Position in the input array.
embeddings[].embedding float[] The dense vector (length = dimension).
usage.input_count int Number of inputs processed.
usage.dimension int Dimension of returned vectors.
usage.elapsed_ms float Server-side processing time in milliseconds.

Input Modalities

Text Only

{
  "inputs": [
    {"text": "quantum computing basics"},
    {"text": "machine learning tutorial"}
  ]
}

Image Only

{
  "inputs": [
    {"image": "/data/photos/cat.jpg"},
    {"image": "https://example.com/dog.png"}
  ]
}

Multimodal

{
  "inputs": [
    {
      "text": "product photo",
      "image": "/data/products/shoe.jpg"
    }
  ]
}

Reranking API

POST /v1/rerank

Score and rank a list of candidate documents against a query. Returns documents sorted by relevance (highest score first).

Request Body

Field Type Required Description
query object Yes The query to rank against. Must contain text, image, or both.
query.text string * Query text. At least one of text or image required.
query.image string * Query image path or URL.
documents array Yes Candidate documents to rerank (1 to max_batch_size).
documents[].text string * Document text. At least one of text or image required per document.
documents[].image string * Document image path or URL.
instruction string No Task instruction (e.g. "Retrieve images relevant to the query.").
top_n int No Return only the top N results. Default: return all.

Response Body

Field Type Description
results[] array Documents sorted by relevance score (descending).
results[].index int Original position of this document in the input array.
results[].score float Relevance score (higher = more relevant).
results[].document object The document that was ranked (echoed back).
usage.query_count int Always 1.
usage.document_count int Total documents scored.
usage.returned_count int Number of results returned (respects top_n).
usage.elapsed_ms float Server-side processing time in milliseconds.

Example: Text Query → Text Documents

{
  "query": {"text": "How do neural networks learn?"},
  "documents": [
    {"text": "Neural networks adjust weights through backpropagation..."},
    {"text": "The stock market experienced a downturn in Q3..."},
    {"text": "Deep learning uses gradient descent to minimize loss..."},
    {"text": "Photosynthesis converts sunlight into chemical energy..."}
  ],
  "top_n": 2
}

Example: Text Query → Image Documents

{
  "query": {"text": "melancholy album artwork"},
  "documents": [
    {"image": "/data/covers/cover1.jpg"},
    {"image": "/data/covers/cover2.jpg"},
    {"text": "dark moody painting", "image": "/data/covers/cover3.jpg"}
  ],
  "instruction": "Retrieve images relevant to the query.",
  "top_n": 2
}

Dimensions, Batches & Performance

Matryoshka Dimension Truncation

Synesis uses Matryoshka Representation Learning (MRL). The model always computes full 2048-dimensional vectors internally, then truncates to your requested dimension. This means you can choose a dimension that balances quality vs. storage/speed.

Dimension Vector Size Quality Use Case
2048 (default) 8 KB / vector (float32) Maximum Highest accuracy retrieval, small collections
1024 4 KB / vector Very high Good balance for most production systems
512 2 KB / vector High Large-scale search with reasonable quality
256 1 KB / vector Good Very large collections, cost-sensitive
128 512 B / vector Moderate Rough filtering, pre-screening
64 256 B / vector Basic Coarse clustering, topic grouping

Important: Consistency

All vectors in the same index/collection must use the same dimension. Choose a dimension at index creation time and use it consistently for both indexing and querying. You cannot mix 512-d and 1024-d vectors in the same vector database index.

Batch Size & Microbatching

The max_batch_size setting (default: 32) controls the maximum number of inputs per API call. This is tuned for the 3090's 24 GB VRAM.

  • Text-only inputs: Batch sizes up to 32 are safe.
  • Image inputs: Images consume significantly more VRAM. Reduce batch sizes to 8–16 when embedding images, depending on resolution.
  • Mixed-modal inputs: Treat as image batches for sizing purposes.

Microbatching Strategy

When processing large datasets (thousands of documents), do not send all items in a single request. Instead, implement client-side microbatching:

  1. Split your dataset into chunks of 16–32 items.
  2. Send each chunk as a separate /v1/embeddings request.
  3. Collect and concatenate the resulting vectors.
  4. For images, use smaller chunk sizes (8–16) to avoid OOM errors.
  5. Add a small delay between requests if processing thousands of items to avoid GPU thermal throttling.

Reranking Batch Limits

The reranker also respects max_batch_size for the number of candidate documents. If you have more than 32 candidates, either pre-filter with embeddings first (recommended) or split into multiple rerank calls and merge results.

Integration Guide

Configuring a Consuming Application

To integrate Synesis into another system, configure these settings:

Setting Value Notes
Embedding API URL http://<host>:8400/v1/embeddings POST, JSON body
Rerank API URL http://<host>:8400/v1/rerank POST, JSON body
Health check URL http://<host>:8400/ready/ GET, 200 = ready
Embedding dimension 2048 (or your chosen value) Must match vector DB index config
Authentication None Secure via network policy
Content-Type application/json All endpoints
Timeout 30–60 seconds Image inputs take longer; adjust for batch size

Python Integration Example

import requests

SYNESIS_URL = "http://synesis-host:8400"

# --- Generate embeddings ---
resp = requests.post(f"{SYNESIS_URL}/v1/embeddings", json={
    "inputs": [
        {"text": "How to train a neural network"},
        {"text": "Best practices for deep learning"},
    ],
    "dimension": 1024,
})
data = resp.json()
vectors = [e["embedding"] for e in data["embeddings"]]
# vectors[0] is a list of 1024 floats

# --- Rerank candidates ---
resp = requests.post(f"{SYNESIS_URL}/v1/rerank", json={
    "query": {"text": "neural network training"},
    "documents": [
        {"text": "Backpropagation adjusts weights using gradients..."},
        {"text": "The weather forecast for tomorrow is sunny..."},
        {"text": "Stochastic gradient descent is an optimization method..."},
    ],
    "top_n": 2,
})
ranked = resp.json()
for result in ranked["results"]:
    print(f"  #{result['index']} score={result['score']:.4f}")
    print(f"    {result['document']['text'][:80]}")

Typical Two-Stage Retrieval Pipeline

  1. Index time: Embed all documents via /v1/embeddings and store vectors in your vector database (e.g. pgvector, Qdrant, Milvus, Weaviate).
  2. Query time — Stage 1 (Recall): Embed the query via /v1/embeddings, perform approximate nearest neighbour (ANN) search in the vector DB to retrieve top 20–50 candidates.
  3. Query time — Stage 2 (Precision): Pass the query and candidates to /v1/rerank to get precise relevance scores. Return the top 5–10 to the user or LLM context.

Similarity API

POST /v1/similarity

Compute cosine similarity between exactly two inputs. A convenience wrapper — embeds both, normalizes, and returns the dot product.

Request Body

Field Type Required Description
a object Yes First input (text, image, or both).
b object Yes Second input (text, image, or both).
dimension int No Embedding dimension for comparison (64–2048). Default: 2048.

Response Body

Field Type Description
score float Cosine similarity (−1.0 to 1.0). Higher = more similar.
dimension int Dimension used for the comparison.

Operations & Monitoring

Health & Readiness Endpoints

Endpoint Method Purpose
/ready/ GET Readiness probe. Returns 200 when both models are loaded and GPU is available. 503 otherwise. Use for load balancer health checks.
/live/ GET Liveness probe. Returns 200 if the process is alive. Use for container restart decisions.
/health GET Detailed status: model paths, loaded state, GPU device name, VRAM usage.
/models
/v1/models
GET List available models (OpenAI-compatible). Returns model IDs, capabilities, and metadata. Used by OpenAI SDK clients for model discovery.
/metrics GET Prometheus metrics (request counts, latency histograms, GPU memory, model status).

Prometheus Metrics

Key custom metrics exposed:

  • embedding_model_loaded — Gauge (1 = loaded)
  • reranker_model_loaded — Gauge (1 = loaded)
  • embedding_gpu_memory_bytes — Gauge (current GPU allocation)
  • embedding_inference_requests_total{endpoint} — Counter per endpoint (embeddings, similarity, rerank)
  • embedding_inference_duration_seconds{endpoint} — Histogram of inference latency
  • Plus standard HTTP metrics from prometheus-fastapi-instrumentator

Environment Configuration

All settings use the EMBEDDING_ prefix and can be overridden via environment variables or /etc/default/synesis:

Variable Default Description
EMBEDDING_MODEL_PATH ./models/Qwen3-VL-Embedding-2B Path to embedding model weights
EMBEDDING_RERANKER_MODEL_PATH ./models/Qwen3-VL-Reranker-2B Path to reranker model weights
EMBEDDING_TORCH_DTYPE float16 Model precision (float16 or bfloat16)
EMBEDDING_USE_FLASH_ATTENTION true Enable Flash Attention 2
EMBEDDING_DEFAULT_DIMENSION 2048 Default embedding dimension when not specified per request
EMBEDDING_MAX_BATCH_SIZE 32 Maximum inputs per request (both embeddings and rerank)
EMBEDDING_HOST 0.0.0.0 Bind address
EMBEDDING_PORT 8400 Listen port

Error Handling

HTTP Status Codes

Code Meaning Action
200 Success Process the response.
422 Validation error Check your request body. Batch size may exceed max_batch_size, or required fields are missing.
500 Inference error Model failed during processing. Check server logs. May indicate OOM with large image batches.
503 Model not loaded Service is starting up or a model failed to load. Retry after checking /ready/.

Synesis v0.2.0 — Qwen3-VL Embedding & Reranking Service. For interactive API exploration, visit /docs on the running service.