Synesis — API Usage Guide

Overview

Embeddings

Generate dense vector representations for text, images, or both. Vectors are suitable for semantic search, retrieval, clustering, and classification.

POST /v1/embeddings

Reranking

Given a query and a list of candidate documents, score and sort them by relevance. Use after an initial retrieval step to improve precision.

POST /v1/rerank

Similarity

Convenience endpoint to compute cosine similarity between two inputs without managing vectors yourself.

POST /v1/similarity

Interactive API Explorer

Full request/response schemas, try-it-out functionality, and auto-generated curl examples are available at http://<host>:8400/docs (Swagger UI). Use it to experiment with every endpoint interactively.

Base URL

All endpoints are served from a single base URL. Configure this in your consuming application:

http://<synesis-host>:8400

Default port is 8400. No authentication is required (secure via network policy / firewall).

Architecture

Service Architecture

Synesis loads two Qwen3-VL-2B models into GPU memory at startup: one for embeddings and one for reranking. Both share the same NVIDIA 3090 (24 GB VRAM).

Request Flow

Typical RAG Integration

sequenceDiagram participant App as Your Application participant Synesis as Synesis API participant VDB as Vector Database Note over App: Indexing Phase App->>Synesis: POST /v1/embeddings (documents) Synesis-->>App: embedding vectors App->>VDB: Store vectors + metadata Note over App: Query Phase App->>Synesis: POST /v1/embeddings (query) Synesis-->>App: query vector App->>VDB: ANN search (top 50) VDB-->>App: candidate documents App->>Synesis: POST /v1/rerank (query + candidates) Synesis-->>App: ranked results with scores App->>App: Use top 5-10 results

Embeddings API

POST /v1/embeddings

Generate dense vector embeddings for one or more inputs. Each input can be text, an image, or both (multimodal).

Request Body

Field	Type	Required	Description
`inputs`	array	Yes	List of items to embed (1 to `max_batch_size`).
`inputs[].text`	string	*	Text content. At least one of `text` or `image` is required.
`inputs[].image`	string	*	Image file path or URL. At least one of `text` or `image` is required.
`inputs[].instruction`	string	No	Optional task instruction to guide embedding (e.g. "Represent this document for retrieval").
`dimension`	int	No	Output vector dimension (64–2048). Default: 2048. See Dimensions.
`normalize`	bool	No	L2-normalize output vectors. Default: `true`.

Response Body

Field	Type	Description
`embeddings[]`	array	One embedding per input, in order.
`embeddings[].index`	int	Position in the input array.
`embeddings[].embedding`	float[]	The dense vector (length = `dimension`).
`usage.input_count`	int	Number of inputs processed.
`usage.dimension`	int	Dimension of returned vectors.
`usage.elapsed_ms`	float	Server-side processing time in milliseconds.

Input Modalities

Text Only

{
  "inputs": [
    {"text": "quantum computing basics"},
    {"text": "machine learning tutorial"}
  ]
}

Image Only

{
  "inputs": [
    {"image": "/data/photos/cat.jpg"},
    {"image": "https://example.com/dog.png"}
  ]
}

Multimodal

{
  "inputs": [
    {
      "text": "product photo",
      "image": "/data/products/shoe.jpg"
    }
  ]
}

Reranking API

POST /v1/rerank

Score and rank a list of candidate documents against a query. Returns documents sorted by relevance (highest score first).

Request Body

Field	Type	Required	Description
`query`	object	Yes	The query to rank against. Must contain `text`, `image`, or both.
`query.text`	string	*	Query text. At least one of `text` or `image` required.
`query.image`	string	*	Query image path or URL.
`documents`	array	Yes	Candidate documents to rerank (1 to `max_batch_size`).
`documents[].text`	string	*	Document text. At least one of `text` or `image` required per document.
`documents[].image`	string	*	Document image path or URL.
`instruction`	string	No	Task instruction (e.g. "Retrieve images relevant to the query.").
`top_n`	int	No	Return only the top N results. Default: return all.

Response Body

Field	Type	Description
`results[]`	array	Documents sorted by relevance score (descending).
`results[].index`	int	Original position of this document in the input array.
`results[].score`	float	Relevance score (higher = more relevant).
`results[].document`	object	The document that was ranked (echoed back).
`usage.query_count`	int	Always 1.
`usage.document_count`	int	Total documents scored.
`usage.returned_count`	int	Number of results returned (respects `top_n`).
`usage.elapsed_ms`	float	Server-side processing time in milliseconds.

Example: Text Query → Text Documents

{
  "query": {"text": "How do neural networks learn?"},
  "documents": [
    {"text": "Neural networks adjust weights through backpropagation..."},
    {"text": "The stock market experienced a downturn in Q3..."},
    {"text": "Deep learning uses gradient descent to minimize loss..."},
    {"text": "Photosynthesis converts sunlight into chemical energy..."}
  ],
  "top_n": 2
}

Example: Text Query → Image Documents

{
  "query": {"text": "melancholy album artwork"},
  "documents": [
    {"image": "/data/covers/cover1.jpg"},
    {"image": "/data/covers/cover2.jpg"},
    {"text": "dark moody painting", "image": "/data/covers/cover3.jpg"}
  ],
  "instruction": "Retrieve images relevant to the query.",
  "top_n": 2
}

Dimensions, Batches & Performance

Matryoshka Dimension Truncation

Synesis uses Matryoshka Representation Learning (MRL). The model always computes full 2048-dimensional vectors internally, then truncates to your requested dimension. This means you can choose a dimension that balances quality vs. storage/speed.

Dimension	Vector Size	Quality	Use Case
`2048` (default)	8 KB / vector (float32)	Maximum	Highest accuracy retrieval, small collections
`1024`	4 KB / vector	Very high	Good balance for most production systems
`512`	2 KB / vector	High	Large-scale search with reasonable quality
`256`	1 KB / vector	Good	Very large collections, cost-sensitive
`128`	512 B / vector	Moderate	Rough filtering, pre-screening
`64`	256 B / vector	Basic	Coarse clustering, topic grouping

Important: Consistency

All vectors in the same index/collection must use the same dimension. Choose a dimension at index creation time and use it consistently for both indexing and querying. You cannot mix 512-d and 1024-d vectors in the same vector database index.

Batch Size & Microbatching

The max_batch_size setting (default: 32) controls the maximum number of inputs per API call. This is tuned for the 3090's 24 GB VRAM.

Text-only inputs: Batch sizes up to 32 are safe.
Image inputs: Images consume significantly more VRAM. Reduce batch sizes to 8–16 when embedding images, depending on resolution.
Mixed-modal inputs: Treat as image batches for sizing purposes.

Microbatching Strategy

When processing large datasets (thousands of documents), do not send all items in a single request. Instead, implement client-side microbatching:

Split your dataset into chunks of 16–32 items.
Send each chunk as a separate /v1/embeddings request.
Collect and concatenate the resulting vectors.
For images, use smaller chunk sizes (8–16) to avoid OOM errors.
Add a small delay between requests if processing thousands of items to avoid GPU thermal throttling.

Reranking Batch Limits

The reranker also respects max_batch_size for the number of candidate documents. If you have more than 32 candidates, either pre-filter with embeddings first (recommended) or split into multiple rerank calls and merge results.

Integration Guide

Configuring a Consuming Application

To integrate Synesis into another system, configure these settings:

Setting	Value	Notes
Embedding API URL	`http://<host>:8400/v1/embeddings`	POST, JSON body
Rerank API URL	`http://<host>:8400/v1/rerank`	POST, JSON body
Health check URL	`http://<host>:8400/ready/`	GET, 200 = ready
Embedding dimension	`2048` (or your chosen value)	Must match vector DB index config
Authentication	None	Secure via network policy
Content-Type	`application/json`	All endpoints
Timeout	30–60 seconds	Image inputs take longer; adjust for batch size

Python Integration Example

import requests

SYNESIS_URL = "http://synesis-host:8400"

# --- Generate embeddings ---
resp = requests.post(f"{SYNESIS_URL}/v1/embeddings", json={
    "inputs": [
        {"text": "How to train a neural network"},
        {"text": "Best practices for deep learning"},
    ],
    "dimension": 1024,
})
data = resp.json()
vectors = [e["embedding"] for e in data["embeddings"]]
# vectors[0] is a list of 1024 floats

# --- Rerank candidates ---
resp = requests.post(f"{SYNESIS_URL}/v1/rerank", json={
    "query": {"text": "neural network training"},
    "documents": [
        {"text": "Backpropagation adjusts weights using gradients..."},
        {"text": "The weather forecast for tomorrow is sunny..."},
        {"text": "Stochastic gradient descent is an optimization method..."},
    ],
    "top_n": 2,
})
ranked = resp.json()
for result in ranked["results"]:
    print(f"  #{result['index']} score={result['score']:.4f}")
    print(f"    {result['document']['text'][:80]}")

Typical Two-Stage Retrieval Pipeline

Index time: Embed all documents via /v1/embeddings and store vectors in your vector database (e.g. pgvector, Qdrant, Milvus, Weaviate).
Query time — Stage 1 (Recall): Embed the query via /v1/embeddings, perform approximate nearest neighbour (ANN) search in the vector DB to retrieve top 20–50 candidates.
Query time — Stage 2 (Precision): Pass the query and candidates to /v1/rerank to get precise relevance scores. Return the top 5–10 to the user or LLM context.

Similarity API

POST /v1/similarity

Compute cosine similarity between exactly two inputs. A convenience wrapper — embeds both, normalizes, and returns the dot product.

Request Body

Field	Type	Required	Description
`a`	object	Yes	First input (`text`, `image`, or both).
`b`	object	Yes	Second input (`text`, `image`, or both).
`dimension`	int	No	Embedding dimension for comparison (64–2048). Default: 2048.

Response Body

Field	Type	Description
`score`	float	Cosine similarity (−1.0 to 1.0). Higher = more similar.
`dimension`	int	Dimension used for the comparison.

Operations & Monitoring

Health & Readiness Endpoints

Endpoint	Method	Purpose
`/ready/`	GET	Readiness probe. Returns 200 when both models are loaded and GPU is available. 503 otherwise. Use for load balancer health checks.
`/live/`	GET	Liveness probe. Returns 200 if the process is alive. Use for container restart decisions.
`/health`	GET	Detailed status: model paths, loaded state, GPU device name, VRAM usage.
`/models` `/v1/models`	GET	List available models (OpenAI-compatible). Returns model IDs, capabilities, and metadata. Used by OpenAI SDK clients for model discovery.
`/metrics`	GET	Prometheus metrics (request counts, latency histograms, GPU memory, model status).

Prometheus Metrics

Key custom metrics exposed:

embedding_model_loaded — Gauge (1 = loaded)
reranker_model_loaded — Gauge (1 = loaded)
embedding_gpu_memory_bytes — Gauge (current GPU allocation)
embedding_inference_requests_total{endpoint} — Counter per endpoint (embeddings, similarity, rerank)
embedding_inference_duration_seconds{endpoint} — Histogram of inference latency
Plus standard HTTP metrics from prometheus-fastapi-instrumentator

Environment Configuration

All settings use the EMBEDDING_ prefix and can be overridden via environment variables or /etc/default/synesis:

Variable	Default	Description
`EMBEDDING_MODEL_PATH`	`./models/Qwen3-VL-Embedding-2B`	Path to embedding model weights
`EMBEDDING_RERANKER_MODEL_PATH`	`./models/Qwen3-VL-Reranker-2B`	Path to reranker model weights
`EMBEDDING_TORCH_DTYPE`	`float16`	Model precision (`float16` or `bfloat16`)
`EMBEDDING_USE_FLASH_ATTENTION`	`true`	Enable Flash Attention 2
`EMBEDDING_DEFAULT_DIMENSION`	`2048`	Default embedding dimension when not specified per request
`EMBEDDING_MAX_BATCH_SIZE`	`32`	Maximum inputs per request (both embeddings and rerank)
`EMBEDDING_HOST`	`0.0.0.0`	Bind address
`EMBEDDING_PORT`	`8400`	Listen port

Error Handling

HTTP Status Codes

Code	Meaning	Action
`200`	Success	Process the response.
`422`	Validation error	Check your request body. Batch size may exceed `max_batch_size`, or required fields are missing.
`500`	Inference error	Model failed during processing. Check server logs. May indicate OOM with large image batches.
`503`	Model not loaded	Service is starting up or a model failed to load. Retry after checking `/ready/`.