feat: add Phase 3 hybrid search with Synesis reranking
Implement hybrid search pipeline combining vector, fulltext, and graph search across Neo4j, with cross-attention reranking via Synesis (Qwen3-VL-Reranker-2B) `/v1/rerank` endpoint. - Add SearchService with vector, fulltext, and graph search strategies - Add SynesisRerankerClient for multimodal reranking via HTTP API - Add search API endpoint (POST /search/) with filtering by library, collection, and library_type - Add SearchRequest/Response serializers and image search results - Add "nonfiction" to library_type choices - Consolidate reranker stack from two models to single Synesis service - Handle image analysis_status as "skipped" when analysis is unavailable - Add comprehensive tests for search pipeline and reranker client
This commit is contained in:
384
docs/PHASE_3_SEARCH_AND_RERANKING.md
Normal file
384
docs/PHASE_3_SEARCH_AND_RERANKING.md
Normal file
@@ -0,0 +1,384 @@
|
||||
# Phase 3: Search & Re-ranking
|
||||
|
||||
## Objective
|
||||
|
||||
Build the complete hybrid search pipeline: accept a query → embed it → search Neo4j (vector + full-text + graph traversal) → fuse candidates → re-rank via Synesis → return ranked results with content-type context. At the end of this phase, content is discoverable through multiple search modalities, ranked by cross-attention relevance, and ready for Phase 4's RAG generation.
|
||||
|
||||
## Heritage
|
||||
|
||||
The hybrid search architecture adapts patterns from [Spelunker](https://git.helu.ca/r/spelunker)'s two-stage retrieval pipeline — vector recall + cross-attention re-ranking — enhanced with knowledge graph traversal, multimodal search, and content-type-aware re-ranking instructions.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
User Query (text, optional image, optional filters)
|
||||
│
|
||||
├─→ Vector Search (Neo4j vector index — Chunk.embedding)
|
||||
│ → Top-K nearest neighbors by cosine similarity
|
||||
│
|
||||
├─→ Full-Text Search (Neo4j fulltext index — Chunk.text_preview, Concept.name)
|
||||
│ → BM25-scored matches
|
||||
│
|
||||
├─→ Graph Search (Cypher traversal)
|
||||
│ → Concept-linked chunks via MENTIONS/REFERENCES/DEPICTS edges
|
||||
│
|
||||
└─→ Image Search (Neo4j vector index — ImageEmbedding.embedding)
|
||||
→ Multimodal similarity (text-to-image in unified vector space)
|
||||
│
|
||||
└─→ Candidate Fusion (Reciprocal Rank Fusion)
|
||||
→ Deduplicated, scored candidate list
|
||||
│
|
||||
└─→ Re-ranking (Synesis /v1/rerank)
|
||||
→ Content-type-aware instruction injection
|
||||
→ Cross-attention precision scoring
|
||||
│
|
||||
└─→ Final ranked results with metadata
|
||||
```
|
||||
|
||||
## Synesis Integration
|
||||
|
||||
[Synesis](docs/synesis_api_usage_guide.html) is a custom FastAPI service built around Qwen3-VL-2B, providing both embedding and re-ranking over a clean REST API. It runs on `pan.helu.ca:8400`.
|
||||
|
||||
**Embedding** (Phase 2, already working): Synesis's `/v1/embeddings` endpoint is OpenAI-compatible — the existing `EmbeddingClient` handles it with `api_type="openai"`.
|
||||
|
||||
**Re-ranking** (Phase 3, new): Synesis's `/v1/rerank` endpoint provides:
|
||||
- Native `instruction` parameter — maps directly to `reranker_instruction` from content types
|
||||
- `top_n` for server-side truncation
|
||||
- Multimodal support — both query and documents can include images
|
||||
- Relevance scores for each candidate
|
||||
|
||||
```python
|
||||
# Synesis rerank request
|
||||
POST http://pan.helu.ca:8400/v1/rerank
|
||||
{
|
||||
"query": {"text": "How do I configure a 3-phase motor?"},
|
||||
"documents": [
|
||||
{"text": "The motor controller requires..."},
|
||||
{"text": "3-phase power is distributed..."}
|
||||
],
|
||||
"instruction": "Re-rank passages from technical documentation based on procedural relevance.",
|
||||
"top_n": 10
|
||||
}
|
||||
```
|
||||
|
||||
## Deliverables
|
||||
|
||||
### 1. Search Service (`library/services/search.py`)
|
||||
|
||||
The core search orchestrator. Accepts a `SearchRequest`, dispatches to individual search backends, fuses results, and optionally re-ranks.
|
||||
|
||||
#### SearchRequest
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SearchRequest:
|
||||
query: str # Natural language query text
|
||||
query_image: bytes | None = None # Optional image for multimodal search
|
||||
library_uid: str | None = None # Scope to specific library
|
||||
library_type: str | None = None # Scope to library type
|
||||
collection_uid: str | None = None # Scope to specific collection
|
||||
search_types: list[str] # ["vector", "fulltext", "graph"]
|
||||
limit: int = 20 # Max results after fusion
|
||||
vector_top_k: int = 50 # Candidates from vector search
|
||||
fulltext_top_k: int = 30 # Candidates from fulltext search
|
||||
graph_max_depth: int = 2 # Graph traversal depth
|
||||
rerank: bool = True # Apply re-ranking
|
||||
include_images: bool = True # Include image results
|
||||
```
|
||||
|
||||
#### SearchResponse
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SearchCandidate:
|
||||
chunk_uid: str
|
||||
item_uid: str
|
||||
item_title: str
|
||||
library_type: str
|
||||
text_preview: str
|
||||
chunk_s3_key: str
|
||||
chunk_index: int
|
||||
score: float # Final score (post-fusion or post-rerank)
|
||||
source: str # "vector", "fulltext", "graph"
|
||||
metadata: dict # Page, section, nearby images, etc.
|
||||
|
||||
@dataclass
|
||||
class ImageSearchResult:
|
||||
image_uid: str
|
||||
item_uid: str
|
||||
item_title: str
|
||||
image_type: str
|
||||
description: str
|
||||
s3_key: str
|
||||
score: float
|
||||
source: str # "vector", "graph"
|
||||
|
||||
@dataclass
|
||||
class SearchResponse:
|
||||
query: str
|
||||
candidates: list[SearchCandidate] # Ranked text results
|
||||
images: list[ImageSearchResult] # Ranked image results
|
||||
total_candidates: int # Pre-fusion candidate count
|
||||
search_time_ms: float
|
||||
reranker_used: bool
|
||||
reranker_model: str | None
|
||||
search_types_used: list[str]
|
||||
```
|
||||
|
||||
### 2. Vector Search
|
||||
|
||||
Uses Neo4j's `db.index.vector.queryNodes()` against `chunk_embedding_index`.
|
||||
|
||||
- Embed query text using system embedding model (via existing `EmbeddingClient`)
|
||||
- Prepend library's `embedding_instruction` when scoped to a specific library
|
||||
- Query Neo4j vector index for top-K Chunk nodes by cosine similarity
|
||||
- Filter by library/collection via graph pattern matching
|
||||
|
||||
```cypher
|
||||
CALL db.index.vector.queryNodes('chunk_embedding_index', $top_k, $query_vector)
|
||||
YIELD node AS chunk, score
|
||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||
AND ($collection_uid IS NULL OR col.uid = $collection_uid)
|
||||
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||
chunk.chunk_s3_key AS chunk_s3_key, chunk.chunk_index AS chunk_index,
|
||||
item.uid AS item_uid, item.title AS item_title,
|
||||
lib.library_type AS library_type, score
|
||||
ORDER BY score DESC
|
||||
LIMIT $top_k
|
||||
```
|
||||
|
||||
### 3. Full-Text Search
|
||||
|
||||
Uses Neo4j fulltext indexes created by `setup_neo4j_indexes`.
|
||||
|
||||
- Query `chunk_text_fulltext` for Chunk matches (BM25)
|
||||
- Query `concept_name_fulltext` for Concept matches → traverse to connected Chunks
|
||||
- Query `item_title_fulltext` for Item title matches → get their Chunks
|
||||
- Normalize BM25 scores to 0-1 range for fusion compatibility
|
||||
|
||||
```cypher
|
||||
-- Chunk full-text search
|
||||
CALL db.index.fulltext.queryNodes('chunk_text_fulltext', $query)
|
||||
YIELD node AS chunk, score
|
||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||
item.uid AS item_uid, item.title AS item_title,
|
||||
lib.library_type AS library_type, score
|
||||
ORDER BY score DESC
|
||||
LIMIT $top_k
|
||||
|
||||
-- Concept-to-Chunk traversal
|
||||
CALL db.index.fulltext.queryNodes('concept_name_fulltext', $query)
|
||||
YIELD node AS concept, score AS concept_score
|
||||
MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
|
||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||
item.uid AS item_uid, item.title AS item_title,
|
||||
concept_score * 0.8 AS score
|
||||
```
|
||||
|
||||
### 4. Graph Search
|
||||
|
||||
Knowledge-graph-powered discovery — the differentiator from standard RAG.
|
||||
|
||||
- Match query terms against Concept names via fulltext index
|
||||
- Traverse `Concept ←[MENTIONS]- Chunk ←[HAS_CHUNK]- Item`
|
||||
- Expand via `Concept -[RELATED_TO]- Concept` for secondary connections
|
||||
- Score based on relationship weight and traversal depth
|
||||
|
||||
```cypher
|
||||
-- Concept graph traversal
|
||||
CALL db.index.fulltext.queryNodes('concept_name_fulltext', $query)
|
||||
YIELD node AS concept, score
|
||||
MATCH path = (concept)<-[:MENTIONS|REFERENCES*1..2]-(connected)
|
||||
WHERE connected:Chunk OR connected:Item
|
||||
WITH concept, connected, score, length(path) AS depth
|
||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||
WHERE chunk = connected OR item = connected
|
||||
RETURN DISTINCT chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||
item.uid AS item_uid, item.title AS item_title,
|
||||
score / (depth * 0.5 + 1) AS score
|
||||
```
|
||||
|
||||
### 5. Image Search
|
||||
|
||||
Multimodal vector search against `image_embedding_index`.
|
||||
|
||||
- Embed query text (or image) using system embedding model
|
||||
- Search `ImageEmbedding` vectors in unified multimodal space
|
||||
- Return with Image descriptions, OCR text, and Item associations from Phase 2B
|
||||
- Also include images found via concept graph DEPICTS relationships
|
||||
|
||||
### 6. Candidate Fusion (`library/services/fusion.py`)
|
||||
|
||||
Reciprocal Rank Fusion (RRF) — parameter-light, proven in Spelunker.
|
||||
|
||||
```python
|
||||
def reciprocal_rank_fusion(
|
||||
result_lists: list[list[SearchCandidate]],
|
||||
k: int = 60,
|
||||
) -> list[SearchCandidate]:
|
||||
"""
|
||||
RRF score = Σ 1 / (k + rank_i) for each list containing the candidate.
|
||||
Candidates in multiple lists get boosted.
|
||||
"""
|
||||
```
|
||||
|
||||
- Deduplicates candidates by `chunk_uid`
|
||||
- Candidates appearing in multiple search types get naturally boosted
|
||||
- Sort by fused score descending, trim to `limit`
|
||||
|
||||
### 7. Re-ranking Client (`library/services/reranker.py`)
|
||||
|
||||
Targets Synesis's `POST /v1/rerank` endpoint. Wraps the system reranker model's API configuration.
|
||||
|
||||
#### Synesis Backend
|
||||
|
||||
```python
|
||||
class RerankerClient:
|
||||
def rerank(
|
||||
self,
|
||||
query: str,
|
||||
candidates: list[SearchCandidate],
|
||||
instruction: str = "",
|
||||
top_n: int | None = None,
|
||||
query_image: bytes | None = None,
|
||||
) -> list[SearchCandidate]:
|
||||
"""
|
||||
Re-rank candidates via Synesis /v1/rerank.
|
||||
|
||||
Injects content-type reranker_instruction as the instruction parameter.
|
||||
"""
|
||||
```
|
||||
|
||||
Features:
|
||||
- Uses `text_preview` (500 chars) for document text — avoids S3 round-trips
|
||||
- Prepends library's `reranker_instruction` as the `instruction` parameter
|
||||
- Supports multimodal queries (text + image)
|
||||
- Falls back gracefully when no reranker model configured
|
||||
- Tracks usage via `LLMUsage` with `purpose="reranking"`
|
||||
|
||||
### 8. Search API Endpoints
|
||||
|
||||
New endpoints in `library/api/`:
|
||||
|
||||
| Method | Route | Purpose |
|
||||
|--------|-------|---------|
|
||||
| `POST` | `/api/v1/library/search/` | Full hybrid search + re-rank |
|
||||
| `POST` | `/api/v1/library/search/vector/` | Vector-only search (debugging) |
|
||||
| `POST` | `/api/v1/library/search/fulltext/` | Full-text-only search (debugging) |
|
||||
| `GET` | `/api/v1/library/concepts/` | List/search concepts |
|
||||
| `GET` | `/api/v1/library/concepts/<uid>/graph/` | Concept neighborhood graph |
|
||||
|
||||
### 9. Search UI Views
|
||||
|
||||
| URL | View | Purpose |
|
||||
|-----|------|---------|
|
||||
| `/library/search/` | `search` | Search page with query input + filters |
|
||||
| `/library/concepts/` | `concept_list` | Browse concepts with search |
|
||||
| `/library/concepts/<uid>/` | `concept_detail` | Single concept with connections |
|
||||
|
||||
### 10. Prometheus Metrics
|
||||
|
||||
| Metric | Type | Labels | Purpose |
|
||||
|--------|------|--------|---------|
|
||||
| `mnemosyne_search_requests_total` | Counter | search_type, library_type | Search throughput |
|
||||
| `mnemosyne_search_duration_seconds` | Histogram | search_type | Per-search-type latency |
|
||||
| `mnemosyne_search_candidates_total` | Histogram | search_type | Candidates per search type |
|
||||
| `mnemosyne_fusion_duration_seconds` | Histogram | — | Fusion latency |
|
||||
| `mnemosyne_rerank_requests_total` | Counter | model_name, status | Re-rank throughput |
|
||||
| `mnemosyne_rerank_duration_seconds` | Histogram | model_name | Re-rank latency |
|
||||
| `mnemosyne_rerank_candidates` | Histogram | — | Candidates sent to reranker |
|
||||
| `mnemosyne_search_total_duration_seconds` | Histogram | — | End-to-end search latency |
|
||||
|
||||
### 11. Management Commands
|
||||
|
||||
| Command | Purpose |
|
||||
|---------|---------|
|
||||
| `search <query> [--library-uid] [--limit] [--no-rerank]` | CLI search for testing |
|
||||
| `search_stats` | Search index statistics |
|
||||
|
||||
### 12. Settings
|
||||
|
||||
```python
|
||||
# Search configuration
|
||||
SEARCH_VECTOR_TOP_K = env.int("SEARCH_VECTOR_TOP_K", default=50)
|
||||
SEARCH_FULLTEXT_TOP_K = env.int("SEARCH_FULLTEXT_TOP_K", default=30)
|
||||
SEARCH_GRAPH_MAX_DEPTH = env.int("SEARCH_GRAPH_MAX_DEPTH", default=2)
|
||||
SEARCH_RRF_K = env.int("SEARCH_RRF_K", default=60)
|
||||
SEARCH_DEFAULT_LIMIT = env.int("SEARCH_DEFAULT_LIMIT", default=20)
|
||||
RERANKER_MAX_CANDIDATES = env.int("RERANKER_MAX_CANDIDATES", default=32)
|
||||
RERANKER_TIMEOUT = env.int("RERANKER_TIMEOUT", default=30)
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
mnemosyne/library/
|
||||
├── services/
|
||||
│ ├── search.py # NEW — SearchService orchestrator
|
||||
│ ├── fusion.py # NEW — Reciprocal Rank Fusion
|
||||
│ ├── reranker.py # NEW — Synesis re-ranking client
|
||||
│ └── ... # Existing services unchanged
|
||||
├── metrics.py # Modified — add search/rerank metrics
|
||||
├── views.py # Modified — add search UI views
|
||||
├── urls.py # Modified — add search routes
|
||||
├── api/
|
||||
│ ├── views.py # Modified — add search API endpoints
|
||||
│ ├── serializers.py # Modified — add search serializers
|
||||
│ └── urls.py # Modified — add search API routes
|
||||
├── management/commands/
|
||||
│ ├── search.py # NEW — CLI search command
|
||||
│ └── search_stats.py # NEW — Index statistics
|
||||
├── templates/library/
|
||||
│ ├── search.html # NEW — Search page
|
||||
│ ├── concept_list.html # NEW — Concept browser
|
||||
│ └── concept_detail.html # NEW — Concept detail
|
||||
└── tests/
|
||||
├── test_search.py # NEW — Search service tests
|
||||
├── test_fusion.py # NEW — RRF fusion tests
|
||||
├── test_reranker.py # NEW — Re-ranking client tests
|
||||
└── test_search_api.py # NEW — Search API endpoint tests
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
No new Python dependencies required. Phase 3 uses:
|
||||
- `neomodel` + raw Cypher (Neo4j search)
|
||||
- `requests` (Synesis reranker HTTP)
|
||||
- `EmbeddingClient` from Phase 2 (query embedding)
|
||||
- `prometheus_client` (metrics)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
All tests use Django `TestCase`. External services mocked.
|
||||
|
||||
| Test File | Scope |
|
||||
|-----------|-------|
|
||||
| `test_search.py` | SearchService orchestration, individual search methods, library/collection scoping |
|
||||
| `test_fusion.py` | RRF correctness, deduplication, score calculation, edge cases |
|
||||
| `test_reranker.py` | Synesis backend (mocked HTTP), instruction injection, graceful fallback |
|
||||
| `test_search_api.py` | API endpoints, request validation, response format |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] Vector search returns Chunk nodes ranked by cosine similarity from Neo4j
|
||||
- [ ] Full-text search returns matches from Neo4j fulltext indexes
|
||||
- [ ] Graph search traverses Concept relationships to discover related content
|
||||
- [ ] Image search returns images via multimodal vector similarity
|
||||
- [ ] Reciprocal Rank Fusion correctly merges and deduplicates across search types
|
||||
- [ ] Re-ranking via Synesis `/v1/rerank` re-scores candidates with cross-attention
|
||||
- [ ] Content-type `reranker_instruction` injected per library type
|
||||
- [ ] Search scoping works (by library, library type, collection)
|
||||
- [ ] Search gracefully degrades: no reranker → skip; no embedding model → clear error
|
||||
- [ ] Search API endpoints return structured results with scores and metadata
|
||||
- [ ] Search UI allows querying with filters and displays ranked results
|
||||
- [ ] Concept explorer allows browsing the knowledge graph
|
||||
- [ ] Prometheus metrics track search throughput, latency, and candidate counts
|
||||
- [ ] CLI search command works for testing
|
||||
- [ ] All tests pass with mocked external services
|
||||
908
docs/synesis_api_usage_guide.html
Normal file
908
docs/synesis_api_usage_guide.html
Normal file
@@ -0,0 +1,908 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Synesis — API Usage Guide</title>
|
||||
<!-- Bootstrap CSS -->
|
||||
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
|
||||
<!-- Mermaid -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container-fluid">
|
||||
|
||||
<!-- Navigation -->
|
||||
<nav class="navbar navbar-dark bg-dark rounded mb-4">
|
||||
<div class="container-fluid">
|
||||
<a class="navbar-brand" href="api_usage_guide.html">Synesis API Guide</a>
|
||||
<div class="navbar-nav d-flex flex-row">
|
||||
<a class="nav-link me-3" href="#overview">Overview</a>
|
||||
<a class="nav-link me-3" href="#architecture">Architecture</a>
|
||||
<a class="nav-link me-3" href="#embeddings">Embeddings</a>
|
||||
<a class="nav-link me-3" href="#reranking">Reranking</a>
|
||||
<a class="nav-link me-3" href="#integration">Integration</a>
|
||||
<a class="nav-link" href="#operations">Operations</a>
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<nav aria-label="breadcrumb">
|
||||
<ol class="breadcrumb">
|
||||
<li class="breadcrumb-item"><a href="api_usage_guide.html">Synesis</a></li>
|
||||
<li class="breadcrumb-item active">API Usage Guide</li>
|
||||
</ol>
|
||||
</nav>
|
||||
|
||||
<!-- Title -->
|
||||
<div class="row mb-4">
|
||||
<div class="col-12">
|
||||
<h1 class="display-4 mb-2">Synesis — API Usage Guide</h1>
|
||||
<p class="lead">Multimodal embedding and reranking service powered by Qwen3-VL-2B. Supports text, image, and mixed-modal inputs over a simple REST API.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- OVERVIEW -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="overview" class="mb-5">
|
||||
<h2 class="h2 mb-4">Overview</h2>
|
||||
|
||||
<div class="row g-4 mb-4">
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">Embeddings</h3>
|
||||
<p>Generate dense vector representations for text, images, or both. Vectors are suitable for semantic search, retrieval, clustering, and classification.</p>
|
||||
<code>POST /v1/embeddings</code>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">Reranking</h3>
|
||||
<p>Given a query and a list of candidate documents, score and sort them by relevance. Use after an initial retrieval step to improve precision.</p>
|
||||
<code>POST /v1/rerank</code>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">Similarity</h3>
|
||||
<p>Convenience endpoint to compute cosine similarity between two inputs without managing vectors yourself.</p>
|
||||
<code>POST /v1/similarity</code>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3>Interactive API Explorer</h3>
|
||||
<p class="mb-0">Full request/response schemas, try-it-out functionality, and auto-generated curl examples are available at <strong><code>http://<host>:8400/docs</code></strong> (Swagger UI). Use it to experiment with every endpoint interactively.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-secondary border-start border-4 border-secondary">
|
||||
<h3>Base URL</h3>
|
||||
<p>All endpoints are served from a single base URL. Configure this in your consuming application:</p>
|
||||
<pre class="mb-0">http://<synesis-host>:8400</pre>
|
||||
<p class="mt-2 mb-0">Default port is <code>8400</code>. No authentication is required (secure via network policy / firewall).</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- ARCHITECTURE -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="architecture" class="mb-5">
|
||||
<h2 class="h2 mb-4">Architecture</h2>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3>Service Architecture</h3>
|
||||
<p>Synesis loads two Qwen3-VL-2B models into GPU memory at startup: one for embeddings and one for reranking. Both share the same NVIDIA 3090 (24 GB VRAM).</p>
|
||||
</div>
|
||||
|
||||
<div class="card my-4">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">Request Flow</h3>
|
||||
<div class="mermaid">
|
||||
graph LR
|
||||
Client["Client Application"] -->|HTTP POST| FastAPI["FastAPI<br/>:8400"]
|
||||
FastAPI -->|/v1/embeddings| Embedder["Qwen3-VL<br/>Embedder 2B"]
|
||||
FastAPI -->|/v1/rerank| Reranker["Qwen3-VL<br/>Reranker 2B"]
|
||||
FastAPI -->|/v1/similarity| Embedder
|
||||
Embedder --> GPU["NVIDIA 3090<br/>24 GB VRAM"]
|
||||
Reranker --> GPU
|
||||
FastAPI -->|/metrics| Prometheus["Prometheus"]
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="card my-4">
|
||||
<div class="card-body">
|
||||
<h3 class="card-title text-primary">Typical RAG Integration</h3>
|
||||
<div class="mermaid">
|
||||
sequenceDiagram
|
||||
participant App as Your Application
|
||||
participant Synesis as Synesis API
|
||||
participant VDB as Vector Database
|
||||
|
||||
Note over App: Indexing Phase
|
||||
App->>Synesis: POST /v1/embeddings (documents)
|
||||
Synesis-->>App: embedding vectors
|
||||
App->>VDB: Store vectors + metadata
|
||||
|
||||
Note over App: Query Phase
|
||||
App->>Synesis: POST /v1/embeddings (query)
|
||||
Synesis-->>App: query vector
|
||||
App->>VDB: ANN search (top 50)
|
||||
VDB-->>App: candidate documents
|
||||
App->>Synesis: POST /v1/rerank (query + candidates)
|
||||
Synesis-->>App: ranked results with scores
|
||||
App->>App: Use top 5-10 results
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- EMBEDDINGS -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="embeddings" class="mb-5">
|
||||
<h2 class="h2 mb-4">Embeddings API</h2>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3>POST /v1/embeddings</h3>
|
||||
<p class="mb-0">Generate dense vector embeddings for one or more inputs. Each input can be text, an image, or both (multimodal).</p>
|
||||
</div>
|
||||
|
||||
<!-- Request Schema -->
|
||||
<h3 class="mt-4">Request Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Required</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>inputs</code></td>
|
||||
<td>array</td>
|
||||
<td>Yes</td>
|
||||
<td>List of items to embed (1 to <code>max_batch_size</code>).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>inputs[].text</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Text content. At least one of <code>text</code> or <code>image</code> is required.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>inputs[].image</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Image file path or URL. At least one of <code>text</code> or <code>image</code> is required.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>inputs[].instruction</code></td>
|
||||
<td>string</td>
|
||||
<td>No</td>
|
||||
<td>Optional task instruction to guide embedding (e.g. "Represent this document for retrieval").</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>dimension</code></td>
|
||||
<td>int</td>
|
||||
<td>No</td>
|
||||
<td>Output vector dimension (64–2048). Default: 2048. See <a href="#dimensions">Dimensions</a>.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>normalize</code></td>
|
||||
<td>bool</td>
|
||||
<td>No</td>
|
||||
<td>L2-normalize output vectors. Default: <code>true</code>.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<!-- Response Schema -->
|
||||
<h3 class="mt-4">Response Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>embeddings[]</code></td>
|
||||
<td>array</td>
|
||||
<td>One embedding per input, in order.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>embeddings[].index</code></td>
|
||||
<td>int</td>
|
||||
<td>Position in the input array.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>embeddings[].embedding</code></td>
|
||||
<td>float[]</td>
|
||||
<td>The dense vector (length = <code>dimension</code>).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.input_count</code></td>
|
||||
<td>int</td>
|
||||
<td>Number of inputs processed.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.dimension</code></td>
|
||||
<td>int</td>
|
||||
<td>Dimension of returned vectors.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.elapsed_ms</code></td>
|
||||
<td>float</td>
|
||||
<td>Server-side processing time in milliseconds.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<!-- Input Types -->
|
||||
<h3 class="mt-4">Input Modalities</h3>
|
||||
<div class="row g-4">
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h4 class="card-title text-primary">Text Only</h4>
|
||||
<pre class="mb-0">{
|
||||
"inputs": [
|
||||
{"text": "quantum computing basics"},
|
||||
{"text": "machine learning tutorial"}
|
||||
]
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h4 class="card-title text-primary">Image Only</h4>
|
||||
<pre class="mb-0">{
|
||||
"inputs": [
|
||||
{"image": "/data/photos/cat.jpg"},
|
||||
{"image": "https://example.com/dog.png"}
|
||||
]
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-4">
|
||||
<div class="card h-100">
|
||||
<div class="card-body">
|
||||
<h4 class="card-title text-primary">Multimodal</h4>
|
||||
<pre class="mb-0">{
|
||||
"inputs": [
|
||||
{
|
||||
"text": "product photo",
|
||||
"image": "/data/products/shoe.jpg"
|
||||
}
|
||||
]
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- RERANKING -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="reranking" class="mb-5">
|
||||
<h2 class="h2 mb-4">Reranking API</h2>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3>POST /v1/rerank</h3>
|
||||
<p class="mb-0">Score and rank a list of candidate documents against a query. Returns documents sorted by relevance (highest score first).</p>
|
||||
</div>
|
||||
|
||||
<!-- Request Schema -->
|
||||
<h3 class="mt-4">Request Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Required</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>query</code></td>
|
||||
<td>object</td>
|
||||
<td>Yes</td>
|
||||
<td>The query to rank against. Must contain <code>text</code>, <code>image</code>, or both.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>query.text</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Query text. At least one of <code>text</code> or <code>image</code> required.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>query.image</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Query image path or URL.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>documents</code></td>
|
||||
<td>array</td>
|
||||
<td>Yes</td>
|
||||
<td>Candidate documents to rerank (1 to <code>max_batch_size</code>).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>documents[].text</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Document text. At least one of <code>text</code> or <code>image</code> required per document.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>documents[].image</code></td>
|
||||
<td>string</td>
|
||||
<td>*</td>
|
||||
<td>Document image path or URL.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>instruction</code></td>
|
||||
<td>string</td>
|
||||
<td>No</td>
|
||||
<td>Task instruction (e.g. "Retrieve images relevant to the query.").</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>top_n</code></td>
|
||||
<td>int</td>
|
||||
<td>No</td>
|
||||
<td>Return only the top N results. Default: return all.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<!-- Response Schema -->
|
||||
<h3 class="mt-4">Response Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>results[]</code></td>
|
||||
<td>array</td>
|
||||
<td>Documents sorted by relevance score (descending).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>results[].index</code></td>
|
||||
<td>int</td>
|
||||
<td>Original position of this document in the input array.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>results[].score</code></td>
|
||||
<td>float</td>
|
||||
<td>Relevance score (higher = more relevant).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>results[].document</code></td>
|
||||
<td>object</td>
|
||||
<td>The document that was ranked (echoed back).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.query_count</code></td>
|
||||
<td>int</td>
|
||||
<td>Always 1.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.document_count</code></td>
|
||||
<td>int</td>
|
||||
<td>Total documents scored.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.returned_count</code></td>
|
||||
<td>int</td>
|
||||
<td>Number of results returned (respects <code>top_n</code>).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>usage.elapsed_ms</code></td>
|
||||
<td>float</td>
|
||||
<td>Server-side processing time in milliseconds.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<!-- Rerank Examples -->
|
||||
<h3 class="mt-4">Example: Text Query → Text Documents</h3>
|
||||
<div class="card my-3">
|
||||
<div class="card-body">
|
||||
<pre class="mb-0">{
|
||||
"query": {"text": "How do neural networks learn?"},
|
||||
"documents": [
|
||||
{"text": "Neural networks adjust weights through backpropagation..."},
|
||||
{"text": "The stock market experienced a downturn in Q3..."},
|
||||
{"text": "Deep learning uses gradient descent to minimize loss..."},
|
||||
{"text": "Photosynthesis converts sunlight into chemical energy..."}
|
||||
],
|
||||
"top_n": 2
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 class="mt-4">Example: Text Query → Image Documents</h3>
|
||||
<div class="card my-3">
|
||||
<div class="card-body">
|
||||
<pre class="mb-0">{
|
||||
"query": {"text": "melancholy album artwork"},
|
||||
"documents": [
|
||||
{"image": "/data/covers/cover1.jpg"},
|
||||
{"image": "/data/covers/cover2.jpg"},
|
||||
{"text": "dark moody painting", "image": "/data/covers/cover3.jpg"}
|
||||
],
|
||||
"instruction": "Retrieve images relevant to the query.",
|
||||
"top_n": 2
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- DIMENSIONS, BATCHES, PERFORMANCE -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="dimensions" class="mb-5">
|
||||
<h2 class="h2 mb-4">Dimensions, Batches & Performance</h2>
|
||||
|
||||
<div class="alert alert-danger border-start border-4 border-danger">
|
||||
<h3>Matryoshka Dimension Truncation</h3>
|
||||
<p>Synesis uses <strong>Matryoshka Representation Learning (MRL)</strong>. The model always computes full 2048-dimensional vectors internally, then truncates to your requested dimension. This means you can choose a dimension that balances <strong>quality vs. storage/speed</strong>.</p>
|
||||
<table class="table table-bordered mt-3 mb-0">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Dimension</th>
|
||||
<th>Vector Size</th>
|
||||
<th>Quality</th>
|
||||
<th>Use Case</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>2048</code> (default)</td>
|
||||
<td>8 KB / vector (float32)</td>
|
||||
<td>Maximum</td>
|
||||
<td>Highest accuracy retrieval, small collections</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>1024</code></td>
|
||||
<td>4 KB / vector</td>
|
||||
<td>Very high</td>
|
||||
<td>Good balance for most production systems</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>512</code></td>
|
||||
<td>2 KB / vector</td>
|
||||
<td>High</td>
|
||||
<td>Large-scale search with reasonable quality</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>256</code></td>
|
||||
<td>1 KB / vector</td>
|
||||
<td>Good</td>
|
||||
<td>Very large collections, cost-sensitive</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>128</code></td>
|
||||
<td>512 B / vector</td>
|
||||
<td>Moderate</td>
|
||||
<td>Rough filtering, pre-screening</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>64</code></td>
|
||||
<td>256 B / vector</td>
|
||||
<td>Basic</td>
|
||||
<td>Coarse clustering, topic grouping</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h3>Important: Consistency</h3>
|
||||
<p class="mb-0">All vectors in the same index/collection <strong>must use the same dimension</strong>. Choose a dimension at index creation time and use it consistently for both indexing and querying. You cannot mix 512-d and 1024-d vectors in the same vector database index.</p>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3>Batch Size & Microbatching</h3>
|
||||
<p>The <code>max_batch_size</code> setting (default: <strong>32</strong>) controls the maximum number of inputs per API call. This is tuned for the 3090's 24 GB VRAM.</p>
|
||||
<ul>
|
||||
<li><strong>Text-only inputs:</strong> Batch sizes up to 32 are safe.</li>
|
||||
<li><strong>Image inputs:</strong> Images consume significantly more VRAM. Reduce batch sizes to 8–16 when embedding images, depending on resolution.</li>
|
||||
<li><strong>Mixed-modal inputs:</strong> Treat as image batches for sizing purposes.</li>
|
||||
</ul>
|
||||
<h4>Microbatching Strategy</h4>
|
||||
<p>When processing large datasets (thousands of documents), <strong>do not send all items in a single request</strong>. Instead, implement client-side microbatching:</p>
|
||||
<ol class="mb-0">
|
||||
<li>Split your dataset into chunks of 16–32 items.</li>
|
||||
<li>Send each chunk as a separate <code>/v1/embeddings</code> request.</li>
|
||||
<li>Collect and concatenate the resulting vectors.</li>
|
||||
<li>For images, use smaller chunk sizes (8–16) to avoid OOM errors.</li>
|
||||
<li>Add a small delay between requests if processing thousands of items to avoid GPU thermal throttling.</li>
|
||||
</ol>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-secondary border-start border-4 border-secondary">
|
||||
<h3>Reranking Batch Limits</h3>
|
||||
<p class="mb-0">The reranker also respects <code>max_batch_size</code> for the number of candidate documents. If you have more than 32 candidates, either pre-filter with embeddings first (recommended) or split into multiple rerank calls and merge results.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- INTEGRATION GUIDE -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="integration" class="mb-5">
|
||||
<h2 class="h2 mb-4">Integration Guide</h2>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3>Configuring a Consuming Application</h3>
|
||||
<p>To integrate Synesis into another system, configure these settings:</p>
|
||||
<table class="table table-bordered mt-3 mb-0">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Setting</th>
|
||||
<th>Value</th>
|
||||
<th>Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Embedding API URL</td>
|
||||
<td><code>http://<host>:8400/v1/embeddings</code></td>
|
||||
<td>POST, JSON body</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Rerank API URL</td>
|
||||
<td><code>http://<host>:8400/v1/rerank</code></td>
|
||||
<td>POST, JSON body</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Health check URL</td>
|
||||
<td><code>http://<host>:8400/ready/</code></td>
|
||||
<td>GET, 200 = ready</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Embedding dimension</td>
|
||||
<td><code>2048</code> (or your chosen value)</td>
|
||||
<td>Must match vector DB index config</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Authentication</td>
|
||||
<td>None</td>
|
||||
<td>Secure via network policy</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Content-Type</td>
|
||||
<td><code>application/json</code></td>
|
||||
<td>All endpoints</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Timeout</td>
|
||||
<td>30–60 seconds</td>
|
||||
<td>Image inputs take longer; adjust for batch size</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<h3 class="mt-4">Python Integration Example</h3>
|
||||
<div class="card my-3">
|
||||
<div class="card-body">
|
||||
<pre class="mb-0">import requests
|
||||
|
||||
SYNESIS_URL = "http://synesis-host:8400"
|
||||
|
||||
# --- Generate embeddings ---
|
||||
resp = requests.post(f"{SYNESIS_URL}/v1/embeddings", json={
|
||||
"inputs": [
|
||||
{"text": "How to train a neural network"},
|
||||
{"text": "Best practices for deep learning"},
|
||||
],
|
||||
"dimension": 1024,
|
||||
})
|
||||
data = resp.json()
|
||||
vectors = [e["embedding"] for e in data["embeddings"]]
|
||||
# vectors[0] is a list of 1024 floats
|
||||
|
||||
# --- Rerank candidates ---
|
||||
resp = requests.post(f"{SYNESIS_URL}/v1/rerank", json={
|
||||
"query": {"text": "neural network training"},
|
||||
"documents": [
|
||||
{"text": "Backpropagation adjusts weights using gradients..."},
|
||||
{"text": "The weather forecast for tomorrow is sunny..."},
|
||||
{"text": "Stochastic gradient descent is an optimization method..."},
|
||||
],
|
||||
"top_n": 2,
|
||||
})
|
||||
ranked = resp.json()
|
||||
for result in ranked["results"]:
|
||||
print(f" #{result['index']} score={result['score']:.4f}")
|
||||
print(f" {result['document']['text'][:80]}")</pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 class="mt-4">Typical Two-Stage Retrieval Pipeline</h3>
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<ol class="mb-0">
|
||||
<li><strong>Index time:</strong> Embed all documents via <code>/v1/embeddings</code> and store vectors in your vector database (e.g. pgvector, Qdrant, Milvus, Weaviate).</li>
|
||||
<li><strong>Query time — Stage 1 (Recall):</strong> Embed the query via <code>/v1/embeddings</code>, perform approximate nearest neighbour (ANN) search in the vector DB to retrieve top 20–50 candidates.</li>
|
||||
<li><strong>Query time — Stage 2 (Precision):</strong> Pass the query and candidates to <code>/v1/rerank</code> to get precise relevance scores. Return the top 5–10 to the user or LLM context.</li>
|
||||
</ol>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- SIMILARITY -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="similarity" class="mb-5">
|
||||
<h2 class="h2 mb-4">Similarity API</h2>
|
||||
|
||||
<div class="alert alert-primary border-start border-4 border-primary">
|
||||
<h3>POST /v1/similarity</h3>
|
||||
<p class="mb-0">Compute cosine similarity between exactly two inputs. A convenience wrapper — embeds both, normalizes, and returns the dot product.</p>
|
||||
</div>
|
||||
|
||||
<h3 class="mt-4">Request Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Required</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>a</code></td>
|
||||
<td>object</td>
|
||||
<td>Yes</td>
|
||||
<td>First input (<code>text</code>, <code>image</code>, or both).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>b</code></td>
|
||||
<td>object</td>
|
||||
<td>Yes</td>
|
||||
<td>Second input (<code>text</code>, <code>image</code>, or both).</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>dimension</code></td>
|
||||
<td>int</td>
|
||||
<td>No</td>
|
||||
<td>Embedding dimension for comparison (64–2048). Default: 2048.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<h3 class="mt-4">Response Body</h3>
|
||||
<table class="table table-bordered">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>score</code></td>
|
||||
<td>float</td>
|
||||
<td>Cosine similarity (−1.0 to 1.0). Higher = more similar.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>dimension</code></td>
|
||||
<td>int</td>
|
||||
<td>Dimension used for the comparison.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- OPERATIONS -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="operations" class="mb-5">
|
||||
<h2 class="h2 mb-4">Operations & Monitoring</h2>
|
||||
|
||||
<div class="alert alert-info border-start border-4 border-info">
|
||||
<h3>Health & Readiness Endpoints</h3>
|
||||
<table class="table table-bordered mt-3 mb-0">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Endpoint</th>
|
||||
<th>Method</th>
|
||||
<th>Purpose</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>/ready/</code></td>
|
||||
<td>GET</td>
|
||||
<td>Readiness probe. Returns 200 when both models are loaded and GPU is available. 503 otherwise. Use for load balancer health checks.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>/live/</code></td>
|
||||
<td>GET</td>
|
||||
<td>Liveness probe. Returns 200 if the process is alive. Use for container restart decisions.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>/health</code></td>
|
||||
<td>GET</td>
|
||||
<td>Detailed status: model paths, loaded state, GPU device name, VRAM usage.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>/models</code><br><code>/v1/models</code></td>
|
||||
<td>GET</td>
|
||||
<td>List available models (OpenAI-compatible). Returns model IDs, capabilities, and metadata. Used by OpenAI SDK clients for model discovery.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>/metrics</code></td>
|
||||
<td>GET</td>
|
||||
<td>Prometheus metrics (request counts, latency histograms, GPU memory, model status).</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-warning border-start border-4 border-warning">
|
||||
<h3>Prometheus Metrics</h3>
|
||||
<p>Key custom metrics exposed:</p>
|
||||
<ul class="mb-0">
|
||||
<li><code>embedding_model_loaded</code> — Gauge (1 = loaded)</li>
|
||||
<li><code>reranker_model_loaded</code> — Gauge (1 = loaded)</li>
|
||||
<li><code>embedding_gpu_memory_bytes</code> — Gauge (current GPU allocation)</li>
|
||||
<li><code>embedding_inference_requests_total{endpoint}</code> — Counter per endpoint (embeddings, similarity, rerank)</li>
|
||||
<li><code>embedding_inference_duration_seconds{endpoint}</code> — Histogram of inference latency</li>
|
||||
<li>Plus standard HTTP metrics from <code>prometheus-fastapi-instrumentator</code></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="alert alert-secondary border-start border-4 border-secondary">
|
||||
<h3>Environment Configuration</h3>
|
||||
<p>All settings use the <code>EMBEDDING_</code> prefix and can be overridden via environment variables or <code>/etc/default/synesis</code>:</p>
|
||||
<table class="table table-bordered mt-3 mb-0">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Variable</th>
|
||||
<th>Default</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_MODEL_PATH</code></td>
|
||||
<td><code>./models/Qwen3-VL-Embedding-2B</code></td>
|
||||
<td>Path to embedding model weights</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_RERANKER_MODEL_PATH</code></td>
|
||||
<td><code>./models/Qwen3-VL-Reranker-2B</code></td>
|
||||
<td>Path to reranker model weights</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_TORCH_DTYPE</code></td>
|
||||
<td><code>float16</code></td>
|
||||
<td>Model precision (<code>float16</code> or <code>bfloat16</code>)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_USE_FLASH_ATTENTION</code></td>
|
||||
<td><code>true</code></td>
|
||||
<td>Enable Flash Attention 2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_DEFAULT_DIMENSION</code></td>
|
||||
<td><code>2048</code></td>
|
||||
<td>Default embedding dimension when not specified per request</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_MAX_BATCH_SIZE</code></td>
|
||||
<td><code>32</code></td>
|
||||
<td>Maximum inputs per request (both embeddings and rerank)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_HOST</code></td>
|
||||
<td><code>0.0.0.0</code></td>
|
||||
<td>Bind address</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>EMBEDDING_PORT</code></td>
|
||||
<td><code>8400</code></td>
|
||||
<td>Listen port</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============================================================ -->
|
||||
<!-- ERROR HANDLING -->
|
||||
<!-- ============================================================ -->
|
||||
<section id="errors" class="mb-5">
|
||||
<h2 class="h2 mb-4">Error Handling</h2>
|
||||
|
||||
<div class="alert alert-danger border-start border-4 border-danger">
|
||||
<h3>HTTP Status Codes</h3>
|
||||
<table class="table table-bordered mt-3 mb-0">
|
||||
<thead class="table-dark">
|
||||
<tr>
|
||||
<th>Code</th>
|
||||
<th>Meaning</th>
|
||||
<th>Action</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>200</code></td>
|
||||
<td>Success</td>
|
||||
<td>Process the response.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>422</code></td>
|
||||
<td>Validation error</td>
|
||||
<td>Check your request body. Batch size may exceed <code>max_batch_size</code>, or required fields are missing.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>500</code></td>
|
||||
<td>Inference error</td>
|
||||
<td>Model failed during processing. Check server logs. May indicate OOM with large image batches.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>503</code></td>
|
||||
<td>Model not loaded</td>
|
||||
<td>Service is starting up or a model failed to load. Retry after checking <code>/ready/</code>.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Footer -->
|
||||
<div class="alert alert-secondary border-start border-4 border-secondary">
|
||||
<p class="mb-0"><strong>Synesis v0.2.0</strong> — Qwen3-VL Embedding & Reranking Service. For interactive API exploration, visit <code>/docs</code> on the running service.</p>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- Bootstrap JS -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
|
||||
|
||||
<!-- Mermaid init -->
|
||||
<script>
|
||||
mermaid.initialize({
|
||||
startOnLoad: true,
|
||||
theme: window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'default'
|
||||
});
|
||||
</script>
|
||||
|
||||
<!-- Dark mode support -->
|
||||
<script>
|
||||
if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
|
||||
document.documentElement.setAttribute('data-bs-theme', 'dark');
|
||||
}
|
||||
window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', function(e) {
|
||||
document.documentElement.setAttribute('data-bs-theme', e.matches ? 'dark' : 'light');
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user