Add vision analysis capabilities to the embedding pipeline

- Introduced a new vision analysis service to classify, describe, and extract text from images.
- Enhanced the Image model with fields for OCR text, vision model name, and analysis status.
- Added a new "nonfiction" library type with specific chunking and embedding configurations.
- Updated content types to include vision prompts for various library types.
- Integrated vision analysis into the embedding pipeline, allowing for image analysis during document processing.
- Implemented metrics to track vision analysis performance and usage.
- Updated UI components to display vision analysis results and statuses in item details and the embedding dashboard.
- Added migration for new vision model fields and usage tracking.
This commit is contained in:
2026-03-22 15:14:34 +00:00
parent 6585beed20
commit 90db904959
11 changed files with 914 additions and 19 deletions

View File

@@ -0,0 +1,244 @@
# Phase 2B: Vision Analysis Pipeline
## Objective
Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text.
## Heritage
Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images *understood*.
## Architecture Overview
```
Image Extracted (Phase 2)
→ Stored in S3 + Image node in Neo4j
→ Vision Analysis (NEW - Phase 2B)
→ Classify image type (diagram, photo, chart, etc.)
→ Generate natural language description
→ Extract visible text (OCR)
→ Identify concepts depicted
→ Create DEPICTS relationships to Concept nodes
→ Connect Item to image-derived concepts via REFERENCES
→ Multimodal Embedding (Phase 2, now enhanced)
```
## How It Fits the Graph
Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do:
```
Item ─[HAS_IMAGE]──→ Image
├── description: "Wiring diagram showing 3-phase motor connection"
├── image_type: "diagram" (auto-classified)
├── ocr_text: "L1 L2 L3 ..."
├── vision_model_name: "Qwen3-VL-72B"
├──[DEPICTS]──→ Concept("3-phase motor")
├──[DEPICTS]──→ Concept("wiring diagram")
└──[DEPICTS]──→ Concept("electrical connection")
└──[RELATED_TO]──→ Concept("motor control")
Chunk text also ──[MENTIONS]──┘
```
Three relationship types connect content to concepts:
- `Chunk ─[MENTIONS]─→ Concept` — text discusses this concept
- `Item ─[REFERENCES]─→ Concept` — item is about this concept
- `Image ─[DEPICTS]─→ Concept` — image visually shows this concept
Concepts extracted from images merge with the **same Concept nodes** extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically.
## Deliverables
### 1. System Vision Model (`llm_manager/models.py`)
New `is_system_vision_model` boolean field on `LLMModel`, following the same pattern as the existing system embedding, chat, and reranker models.
```python
is_system_vision_model = models.BooleanField(
default=False,
help_text="Mark as the system-wide vision model for image analysis."
)
@classmethod
def get_system_vision_model(cls):
return cls.objects.filter(
is_system_vision_model=True,
is_active=True,
model_type__in=["vision", "chat"], # Vision-capable chat models work
).first()
```
Added `"vision_analysis"` to `LLMUsage.purpose` choices for cost tracking.
### 2. Image Model Enhancements (`library/models.py`)
New fields on the `Image` node:
| Field | Type | Purpose |
|-------|------|---------|
| `ocr_text` | StringProperty | Visible text extracted by vision model |
| `vision_model_name` | StringProperty | Which model analyzed this image |
| `analysis_status` | StringProperty | pending / completed / failed / skipped |
Expanded `image_type` choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo.
New relationship: `Image ─[DEPICTS]─→ Concept`
### 3. Content-Type Vision Prompts (`library/content_types.py`)
Each library type now includes a `vision_prompt` that shapes what the vision model looks for:
| Library Type | Vision Focus |
|---|---|
| **Fiction** | Illustrations, cover art, characters, scenes, artistic style |
| **Non-Fiction** | Photographs, maps, charts, people, places, historical context |
| **Technical** | Diagrams, schematics, charts, tables, labels, processes |
| **Music** | Album covers, band photos, liner notes, era/aesthetic |
| **Film** | Stills, posters, storyboards, cinematographic elements |
| **Art** | Medium, style, subject, composition, artistic period |
| **Journal** | Photos, sketches, documents, dates, context clues |
### 4. Vision Analysis Service (`library/services/vision.py`)
New service: `VisionAnalyzer` — analyzes images via the system vision model.
#### API Call Format
Uses OpenAI-compatible multimodal chat format:
```python
{
"model": "qwen3-vl-72b",
"messages": [
{"role": "system", "content": "<structured JSON output instructions>"},
{"role": "user", "content": [
{"type": "text", "text": "<content-type-aware vision prompt>"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]}
],
"temperature": 0.1,
"max_tokens": 800
}
```
#### Response Structure
The vision model returns structured JSON:
```json
{
"image_type": "diagram",
"description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs",
"ocr_text": "L1 L2 L3 GND PE 400V",
"concepts": [
{"name": "3-phase motor", "type": "topic"},
{"name": "wiring diagram", "type": "technique"}
]
}
```
#### Processing Flow
For each image:
1. Read image bytes from S3
2. Base64-encode and send to vision model with content-type-aware prompt
3. Parse structured JSON response
4. Validate and normalize (image_type must be valid, concepts capped at 20)
5. Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status)
6. Create/connect Concept nodes via DEPICTS relationship
7. Also connect Item → Concept via REFERENCES (weight 0.8)
8. Log usage to LLMUsage
### 5. Pipeline Integration (`library/services/pipeline.py`)
Vision analysis is Stage 5.5 in the pipeline:
```
Stage 5: Store images in S3 + Neo4j (existing)
Stage 5.5: Vision analysis (NEW)
Stage 6: Embed images multimodally (existing)
Stage 7: Concept extraction from text (existing)
```
Behavior:
- If system vision model configured → analyze all images
- If no vision model → mark images as `analysis_status="skipped"`, continue pipeline
- Vision analysis failures are per-image (don't fail the whole pipeline)
### 6. Non-Fiction Library Type
New `"nonfiction"` library type added alongside the existing six types:
| Setting | Value |
|---------|-------|
| Strategy | `section_aware` |
| Chunk Size | 768 |
| Chunk Overlap | 96 |
| Boundaries | chapter, section, paragraph |
| Focus | Factual claims, historical events, people, places, arguments, evidence |
### 7. Prometheus Metrics (`library/metrics.py`)
| Metric | Type | Labels | Purpose |
|--------|------|--------|---------|
| `mnemosyne_vision_analyses_total` | Counter | status | Images analyzed |
| `mnemosyne_vision_analysis_duration_seconds` | Histogram | — | Per-image analysis latency |
| `mnemosyne_vision_concepts_extracted_total` | Counter | concept_type | Concepts from images |
### 8. Dashboard & UI Updates
- Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models
- Item detail page shows enriched image cards with:
- Auto-classified image type badge
- Vision-generated description
- Collapsible OCR text section
- Analysis status indicator
- Vision model name reference
## File Structure
```
mnemosyne/library/
├── services/
│ ├── vision.py # NEW — VisionAnalyzer service
│ ├── pipeline.py # Modified — Stage 5.5 integration
│ └── ...
├── models.py # Modified — Image fields, DEPICTS rel, nonfiction type
├── content_types.py # Modified — vision_prompt for all 7 types
├── metrics.py # Modified — vision analysis metrics
├── views.py # Modified — vision model in dashboard context
└── templates/library/
├── item_detail.html # Modified — enriched image display
└── embedding_dashboard.html # Modified — vision model row
mnemosyne/llm_manager/
├── models.py # Modified — is_system_vision_model, get_system_vision_model()
└── migrations/
└── 0003_add_vision_model_and_usage.py # NEW
```
## Performance Considerations
- Each image = one vision model inference (~2-5 seconds on local GPU)
- A document with 20 images = ~40-100 seconds of extra processing
- Runs in Celery async tasks — does not block web requests
- Uses the same GPU infrastructure already serving embedding and reranking
- Zero API cost when running locally
- Per-image failure isolation — one bad image doesn't fail the pipeline
## Success Criteria
- [ ] System vision model configurable via Django admin (same pattern as other system models)
- [ ] Images auto-classified with correct image_type (not hardcoded "diagram")
- [ ] Vision-generated descriptions visible on item detail page
- [ ] OCR text extracted from images with visible text
- [ ] Concepts extracted from images connected to Concept nodes via DEPICTS
- [ ] Shared concepts bridge text chunks and images in the graph
- [ ] Pipeline gracefully skips vision analysis when no vision model configured
- [ ] Non-fiction library type available for history, biography, essays, etc.
- [ ] Prometheus metrics track vision analysis throughput and latency
- [ ] Dashboard shows vision model status

View File

@@ -2,7 +2,7 @@
Content-type system configuration for Mnemosyne library types.
Each library type has a default configuration that governs chunking,
embedding, re-ranking, and LLM context injection.
embedding, re-ranking, LLM context injection, and vision analysis prompts.
"""
# Default configurations per library type.
@@ -30,6 +30,43 @@ LIBRARY_TYPE_DEFAULTS = {
"characters, settings, and events are fictional. Cite specific passages "
"when answering."
),
"vision_prompt": (
"Analyze this image from a work of fiction. Identify:\n"
"1) Image type (illustration, cover, map, portrait, scene).\n"
"2) What it depicts — characters, scenes, settings.\n"
"3) Any visible text, titles, or captions.\n"
"4) The artistic style and mood."
),
},
"nonfiction": {
"chunking_config": {
"strategy": "section_aware",
"chunk_size": 768,
"chunk_overlap": 96,
"respect_boundaries": ["chapter", "section", "paragraph"],
},
"embedding_instruction": (
"Represent this passage from a non-fiction work for retrieval. "
"Focus on factual claims, historical events, people, places, "
"arguments, and supporting evidence."
),
"reranker_instruction": (
"Re-rank passages from non-fiction based on factual relevance. "
"Prioritize historical events, arguments, evidence, and key figures."
),
"llm_context_prompt": (
"The following excerpts are from non-fiction (history, biography, essays, "
"science writing, philosophy, etc.). Treat claims as the author's perspective "
"— note when interpretations may vary. Cite specific passages and distinguish "
"fact from analysis."
),
"vision_prompt": (
"Analyze this image from a non-fiction work. Identify:\n"
"1) Image type (photograph, map, chart, illustration, portrait, document).\n"
"2) What it depicts — people, places, events, data.\n"
"3) Any visible text, labels, captions, or dates.\n"
"4) Historical or factual context."
),
},
"technical": {
"chunking_config": {
@@ -52,6 +89,13 @@ LIBRARY_TYPE_DEFAULTS = {
"reference material). Provide precise, actionable answers. Include code "
"examples and exact configurations when available. Cite source sections."
),
"vision_prompt": (
"Analyze this image from technical documentation. Identify:\n"
"1) Image type (diagram, chart, table, screenshot, schematic, flowchart).\n"
"2) What it depicts — components, processes, architecture, data.\n"
"3) Any visible text, labels, values, or annotations.\n"
"4) The technical concept or procedure being illustrated."
),
},
"music": {
"chunking_config": {
@@ -73,6 +117,13 @@ LIBRARY_TYPE_DEFAULTS = {
"Consider the artistic and cultural context. Reference specific "
"songs, albums, and artists when answering."
),
"vision_prompt": (
"Analyze this image from a music collection. Identify:\n"
"1) Image type (album cover, liner notes, band photo, concert photo, logo).\n"
"2) What it depicts — artwork style, imagery, people.\n"
"3) Any visible text, band names, album titles, or track listings.\n"
"4) The era, aesthetic, and genre associations."
),
},
"film": {
"chunking_config": {
@@ -94,6 +145,13 @@ LIBRARY_TYPE_DEFAULTS = {
"reviews). Consider the cinematic context — visual storytelling, "
"direction, and performance. Cite specific scenes and films."
),
"vision_prompt": (
"Analyze this image from a film or film-related content. Identify:\n"
"1) Image type (movie still, poster, storyboard, behind-the-scenes, screenshot).\n"
"2) What it depicts — scene, characters, setting, action.\n"
"3) Any visible text, titles, or credits.\n"
"4) Cinematographic elements — composition, lighting, mood."
),
},
"art": {
"chunking_config": {
@@ -115,6 +173,13 @@ LIBRARY_TYPE_DEFAULTS = {
"Consider visual elements, artistic technique, historical context, "
"and the artist's intent. Reference specific works and movements."
),
"vision_prompt": (
"Describe this artwork in detail. Identify:\n"
"1) Image type (painting, sculpture, photograph, print, drawing, mixed media).\n"
"2) Subject matter — what the artwork depicts.\n"
"3) Style, medium, and technique.\n"
"4) Composition, color palette, mood, and artistic period or movement."
),
},
"journal": {
"chunking_config": {
@@ -137,6 +202,13 @@ LIBRARY_TYPE_DEFAULTS = {
"This is private, reflective content. Respect the personal nature — "
"answer with sensitivity. Note dates and temporal context when relevant."
),
"vision_prompt": (
"Analyze this image from a personal journal. Identify:\n"
"1) Image type (photograph, sketch, document, receipt, ticket, screenshot).\n"
"2) What it depicts — people, places, events, objects.\n"
"3) Any visible text, dates, or handwriting.\n"
"4) Context clues about when and where this was taken or created."
),
},
}
@@ -146,11 +218,12 @@ def get_library_type_config(library_type):
Get the default configuration for a library type.
Args:
library_type: One of 'fiction', 'technical', 'music', 'film', 'art', 'journal'
library_type: One of 'fiction', 'nonfiction', 'technical', 'music',
'film', 'art', 'journal'
Returns:
dict with keys: chunking_config, embedding_instruction,
reranker_instruction, llm_context_prompt
reranker_instruction, llm_context_prompt, vision_prompt
Raises:
ValueError: If library_type is not recognized

View File

@@ -88,6 +88,24 @@ CONCEPTS_EXTRACTED_TOTAL = Counter(
["concept_type"],
)
# --- Vision Analysis (Phase 2B) ---
VISION_ANALYSES_TOTAL = Counter(
"mnemosyne_vision_analyses_total",
"Total images analyzed by vision model",
["status"],
)
VISION_ANALYSIS_DURATION = Histogram(
"mnemosyne_vision_analysis_duration_seconds",
"Time to analyze a single image with the vision model",
buckets=[0.5, 1, 2, 5, 10, 20, 30, 60],
)
VISION_CONCEPTS_EXTRACTED_TOTAL = Counter(
"mnemosyne_vision_concepts_extracted_total",
"Concepts extracted from images by vision analysis",
["concept_type"],
)
# --- System State ---
EMBEDDING_QUEUE_SIZE = Gauge(

View File

@@ -60,6 +60,7 @@ class Library(StructuredNode):
required=True,
choices={
"fiction": "Fiction",
"nonfiction": "Non-Fiction",
"technical": "Technical",
"music": "Music",
"film": "Film",
@@ -219,6 +220,12 @@ class Image(StructuredNode):
choices={
"cover": "Cover",
"diagram": "Diagram",
"chart": "Chart",
"table": "Table",
"screenshot": "Screenshot",
"illustration": "Illustration",
"map": "Map",
"portrait": "Portrait",
"artwork": "Artwork",
"still": "Still",
"photo": "Photo",
@@ -227,10 +234,24 @@ class Image(StructuredNode):
description = StringProperty(default="")
metadata = JSONProperty(default={})
# Vision analysis fields (Phase 2B)
ocr_text = StringProperty(default="") # Visible text extracted by vision model
vision_model_name = StringProperty(default="") # Which vision model analyzed this
analysis_status = StringProperty(
default="pending",
choices={
"pending": "Pending",
"completed": "Completed",
"failed": "Failed",
"skipped": "Skipped",
},
)
created_at = DateTimeProperty(default_now=True)
# Relationships
embeddings = RelationshipTo("ImageEmbedding", "HAS_EMBEDDING")
concepts = RelationshipTo("Concept", "DEPICTS")
def __str__(self):
return f"Image {self.image_type} ({self.uid})"

View File

@@ -154,6 +154,7 @@ class EmbeddingPipeline:
"chunks_created": 0,
"chunks_embedded": 0,
"images_stored": 0,
"images_analyzed": 0,
"images_embedded": 0,
"concepts_extracted": 0,
"model_name": "",
@@ -238,7 +239,29 @@ class EmbeddingPipeline:
)
if progress_callback:
progress_callback(80, "Embedding images")
progress_callback(75, "Analyzing images with vision model")
# --- Stage 5.5: Vision analysis (Phase 2B) ---
vision_model = LLMModel.get_system_vision_model()
if vision_model and image_nodes:
from .vision import VisionAnalyzer
vision_prompt = self._get_vision_prompt(library)
analyzer = VisionAnalyzer(vision_model, user=self.user)
images_analyzed = analyzer.analyze_images(
image_nodes,
vision_prompt=vision_prompt,
item=item,
)
result["images_analyzed"] = images_analyzed
elif image_nodes:
# No vision model — mark images as skipped
for img_node in image_nodes:
img_node.analysis_status = "skipped"
img_node.save()
if progress_callback:
progress_callback(85, "Embedding images")
# --- Stage 6: Embed images (multimodal) ---
if image_nodes and embedding_model.supports_multimodal:
@@ -547,6 +570,24 @@ class EmbeddingPipeline:
return embedded_count
def _get_vision_prompt(self, library) -> str:
"""
Get the content-type-aware vision prompt for a library.
:param library: Library node, or None.
:returns: Vision prompt string, or empty string.
"""
if not library:
return ""
try:
from library.content_types import get_library_type_config
config = get_library_type_config(library.library_type)
return config.get("vision_prompt", "")
except Exception:
return ""
def _check_dimension_compatibility(self, model_dimensions: int):
"""
Check if the model's vector dimensions match the Neo4j index.

View File

@@ -0,0 +1,386 @@
"""
Vision analysis service for image understanding (Phase 2B).
Uses the system vision model to analyze extracted images: classify type,
generate descriptions, extract visible text (OCR), and identify concepts
that connect images into the knowledge graph via DEPICTS relationships.
"""
import base64
import json
import logging
import re
import time
from typing import Optional
import requests
from library.metrics import (
VISION_ANALYSES_TOTAL,
VISION_ANALYSIS_DURATION,
VISION_CONCEPTS_EXTRACTED_TOTAL,
)
logger = logging.getLogger(__name__)
# Valid image_type values from the Image model
VALID_IMAGE_TYPES = {
"cover", "diagram", "chart", "table", "screenshot",
"illustration", "map", "portrait", "artwork", "still", "photo",
}
# System prompt for structured vision analysis
VISION_SYSTEM_PROMPT = (
"You are an image analysis assistant. Analyze the image and return a JSON object "
"with the following fields:\n"
'- "image_type": one of: cover, diagram, chart, table, screenshot, '
"illustration, map, portrait, artwork, still, photo\n"
'- "description": a concise 1-3 sentence description of what the image shows\n'
'- "ocr_text": any visible text in the image (empty string if none)\n'
'- "concepts": array of objects with "name" (lowercase) and "type" '
'(one of: person, place, topic, technique, theme)\n\n'
"Return ONLY the JSON object, no other text."
)
class VisionAnalyzer:
"""
Analyzes images using a vision-capable LLM to extract structured metadata.
For each image, produces:
- image_type classification
- natural language description
- OCR text extraction
- concept entities for knowledge graph integration
Requires a system vision model configured in LLM Manager.
"""
def __init__(self, vision_model, user=None):
"""
:param vision_model: LLMModel instance for the vision model.
:param user: Optional Django user for usage tracking.
"""
self.vision_model = vision_model
self.api = vision_model.api
self.user = user
self.base_url = self.api.base_url.rstrip("/")
self.model_name = self.vision_model.name
self.timeout = self.api.timeout_seconds or 120
logger.info(
"VisionAnalyzer initialized model=%s api=%s",
self.model_name,
self.api.name,
)
def analyze_images(
self,
image_nodes: list,
vision_prompt: str = "",
item=None,
) -> int:
"""
Analyze a list of Image nodes and update them with vision results.
:param image_nodes: List of Image node instances (with s3_key populated).
:param vision_prompt: Content-type-aware prompt from the Library config.
:param item: Optional Item node to connect image concepts via REFERENCES.
:returns: Number of images successfully analyzed.
"""
from django.core.files.storage import default_storage
analyzed_count = 0
for img_node in image_nodes:
try:
# Read image data from S3
img_data = default_storage.open(img_node.s3_key, "rb").read()
ext = img_node.s3_key.rsplit(".", 1)[-1] if "." in img_node.s3_key else "png"
result = self._analyze_single_image(img_data, ext, vision_prompt)
if result:
self._apply_result(img_node, result, item)
analyzed_count += 1
VISION_ANALYSES_TOTAL.labels(status="success").inc()
else:
img_node.analysis_status = "failed"
img_node.save()
VISION_ANALYSES_TOTAL.labels(status="failed").inc()
except Exception as exc:
logger.warning(
"Vision analysis failed for image s3_key=%s: %s",
img_node.s3_key,
exc,
)
img_node.analysis_status = "failed"
img_node.save()
VISION_ANALYSES_TOTAL.labels(status="failed").inc()
if analyzed_count:
logger.info("Vision analysis completed: %d/%d images", analyzed_count, len(image_nodes))
return analyzed_count
def _analyze_single_image(
self,
image_data: bytes,
image_ext: str,
vision_prompt: str,
) -> Optional[dict]:
"""
Send a single image to the vision model for analysis.
:param image_data: Raw image bytes.
:param image_ext: Image format extension (png, jpg, etc.).
:param vision_prompt: Content-type-aware analysis prompt.
:returns: Parsed result dict, or None on failure.
"""
b64 = base64.b64encode(image_data).decode("utf-8")
mime_type = f"image/{image_ext}" if image_ext != "jpg" else "image/jpeg"
# Build the user prompt with content-type context
user_prompt = vision_prompt if vision_prompt else "Analyze this image."
start_time = time.time()
try:
response_text = self._call_vision_model(b64, mime_type, user_prompt)
elapsed = time.time() - start_time
VISION_ANALYSIS_DURATION.observe(elapsed)
result = self._parse_vision_response(response_text)
logger.debug(
"Vision analysis completed in %.2fs image_type=%s description_len=%d",
elapsed,
result.get("image_type", "unknown") if result else "failed",
len(result.get("description", "")) if result else 0,
)
return result
except Exception as exc:
elapsed = time.time() - start_time
VISION_ANALYSIS_DURATION.observe(elapsed)
logger.warning("Vision model call failed: %s", exc)
return None
def _call_vision_model(
self,
b64_image: str,
mime_type: str,
user_prompt: str,
) -> str:
"""
Make a chat completion request with an image to the vision model.
Uses OpenAI-compatible multimodal chat format.
:param b64_image: Base64-encoded image data.
:param mime_type: MIME type of the image.
:param user_prompt: Text prompt to accompany the image.
:returns: Response text from the model.
"""
url = f"{self.base_url}/chat/completions"
headers = {"Content-Type": "application/json"}
if self.api.api_key:
headers["Authorization"] = f"Bearer {self.api.api_key}"
body = {
"model": self.model_name,
"messages": [
{"role": "system", "content": VISION_SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{b64_image}",
},
},
],
},
],
"temperature": 0.1,
"max_tokens": 800,
}
resp = requests.post(url, json=body, headers=headers, timeout=self.timeout)
if resp.status_code != 200:
logger.error(
"Vision model request failed status=%d body=%s",
resp.status_code,
resp.text[:500],
)
resp.raise_for_status()
data = resp.json()
# Parse response — OpenAI-compatible format
if "choices" in data:
return data["choices"][0]["message"]["content"]
if "output" in data:
# Bedrock Converse format
return data["output"]["message"]["content"][0]["text"]
raise ValueError(f"Unexpected vision response format: {list(data.keys())}")
def _parse_vision_response(self, response_text: str) -> Optional[dict]:
"""
Parse the vision model's response into structured data.
:param response_text: Raw response text (expected JSON).
:returns: Parsed dict with image_type, description, ocr_text, concepts.
"""
text = response_text.strip()
# Handle markdown code blocks
if text.startswith("```"):
lines = text.split("\n")
# Remove first line (```json) and last line (```)
text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
try:
result = json.loads(text)
if isinstance(result, dict):
return self._validate_result(result)
except json.JSONDecodeError:
pass
# Try to find JSON object in the response
match = re.search(r"\{.*\}", text, re.DOTALL)
if match:
try:
result = json.loads(match.group())
if isinstance(result, dict):
return self._validate_result(result)
except json.JSONDecodeError:
pass
logger.debug("Could not parse vision response: %s", text[:200])
return None
def _validate_result(self, result: dict) -> dict:
"""
Validate and normalize the parsed vision result.
:param result: Raw parsed dict from the model.
:returns: Normalized result dict.
"""
# Normalize image_type
image_type = result.get("image_type", "").lower().strip()
if image_type not in VALID_IMAGE_TYPES:
image_type = "photo" # Safe default
# Normalize concepts
concepts = result.get("concepts", [])
valid_concepts = []
for c in concepts:
if isinstance(c, dict) and "name" in c:
name = c["name"].strip().lower()
if name and len(name) >= 2:
valid_concepts.append({
"name": name,
"type": c.get("type", "topic"),
})
return {
"image_type": image_type,
"description": str(result.get("description", "")).strip()[:2000],
"ocr_text": str(result.get("ocr_text", "")).strip()[:5000],
"concepts": valid_concepts[:20], # Cap at 20 concepts per image
}
def _apply_result(self, img_node, result: dict, item=None):
"""
Apply vision analysis results to an Image node and create graph relationships.
:param img_node: Image node to update.
:param result: Validated result dict from vision analysis.
:param item: Optional Item node for REFERENCES relationships.
"""
from library.models import Concept
# Update Image node fields
img_node.image_type = result["image_type"]
img_node.description = result["description"]
img_node.ocr_text = result["ocr_text"]
img_node.vision_model_name = self.model_name
img_node.analysis_status = "completed"
img_node.save()
# Create concept relationships
for concept_data in result["concepts"]:
name = concept_data["name"]
concept_type = concept_data["type"]
concept_node = self._get_or_create_concept(name, concept_type)
if concept_node:
# Image DEPICTS Concept
try:
img_node.concepts.connect(concept_node)
except Exception:
pass # Already connected
# Also connect Item REFERENCES Concept (if item provided)
if item:
try:
item.concepts.connect(concept_node, {"weight": 0.8})
except Exception:
pass # Already connected
VISION_CONCEPTS_EXTRACTED_TOTAL.labels(concept_type=concept_type).inc()
logger.debug(
"Applied vision result to image uid=%s type=%s concepts=%d",
img_node.uid,
result["image_type"],
len(result["concepts"]),
)
# Log usage
self._log_usage()
def _get_or_create_concept(self, name: str, concept_type: str):
"""
Get or create a Concept node by name (shared with text concept extraction).
:param name: Concept name (lowercase).
:param concept_type: Concept type (person, place, topic, etc.).
:returns: Concept node, or None on failure.
"""
from library.models import Concept
try:
existing = Concept.nodes.filter(name=name)
if existing:
return existing[0]
concept = Concept(name=name, concept_type=concept_type)
concept.save()
return concept
except Exception as exc:
logger.debug("Failed to get/create concept '%s': %s", name, exc)
return None
def _log_usage(self):
"""Log vision analysis usage to LLMUsage model."""
try:
from llm_manager.models import LLMUsage
LLMUsage.objects.create(
model=self.vision_model,
user=self.user,
input_tokens=500, # Approximate — vision tokens are hard to estimate
output_tokens=200,
cached_tokens=0,
total_cost=0,
purpose="vision_analysis",
)
except Exception as exc:
logger.warning("Failed to log vision usage: %s", exc)

View File

@@ -56,6 +56,19 @@
{% endif %}
</td>
</tr>
<tr>
<th>Vision Model</th>
<td>
{% if system_vision_model %}
<span class="font-semibold">{{ system_vision_model.api.name }}: {{ system_vision_model.name }}</span>
{% if system_vision_model.supports_vision %}
<span class="badge badge-accent badge-sm ml-1">Vision</span>
{% endif %}
{% else %}
<span class="text-sm opacity-60">Not configured — image analysis disabled</span>
{% endif %}
</td>
</tr>
</tbody>
</table>
</div>

View File

@@ -97,24 +97,51 @@
{% if images %}
<div class="mb-6">
<h2 class="text-xl font-bold mb-3">Images ({{ images|length }})</h2>
<div class="grid grid-cols-2 md:grid-cols-4 gap-3">
<div class="grid grid-cols-1 md:grid-cols-2 gap-4">
{% for img in images %}
<a href="{% url 'library:image-serve' uid=img.uid %}" target="_blank" rel="noopener" class="card bg-base-200 hover:shadow-lg hover:bg-base-300 transition-all cursor-pointer">
{% if img.s3_key %}
<figure class="px-3 pt-3">
<img src="{% url 'library:image-serve' uid=img.uid %}" alt="{{ img.description|default:'Image' }}" class="rounded-lg object-cover w-full h-32" loading="lazy" />
</figure>
{% endif %}
<div class="card-body p-3">
<span class="badge badge-sm">{{ img.image_type|default:"image" }}</span>
{% if img.description %}
<p class="text-xs opacity-60 mt-1">{{ img.description|truncatewords:10 }}</p>
<div class="card bg-base-200">
<div class="flex flex-col md:flex-row">
{% if img.s3_key %}
<a href="{% url 'library:image-serve' uid=img.uid %}" target="_blank" rel="noopener" class="flex-shrink-0">
<img src="{% url 'library:image-serve' uid=img.uid %}" alt="{{ img.description|default:'Image' }}" class="rounded-lg object-cover w-full md:w-48 h-40" loading="lazy" />
</a>
{% endif %}
<p class="text-xs opacity-40 mt-1 flex items-center gap-1">
<span>🔍</span> Click to view
</p>
<div class="card-body p-3 flex-1">
<div class="flex items-center gap-2 flex-wrap">
<span class="badge badge-primary badge-sm">{{ img.image_type|default:"image" }}</span>
{% if img.analysis_status == "completed" %}
<span class="badge badge-success badge-xs">analyzed</span>
{% elif img.analysis_status == "failed" %}
<span class="badge badge-error badge-xs">analysis failed</span>
{% elif img.analysis_status == "skipped" %}
<span class="badge badge-ghost badge-xs">no vision model</span>
{% endif %}
{% if img.vision_model_name %}
<span class="text-xs opacity-40">{{ img.vision_model_name }}</span>
{% endif %}
</div>
{% if img.description %}
<p class="text-sm mt-2">{{ img.description }}</p>
{% endif %}
{% if img.ocr_text %}
<div class="collapse collapse-arrow bg-base-300 mt-2">
<input type="checkbox" />
<div class="collapse-title text-xs font-medium py-1 min-h-0">
Extracted Text (OCR)
</div>
<div class="collapse-content">
<p class="text-xs whitespace-pre-wrap opacity-70">{{ img.ocr_text }}</p>
</div>
</div>
{% endif %}
{% if not img.description and not img.ocr_text %}
<a href="{% url 'library:image-serve' uid=img.uid %}" target="_blank" rel="noopener" class="text-xs opacity-40 mt-1 flex items-center gap-1">
<span>🔍</span> Click to view full image
</a>
{% endif %}
</div>
</div>
</a>
</div>
{% endfor %}
</div>
</div>

View File

@@ -550,6 +550,7 @@ def embedding_dashboard(request):
"system_embedding_model": None,
"system_chat_model": None,
"system_reranker_model": None,
"system_vision_model": None,
"status_counts": {},
"node_counts": {},
"total_items": 0,
@@ -565,6 +566,7 @@ def embedding_dashboard(request):
context["system_embedding_model"] = LLMModel.get_system_embedding_model()
context["system_chat_model"] = LLMModel.get_system_chat_model()
context["system_reranker_model"] = LLMModel.get_system_reranker_model()
context["system_vision_model"] = LLMModel.get_system_vision_model()
except Exception as exc:
logger.warning("Could not load system models: %s", exc)

View File

@@ -0,0 +1,52 @@
"""
Add is_system_vision_model to LLMModel and vision_analysis purpose to LLMUsage.
"""
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
("llm_manager", "0002_add_bedrock_api_type"),
]
operations = [
migrations.AddField(
model_name="llmmodel",
name="is_system_vision_model",
field=models.BooleanField(
default=False,
help_text=(
"Mark this as the system-wide vision model for image analysis. "
"Only ONE vision model should have this set to True."
),
),
),
migrations.AddIndex(
model_name="llmmodel",
index=models.Index(
fields=["is_system_vision_model", "model_type"],
name="llm_manager__is_syst_b2f4e7_idx",
),
),
migrations.AlterField(
model_name="llmusage",
name="purpose",
field=models.CharField(
choices=[
("responder", "RAG Responder"),
("reviewer", "RAG Reviewer"),
("embeddings", "Document Embeddings"),
("search", "Vector Search"),
("reranking", "Re-ranking"),
("multimodal_embed", "Multimodal Embedding"),
("vision_analysis", "Vision Analysis"),
("other", "Other"),
],
db_index=True,
default="other",
max_length=50,
),
),
]

View File

@@ -179,6 +179,13 @@ class LLMModel(models.Model):
"Only ONE reranker model should have this set to True."
),
)
is_system_vision_model = models.BooleanField(
default=False,
help_text=(
"Mark this as the system-wide vision model for image analysis. "
"Only ONE vision model should have this set to True."
),
)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
@@ -191,6 +198,7 @@ class LLMModel(models.Model):
models.Index(fields=["is_system_embedding_model", "model_type"]),
models.Index(fields=["is_system_chat_model", "model_type"]),
models.Index(fields=["is_system_reranker_model", "model_type"]),
models.Index(fields=["is_system_vision_model", "model_type"]),
]
def __str__(self):
@@ -223,6 +231,15 @@ class LLMModel(models.Model):
model_type="reranker",
).first()
@classmethod
def get_system_vision_model(cls):
"""Get the system-wide vision model for image analysis."""
return cls.objects.filter(
is_system_vision_model=True,
is_active=True,
model_type__in=["vision", "chat"],
).first()
class LLMUsage(models.Model):
"""
@@ -259,6 +276,7 @@ class LLMUsage(models.Model):
("search", "Vector Search"),
("reranking", "Re-ranking"),
("multimodal_embed", "Multimodal Embedding"),
("vision_analysis", "Vision Analysis"),
("other", "Other"),
],
default="other",