Add vision analysis capabilities to the embedding pipeline

- Introduced a new vision analysis service to classify, describe, and extract text from images. - Enhanced the Image model with fields for OCR text, vision model name, and analysis status. - Added a new "nonfiction" library type with specific chunking and embedding configurations. - Updated content types to include vision prompts for various library types. - Integrated vision analysis into the embedding pipeline, allowing for image analysis during document processing. - Implemented metrics to track vision analysis performance and usage. - Updated UI components to display vision analysis results and statuses in item details and the embedding dashboard. - Added migration for new vision model fields and usage tracking.
2026-03-22 15:14:34 +00:00
parent 6585beed20
commit 90db904959
11 changed files with 914 additions and 19 deletions
--- a/docs/PHASE_2B_VISION_PIPELINE.md
+++ b/docs/PHASE_2B_VISION_PIPELINE.md
@@ -0,0 +1,244 @@
+# Phase 2B: Vision Analysis Pipeline
+
+## Objective
+
+Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text.
+
+## Heritage
+
+Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images *understood*.
+
+## Architecture Overview
+
+```
+Image Extracted (Phase 2)
+  → Stored in S3 + Image node in Neo4j
+    → Vision Analysis (NEW - Phase 2B)
+      → Classify image type (diagram, photo, chart, etc.)
+      → Generate natural language description
+      → Extract visible text (OCR)
+      → Identify concepts depicted
+        → Create DEPICTS relationships to Concept nodes
+        → Connect Item to image-derived concepts via REFERENCES
+    → Multimodal Embedding (Phase 2, now enhanced)
+```
+
+## How It Fits the Graph
+
+Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do:
+
+```
+Item ─[HAS_IMAGE]──→ Image
+                        │
+                        ├── description: "Wiring diagram showing 3-phase motor connection"
+                        ├── image_type: "diagram" (auto-classified)
+                        ├── ocr_text: "L1 L2 L3 ..."
+                        ├── vision_model_name: "Qwen3-VL-72B"
+                        │
+                        ├──[DEPICTS]──→ Concept("3-phase motor")
+                        ├──[DEPICTS]──→ Concept("wiring diagram")
+                        └──[DEPICTS]──→ Concept("electrical connection")
+                                            │
+                                            └──[RELATED_TO]──→ Concept("motor control")
+                                                                  ↑
+                                            Chunk text also ──[MENTIONS]──┘
+```
+
+Three relationship types connect content to concepts:
+- `Chunk ─[MENTIONS]─→ Concept` — text discusses this concept
+- `Item ─[REFERENCES]─→ Concept` — item is about this concept
+- `Image ─[DEPICTS]─→ Concept` — image visually shows this concept
+
+Concepts extracted from images merge with the **same Concept nodes** extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically.
+
+## Deliverables
+
+### 1. System Vision Model (`llm_manager/models.py`)
+
+New `is_system_vision_model` boolean field on `LLMModel`, following the same pattern as the existing system embedding, chat, and reranker models.
+
+```python
+is_system_vision_model = models.BooleanField(
+    default=False,
+    help_text="Mark as the system-wide vision model for image analysis."
+)
+
+@classmethod
+def get_system_vision_model(cls):
+    return cls.objects.filter(
+        is_system_vision_model=True,
+        is_active=True,
+        model_type__in=["vision", "chat"],  # Vision-capable chat models work
+    ).first()
+```
+
+Added `"vision_analysis"` to `LLMUsage.purpose` choices for cost tracking.
+
+### 2. Image Model Enhancements (`library/models.py`)
+
+New fields on the `Image` node:
+
+| Field | Type | Purpose |
+|-------|------|---------|
+| `ocr_text` | StringProperty | Visible text extracted by vision model |
+| `vision_model_name` | StringProperty | Which model analyzed this image |
+| `analysis_status` | StringProperty | pending / completed / failed / skipped |
+
+Expanded `image_type` choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo.
+
+New relationship: `Image ─[DEPICTS]─→ Concept`
+
+### 3. Content-Type Vision Prompts (`library/content_types.py`)
+
+Each library type now includes a `vision_prompt` that shapes what the vision model looks for:
+
+| Library Type | Vision Focus |
+|---|---|
+| **Fiction** | Illustrations, cover art, characters, scenes, artistic style |
+| **Non-Fiction** | Photographs, maps, charts, people, places, historical context |
+| **Technical** | Diagrams, schematics, charts, tables, labels, processes |
+| **Music** | Album covers, band photos, liner notes, era/aesthetic |
+| **Film** | Stills, posters, storyboards, cinematographic elements |
+| **Art** | Medium, style, subject, composition, artistic period |
+| **Journal** | Photos, sketches, documents, dates, context clues |
+
+### 4. Vision Analysis Service (`library/services/vision.py`)
+
+New service: `VisionAnalyzer` — analyzes images via the system vision model.
+
+#### API Call Format
+
+Uses OpenAI-compatible multimodal chat format:
+
+```python
+{
+    "model": "qwen3-vl-72b",
+    "messages": [
+        {"role": "system", "content": "<structured JSON output instructions>"},
+        {"role": "user", "content": [
+            {"type": "text", "text": "<content-type-aware vision prompt>"},
+            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
+        ]}
+    ],
+    "temperature": 0.1,
+    "max_tokens": 800
+}
+```
+
+#### Response Structure
+
+The vision model returns structured JSON:
+
+```json
+{
+    "image_type": "diagram",
+    "description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs",
+    "ocr_text": "L1 L2 L3 GND PE 400V",
+    "concepts": [
+        {"name": "3-phase motor", "type": "topic"},
+        {"name": "wiring diagram", "type": "technique"}
+    ]
+}
+```
+
+#### Processing Flow
+
+For each image:
+1. Read image bytes from S3
+2. Base64-encode and send to vision model with content-type-aware prompt
+3. Parse structured JSON response
+4. Validate and normalize (image_type must be valid, concepts capped at 20)
+5. Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status)
+6. Create/connect Concept nodes via DEPICTS relationship
+7. Also connect Item → Concept via REFERENCES (weight 0.8)
+8. Log usage to LLMUsage
+
+### 5. Pipeline Integration (`library/services/pipeline.py`)
+
+Vision analysis is Stage 5.5 in the pipeline:
+
+```
+Stage 5:   Store images in S3 + Neo4j          (existing)
+Stage 5.5: Vision analysis                      (NEW)
+Stage 6:   Embed images multimodally            (existing)
+Stage 7:   Concept extraction from text         (existing)
+```
+
+Behavior:
+- If system vision model configured → analyze all images
+- If no vision model → mark images as `analysis_status="skipped"`, continue pipeline
+- Vision analysis failures are per-image (don't fail the whole pipeline)
+
+### 6. Non-Fiction Library Type
+
+New `"nonfiction"` library type added alongside the existing six types:
+
+| Setting | Value |
+|---------|-------|
+| Strategy | `section_aware` |
+| Chunk Size | 768 |
+| Chunk Overlap | 96 |
+| Boundaries | chapter, section, paragraph |
+| Focus | Factual claims, historical events, people, places, arguments, evidence |
+
+### 7. Prometheus Metrics (`library/metrics.py`)
+
+| Metric | Type | Labels | Purpose |
+|--------|------|--------|---------|
+| `mnemosyne_vision_analyses_total` | Counter | status | Images analyzed |
+| `mnemosyne_vision_analysis_duration_seconds` | Histogram | — | Per-image analysis latency |
+| `mnemosyne_vision_concepts_extracted_total` | Counter | concept_type | Concepts from images |
+
+### 8. Dashboard & UI Updates
+
+- Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models
+- Item detail page shows enriched image cards with:
+  - Auto-classified image type badge
+  - Vision-generated description
+  - Collapsible OCR text section
+  - Analysis status indicator
+  - Vision model name reference
+
+## File Structure
+
+```
+mnemosyne/library/
+├── services/
+│   ├── vision.py              # NEW — VisionAnalyzer service
+│   ├── pipeline.py            # Modified — Stage 5.5 integration
+│   └── ...
+├── models.py                  # Modified — Image fields, DEPICTS rel, nonfiction type
+├── content_types.py           # Modified — vision_prompt for all 7 types
+├── metrics.py                 # Modified — vision analysis metrics
+├── views.py                   # Modified — vision model in dashboard context
+└── templates/library/
+    ├── item_detail.html        # Modified — enriched image display
+    └── embedding_dashboard.html # Modified — vision model row
+
+mnemosyne/llm_manager/
+├── models.py                  # Modified — is_system_vision_model, get_system_vision_model()
+└── migrations/
+    └── 0003_add_vision_model_and_usage.py  # NEW
+```
+
+## Performance Considerations
+
+- Each image = one vision model inference (~2-5 seconds on local GPU)
+- A document with 20 images = ~40-100 seconds of extra processing
+- Runs in Celery async tasks — does not block web requests
+- Uses the same GPU infrastructure already serving embedding and reranking
+- Zero API cost when running locally
+- Per-image failure isolation — one bad image doesn't fail the pipeline
+
+## Success Criteria
+
+- [ ] System vision model configurable via Django admin (same pattern as other system models)
+- [ ] Images auto-classified with correct image_type (not hardcoded "diagram")
+- [ ] Vision-generated descriptions visible on item detail page
+- [ ] OCR text extracted from images with visible text
+- [ ] Concepts extracted from images connected to Concept nodes via DEPICTS
+- [ ] Shared concepts bridge text chunks and images in the graph
+- [ ] Pipeline gracefully skips vision analysis when no vision model configured
+- [ ] Non-fiction library type available for history, biography, essays, etc.
+- [ ] Prometheus metrics track vision analysis throughput and latency
+- [ ] Dashboard shows vision model status