# Phase 2B: Vision Analysis Pipeline ## Objective Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text. ## Heritage Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images *understood*. ## Architecture Overview ``` Image Extracted (Phase 2) → Stored in S3 + Image node in Neo4j → Vision Analysis (NEW - Phase 2B) → Classify image type (diagram, photo, chart, etc.) → Generate natural language description → Extract visible text (OCR) → Identify concepts depicted → Create DEPICTS relationships to Concept nodes → Connect Item to image-derived concepts via REFERENCES → Multimodal Embedding (Phase 2, now enhanced) ``` ## How It Fits the Graph Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do: ``` Item ─[HAS_IMAGE]──→ Image │ ├── description: "Wiring diagram showing 3-phase motor connection" ├── image_type: "diagram" (auto-classified) ├── ocr_text: "L1 L2 L3 ..." ├── vision_model_name: "Qwen3-VL-72B" │ ├──[DEPICTS]──→ Concept("3-phase motor") ├──[DEPICTS]──→ Concept("wiring diagram") └──[DEPICTS]──→ Concept("electrical connection") │ └──[RELATED_TO]──→ Concept("motor control") ↑ Chunk text also ──[MENTIONS]──┘ ``` Three relationship types connect content to concepts: - `Chunk ─[MENTIONS]─→ Concept` — text discusses this concept - `Item ─[REFERENCES]─→ Concept` — item is about this concept - `Image ─[DEPICTS]─→ Concept` — image visually shows this concept Concepts extracted from images merge with the **same Concept nodes** extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically. ## Deliverables ### 1. System Vision Model (`llm_manager/models.py`) New `is_system_vision_model` boolean field on `LLMModel`, following the same pattern as the existing system embedding, chat, and reranker models. ```python is_system_vision_model = models.BooleanField( default=False, help_text="Mark as the system-wide vision model for image analysis." ) @classmethod def get_system_vision_model(cls): return cls.objects.filter( is_system_vision_model=True, is_active=True, model_type__in=["vision", "chat"], # Vision-capable chat models work ).first() ``` Added `"vision_analysis"` to `LLMUsage.purpose` choices for cost tracking. ### 2. Image Model Enhancements (`library/models.py`) New fields on the `Image` node: | Field | Type | Purpose | |-------|------|---------| | `ocr_text` | StringProperty | Visible text extracted by vision model | | `vision_model_name` | StringProperty | Which model analyzed this image | | `analysis_status` | StringProperty | pending / completed / failed / skipped | Expanded `image_type` choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo. New relationship: `Image ─[DEPICTS]─→ Concept` ### 3. Content-Type Vision Prompts (`library/content_types.py`) Each library type now includes a `vision_prompt` that shapes what the vision model looks for: | Library Type | Vision Focus | |---|---| | **Fiction** | Illustrations, cover art, characters, scenes, artistic style | | **Non-Fiction** | Photographs, maps, charts, people, places, historical context | | **Technical** | Diagrams, schematics, charts, tables, labels, processes | | **Music** | Album covers, band photos, liner notes, era/aesthetic | | **Film** | Stills, posters, storyboards, cinematographic elements | | **Art** | Medium, style, subject, composition, artistic period | | **Journal** | Photos, sketches, documents, dates, context clues | ### 4. Vision Analysis Service (`library/services/vision.py`) New service: `VisionAnalyzer` — analyzes images via the system vision model. #### API Call Format Uses OpenAI-compatible multimodal chat format: ```python { "model": "qwen3-vl-72b", "messages": [ {"role": "system", "content": ""}, {"role": "user", "content": [ {"type": "text", "text": ""}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}} ]} ], "temperature": 0.1, "max_tokens": 800 } ``` #### Response Structure The vision model returns structured JSON: ```json { "image_type": "diagram", "description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs", "ocr_text": "L1 L2 L3 GND PE 400V", "concepts": [ {"name": "3-phase motor", "type": "topic"}, {"name": "wiring diagram", "type": "technique"} ] } ``` #### Processing Flow For each image: 1. Read image bytes from S3 2. Base64-encode and send to vision model with content-type-aware prompt 3. Parse structured JSON response 4. Validate and normalize (image_type must be valid, concepts capped at 20) 5. Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status) 6. Create/connect Concept nodes via DEPICTS relationship 7. Also connect Item → Concept via REFERENCES (weight 0.8) 8. Log usage to LLMUsage ### 5. Pipeline Integration (`library/services/pipeline.py`) Vision analysis is Stage 5.5 in the pipeline: ``` Stage 5: Store images in S3 + Neo4j (existing) Stage 5.5: Vision analysis (NEW) Stage 6: Embed images multimodally (existing) Stage 7: Concept extraction from text (existing) ``` Behavior: - If system vision model configured → analyze all images - If no vision model → mark images as `analysis_status="skipped"`, continue pipeline - Vision analysis failures are per-image (don't fail the whole pipeline) ### 6. Non-Fiction Library Type New `"nonfiction"` library type added alongside the existing six types: | Setting | Value | |---------|-------| | Strategy | `section_aware` | | Chunk Size | 768 | | Chunk Overlap | 96 | | Boundaries | chapter, section, paragraph | | Focus | Factual claims, historical events, people, places, arguments, evidence | ### 7. Prometheus Metrics (`library/metrics.py`) | Metric | Type | Labels | Purpose | |--------|------|--------|---------| | `mnemosyne_vision_analyses_total` | Counter | status | Images analyzed | | `mnemosyne_vision_analysis_duration_seconds` | Histogram | — | Per-image analysis latency | | `mnemosyne_vision_concepts_extracted_total` | Counter | concept_type | Concepts from images | ### 8. Dashboard & UI Updates - Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models - Item detail page shows enriched image cards with: - Auto-classified image type badge - Vision-generated description - Collapsible OCR text section - Analysis status indicator - Vision model name reference ## File Structure ``` mnemosyne/library/ ├── services/ │ ├── vision.py # NEW — VisionAnalyzer service │ ├── pipeline.py # Modified — Stage 5.5 integration │ └── ... ├── models.py # Modified — Image fields, DEPICTS rel, nonfiction type ├── content_types.py # Modified — vision_prompt for all 7 types ├── metrics.py # Modified — vision analysis metrics ├── views.py # Modified — vision model in dashboard context └── templates/library/ ├── item_detail.html # Modified — enriched image display └── embedding_dashboard.html # Modified — vision model row mnemosyne/llm_manager/ ├── models.py # Modified — is_system_vision_model, get_system_vision_model() └── migrations/ └── 0003_add_vision_model_and_usage.py # NEW ``` ## Performance Considerations - Each image = one vision model inference (~2-5 seconds on local GPU) - A document with 20 images = ~40-100 seconds of extra processing - Runs in Celery async tasks — does not block web requests - Uses the same GPU infrastructure already serving embedding and reranking - Zero API cost when running locally - Per-image failure isolation — one bad image doesn't fail the pipeline ## Success Criteria - [ ] System vision model configurable via Django admin (same pattern as other system models) - [ ] Images auto-classified with correct image_type (not hardcoded "diagram") - [ ] Vision-generated descriptions visible on item detail page - [ ] OCR text extracted from images with visible text - [ ] Concepts extracted from images connected to Concept nodes via DEPICTS - [ ] Shared concepts bridge text chunks and images in the graph - [ ] Pipeline gracefully skips vision analysis when no vision model configured - [ ] Non-fiction library type available for history, biography, essays, etc. - [ ] Prometheus metrics track vision analysis throughput and latency - [ ] Dashboard shows vision model status