Files
mnemosyne/docs/PHASE_2B_VISION_PIPELINE.md
Robert Helewka 90db904959 Add vision analysis capabilities to the embedding pipeline
- Introduced a new vision analysis service to classify, describe, and extract text from images.
- Enhanced the Image model with fields for OCR text, vision model name, and analysis status.
- Added a new "nonfiction" library type with specific chunking and embedding configurations.
- Updated content types to include vision prompts for various library types.
- Integrated vision analysis into the embedding pipeline, allowing for image analysis during document processing.
- Implemented metrics to track vision analysis performance and usage.
- Updated UI components to display vision analysis results and statuses in item details and the embedding dashboard.
- Added migration for new vision model fields and usage tracking.
2026-03-22 15:14:34 +00:00

9.7 KiB

Phase 2B: Vision Analysis Pipeline

Objective

Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text.

Heritage

Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images understood.

Architecture Overview

Image Extracted (Phase 2)
  → Stored in S3 + Image node in Neo4j
    → Vision Analysis (NEW - Phase 2B)
      → Classify image type (diagram, photo, chart, etc.)
      → Generate natural language description
      → Extract visible text (OCR)
      → Identify concepts depicted
        → Create DEPICTS relationships to Concept nodes
        → Connect Item to image-derived concepts via REFERENCES
    → Multimodal Embedding (Phase 2, now enhanced)

How It Fits the Graph

Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do:

Item ─[HAS_IMAGE]──→ Image
                        │
                        ├── description: "Wiring diagram showing 3-phase motor connection"
                        ├── image_type: "diagram" (auto-classified)
                        ├── ocr_text: "L1 L2 L3 ..."
                        ├── vision_model_name: "Qwen3-VL-72B"
                        │
                        ├──[DEPICTS]──→ Concept("3-phase motor")
                        ├──[DEPICTS]──→ Concept("wiring diagram")
                        └──[DEPICTS]──→ Concept("electrical connection")
                                            │
                                            └──[RELATED_TO]──→ Concept("motor control")
                                                                  ↑
                                            Chunk text also ──[MENTIONS]──┘

Three relationship types connect content to concepts:

  • Chunk ─[MENTIONS]─→ Concept — text discusses this concept
  • Item ─[REFERENCES]─→ Concept — item is about this concept
  • Image ─[DEPICTS]─→ Concept — image visually shows this concept

Concepts extracted from images merge with the same Concept nodes extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically.

Deliverables

1. System Vision Model (llm_manager/models.py)

New is_system_vision_model boolean field on LLMModel, following the same pattern as the existing system embedding, chat, and reranker models.

is_system_vision_model = models.BooleanField(
    default=False,
    help_text="Mark as the system-wide vision model for image analysis."
)

@classmethod
def get_system_vision_model(cls):
    return cls.objects.filter(
        is_system_vision_model=True,
        is_active=True,
        model_type__in=["vision", "chat"],  # Vision-capable chat models work
    ).first()

Added "vision_analysis" to LLMUsage.purpose choices for cost tracking.

2. Image Model Enhancements (library/models.py)

New fields on the Image node:

Field Type Purpose
ocr_text StringProperty Visible text extracted by vision model
vision_model_name StringProperty Which model analyzed this image
analysis_status StringProperty pending / completed / failed / skipped

Expanded image_type choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo.

New relationship: Image ─[DEPICTS]─→ Concept

3. Content-Type Vision Prompts (library/content_types.py)

Each library type now includes a vision_prompt that shapes what the vision model looks for:

Library Type Vision Focus
Fiction Illustrations, cover art, characters, scenes, artistic style
Non-Fiction Photographs, maps, charts, people, places, historical context
Technical Diagrams, schematics, charts, tables, labels, processes
Music Album covers, band photos, liner notes, era/aesthetic
Film Stills, posters, storyboards, cinematographic elements
Art Medium, style, subject, composition, artistic period
Journal Photos, sketches, documents, dates, context clues

4. Vision Analysis Service (library/services/vision.py)

New service: VisionAnalyzer — analyzes images via the system vision model.

API Call Format

Uses OpenAI-compatible multimodal chat format:

{
    "model": "qwen3-vl-72b",
    "messages": [
        {"role": "system", "content": "<structured JSON output instructions>"},
        {"role": "user", "content": [
            {"type": "text", "text": "<content-type-aware vision prompt>"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]}
    ],
    "temperature": 0.1,
    "max_tokens": 800
}

Response Structure

The vision model returns structured JSON:

{
    "image_type": "diagram",
    "description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs",
    "ocr_text": "L1 L2 L3 GND PE 400V",
    "concepts": [
        {"name": "3-phase motor", "type": "topic"},
        {"name": "wiring diagram", "type": "technique"}
    ]
}

Processing Flow

For each image:

  1. Read image bytes from S3
  2. Base64-encode and send to vision model with content-type-aware prompt
  3. Parse structured JSON response
  4. Validate and normalize (image_type must be valid, concepts capped at 20)
  5. Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status)
  6. Create/connect Concept nodes via DEPICTS relationship
  7. Also connect Item → Concept via REFERENCES (weight 0.8)
  8. Log usage to LLMUsage

5. Pipeline Integration (library/services/pipeline.py)

Vision analysis is Stage 5.5 in the pipeline:

Stage 5:   Store images in S3 + Neo4j          (existing)
Stage 5.5: Vision analysis                      (NEW)
Stage 6:   Embed images multimodally            (existing)
Stage 7:   Concept extraction from text         (existing)

Behavior:

  • If system vision model configured → analyze all images
  • If no vision model → mark images as analysis_status="skipped", continue pipeline
  • Vision analysis failures are per-image (don't fail the whole pipeline)

6. Non-Fiction Library Type

New "nonfiction" library type added alongside the existing six types:

Setting Value
Strategy section_aware
Chunk Size 768
Chunk Overlap 96
Boundaries chapter, section, paragraph
Focus Factual claims, historical events, people, places, arguments, evidence

7. Prometheus Metrics (library/metrics.py)

Metric Type Labels Purpose
mnemosyne_vision_analyses_total Counter status Images analyzed
mnemosyne_vision_analysis_duration_seconds Histogram Per-image analysis latency
mnemosyne_vision_concepts_extracted_total Counter concept_type Concepts from images

8. Dashboard & UI Updates

  • Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models
  • Item detail page shows enriched image cards with:
    • Auto-classified image type badge
    • Vision-generated description
    • Collapsible OCR text section
    • Analysis status indicator
    • Vision model name reference

File Structure

mnemosyne/library/
├── services/
│   ├── vision.py              # NEW — VisionAnalyzer service
│   ├── pipeline.py            # Modified — Stage 5.5 integration
│   └── ...
├── models.py                  # Modified — Image fields, DEPICTS rel, nonfiction type
├── content_types.py           # Modified — vision_prompt for all 7 types
├── metrics.py                 # Modified — vision analysis metrics
├── views.py                   # Modified — vision model in dashboard context
└── templates/library/
    ├── item_detail.html        # Modified — enriched image display
    └── embedding_dashboard.html # Modified — vision model row

mnemosyne/llm_manager/
├── models.py                  # Modified — is_system_vision_model, get_system_vision_model()
└── migrations/
    └── 0003_add_vision_model_and_usage.py  # NEW

Performance Considerations

  • Each image = one vision model inference (~2-5 seconds on local GPU)
  • A document with 20 images = ~40-100 seconds of extra processing
  • Runs in Celery async tasks — does not block web requests
  • Uses the same GPU infrastructure already serving embedding and reranking
  • Zero API cost when running locally
  • Per-image failure isolation — one bad image doesn't fail the pipeline

Success Criteria

  • System vision model configurable via Django admin (same pattern as other system models)
  • Images auto-classified with correct image_type (not hardcoded "diagram")
  • Vision-generated descriptions visible on item detail page
  • OCR text extracted from images with visible text
  • Concepts extracted from images connected to Concept nodes via DEPICTS
  • Shared concepts bridge text chunks and images in the graph
  • Pipeline gracefully skips vision analysis when no vision model configured
  • Non-fiction library type available for history, biography, essays, etc.
  • Prometheus metrics track vision analysis throughput and latency
  • Dashboard shows vision model status