Files

Robert Helewka 90db904959 Add vision analysis capabilities to the embedding pipeline

- Introduced a new vision analysis service to classify, describe, and extract text from images.
- Enhanced the Image model with fields for OCR text, vision model name, and analysis status.
- Added a new "nonfiction" library type with specific chunking and embedding configurations.
- Updated content types to include vision prompts for various library types.
- Integrated vision analysis into the embedding pipeline, allowing for image analysis during document processing.
- Implemented metrics to track vision analysis performance and usage.
- Updated UI components to display vision analysis results and statuses in item details and the embedding dashboard.
- Added migration for new vision model fields and usage tracking.

2026-03-22 15:14:34 +00:00

9.7 KiB

Raw Blame History

Phase 2B: Vision Analysis Pipeline

Objective

Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text.

Heritage

Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images understood.

Architecture Overview

Image Extracted (Phase 2)
  → Stored in S3 + Image node in Neo4j
    → Vision Analysis (NEW - Phase 2B)
      → Classify image type (diagram, photo, chart, etc.)
      → Generate natural language description
      → Extract visible text (OCR)
      → Identify concepts depicted
        → Create DEPICTS relationships to Concept nodes
        → Connect Item to image-derived concepts via REFERENCES
    → Multimodal Embedding (Phase 2, now enhanced)

How It Fits the Graph

Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do:

Item ─[HAS_IMAGE]──→ Image
                        │
                        ├── description: "Wiring diagram showing 3-phase motor connection"
                        ├── image_type: "diagram" (auto-classified)
                        ├── ocr_text: "L1 L2 L3 ..."
                        ├── vision_model_name: "Qwen3-VL-72B"
                        │
                        ├──[DEPICTS]──→ Concept("3-phase motor")
                        ├──[DEPICTS]──→ Concept("wiring diagram")
                        └──[DEPICTS]──→ Concept("electrical connection")
                                            │
                                            └──[RELATED_TO]──→ Concept("motor control")
                                                                  ↑
                                            Chunk text also ──[MENTIONS]──┘

Three relationship types connect content to concepts:

Chunk ─[MENTIONS]─→ Concept — text discusses this concept
Item ─[REFERENCES]─→ Concept — item is about this concept
Image ─[DEPICTS]─→ Concept — image visually shows this concept

Concepts extracted from images merge with the same Concept nodes extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically.

Deliverables

1. System Vision Model (`llm_manager/models.py`)

New is_system_vision_model boolean field on LLMModel, following the same pattern as the existing system embedding, chat, and reranker models.

is_system_vision_model = models.BooleanField(
    default=False,
    help_text="Mark as the system-wide vision model for image analysis."
)

@classmethod
def get_system_vision_model(cls):
    return cls.objects.filter(
        is_system_vision_model=True,
        is_active=True,
        model_type__in=["vision", "chat"],  # Vision-capable chat models work
    ).first()

Added "vision_analysis" to LLMUsage.purpose choices for cost tracking.

2. Image Model Enhancements (`library/models.py`)

New fields on the Image node:

Field	Type	Purpose
`ocr_text`	StringProperty	Visible text extracted by vision model
`vision_model_name`	StringProperty	Which model analyzed this image
`analysis_status`	StringProperty	pending / completed / failed / skipped

Expanded image_type choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo.

New relationship: Image ─[DEPICTS]─→ Concept

3. Content-Type Vision Prompts (`library/content_types.py`)

Each library type now includes a vision_prompt that shapes what the vision model looks for:

Library Type	Vision Focus
Fiction	Illustrations, cover art, characters, scenes, artistic style
Non-Fiction	Photographs, maps, charts, people, places, historical context
Technical	Diagrams, schematics, charts, tables, labels, processes
Music	Album covers, band photos, liner notes, era/aesthetic
Film	Stills, posters, storyboards, cinematographic elements
Art	Medium, style, subject, composition, artistic period
Journal	Photos, sketches, documents, dates, context clues

4. Vision Analysis Service (`library/services/vision.py`)

New service: VisionAnalyzer — analyzes images via the system vision model.

API Call Format

Uses OpenAI-compatible multimodal chat format:

{
    "model": "qwen3-vl-72b",
    "messages": [
        {"role": "system", "content": "<structured JSON output instructions>"},
        {"role": "user", "content": [
            {"type": "text", "text": "<content-type-aware vision prompt>"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]}
    ],
    "temperature": 0.1,
    "max_tokens": 800
}

Response Structure

The vision model returns structured JSON:

{
    "image_type": "diagram",
    "description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs",
    "ocr_text": "L1 L2 L3 GND PE 400V",
    "concepts": [
        {"name": "3-phase motor", "type": "topic"},
        {"name": "wiring diagram", "type": "technique"}
    ]
}

Processing Flow

For each image:

Read image bytes from S3
Base64-encode and send to vision model with content-type-aware prompt
Parse structured JSON response
Validate and normalize (image_type must be valid, concepts capped at 20)
Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status)
Create/connect Concept nodes via DEPICTS relationship
Also connect Item → Concept via REFERENCES (weight 0.8)
Log usage to LLMUsage

5. Pipeline Integration (`library/services/pipeline.py`)

Vision analysis is Stage 5.5 in the pipeline:

Stage 5:   Store images in S3 + Neo4j          (existing)
Stage 5.5: Vision analysis                      (NEW)
Stage 6:   Embed images multimodally            (existing)
Stage 7:   Concept extraction from text         (existing)

Behavior:

If system vision model configured → analyze all images
If no vision model → mark images as analysis_status="skipped", continue pipeline
Vision analysis failures are per-image (don't fail the whole pipeline)

6. Non-Fiction Library Type

New "nonfiction" library type added alongside the existing six types:

Setting	Value
Strategy	`section_aware`
Chunk Size	768
Chunk Overlap	96
Boundaries	chapter, section, paragraph
Focus	Factual claims, historical events, people, places, arguments, evidence

7. Prometheus Metrics (`library/metrics.py`)

Metric	Type	Labels	Purpose
`mnemosyne_vision_analyses_total`	Counter	status	Images analyzed
`mnemosyne_vision_analysis_duration_seconds`	Histogram	—	Per-image analysis latency
`mnemosyne_vision_concepts_extracted_total`	Counter	concept_type	Concepts from images

8. Dashboard & UI Updates

Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models
Item detail page shows enriched image cards with:
- Auto-classified image type badge
- Vision-generated description
- Collapsible OCR text section
- Analysis status indicator
- Vision model name reference

File Structure

mnemosyne/library/
├── services/
│   ├── vision.py              # NEW — VisionAnalyzer service
│   ├── pipeline.py            # Modified — Stage 5.5 integration
│   └── ...
├── models.py                  # Modified — Image fields, DEPICTS rel, nonfiction type
├── content_types.py           # Modified — vision_prompt for all 7 types
├── metrics.py                 # Modified — vision analysis metrics
├── views.py                   # Modified — vision model in dashboard context
└── templates/library/
    ├── item_detail.html        # Modified — enriched image display
    └── embedding_dashboard.html # Modified — vision model row

mnemosyne/llm_manager/
├── models.py                  # Modified — is_system_vision_model, get_system_vision_model()
└── migrations/
    └── 0003_add_vision_model_and_usage.py  # NEW

Performance Considerations

Each image = one vision model inference (~2-5 seconds on local GPU)
A document with 20 images = ~40-100 seconds of extra processing
Runs in Celery async tasks — does not block web requests
Uses the same GPU infrastructure already serving embedding and reranking
Zero API cost when running locally
Per-image failure isolation — one bad image doesn't fail the pipeline

Success Criteria

System vision model configurable via Django admin (same pattern as other system models)
Images auto-classified with correct image_type (not hardcoded "diagram")
Vision-generated descriptions visible on item detail page
OCR text extracted from images with visible text
Concepts extracted from images connected to Concept nodes via DEPICTS
Shared concepts bridge text chunks and images in the graph
Pipeline gracefully skips vision analysis when no vision model configured
Non-fiction library type available for history, biography, essays, etc.
Prometheus metrics track vision analysis throughput and latency
Dashboard shows vision model status

9.7 KiB Raw Blame History