Add vision analysis capabilities to the embedding pipeline
- Introduced a new vision analysis service to classify, describe, and extract text from images. - Enhanced the Image model with fields for OCR text, vision model name, and analysis status. - Added a new "nonfiction" library type with specific chunking and embedding configurations. - Updated content types to include vision prompts for various library types. - Integrated vision analysis into the embedding pipeline, allowing for image analysis during document processing. - Implemented metrics to track vision analysis performance and usage. - Updated UI components to display vision analysis results and statuses in item details and the embedding dashboard. - Added migration for new vision model fields and usage tracking.
This commit is contained in:
244
docs/PHASE_2B_VISION_PIPELINE.md
Normal file
244
docs/PHASE_2B_VISION_PIPELINE.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Phase 2B: Vision Analysis Pipeline
|
||||
|
||||
## Objective
|
||||
|
||||
Add vision-based image understanding to the embedding pipeline: when documents are processed and images extracted, use a vision-capable LLM to classify, describe, extract text from, and identify concepts within each image — connecting images into the knowledge graph as first-class participants alongside text.
|
||||
|
||||
## Heritage
|
||||
|
||||
Extends Phase 2's image extraction (PyMuPDF) and multimodal embedding (Qwen3-VL) with structured understanding. Previously, images were stored and optionally embedded into vector space, but had no description, no classification beyond a hardcoded default, and no concept graph integration. Phase 2B makes images *understood*.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
Image Extracted (Phase 2)
|
||||
→ Stored in S3 + Image node in Neo4j
|
||||
→ Vision Analysis (NEW - Phase 2B)
|
||||
→ Classify image type (diagram, photo, chart, etc.)
|
||||
→ Generate natural language description
|
||||
→ Extract visible text (OCR)
|
||||
→ Identify concepts depicted
|
||||
→ Create DEPICTS relationships to Concept nodes
|
||||
→ Connect Item to image-derived concepts via REFERENCES
|
||||
→ Multimodal Embedding (Phase 2, now enhanced)
|
||||
```
|
||||
|
||||
## How It Fits the Graph
|
||||
|
||||
Vision analysis enriches images so they participate in the knowledge graph the same way text chunks do:
|
||||
|
||||
```
|
||||
Item ─[HAS_IMAGE]──→ Image
|
||||
│
|
||||
├── description: "Wiring diagram showing 3-phase motor connection"
|
||||
├── image_type: "diagram" (auto-classified)
|
||||
├── ocr_text: "L1 L2 L3 ..."
|
||||
├── vision_model_name: "Qwen3-VL-72B"
|
||||
│
|
||||
├──[DEPICTS]──→ Concept("3-phase motor")
|
||||
├──[DEPICTS]──→ Concept("wiring diagram")
|
||||
└──[DEPICTS]──→ Concept("electrical connection")
|
||||
│
|
||||
└──[RELATED_TO]──→ Concept("motor control")
|
||||
↑
|
||||
Chunk text also ──[MENTIONS]──┘
|
||||
```
|
||||
|
||||
Three relationship types connect content to concepts:
|
||||
- `Chunk ─[MENTIONS]─→ Concept` — text discusses this concept
|
||||
- `Item ─[REFERENCES]─→ Concept` — item is about this concept
|
||||
- `Image ─[DEPICTS]─→ Concept` — image visually shows this concept
|
||||
|
||||
Concepts extracted from images merge with the **same Concept nodes** extracted from text via deduplication by name. This means graph traversal discovers cross-modal connections automatically.
|
||||
|
||||
## Deliverables
|
||||
|
||||
### 1. System Vision Model (`llm_manager/models.py`)
|
||||
|
||||
New `is_system_vision_model` boolean field on `LLMModel`, following the same pattern as the existing system embedding, chat, and reranker models.
|
||||
|
||||
```python
|
||||
is_system_vision_model = models.BooleanField(
|
||||
default=False,
|
||||
help_text="Mark as the system-wide vision model for image analysis."
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def get_system_vision_model(cls):
|
||||
return cls.objects.filter(
|
||||
is_system_vision_model=True,
|
||||
is_active=True,
|
||||
model_type__in=["vision", "chat"], # Vision-capable chat models work
|
||||
).first()
|
||||
```
|
||||
|
||||
Added `"vision_analysis"` to `LLMUsage.purpose` choices for cost tracking.
|
||||
|
||||
### 2. Image Model Enhancements (`library/models.py`)
|
||||
|
||||
New fields on the `Image` node:
|
||||
|
||||
| Field | Type | Purpose |
|
||||
|-------|------|---------|
|
||||
| `ocr_text` | StringProperty | Visible text extracted by vision model |
|
||||
| `vision_model_name` | StringProperty | Which model analyzed this image |
|
||||
| `analysis_status` | StringProperty | pending / completed / failed / skipped |
|
||||
|
||||
Expanded `image_type` choices: cover, diagram, chart, table, screenshot, illustration, map, portrait, artwork, still, photo.
|
||||
|
||||
New relationship: `Image ─[DEPICTS]─→ Concept`
|
||||
|
||||
### 3. Content-Type Vision Prompts (`library/content_types.py`)
|
||||
|
||||
Each library type now includes a `vision_prompt` that shapes what the vision model looks for:
|
||||
|
||||
| Library Type | Vision Focus |
|
||||
|---|---|
|
||||
| **Fiction** | Illustrations, cover art, characters, scenes, artistic style |
|
||||
| **Non-Fiction** | Photographs, maps, charts, people, places, historical context |
|
||||
| **Technical** | Diagrams, schematics, charts, tables, labels, processes |
|
||||
| **Music** | Album covers, band photos, liner notes, era/aesthetic |
|
||||
| **Film** | Stills, posters, storyboards, cinematographic elements |
|
||||
| **Art** | Medium, style, subject, composition, artistic period |
|
||||
| **Journal** | Photos, sketches, documents, dates, context clues |
|
||||
|
||||
### 4. Vision Analysis Service (`library/services/vision.py`)
|
||||
|
||||
New service: `VisionAnalyzer` — analyzes images via the system vision model.
|
||||
|
||||
#### API Call Format
|
||||
|
||||
Uses OpenAI-compatible multimodal chat format:
|
||||
|
||||
```python
|
||||
{
|
||||
"model": "qwen3-vl-72b",
|
||||
"messages": [
|
||||
{"role": "system", "content": "<structured JSON output instructions>"},
|
||||
{"role": "user", "content": [
|
||||
{"type": "text", "text": "<content-type-aware vision prompt>"},
|
||||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
|
||||
]}
|
||||
],
|
||||
"temperature": 0.1,
|
||||
"max_tokens": 800
|
||||
}
|
||||
```
|
||||
|
||||
#### Response Structure
|
||||
|
||||
The vision model returns structured JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"image_type": "diagram",
|
||||
"description": "A wiring diagram showing a 3-phase motor connection with L1, L2, L3 inputs",
|
||||
"ocr_text": "L1 L2 L3 GND PE 400V",
|
||||
"concepts": [
|
||||
{"name": "3-phase motor", "type": "topic"},
|
||||
{"name": "wiring diagram", "type": "technique"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Processing Flow
|
||||
|
||||
For each image:
|
||||
1. Read image bytes from S3
|
||||
2. Base64-encode and send to vision model with content-type-aware prompt
|
||||
3. Parse structured JSON response
|
||||
4. Validate and normalize (image_type must be valid, concepts capped at 20)
|
||||
5. Update Image node (description, ocr_text, image_type, vision_model_name, analysis_status)
|
||||
6. Create/connect Concept nodes via DEPICTS relationship
|
||||
7. Also connect Item → Concept via REFERENCES (weight 0.8)
|
||||
8. Log usage to LLMUsage
|
||||
|
||||
### 5. Pipeline Integration (`library/services/pipeline.py`)
|
||||
|
||||
Vision analysis is Stage 5.5 in the pipeline:
|
||||
|
||||
```
|
||||
Stage 5: Store images in S3 + Neo4j (existing)
|
||||
Stage 5.5: Vision analysis (NEW)
|
||||
Stage 6: Embed images multimodally (existing)
|
||||
Stage 7: Concept extraction from text (existing)
|
||||
```
|
||||
|
||||
Behavior:
|
||||
- If system vision model configured → analyze all images
|
||||
- If no vision model → mark images as `analysis_status="skipped"`, continue pipeline
|
||||
- Vision analysis failures are per-image (don't fail the whole pipeline)
|
||||
|
||||
### 6. Non-Fiction Library Type
|
||||
|
||||
New `"nonfiction"` library type added alongside the existing six types:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Strategy | `section_aware` |
|
||||
| Chunk Size | 768 |
|
||||
| Chunk Overlap | 96 |
|
||||
| Boundaries | chapter, section, paragraph |
|
||||
| Focus | Factual claims, historical events, people, places, arguments, evidence |
|
||||
|
||||
### 7. Prometheus Metrics (`library/metrics.py`)
|
||||
|
||||
| Metric | Type | Labels | Purpose |
|
||||
|--------|------|--------|---------|
|
||||
| `mnemosyne_vision_analyses_total` | Counter | status | Images analyzed |
|
||||
| `mnemosyne_vision_analysis_duration_seconds` | Histogram | — | Per-image analysis latency |
|
||||
| `mnemosyne_vision_concepts_extracted_total` | Counter | concept_type | Concepts from images |
|
||||
|
||||
### 8. Dashboard & UI Updates
|
||||
|
||||
- Embedding dashboard shows system vision model status alongside embedding, chat, and reranker models
|
||||
- Item detail page shows enriched image cards with:
|
||||
- Auto-classified image type badge
|
||||
- Vision-generated description
|
||||
- Collapsible OCR text section
|
||||
- Analysis status indicator
|
||||
- Vision model name reference
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
mnemosyne/library/
|
||||
├── services/
|
||||
│ ├── vision.py # NEW — VisionAnalyzer service
|
||||
│ ├── pipeline.py # Modified — Stage 5.5 integration
|
||||
│ └── ...
|
||||
├── models.py # Modified — Image fields, DEPICTS rel, nonfiction type
|
||||
├── content_types.py # Modified — vision_prompt for all 7 types
|
||||
├── metrics.py # Modified — vision analysis metrics
|
||||
├── views.py # Modified — vision model in dashboard context
|
||||
└── templates/library/
|
||||
├── item_detail.html # Modified — enriched image display
|
||||
└── embedding_dashboard.html # Modified — vision model row
|
||||
|
||||
mnemosyne/llm_manager/
|
||||
├── models.py # Modified — is_system_vision_model, get_system_vision_model()
|
||||
└── migrations/
|
||||
└── 0003_add_vision_model_and_usage.py # NEW
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Each image = one vision model inference (~2-5 seconds on local GPU)
|
||||
- A document with 20 images = ~40-100 seconds of extra processing
|
||||
- Runs in Celery async tasks — does not block web requests
|
||||
- Uses the same GPU infrastructure already serving embedding and reranking
|
||||
- Zero API cost when running locally
|
||||
- Per-image failure isolation — one bad image doesn't fail the pipeline
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] System vision model configurable via Django admin (same pattern as other system models)
|
||||
- [ ] Images auto-classified with correct image_type (not hardcoded "diagram")
|
||||
- [ ] Vision-generated descriptions visible on item detail page
|
||||
- [ ] OCR text extracted from images with visible text
|
||||
- [ ] Concepts extracted from images connected to Concept nodes via DEPICTS
|
||||
- [ ] Shared concepts bridge text chunks and images in the graph
|
||||
- [ ] Pipeline gracefully skips vision analysis when no vision model configured
|
||||
- [ ] Non-fiction library type available for history, biography, essays, etc.
|
||||
- [ ] Prometheus metrics track vision analysis throughput and latency
|
||||
- [ ] Dashboard shows vision model status
|
||||
Reference in New Issue
Block a user