Mnemosyne — Architecture Documentation

Overview

Purpose

Mnemosyne is a personal knowledge management system that treats content type as a first-class concept. Unlike generic knowledge bases that treat all documents identically, Mnemosyne understands the difference between a novel, a technical manual, album artwork, and a journal entry — and adjusts its chunking, embedding, search, and LLM prompting accordingly.

Knowledge Graph

Neo4j stores relationships between content, not just vectors
Author → Book → Character → Theme traversals
Artist → Album → Track → Genre connections
No vector dimension limits (full 4096d Qwen3-VL)
Graph + vector + full-text search in one database

Multimodal AI

Qwen3-VL-Embedding: text + images + video in one vector space
Qwen3-VL-Reranker: cross-attention scoring across modalities
Album art, diagrams, screenshots become searchable
Local GPU inference (5090 + 3090) — zero API costs
llama.cpp text fallback via existing Ansible/systemd infra

Content-Type Awareness

Library types define chunking, embedding, and prompt behavior
Fiction: narrative-aware chunking, character extraction
Technical: section-aware, code block preservation
Music: lyrics as primary, metadata-heavy (genre, mood)
Each type injects context into the LLM prompt

Key Differentiators

Content-type-aware pipeline — chunking, embedding instructions, re-ranking instructions, and LLM context all adapt per library type
Neo4j knowledge graph — traversable relationships, not just flat vector similarity
Full multimodal — Qwen3-VL processes images, diagrams, album art alongside text in a unified vector space
No dimension limits — Neo4j handles 4096d vectors natively (pgvector caps at 2000)
MCP-first interface — designed for LLM integration from day one
Proven RAG architecture — two-stage responder/reviewer pattern inherited from Spelunker
Local GPU inference — zero ongoing API costs via vLLM + llama.cpp on RTX 5090/3090

Heritage

Mnemosyne's RAG pipeline architecture is inspired by Spelunker, an enterprise RFP response platform built on Django, PostgreSQL/pgvector, and LangChain. The proven patterns — hybrid search, two-stage RAG, citation-based retrieval, async document processing, and SME-approved knowledge bases — are carried forward and enhanced with multimodal capabilities and knowledge graph relationships. Proven patterns from Mnemosyne will be backported to Spelunker.

System Architecture

High-Level Architecture

graph TB subgraph Clients["Client Layer"] MCP["MCP Clients
(Claude, Copilot, etc.)"] UI["Django Web UI"] API["REST API (DRF)"] end subgraph App["Application Layer — Django"] Core["core/
Users, Auth"] Library["library/
Libraries, Collections, Items"] Engine["engine/
Embedding, Search, Reranker, RAG"] MCPServer["mcp_server/
MCP Tool Interface"] Importers["importers/
File, Calibre, Web"] end subgraph Data["Data Layer"] Neo4j["Neo4j 5.x
Knowledge Graph + Vectors"] PG["PostgreSQL
Auth, Config, Analytics"] S3["S3/MinIO
Content + Chunks"] RMQ["RabbitMQ
Task Queue"] end subgraph GPU["GPU Services"] vLLM_E["vLLM
Qwen3-VL-Embedding-8B
(Multimodal Embed)"] vLLM_R["vLLM
Qwen3-VL-Reranker-8B
(Multimodal Rerank)"] LCPP["llama.cpp
Qwen3-Reranker-0.6B
(Text Fallback)"] LCPP_C["llama.cpp
Qwen3 Chat
(RAG Responder)"] end MCP --> MCPServer UI --> Core API --> Library API --> Engine MCPServer --> Engine MCPServer --> Library Library --> Neo4j Engine --> Neo4j Engine --> S3 Core --> PG Engine --> vLLM_E Engine --> vLLM_R Engine --> LCPP Engine --> LCPP_C Library --> RMQ

Django Apps

core/ — Users, authentication, profiles, permissions
library/ — Libraries, Collections, Items, Chunks, Concepts (Neo4j models)
engine/ — Embedding, search, reranker, RAG pipeline services
mcp_server/ — MCP tool definitions and server interface
importers/ — Content acquisition (file upload, Calibre, web scrape)
llm_manager/ — LLM API/model config, usage tracking (from Spelunker)

Technology Stack

Django 5.x, Python ≥3.12, Django REST Framework
Neo4j 5.x + django-neomodel — knowledge graph + vector index
PostgreSQL — Django auth, config, analytics only
S3/MinIO — all content and chunk storage
Celery + RabbitMQ — async embedding and graph construction
vLLM ≥0.14 — Qwen3-VL multimodal serving
llama.cpp — text model serving (existing Ansible infra)
MCP SDK — Model Context Protocol server

Project Structure

mnemosyne/
├── mnemosyne/          # Django settings, URLs, WSGI/ASGI
├── core/               # Users, auth, profiles
├── library/            # Neo4j models (Library, Collection, Item, Chunk, Concept)
├── engine/             # RAG pipeline services
│   ├── embeddings.py   # Qwen3-VL embedding client
│   ├── reranker.py     # Qwen3-VL reranker client
│   ├── search.py       # Hybrid search (vector + graph + full-text)
│   ├── pipeline.py     # Two-stage RAG (responder + reviewer)
│   ├── llm_client.py   # OpenAI-compatible LLM client
│   └── content_types.py # Library type definitions
├── mcp_server/         # MCP tool definitions
├── importers/          # Content import tools
├── llm_manager/        # LLM API/model config (ported from Spelunker)
├── static/
├── templates/
├── docker-compose.yml
├── pyproject.toml
└── manage.py

Data Model — Neo4j Knowledge Graph

Dual Database Strategy

Neo4j stores all content knowledge: libraries, collections, items, chunks, concepts, and their relationships + vector embeddings. PostgreSQL stores only Django operational data: users, auth, LLM configurations, analytics, and Celery results. Content never lives in PostgreSQL.

Graph Schema

Core Nodes

Node	Key Properties	Vector?
Library	name, library_type, chunking_config, embedding_instruction, llm_context_prompt	No
Collection	name, description, metadata	No
Item	title, item_type, s3_key, content_hash, metadata, created_at	No
Chunk	chunk_index, chunk_s3_key, chunk_size, embedding (4096d)	Yes
Concept	name, concept_type, embedding (4096d)	Yes
Image	s3_key, image_type, description, metadata	No
ImageEmbedding	embedding (4096d multimodal)	Yes

Relationships

Relationship	From → To	Properties
CONTAINS	Library → Collection	—
CONTAINS	Collection → Item	position
HAS_CHUNK	Item → Chunk	—
HAS_IMAGE	Item → Image	image_role
HAS_EMBEDDING	Image → ImageEmbedding	—
REFERENCES	Item → Concept	relevance
MENTIONS	Chunk → Concept	—
RELATED_TO	Item → Item	relationship_type, weight
RELATED_TO	Concept → Concept	relationship_type

Neo4j Vector Indexes

// Chunk text+image embeddings (4096 dimensions, no pgvector limits!)
CREATE VECTOR INDEX chunk_embedding FOR (c:Chunk)
ON (c.embedding) OPTIONS {indexConfig: {
  `vector.dimensions`: 4096,
  `vector.similarity_function`: 'cosine'
}}

// Concept embeddings for semantic concept search
CREATE VECTOR INDEX concept_embedding FOR (con:Concept)
ON (con.embedding) OPTIONS {indexConfig: {
  `vector.dimensions`: 4096,
  `vector.similarity_function`: 'cosine'
}}

// Image multimodal embeddings
CREATE VECTOR INDEX image_embedding FOR (ie:ImageEmbedding)
ON (ie.embedding) OPTIONS {indexConfig: {
  `vector.dimensions`: 4096,
  `vector.similarity_function`: 'cosine'
}}

// Full-text index for keyword/BM25-style search
CREATE FULLTEXT INDEX chunk_fulltext FOR (c:Chunk) ON EACH [c.text_preview]

Content Type System

The Core Innovation

Each Library has a library_type that defines how content is chunked, what embedding instructions are sent to Qwen3-VL, what re-ranking instructions are used, and what context prompt is injected when the LLM generates answers. This is configured per library in the database — not hardcoded.

Fiction

Chunking: Chapter-aware, preserve dialogue blocks, narrative flow

Embedding Instruction: "Represent the narrative passage for literary retrieval, capturing themes, characters, and plot elements"

Reranker Instruction: "Score relevance of this fiction excerpt to the query, considering narrative themes and character arcs"

LLM Context: "The following excerpts are from fiction. Interpret as narrative — consider themes, symbolism, character development."

Multimodal: Cover art, illustrations

Graph: Author → Book → Character → Theme

Technical

Chunking: Section/heading-aware, preserve code blocks and tables as atomic units

Embedding Instruction: "Represent the technical documentation for precise procedural retrieval"

Reranker Instruction: "Score relevance of this technical documentation to the query, prioritizing procedural accuracy"

LLM Context: "The following excerpts are from technical documentation. Provide precise, actionable instructions."

Multimodal: Diagrams, screenshots, wiring diagrams

Graph: Product → Manual → Section → Procedure → Tool

Music

Chunking: Song-level (lyrics as one chunk), verse/chorus segmentation

Embedding Instruction: "Represent the song lyrics and album context for music discovery and thematic analysis"

Reranker Instruction: "Score relevance considering lyrical themes, musical context, and artist style"

LLM Context: "The following excerpts are song lyrics and music metadata. Interpret in musical and cultural context."

Multimodal: Album artwork, liner note images

Graph: Artist → Album → Track → Genre; Track → SAMPLES → Track

Film

Chunking: Scene-level for scripts, paragraph-level for synopses

Embedding Instruction: "Represent the film content for cinematic retrieval, capturing visual and narrative elements"

Multimodal: Movie stills, posters, screenshots

Graph: Director → Film → Scene → Actor; Film → BASED_ON → Book

Art

Chunking: Description-level, catalog entry as unit

Embedding Instruction: "Represent the artwork and its description for visual and stylistic retrieval"

Multimodal: The artwork itself — primary content is visual

Graph: Artist → Piece → Style → Movement; Piece → INSPIRED_BY → Piece

Journals

Chunking: Entry-level (one entry = one chunk), paragraph split for long entries

Embedding Instruction: "Represent the personal journal entry for temporal and reflective retrieval"

Multimodal: Photos, sketches attached to entries

Graph: Date → Entry → Topic; Entry → MENTIONS → Person/Place

Multimodal Embedding & Re-ranking Pipeline

Two-Stage Multimodal Pipeline

Stage 1 — Embedding (Qwen3-VL-Embedding-8B): Generates 4096-dimensional vectors from text, images, screenshots, and video in a unified semantic space. Accepts content-type-specific instructions for optimized representations.

Stage 2 — Re-ranking (Qwen3-VL-Reranker-8B): Takes (query, document) pairs — where both can be multimodal — and outputs precise relevance scores via cross-attention. Dramatically sharpens retrieval accuracy.

Embedding & Ingestion Flow

flowchart TD A["New Content
(file upload, import)"] --> B{"Content Type?"} B -->|"Text (PDF, DOCX, MD)"| C["Parse Text
+ Extract Images"] B -->|"Image (art, photo)"| D["Image Only"] B -->|"Mixed (manual + diagrams)"| E["Parse Text
+ Keep Page Images"] C --> F["Chunk Text
(content-type-aware)"] D --> G["Image to S3"] E --> F E --> G F --> H["Store Chunks in S3"] H --> I["Qwen3-VL-Embedding
(text + instruction)"] G --> J["Qwen3-VL-Embedding
(image + instruction)"] I --> K["4096d Vector"] J --> K K --> L["Store in Neo4j
Chunk/ImageEmbedding Node"] L --> M["Extract Concepts
(LLM entity extraction)"] M --> N["Create Concept Nodes
+ REFERENCES/MENTIONS edges"]

Qwen3-VL-Embedding-8B

Dimensions: 4096 (full), or MRL truncation to 3072/2048/1536/1024
Input: Text, images, screenshots, video, or any mix
Instruction-aware: Content-type instruction improves quality 1–5%
Quantization: Int8 (~8GB VRAM), Int4 (~4GB VRAM)
Serving: vLLM with --runner pooling
Languages: 30+ languages supported

Qwen3-VL-Reranker-8B

Architecture: Single-tower cross-attention (deep query↔document interaction)
Input: (query, document) pairs — both can be multimodal
Output: Relevance score (sigmoid of yes/no token probabilities)
Instruction-aware: Custom re-ranking instructions per content type
Serving: vLLM with --runner pooling + score endpoint
Fallback: Qwen3-Reranker-0.6B via llama.cpp (text-only)

Why Multimodal Matters

Traditional RAG systems OCR images and diagrams, producing garbled text. Multimodal embedding understands the visual content directly:

Technical diagrams: Wiring diagrams, network topologies, architecture diagrams — searchable by visual content, not OCR garbage
Album artwork: "psychedelic album covers from the 70s" finds matching art via visual similarity
Art: The actual painting/sculpture becomes the searchable content, not just its text description
PDF pages: Image-only PDF pages with charts and tables are embedded as images, not skipped

Search Pipeline — GraphRAG + Vector + Re-rank

Search Flow

flowchart TD Q["User Query"] --> E["Embed Query
(Qwen3-VL-Embedding)"] E --> VS["1. Vector Search
(Neo4j vector index)
Top-K × 3 oversample"] E --> GT["2. Graph Traversal
(Cypher queries)
Concept + relationship walks"] Q --> FT["3. Full-Text Search
(Neo4j fulltext index)
Keyword matching"] VS --> F["Candidate Fusion
+ Deduplication"] GT --> F FT --> F F --> RR["4. Re-Rank
(Qwen3-VL-Reranker)
Cross-attention scoring"] RR --> TK["Top-K Results"] TK --> CTX["Inject Content-Type
Context Prompt"] CTX --> LLM["5. LLM Responder
(Two-stage RAG)"] LLM --> REV["6. LLM Reviewer
(Quality + citation check)"] REV --> ANS["Final Answer
with Citations"]

1. Vector Search

Cosine similarity via Neo4j vector index on Chunk and ImageEmbedding nodes.

CALL db.index.vector.queryNodes(
  'chunk_embedding', 30,
  $query_vector
) YIELD node, score
WHERE score > $threshold

2. Graph Traversal

Walk relationships to find contextually related content that vector search alone would miss.

MATCH (c:Chunk)-[:HAS_CHUNK]-(i:Item)
  -[:REFERENCES]->(con:Concept)
  -[:RELATED_TO]-(con2:Concept)
  <-[:REFERENCES]-(i2:Item)
  -[:HAS_CHUNK]->(c2:Chunk)
RETURN c2, i2

3. Full-Text Search

Neo4j native full-text index for keyword matching (BM25-equivalent).

CALL db.index.fulltext.queryNodes(
  'chunk_fulltext',
  $query_text
) YIELD node, score

MCP Server Interface

MCP-First Design

Mnemosyne exposes its capabilities as MCP tools, making the entire knowledge base accessible to Claude, Copilot, and any MCP-compatible LLM client. The MCP server is a primary interface, not an afterthought.

Search & Retrieval Tools

Tool	Description
`search_library`	Semantic + graph + full-text search with re-ranking. Filters by library, collection, content type.
`ask_about`	Full RAG pipeline — search, re-rank, content-type context injection, LLM response with citations.
`find_similar`	Find items similar to a given item using vector similarity. Optionally search across libraries.
`search_by_image`	Multimodal search — find content matching an uploaded image.
`explore_connections`	Traverse knowledge graph from an item — find related concepts, authors, themes.

Management & Navigation Tools

Tool	Description
`browse_libraries`	List all libraries with their content types and item counts.
`browse_collections`	List collections within a library.
`get_item`	Get detailed info about a specific item, including metadata and graph connections.
`add_content`	Add new content to a library — triggers async embedding + graph construction.
`get_concepts`	List extracted concepts for an item or across a library.

GPU Services

RTX 5090 (32GB VRAM)

Model	Qwen3-VL-Reranker-8B
VRAM (bf16)	~18GB
Serving	vLLM `--runner pooling`
Port	:8001
Role	Multimodal re-ranking
Headroom	~14GB for chat model

RTX 3090 (24GB VRAM)

Model	Qwen3-VL-Embedding-8B
VRAM (bf16)	~18GB
Serving	vLLM `--runner pooling`
Port	:8002
Role	Multimodal embedding
Headroom	~6GB

Fallback: llama.cpp (Existing Ansible Infra)

Text-only Qwen3-Reranker-0.6B GGUF served via llama-server on existing systemd/Ansible infrastructure. Managed by the same playbooks, monitored by the same Grafana dashboards. Used when vLLM services are down or for text-only workloads.

Deployment

Core Services

web: Django app (Gunicorn)
postgres: PostgreSQL (auth/config only)
neo4j: Neo4j 5.x (knowledge graph + vectors)
rabbitmq: Celery broker

Async Processing

celery-worker: Embedding, graph construction
celery-beat: Scheduled re-sync tasks

Storage & Proxy

minio: S3-compatible content storage
nginx: Static/proxy
mcp-server: MCP interface process

Shared Infrastructure with Spelunker

Mnemosyne and Spelunker share: GPU model services (llama.cpp + vLLM), MinIO/S3 (separate buckets), Neo4j (separate databases), RabbitMQ (separate vhosts), and Grafana monitoring. Each is its own Docker Compose stack but points to shared infra.

Backport Strategy to Spelunker

Build Forward, Backport Back

Mnemosyne proves the architecture with no legacy constraints. Once validated, proven components flow back to Spelunker to enhance its RFP workflow with multimodal understanding and re-ranking precision.

Component	Mnemosyne (Prove)	Spelunker (Backport)
RerankerService	Qwen3-VL multimodal + llama.cpp text	Drop into `rag/services/reranker.py`
Multimodal Embedding	Qwen3-VL-Embedding via vLLM	Add alongside OpenAI embeddings, MRL@1536d for pgvector compat
Diagram Understanding	Image pages embedded multimodally	PDF diagrams in RFP docs become searchable
MCP Server	Primary interface from day one	Add as secondary interface to Spelunker
Neo4j (optional)	Primary vector + graph store	Could replace pgvector, or run alongside
Content-Type Config	Library type definitions	Adapt as document classification in Spelunker

Mnemosyne Architecture