Add Themis application with custom widgets, views, and utilities

- Implemented custom form widgets for date, time, and datetime fields with DaisyUI styling. - Created utility functions for formatting dates, times, and numbers according to user preferences. - Developed views for profile settings, API key management, and notifications, including health check endpoints. - Added URL configurations for Themis tests and main application routes. - Established test cases for custom widgets to ensure proper functionality and integration. - Defined project metadata and dependencies in pyproject.toml for package management.
2026-03-21 02:00:18 +00:00
parent e99346d014
commit 99bdb4ac92
351 changed files with 65123 additions and 2 deletions
--- a/docs/PHASE_2_EMBEDDING_PIPELINE.md
+++ b/docs/PHASE_2_EMBEDDING_PIPELINE.md
@@ -0,0 +1,498 @@
+# Phase 2: Embedding Pipeline
+
+## Objective
+
+Build the complete document ingestion and embedding pipeline: upload content → parse (text + images) → chunk (content-type-aware) → embed via configurable model → store vectors in Neo4j → extract concepts for knowledge graph.
+
+## Heritage
+
+The embedding pipeline adapts proven patterns from [Spelunker](https://git.helu.ca/r/spelunker)'s `rag/services/embeddings.py` — semantic chunking, batch embedding, S3 chunk storage, and progress tracking — enhanced with multimodal capabilities, knowledge graph relationships, and content-type awareness.
+
+## Architecture Overview
+
+```
+Upload (API/Admin)
+  → S3 Storage (original file)
+    → Document Parsing (PyMuPDF — text + images)
+      → Content-Type-Aware Chunking (semantic-text-splitter)
+        → Text Embedding (system embedding model via LLM Manager)
+        → Image Embedding (multimodal model, if available)
+          → Neo4j Graph Storage (Chunk nodes, Image nodes, vectors)
+            → Concept Extraction (system chat model)
+              → Knowledge Graph (Concept nodes, MENTIONS/REFERENCES edges)
+```
+
+## Deliverables
+
+### 1. Document Parsing Service (`library/services/parsers.py`)
+
+**Primary parser: PyMuPDF** — a single library handling all document formats with unified text + image extraction.
+
+#### Supported Formats
+
+| Format | Extensions | Text Extraction | Image Extraction |
+|--------|-----------|----------------|-----------------|
+| PDF | `.pdf` | Layout-preserving text | Embedded images, diagrams |
+| EPUB | `.epub` | Chapter-structured HTML | Cover art, illustrations |
+| DOCX | `.docx` | Via HTML conversion | Inline images, diagrams |
+| PPTX | `.pptx` | Via HTML conversion | Slide images, charts |
+| XLSX | `.xlsx` | Via HTML conversion | Embedded charts |
+| XPS | `.xps` | Native | Native |
+| MOBI | `.mobi` | Native | Native |
+| FB2 | `.fb2` | Native | Native |
+| CBZ | `.cbz` | Native | Native (comic pages) |
+| Plain text | `.txt`, `.md` | Direct read | N/A |
+| HTML | `.html`, `.htm` | PyMuPDF or direct | Inline images |
+| Images | `.jpg`, `.png`, etc. | N/A (OCR future) | The image itself |
+
+#### Text Sanitization
+
+Ported from Spelunker's `text_utils.py`:
+- Remove null bytes and control characters
+- Remove zero-width characters
+- Normalize Unicode to NFC
+- Replace invalid UTF-8 sequences
+- Clean PDF ligatures and artifacts
+- Normalize whitespace
+
+#### Image Extraction
+
+For each document page/section, extract embedded images via `page.get_images()` → `doc.extract_image(xref)`:
+- Raw image bytes (PNG/JPEG)
+- Dimensions (width × height)
+- Source page/position for chunk-image association
+- Store in S3: `images/{item_uid}/{image_index}.{ext}`
+
+#### Parse Result Structure
+
+```python
+@dataclass
+class TextBlock:
+    text: str
+    page: int
+    metadata: dict  # {heading_level, section_name, etc.}
+
+@dataclass
+class ExtractedImage:
+    data: bytes
+    ext: str        # png, jpg, etc.
+    width: int
+    height: int
+    source_page: int
+    source_index: int
+
+@dataclass
+class ParseResult:
+    text_blocks: list[TextBlock]
+    images: list[ExtractedImage]
+    metadata: dict  # {page_count, title, author, etc.}
+    file_type: str
+```
+
+### 2. Content-Type-Aware Chunking Service (`library/services/chunker.py`)
+
+Uses `semantic-text-splitter` with HuggingFace tokenizer (proven in Spelunker).
+
+#### Strategy Dispatch
+
+Based on `Library.chunking_config`:
+
+| Strategy | Library Type | Boundary Markers | Chunk Size | Overlap |
+|----------|-------------|-----------------|-----------|---------|
+| `chapter_aware` | Fiction | chapter, scene, paragraph | 1024 | 128 |
+| `section_aware` | Technical | section, subsection, code_block, list | 512 | 64 |
+| `song_level` | Music | song, verse, chorus | 512 | 32 |
+| `scene_level` | Film | scene, act, sequence | 768 | 64 |
+| `description_level` | Art | artwork, description, analysis | 512 | 32 |
+| `entry_level` | Journal | entry, date, paragraph | 512 | 32 |
+
+#### Chunk-Image Association
+
+Track which images appeared near which text chunks:
+- PDF: image bounding boxes on specific pages
+- DOCX/PPTX: images associated with slides/sections
+- EPUB: images referenced from specific chapters
+
+Creates `Chunk -[HAS_NEARBY_IMAGE]-> Image` relationships with proximity metadata.
+
+#### Chunk Storage
+
+- Chunk text stored in S3: `chunks/{item_uid}/chunk_{index}.txt`
+- `text_preview` (first 500 chars) stored on Chunk node for full-text indexing
+
+### 3. Embedding Client (`library/services/embedding_client.py`)
+
+Multi-backend embedding client dispatching by `LLMApi.api_type`.
+
+#### Backend Support
+
+| API Type | Protocol | Auth | Batch Support |
+|----------|---------|------|---------------|
+| `openai` | HTTP POST `/embeddings` | API key header | Native batch |
+| `vllm` | HTTP POST `/embeddings` | API key header | Native batch |
+| `llama-cpp` | HTTP POST `/embeddings` | API key header | Native batch |
+| `ollama` | HTTP POST `/embeddings` | None | Native batch |
+| `bedrock` | HTTP POST `/model/{id}/invoke` | Bearer token | Client-side loop |
+
+#### Bedrock Integration
+
+Uses Amazon Bedrock API keys (Bearer token auth) — no boto3 SDK required:
+
+```
+POST https://bedrock-runtime.{region}.amazonaws.com/model/{model_id}/invoke
+Authorization: Bearer {bedrock_api_key}
+Content-Type: application/json
+
+{"inputText": "text to embed", "dimensions": 1024, "normalize": true}
+→ {"embedding": [float, ...], "inputTextTokenCount": 42}
+```
+
+**LLMApi setup for Bedrock embeddings:**
+- `api_type`: `"bedrock"`
+- `base_url`: `https://bedrock-runtime.us-east-1.amazonaws.com`
+- `api_key`: Bedrock API key (encrypted)
+
+**LLMApi setup for Bedrock chat (Claude, etc.):**
+- `api_type`: `"openai"` (Mantle endpoint is OpenAI-compatible)
+- `base_url`: `https://bedrock-mantle.us-east-1.api.aws/v1`
+- `api_key`: Same Bedrock API key
+
+#### Embedding Instruction Prefix
+
+Before embedding, prepend the library's `embedding_instruction` to each chunk:
+```
+"{embedding_instruction}\n\n{chunk_text}"
+```
+
+#### Image Embedding
+
+For multimodal models (`model.supports_multimodal`):
+- Send base64-encoded image to the embedding endpoint
+- Create `ImageEmbedding` node with the resulting vector
+- If no multimodal model available, skip (images stored but not embedded)
+
+#### Model Matching
+
+Track embedded model by **name** (not UUID). Multiple APIs can serve the same model — matching by name allows provider switching without re-embedding.
+
+### 4. Pipeline Orchestrator (`library/services/pipeline.py`)
+
+Coordinates the full flow: parse → chunk → embed → store → graph.
+
+#### Pipeline Stages
+
+1. **Parse**: Extract text blocks + images from document
+2. **Chunk**: Split text using content-type-aware strategy
+3. **Store chunks**: S3 + Chunk nodes in Neo4j
+4. **Embed text**: Generate vectors for all chunks
+5. **Store images**: S3 + Image nodes in Neo4j
+6. **Embed images**: Multimodal vectors (if available)
+7. **Extract concepts**: Named entities from chunk text (via system chat model)
+8. **Build graph**: Create Concept nodes, MENTIONS/REFERENCES edges
+
+#### Idempotency
+
+- Check `Item.content_hash` — skip if already processed with same hash
+- Re-embedding deletes existing Chunk/Image nodes before re-processing
+
+#### Dimension Compatibility
+
+- Validate that the system embedding model's `vector_dimensions` matches the Neo4j vector index dimensions
+- Warn at embed time if mismatch detected
+
+### 5. Concept Extraction (`library/services/concepts.py`)
+
+Uses the system chat model for LLM-based named entity recognition.
+
+- Extract: people, places, topics, techniques, themes
+- Create/update `Concept` nodes (deduplicated by name via unique_index)
+- Connect: `Chunk -[MENTIONS]-> Concept`, `Item -[REFERENCES]-> Concept`
+- Embed concept names for vector search
+- If no system chat model configured, concept extraction is skipped
+
+### 6. Celery Tasks (`library/tasks.py`)
+
+All tasks pass IDs (not model instances) per Red Panda Standards.
+
+| Task | Queue | Purpose |
+|------|-------|---------|
+| `embed_item(item_uid)` | `embedding` | Full pipeline for single item |
+| `embed_collection(collection_uid)` | `batch` | All items in a collection |
+| `embed_library(library_uid)` | `batch` | All items in a library |
+| `batch_embed_items(item_uids)` | `batch` | Specific items |
+| `reembed_item(item_uid)` | `embedding` | Delete + re-embed |
+
+Tasks are idempotent, include retry logic, and track progress via Memcached: `library:task:{task_id}:progress`.
+
+### 7. Prometheus Metrics (`library/metrics.py`)
+
+Custom metrics for pipeline observability:
+
+| Metric | Type | Labels | Purpose |
+|--------|------|--------|---------|
+| `mnemosyne_documents_parsed_total` | Counter | file_type, status | Parse throughput |
+| `mnemosyne_document_parse_duration_seconds` | Histogram | file_type | Parse latency |
+| `mnemosyne_images_extracted_total` | Counter | file_type | Image extraction volume |
+| `mnemosyne_chunks_created_total` | Counter | library_type, strategy | Chunk throughput |
+| `mnemosyne_chunk_size_tokens` | Histogram | — | Chunk size distribution |
+| `mnemosyne_embeddings_generated_total` | Counter | model_name, api_type, content_type | Embedding throughput |
+| `mnemosyne_embedding_batch_duration_seconds` | Histogram | model_name, api_type | API latency |
+| `mnemosyne_embedding_api_errors_total` | Counter | model_name, api_type, error_type | API failures |
+| `mnemosyne_embedding_tokens_total` | Counter | model_name | Token consumption |
+| `mnemosyne_pipeline_items_total` | Counter | status | Pipeline throughput |
+| `mnemosyne_pipeline_item_duration_seconds` | Histogram | — | End-to-end latency |
+| `mnemosyne_pipeline_items_in_progress` | Gauge | — | Concurrent processing |
+| `mnemosyne_concepts_extracted_total` | Counter | concept_type | Concept extraction volume |
+
+### 8. Model Changes
+
+#### Item Node — New Fields
+
+| Field | Type | Purpose |
+|-------|------|---------|
+| `embedding_status` | StringProperty | pending / processing / completed / failed |
+| `embedding_model_name` | StringProperty | Name of model that generated embeddings |
+| `chunk_count` | IntegerProperty | Number of chunks created |
+| `image_count` | IntegerProperty | Number of images extracted |
+| `error_message` | StringProperty | Last error message (if failed) |
+
+#### New Relationship Model
+
+```python
+class NearbyImageRel(StructuredRel):
+    proximity = StringProperty(default="same_page")  # same_page, inline, same_slide, same_chapter
+```
+
+#### Chunk Node — New Relationship
+
+```python
+nearby_images = RelationshipTo('Image', 'HAS_NEARBY_IMAGE', model=NearbyImageRel)
+```
+
+#### LLMApi Model — New API Type
+
+Add `("bedrock", "Amazon Bedrock")` to `api_type` choices.
+
+### 9. API Enhancements
+
+- `POST /api/v1/library/items/` — File upload with auto-trigger of `embed_item` task
+- `POST /api/v1/library/items/<uid>/reembed/` — Re-embed endpoint
+- `GET /api/v1/library/items/<uid>/status/` — Embedding status check
+- Admin views: File upload field on item create, embedding status display
+
+### 10. Management Commands
+
+| Command | Purpose |
+|---------|---------|
+| `embed_item <uid>` | CLI embedding for testing |
+| `embed_collection <uid>` | CLI batch embedding |
+| `embedding_status` | Show embedding progress/statistics |
+
+### 11. Dynamic Vector Index Dimensions
+
+Update `setup_neo4j_indexes` to read dimensions from `LLMModel.get_system_embedding_model().vector_dimensions` instead of hardcoding 4096.
+
+## Celery Workers & Scheduler
+
+### Prerequisites
+
+- RabbitMQ running on `oberon.incus:5672` with `mnemosyne` vhost and user
+- `.env` configured with `CELERY_BROKER_URL=amqp://mnemosyne:password@oberon.incus:5672/mnemosyne`
+- Virtual environment activated: `source ~/env/mnemosyne/bin/activate`
+
+### Queues
+
+Mnemosyne uses three Celery queues with task routing configured in `settings.py`:
+
+| Queue | Tasks | Purpose | Recommended Concurrency |
+|-------|-------|---------|------------------------|
+| `celery` (default) | `llm_manager.validate_all_llm_apis`, `llm_manager.validate_single_api` | LLM API validation & model discovery | 2 |
+| `embedding` | `library.tasks.embed_item`, `library.tasks.reembed_item` | Single-item embedding pipeline (GPU-bound) | 1 |
+| `batch` | `library.tasks.embed_collection`, `library.tasks.embed_library`, `library.tasks.batch_embed_items` | Batch orchestration (dispatches to embedding queue) | 2 |
+
+Task routing (`settings.py`):
+```python
+CELERY_TASK_ROUTES = {
+    "library.tasks.embed_*": {"queue": "embedding"},
+    "library.tasks.batch_*": {"queue": "batch"},
+}
+```
+
+### Starting Workers
+
+All commands run from the Django project root (`mnemosyne/`):
+
+**Development — single worker, all queues:**
+```bash
+cd mnemosyne
+celery -A mnemosyne worker -l info -Q celery,embedding,batch
+```
+
+**Development — eager mode (no worker needed):**
+
+Set `CELERY_TASK_ALWAYS_EAGER=True` in `.env`. All tasks execute synchronously in the web process. Useful for debugging but does not test async behavior.
+
+**Production — separate workers per queue:**
+```bash
+# Embedding worker (single concurrency — GPU is sequential)
+celery -A mnemosyne worker \
+    -l info \
+    -Q embedding \
+    -c 1 \
+    -n embedding@%h \
+    --max-tasks-per-child=100
+
+# Batch orchestration worker
+celery -A mnemosyne worker \
+    -l info \
+    -Q batch \
+    -c 2 \
+    -n batch@%h
+
+# Default queue worker (LLM API validation, etc.)
+celery -A mnemosyne worker \
+    -l info \
+    -Q celery \
+    -c 2 \
+    -n default@%h
+```
+
+### Celery Beat (Periodic Scheduler)
+
+Celery Beat runs scheduled tasks (e.g., periodic LLM API validation):
+
+```bash
+# File-based scheduler (simple, stores schedule in celerybeat-schedule file)
+celery -A mnemosyne beat -l info
+
+# Or with Django database scheduler (if django-celery-beat is installed)
+celery -A mnemosyne beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
+```
+
+Example periodic task schedule (add to `settings.py` if needed):
+```python
+from celery.schedules import crontab
+
+CELERY_BEAT_SCHEDULE = {
+    "validate-llm-apis-daily": {
+        "task": "llm_manager.validate_all_llm_apis",
+        "schedule": crontab(hour=6, minute=0),  # Daily at 6 AM
+    },
+}
+```
+
+### Flower (Task Monitoring)
+
+[Flower](https://flower.readthedocs.io/) provides a real-time web UI for monitoring Celery workers and tasks:
+
+```bash
+celery -A mnemosyne flower --port=5555
+```
+
+Access at `http://localhost:5555`. Shows:
+- Active/completed/failed tasks
+- Worker status and resource usage
+- Task execution times and retry counts
+- Queue depths
+
+### Reliability Configuration
+
+The following settings are already configured in `settings.py`:
+
+| Setting | Value | Purpose |
+|---------|-------|---------|
+| `CELERY_TASK_ACKS_LATE` | `True` | Acknowledge tasks after execution (not on receipt) — prevents task loss on worker crash |
+| `CELERY_WORKER_PREFETCH_MULTIPLIER` | `1` | Workers fetch one task at a time — ensures fair distribution across workers |
+| `CELERY_ACCEPT_CONTENT` | `["json"]` | Only accept JSON-serialized tasks |
+| `CELERY_TASK_SERIALIZER` | `"json"` | Serialize task arguments as JSON |
+
+### Task Progress Tracking
+
+Embedding tasks report progress via Memcached using the key pattern:
+```
+library:task:{task_id}:progress → {"percent": 45, "message": "Embedded 12/27 chunks"}
+```
+
+Tasks also update Celery's native state:
+```python
+# Query task progress from Python
+from celery.result import AsyncResult
+result = AsyncResult(task_id)
+result.state    # "PROGRESS", "SUCCESS", "FAILURE"
+result.info     # {"percent": 45, "message": "..."}
+```
+
+## Dependencies
+
+```toml
+# New additions to pyproject.toml
+"PyMuPDF>=1.24,<2.0",
+"pymupdf4llm>=0.0.17,<1.0",
+"semantic-text-splitter>=0.20,<1.0",
+"tokenizers>=0.20,<1.0",
+"Pillow>=10.0,<12.0",
+"django-prometheus>=2.3,<3.0",
+```
+
+### License Note
+
+PyMuPDF is AGPL-3.0 licensed. Acceptable for self-hosted personal use. Commercial distribution would require Artifex's commercial license.
+
+## File Structure
+
+```
+mnemosyne/library/
+├── services/
+│   ├── __init__.py
+│   ├── parsers.py          # PyMuPDF universal document parsing
+│   ├── text_utils.py       # Text sanitization (from Spelunker)
+│   ├── chunker.py          # Content-type-aware chunking
+│   ├── embedding_client.py # Multi-backend embedding API client
+│   ├── pipeline.py         # Orchestration: parse → chunk → embed → graph
+│   └── concepts.py         # LLM-based concept extraction
+├── metrics.py              # Prometheus metrics definitions
+├── tasks.py                # Celery tasks for async embedding
+├── management/commands/
+│   ├── embed_item.py
+│   ├── embed_collection.py
+│   └── embedding_status.py
+└── tests/
+    ├── test_parsers.py
+    ├── test_text_utils.py
+    ├── test_chunker.py
+    ├── test_embedding_client.py
+    ├── test_pipeline.py
+    ├── test_concepts.py
+    └── test_tasks.py
+```
+
+## Testing Strategy
+
+All tests use Django `TestCase`. External services (LLM APIs, Neo4j) are mocked.
+
+| Test File | Scope |
+|-----------|-------|
+| `test_parsers.py` | PyMuPDF parsing for each file type, image extraction, text sanitization |
+| `test_text_utils.py` | Sanitization functions, PDF artifact cleaning, Unicode normalization |
+| `test_chunker.py` | Content-type strategies, boundary detection, chunk-image association |
+| `test_embedding_client.py` | OpenAI-compat + Bedrock backends (mocked HTTP), batch processing, usage tracking |
+| `test_pipeline.py` | Full pipeline integration (mocked), S3 storage, idempotency |
+| `test_concepts.py` | Concept extraction, deduplication, graph relationships |
+| `test_tasks.py` | Celery tasks (eager mode), retry logic, error handling |
+
+## Success Criteria
+
+- [ ] Upload a document (PDF, EPUB, DOCX, PPTX, TXT) via API or admin → file stored in S3
+- [ ] Images extracted from documents and stored as Image nodes in Neo4j
+- [ ] Document automatically chunked using content-type-aware strategy
+- [ ] Chunks embedded via system embedding model and vectors stored in Neo4j Chunk nodes
+- [ ] Images embedded multimodally into ImageEmbedding nodes (when multimodal model available)
+- [ ] Chunk-image proximity relationships established in graph
+- [ ] Concepts extracted and graph populated with MENTIONS/REFERENCES relationships
+- [ ] Neo4j vector indexes usable for similarity queries on stored embeddings
+- [ ] Celery tasks handle async embedding with progress tracking
+- [ ] Re-embedding works (delete old chunks, re-process)
+- [ ] Content hash prevents redundant re-embedding
+- [ ] Prometheus metrics exposed at `/metrics` for pipeline monitoring
+- [ ] All tests pass with mocked LLM/embedding APIs
+- [ ] Bedrock embedding works via Bearer token HTTP (no boto3)