feat: replace server-side RAG with MCP retrieval primitives

- Remove Phase 4 RAG pipeline in favor of retrieval-only architecture - Add FastMCP server exposing search, get_chunk, list_libraries tools - Mount MCP endpoints (streamable HTTP + SSE) via Starlette in ASGI config - Update README to clarify Mnemosyne is a retrieval engine, not RAG - Let calling LLMs drive synthesis and iterative retrieval themselves
2026-04-26 15:34:26 -04:00
parent 388b37e471
commit 2df22941d2
30 changed files with 1180 additions and 126 deletions
--- a/README.md
+++ b/README.md
@@ -47,8 +47,8 @@ This **content-type awareness** flows through every layer: chunking strategy, em

 ```
 Query → Vector Search (Neo4j) + Graph Traversal (Cypher) + Full-Text Search
-  → Candidate Fusion → Qwen3-VL Re-ranking → Content-Type Context Injection
-    → LLM Response with Citations
+  → Candidate Fusion → Qwen3-VL Re-ranking → Ranked Chunks + Metadata
+    → MCP tool result (the calling LLM does its own synthesis)
 ```

 ## Heritage
@@ -82,14 +82,21 @@ celery -A mnemosyne flower --port=5555      # Web monitoring UI

 See [Phase 2: Celery Workers & Scheduler](docs/PHASE_2_EMBEDDING_PIPELINE.md#celery-workers--scheduler) for full details on queues, reliability settings, and task progress tracking.

+## Architecture Note: Retrieval, Not Synthesis
+
+Mnemosyne is a **retrieval engine**, not a RAG pipeline. It stores, embeds, and ranks — it does not synthesize answers.
+
+The earlier roadmap had a server-side RAG layer that took a query and returned a written answer with citations. That layer has been removed. Calling LLMs (Claude via MCP, principally) are perfectly capable of driving iterative retrieval themselves when given the right primitives, and a server-side synthesis hop adds latency, cost, and a place where errors are harder to debug. Letting the calling LLM see chunks directly — and follow citations, pivot mid-search, or call `get_chunk` for full text — beats pre-digesting them.
+
+If a "knowledge subagent" is ever wanted (a wrapper that takes a question and returns a written answer), it lives **outside** Mnemosyne as a thin client over the MCP tools, with its own system prompt. No coupling, no extra inference hop inside the server, and the subagent's behavior can iterate independently.
+
 ## Documentation

 - **[Architecture Documentation](docs/mnemosyne.html)** — Full system architecture with diagrams
 - **[Phase 1: Foundation](docs/PHASE_1_FOUNDATION.md)** — Project skeleton, Neo4j data model, content-type system
 - **[Phase 2: Embedding Pipeline](docs/PHASE_2_EMBEDDING_PIPELINE.md)** — Qwen3-VL multimodal embedding
 - **[Phase 3: Search & Re-ranking](docs/PHASE_3_SEARCH_AND_RERANKING.md)** — Hybrid search + re-ranker
- **[Phase 4: RAG Pipeline](docs/PHASE_4_RAG_PIPELINE.md)** — Content-type-aware generation
- **[Phase 5: MCP Server](docs/PHASE_5_MCP_SERVER.md)** — LLM integration interface
+- **[Phase 5: MCP Server](docs/PHASE_5_MCP_SERVER.md)** — Retrieval primitives for LLMs (`search`, `get_chunk`, `list_libraries`, …)
 - **[Phase 6: Backport to Spelunker](docs/PHASE_6_BACKPORT_TO_SPELUNKER.md)** — Proven patterns flowing back