fix(search): require library match and preserve raw scores for RRF

Replace OPTIONAL MATCH with MATCH for Library-Collection-Item paths to ensure results are properly scoped to libraries, and remove per-query score normalization since RRF fuses results by rank rather than score magnitude.
2026-04-26 06:35:11 -04:00
parent 4a35aa126f
commit 388b37e471
3 changed files with 55 additions and 360 deletions
--- a/Standards_Django_V1-00.md
+++ b/Standards_Django_V1-00.md
@@ -1,306 +0,0 @@
 ## 🐾 Red Panda Approval™
 This project follows Red Panda Approval standards — our gold standard for Django application quality. Code must be elegant, reliable, and maintainable to earn the approval of our adorable red panda judges.
 ### The 5 Sacred Django Criteria
 1. **Fresh Migration Test** — Clean migrations from empty database
 2. **Elegant Simplicity** — No unnecessary complexity
 3. **Observable & Debuggable** — Proper logging and error handling
 4. **Consistent Patterns** — Follow Django conventions
 5. **Actually Works** — Passes all checks and serves real user needs
 ## Environment Standards
 - Virtual environment: ~/env/PROJECT/bin/activate
 - Use pyproject.toml for project configuration (no setup.py, no requirements.txt)
 - Python version: specified in pyproject.toml
 - Dependencies: floor-pinned with ceiling (e.g. `Django>=5.2,<6.0`)
 ### Dependency Pinning
 ```toml
 # Correct — floor pin with ceiling
 dependencies = [
    "Django>=5.2,<6.0",
    "djangorestframework>=3.14,<4.0",
    "cryptography>=41.0,<45.0",
 ]
 # Wrong — exact pins in library packages
 dependencies = [
    "Django==5.2.7",  # too strict, breaks downstream
 ]
 ```
 Exact pins (`==`) are only appropriate in application-level lock files, not in reusable library packages.
 ## Directory Structure
 myproject/                     # Git repository root
 ├── .gitignore
 ├── README.md
 ├── pyproject.toml             # Project configuration (moved to repo root)
 ├── docker-compose.yml
 ├── .env                       # Docker Compose environment (DATABASE_URL=postgres://...)
 ├── .env.example
 │
 ├── project/                   # Django project root (manage.py lives here)
 │   ├── manage.py
 │   ├── Dockerfile
 │   ├── .env                   # Local development environment (DATABASE_URL=sqlite:///...)
 │   ├── .env.example
 │   │
 │   ├── config/                # Django configuration module
 │   │   ├── __init__.py
 │   │   ├── settings.py
 │   │   ├── urls.py
 │   │   ├── wsgi.py
 │   │   └── asgi.py
 │   │
 │   ├── accounts/              # Django app
 │   │   ├── __init__.py
 │   │   ├── models.py
 │   │   ├── views.py
 │   │   └── urls.py
 │   │
 │   ├── blog/                  # Django app
 │   │   ├── __init__.py
 │   │   ├── models.py
 │   │   ├── views.py
 │   │   └── urls.py
 │   │
 │   ├── static/
 │   │   ├── css/
 │   │   └── js/
 │   │
 │   └── templates/
 │       └── base.html
 │
 ├── web/                       # Nginx configuration
 │   └── nginx.conf
 │
 ├── db/                        # PostgreSQL configuration
 │   └── postgresql.conf
 │
 └── docs/                      # Project documentation
    └── index.md
 ## Settings Structure
 - Use a single settings.py file
 - Use django-environ or python-dotenv for environment variables
 - Never commit .env files to version control
 - Provide .env.example with all required variables documented
 - Create .gitignore file
 - Create a .dockerignore file
 ## Code Organization
 - Imports: PEP 8 ordering (stdlib, third-party, local)
 - Type hints on function parameters
 - CSS: External .css files only (no inline styles, no embedded `<style>` tags)
 - JS: External .js files only (no inline handlers, no embedded `<script>` blocks)
 - Maximum file length: 1000 lines
 - If a file exceeds 500 lines, consider splitting by domain concept
 ## Database Conventions
 - Migrations run cleanly from empty database
 - Never edit deployed migrations
 - Use meaningful migration names: --name add_email_to_profile
 - One logical change per migration when possible
 - Test migrations both forward and backward
 ### Development vs Production
 - Development: SQLite
 - Production: PostgreSQL
 ## Caching
 - Expensive queries are cached
 - Cache keys follow naming convention
 - TTLs are appropriate (not infinite)
 - Invalidation is documented
 - Key Naming Pattern: {app}:{model}:{identifier}:{field}
 ## Model Naming
 - Model names: singular PascalCase (User, BlogPost, OrderItem)
 - Correct English pluralization on related names
 - All models have created_at and updated_at
 - All models define __str__ and get_absolute_url
 - TextChoices used for status fields
 - related_name defined on ForeignKey fields
 - Related names: plural snake_case with proper English pluralization
 ## Forms
 - Use ModelForm with explicit fields list (never __all__)
 ## Field Naming
 - Foreign keys: singular without _id suffix (author, category, parent)
 - Boolean fields: use prefixes (is_active, has_permission, can_edit)
 - Date fields: use suffixes (created_at, updated_at, published_on)
 - Avoid abbreviations (use description, not desc)
 ## Required Model Fields
 - All models should include:
  - created_at = models.DateTimeField(auto_now_add=True)
  - updated_at = models.DateTimeField(auto_now=True)
 - Consider adding:
  - id = models.UUIDField(primary_key=True) for public-facing models
  - is_active = models.BooleanField(default=True) for soft deletes
 ## Indexing
 - Add db_index=True to frequently queried fields
 - Use Meta.indexes for composite indexes
 - Document why each index exists
 ## Queries
 - Use select_related() for foreign keys
 - Use prefetch_related() for reverse relations and M2M
 - Avoid queries in loops (N+1 problem)
 - Use .only() and .defer() for large models
 - Add comments explaining complex querysets
 ## Docstrings
 - Use Sphinx style docstrings
 - Document all public functions, classes, and modules
 - Skip docstrings for obvious one-liners and standard Django overrides
 ## Views
 - Use Function-Based Views (FBVs) exclusively
 - Explicit logic is preferred over implicit inheritance
 - Extract shared logic into utility functions
 ## URLs & Identifiers
 - Public URLs use short UUIDs (12 characters) via `shortuuid`
 - Never expose sequential IDs in URLs (security/enumeration risk)
 - Internal references may use standard UUIDs or PKs
 ## URL Patterns
 - Resource-based URLs (RESTful style)
 - Namespaced URL names per app
 - Trailing slashes (Django default)
 - Flat structure preferred over deep nesting
 ## Background Tasks
 - All tasks are run synchronously unless the design specifies background tasks are needed for long operations
 - Long operations use Celery tasks
 - Use Memcached, task progress pattern: {app}:task:{task_id}:progress
 - Tasks are idempotent
 - Tasks include retry logic
 - Tasks live in app/tasks.py
 - RabbitMQ is the Message Broker
 - Flower Monitoring: Use for debugging failed tasks
 ## Testing
 - Framework: Django TestCase (not pytest)
 - Separate test files per module: test_models.py, test_views.py, test_forms.py
 ## Frontend Standards
 ### New Projects (DaisyUI + Tailwind)
 - DaisyUI 4 via CDN for component classes
 - Tailwind CSS via CDN for utility classes
 - Theme management via Themis (DaisyUI `data-theme` attribute)
 - All apps extend `themis/base.html` for consistent navigation
 - No inline styles or scripts
 ### Existing Projects (Bootstrap 5)
 - Bootstrap 5 via CDN
 - Bootstrap Icons via CDN
 - Bootswatch for theme variants (if applicable)
 - django-bootstrap5 and crispy-bootstrap5 for form rendering
 ## Preferred Packages
 ### Core Django
 - django>=5.2,<6.0
 - django-environ — Environment variables
 ### Authentication & Security
 - django-allauth — User management
 - django-allauth-2fa — Two-factor authentication
 ### API Development
 - djangorestframework>=3.14,<4.0 — REST APIs
 - drf-spectacular — OpenAPI/Swagger documentation
 ### Encryption
 - cryptography — Fernet encryption for secrets/API keys
 ### Background Tasks
 - celery — Async task queue
 - django-celery-progress — Progress bars
 - flower — Celery monitoring
 ### Caching
 - pymemcache — Memcached backend
 ### Database
 - dj-database-url — Database URL configuration
 - psycopg[binary] — PostgreSQL adapter
 - shortuuid — Short UUIDs for public URLs
 ### Production
 - gunicorn — WSGI server
 ### Shared Apps
 - django-heluca-themis — User preferences, themes, key management, navigation
 ### Deprecated / Removed
 - ~~pytz~~ — Use stdlib `zoneinfo` (Python 3.9+, Django 4+)
 - ~~Pillow~~ — Only add if your app needs ImageField
 - ~~django-heluca-core~~ — Replaced by Themis
 ## Anti-Patterns to Avoid
 ### Models
 - Don't use `Model.objects.get()` without handling `DoesNotExist`
 - Don't use `null=True` on `CharField` or `TextField` (use `blank=True, default=""`)
 - Don't use `related_name='+'` unless you have a specific reason
 - Don't override `save()` for business logic (use signals or service functions)
 - Don't use `auto_now=True` on fields you might need to manually set
 - Don't use `ForeignKey` without specifying `on_delete` explicitly
 - Don't use `Meta.ordering` on large tables (specify ordering in queries)
 ### Queries
 - Don't query inside loops (N+1 problem)
 - Don't use `.all()` when you need a subset
 - Don't use raw SQL unless absolutely necessary
 - Don't forget `select_related()` and `prefetch_related()`
 ### Views
 - Don't put business logic in views
 - Don't use `request.POST.get()` without validation (use forms)
 - Don't return sensitive data in error messages
 - Don't forget `login_required` decorator on protected views
 ### Forms
 - Don't use `fields = '__all__'` in ModelForm
 - Don't trust client-side validation alone
 - Don't use `exclude` in ModelForm (use explicit `fields`)
 ### Templates
 - Don't use `{{ variable }}` for URLs (use `{% url %}` tag)
 - Don't put logic in templates
 - Don't use inline CSS or JavaScript (external files only)
 - Don't forget `{% csrf_token %}` in forms
 ### Security
 - Don't store secrets in `settings.py` (use environment variables)
 - Don't commit `.env` files to version control
 - Don't use `DEBUG=True` in production
 - Don't expose sequential IDs in public URLs
 - Don't use `mark_safe()` on user-supplied content
 - Don't disable CSRF protection
 ### Imports & Code Style
 - Don't use `from module import *`
 - Don't use mutable default arguments
 - Don't use bare `except:` clauses
 - Don't ignore linter warnings without documented reason
 ### Migrations
 - Don't edit migrations that have been deployed
 - Don't use `RunPython` without a reverse function
 - Don't add non-nullable fields without a default value
 ### Celery Tasks
 - Don't pass model instances to tasks (pass IDs and re-fetch)
 - Don't assume tasks run immediately
 - Don't forget retry logic for external service calls
--- a/mnemosyne/library/services/search.py
+++ b/mnemosyne/library/services/search.py
@@ -247,7 +247,7 @@ class SearchService:
            CALL db.index.vector.queryNodes('chunk_embedding_index', $top_k, $query_vector)
            YIELD node AS chunk, score
            MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
-            OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
+            MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
            WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
              AND ($library_type IS NULL OR lib.library_type = $library_type)
              AND ($collection_uid IS NULL OR col.uid = $collection_uid)
@@ -352,7 +352,7 @@ class SearchService:
            CALL db.index.fulltext.queryNodes('chunk_text_fulltext', $query)
            YIELD node AS chunk, score
            MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
-            OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
+            MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
            WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
              AND ($library_type IS NULL OR lib.library_type = $library_type)
              AND ($collection_uid IS NULL OR col.uid = $collection_uid)
@@ -374,15 +374,13 @@ class SearchService:
        try:
            results, _ = db.cypher_query(cypher, params)
-            # Normalize BM25 scores to 0-1 range
+            # Keep raw BM25 scores — RRF fuses by rank, not by score magnitude.
            max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
            for row in results:
                uid = row[0]
                if not uid:
                    continue
                raw_score = float(row[7]) if row[7] else 0.0
-                normalized = raw_score / max_score if max_score > 0 else 0.0
+                if uid not in candidates or raw_score > candidates[uid].score:
                if uid not in candidates or normalized > candidates[uid].score:
                    candidates[uid] = SearchCandidate(
                        chunk_uid=uid,
                        text_preview=row[1] or "",
@@ -391,7 +389,7 @@ class SearchService:
                        item_uid=row[4] or "",
                        item_title=row[5] or "",
                        library_type=row[6] or "",
-                        score=normalized,
+                        score=raw_score,
                        source="fulltext",
                    )
        except Exception as exc:
@@ -409,7 +407,7 @@ class SearchService:
            YIELD node AS concept, score AS concept_score
            MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
            MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
-            OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
+            MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
            WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
              AND ($library_type IS NULL OR lib.library_type = $library_type)
            RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
@@ -430,14 +428,13 @@ class SearchService:
        try:
            results, _ = db.cypher_query(cypher, params)
-            max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
+            # Raw scores already include the 0.8 concept downweight from Cypher.
            for row in results:
                uid = row[0]
                if not uid:
                    continue
                raw_score = float(row[7]) if row[7] else 0.0
-                normalized = raw_score / max_score if max_score > 0 else 0.0
+                if uid not in candidates or raw_score > candidates[uid].score:
                if uid not in candidates or normalized > candidates[uid].score:
                    candidates[uid] = SearchCandidate(
                        chunk_uid=uid,
                        text_preview=row[1] or "",
@@ -446,7 +443,7 @@ class SearchService:
                        item_uid=row[4] or "",
                        item_title=row[5] or "",
                        library_type=row[6] or "",
-                        score=normalized,
+                        score=raw_score,
                        source="fulltext",
                    )
        except Exception as exc:
@@ -476,17 +473,17 @@ class SearchService:
            LIMIT 10
            MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
            MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
-            OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
+            MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
            WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
              AND ($library_type IS NULL OR lib.library_type = $library_type)
-            WITH chunk, item, lib, concept, concept_score,
+            WITH chunk, item, lib,
-                 count(DISTINCT concept) AS concept_count
+                 max(concept_score) AS score,
-            RETURN DISTINCT chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
+                 collect(DISTINCT concept.name)[..5] AS concept_names
            RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
                   chunk.chunk_s3_key AS chunk_s3_key, chunk.chunk_index AS chunk_index,
                   item.uid AS item_uid, item.title AS item_title,
                   lib.library_type AS library_type,
-                   concept_score AS score,
+                   score, concept_names
                   collect(concept.name)[..5] AS concept_names
            ORDER BY score DESC
            LIMIT $limit
        """
@@ -504,16 +501,12 @@ class SearchService:
            logger.error("Graph search failed: %s", exc)
            return []
        # Normalize scores
        max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
        candidates = []
        for row in results:
            uid = row[0]
            if not uid:
                continue
            raw_score = float(row[7]) if row[7] else 0.0
            normalized = raw_score / max_score if max_score > 0 else 0.0
            concept_names = row[8] if len(row) > 8 else []
            candidates.append(
@@ -525,7 +518,7 @@ class SearchService:
                    item_uid=row[4] or "",
                    item_title=row[5] or "",
                    library_type=row[6] or "",
-                    score=normalized,
+                    score=raw_score,
                    source="graph",
                    metadata={"concepts": concept_names},
                )
@@ -562,7 +555,7 @@ class SearchService:
            YIELD node AS emb_node, score
            MATCH (img:Image)-[:HAS_EMBEDDING]->(emb_node)
            MATCH (item:Item)-[:HAS_IMAGE]->(img)
-            OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
+            MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
            WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
              AND ($library_type IS NULL OR lib.library_type = $library_type)
            RETURN img.uid AS image_uid, img.image_type AS image_type,
@@ -642,11 +635,13 @@ class SearchService:
        try:
            client = RerankerClient(reranker_model, user=self.user)
            # Don't pass top_n — let the reranker score every candidate so
            # cross-attention can promote items the RRF stage ranked low.
            # Final trimming to request.limit happens in search().
            reranked = client.rerank(
                query=request.query,
                candidates=candidates_to_rerank,
                instruction=instruction,
                top_n=request.limit,
                query_image=request.query_image,
            )
            return reranked, reranker_model.name
@@ -660,22 +655,27 @@ class SearchService:
    # Helpers
    # ------------------------------------------------------------------
    GENERIC_RERANKER_INSTRUCTION = (
        "Re-rank these passages by relevance to the query."
    )
    def _get_reranker_instruction(
        self, request: SearchRequest, candidates: list[SearchCandidate]
    ) -> str:
        """
        Get the content-type-aware reranker instruction.
-        If scoped to a library or library type, use that type's instruction.
+        Scoped queries (by library or library type) use that type's
-        If mixed types, use a generic instruction.
+        instruction. Unscoped queries — even when results happen to
        come mostly from one type — use a generic instruction so the
        reranker is not biased toward the majority type.
        :param request: SearchRequest.
-        :param candidates: Candidates (used to detect dominant library type).
+        :param candidates: Candidates (unused; kept for API stability).
        :returns: Reranker instruction string.
        """
        from library.content_types import get_library_type_config
        # Use explicit library type from request
        if request.library_type:
            try:
                config = get_library_type_config(request.library_type)
@@ -683,25 +683,12 @@ class SearchService:
            except ValueError:
                pass
        # Use library UID to look up type
        if request.library_uid:
-            return self._get_library_reranker_instruction(request.library_uid)
+            instruction = self._get_library_reranker_instruction(request.library_uid)
            if instruction:
                return instruction
-        # Detect dominant type from candidates
+        return self.GENERIC_RERANKER_INSTRUCTION
        type_counts: dict[str, int] = {}
        for c in candidates:
            if c.library_type:
                type_counts[c.library_type] = type_counts.get(c.library_type, 0) + 1
        if type_counts:
            dominant_type = max(type_counts, key=type_counts.get)
            try:
                config = get_library_type_config(dominant_type)
                return config.get("reranker_instruction", "")
            except ValueError:
                pass
        return ""
    def _get_library_reranker_instruction(self, library_uid: str) -> str:
        """Get reranker_instruction from a Library node."""
@@ -710,7 +697,12 @@ class SearchService:
            lib = Library.nodes.get(uid=library_uid)
            return lib.reranker_instruction or ""
-        except Exception:
+        except Exception as exc:
            logger.warning(
                "Failed to load reranker_instruction for library_uid=%s: %s",
                library_uid,
                exc,
            )
            return ""
    def _get_embedding_instruction(self, library_uid: str) -> str:
@@ -720,7 +712,12 @@ class SearchService:
            lib = Library.nodes.get(uid=library_uid)
            return lib.embedding_instruction or ""
-        except Exception:
+        except Exception as exc:
            logger.warning(
                "Failed to load embedding_instruction for library_uid=%s: %s",
                library_uid,
                exc,
            )
            return ""
    def _get_type_embedding_instruction(self, library_type: str) -> str:
--- a/mnemosyne/library/tests/test_search.py
+++ b/mnemosyne/library/tests/test_search.py
@@ -225,8 +225,12 @@ class SearchServiceHelperTest(TestCase):
        instruction = service._get_reranker_instruction(request, [])
        self.assertIn("fiction", instruction.lower())
-    def test_get_reranker_instruction_from_candidates(self):
+    def test_get_reranker_instruction_generic_for_unscoped(self):
-        """Detects dominant library type from candidate list."""
+        """
        Unscoped queries get the generic instruction even when candidates
        all share a library_type — type-specific instructions could bias
        the reranker against minority-type results.
        """
        service = SearchService()
        request = SearchRequest(query="test")
        candidates = [
@@ -240,10 +244,10 @@ class SearchServiceHelperTest(TestCase):
        ]
        instruction = service._get_reranker_instruction(request, candidates)
-        self.assertIn("technical", instruction.lower())
+        self.assertEqual(instruction, SearchService.GENERIC_RERANKER_INSTRUCTION)
-    def test_get_reranker_instruction_empty_when_no_context(self):
+    def test_get_reranker_instruction_generic_when_no_context(self):
-        """Returns empty when no library type context available."""
+        """Returns the generic instruction when no library scope is set."""
        service = SearchService()
        request = SearchRequest(query="test")
        candidates = [
@@ -256,4 +260,4 @@ class SearchServiceHelperTest(TestCase):
        ]
        instruction = service._get_reranker_instruction(request, candidates)
-        self.assertEqual(instruction, "")
+        self.assertEqual(instruction, SearchService.GENERIC_RERANKER_INSTRUCTION)