fix(search): require library match and preserve raw scores for RRF
Replace OPTIONAL MATCH with MATCH for Library-Collection-Item paths to ensure results are properly scoped to libraries, and remove per-query score normalization since RRF fuses results by rank rather than score magnitude.
This commit is contained in:
@@ -1,306 +0,0 @@
|
|||||||
## 🐾 Red Panda Approval™
|
|
||||||
|
|
||||||
This project follows Red Panda Approval standards — our gold standard for Django application quality. Code must be elegant, reliable, and maintainable to earn the approval of our adorable red panda judges.
|
|
||||||
|
|
||||||
### The 5 Sacred Django Criteria
|
|
||||||
1. **Fresh Migration Test** — Clean migrations from empty database
|
|
||||||
2. **Elegant Simplicity** — No unnecessary complexity
|
|
||||||
3. **Observable & Debuggable** — Proper logging and error handling
|
|
||||||
4. **Consistent Patterns** — Follow Django conventions
|
|
||||||
5. **Actually Works** — Passes all checks and serves real user needs
|
|
||||||
|
|
||||||
## Environment Standards
|
|
||||||
- Virtual environment: ~/env/PROJECT/bin/activate
|
|
||||||
- Use pyproject.toml for project configuration (no setup.py, no requirements.txt)
|
|
||||||
- Python version: specified in pyproject.toml
|
|
||||||
- Dependencies: floor-pinned with ceiling (e.g. `Django>=5.2,<6.0`)
|
|
||||||
|
|
||||||
### Dependency Pinning
|
|
||||||
|
|
||||||
```toml
|
|
||||||
# Correct — floor pin with ceiling
|
|
||||||
dependencies = [
|
|
||||||
"Django>=5.2,<6.0",
|
|
||||||
"djangorestframework>=3.14,<4.0",
|
|
||||||
"cryptography>=41.0,<45.0",
|
|
||||||
]
|
|
||||||
|
|
||||||
# Wrong — exact pins in library packages
|
|
||||||
dependencies = [
|
|
||||||
"Django==5.2.7", # too strict, breaks downstream
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
Exact pins (`==`) are only appropriate in application-level lock files, not in reusable library packages.
|
|
||||||
|
|
||||||
## Directory Structure
|
|
||||||
myproject/ # Git repository root
|
|
||||||
├── .gitignore
|
|
||||||
├── README.md
|
|
||||||
├── pyproject.toml # Project configuration (moved to repo root)
|
|
||||||
├── docker-compose.yml
|
|
||||||
├── .env # Docker Compose environment (DATABASE_URL=postgres://...)
|
|
||||||
├── .env.example
|
|
||||||
│
|
|
||||||
├── project/ # Django project root (manage.py lives here)
|
|
||||||
│ ├── manage.py
|
|
||||||
│ ├── Dockerfile
|
|
||||||
│ ├── .env # Local development environment (DATABASE_URL=sqlite:///...)
|
|
||||||
│ ├── .env.example
|
|
||||||
│ │
|
|
||||||
│ ├── config/ # Django configuration module
|
|
||||||
│ │ ├── __init__.py
|
|
||||||
│ │ ├── settings.py
|
|
||||||
│ │ ├── urls.py
|
|
||||||
│ │ ├── wsgi.py
|
|
||||||
│ │ └── asgi.py
|
|
||||||
│ │
|
|
||||||
│ ├── accounts/ # Django app
|
|
||||||
│ │ ├── __init__.py
|
|
||||||
│ │ ├── models.py
|
|
||||||
│ │ ├── views.py
|
|
||||||
│ │ └── urls.py
|
|
||||||
│ │
|
|
||||||
│ ├── blog/ # Django app
|
|
||||||
│ │ ├── __init__.py
|
|
||||||
│ │ ├── models.py
|
|
||||||
│ │ ├── views.py
|
|
||||||
│ │ └── urls.py
|
|
||||||
│ │
|
|
||||||
│ ├── static/
|
|
||||||
│ │ ├── css/
|
|
||||||
│ │ └── js/
|
|
||||||
│ │
|
|
||||||
│ └── templates/
|
|
||||||
│ └── base.html
|
|
||||||
│
|
|
||||||
├── web/ # Nginx configuration
|
|
||||||
│ └── nginx.conf
|
|
||||||
│
|
|
||||||
├── db/ # PostgreSQL configuration
|
|
||||||
│ └── postgresql.conf
|
|
||||||
│
|
|
||||||
└── docs/ # Project documentation
|
|
||||||
└── index.md
|
|
||||||
|
|
||||||
## Settings Structure
|
|
||||||
- Use a single settings.py file
|
|
||||||
- Use django-environ or python-dotenv for environment variables
|
|
||||||
- Never commit .env files to version control
|
|
||||||
- Provide .env.example with all required variables documented
|
|
||||||
- Create .gitignore file
|
|
||||||
- Create a .dockerignore file
|
|
||||||
|
|
||||||
## Code Organization
|
|
||||||
- Imports: PEP 8 ordering (stdlib, third-party, local)
|
|
||||||
- Type hints on function parameters
|
|
||||||
- CSS: External .css files only (no inline styles, no embedded `<style>` tags)
|
|
||||||
- JS: External .js files only (no inline handlers, no embedded `<script>` blocks)
|
|
||||||
- Maximum file length: 1000 lines
|
|
||||||
- If a file exceeds 500 lines, consider splitting by domain concept
|
|
||||||
|
|
||||||
## Database Conventions
|
|
||||||
- Migrations run cleanly from empty database
|
|
||||||
- Never edit deployed migrations
|
|
||||||
- Use meaningful migration names: --name add_email_to_profile
|
|
||||||
- One logical change per migration when possible
|
|
||||||
- Test migrations both forward and backward
|
|
||||||
|
|
||||||
### Development vs Production
|
|
||||||
- Development: SQLite
|
|
||||||
- Production: PostgreSQL
|
|
||||||
|
|
||||||
## Caching
|
|
||||||
- Expensive queries are cached
|
|
||||||
- Cache keys follow naming convention
|
|
||||||
- TTLs are appropriate (not infinite)
|
|
||||||
- Invalidation is documented
|
|
||||||
- Key Naming Pattern: {app}:{model}:{identifier}:{field}
|
|
||||||
|
|
||||||
## Model Naming
|
|
||||||
- Model names: singular PascalCase (User, BlogPost, OrderItem)
|
|
||||||
- Correct English pluralization on related names
|
|
||||||
- All models have created_at and updated_at
|
|
||||||
- All models define __str__ and get_absolute_url
|
|
||||||
- TextChoices used for status fields
|
|
||||||
- related_name defined on ForeignKey fields
|
|
||||||
- Related names: plural snake_case with proper English pluralization
|
|
||||||
|
|
||||||
## Forms
|
|
||||||
- Use ModelForm with explicit fields list (never __all__)
|
|
||||||
|
|
||||||
## Field Naming
|
|
||||||
- Foreign keys: singular without _id suffix (author, category, parent)
|
|
||||||
- Boolean fields: use prefixes (is_active, has_permission, can_edit)
|
|
||||||
- Date fields: use suffixes (created_at, updated_at, published_on)
|
|
||||||
- Avoid abbreviations (use description, not desc)
|
|
||||||
|
|
||||||
## Required Model Fields
|
|
||||||
- All models should include:
|
|
||||||
- created_at = models.DateTimeField(auto_now_add=True)
|
|
||||||
- updated_at = models.DateTimeField(auto_now=True)
|
|
||||||
- Consider adding:
|
|
||||||
- id = models.UUIDField(primary_key=True) for public-facing models
|
|
||||||
- is_active = models.BooleanField(default=True) for soft deletes
|
|
||||||
|
|
||||||
## Indexing
|
|
||||||
- Add db_index=True to frequently queried fields
|
|
||||||
- Use Meta.indexes for composite indexes
|
|
||||||
- Document why each index exists
|
|
||||||
|
|
||||||
## Queries
|
|
||||||
- Use select_related() for foreign keys
|
|
||||||
- Use prefetch_related() for reverse relations and M2M
|
|
||||||
- Avoid queries in loops (N+1 problem)
|
|
||||||
- Use .only() and .defer() for large models
|
|
||||||
- Add comments explaining complex querysets
|
|
||||||
|
|
||||||
## Docstrings
|
|
||||||
- Use Sphinx style docstrings
|
|
||||||
- Document all public functions, classes, and modules
|
|
||||||
- Skip docstrings for obvious one-liners and standard Django overrides
|
|
||||||
|
|
||||||
## Views
|
|
||||||
- Use Function-Based Views (FBVs) exclusively
|
|
||||||
- Explicit logic is preferred over implicit inheritance
|
|
||||||
- Extract shared logic into utility functions
|
|
||||||
|
|
||||||
## URLs & Identifiers
|
|
||||||
|
|
||||||
- Public URLs use short UUIDs (12 characters) via `shortuuid`
|
|
||||||
- Never expose sequential IDs in URLs (security/enumeration risk)
|
|
||||||
- Internal references may use standard UUIDs or PKs
|
|
||||||
|
|
||||||
## URL Patterns
|
|
||||||
- Resource-based URLs (RESTful style)
|
|
||||||
- Namespaced URL names per app
|
|
||||||
- Trailing slashes (Django default)
|
|
||||||
- Flat structure preferred over deep nesting
|
|
||||||
|
|
||||||
## Background Tasks
|
|
||||||
- All tasks are run synchronously unless the design specifies background tasks are needed for long operations
|
|
||||||
- Long operations use Celery tasks
|
|
||||||
- Use Memcached, task progress pattern: {app}:task:{task_id}:progress
|
|
||||||
- Tasks are idempotent
|
|
||||||
- Tasks include retry logic
|
|
||||||
- Tasks live in app/tasks.py
|
|
||||||
- RabbitMQ is the Message Broker
|
|
||||||
- Flower Monitoring: Use for debugging failed tasks
|
|
||||||
|
|
||||||
## Testing
|
|
||||||
- Framework: Django TestCase (not pytest)
|
|
||||||
- Separate test files per module: test_models.py, test_views.py, test_forms.py
|
|
||||||
|
|
||||||
## Frontend Standards
|
|
||||||
|
|
||||||
### New Projects (DaisyUI + Tailwind)
|
|
||||||
- DaisyUI 4 via CDN for component classes
|
|
||||||
- Tailwind CSS via CDN for utility classes
|
|
||||||
- Theme management via Themis (DaisyUI `data-theme` attribute)
|
|
||||||
- All apps extend `themis/base.html` for consistent navigation
|
|
||||||
- No inline styles or scripts
|
|
||||||
|
|
||||||
### Existing Projects (Bootstrap 5)
|
|
||||||
- Bootstrap 5 via CDN
|
|
||||||
- Bootstrap Icons via CDN
|
|
||||||
- Bootswatch for theme variants (if applicable)
|
|
||||||
- django-bootstrap5 and crispy-bootstrap5 for form rendering
|
|
||||||
|
|
||||||
## Preferred Packages
|
|
||||||
|
|
||||||
### Core Django
|
|
||||||
- django>=5.2,<6.0
|
|
||||||
- django-environ — Environment variables
|
|
||||||
|
|
||||||
### Authentication & Security
|
|
||||||
- django-allauth — User management
|
|
||||||
- django-allauth-2fa — Two-factor authentication
|
|
||||||
|
|
||||||
### API Development
|
|
||||||
- djangorestframework>=3.14,<4.0 — REST APIs
|
|
||||||
- drf-spectacular — OpenAPI/Swagger documentation
|
|
||||||
|
|
||||||
### Encryption
|
|
||||||
- cryptography — Fernet encryption for secrets/API keys
|
|
||||||
|
|
||||||
### Background Tasks
|
|
||||||
- celery — Async task queue
|
|
||||||
- django-celery-progress — Progress bars
|
|
||||||
- flower — Celery monitoring
|
|
||||||
|
|
||||||
### Caching
|
|
||||||
- pymemcache — Memcached backend
|
|
||||||
|
|
||||||
### Database
|
|
||||||
- dj-database-url — Database URL configuration
|
|
||||||
- psycopg[binary] — PostgreSQL adapter
|
|
||||||
- shortuuid — Short UUIDs for public URLs
|
|
||||||
|
|
||||||
### Production
|
|
||||||
- gunicorn — WSGI server
|
|
||||||
|
|
||||||
### Shared Apps
|
|
||||||
- django-heluca-themis — User preferences, themes, key management, navigation
|
|
||||||
|
|
||||||
### Deprecated / Removed
|
|
||||||
- ~~pytz~~ — Use stdlib `zoneinfo` (Python 3.9+, Django 4+)
|
|
||||||
- ~~Pillow~~ — Only add if your app needs ImageField
|
|
||||||
- ~~django-heluca-core~~ — Replaced by Themis
|
|
||||||
|
|
||||||
## Anti-Patterns to Avoid
|
|
||||||
|
|
||||||
### Models
|
|
||||||
- Don't use `Model.objects.get()` without handling `DoesNotExist`
|
|
||||||
- Don't use `null=True` on `CharField` or `TextField` (use `blank=True, default=""`)
|
|
||||||
- Don't use `related_name='+'` unless you have a specific reason
|
|
||||||
- Don't override `save()` for business logic (use signals or service functions)
|
|
||||||
- Don't use `auto_now=True` on fields you might need to manually set
|
|
||||||
- Don't use `ForeignKey` without specifying `on_delete` explicitly
|
|
||||||
- Don't use `Meta.ordering` on large tables (specify ordering in queries)
|
|
||||||
|
|
||||||
### Queries
|
|
||||||
- Don't query inside loops (N+1 problem)
|
|
||||||
- Don't use `.all()` when you need a subset
|
|
||||||
- Don't use raw SQL unless absolutely necessary
|
|
||||||
- Don't forget `select_related()` and `prefetch_related()`
|
|
||||||
|
|
||||||
### Views
|
|
||||||
- Don't put business logic in views
|
|
||||||
- Don't use `request.POST.get()` without validation (use forms)
|
|
||||||
- Don't return sensitive data in error messages
|
|
||||||
- Don't forget `login_required` decorator on protected views
|
|
||||||
|
|
||||||
### Forms
|
|
||||||
- Don't use `fields = '__all__'` in ModelForm
|
|
||||||
- Don't trust client-side validation alone
|
|
||||||
- Don't use `exclude` in ModelForm (use explicit `fields`)
|
|
||||||
|
|
||||||
### Templates
|
|
||||||
- Don't use `{{ variable }}` for URLs (use `{% url %}` tag)
|
|
||||||
- Don't put logic in templates
|
|
||||||
- Don't use inline CSS or JavaScript (external files only)
|
|
||||||
- Don't forget `{% csrf_token %}` in forms
|
|
||||||
|
|
||||||
### Security
|
|
||||||
- Don't store secrets in `settings.py` (use environment variables)
|
|
||||||
- Don't commit `.env` files to version control
|
|
||||||
- Don't use `DEBUG=True` in production
|
|
||||||
- Don't expose sequential IDs in public URLs
|
|
||||||
- Don't use `mark_safe()` on user-supplied content
|
|
||||||
- Don't disable CSRF protection
|
|
||||||
|
|
||||||
### Imports & Code Style
|
|
||||||
- Don't use `from module import *`
|
|
||||||
- Don't use mutable default arguments
|
|
||||||
- Don't use bare `except:` clauses
|
|
||||||
- Don't ignore linter warnings without documented reason
|
|
||||||
|
|
||||||
### Migrations
|
|
||||||
- Don't edit migrations that have been deployed
|
|
||||||
- Don't use `RunPython` without a reverse function
|
|
||||||
- Don't add non-nullable fields without a default value
|
|
||||||
|
|
||||||
### Celery Tasks
|
|
||||||
- Don't pass model instances to tasks (pass IDs and re-fetch)
|
|
||||||
- Don't assume tasks run immediately
|
|
||||||
- Don't forget retry logic for external service calls
|
|
||||||
@@ -247,7 +247,7 @@ class SearchService:
|
|||||||
CALL db.index.vector.queryNodes('chunk_embedding_index', $top_k, $query_vector)
|
CALL db.index.vector.queryNodes('chunk_embedding_index', $top_k, $query_vector)
|
||||||
YIELD node AS chunk, score
|
YIELD node AS chunk, score
|
||||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
||||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||||
AND ($collection_uid IS NULL OR col.uid = $collection_uid)
|
AND ($collection_uid IS NULL OR col.uid = $collection_uid)
|
||||||
@@ -352,7 +352,7 @@ class SearchService:
|
|||||||
CALL db.index.fulltext.queryNodes('chunk_text_fulltext', $query)
|
CALL db.index.fulltext.queryNodes('chunk_text_fulltext', $query)
|
||||||
YIELD node AS chunk, score
|
YIELD node AS chunk, score
|
||||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
||||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||||
AND ($collection_uid IS NULL OR col.uid = $collection_uid)
|
AND ($collection_uid IS NULL OR col.uid = $collection_uid)
|
||||||
@@ -374,15 +374,13 @@ class SearchService:
|
|||||||
|
|
||||||
try:
|
try:
|
||||||
results, _ = db.cypher_query(cypher, params)
|
results, _ = db.cypher_query(cypher, params)
|
||||||
# Normalize BM25 scores to 0-1 range
|
# Keep raw BM25 scores — RRF fuses by rank, not by score magnitude.
|
||||||
max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
|
|
||||||
for row in results:
|
for row in results:
|
||||||
uid = row[0]
|
uid = row[0]
|
||||||
if not uid:
|
if not uid:
|
||||||
continue
|
continue
|
||||||
raw_score = float(row[7]) if row[7] else 0.0
|
raw_score = float(row[7]) if row[7] else 0.0
|
||||||
normalized = raw_score / max_score if max_score > 0 else 0.0
|
if uid not in candidates or raw_score > candidates[uid].score:
|
||||||
if uid not in candidates or normalized > candidates[uid].score:
|
|
||||||
candidates[uid] = SearchCandidate(
|
candidates[uid] = SearchCandidate(
|
||||||
chunk_uid=uid,
|
chunk_uid=uid,
|
||||||
text_preview=row[1] or "",
|
text_preview=row[1] or "",
|
||||||
@@ -391,7 +389,7 @@ class SearchService:
|
|||||||
item_uid=row[4] or "",
|
item_uid=row[4] or "",
|
||||||
item_title=row[5] or "",
|
item_title=row[5] or "",
|
||||||
library_type=row[6] or "",
|
library_type=row[6] or "",
|
||||||
score=normalized,
|
score=raw_score,
|
||||||
source="fulltext",
|
source="fulltext",
|
||||||
)
|
)
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
@@ -409,7 +407,7 @@ class SearchService:
|
|||||||
YIELD node AS concept, score AS concept_score
|
YIELD node AS concept, score AS concept_score
|
||||||
MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
|
MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
|
||||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
|
||||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||||
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||||
@@ -430,14 +428,13 @@ class SearchService:
|
|||||||
|
|
||||||
try:
|
try:
|
||||||
results, _ = db.cypher_query(cypher, params)
|
results, _ = db.cypher_query(cypher, params)
|
||||||
max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
|
# Raw scores already include the 0.8 concept downweight from Cypher.
|
||||||
for row in results:
|
for row in results:
|
||||||
uid = row[0]
|
uid = row[0]
|
||||||
if not uid:
|
if not uid:
|
||||||
continue
|
continue
|
||||||
raw_score = float(row[7]) if row[7] else 0.0
|
raw_score = float(row[7]) if row[7] else 0.0
|
||||||
normalized = raw_score / max_score if max_score > 0 else 0.0
|
if uid not in candidates or raw_score > candidates[uid].score:
|
||||||
if uid not in candidates or normalized > candidates[uid].score:
|
|
||||||
candidates[uid] = SearchCandidate(
|
candidates[uid] = SearchCandidate(
|
||||||
chunk_uid=uid,
|
chunk_uid=uid,
|
||||||
text_preview=row[1] or "",
|
text_preview=row[1] or "",
|
||||||
@@ -446,7 +443,7 @@ class SearchService:
|
|||||||
item_uid=row[4] or "",
|
item_uid=row[4] or "",
|
||||||
item_title=row[5] or "",
|
item_title=row[5] or "",
|
||||||
library_type=row[6] or "",
|
library_type=row[6] or "",
|
||||||
score=normalized,
|
score=raw_score,
|
||||||
source="fulltext",
|
source="fulltext",
|
||||||
)
|
)
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
@@ -476,17 +473,17 @@ class SearchService:
|
|||||||
LIMIT 10
|
LIMIT 10
|
||||||
MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
|
MATCH (chunk:Chunk)-[:MENTIONS]->(concept)
|
||||||
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
MATCH (item:Item)-[:HAS_CHUNK]->(chunk)
|
||||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
|
||||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||||
WITH chunk, item, lib, concept, concept_score,
|
WITH chunk, item, lib,
|
||||||
count(DISTINCT concept) AS concept_count
|
max(concept_score) AS score,
|
||||||
RETURN DISTINCT chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
collect(DISTINCT concept.name)[..5] AS concept_names
|
||||||
|
RETURN chunk.uid AS chunk_uid, chunk.text_preview AS text_preview,
|
||||||
chunk.chunk_s3_key AS chunk_s3_key, chunk.chunk_index AS chunk_index,
|
chunk.chunk_s3_key AS chunk_s3_key, chunk.chunk_index AS chunk_index,
|
||||||
item.uid AS item_uid, item.title AS item_title,
|
item.uid AS item_uid, item.title AS item_title,
|
||||||
lib.library_type AS library_type,
|
lib.library_type AS library_type,
|
||||||
concept_score AS score,
|
score, concept_names
|
||||||
collect(concept.name)[..5] AS concept_names
|
|
||||||
ORDER BY score DESC
|
ORDER BY score DESC
|
||||||
LIMIT $limit
|
LIMIT $limit
|
||||||
"""
|
"""
|
||||||
@@ -504,16 +501,12 @@ class SearchService:
|
|||||||
logger.error("Graph search failed: %s", exc)
|
logger.error("Graph search failed: %s", exc)
|
||||||
return []
|
return []
|
||||||
|
|
||||||
# Normalize scores
|
|
||||||
max_score = max((float(r[7]) for r in results if r[7]), default=1.0)
|
|
||||||
|
|
||||||
candidates = []
|
candidates = []
|
||||||
for row in results:
|
for row in results:
|
||||||
uid = row[0]
|
uid = row[0]
|
||||||
if not uid:
|
if not uid:
|
||||||
continue
|
continue
|
||||||
raw_score = float(row[7]) if row[7] else 0.0
|
raw_score = float(row[7]) if row[7] else 0.0
|
||||||
normalized = raw_score / max_score if max_score > 0 else 0.0
|
|
||||||
concept_names = row[8] if len(row) > 8 else []
|
concept_names = row[8] if len(row) > 8 else []
|
||||||
|
|
||||||
candidates.append(
|
candidates.append(
|
||||||
@@ -525,7 +518,7 @@ class SearchService:
|
|||||||
item_uid=row[4] or "",
|
item_uid=row[4] or "",
|
||||||
item_title=row[5] or "",
|
item_title=row[5] or "",
|
||||||
library_type=row[6] or "",
|
library_type=row[6] or "",
|
||||||
score=normalized,
|
score=raw_score,
|
||||||
source="graph",
|
source="graph",
|
||||||
metadata={"concepts": concept_names},
|
metadata={"concepts": concept_names},
|
||||||
)
|
)
|
||||||
@@ -562,7 +555,7 @@ class SearchService:
|
|||||||
YIELD node AS emb_node, score
|
YIELD node AS emb_node, score
|
||||||
MATCH (img:Image)-[:HAS_EMBEDDING]->(emb_node)
|
MATCH (img:Image)-[:HAS_EMBEDDING]->(emb_node)
|
||||||
MATCH (item:Item)-[:HAS_IMAGE]->(img)
|
MATCH (item:Item)-[:HAS_IMAGE]->(img)
|
||||||
OPTIONAL MATCH (lib:Library)-[:CONTAINS]->(col:Collection)-[:CONTAINS]->(item)
|
MATCH (lib:Library)-[:CONTAINS]->(:Collection)-[:CONTAINS]->(item)
|
||||||
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
WHERE ($library_uid IS NULL OR lib.uid = $library_uid)
|
||||||
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
AND ($library_type IS NULL OR lib.library_type = $library_type)
|
||||||
RETURN img.uid AS image_uid, img.image_type AS image_type,
|
RETURN img.uid AS image_uid, img.image_type AS image_type,
|
||||||
@@ -642,11 +635,13 @@ class SearchService:
|
|||||||
|
|
||||||
try:
|
try:
|
||||||
client = RerankerClient(reranker_model, user=self.user)
|
client = RerankerClient(reranker_model, user=self.user)
|
||||||
|
# Don't pass top_n — let the reranker score every candidate so
|
||||||
|
# cross-attention can promote items the RRF stage ranked low.
|
||||||
|
# Final trimming to request.limit happens in search().
|
||||||
reranked = client.rerank(
|
reranked = client.rerank(
|
||||||
query=request.query,
|
query=request.query,
|
||||||
candidates=candidates_to_rerank,
|
candidates=candidates_to_rerank,
|
||||||
instruction=instruction,
|
instruction=instruction,
|
||||||
top_n=request.limit,
|
|
||||||
query_image=request.query_image,
|
query_image=request.query_image,
|
||||||
)
|
)
|
||||||
return reranked, reranker_model.name
|
return reranked, reranker_model.name
|
||||||
@@ -660,22 +655,27 @@ class SearchService:
|
|||||||
# Helpers
|
# Helpers
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
|
|
||||||
|
GENERIC_RERANKER_INSTRUCTION = (
|
||||||
|
"Re-rank these passages by relevance to the query."
|
||||||
|
)
|
||||||
|
|
||||||
def _get_reranker_instruction(
|
def _get_reranker_instruction(
|
||||||
self, request: SearchRequest, candidates: list[SearchCandidate]
|
self, request: SearchRequest, candidates: list[SearchCandidate]
|
||||||
) -> str:
|
) -> str:
|
||||||
"""
|
"""
|
||||||
Get the content-type-aware reranker instruction.
|
Get the content-type-aware reranker instruction.
|
||||||
|
|
||||||
If scoped to a library or library type, use that type's instruction.
|
Scoped queries (by library or library type) use that type's
|
||||||
If mixed types, use a generic instruction.
|
instruction. Unscoped queries — even when results happen to
|
||||||
|
come mostly from one type — use a generic instruction so the
|
||||||
|
reranker is not biased toward the majority type.
|
||||||
|
|
||||||
:param request: SearchRequest.
|
:param request: SearchRequest.
|
||||||
:param candidates: Candidates (used to detect dominant library type).
|
:param candidates: Candidates (unused; kept for API stability).
|
||||||
:returns: Reranker instruction string.
|
:returns: Reranker instruction string.
|
||||||
"""
|
"""
|
||||||
from library.content_types import get_library_type_config
|
from library.content_types import get_library_type_config
|
||||||
|
|
||||||
# Use explicit library type from request
|
|
||||||
if request.library_type:
|
if request.library_type:
|
||||||
try:
|
try:
|
||||||
config = get_library_type_config(request.library_type)
|
config = get_library_type_config(request.library_type)
|
||||||
@@ -683,25 +683,12 @@ class SearchService:
|
|||||||
except ValueError:
|
except ValueError:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
# Use library UID to look up type
|
|
||||||
if request.library_uid:
|
if request.library_uid:
|
||||||
return self._get_library_reranker_instruction(request.library_uid)
|
instruction = self._get_library_reranker_instruction(request.library_uid)
|
||||||
|
if instruction:
|
||||||
|
return instruction
|
||||||
|
|
||||||
# Detect dominant type from candidates
|
return self.GENERIC_RERANKER_INSTRUCTION
|
||||||
type_counts: dict[str, int] = {}
|
|
||||||
for c in candidates:
|
|
||||||
if c.library_type:
|
|
||||||
type_counts[c.library_type] = type_counts.get(c.library_type, 0) + 1
|
|
||||||
|
|
||||||
if type_counts:
|
|
||||||
dominant_type = max(type_counts, key=type_counts.get)
|
|
||||||
try:
|
|
||||||
config = get_library_type_config(dominant_type)
|
|
||||||
return config.get("reranker_instruction", "")
|
|
||||||
except ValueError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return ""
|
|
||||||
|
|
||||||
def _get_library_reranker_instruction(self, library_uid: str) -> str:
|
def _get_library_reranker_instruction(self, library_uid: str) -> str:
|
||||||
"""Get reranker_instruction from a Library node."""
|
"""Get reranker_instruction from a Library node."""
|
||||||
@@ -710,7 +697,12 @@ class SearchService:
|
|||||||
|
|
||||||
lib = Library.nodes.get(uid=library_uid)
|
lib = Library.nodes.get(uid=library_uid)
|
||||||
return lib.reranker_instruction or ""
|
return lib.reranker_instruction or ""
|
||||||
except Exception:
|
except Exception as exc:
|
||||||
|
logger.warning(
|
||||||
|
"Failed to load reranker_instruction for library_uid=%s: %s",
|
||||||
|
library_uid,
|
||||||
|
exc,
|
||||||
|
)
|
||||||
return ""
|
return ""
|
||||||
|
|
||||||
def _get_embedding_instruction(self, library_uid: str) -> str:
|
def _get_embedding_instruction(self, library_uid: str) -> str:
|
||||||
@@ -720,7 +712,12 @@ class SearchService:
|
|||||||
|
|
||||||
lib = Library.nodes.get(uid=library_uid)
|
lib = Library.nodes.get(uid=library_uid)
|
||||||
return lib.embedding_instruction or ""
|
return lib.embedding_instruction or ""
|
||||||
except Exception:
|
except Exception as exc:
|
||||||
|
logger.warning(
|
||||||
|
"Failed to load embedding_instruction for library_uid=%s: %s",
|
||||||
|
library_uid,
|
||||||
|
exc,
|
||||||
|
)
|
||||||
return ""
|
return ""
|
||||||
|
|
||||||
def _get_type_embedding_instruction(self, library_type: str) -> str:
|
def _get_type_embedding_instruction(self, library_type: str) -> str:
|
||||||
|
|||||||
@@ -225,8 +225,12 @@ class SearchServiceHelperTest(TestCase):
|
|||||||
instruction = service._get_reranker_instruction(request, [])
|
instruction = service._get_reranker_instruction(request, [])
|
||||||
self.assertIn("fiction", instruction.lower())
|
self.assertIn("fiction", instruction.lower())
|
||||||
|
|
||||||
def test_get_reranker_instruction_from_candidates(self):
|
def test_get_reranker_instruction_generic_for_unscoped(self):
|
||||||
"""Detects dominant library type from candidate list."""
|
"""
|
||||||
|
Unscoped queries get the generic instruction even when candidates
|
||||||
|
all share a library_type — type-specific instructions could bias
|
||||||
|
the reranker against minority-type results.
|
||||||
|
"""
|
||||||
service = SearchService()
|
service = SearchService()
|
||||||
request = SearchRequest(query="test")
|
request = SearchRequest(query="test")
|
||||||
candidates = [
|
candidates = [
|
||||||
@@ -240,10 +244,10 @@ class SearchServiceHelperTest(TestCase):
|
|||||||
]
|
]
|
||||||
|
|
||||||
instruction = service._get_reranker_instruction(request, candidates)
|
instruction = service._get_reranker_instruction(request, candidates)
|
||||||
self.assertIn("technical", instruction.lower())
|
self.assertEqual(instruction, SearchService.GENERIC_RERANKER_INSTRUCTION)
|
||||||
|
|
||||||
def test_get_reranker_instruction_empty_when_no_context(self):
|
def test_get_reranker_instruction_generic_when_no_context(self):
|
||||||
"""Returns empty when no library type context available."""
|
"""Returns the generic instruction when no library scope is set."""
|
||||||
service = SearchService()
|
service = SearchService()
|
||||||
request = SearchRequest(query="test")
|
request = SearchRequest(query="test")
|
||||||
candidates = [
|
candidates = [
|
||||||
@@ -256,4 +260,4 @@ class SearchServiceHelperTest(TestCase):
|
|||||||
]
|
]
|
||||||
|
|
||||||
instruction = service._get_reranker_instruction(request, candidates)
|
instruction = service._get_reranker_instruction(request, candidates)
|
||||||
self.assertEqual(instruction, "")
|
self.assertEqual(instruction, SearchService.GENERIC_RERANKER_INSTRUCTION)
|
||||||
|
|||||||
Reference in New Issue
Block a user