Add Themis application with custom widgets, views, and utilities

- Implemented custom form widgets for date, time, and datetime fields with DaisyUI styling. - Created utility functions for formatting dates, times, and numbers according to user preferences. - Developed views for profile settings, API key management, and notifications, including health check endpoints. - Added URL configurations for Themis tests and main application routes. - Established test cases for custom widgets to ensure proper functionality and integration. - Defined project metadata and dependencies in pyproject.toml for package management.
2026-03-21 02:00:18 +00:00
parent e99346d014
commit 99bdb4ac92
351 changed files with 65123 additions and 2 deletions
--- a/docs/Pattern_S3-Storage_V1-00.md
+++ b/docs/Pattern_S3-Storage_V1-00.md
@@ -0,0 +1,434 @@
+# S3/MinIO File Storage Pattern v1.0.0
+
+Standardizes how Django apps in Spelunker store, read, and reference files in S3/MinIO, covering upload paths, model metadata fields, storage-agnostic I/O, and test isolation.
+
+## 🐾 Red Panda Approval™
+
+This pattern follows Red Panda Approval standards.
+
+---
+
+## Why a Pattern, Not a Shared Implementation
+
+Each Django app stores files for a different domain purpose with different path conventions, processing workflows, and downstream consumers, making a single shared model impractical.
+
+- The **rfp_manager** app needs files scoped under an RFP ID (info docs, question spreadsheets, generated exports), with no embedding — only LLM summarization
+- The **solution_library** app needs files tied to vendor/solution hierarchies, with full text embedding and chunk storage, plus scraped documents that have no Django `FileField` at all
+- The **rag** app needs to programmatically write chunk texts to S3 during embedding and read them back for search context
+- The **core** app needs a simple image upload for organization logos without any processing pipeline
+
+Instead, this pattern defines:
+
+- **Required fields** — the minimum every file-backed model must have
+- **Recommended fields** — metadata most implementations should track
+- **Standard path conventions** — bucket key prefixes each domain owns
+- **Storage-agnostic I/O** — how to read and write files so tests work without a real S3 bucket
+
+---
+
+## Required Fields
+
+Every model that stores a file in S3/MinIO must have at minimum:
+
+```python
+from django.core.validators import FileExtensionValidator
+from django.db import models
+
+def my_domain_upload_path(instance, filename):
+    """Return a scoped S3 key for this domain."""
+    return f'my_domain/{instance.parent_id}/{filename}'
+
+class MyDocument(models.Model):
+    file = models.FileField(
+        upload_to=my_domain_upload_path,              # or a string prefix
+        validators=[FileExtensionValidator(allowed_extensions=[...])],
+    )
+    file_type = models.CharField(max_length=100, blank=True)   # extension without dot
+    file_size = models.PositiveIntegerField(null=True, blank=True)  # bytes
+```
+
+---
+
+## Standard Path Conventions
+
+Use these exact key prefixes so buckets stay organized and IAM policies can target prefixes.
+
+| App / Purpose                  | S3 Key Prefix                              |
+|--------------------------------|--------------------------------------------|
+| Solution library documents     | `documents/`                               |
+| Scraped documentation sources  | `scraped/{source_id}/{filename}`           |
+| Embedding chunk texts          | `chunks/{document_id}/chunk_{index}.txt`   |
+| RFP information documents      | `rfp_info_documents/{rfp_id}/{filename}`   |
+| RFP question spreadsheets      | `rfp_question_documents/{rfp_id}/{filename}` |
+| RFP generated exports          | `rfp_exports/{rfp_id}/{filename}`          |
+| Organization logos             | `orgs/logos/`                              |
+
+---
+
+## Recommended Fields and Behaviors
+
+Most file-backed models should also include these and populate them automatically.
+
+```python
+class MyDocument(models.Model):
+    # ... required fields above ...
+
+    # Recommended: explicit S3 key for programmatic access and admin visibility
+    s3_key = models.CharField(max_length=500, blank=True)
+
+    def save(self, *args, **kwargs):
+        """Auto-populate file metadata on every save."""
+        if self.file:
+            self.s3_key = self.file.name
+            if hasattr(self.file, 'size'):
+                self.file_size = self.file.size
+            if self.file.name and '.' in self.file.name:
+                self.file_type = self.file.name.rsplit('.', 1)[-1].lower()
+        super().save(*args, **kwargs)
+```
+
+---
+
+## Pattern Variant 1: FileField Upload (User-Initiated Upload)
+
+Used by `rfp_manager.RFPInformationDocument`, `rfp_manager.RFPQuestionDocument`, `rfp_manager.RFPExport`, `solution_library.Document`, and `core.Organization`.
+
+The user (or Celery task generating an export) provides a file. Django's `FileField` handles the upload to S3 automatically via the configured storage backend.
+
+```python
+import os
+from django.core.validators import FileExtensionValidator
+from django.db import models
+
+
+def rfp_info_document_path(instance, filename):
+    """Scope uploads under the parent RFP's ID to keep the bucket organized."""
+    return f'rfp_info_documents/{instance.rfp.id}/{filename}'
+
+
+class RFPInformationDocument(models.Model):
+    file = models.FileField(
+        upload_to=rfp_info_document_path,
+        validators=[FileExtensionValidator(
+            allowed_extensions=['pdf', 'doc', 'docx', 'txt', 'md']
+        )],
+    )
+    title = models.CharField(max_length=500)
+    file_type = models.CharField(max_length=100, blank=True)
+    file_size = models.PositiveIntegerField(null=True, blank=True)
+
+    def save(self, *args, **kwargs):
+        if self.file:
+            if hasattr(self.file, 'size'):
+                self.file_size = self.file.size
+            if self.file.name:
+                self.file_type = os.path.splitext(self.file.name)[1].lstrip('.')
+        super().save(*args, **kwargs)
+```
+
+---
+
+## Pattern Variant 2: Programmatic Write (Code-Generated Content)
+
+Used by `rag.services.embeddings` (chunk texts) and `solution_library.services.sync` (scraped documents).
+
+Content is generated or fetched in code and written directly to S3 using `default_storage.save()` with a `ContentFile`. The model records the resulting S3 key for later retrieval.
+
+```python
+from django.core.files.base import ContentFile
+from django.core.files.storage import default_storage
+
+
+def store_chunk(document_id: int, chunk_index: int, text: str) -> str:
+    """
+    Store an embedding chunk in S3 and return the saved key.
+
+    Returns:
+        The actual S3 key (may differ from requested if file_overwrite=False)
+    """
+    s3_key = f'chunks/{document_id}/chunk_{chunk_index}.txt'
+    saved_key = default_storage.save(s3_key, ContentFile(text.encode('utf-8')))
+    return saved_key
+
+
+def store_scraped_document(source_id: int, filename: str, content: str) -> str:
+    """Store scraped document content in S3 and return the saved key."""
+    s3_key = f'scraped/{source_id}/{filename}'
+    return default_storage.save(s3_key, ContentFile(content.encode('utf-8')))
+```
+
+When creating the model record after a programmatic write, use `s3_key` rather than a `FileField`:
+
+```python
+Document.objects.create(
+    title=filename,
+    s3_key=saved_key,
+    file_size=len(content),
+    file_type='md',
+    # Note: `file` field is intentionally empty — this is a scraped document
+)
+```
+
+---
+
+## Pattern Variant 3: Storage-Agnostic Read
+
+Used by `rfp_manager.services.excel_processor`, `rag.services.embeddings._read_document_content`, and `solution_library.models.DocumentEmbedding.get_chunk_text`.
+
+Always read via `default_storage.open()` so the same code works against S3 in production and `FileSystemStorage` in tests. Never construct a filesystem path from `settings.MEDIA_ROOT`.
+
+```python
+from django.core.files.storage import default_storage
+from io import BytesIO
+
+
+def load_binary_from_storage(file_path: str) -> BytesIO:
+    """
+    Read a binary file from storage into a BytesIO buffer.
+    Works against S3/MinIO in production and FileSystemStorage in tests.
+    """
+    with default_storage.open(file_path, 'rb') as f:
+        return BytesIO(f.read())
+
+
+def read_text_from_storage(s3_key: str) -> str:
+    """Read a text file from storage."""
+    with default_storage.open(s3_key, 'r') as f:
+        return f.read()
+```
+
+When a model has both a `file` field (user upload) and a bare `s3_key` (scraped/programmatic), check which path applies:
+
+```python
+def _read_document_content(self, document) -> str:
+    if document.s3_key and not document.file:
+        # Scraped document: no FileField, read by key
+        with default_storage.open(document.s3_key, 'r') as f:
+            return f.read()
+    # Uploaded document: use the FileField
+    with document.file.open('r') as f:
+        return f.read()
+```
+
+---
+
+## Pattern Variant 4: S3 Connectivity Validation
+
+Used by `solution_library.models.Document.clean()` and `solution_library.services.sync.sync_documentation_source`.
+
+Validate that the bucket is reachable before attempting an upload or sync. This surfaces credential errors with a user-friendly message rather than a cryptic 500.
+
+```python
+from botocore.exceptions import ClientError, NoCredentialsError
+from django.core.exceptions import ValidationError
+from django.core.files.storage import default_storage
+
+
+def validate_s3_connectivity():
+    """
+    Raise ValidationError if S3/MinIO bucket is not accessible.
+    Only call on new uploads or at the start of a background sync.
+    """
+    if not hasattr(default_storage, 'bucket'):
+        return  # Not an S3 backend (e.g., tests), skip validation
+
+    try:
+        default_storage.bucket.meta.client.head_bucket(
+            Bucket=default_storage.bucket_name
+        )
+    except ClientError as e:
+        code = e.response.get('Error', {}).get('Code', '')
+        if code == '403':
+            raise ValidationError(
+                "S3/MinIO credentials are invalid or permissions are insufficient."
+            )
+        elif code == '404':
+            raise ValidationError(
+                f"Bucket '{default_storage.bucket_name}' does not exist."
+            )
+        raise ValidationError(f"S3/MinIO error ({code}): {e}")
+    except NoCredentialsError:
+        raise ValidationError("S3/MinIO credentials are not configured.")
+```
+
+In a model's `clean()`, guard with `not self.pk` to avoid checking on every update:
+
+```python
+def clean(self):
+    super().clean()
+    if self.file and not self.pk:   # New uploads only
+        validate_s3_connectivity()
+```
+
+---
+
+## Domain Extension Examples
+
+### rfp_manager App
+
+RFP documents are scoped under the RFP ID for isolation and easy cleanup. The app uses three document types (info, question, export), each with its own callable path function to keep the bucket navigation clear.
+
+```python
+def rfp_export_path(instance, filename):
+    return f'rfp_exports/{instance.rfp.id}/{filename}'
+
+class RFPExport(models.Model):
+    export_file = models.FileField(upload_to=rfp_export_path)
+    version = models.CharField(max_length=50)
+    file_size = models.PositiveIntegerField(null=True, blank=True)
+    question_count = models.IntegerField()
+    answered_count = models.IntegerField()
+    # No s3_key field - export files are always accessed via FileField
+```
+
+### solution_library App
+
+Solution library documents track an explicit `s3_key` because the app supports two document origins: user uploads (with `FileField`) and scraped documents (programmatic write only, no `FileField`). For embedding, chunk texts are stored separately in S3 and referenced from `DocumentEmbedding` via `chunk_s3_key`.
+
+```python
+class Document(models.Model):
+    file = models.FileField(upload_to='documents/', blank=True)  # blank=True: scraped docs
+    s3_key = models.CharField(max_length=500, blank=True)        # always populated
+    content_hash = models.CharField(max_length=64, blank=True, db_index=True)
+
+class DocumentEmbedding(models.Model):
+    document = models.ForeignKey(Document, on_delete=models.CASCADE, related_name='embeddings')
+    chunk_s3_key = models.CharField(max_length=500)   # e.g. chunks/42/chunk_7.txt
+    chunk_index = models.IntegerField()
+    chunk_size = models.PositiveIntegerField()
+    embedding = VectorField(null=True, blank=True)    # pgvector column
+
+    def get_chunk_text(self) -> str:
+        from django.core.files.storage import default_storage
+        with default_storage.open(self.chunk_s3_key, 'r') as f:
+            return f.read()
+```
+
+---
+
+## Anti-Patterns
+
+- ❌ Don't build filesystem paths with `os.path.join(settings.MEDIA_ROOT, ...)` — always read through `default_storage.open()`
+- ❌ Don't store file content as a `TextField` or `BinaryField` in the database
+- ❌ Don't use `default_acl='public-read'` — all Spelunker buckets use `private` ACL with `querystring_auth=True` (pre-signed URLs)
+- ❌ Don't skip `FileExtensionValidator` on upload fields — it is the first line of defence against unexpected file types
+- ❌ Don't call `document.file.storage.size()` or `.exists()` in hot paths — these make network round-trips; use the `s3_key` and metadata fields for display purposes
+- ❌ Don't make S3 API calls in tests without first overriding `STORAGES` in `test_settings.py`
+- ❌ Don't use `file_overwrite=True` — the global setting `file_overwrite=False` ensures Django auto-appends a unique suffix rather than silently overwriting existing objects
+
+---
+
+## Settings
+
+```python
+# spelunker/settings.py
+
+STORAGES = {
+    "default": {
+        "BACKEND": "storages.backends.s3boto3.S3Boto3Storage",
+        "OPTIONS": {
+            "access_key": env('S3_ACCESS_KEY'),
+            "secret_key": env('S3_SECRET_KEY'),
+            "bucket_name": env('S3_BUCKET_NAME'),
+            "endpoint_url": env('S3_ENDPOINT_URL'),    # Use for MinIO or non-AWS S3
+            "use_ssl": env('S3_USE_SSL'),
+            "default_acl": env('S3_DEFAULT_ACL'),      # Must be 'private'
+            "region_name": env('S3_REGION_NAME'),
+            "file_overwrite": False,                   # Prevent silent overwrites
+            "querystring_auth": True,                  # Pre-signed URLs for all access
+            "verify": env.bool('S3_VERIFY_SSL', default=True),
+        }
+    },
+    "staticfiles": {
+        # Static files are served locally (nginx), never from S3
+        "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
+    },
+}
+```
+
+Environment variables (see `.env.example`):
+
+```bash
+S3_ACCESS_KEY=
+S3_SECRET_KEY=
+S3_BUCKET_NAME=spelunker-documents
+S3_ENDPOINT_URL=http://localhost:9000   # MinIO local dev
+S3_USE_SSL=False
+S3_VERIFY_SSL=False
+S3_DEFAULT_ACL=private
+S3_REGION_NAME=us-east-1
+```
+
+Test override (disables all S3 calls):
+
+```python
+# spelunker/test_settings.py
+
+STORAGES = {
+    "default": {
+        "BACKEND": "django.core.files.storage.FileSystemStorage",
+        "OPTIONS": {"location": "/tmp/test_media/"},
+    },
+    "staticfiles": {
+        "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
+    },
+}
+```
+
+---
+
+## Testing
+
+Standard test cases every file-backed implementation should cover.
+
+```python
+import os
+import tempfile
+from django.core.files.uploadedfile import SimpleUploadedFile
+from django.test import TestCase, override_settings
+
+
+@override_settings(
+    STORAGES={
+        "default": {
+            "BACKEND": "django.core.files.storage.FileSystemStorage",
+            "OPTIONS": {"location": tempfile.mkdtemp()},
+        },
+        "staticfiles": {
+            "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
+        },
+    }
+)
+class MyDocumentStorageTest(TestCase):
+
+    def test_file_metadata_populated_on_save(self):
+        """file_type and file_size are auto-populated from the uploaded file."""
+        uploaded = SimpleUploadedFile("report.pdf", b"%PDF-1.4 content", content_type="application/pdf")
+        doc = MyDocument.objects.create(file=uploaded, title="Test")
+        self.assertEqual(doc.file_type, "pdf")
+        self.assertGreater(doc.file_size, 0)
+
+    def test_upload_path_includes_parent_id(self):
+        """upload_to callable scopes the key under the parent ID."""
+        uploaded = SimpleUploadedFile("q.xlsx", b"PK content")
+        doc = MyDocument.objects.create(file=uploaded, title="Questions", rfp=self.rfp)
+        self.assertIn(str(self.rfp.id), doc.file.name)
+
+    def test_rejected_extension(self):
+        """FileExtensionValidator rejects disallowed file types."""
+        from django.core.exceptions import ValidationError
+        uploaded = SimpleUploadedFile("hack.exe", b"MZ")
+        doc = MyDocument(file=uploaded, title="Bad")
+        with self.assertRaises(ValidationError):
+            doc.full_clean()
+
+    def test_storage_agnostic_read(self):
+        """Reading via default_storage.open() works against FileSystemStorage."""
+        from django.core.files.base import ContentFile
+        from django.core.files.storage import default_storage
+        key = default_storage.save("test/hello.txt", ContentFile(b"hello world"))
+        with default_storage.open(key, 'r') as f:
+            content = f.read()
+        self.assertEqual(content, "hello world")
+        default_storage.delete(key)
+```