diff --git a/docs/superpowers/plans/2026-05-19-workflow-runner-refactor.md b/docs/superpowers/plans/2026-05-19-workflow-runner-refactor.md new file mode 100644 index 000000000..16eca2d78 --- /dev/null +++ b/docs/superpowers/plans/2026-05-19-workflow-runner-refactor.md @@ -0,0 +1,1218 @@ +# Workflow Runner Refactor Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace the `use_overrides` dual-mode branching in `run_workflows` with a polymorphic `WorkflowRunContext`, and make the workflow-execution → file-rename sequence deterministic via a `ContextVar` guard. + +**Architecture:** A new `documents/workflows/context.py` module defines a `WorkflowRunContext` `Protocol` and two implementations — `ConsumptionContext` (wraps `ConsumableDocument` + `DocumentMetadataOverrides`) and `PersistedContext` (wraps a real `Document`). `run_workflows` becomes a flat match-and-dispatch loop with no mode flag. A module-level `ContextVar` guard suppresses `update_filename_and_move_files` for the duration of a workflow run; the file rename is invoked once, explicitly, after the run. + +**Tech Stack:** Python 3.12+, Django, Celery, pytest (pytest-xdist), `uv`, `ruff`. + +**Reference spec:** `docs/superpowers/specs/2026-05-19-workflow-runner-refactor-design.md` + +--- + +## Conventions for every task + +- All Python commands run from `src/`. Tests: `uv run pytest`. Lint: `ruff check --fix` then `ruff format` (global `ruff`, **not** `uv run ruff`). +- New test code: group tests under classes, put `@pytest.mark.django_db` per class where DB is needed, fully type-annotate every fixture parameter, fixture return type, and test signature. +- Commit after each task with the message shown in its final step. + +## File structure + +| File | Responsibility | Change | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ----------------------- | +| `src/documents/workflows/context.py` | `ContextVar` guard, `WorkflowRunContext` Protocol, `ConsumptionContext`, `PersistedContext`, `build_workflow_context` factory | **Create** | +| `src/documents/signals/handlers.py` | `run_workflows` rewritten as flat loop; `update_filename_and_move_files` split into receiver + `move_files_for_document` | Modify | +| `src/documents/workflows/actions.py` | Drop `build_workflow_action_context`; `execute_email_action` / `execute_webhook_action` take `source_file` instead of `original_file` | Modify | +| `src/documents/tests/test_workflow_context.py` | Unit tests for the guard and the two contexts | **Create** | +| `src/documents/tests/test_workflows.py` | Existing regression suite (must stay green) + new guard/race regression tests | Modify (add tests only) | + +--- + +## Task 1: ContextVar guard + +**Files:** + +- Create: `src/documents/workflows/context.py` +- Test: `src/documents/tests/test_workflow_context.py` + +- [ ] **Step 1: Write the failing test** + +Create `src/documents/tests/test_workflow_context.py`: + +```python +from documents.workflows.context import workflow_guard +from documents.workflows.context import workflow_run_in_progress + + +class TestWorkflowGuard: + def test_guard_not_set_by_default(self) -> None: + assert workflow_run_in_progress() is False + + def test_guard_set_inside_context_manager(self) -> None: + with workflow_guard(): + assert workflow_run_in_progress() is True + assert workflow_run_in_progress() is False + + def test_guard_is_reentrant(self) -> None: + with workflow_guard(): + with workflow_guard(): + assert workflow_run_in_progress() is True + assert workflow_run_in_progress() is True + assert workflow_run_in_progress() is False + + def test_guard_resets_on_exception(self) -> None: + try: + with workflow_guard(): + raise RuntimeError("boom") + except RuntimeError: + pass + assert workflow_run_in_progress() is False +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `uv run pytest documents/tests/test_workflow_context.py -v` +Expected: FAIL with `ModuleNotFoundError: No module named 'documents.workflows.context'` + +- [ ] **Step 3: Create the module with the guard** + +Create `src/documents/workflows/context.py` with only the guard for now: + +```python +import logging +from collections.abc import Iterator +from contextlib import contextmanager +from contextvars import ContextVar + +logger = logging.getLogger("paperless.workflows") + +_workflow_in_progress: ContextVar[bool] = ContextVar( + "workflow_in_progress", + default=False, +) + + +def workflow_run_in_progress() -> bool: + """ + True while run_workflows is executing on the current thread/context. + + update_filename_and_move_files checks this and early-returns, so the file + rename does not race the workflow's own metadata mutations. + """ + return _workflow_in_progress.get() + + +@contextmanager +def workflow_guard() -> Iterator[None]: + """ + Suppress update_filename_and_move_files for the duration of a workflow run. + + Token-based reset keeps this reentrancy-safe: nested workflow_guard() blocks + (e.g. a workflow whose action triggers another save) restore the previous + value rather than unconditionally clearing the flag. + """ + token = _workflow_in_progress.set(True) + try: + yield + finally: + _workflow_in_progress.reset(token) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `uv run pytest documents/tests/test_workflow_context.py -v` +Expected: PASS (4 tests) + +- [ ] **Step 5: Lint and commit** + +```bash +ruff check --fix src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +ruff format src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git add src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git commit -m "Add ContextVar guard for workflow runner" +``` + +--- + +## Task 2: `WorkflowRunContext` Protocol + `ConsumptionContext` + +**Files:** + +- Modify: `src/documents/workflows/context.py` +- Test: `src/documents/tests/test_workflow_context.py` + +The `ConsumptionContext` absorbs the `use_overrides=True` branches currently in +`run_workflows` (`handlers.py:854-1010`) and `build_workflow_action_context` +(`actions.py:33,53-83`). + +- [ ] **Step 1: Write the failing test** + +Append to `src/documents/tests/test_workflow_context.py`: + +```python +import pytest + +from documents.data_models import ConsumableDocument +from documents.data_models import DocumentMetadataOverrides +from documents.data_models import DocumentSource +from documents.models import Correspondent +from documents.models import WorkflowTrigger +from documents.workflows.context import ConsumptionContext + + +@pytest.mark.django_db +class TestConsumptionContext: + @pytest.fixture + def consumable(self, tmp_path) -> ConsumableDocument: + staged = tmp_path / "staged.pdf" + staged.write_bytes(b"%PDF-1.4") + return ConsumableDocument( + source=DocumentSource.ConsumeFolder, + original_file=staged, + ) + + def test_source_and_staged_file_are_the_staged_path( + self, + consumable: ConsumableDocument, + ) -> None: + overrides = DocumentMetadataOverrides() + ctx = ConsumptionContext( + WorkflowTrigger.WorkflowTriggerType.CONSUMPTION, + consumable, + overrides, + ) + assert ctx.source_file == consumable.original_file + assert ctx.staged_file == consumable.original_file + assert ctx.target is consumable + + def test_assignment_mutates_overrides_not_db( + self, + consumable: ConsumableDocument, + ) -> None: + correspondent = Correspondent.objects.create(name="ACME") + overrides = DocumentMetadataOverrides() + ctx = ConsumptionContext( + WorkflowTrigger.WorkflowTriggerType.CONSUMPTION, + consumable, + overrides, + ) + + class FakeAction: + assign_correspondent = correspondent + assign_document_type = None + assign_storage_path = None + assign_owner = None + assign_title = None + has_assign_tags = False + has_assign_view_users = False + has_assign_view_groups = False + has_assign_change_users = False + has_assign_change_groups = False + has_assign_custom_fields = False + + ctx.apply_assignment(FakeAction(), logging_group=None) + assert overrides.correspondent_id == correspondent.pk + + def test_log_action_collects_messages( + self, + consumable: ConsumableDocument, + ) -> None: + ctx = ConsumptionContext( + WorkflowTrigger.WorkflowTriggerType.CONSUMPTION, + consumable, + DocumentMetadataOverrides(), + ) + ctx.log_action("hello", logging_group=None) + assert ctx.messages == ["hello"] +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `uv run pytest documents/tests/test_workflow_context.py::TestConsumptionContext -v` +Expected: FAIL with `ImportError: cannot import name 'ConsumptionContext'` + +- [ ] **Step 3: Add the Protocol and `ConsumptionContext`** + +Add to `src/documents/workflows/context.py` (after the guard). Add these imports at the top of the file: + +```python +import uuid +from pathlib import Path +from typing import Protocol + +from django.contrib.auth.models import User +from django.utils import timezone + +from documents.data_models import ConsumableDocument +from documents.data_models import DocumentMetadataOverrides +from documents.models import Correspondent +from documents.models import Document +from documents.models import DocumentType +from documents.models import Workflow +from documents.models import WorkflowAction +from documents.models import WorkflowRun +from documents.models import WorkflowTrigger +from documents.workflows.mutations import apply_assignment_to_overrides +from documents.workflows.mutations import apply_removal_to_overrides +``` + +Then append: + +```python +class WorkflowRunContext(Protocol): + """ + Uniform surface a workflow action operates on, independent of whether the + run targets a freshly-consumed document (overrides) or a persisted one. + """ + + trigger_type: WorkflowTrigger.WorkflowTriggerType + + @property + def target(self) -> Document | ConsumableDocument: + """The object passed to matching and to email/webhook/trash actions.""" + ... + + @property + def source_file(self) -> Path: + """Best-effort current on-disk location of the document file.""" + ... + + @property + def staged_file(self) -> Path | None: + """ + The staged (pre-final-move) path, when the file is not yet at its + final storage location; otherwise None. Used by password removal. + """ + ... + + def refresh(self, logging_group: uuid.UUID | None) -> bool: + """Reload state before matching. Return False to stop the run.""" + ... + + def build_placeholder_context(self) -> dict: + """Context dict for email/webhook placeholder parsing.""" + ... + + def apply_assignment( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: ... + + def apply_removal( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: ... + + def log_action(self, message: str, logging_group: uuid.UUID | None) -> None: + """Record an 'applying action' message.""" + ... + + def persist(self) -> None: + """Commit accumulated mutations (no-op for the consumption flow).""" + ... + + def record_run(self, workflow: Workflow) -> None: + """Create the WorkflowRun audit row.""" + ... + + def finalize_file_location(self) -> None: + """Run the one-time file rename after the workflow run completes.""" + ... + + +class ConsumptionContext: + """Workflow run against a document still being consumed.""" + + def __init__( + self, + trigger_type: WorkflowTrigger.WorkflowTriggerType, + consumable: ConsumableDocument, + overrides: DocumentMetadataOverrides, + ) -> None: + self.trigger_type = trigger_type + self.consumable = consumable + self.overrides = overrides + self.messages: list[str] = [] + + @property + def target(self) -> Document | ConsumableDocument: + return self.consumable + + @property + def source_file(self) -> Path: + return self.consumable.original_file + + @property + def staged_file(self) -> Path | None: + return self.consumable.original_file + + def refresh(self, logging_group: uuid.UUID | None) -> bool: + return True + + def build_placeholder_context(self) -> dict: + overrides = self.overrides + correspondent_obj = ( + Correspondent.objects.filter(pk=overrides.correspondent_id).first() + if overrides.correspondent_id + else None + ) + document_type_obj = ( + DocumentType.objects.filter(pk=overrides.document_type_id).first() + if overrides.document_type_id + else None + ) + owner_obj = ( + User.objects.filter(pk=overrides.owner_id).first() + if overrides.owner_id + else None + ) + filename = ( + self.consumable.original_file if self.consumable.original_file else "" + ) + return { + "title": overrides.title + if overrides.title + else str(self.consumable.original_file), + "doc_url": "", + "correspondent": correspondent_obj.name if correspondent_obj else "", + "document_type": document_type_obj.name if document_type_obj else "", + "owner_username": owner_obj.username if owner_obj else "", + "filename": filename, + "current_filename": filename, + "added": timezone.localtime(timezone.now()), + "created": overrides.created, + "id": "", + } + + def apply_assignment( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: + apply_assignment_to_overrides(action, self.overrides) + + def apply_removal( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: + apply_removal_to_overrides(action, self.overrides) + + def log_action(self, message: str, logging_group: uuid.UUID | None) -> None: + self.messages.append(message) + + def persist(self) -> None: + return None + + def record_run(self, workflow: Workflow) -> None: + WorkflowRun.objects.create( + workflow=workflow, + type=self.trigger_type, + document=None, + ) + + def finalize_file_location(self) -> None: + return None +``` + +> Note: the `build_placeholder_context` body is moved verbatim from the +> `use_overrides` branch of `build_workflow_action_context` (`actions.py:53-83`). + +- [ ] **Step 4: Run test to verify it passes** + +Run: `uv run pytest documents/tests/test_workflow_context.py::TestConsumptionContext -v` +Expected: PASS (3 tests) + +- [ ] **Step 5: Lint and commit** + +```bash +ruff check --fix src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +ruff format src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git add src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git commit -m "Add WorkflowRunContext protocol and ConsumptionContext" +``` + +--- + +## Task 3: `PersistedContext` + `build_workflow_context` factory + +**Files:** + +- Modify: `src/documents/workflows/context.py` +- Test: `src/documents/tests/test_workflow_context.py` + +`PersistedContext` absorbs the `use_overrides=False` branches of `run_workflows` +(`handlers.py:896-1005`) and `build_workflow_action_context` (`actions.py:35-51`). + +- [ ] **Step 1: Write the failing test** + +Append to `src/documents/tests/test_workflow_context.py`: + +```python +from documents.workflows.context import PersistedContext +from documents.workflows.context import build_workflow_context + + +@pytest.mark.django_db +class TestPersistedContext: + @pytest.fixture + def document(self) -> Document: + return Document.objects.create( + title="doc", + mime_type="application/pdf", + checksum="abc123", + ) + + def test_source_file_is_document_source_path_without_staged( + self, + document: Document, + ) -> None: + ctx = PersistedContext( + WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED, + document, + staged_file=None, + ) + assert ctx.source_file == document.source_path + assert ctx.staged_file is None + assert ctx.target is document + + def test_source_file_prefers_staged_path( + self, + document: Document, + tmp_path, + ) -> None: + staged = tmp_path / "staged.pdf" + staged.write_bytes(b"%PDF-1.4") + ctx = PersistedContext( + WorkflowTrigger.WorkflowTriggerType.DOCUMENT_ADDED, + document, + staged_file=staged, + ) + assert ctx.source_file == staged + assert ctx.staged_file == staged + + def test_refresh_returns_false_when_document_deleted( + self, + document: Document, + ) -> None: + ctx = PersistedContext( + WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED, + document, + staged_file=None, + ) + Document.objects.filter(pk=document.pk).delete() + assert ctx.refresh(logging_group=None) is False + + +@pytest.mark.django_db +class TestBuildWorkflowContext: + def test_overrides_present_builds_consumption_context(self, tmp_path) -> None: + from documents.data_models import DocumentSource + + consumable = ConsumableDocument( + source=DocumentSource.ConsumeFolder, + original_file=tmp_path / "f.pdf", + ) + ctx = build_workflow_context( + trigger_type=WorkflowTrigger.WorkflowTriggerType.CONSUMPTION, + document=consumable, + overrides=DocumentMetadataOverrides(), + staged_file=None, + ) + assert isinstance(ctx, ConsumptionContext) + + def test_no_overrides_builds_persisted_context(self) -> None: + document = Document.objects.create( + title="d", + mime_type="application/pdf", + checksum="zzz", + ) + ctx = build_workflow_context( + trigger_type=WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED, + document=document, + overrides=None, + staged_file=None, + ) + assert isinstance(ctx, PersistedContext) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `uv run pytest documents/tests/test_workflow_context.py::TestPersistedContext -v` +Expected: FAIL with `ImportError: cannot import name 'PersistedContext'` + +- [ ] **Step 3: Add `PersistedContext` and the factory** + +Add these imports to `src/documents/workflows/context.py`: + +```python +from documents.workflows.mutations import apply_assignment_to_document +from documents.workflows.mutations import apply_removal_to_document +``` + +Append to `src/documents/workflows/context.py`: + +```python +# Fields a workflow ASSIGNMENT/REMOVAL action can set directly on a Document. +# Deliberately excludes filename / archive_filename: those are managed only by +# move_files_for_document. A workflow never sets them, and writing a stale +# in-memory value here would revert a concurrent (cross-process) file move. +# modified has auto_now=True but is not auto-added when update_fields is given. +_WORKFLOW_SAVE_FIELDS = [ + "title", + "correspondent", + "document_type", + "storage_path", + "owner", + "modified", +] + + +class PersistedContext: + """Workflow run against a document already in the database.""" + + def __init__( + self, + trigger_type: WorkflowTrigger.WorkflowTriggerType, + document: Document, + staged_file: Path | None, + ) -> None: + self.trigger_type = trigger_type + self.document = document + self._staged_file = staged_file + + @property + def target(self) -> Document | ConsumableDocument: + return self.document + + @property + def source_file(self) -> Path: + return self._staged_file if self._staged_file else self.document.source_path + + @property + def staged_file(self) -> Path | None: + return self._staged_file + + def refresh(self, logging_group: uuid.UUID | None) -> bool: + try: + # run_workflows may be called repeatedly from bulk_update_documents; + # refresh so matching data is current and we do not overwrite work + # done by another process. + self.document.refresh_from_db() + except Document.DoesNotExist: + logger.info( + "Document no longer exists, skipping remaining workflows", + extra={"group": logging_group}, + ) + return False + if self.document.is_deleted: + logger.info( + "Document was moved to trash, skipping remaining workflows", + extra={"group": logging_group}, + ) + return False + return True + + def build_placeholder_context(self) -> dict: + from django.conf import settings + + document = self.document + return { + "title": document.title, + "doc_url": f"{settings.PAPERLESS_URL}{settings.BASE_URL}documents/{document.pk}/", + "correspondent": document.correspondent.name + if document.correspondent + else "", + "document_type": document.document_type.name + if document.document_type + else "", + "owner_username": document.owner.username if document.owner else "", + "filename": document.original_filename or "", + "current_filename": document.filename or "", + "added": timezone.localtime(document.added), + "created": document.created, + "id": document.pk, + } + + def apply_assignment( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: + apply_assignment_to_document(action, self.document, logging_group) + + def apply_removal( + self, + action: WorkflowAction, + logging_group: uuid.UUID | None, + ) -> None: + apply_removal_to_document(action, self.document) + + def log_action(self, message: str, logging_group: uuid.UUID | None) -> None: + logger.info(message, extra={"group": logging_group}) + + def persist(self) -> None: + # limit title to 128 characters + self.document.title = self.document.title[:128] + self.document.save(update_fields=_WORKFLOW_SAVE_FIELDS) + + def record_run(self, workflow: Workflow) -> None: + WorkflowRun.objects.create( + workflow=workflow, + type=self.trigger_type, + document=self.document, + ) + + def finalize_file_location(self) -> None: + # Imported here to avoid a circular import (handlers imports this module). + from documents.signals.handlers import move_files_for_document + + try: + self.document.refresh_from_db() + except Document.DoesNotExist: + return + if self.document.is_deleted: + return + move_files_for_document(self.document) + + +def build_workflow_context( + *, + trigger_type: WorkflowTrigger.WorkflowTriggerType, + document: Document | ConsumableDocument, + overrides: DocumentMetadataOverrides | None, + staged_file: Path | None, +) -> WorkflowRunContext: + """ + Pick the context implementation from the call shape: a non-None `overrides` + means the consumption flow (ConsumableDocument); otherwise a persisted + Document. + """ + if overrides is not None: + assert isinstance(document, ConsumableDocument) + return ConsumptionContext(trigger_type, document, overrides) + assert isinstance(document, Document) + return PersistedContext(trigger_type, document, staged_file) +``` + +> Note: `build_placeholder_context` here is moved verbatim from the +> non-`use_overrides` branch of `build_workflow_action_context` +> (`actions.py:35-51`). `_WORKFLOW_SAVE_FIELDS` and `persist()` reproduce the +> save at `handlers.py:987-996` — keep the field list and the comment. + +- [ ] **Step 4: Run test to verify it passes** + +Run: `uv run pytest documents/tests/test_workflow_context.py -v` +Expected: PASS (all tests, ~13) + +- [ ] **Step 5: Lint and commit** + +```bash +ruff check --fix src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +ruff format src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git add src/documents/workflows/context.py src/documents/tests/test_workflow_context.py +git commit -m "Add PersistedContext and build_workflow_context factory" +``` + +--- + +## Task 4: Extract `move_files_for_document` from the rename receiver + +**Files:** + +- Modify: `src/documents/signals/handlers.py:431-667` + +This task is a pure extraction — no behavior change yet. The guard check is +added in Task 6. + +- [ ] **Step 1: Confirm the regression baseline is green** + +Run: `uv run pytest documents/tests/test_file_handling.py -q` +Expected: PASS (records the pre-change baseline for the rename logic) + +- [ ] **Step 2: Split the function** + +In `src/documents/signals/handlers.py`, replace the `update_filename_and_move_files` +definition (currently `def update_filename_and_move_files(...)` at line 434, +through the end of its body at line 667) with two definitions: + +```python +@receiver(models.signals.post_save, sender=CustomFieldInstance, weak=False) +@receiver(models.signals.m2m_changed, sender=Document.tags.through, weak=False) +@receiver(models.signals.post_save, sender=Document, weak=False) +def update_filename_and_move_files( + sender, + instance: Document | CustomFieldInstance, + **kwargs, +) -> None: + if isinstance(instance, CustomFieldInstance): + if not _filename_template_uses_custom_fields(instance.document): + return + instance = instance.document + + move_files_for_document(instance) + + +def move_files_for_document(instance: Document) -> None: + def validate_move(instance, old_path: Path, new_path: Path, root: Path) -> None: + ... # body unchanged from current lines 444-463 +``` + +Move the entire body currently between line 444 (`def validate_move`) and line +667 into `move_files_for_document`, **unchanged**, with one edit: the recursive +call at the end (currently `handlers.py:660-667`) must call the new function: + +```python + # Keep version files in sync with root + if instance.root_document_id is None: + for version_doc in Document.objects.filter(root_document_id=instance.pk).only( + "pk", + ): + move_files_for_document(version_doc) +``` + +- [ ] **Step 3: Run the rename tests** + +Run: `uv run pytest documents/tests/test_file_handling.py -q` +Expected: PASS (identical to Step 1 — extraction is behavior-preserving) + +- [ ] **Step 4: Run the workflow suite** + +Run: `uv run pytest documents/tests/test_workflows.py -q` +Expected: PASS (unchanged) + +- [ ] **Step 5: Lint and commit** + +```bash +ruff check --fix src/documents/signals/handlers.py +ruff format src/documents/signals/handlers.py +git add src/documents/signals/handlers.py +git commit -m "Extract move_files_for_document from rename receiver" +``` + +--- + +## Task 5: Rewrite `run_workflows` to use contexts + +**Files:** + +- Modify: `src/documents/signals/handlers.py` (`run_workflows`, `run_workflows_added`) +- Modify: `src/documents/workflows/actions.py` (`execute_email_action`, `execute_webhook_action`, delete `build_workflow_action_context`) + +This removes the `use_overrides` flag, the `caller_supplied_original_file` +flag, and the `original_file` threading into `execute_*` helpers. It does **not** +yet add the guard (Task 6). + +- [ ] **Step 1: Confirm baseline** + +Run: `uv run pytest documents/tests/test_workflows.py -q` +Expected: PASS (baseline before the rewrite) + +- [ ] **Step 2: Update `actions.py` — `execute_email_action` signature** + +In `src/documents/workflows/actions.py`, change `execute_email_action` so the +parameter `original_file: Path` is renamed to `source_file: Path`, and replace +every use of `original_file` in its body (`actions.py:158-167`) with +`source_file`. The logic is otherwise unchanged. + +- [ ] **Step 3: Update `actions.py` — `execute_webhook_action` signature** + +In `execute_webhook_action`, rename the parameter `original_file: Path` to +`source_file: Path` and replace the two uses (`actions.py:245,250`) with +`source_file`. + +- [ ] **Step 4: Delete `build_workflow_action_context`** + +Delete the entire `build_workflow_action_context` function from +`src/documents/workflows/actions.py` (`actions.py:26-83`) — its two branches now +live in `ConsumptionContext.build_placeholder_context` and +`PersistedContext.build_placeholder_context`. + +- [ ] **Step 5: Rewrite `run_workflows` in `handlers.py`** + +In `src/documents/signals/handlers.py`, add to the imports: + +```python +from documents.workflows.context import WorkflowRunContext +from documents.workflows.context import build_workflow_context +``` + +and remove the now-unused `build_workflow_action_context` import. + +Replace the entire `run_workflows` function (`handlers.py:854-1010`) with: + +```python +def run_workflows( + trigger_type: WorkflowTrigger.WorkflowTriggerType, + document: Document | ConsumableDocument, + workflow_to_run: Workflow | None = None, + logging_group: uuid.UUID | None = None, + overrides: DocumentMetadataOverrides | None = None, + original_file: Path | None = None, +) -> tuple[DocumentMetadataOverrides, str] | None: + """ + Execute workflows matching a document for the given trigger. + + For the consumption flow (`overrides` provided) actions mutate the overrides + and the function returns `(overrides, messages)`. Otherwise actions mutate + the persisted Document and the function returns None. + + `original_file`, when given, is the staged path of a freshly-consumed file + that has not yet been moved into its final storage location. + """ + if isinstance(document, Document) and document.root_document_id is not None: + logger.debug( + "Skipping workflow execution for version document %s", + document.pk, + ) + return None + + context: WorkflowRunContext = build_workflow_context( + trigger_type=trigger_type, + document=document, + overrides=overrides, + staged_file=original_file, + ) + + for workflow in get_workflows_for_trigger(trigger_type, workflow_to_run): + if not context.refresh(logging_group): + break + + if not matching.document_matches_workflow( + context.target, + workflow, + trigger_type, + ): + continue + + action: WorkflowAction + has_move_to_trash_action = False + for action in workflow.actions.order_by("order", "pk"): + context.log_action(f"Applying {action} from {workflow}", logging_group) + + if action.type == WorkflowAction.WorkflowActionType.ASSIGNMENT: + context.apply_assignment(action, logging_group) + elif action.type == WorkflowAction.WorkflowActionType.REMOVAL: + context.apply_removal(action, logging_group) + elif action.type == WorkflowAction.WorkflowActionType.EMAIL: + execute_email_action( + action, + context.target, + context.build_placeholder_context(), + logging_group, + context.source_file, + trigger_type, + ) + elif action.type == WorkflowAction.WorkflowActionType.WEBHOOK: + execute_webhook_action( + action, + context.target, + context.build_placeholder_context(), + logging_group, + context.source_file, + ) + elif action.type == WorkflowAction.WorkflowActionType.PASSWORD_REMOVAL: + execute_password_removal_action( + action, + context.target, + logging_group, + source_file=context.staged_file, + ) + elif action.type == WorkflowAction.WorkflowActionType.MOVE_TO_TRASH: + has_move_to_trash_action = True + + context.persist() + context.record_run(workflow) + + if has_move_to_trash_action: + execute_move_to_trash_action(action, context.target, logging_group) + + if isinstance(context, ConsumptionContext): + return context.overrides, "\n".join(context.messages) + return None +``` + +Add the import `from documents.workflows.context import ConsumptionContext` to +`handlers.py` (used for the `isinstance` at the end). + +> Note on `staged_file` for password removal: previously the action received +> `original_file if caller_supplied_original_file else None`. `PersistedContext. +staged_file` is exactly the constructor `staged_file` arg, which is +> `original_file` — so DOCUMENT_ADDED gets the staged path and DOCUMENT_UPDATED +> / SCHEDULED get `None`, preserving the old behavior. For the consumption flow, +> `execute_password_removal_action` takes the `ConsumableDocument` branch and +> ignores `source_file` entirely. + +- [ ] **Step 6: Update `run_workflows_added`** + +`run_workflows_added` (`handlers.py:803-816`) already forwards `original_file` +to `run_workflows`. Leave it as-is — it still receives `original_file` from the +`document_consumption_finished` signal and passes it through; `run_workflows` +now routes it into `PersistedContext` construction. + +- [ ] **Step 7: Run the full workflow suite** + +Run: `uv run pytest documents/tests/test_workflows.py -q` +Expected: PASS — identical pass count to Step 1. If any test needs editing, STOP: +that is a behavior change, not a refactor outcome — investigate before continuing. + +- [ ] **Step 8: Run consumer + matching suites** + +Run: `uv run pytest documents/tests/test_consumer.py documents/tests/test_matchables.py -q` +Expected: PASS + +- [ ] **Step 9: Lint and commit** + +```bash +ruff check --fix src/documents/signals/handlers.py src/documents/workflows/actions.py +ruff format src/documents/signals/handlers.py src/documents/workflows/actions.py +git add src/documents/signals/handlers.py src/documents/workflows/actions.py +git commit -m "Refactor run_workflows around WorkflowRunContext, drop use_overrides" +``` + +--- + +## Task 6: Wire the guard into `run_workflows` and the rename receiver + +**Files:** + +- Modify: `src/documents/signals/handlers.py` (`run_workflows`, `update_filename_and_move_files`, `PersistedContext` use via `finalize_file_location`) +- Test: `src/documents/tests/test_workflows.py` + +- [ ] **Step 1: Write the failing test** + +Append a new test class to `src/documents/tests/test_workflows.py`. Use the +existing helpers/fixtures already in that file for creating a `Document`, a +`Workflow` with a `DOCUMENT_UPDATED` trigger, and an ASSIGNMENT action that +assigns a correspondent. Model it on the existing `DOCUMENT_UPDATED` tests in +that file (they call `run_workflows(...)` directly). The new test: + +```python +class TestWorkflowRenameSequencing(TestCase): + @pytest.mark.django_db + def test_rename_suppressed_during_run_then_runs_once(self) -> None: + """ + update_filename_and_move_files must not run while a workflow is + executing; the move runs exactly once afterwards. + """ + from unittest import mock + + from documents.signals import handlers + + # ... set up a Document, a storage path whose template depends on + # correspondent, and a DOCUMENT_UPDATED workflow with an ASSIGNMENT + # action assigning that correspondent (reuse this file's helpers) ... + + with mock.patch.object( + handlers, + "move_files_for_document", + wraps=handlers.move_files_for_document, + ) as move_spy: + run_workflows( + trigger_type=WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED, + document=document, + ) + + # The m2m_changed / post_save receiver must NOT have driven the move + # mid-run; only the single explicit finalize call remains. + assert move_spy.call_count == 1 +``` + +> The implementer should flesh out the document/workflow setup using the +> patterns already present in `test_workflows.py`. The assertion that matters: +> `move_files_for_document` is called exactly once, via the explicit +> `finalize_file_location()`, not once-per-`save()` from signals. + +- [ ] **Step 2: Run test to verify it fails** + +Run: `uv run pytest documents/tests/test_workflows.py::TestWorkflowRenameSequencing -v` +Expected: FAIL — without the guard, `move_files_for_document` is invoked +multiple times (via `post_save` / `m2m_changed`), so `call_count != 1`. + +- [ ] **Step 3: Add the guard check to the receiver** + +In `src/documents/signals/handlers.py`, add the early-return to the +`update_filename_and_move_files` receiver. Add the import: + +```python +from documents.workflows.context import workflow_run_in_progress +``` + +and make the receiver: + +```python +def update_filename_and_move_files( + sender, + instance: Document | CustomFieldInstance, + **kwargs, +) -> None: + if workflow_run_in_progress(): + # A workflow run is mutating this document; the rename is deferred to a + # single explicit call in run_workflows after the run completes. + return + + if isinstance(instance, CustomFieldInstance): + if not _filename_template_uses_custom_fields(instance.document): + return + instance = instance.document + + move_files_for_document(instance) +``` + +- [ ] **Step 4: Wrap the run and add the explicit rename in `run_workflows`** + +In `run_workflows`, add the import: + +```python +from documents.workflows.context import workflow_guard +``` + +Wrap the `for workflow in ...` loop in `with workflow_guard():`, and call +`context.finalize_file_location()` after the loop (outside the `with` block): + +```python + with workflow_guard(): + for workflow in get_workflows_for_trigger(trigger_type, workflow_to_run): + ... # loop body unchanged + if has_move_to_trash_action: + execute_move_to_trash_action(action, context.target, logging_group) + + context.finalize_file_location() + + if isinstance(context, ConsumptionContext): + return context.overrides, "\n".join(context.messages) + return None +``` + +`ConsumptionContext.finalize_file_location()` is a no-op; `PersistedContext`'s +runs `move_files_for_document` once against the refreshed, non-deleted document. + +- [ ] **Step 5: Run the new test** + +Run: `uv run pytest documents/tests/test_workflows.py::TestWorkflowRenameSequencing -v` +Expected: PASS + +- [ ] **Step 6: Run the full workflow + file-handling suites** + +Run: `uv run pytest documents/tests/test_workflows.py documents/tests/test_file_handling.py -q` +Expected: PASS — no pre-existing test changed. + +- [ ] **Step 7: Lint and commit** + +```bash +ruff check --fix src/documents/signals/handlers.py src/documents/tests/test_workflows.py +ruff format src/documents/signals/handlers.py src/documents/tests/test_workflows.py +git add src/documents/signals/handlers.py src/documents/tests/test_workflows.py +git commit -m "Defer workflow file rename via ContextVar guard" +``` + +--- + +## Task 7: Regression test for the metadata-vs-rename race (#12386) + +**Files:** + +- Test: `src/documents/tests/test_workflows.py` + +- [ ] **Step 1: Write the regression test** + +Append to `src/documents/tests/test_workflows.py` a test that reproduces the +original bug: a workflow that assigns **both** tags (firing `m2m_changed`) and a +correspondent, where the storage path template depends on the correspondent. +Before the fix the `m2m_changed`-triggered rename ran with a stale (empty) +correspondent and moved the file to the wrong path. + +```python +class TestWorkflowMetadataRenameRace(TestCase): + @pytest.mark.django_db + def test_tag_and_correspondent_assignment_lands_file_at_final_path( + self, + ) -> None: + """ + Regression for #12386: assigning tags (m2m_changed) plus a correspondent + used by the storage-path template must not move the file using stale + metadata. After the run the DB filename and the on-disk file agree. + """ + # ... reuse this file's helpers to: + # - create a StoragePath whose path template references {correspondent} + # - create a Document assigned to that storage path with a real file + # on disk at document.source_path + # - create a DOCUMENT_UPDATED Workflow with one ASSIGNMENT action that + # assigns BOTH a tag and a correspondent + # - call run_workflows(DOCUMENT_UPDATED, document=document) + # Assert: + document.refresh_from_db() + assert document.correspondent is not None + assert Path(document.source_path).is_file() + # the path reflects the assigned correspondent, not an empty value + assert document.correspondent.name in str(document.filename) +``` + +- [ ] **Step 2: Run the test** + +Run: `uv run pytest documents/tests/test_workflows.py::TestWorkflowMetadataRenameRace -v` +Expected: PASS (the guard from Task 6 fixes the race; this test locks it in) + +- [ ] **Step 3: Lint and commit** + +```bash +ruff check --fix src/documents/tests/test_workflows.py +ruff format src/documents/tests/test_workflows.py +git add src/documents/tests/test_workflows.py +git commit -m "Add regression test for workflow metadata vs rename race" +``` + +--- + +## Task 8: Full verification + +**Files:** none (verification only) + +- [ ] **Step 1: Run the full backend suite** + +Run: `uv run pytest -m "not live" -q` +Expected: PASS. Investigate any failure before claiming completion. + +- [ ] **Step 2: Lint the whole change** + +Run: `ruff check src/documents/` then `ruff format --check src/documents/` +Expected: clean. + +- [ ] **Step 3: Type-check** + +Run: `uv run mypy documents/workflows/context.py documents/signals/handlers.py documents/workflows/actions.py` +Expected: no **new** errors beyond `.mypy-baseline.txt`. Per CLAUDE.md, do not +attempt to clear the baseline — only ensure you introduced nothing new. + +- [ ] **Step 4: Confirm no stray references** + +Run: `git grep -n "build_workflow_action_context\|caller_supplied_original_file\|use_overrides" src/` +Expected: no matches (all three are fully removed). + +- [ ] **Step 5: Final commit (if Step 2-4 required fixes)** + +```bash +git add -A +git commit -m "Clean up after workflow runner refactor" +``` + +--- + +## Self-review notes (already applied) + +- **Spec coverage:** §1 Protocol → Tasks 2-3; §2 branch-free `run_workflows` → + Task 5; §3 `source_file` / staged-path relocation → Tasks 3, 5; §4 guard + + sequencing → Tasks 4, 6; deferred password-removal hook left as-is (Task 5 + note); testing → Tasks 1-3, 6-8. +- **`update_fields` exclusion kept** (`_WORKFLOW_SAVE_FIELDS` in Task 3) — per + the design decision that it guards a cross-process hazard the guard does not + cover. No task removes it. +- **Type consistency:** `WorkflowRunContext` surface (`target`, `source_file`, + `staged_file`, `refresh`, `build_placeholder_context`, `apply_assignment`, + `apply_removal`, `log_action`, `persist`, `record_run`, + `finalize_file_location`) is identical across the Protocol, both + implementations, and all call sites in the rewritten `run_workflows`. diff --git a/docs/superpowers/specs/2026-04-23-pluggable-document-storage-design.md b/docs/superpowers/specs/2026-04-23-pluggable-document-storage-design.md new file mode 100644 index 000000000..fe4660025 --- /dev/null +++ b/docs/superpowers/specs/2026-04-23-pluggable-document-storage-design.md @@ -0,0 +1,167 @@ +# Pluggable Document Storage Design + +**Date:** 2026-04-23 +**Status:** Approved + +## Overview + +Replace the hardcoded local filesystem storage in paperless-ngx with a pluggable `DocumentStorage` Protocol. Ship two built-in backends — `LocalFilesystemBackend` (default, zero config change) and `S3CompatibleBackend` (supports AWS S3 and any S3-compatible endpoint). Third parties can implement the Protocol to provide their own backends. + +## Scope + +- **In scope:** original documents, PDF/A archives +- **Out of scope:** thumbnails (stay on local filesystem, regenerable), consumption directory (stays local) +- **Frontend impact:** none — S3 is invisible; Django proxies all file access + +## Protocol + +Defined in `src/paperless/storage.py`: + +```python +class DocumentStorage(Protocol): + def __enter__(self) -> Self: ... + def __exit__(self, exc_type, exc_val, exc_tb) -> None: ... + def open(self, name: str) -> IO[bytes]: ... + def save(self, name: str, content: IO[bytes]) -> str: ... # returns actual name used + def delete(self, name: str) -> None: ... + def exists(self, name: str) -> bool: ... + def move(self, old_name: str, new_name: str) -> None: ... + def list_files(self, prefix: str = "") -> Iterable[str]: ... + def size(self, name: str) -> int: ... +``` + +`name` is always the relative key as stored in the DB (e.g. `2024/my-invoice.pdf`). All operations including `open()` must be called within a `with storage:` block — the context manager handles connection lifecycle and backend-specific cleanup. + +## Storage Instances + +Two module-level singletons in `src/paperless/storage.py`, each an instance of the configured backend class: + +```python +original_storage: DocumentStorage = _build("originals") +archive_storage: DocumentStorage = _build("archive") +``` + +`_build(prefix)` reads `PAPERLESS_DOCUMENT_STORAGE_BACKEND` and `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` from settings, instantiates the backend class with the configured options plus the paperless-controlled prefix. The prefix distinguishes originals from archives within the same bucket or directory root — it is not stored in the DB key. + +## Configuration + +Two new settings, using the existing key-value dict mechanism: + +| Setting | Default | Description | +| ------------------------------------ | ------------------------------------------ | ------------------------------------------------------------ | +| `PAPERLESS_DOCUMENT_STORAGE_BACKEND` | `paperless.storage.LocalFilesystemBackend` | Dotted Python path to any class satisfying `DocumentStorage` | +| `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` | `{}` | Dict of kwargs passed to the backend constructor | + +**Example — S3-compatible:** + +``` +PAPERLESS_DOCUMENT_STORAGE_BACKEND=paperless.storage.S3CompatibleBackend +PAPERLESS_DOCUMENT_STORAGE_OPTIONS={"bucket_name": "my-docs", "endpoint_url": "https://s3.wasabi.com", "region_name": "us-east-1", "access_key": "...", "secret_key": "..."} +``` + +Existing users set nothing — `LocalFilesystemBackend` with no options is the default. + +## Built-in Backends + +### `LocalFilesystemBackend` + +- `__enter__`: initialises tracking of directories affected during the context +- `__exit__`: calls `delete_empty_directories()` for all tracked dirs; no-op on exception +- `open/save/delete/exists/move`: direct `Path` + `shutil` operations rooted at `settings.ORIGINALS_DIR` / `settings.ARCHIVE_DIR` (via the prefix passed by `_build`) +- `move()`: `shutil.move()` — atomic on same filesystem +- `list_files()`: `Path.rglob("*")` + +### `S3CompatibleBackend` + +- Wraps `django-storages` S3 backend (`storages.backends.s3boto3.S3Boto3Storage`) for `open`, `save`, `delete`, `exists` +- `__enter__`: initialises boto3 client/session +- `__exit__`: no cleanup required (no empty directory concept on S3) +- `move()`: boto3 `copy_object` (server-side, no data transfer) + `delete_object` +- `open()`: returns streaming S3 response body; caller's `with` block closes the HTTP connection +- `list_files()`: S3 `list_objects_v2` with prefix +- Works with any S3-compatible endpoint via `endpoint_url` option + +## Data Migration + +One Django migration strips the stored prefix from existing rows: + +- `document.filename`: `documents/originals/2024/invoice.pdf` → `2024/invoice.pdf` +- `document.archive_filename`: `documents/archive/2024/invoice.pdf` → `2024/invoice.pdf` + +The prefix is now owned by the storage instance, not the DB key. + +## `migrate_storage` Management Command + +``` +manage.py migrate_storage [--dry-run] [--no-delete] + [--source-backend=] [--source-options=] +``` + +Transfers all document files from one storage backend to another. The user updates `PAPERLESS_DOCUMENT_STORAGE_BACKEND` in their config first, then runs this command to move existing files. + +The destination is always the currently configured backend (from settings). The source is specified via `--source-backend` / `--source-options`, defaulting to `LocalFilesystemBackend` with no options if omitted (covering the most common migration path: local → S3). + +**Flow:** + +1. Instantiate source backend (from CLI args or default) and destination backend (from current settings) +2. Iterate `Document.objects.only("filename", "archive_filename")` +3. For each file (original + archive): + - Skip with warning if missing from source + - Skip silently if already present on destination (idempotent — safe to re-run) + - Copy: `destination.save(name, source.open(name))` + - Unless `--no-delete`: `source.delete(name)` +4. Report counts: moved / skipped / failed +5. `--dry-run`: prints actions without touching files + +Individual failures are logged and counted but do not abort the run. Bidirectional: local → S3, S3 → local, S3 → S3. + +## Files to Create + +| File | Purpose | +| ------------------------------------------------------- | ------------------------------------------------------------------------------ | +| `src/paperless/storage.py` | Protocol, built-in backends, `original_storage` / `archive_storage` singletons | +| `src/documents/management/commands/migrate_storage.py` | Migration command | +| `src/documents/migrations/XXXX_strip_storage_prefix.py` | Strip prefix from existing filename rows | + +## Files to Modify + +| File | Change | +| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | +| `src/paperless/settings/__init__.py` | Add `PAPERLESS_DOCUMENT_STORAGE_BACKEND`, `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` | +| `src/documents/models.py` | `source_file`, `archive_file` use storage instances; `source_path` returns temp file for subprocess callers | +| `src/documents/consumer.py` | `_write()` → `storage.save()`; remove `mkdir` calls | +| `src/documents/signals/handlers.py` | `shutil.move()` → `storage.move()`; remove `create_source_path_directory` / `delete_empty_directories` callsites | +| `src/documents/tasks.py` | Same as signals | +| `src/documents/file_handling.py` | `exists()` checks and directory references use storage API | +| `src/documents/views/` | File-serving views use `storage.open()` within context; wrap for `FileResponse` lifecycle | +| `src/documents/management/commands/document_importer.py` | Replace `Path.glob()` and direct copies with storage API | +| `src/documents/management/commands/document_exporter.py` | Replace direct file copies and `FileLock`-guarded writes with storage API | + +## Locking & Concurrency + +The codebase serialises all document file write/move operations with `FileLock(settings.MEDIA_LOCK)`, where `MEDIA_LOCK = MEDIA_ROOT / "media.lock"`. This is used in `consumer.py`, `signals/handlers.py`, `tasks.py`, `mail.py`, `document_importer.py`, and `document_exporter.py`. + +**The lock file stays on the local filesystem regardless of backend.** `MEDIA_LOCK` lives under `MEDIA_ROOT`, which is the local path even when documents are stored on S3. This means: + +- **Single-host deployments** (the common case — Docker Compose, single server): the `FileLock` continues to work correctly. All Celery workers and the Django process share the same lock file. No change required. +- **Multi-host deployments**: the `FileLock` is already broken for these today — each host has its own lock file. This is a pre-existing limitation and is out of scope for this feature. + +**Callsite structure** — the storage context manager nests inside the existing lock, preserving current behaviour: + +```python +with FileLock(settings.MEDIA_LOCK): + with original_storage as storage: + storage.move(old_name, new_name) +``` + +**`generate_unique_filename` race:** this function checks `storage.exists()` then saves, which is not atomic on S3. The `FileLock` already serialises this on a single host. For multi-host this is a pre-existing gap — not introduced by this feature. + +**Future path for multi-host:** replace `FileLock` with a database-level advisory lock or Redis lock. Out of scope here. + +## Key Invariants + +- The context manager is required for all storage operations, including reads +- `name` is always the relative key — never an absolute path or URL +- The backend prefix (`originals` / `archive`) is paperless-controlled and never stored in the DB +- `LocalFilesystemBackend` is the default — existing deployments require no config change +- The migrate command is idempotent and can be re-run after partial failure diff --git a/docs/superpowers/specs/2026-05-19-workflow-runner-refactor-design.md b/docs/superpowers/specs/2026-05-19-workflow-runner-refactor-design.md new file mode 100644 index 000000000..142cc17ce --- /dev/null +++ b/docs/superpowers/specs/2026-05-19-workflow-runner-refactor-design.md @@ -0,0 +1,253 @@ +# Workflow Runner Refactor — Design + +**Date:** 2026-05-19 +**Branch base:** `dev` +**Status:** Approved design, pending implementation plan + +## Problem + +Workflow execution and the Django signal layer have repeatedly produced fragile, +hard-to-fix bugs (see the revert/refix history around password removal: #12803, +#12814, #12716, and the filename race #12386). Three structural causes: + +1. **`run_workflows` is dual-mode.** A single function handles both consumption + (mutating a `DocumentMetadataOverrides`) and post-save (mutating a real + `Document`), branching on a `use_overrides` flag. The branching is + concentrated in two places — the action dispatch inside `run_workflows` + (`handlers.py:931-1001`) and `build_workflow_action_context` + (`actions.py:33-83`), each with two full code paths. The `apply_*` helpers in + `workflows/mutations.py` are _already_ split by target type + (`apply_assignment_to_document` vs `apply_assignment_to_overrides`, etc.); the + refactor unifies their callers, not the helpers themselves. + +2. **File location is an implicit, timing-dependent side channel.** The + `DOCUMENT_ADDED` workflow fires from `run_workflows_added`, which runs while + the consumer is still inside its transaction — _before_ the consumed file is + copied to `document.source_path` (`document_consumption_finished` is sent at + `consumer.py:658`; the file copy happens after, at `consumer.py:670+`). The + staged path is therefore threaded through as `original_file` / + `caller_supplied_original_file` parameters. Actions that read the file + (password removal, email attachments) depend on this plumbing being correct. + +3. **The workflow run races the filename rename.** `update_filename_and_move_files` + is a raw `post_save` receiver on `Document`. When a workflow persists its + changes via `document.save(update_fields=[...])`, that save fires `post_save` + and runs the rename _while the workflow is still executing_. Under concurrent + Celery/UI updates the interleaved `refresh_from_db()` calls corrupt state. The + comment at `handlers.py:980-984` — deliberately excluding `filename` / + `archive_filename` from the workflow save — is a load-bearing workaround for + exactly this. + +Note: `run_workflows_added` / `run_workflows_updated` are connected to the +_custom_ signals `document_consumption_finished` / `document_updated`, fired +explicitly by paperless code in a handful of known sites — not to raw Django +`post_save`. Only `update_filename_and_move_files` is a raw `post_save` receiver. +This refactor does not change where workflows are triggered from. + +## Scope + +In scope: + +- Refactor `run_workflows` and its action helpers around an execution-context + abstraction. +- Delete the `original_file` side-channel plumbing. +- Make the workflow-execution → persist → rename sequence explicit and + deterministic. + +Out of scope: + +- Changing where/when workflows are triggered (custom signal call sites unchanged). +- Reworking the matching logic (`matching.document_matches_workflow`). +- Any change to workflow models, serializers, or the REST API. + +## Design + +### 1. `WorkflowRunContext` protocol + +New module `documents/workflows/context.py` defining a `typing.Protocol`: + +``` +WorkflowRunContext (Protocol) + source_file: Path # where the file actually is, now + build_placeholder_context() -> dict + apply_assignment(action) -> None + apply_removal(action) -> None + persist() -> None # commit accumulated mutations + record_run(workflow, trigger_type) -> None +``` + +Two concrete implementations (which need not import the Protocol — structural +typing): + +- **`ConsumptionContext`** — wraps `ConsumableDocument` + `DocumentMetadataOverrides`. + `source_file` returns the staged file path. Mutations land on the overrides. + `persist()` is a no-op (the overrides object is returned to the caller). +- **`PersistedContext`** — wraps a real `Document`. Mutations land on the + in-memory `Document`. `persist()` performs a single save. + +**Context selection** — `run_workflows` picks the context from the call shape: + +- CONSUMPTION trigger (`ConsumableDocument` + non-`None` `overrides`) → + `ConsumptionContext`. +- DOCUMENT_ADDED / DOCUMENT_UPDATED / SCHEDULED (a real `Document`, + `overrides=None`) → `PersistedContext`. + +**`source_file` for `PersistedContext`.** It cannot unconditionally return +`document.source_path`: for the `DOCUMENT_ADDED` trigger the file has not yet +been moved there. The staged path is therefore passed into the `PersistedContext` +_at construction time_ by `run_workflows_added` (which still receives it from the +`document_consumption_finished` signal). `source_file` returns that staged path +when supplied, otherwise `document.source_path`. This relocates the staged-path +information from a chain of function parameters into a single piece of +construction state — the `original_file` / `caller_supplied_original_file` +_parameter plumbing_ through `run_workflows` and the action helpers is what gets +deleted, not the staged path itself. + +`WorkflowRunContext` is a plain `Protocol`, not `@runtime_checkable` — the runner +constructs the context itself, so no `isinstance` check is needed. Genuinely +shared logic goes into module-level helper functions, not a base class. + +### 2. `run_workflows` becomes branch-free + +`run_workflows` keeps its current public signature so all call sites are +unchanged. Its body: + +1. Construct the appropriate context once, from the argument types. +2. Run a single flat match-and-dispatch loop over matching workflows/actions, + delegating every action to context methods. + +No `use_overrides` flag anywhere. The branching currently scattered across +`run_workflows`, `build_workflow_action_context`, and the `apply_*` helpers +collapses into the two context classes. + +### 3. File staging via `source_file` + +`source_file` is a property of the context, fixed at construction. The +`original_file` and `caller_supplied_original_file` parameters threaded through +`run_workflows` and the `execute_*` helpers are deleted; each context resolves +the path itself (see "Context selection" above). + +**Deferred password removal.** `execute_password_removal_action`, when given a +`ConsumableDocument`, currently installs a one-shot handler on +`document_consumption_finished` that picks up `original_file` from `kwargs` +later (`actions.py:295-308`). This deferred hook lives outside the context +abstraction. The refactor must explicitly decide its fate: either keep it as-is +(the context still constructs correctly around it) or fold the deferral into +`ConsumptionContext`. This is called out as an open implementation decision, not +silently absorbed. + +### 4. Explicit workflow → persist → rename sequencing + +What must be deferred is the **file rename**, not the DB save. `run_workflows` +keeps its per-workflow `document.refresh_from_db()` at the top of each iteration +— that is deliberate concurrency protection against `bulk_update_documents` +running simultaneously. Deferring all saves to a single final `persist()` would +let one workflow's refresh wipe a prior workflow's in-memory changes. So: + +1. `run_workflows` refreshes and applies actions per workflow, and + `PersistedContext.persist()` saves after each matching workflow, as today. +2. The save deliberately **continues to exclude** `filename` / + `archive_filename` from `update_fields`. This is not duct tape: it guards a + _cross-process_ hazard — another Celery task may have moved the file and + written `filename` to the DB, and a stale in-memory `filename` in our save + would revert it. The `ContextVar` guard (below) only addresses _intra-process_ + ordering, so this exclusion stays. +3. The rename is suppressed for the whole run and invoked **exactly once, + afterward**, against final committed state. + +The actual race being fixed: `apply_assignment_to_document` assigns tags via +`document.add_nested_tags(...)`, which fires `m2m_changed` on +`Document.tags.through` _before_ the workflow's `document.save()`. The +`m2m_changed` receiver `update_filename_and_move_files` then calls +`refresh_from_db()`, wiping the workflow's in-memory correspondent/type, and +moves the file to a path computed from stale metadata. The guard prevents this. + +To stop the rename from firing mid-workflow, a **`ContextVar` guard** is +introduced (e.g. `documents/workflows/context.py` module-level +`_workflow_in_progress: ContextVar[bool]`). `update_filename_and_move_files` +checks the guard and early-returns when set. `run_workflows` wraps its **entire** +persisted-path execution — not just the `persist()` call — in a context manager +that sets the guard via `set()`/`reset(token)`. Token-based reset is +reentrancy-safe for nested saves or nested workflow runs. + +The guard must span the whole execution, not just `persist()`, because +`update_filename_and_move_files` is _also_ registered to `m2m_changed` on +`Document.tags.through` and to `post_save` on `CustomFieldInstance` +(`handlers.py:431-432`). A workflow action that assigns tags or custom fields +would otherwise trigger a rename mid-workflow through those signals. + +After execution completes, `run_workflows` calls `persist()` once and then +explicitly invokes the move logic once. The `ContextVar` is set/reset in the +same thread that runs these receivers synchronously, so they always observe the +value. (Celery `prefork` workers run each task in its own process; greenlet +pools are also `contextvars`-aware — non-issues, noted for completeness.) + +The move body of `update_filename_and_move_files` is extracted into a plain +callable that the runner invokes directly. The function is already invoked +directly (as a plain call, bypassing the decorator) for version documents at +`handlers.py:664-667`, so this extraction has precedent. The thin `post_save` +receiver remains as a guard-checking wrapper. + +The two `post_save` receivers on `Document` are `update_filename_and_move_files` +(`handlers.py:433`) and `update_llm_suggestions_cache` (`handlers.py:740`). The +`ContextVar` guard suppresses **only** the former — `update_llm_suggestions_cache` +keeps running normally, as do `document_consumption_finished` receivers such as +`add_or_update_document_in_llm_index` (which is _not_ a `post_save` receiver). +This is why the guard is preferred over persisting with `.update()`, which would +silently suppress _all_ `post_save` receivers including +`update_llm_suggestions_cache`. + +`WorkflowRun.objects.create(...)` is created per matching workflow as today +(`handlers.py:998-1002`); it is a separate model and is not deferred. + +The comment at `handlers.py:980-984` is updated to describe the new flow +(per-workflow save under the guard; single explicit rename afterward) but the +`filename` / `archive_filename` exclusion it documents is kept — see point 2 +above. + +## Testing + +- **Runner loop** — exercised against a fake context implementing the + `WorkflowRunContext` surface that records `apply_assignment` / `apply_removal` + / `persist` calls. No DB document, no staged files, no signals. +- **Concrete contexts** — `ConsumptionContext` and `PersistedContext` each get + focused tests: given an action, assert the mutation lands on the overrides vs. + the document, and that `source_file` resolves to the staged vs. final path. +- **ContextVar guard** — assert `update_filename_and_move_files` early-returns + while the guard is set, and that the rename runs exactly once after + `persist()`. +- **Regression: the racy case** — a workflow that reassigns metadata while the + document is subject to a filename template; assert final DB filename and file + location are consistent (the #12386 scenario). +- **Regression safety net** — the existing `test_workflows.py` suite (~100 + tests; ~19 `document_consumption_finished.send` sites plus many direct + `run_workflows(...)` calls for the `DOCUMENT_UPDATED` path) must stay green + **unchanged**. A test that needs editing signals a behavior change to flag + explicitly, not a silent refactor outcome. + +Per project conventions: tests grouped under classes, fixtures and test +signatures fully type-annotated. + +## Implementation sequence + +Each step is independently reviewable and keeps the test suite green: + +1. Introduce the `Protocol` + the two contexts; `run_workflows` delegates to + them. Pure refactor, no behavior change. +2. Move the staged path into `PersistedContext` construction (passed by + `run_workflows_added`); delete the `original_file` / + `caller_supplied_original_file` parameter plumbing through `run_workflows` + and the `execute_*` helpers. +3. Extract the move body from `update_filename_and_move_files` into a callable; + add the `ContextVar` guard; `run_workflows` invokes the move once after the + run completes. The `filename` / `archive_filename` exclusion in the + per-workflow save is kept; only the comment at `handlers.py:980-984` is + updated to describe the new flow. + +## Pain points addressed + +- **Dual-mode** → eliminated by the `Protocol` + two contexts; no `use_overrides`. +- **File staging** → `source_file` is a context property; side-channel args deleted. +- **Rename race** → per-workflow save under a `ContextVar` guard that suppresses + the mid-workflow rename; a single explicit rename runs once at the end against + final state. diff --git a/docs/superpowers/specs/2026-05-20-ai-taxonomy-hints-design.md b/docs/superpowers/specs/2026-05-20-ai-taxonomy-hints-design.md new file mode 100644 index 000000000..88dd7124d --- /dev/null +++ b/docs/superpowers/specs/2026-05-20-ai-taxonomy-hints-design.md @@ -0,0 +1,215 @@ +# AI Suggestions: Inject existing taxonomy as candidates + +**Status:** Design (v2 — frequency-only) +**Date:** 2026-05-20 +**Related:** [Discussion #12787](https://github.com/paperless-ngx/paperless-ngx/discussions/12787) +**Branch target:** `dev` + +## Problem + +AI Suggestions currently asks the LLM for free-form tag/document-type/correspondent/storage-path names, then reconciles via `difflib` fuzzy matching (cutoff 0.8) in `paperless_ai/matching.py`. This works for typos but not for semantic equivalents: + +- `blood test` does not fuzzy-match `Bloodwork` +- `IRS` does not fuzzy-match `Taxes` +- `doctor visit` does not fuzzy-match `Medical` + +Result: the LLM invents new metadata names that duplicate existing taxonomy entries. + +## Goal + +Tell the LLM what already exists, so it can prefer existing names verbatim. Fuzzy matching becomes the fallback for typos and for legitimately novel suggestions, not the primary semantic-equivalence mechanism. + +Non-goals: changing the LLM client, embedding model selection, or RAG retrieval. Replacing fuzzy matching entirely. Custom-field option values. Embedding-based shortlisting (deferred to a v2 if frequency proves insufficient). + +## Approach + +For each of Tags, DocumentTypes, Correspondents, StoragePaths: + +1. Take the user-visible queryset (owner-aware, matching `matching.py`). +2. Annotate by document-usage count and take the top `X` names by frequency. `X` is configurable per category cap (single setting, applied to all four categories). +3. Inject those names into the LLM prompt as "Available " blocks, with the instruction to prefer them verbatim. +4. When the LLM responds, tell `matching.py` which names were hinted so an exact normalized match short-circuits past fuzzy. Names not in the hint list keep today's fuzzy fallback. + +No FAISS index, no signals, no Celery tasks, no locks. Pure DB-side queries on each suggestion request. + +## Components + +### `paperless_ai/taxonomy.py` (new) + +```python +class TaxonomyHints(TypedDict): + tags: list[str] + document_types: list[str] + correspondents: list[str] + storage_paths: list[str] + +def build_taxonomy_hints(document: Document, user: User | None) -> TaxonomyHints: ... +def format_hints_for_prompt(hints: TaxonomyHints) -> str: ... +``` + +Internals: + +- `_visible_queryset(model_cls, perm: str, user)` — wraps `get_objects_for_user_owner_aware` exactly as `matching.py` does. If `user` is `None`, returns the unfiltered manager queryset (parity with how `matching.py` behaves today). +- `_shortlist_by_frequency(queryset, max_per_category)` — DB-side: + ```python + return list( + queryset + .annotate(usage=Count("documents")) + .order_by("-usage", "name") + .values_list("name", flat=True)[:max_per_category] + ) + ``` + Confirmed reverse relation name is `documents` for all four models (`documents/models.py:164,173,184,211`). Secondary order by `name` keeps results stable when usage ties (common with 0-usage tails). `StoragePath` uses the human `name` field, not the `path` template. + +`format_hints_for_prompt` emits one `Available :` block per non-empty category. Empty categories produce no block (avoid prompting the LLM with "Available tags: (none)"). A single instruction line follows: + +``` +Prefer existing names from these lists verbatim. Only propose a new value +if none of the existing names fits. +``` + +### `paperless_ai/ai_classifier.py` (modify) + +Required signature change (the v1 spec missed this — flagged by code review): + +- `build_prompt_without_rag(document, user: User | None = None)` — currently takes only `document`; add `user` with `None` default to keep call sites simple. +- `build_prompt_with_rag(document, user: User | None = None)` — already takes `user`; its existing call to `build_prompt_without_rag(document)` at `ai_classifier.py:39` is updated to pass `user` through. + +Both prompt builders accept an optional `hints: TaxonomyHints | None = None` parameter. When non-`None`, `format_hints_for_prompt(hints)` is spliced in before the "Analyze the following document" instruction. When `None` (default), the prompt is built as today. + +`get_ai_document_classification(document, user, hints: TaxonomyHints | None = None)` accepts the same optional `hints` and forwards it to the prompt builder. Return shape is **unchanged** (`dict`). The view layer owns hint construction so the same `TaxonomyHints` object can be used both for the prompt and for `hinted_names` in matching — no need to thread it back out of the classifier. Callers in tests pass `hints=None` (or omit) to preserve existing behavior. + +### `paperless_ai/matching.py` (modify) + +- `_match_names_to_queryset(names, queryset, attr, hinted_names: set[str] | None = None)`: + - Normalization unchanged. + - Exact-match-on-full-queryset behavior unchanged (always tried first). + - When `hinted_names` is provided and the LLM-returned name (normalized) matches a hinted name (normalized) → treated as exact-only; fuzzy is skipped for that name. + - When `hinted_names` is `None` or the name isn't in it → existing 0.8 fuzzy fallback runs. +- `match_tags_by_name(names, user, hinted_names=None)` etc. — optional kwarg, backward compatible. + +### `documents/views.py` (modify) + +The suggestion endpoint (around line 1482) is the single production caller of `get_ai_document_classification` and the call site for `match_*_by_name`. Update it to: + +1. Build hints once: `hints = build_taxonomy_hints(document, request.user)` (when `AIConfig().taxonomy_hints_enabled` and `max_per_category > 0`; otherwise `hints = None`). +2. Pass `hints` into the classifier: `parsed = get_ai_document_classification(document, request.user, hints=hints)`. +3. Pass `hinted_names=set(hints["tags"])` (etc., one per category, or `None` when `hints` is `None`) into each `match_*_by_name` call. + +**Cache interaction:** the AI suggestion path is wrapped by `cached_llm_suggestions` / `refresh_suggestions_cache` (views.py:1477). A cached response bypasses the LLM call entirely — so changes to hints config don't take effect until the cache entry is invalidated. Acceptable for v1 (cache is short-lived). If experience shows users change the toggle and expect immediate effect, follow up by including a hash of the hint-relevant config (`taxonomy_hints_enabled`, `_max`) in the cache key. + +### `paperless/config.py` (`AIConfig`) + DB model + settings + +`AIConfig.__post_init__` reads values from the `ApplicationConfiguration` DB row **and** falls back to `settings.*` constants (pattern at `paperless/config.py:207` for `ai_enabled`). Both layers are needed. + +Two new fields, threaded through three places: + +1. **`paperless/settings/*.py`** — add module-level constants read from env: + - `AI_TAXONOMY_HINTS: bool = __get_boolean("PAPERLESS_AI_TAXONOMY_HINTS", "yes")` (default on) + - `AI_TAXONOMY_HINTS_MAX: int = int(os.getenv("PAPERLESS_AI_TAXONOMY_HINTS_MAX", "30"))` + +2. **`paperless/models.py` (`ApplicationConfiguration`)** — add two nullable columns: + - `taxonomy_hints_enabled = models.BooleanField(null=True)` + - `taxonomy_hints_max_per_category = models.PositiveSmallIntegerField(null=True)` (range 0–32767; `PositiveSmallIntegerField` is sufficient) + - One Django migration. + +3. **`paperless/config.py` (`AIConfig`)** — read with **explicit None check, not `or`** (because `0` and `False` are legitimate user values that would otherwise silently fall back to the settings default): + ```python + self.taxonomy_hints_enabled = ( + app_config.taxonomy_hints_enabled + if app_config.taxonomy_hints_enabled is not None + else settings.AI_TAXONOMY_HINTS + ) + self.taxonomy_hints_max_per_category = ( + app_config.taxonomy_hints_max_per_category + if app_config.taxonomy_hints_max_per_category is not None + else settings.AI_TAXONOMY_HINTS_MAX + ) + ``` + (Other fields in this file use `or`; we deliberately diverge here to support `0` and `False`. A short comment in code records why.) + +**Frontend** (`src-ui/src/app/data/paperless-config.ts`): add two entries to the `PaperlessConfigOptions` declarative list (one `Boolean`, one `Number`, `category: ConfigCategory.AI`) plus two fields on the `PaperlessConfig` interface. No component changes; the form is generated from this list. + +`paperless.conf.example` and the configuration docs page get entries. + +## Data flow + +Suggestion request: + +1. View checks `AIConfig().taxonomy_hints_enabled`; if enabled, calls `hints = build_taxonomy_hints(document, user)`; otherwise `hints = None`. +2. View calls `parsed = get_ai_document_classification(document, user, hints=hints)`. +3. Classifier splices `format_hints_for_prompt(hints)` into the prompt (when non-`None`), calls LLM, returns parsed dict. +4. View calls `match_*_by_name(names, user, hinted_names=set(hints[]) if hints else None)` per category. Exact-on-hint short-circuit; fuzzy fallback unchanged for misses. + +No background processing. No persisted state. Each suggestion request runs four lightweight `Count("documents")` queries (could be combined into a single query per model via `.annotate().order_by().values_list()`, no joins beyond the existing reverse relation). + +## Error handling + +- **Empty visible queryset for a category:** omit that category's block from the prompt. +- **`taxonomy_hints_enabled = False` or `max_per_category = 0`:** `build_taxonomy_hints` returns an empty `TaxonomyHints`; prompt is identical to today; matching is called without `hinted_names`; behavior identical to today. +- **LLM returns a name not in hints but exactly matching an existing visible name:** still treated as exact match. `_match_names_to_queryset` always tries exact-on-full-queryset before fuzzy; `hinted_names` only governs whether fuzzy is consulted for that specific name. +- **DB query failure during shortlist build:** propagate. Suggestion failures already surface as 5xx; adding silent fallbacks here would mask real problems. + +## Testing + +All new and modified tests use pytest style — functions/classes, no `unittest.TestCase` subclasses; `pytest-django` with per-class `@pytest.mark.django_db`; `pytest-mock`'s `mocker` fixture for patching; every fixture parameter, fixture return, and test signature type-annotated. Tests grouped under classes (`class TestBuildTaxonomyHints:`), not flat free functions. Shared fixtures live in `paperless_ai/tests/conftest.py`. Format with `ruff` directly (not `uv run ruff`). + +### `paperless_ai/tests/test_taxonomy.py` (new) + +- `class TestBuildTaxonomyHints:` + - Returns a `TaxonomyHints` with all four keys. + - Top-K limit respected (`max_per_category` honored from `AIConfig`). + - Frequency ordering: tag used on 5 docs ranks above tag used on 2 docs. + - Tie-break by name (alphabetical) for stable output. + - Owner-aware: user lacking `view_tag` perm gets `tags=[]`; `view_documenttype` likewise per category. + - Empty queryset for a category → empty list; `format_hints_for_prompt` omits the block. + - `taxonomy_hints_enabled=False` returns zero-filled `TaxonomyHints` and runs no taxonomy DB queries (`django_assert_num_queries`). + - `max_per_category=0` same behavior as disabled. + - `StoragePath` shortlist uses the `name` field, not `path` template (asserted on returned values). + +- `class TestFormatHintsForPrompt:` + - All four blocks present when all categories non-empty. + - Empty category produces no block. + - All-empty hints produces empty string (no stray instruction line). + - Instruction line appears exactly once when at least one block is rendered. + +### `paperless_ai/tests/test_ai_classifier.py` (extend) + +- `class TestBuildPrompt:` + - `build_prompt_without_rag(doc, user)` now accepts `user`; produces a prompt containing the hints block when hints are non-empty. + - `build_prompt_with_rag(doc, user)` includes both the RAG context block (unchanged) and the hints block. + - `taxonomy_hints_enabled=False`: prompt matches today's baseline (string equality against a fixture). + - `get_ai_document_classification(doc, user, hints=...)` forwards hints into the prompt; return shape unchanged (still `dict`). + +### `paperless_ai/tests/test_matching.py` (extend) + +- `class TestHintedMatching:` + - LLM returns `"Bloodwork"` verbatim, `hinted_names={"Bloodwork", ...}` → exact match returned; `difflib.get_close_matches` not called (`mocker.spy` on `difflib.get_close_matches`). + - LLM returns `"blood test"` not in `hinted_names`, no existing exact → fuzzy fallback runs; behavior unchanged from today (regression guard). + - LLM returns `"Bloodwork "` (whitespace) with hinted_names containing `"Bloodwork"` → normalized exact match wins, fuzzy not consulted. + - Backward compatibility: `match_tags_by_name(names, user)` without the kwarg behaves identically to today (snapshot of an existing test, parameterized). + +Markers: no `live` marker needed. + +## Migration / rollout + +- One Django migration adding two columns to `ApplicationConfiguration` (`taxonomy_hints_enabled BooleanField`, `taxonomy_hints_max_per_category PositiveSmallIntegerField`). Both nullable with sensible defaults so existing rows aren't broken. +- Feature defaults to on for new and existing installs. Set `PAPERLESS_AI_TAXONOMY_HINTS=false` (or via the Application Configuration UI) to restore today's behavior. +- Frontend admin form updated to expose the two fields under the existing AI section. + +## Open questions deferred to implementation + +- `paperless_ai/tests/conftest.py` already exists — verify fixture-naming conventions match before adding new fixtures. +- Confirm `parse_ai_response` doesn't need to know about hints (it's a pure parser; hints flow alongside, not through it). +- The view layer applying `hinted_names` needs to read the same `AIConfig` instance the classifier used; pass the `TaxonomyHints` through the response tuple (chosen) rather than re-deriving in the view. + +## Interplay with `extract_unmatched_names` + +`extract_unmatched_names` (used downstream of matching) surfaces LLM-returned names that didn't match any existing taxonomy entry — the UI uses these to offer "create new tag?" affordances. With hints in place, fewer names will be unmatched, which is the desired outcome. No behavior change is required: a hinted name that the LLM repeats verbatim will exact-match and not appear in the unmatched list; a name the LLM invents anyway (despite the hint instruction) still flows through fuzzy and, if no match, surfaces as "new" exactly as today. Out of scope: filtering unmatched results based on what was in the hint set. + +## Out of scope (potential v2) + +- Embedding-based shortlisting (for users with very large taxonomies where frequency misses the right tag). Would re-introduce a FAISS-shaped subsystem with signals, debounce, and locks. Defer until evidence frequency is insufficient. +- Tag hierarchy awareness — hinting `Medical/Bloodwork` vs `Bloodwork` when tags are nested. +- Custom field option values. +- `StoragePath` template-expression hinting (vs raw `name`). diff --git a/docs/superpowers/specs/telemetry-spec.md b/docs/superpowers/specs/telemetry-spec.md new file mode 100644 index 000000000..131fb07be --- /dev/null +++ b/docs/superpowers/specs/telemetry-spec.md @@ -0,0 +1,308 @@ +# Usage Reporting — Technical Spec + +Voluntary, opt-in usage reporting for paperless-ngx. The goal is to +understand how many instances are running a given release (especially +beta), which platforms and architectures are in use, and what features +are being deployed — without collecting any personal data or document +content. + +--- + +## Guiding principles + +- **Explicitly opt-in.** Nothing is sent automatically. The user runs + the command and confirms before any network call is made. +- **Transparent.** The exact payload is shown before sending. +- **Anonymous.** The UUID is a random identifier with no link to + identity, IP address, or hostname. +- **Graceful.** Network failures produce a friendly message, never a + stack trace. + +--- + +## Client — management command + +### Name + +``` +manage.py send_usage_report +``` + +### Flags + +| Flag | Behaviour | +| ----------- | --------------------------------------------------------- | +| _(none)_ | Show payload, prompt for confirmation, send on `y`/`yes` | +| `--dry-run` | Show payload, skip confirmation and network call entirely | + +### UUID storage + +A random UUID4 is generated on the first run and written to +`PAPERLESS_DATA_DIR/usage_uuid` (plain text, one line). Subsequent +runs reuse the same file. If the file is missing it is regenerated +(counts as a new install — acceptable). + +### Confirmation flow + +``` +The following information will be sent to paperless-ngx to help +improve the project: + + Installation ID : a1b2c3d4-e5f6-7890-abcd-ef1234567890 + Version : 2.15.0 + Channel : beta + Commit : bd86dca57 (built 2026-05-18T12:00:00Z) + Install type : docker + Architecture : x86_64 + Python : 3.12.3 + Database : postgresql + Documents : 1000–9999 + Multi-user : yes + Mail enabled : yes + AI enabled : no + +No personal data, document content, or IP address is stored. +More information: https://docs.paperless-ngx.com/usage-reporting/ + +Send this report? [y/N]: +``` + +Default answer is **N**. Anything other than `y`/`yes` aborts with +no network call and prints `Nothing sent.` + +`--dry-run` skips the prompt entirely and prints `Dry run — nothing sent.` + +### Network error handling + +- Timeout: 10 seconds +- On any failure (timeout, DNS, HTTP error): print a single friendly + line, exit 0 (not an error from the user's perspective) + +``` +Could not reach the reporting endpoint. Nothing was sent. +``` + +### Duplicate submission handling + +The server returns `429` if the UUID was seen within the last 7 days, +with a JSON body: + +```json +{ + "error": "already_submitted", + "last_sent": "2026-05-15T10:00:00Z", + "retry_after_days": 4 +} +``` + +The command prints: + +``` +Already submitted 3 days ago. Nothing sent. +You can send again after 2026-05-19. +``` + +--- + +## Payload schema + +All fields are strings unless noted. Fields marked _omit if absent_ +are left out of the JSON entirely when the value is unavailable — +never sent as `null`. + +| Field | Source | Notes | +| -------------- | --------------------------------------------------------- | ------------------------------------------------ | +| `uuid` | `PAPERLESS_DATA_DIR/usage_uuid` | UUID4, random | +| `version` | `paperless/version.py` — `__full_version_str__` | e.g. `"2.15.0"` | +| `channel` | `paperless/version.py` — `__channel__` | `"stable"` \| `"beta"` \| `"dev"` | +| `commit` | `paperless/build_info.py` — `SOURCE_COMMIT` | Short SHA — _omit if absent_ | +| `build_date` | `paperless/build_info.py` — `BUILD_DATE` | ISO 8601 — _omit if absent_ | +| `install_type` | Detected at runtime (see below) | | +| `arch` | `platform.machine()` | e.g. `"x86_64"`, `"aarch64"` | +| `python` | `platform.python_version()` | e.g. `"3.12.3"` | +| `database` | Last segment of `settings.DATABASES["default"]["ENGINE"]` | e.g. `"postgresql"`, `"sqlite3"` | +| `doc_bucket` | Bucketed document count (see below) | | +| `multi_user` | boolean | `true` if more than one real user account exists | +| `feature_mail` | boolean | `true` if any mail account is configured | +| `feature_ai` | boolean | `true` if AI features are enabled in settings | + +### Document count buckets + +| Range | Value | +| ------------- | --------------- | +| 0–99 | `"0-99"` | +| 100–999 | `"100-999"` | +| 1 000–9 999 | `"1000-9999"` | +| 10 000–49 999 | `"10000-49999"` | +| 50 000+ | `"50000+"` | + +### Install type detection + +Evaluated in order; first match wins. + +| Value | Detection | +| -------------- | ----------------------------------------------------------- | +| `"kubernetes"` | `KUBERNETES_SERVICE_HOST` env var is set | +| `"podman"` | `container` env var equals `"podman"` | +| `"docker"` | `Path("/.dockerenv").exists()` | +| `"nixos"` | `"/nix/store/"` in `sys.executable` | +| `"snap"` | `SNAP` env var is set | +| `"flatpak"` | `FLATPAK_ID` env var is set | +| `"distro"` | `paperless/distro_info.py` exists (set by distro packagers) | +| `"release"` | `paperless/build_info.py` exists (none of the above) | +| `"source"` | Fallback — dev checkout | + +Distro packagers (Debian, NixOS community, Unraid, etc.) can opt in +by shipping a `src/paperless/distro_info.py` containing: + +```python +DISTRO = "debian" # or "rpm", "homebrew", "unraid", etc. +``` + +When present the install type is reported as the `DISTRO` value rather +than `"distro"`. + +### `version.py` additions + +Add `__channel__` alongside the existing version fields: + +```python +__channel__: Final[str] = "beta" # "stable" | "beta" | "dev" +``` + +This is the canonical place to set the channel when preparing a +release. `"dev"` is the default for unreleased branches. + +### `build_info.py` + +Generated at build time, never committed (add to `.gitignore`). + +```python +SOURCE_COMMIT = "bd86dca57" +BUILD_DATE = "2026-05-18T12:00:00Z" +``` + +--- + +## Server — Cloudflare Worker + +Managed in a separate repository under the paperless-ngx GitHub org +(e.g. `paperless-ngx/telemetry`). Deployed via Wrangler. + +### Endpoint + +``` +POST /report +Content-Type: application/json +``` + +Returns `204` on success. No response body. + +### Timestamp + +`received` is always set server-side. Any client-supplied timestamp +field is ignored. + +### Validation + +Reject with `400` if any of the following fail: + +- `uuid` does not match UUID4 format +- `version` does not match `\d+\.\d+\.\d+` +- `channel` is not one of `stable`, `beta`, `dev` +- `install_type` is not in the known set +- `arch` is absent +- Payload is not valid JSON or exceeds 4 KB + +Unknown extra fields are silently ignored (forward compatibility). + +### Deduplication + +Before inserting, query for the most recent submission from this UUID: + +```sql +SELECT received FROM reports +WHERE uuid = ? +ORDER BY received DESC +LIMIT 1 +``` + +If the result is within 7 days of now, return: + +``` +HTTP 429 +{ "error": "already_submitted", "last_sent": "", "retry_after_days": } +``` + +Otherwise insert and return `204`. + +### D1 schema + +```sql +CREATE TABLE reports ( + id INTEGER PRIMARY KEY, + received TEXT NOT NULL, -- ISO 8601, server-side + uuid TEXT NOT NULL, + version TEXT, + channel TEXT, + commit TEXT, + build_date TEXT, + install_type TEXT, + arch TEXT, + python TEXT, + database TEXT, + doc_bucket TEXT, + multi_user INTEGER, -- 0 / 1 + feature_mail INTEGER, -- 0 / 1 + feature_ai INTEGER -- 0 / 1 +); + +CREATE INDEX idx_reports_uuid ON reports(uuid); +CREATE INDEX idx_reports_channel ON reports(channel); +CREATE INDEX idx_reports_version ON reports(version); +``` + +--- + +## Useful queries + +```sql +-- Distinct beta installs +SELECT COUNT(DISTINCT uuid) +FROM reports +WHERE channel = 'beta'; + +-- Installs by commit (beta only) +SELECT commit, COUNT(DISTINCT uuid) AS installs +FROM reports +WHERE channel = 'beta' +GROUP BY commit +ORDER BY installs DESC; + +-- Architecture breakdown +SELECT arch, COUNT(DISTINCT uuid) AS installs +FROM reports +GROUP BY arch +ORDER BY installs DESC; + +-- Install type split +SELECT install_type, COUNT(DISTINCT uuid) AS installs +FROM reports +GROUP BY install_type +ORDER BY installs DESC; + +-- Database backend split +SELECT database, COUNT(DISTINCT uuid) AS installs +FROM reports +GROUP BY database +ORDER BY installs DESC; +``` + +--- + +## Out of scope (for now) + +- Automatic or scheduled reporting +- Any opt-out settings flag +- Server-side dashboard (raw SQL is sufficient) +- Locale, timezone, or OS version fields