Just for later ideas, store some brainstorming sessions with Claude

2026-06-06 13:49:44 +00:00 · 2026-06-03 10:49:01 -07:00
parent 6ede72cc44
commit 6fa3e5fac7
5 changed files with 2161 additions and 0 deletions
@@ -0,0 +1,167 @@
+# Pluggable Document Storage Design
+
+**Date:** 2026-04-23
+**Status:** Approved
+
+## Overview
+
+Replace the hardcoded local filesystem storage in paperless-ngx with a pluggable `DocumentStorage` Protocol. Ship two built-in backends — `LocalFilesystemBackend` (default, zero config change) and `S3CompatibleBackend` (supports AWS S3 and any S3-compatible endpoint). Third parties can implement the Protocol to provide their own backends.
+
+## Scope
+
+- **In scope:** original documents, PDF/A archives
+- **Out of scope:** thumbnails (stay on local filesystem, regenerable), consumption directory (stays local)
+- **Frontend impact:** none — S3 is invisible; Django proxies all file access
+
+## Protocol
+
+Defined in `src/paperless/storage.py`:
+
+```python
+class DocumentStorage(Protocol):
+    def __enter__(self) -> Self: ...
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None: ...
+    def open(self, name: str) -> IO[bytes]: ...
+    def save(self, name: str, content: IO[bytes]) -> str: ...  # returns actual name used
+    def delete(self, name: str) -> None: ...
+    def exists(self, name: str) -> bool: ...
+    def move(self, old_name: str, new_name: str) -> None: ...
+    def list_files(self, prefix: str = "") -> Iterable[str]: ...
+    def size(self, name: str) -> int: ...
+```
+
+`name` is always the relative key as stored in the DB (e.g. `2024/my-invoice.pdf`). All operations including `open()` must be called within a `with storage:` block — the context manager handles connection lifecycle and backend-specific cleanup.
+
+## Storage Instances
+
+Two module-level singletons in `src/paperless/storage.py`, each an instance of the configured backend class:
+
+```python
+original_storage: DocumentStorage = _build("originals")
+archive_storage: DocumentStorage  = _build("archive")
+```
+
+`_build(prefix)` reads `PAPERLESS_DOCUMENT_STORAGE_BACKEND` and `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` from settings, instantiates the backend class with the configured options plus the paperless-controlled prefix. The prefix distinguishes originals from archives within the same bucket or directory root — it is not stored in the DB key.
+
+## Configuration
+
+Two new settings, using the existing key-value dict mechanism:
+
+| Setting                              | Default                                    | Description                                                  |
+| ------------------------------------ | ------------------------------------------ | ------------------------------------------------------------ |
+| `PAPERLESS_DOCUMENT_STORAGE_BACKEND` | `paperless.storage.LocalFilesystemBackend` | Dotted Python path to any class satisfying `DocumentStorage` |
+| `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` | `{}`                                       | Dict of kwargs passed to the backend constructor             |
+
+**Example — S3-compatible:**
+
+```
+PAPERLESS_DOCUMENT_STORAGE_BACKEND=paperless.storage.S3CompatibleBackend
+PAPERLESS_DOCUMENT_STORAGE_OPTIONS={"bucket_name": "my-docs", "endpoint_url": "https://s3.wasabi.com", "region_name": "us-east-1", "access_key": "...", "secret_key": "..."}
+```
+
+Existing users set nothing — `LocalFilesystemBackend` with no options is the default.
+
+## Built-in Backends
+
+### `LocalFilesystemBackend`
+
+- `__enter__`: initialises tracking of directories affected during the context
+- `__exit__`: calls `delete_empty_directories()` for all tracked dirs; no-op on exception
+- `open/save/delete/exists/move`: direct `Path` + `shutil` operations rooted at `settings.ORIGINALS_DIR` / `settings.ARCHIVE_DIR` (via the prefix passed by `_build`)
+- `move()`: `shutil.move()` — atomic on same filesystem
+- `list_files()`: `Path.rglob("*")`
+
+### `S3CompatibleBackend`
+
+- Wraps `django-storages` S3 backend (`storages.backends.s3boto3.S3Boto3Storage`) for `open`, `save`, `delete`, `exists`
+- `__enter__`: initialises boto3 client/session
+- `__exit__`: no cleanup required (no empty directory concept on S3)
+- `move()`: boto3 `copy_object` (server-side, no data transfer) + `delete_object`
+- `open()`: returns streaming S3 response body; caller's `with` block closes the HTTP connection
+- `list_files()`: S3 `list_objects_v2` with prefix
+- Works with any S3-compatible endpoint via `endpoint_url` option
+
+## Data Migration
+
+One Django migration strips the stored prefix from existing rows:
+
+- `document.filename`: `documents/originals/2024/invoice.pdf` → `2024/invoice.pdf`
+- `document.archive_filename`: `documents/archive/2024/invoice.pdf` → `2024/invoice.pdf`
+
+The prefix is now owned by the storage instance, not the DB key.
+
+## `migrate_storage` Management Command
+
+```
+manage.py migrate_storage [--dry-run] [--no-delete]
+    [--source-backend=<dotted.path>] [--source-options=<json>]
+```
+
+Transfers all document files from one storage backend to another. The user updates `PAPERLESS_DOCUMENT_STORAGE_BACKEND` in their config first, then runs this command to move existing files.
+
+The destination is always the currently configured backend (from settings). The source is specified via `--source-backend` / `--source-options`, defaulting to `LocalFilesystemBackend` with no options if omitted (covering the most common migration path: local → S3).
+
+**Flow:**
+
+1. Instantiate source backend (from CLI args or default) and destination backend (from current settings)
+2. Iterate `Document.objects.only("filename", "archive_filename")`
+3. For each file (original + archive):
+   - Skip with warning if missing from source
+   - Skip silently if already present on destination (idempotent — safe to re-run)
+   - Copy: `destination.save(name, source.open(name))`
+   - Unless `--no-delete`: `source.delete(name)`
+4. Report counts: moved / skipped / failed
+5. `--dry-run`: prints actions without touching files
+
+Individual failures are logged and counted but do not abort the run. Bidirectional: local → S3, S3 → local, S3 → S3.
+
+## Files to Create
+
+| File                                                    | Purpose                                                                        |
+| ------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| `src/paperless/storage.py`                              | Protocol, built-in backends, `original_storage` / `archive_storage` singletons |
+| `src/documents/management/commands/migrate_storage.py`  | Migration command                                                              |
+| `src/documents/migrations/XXXX_strip_storage_prefix.py` | Strip prefix from existing filename rows                                       |
+
+## Files to Modify
+
+| File                                                     | Change                                                                                                           |
+| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
+| `src/paperless/settings/__init__.py`                     | Add `PAPERLESS_DOCUMENT_STORAGE_BACKEND`, `PAPERLESS_DOCUMENT_STORAGE_OPTIONS`                                   |
+| `src/documents/models.py`                                | `source_file`, `archive_file` use storage instances; `source_path` returns temp file for subprocess callers      |
+| `src/documents/consumer.py`                              | `_write()` → `storage.save()`; remove `mkdir` calls                                                              |
+| `src/documents/signals/handlers.py`                      | `shutil.move()` → `storage.move()`; remove `create_source_path_directory` / `delete_empty_directories` callsites |
+| `src/documents/tasks.py`                                 | Same as signals                                                                                                  |
+| `src/documents/file_handling.py`                         | `exists()` checks and directory references use storage API                                                       |
+| `src/documents/views/`                                   | File-serving views use `storage.open()` within context; wrap for `FileResponse` lifecycle                        |
+| `src/documents/management/commands/document_importer.py` | Replace `Path.glob()` and direct copies with storage API                                                         |
+| `src/documents/management/commands/document_exporter.py` | Replace direct file copies and `FileLock`-guarded writes with storage API                                        |
+
+## Locking & Concurrency
+
+The codebase serialises all document file write/move operations with `FileLock(settings.MEDIA_LOCK)`, where `MEDIA_LOCK = MEDIA_ROOT / "media.lock"`. This is used in `consumer.py`, `signals/handlers.py`, `tasks.py`, `mail.py`, `document_importer.py`, and `document_exporter.py`.
+
+**The lock file stays on the local filesystem regardless of backend.** `MEDIA_LOCK` lives under `MEDIA_ROOT`, which is the local path even when documents are stored on S3. This means:
+
+- **Single-host deployments** (the common case — Docker Compose, single server): the `FileLock` continues to work correctly. All Celery workers and the Django process share the same lock file. No change required.
+- **Multi-host deployments**: the `FileLock` is already broken for these today — each host has its own lock file. This is a pre-existing limitation and is out of scope for this feature.
+
+**Callsite structure** — the storage context manager nests inside the existing lock, preserving current behaviour:
+
+```python
+with FileLock(settings.MEDIA_LOCK):
+    with original_storage as storage:
+        storage.move(old_name, new_name)
+```
+
+**`generate_unique_filename` race:** this function checks `storage.exists()` then saves, which is not atomic on S3. The `FileLock` already serialises this on a single host. For multi-host this is a pre-existing gap — not introduced by this feature.
+
+**Future path for multi-host:** replace `FileLock` with a database-level advisory lock or Redis lock. Out of scope here.
+
+## Key Invariants
+
+- The context manager is required for all storage operations, including reads
+- `name` is always the relative key — never an absolute path or URL
+- The backend prefix (`originals` / `archive`) is paperless-controlled and never stored in the DB
+- `LocalFilesystemBackend` is the default — existing deployments require no config change
+- The migrate command is idempotent and can be re-run after partial failure
@@ -0,0 +1,253 @@
+# Workflow Runner Refactor — Design
+
+**Date:** 2026-05-19
+**Branch base:** `dev`
+**Status:** Approved design, pending implementation plan
+
+## Problem
+
+Workflow execution and the Django signal layer have repeatedly produced fragile,
+hard-to-fix bugs (see the revert/refix history around password removal: #12803,
+#12814, #12716, and the filename race #12386). Three structural causes:
+
+1. **`run_workflows` is dual-mode.** A single function handles both consumption
+   (mutating a `DocumentMetadataOverrides`) and post-save (mutating a real
+   `Document`), branching on a `use_overrides` flag. The branching is
+   concentrated in two places — the action dispatch inside `run_workflows`
+   (`handlers.py:931-1001`) and `build_workflow_action_context`
+   (`actions.py:33-83`), each with two full code paths. The `apply_*` helpers in
+   `workflows/mutations.py` are _already_ split by target type
+   (`apply_assignment_to_document` vs `apply_assignment_to_overrides`, etc.); the
+   refactor unifies their callers, not the helpers themselves.
+
+2. **File location is an implicit, timing-dependent side channel.** The
+   `DOCUMENT_ADDED` workflow fires from `run_workflows_added`, which runs while
+   the consumer is still inside its transaction — _before_ the consumed file is
+   copied to `document.source_path` (`document_consumption_finished` is sent at
+   `consumer.py:658`; the file copy happens after, at `consumer.py:670+`). The
+   staged path is therefore threaded through as `original_file` /
+   `caller_supplied_original_file` parameters. Actions that read the file
+   (password removal, email attachments) depend on this plumbing being correct.
+
+3. **The workflow run races the filename rename.** `update_filename_and_move_files`
+   is a raw `post_save` receiver on `Document`. When a workflow persists its
+   changes via `document.save(update_fields=[...])`, that save fires `post_save`
+   and runs the rename _while the workflow is still executing_. Under concurrent
+   Celery/UI updates the interleaved `refresh_from_db()` calls corrupt state. The
+   comment at `handlers.py:980-984` — deliberately excluding `filename` /
+   `archive_filename` from the workflow save — is a load-bearing workaround for
+   exactly this.
+
+Note: `run_workflows_added` / `run_workflows_updated` are connected to the
+_custom_ signals `document_consumption_finished` / `document_updated`, fired
+explicitly by paperless code in a handful of known sites — not to raw Django
+`post_save`. Only `update_filename_and_move_files` is a raw `post_save` receiver.
+This refactor does not change where workflows are triggered from.
+
+## Scope
+
+In scope:
+
+- Refactor `run_workflows` and its action helpers around an execution-context
+  abstraction.
+- Delete the `original_file` side-channel plumbing.
+- Make the workflow-execution → persist → rename sequence explicit and
+  deterministic.
+
+Out of scope:
+
+- Changing where/when workflows are triggered (custom signal call sites unchanged).
+- Reworking the matching logic (`matching.document_matches_workflow`).
+- Any change to workflow models, serializers, or the REST API.
+
+## Design
+
+### 1. `WorkflowRunContext` protocol
+
+New module `documents/workflows/context.py` defining a `typing.Protocol`:
+
+```
+WorkflowRunContext (Protocol)
+  source_file: Path                       # where the file actually is, now
+  build_placeholder_context() -> dict
+  apply_assignment(action) -> None
+  apply_removal(action) -> None
+  persist() -> None                        # commit accumulated mutations
+  record_run(workflow, trigger_type) -> None
+```
+
+Two concrete implementations (which need not import the Protocol — structural
+typing):
+
+- **`ConsumptionContext`** — wraps `ConsumableDocument` + `DocumentMetadataOverrides`.
+  `source_file` returns the staged file path. Mutations land on the overrides.
+  `persist()` is a no-op (the overrides object is returned to the caller).
+- **`PersistedContext`** — wraps a real `Document`. Mutations land on the
+  in-memory `Document`. `persist()` performs a single save.
+
+**Context selection** — `run_workflows` picks the context from the call shape:
+
+- CONSUMPTION trigger (`ConsumableDocument` + non-`None` `overrides`) →
+  `ConsumptionContext`.
+- DOCUMENT_ADDED / DOCUMENT_UPDATED / SCHEDULED (a real `Document`,
+  `overrides=None`) → `PersistedContext`.
+
+**`source_file` for `PersistedContext`.** It cannot unconditionally return
+`document.source_path`: for the `DOCUMENT_ADDED` trigger the file has not yet
+been moved there. The staged path is therefore passed into the `PersistedContext`
+_at construction time_ by `run_workflows_added` (which still receives it from the
+`document_consumption_finished` signal). `source_file` returns that staged path
+when supplied, otherwise `document.source_path`. This relocates the staged-path
+information from a chain of function parameters into a single piece of
+construction state — the `original_file` / `caller_supplied_original_file`
+_parameter plumbing_ through `run_workflows` and the action helpers is what gets
+deleted, not the staged path itself.
+
+`WorkflowRunContext` is a plain `Protocol`, not `@runtime_checkable` — the runner
+constructs the context itself, so no `isinstance` check is needed. Genuinely
+shared logic goes into module-level helper functions, not a base class.
+
+### 2. `run_workflows` becomes branch-free
+
+`run_workflows` keeps its current public signature so all call sites are
+unchanged. Its body:
+
+1. Construct the appropriate context once, from the argument types.
+2. Run a single flat match-and-dispatch loop over matching workflows/actions,
+   delegating every action to context methods.
+
+No `use_overrides` flag anywhere. The branching currently scattered across
+`run_workflows`, `build_workflow_action_context`, and the `apply_*` helpers
+collapses into the two context classes.
+
+### 3. File staging via `source_file`
+
+`source_file` is a property of the context, fixed at construction. The
+`original_file` and `caller_supplied_original_file` parameters threaded through
+`run_workflows` and the `execute_*` helpers are deleted; each context resolves
+the path itself (see "Context selection" above).
+
+**Deferred password removal.** `execute_password_removal_action`, when given a
+`ConsumableDocument`, currently installs a one-shot handler on
+`document_consumption_finished` that picks up `original_file` from `kwargs`
+later (`actions.py:295-308`). This deferred hook lives outside the context
+abstraction. The refactor must explicitly decide its fate: either keep it as-is
+(the context still constructs correctly around it) or fold the deferral into
+`ConsumptionContext`. This is called out as an open implementation decision, not
+silently absorbed.
+
+### 4. Explicit workflow → persist → rename sequencing
+
+What must be deferred is the **file rename**, not the DB save. `run_workflows`
+keeps its per-workflow `document.refresh_from_db()` at the top of each iteration
+— that is deliberate concurrency protection against `bulk_update_documents`
+running simultaneously. Deferring all saves to a single final `persist()` would
+let one workflow's refresh wipe a prior workflow's in-memory changes. So:
+
+1. `run_workflows` refreshes and applies actions per workflow, and
+   `PersistedContext.persist()` saves after each matching workflow, as today.
+2. The save deliberately **continues to exclude** `filename` /
+   `archive_filename` from `update_fields`. This is not duct tape: it guards a
+   _cross-process_ hazard — another Celery task may have moved the file and
+   written `filename` to the DB, and a stale in-memory `filename` in our save
+   would revert it. The `ContextVar` guard (below) only addresses _intra-process_
+   ordering, so this exclusion stays.
+3. The rename is suppressed for the whole run and invoked **exactly once,
+   afterward**, against final committed state.
+
+The actual race being fixed: `apply_assignment_to_document` assigns tags via
+`document.add_nested_tags(...)`, which fires `m2m_changed` on
+`Document.tags.through` _before_ the workflow's `document.save()`. The
+`m2m_changed` receiver `update_filename_and_move_files` then calls
+`refresh_from_db()`, wiping the workflow's in-memory correspondent/type, and
+moves the file to a path computed from stale metadata. The guard prevents this.
+
+To stop the rename from firing mid-workflow, a **`ContextVar` guard** is
+introduced (e.g. `documents/workflows/context.py` module-level
+`_workflow_in_progress: ContextVar[bool]`). `update_filename_and_move_files`
+checks the guard and early-returns when set. `run_workflows` wraps its **entire**
+persisted-path execution — not just the `persist()` call — in a context manager
+that sets the guard via `set()`/`reset(token)`. Token-based reset is
+reentrancy-safe for nested saves or nested workflow runs.
+
+The guard must span the whole execution, not just `persist()`, because
+`update_filename_and_move_files` is _also_ registered to `m2m_changed` on
+`Document.tags.through` and to `post_save` on `CustomFieldInstance`
+(`handlers.py:431-432`). A workflow action that assigns tags or custom fields
+would otherwise trigger a rename mid-workflow through those signals.
+
+After execution completes, `run_workflows` calls `persist()` once and then
+explicitly invokes the move logic once. The `ContextVar` is set/reset in the
+same thread that runs these receivers synchronously, so they always observe the
+value. (Celery `prefork` workers run each task in its own process; greenlet
+pools are also `contextvars`-aware — non-issues, noted for completeness.)
+
+The move body of `update_filename_and_move_files` is extracted into a plain
+callable that the runner invokes directly. The function is already invoked
+directly (as a plain call, bypassing the decorator) for version documents at
+`handlers.py:664-667`, so this extraction has precedent. The thin `post_save`
+receiver remains as a guard-checking wrapper.
+
+The two `post_save` receivers on `Document` are `update_filename_and_move_files`
+(`handlers.py:433`) and `update_llm_suggestions_cache` (`handlers.py:740`). The
+`ContextVar` guard suppresses **only** the former — `update_llm_suggestions_cache`
+keeps running normally, as do `document_consumption_finished` receivers such as
+`add_or_update_document_in_llm_index` (which is _not_ a `post_save` receiver).
+This is why the guard is preferred over persisting with `.update()`, which would
+silently suppress _all_ `post_save` receivers including
+`update_llm_suggestions_cache`.
+
+`WorkflowRun.objects.create(...)` is created per matching workflow as today
+(`handlers.py:998-1002`); it is a separate model and is not deferred.
+
+The comment at `handlers.py:980-984` is updated to describe the new flow
+(per-workflow save under the guard; single explicit rename afterward) but the
+`filename` / `archive_filename` exclusion it documents is kept — see point 2
+above.
+
+## Testing
+
+- **Runner loop** — exercised against a fake context implementing the
+  `WorkflowRunContext` surface that records `apply_assignment` / `apply_removal`
+  / `persist` calls. No DB document, no staged files, no signals.
+- **Concrete contexts** — `ConsumptionContext` and `PersistedContext` each get
+  focused tests: given an action, assert the mutation lands on the overrides vs.
+  the document, and that `source_file` resolves to the staged vs. final path.
+- **ContextVar guard** — assert `update_filename_and_move_files` early-returns
+  while the guard is set, and that the rename runs exactly once after
+  `persist()`.
+- **Regression: the racy case** — a workflow that reassigns metadata while the
+  document is subject to a filename template; assert final DB filename and file
+  location are consistent (the #12386 scenario).
+- **Regression safety net** — the existing `test_workflows.py` suite (~100
+  tests; ~19 `document_consumption_finished.send` sites plus many direct
+  `run_workflows(...)` calls for the `DOCUMENT_UPDATED` path) must stay green
+  **unchanged**. A test that needs editing signals a behavior change to flag
+  explicitly, not a silent refactor outcome.
+
+Per project conventions: tests grouped under classes, fixtures and test
+signatures fully type-annotated.
+
+## Implementation sequence
+
+Each step is independently reviewable and keeps the test suite green:
+
+1. Introduce the `Protocol` + the two contexts; `run_workflows` delegates to
+   them. Pure refactor, no behavior change.
+2. Move the staged path into `PersistedContext` construction (passed by
+   `run_workflows_added`); delete the `original_file` /
+   `caller_supplied_original_file` parameter plumbing through `run_workflows`
+   and the `execute_*` helpers.
+3. Extract the move body from `update_filename_and_move_files` into a callable;
+   add the `ContextVar` guard; `run_workflows` invokes the move once after the
+   run completes. The `filename` / `archive_filename` exclusion in the
+   per-workflow save is kept; only the comment at `handlers.py:980-984` is
+   updated to describe the new flow.
+
+## Pain points addressed
+
+- **Dual-mode** → eliminated by the `Protocol` + two contexts; no `use_overrides`.
+- **File staging** → `source_file` is a context property; side-channel args deleted.
+- **Rename race** → per-workflow save under a `ContextVar` guard that suppresses
+  the mid-workflow rename; a single explicit rename runs once at the end against
+  final state.
@@ -0,0 +1,215 @@
+# AI Suggestions: Inject existing taxonomy as candidates
+
+**Status:** Design (v2 — frequency-only)
+**Date:** 2026-05-20
+**Related:** [Discussion #12787](https://github.com/paperless-ngx/paperless-ngx/discussions/12787)
+**Branch target:** `dev`
+
+## Problem
+
+AI Suggestions currently asks the LLM for free-form tag/document-type/correspondent/storage-path names, then reconciles via `difflib` fuzzy matching (cutoff 0.8) in `paperless_ai/matching.py`. This works for typos but not for semantic equivalents:
+
+- `blood test` does not fuzzy-match `Bloodwork`
+- `IRS` does not fuzzy-match `Taxes`
+- `doctor visit` does not fuzzy-match `Medical`
+
+Result: the LLM invents new metadata names that duplicate existing taxonomy entries.
+
+## Goal
+
+Tell the LLM what already exists, so it can prefer existing names verbatim. Fuzzy matching becomes the fallback for typos and for legitimately novel suggestions, not the primary semantic-equivalence mechanism.
+
+Non-goals: changing the LLM client, embedding model selection, or RAG retrieval. Replacing fuzzy matching entirely. Custom-field option values. Embedding-based shortlisting (deferred to a v2 if frequency proves insufficient).
+
+## Approach
+
+For each of Tags, DocumentTypes, Correspondents, StoragePaths:
+
+1. Take the user-visible queryset (owner-aware, matching `matching.py`).
+2. Annotate by document-usage count and take the top `X` names by frequency. `X` is configurable per category cap (single setting, applied to all four categories).
+3. Inject those names into the LLM prompt as "Available <category>" blocks, with the instruction to prefer them verbatim.
+4. When the LLM responds, tell `matching.py` which names were hinted so an exact normalized match short-circuits past fuzzy. Names not in the hint list keep today's fuzzy fallback.
+
+No FAISS index, no signals, no Celery tasks, no locks. Pure DB-side queries on each suggestion request.
+
+## Components
+
+### `paperless_ai/taxonomy.py` (new)
+
+```python
+class TaxonomyHints(TypedDict):
+    tags: list[str]
+    document_types: list[str]
+    correspondents: list[str]
+    storage_paths: list[str]
+
+def build_taxonomy_hints(document: Document, user: User | None) -> TaxonomyHints: ...
+def format_hints_for_prompt(hints: TaxonomyHints) -> str: ...
+```
+
+Internals:
+
+- `_visible_queryset(model_cls, perm: str, user)` — wraps `get_objects_for_user_owner_aware` exactly as `matching.py` does. If `user` is `None`, returns the unfiltered manager queryset (parity with how `matching.py` behaves today).
+- `_shortlist_by_frequency(queryset, max_per_category)` — DB-side:
+  ```python
+  return list(
+      queryset
+      .annotate(usage=Count("documents"))
+      .order_by("-usage", "name")
+      .values_list("name", flat=True)[:max_per_category]
+  )
+  ```
+  Confirmed reverse relation name is `documents` for all four models (`documents/models.py:164,173,184,211`). Secondary order by `name` keeps results stable when usage ties (common with 0-usage tails). `StoragePath` uses the human `name` field, not the `path` template.
+
+`format_hints_for_prompt` emits one `Available <category>:` block per non-empty category. Empty categories produce no block (avoid prompting the LLM with "Available tags: (none)"). A single instruction line follows:
+
+```
+Prefer existing names from these lists verbatim. Only propose a new value
+if none of the existing names fits.
+```
+
+### `paperless_ai/ai_classifier.py` (modify)
+
+Required signature change (the v1 spec missed this — flagged by code review):
+
+- `build_prompt_without_rag(document, user: User | None = None)` — currently takes only `document`; add `user` with `None` default to keep call sites simple.
+- `build_prompt_with_rag(document, user: User | None = None)` — already takes `user`; its existing call to `build_prompt_without_rag(document)` at `ai_classifier.py:39` is updated to pass `user` through.
+
+Both prompt builders accept an optional `hints: TaxonomyHints | None = None` parameter. When non-`None`, `format_hints_for_prompt(hints)` is spliced in before the "Analyze the following document" instruction. When `None` (default), the prompt is built as today.
+
+`get_ai_document_classification(document, user, hints: TaxonomyHints | None = None)` accepts the same optional `hints` and forwards it to the prompt builder. Return shape is **unchanged** (`dict`). The view layer owns hint construction so the same `TaxonomyHints` object can be used both for the prompt and for `hinted_names` in matching — no need to thread it back out of the classifier. Callers in tests pass `hints=None` (or omit) to preserve existing behavior.
+
+### `paperless_ai/matching.py` (modify)
+
+- `_match_names_to_queryset(names, queryset, attr, hinted_names: set[str] | None = None)`:
+  - Normalization unchanged.
+  - Exact-match-on-full-queryset behavior unchanged (always tried first).
+  - When `hinted_names` is provided and the LLM-returned name (normalized) matches a hinted name (normalized) → treated as exact-only; fuzzy is skipped for that name.
+  - When `hinted_names` is `None` or the name isn't in it → existing 0.8 fuzzy fallback runs.
+- `match_tags_by_name(names, user, hinted_names=None)` etc. — optional kwarg, backward compatible.
+
+### `documents/views.py` (modify)
+
+The suggestion endpoint (around line 1482) is the single production caller of `get_ai_document_classification` and the call site for `match_*_by_name`. Update it to:
+
+1. Build hints once: `hints = build_taxonomy_hints(document, request.user)` (when `AIConfig().taxonomy_hints_enabled` and `max_per_category > 0`; otherwise `hints = None`).
+2. Pass `hints` into the classifier: `parsed = get_ai_document_classification(document, request.user, hints=hints)`.
+3. Pass `hinted_names=set(hints["tags"])` (etc., one per category, or `None` when `hints` is `None`) into each `match_*_by_name` call.
+
+**Cache interaction:** the AI suggestion path is wrapped by `cached_llm_suggestions` / `refresh_suggestions_cache` (views.py:1477). A cached response bypasses the LLM call entirely — so changes to hints config don't take effect until the cache entry is invalidated. Acceptable for v1 (cache is short-lived). If experience shows users change the toggle and expect immediate effect, follow up by including a hash of the hint-relevant config (`taxonomy_hints_enabled`, `_max`) in the cache key.
+
+### `paperless/config.py` (`AIConfig`) + DB model + settings
+
+`AIConfig.__post_init__` reads values from the `ApplicationConfiguration` DB row **and** falls back to `settings.*` constants (pattern at `paperless/config.py:207` for `ai_enabled`). Both layers are needed.
+
+Two new fields, threaded through three places:
+
+1. **`paperless/settings/*.py`** — add module-level constants read from env:
+   - `AI_TAXONOMY_HINTS: bool = __get_boolean("PAPERLESS_AI_TAXONOMY_HINTS", "yes")` (default on)
+   - `AI_TAXONOMY_HINTS_MAX: int = int(os.getenv("PAPERLESS_AI_TAXONOMY_HINTS_MAX", "30"))`
+
+2. **`paperless/models.py` (`ApplicationConfiguration`)** — add two nullable columns:
+   - `taxonomy_hints_enabled = models.BooleanField(null=True)`
+   - `taxonomy_hints_max_per_category = models.PositiveSmallIntegerField(null=True)` (range 0–32767; `PositiveSmallIntegerField` is sufficient)
+   - One Django migration.
+
+3. **`paperless/config.py` (`AIConfig`)** — read with **explicit None check, not `or`** (because `0` and `False` are legitimate user values that would otherwise silently fall back to the settings default):
+   ```python
+   self.taxonomy_hints_enabled = (
+       app_config.taxonomy_hints_enabled
+       if app_config.taxonomy_hints_enabled is not None
+       else settings.AI_TAXONOMY_HINTS
+   )
+   self.taxonomy_hints_max_per_category = (
+       app_config.taxonomy_hints_max_per_category
+       if app_config.taxonomy_hints_max_per_category is not None
+       else settings.AI_TAXONOMY_HINTS_MAX
+   )
+   ```
+   (Other fields in this file use `or`; we deliberately diverge here to support `0` and `False`. A short comment in code records why.)
+
+**Frontend** (`src-ui/src/app/data/paperless-config.ts`): add two entries to the `PaperlessConfigOptions` declarative list (one `Boolean`, one `Number`, `category: ConfigCategory.AI`) plus two fields on the `PaperlessConfig` interface. No component changes; the form is generated from this list.
+
+`paperless.conf.example` and the configuration docs page get entries.
+
+## Data flow
+
+Suggestion request:
+
+1. View checks `AIConfig().taxonomy_hints_enabled`; if enabled, calls `hints = build_taxonomy_hints(document, user)`; otherwise `hints = None`.
+2. View calls `parsed = get_ai_document_classification(document, user, hints=hints)`.
+3. Classifier splices `format_hints_for_prompt(hints)` into the prompt (when non-`None`), calls LLM, returns parsed dict.
+4. View calls `match_*_by_name(names, user, hinted_names=set(hints[<category>]) if hints else None)` per category. Exact-on-hint short-circuit; fuzzy fallback unchanged for misses.
+
+No background processing. No persisted state. Each suggestion request runs four lightweight `Count("documents")` queries (could be combined into a single query per model via `.annotate().order_by().values_list()`, no joins beyond the existing reverse relation).
+
+## Error handling
+
+- **Empty visible queryset for a category:** omit that category's block from the prompt.
+- **`taxonomy_hints_enabled = False` or `max_per_category = 0`:** `build_taxonomy_hints` returns an empty `TaxonomyHints`; prompt is identical to today; matching is called without `hinted_names`; behavior identical to today.
+- **LLM returns a name not in hints but exactly matching an existing visible name:** still treated as exact match. `_match_names_to_queryset` always tries exact-on-full-queryset before fuzzy; `hinted_names` only governs whether fuzzy is consulted for that specific name.
+- **DB query failure during shortlist build:** propagate. Suggestion failures already surface as 5xx; adding silent fallbacks here would mask real problems.
+
+## Testing
+
+All new and modified tests use pytest style — functions/classes, no `unittest.TestCase` subclasses; `pytest-django` with per-class `@pytest.mark.django_db`; `pytest-mock`'s `mocker` fixture for patching; every fixture parameter, fixture return, and test signature type-annotated. Tests grouped under classes (`class TestBuildTaxonomyHints:`), not flat free functions. Shared fixtures live in `paperless_ai/tests/conftest.py`. Format with `ruff` directly (not `uv run ruff`).
+
+### `paperless_ai/tests/test_taxonomy.py` (new)
+
+- `class TestBuildTaxonomyHints:`
+  - Returns a `TaxonomyHints` with all four keys.
+  - Top-K limit respected (`max_per_category` honored from `AIConfig`).
+  - Frequency ordering: tag used on 5 docs ranks above tag used on 2 docs.
+  - Tie-break by name (alphabetical) for stable output.
+  - Owner-aware: user lacking `view_tag` perm gets `tags=[]`; `view_documenttype` likewise per category.
+  - Empty queryset for a category → empty list; `format_hints_for_prompt` omits the block.
+  - `taxonomy_hints_enabled=False` returns zero-filled `TaxonomyHints` and runs no taxonomy DB queries (`django_assert_num_queries`).
+  - `max_per_category=0` same behavior as disabled.
+  - `StoragePath` shortlist uses the `name` field, not `path` template (asserted on returned values).
+
+- `class TestFormatHintsForPrompt:`
+  - All four blocks present when all categories non-empty.
+  - Empty category produces no block.
+  - All-empty hints produces empty string (no stray instruction line).
+  - Instruction line appears exactly once when at least one block is rendered.
+
+### `paperless_ai/tests/test_ai_classifier.py` (extend)
+
+- `class TestBuildPrompt:`
+  - `build_prompt_without_rag(doc, user)` now accepts `user`; produces a prompt containing the hints block when hints are non-empty.
+  - `build_prompt_with_rag(doc, user)` includes both the RAG context block (unchanged) and the hints block.
+  - `taxonomy_hints_enabled=False`: prompt matches today's baseline (string equality against a fixture).
+  - `get_ai_document_classification(doc, user, hints=...)` forwards hints into the prompt; return shape unchanged (still `dict`).
+
+### `paperless_ai/tests/test_matching.py` (extend)
+
+- `class TestHintedMatching:`
+  - LLM returns `"Bloodwork"` verbatim, `hinted_names={"Bloodwork", ...}` → exact match returned; `difflib.get_close_matches` not called (`mocker.spy` on `difflib.get_close_matches`).
+  - LLM returns `"blood test"` not in `hinted_names`, no existing exact → fuzzy fallback runs; behavior unchanged from today (regression guard).
+  - LLM returns `"Bloodwork "` (whitespace) with hinted_names containing `"Bloodwork"` → normalized exact match wins, fuzzy not consulted.
+  - Backward compatibility: `match_tags_by_name(names, user)` without the kwarg behaves identically to today (snapshot of an existing test, parameterized).
+
+Markers: no `live` marker needed.
+
+## Migration / rollout
+
+- One Django migration adding two columns to `ApplicationConfiguration` (`taxonomy_hints_enabled BooleanField`, `taxonomy_hints_max_per_category PositiveSmallIntegerField`). Both nullable with sensible defaults so existing rows aren't broken.
+- Feature defaults to on for new and existing installs. Set `PAPERLESS_AI_TAXONOMY_HINTS=false` (or via the Application Configuration UI) to restore today's behavior.
+- Frontend admin form updated to expose the two fields under the existing AI section.
+
+## Open questions deferred to implementation
+
+- `paperless_ai/tests/conftest.py` already exists — verify fixture-naming conventions match before adding new fixtures.
+- Confirm `parse_ai_response` doesn't need to know about hints (it's a pure parser; hints flow alongside, not through it).
+- The view layer applying `hinted_names` needs to read the same `AIConfig` instance the classifier used; pass the `TaxonomyHints` through the response tuple (chosen) rather than re-deriving in the view.
+
+## Interplay with `extract_unmatched_names`
+
+`extract_unmatched_names` (used downstream of matching) surfaces LLM-returned names that didn't match any existing taxonomy entry — the UI uses these to offer "create new tag?" affordances. With hints in place, fewer names will be unmatched, which is the desired outcome. No behavior change is required: a hinted name that the LLM repeats verbatim will exact-match and not appear in the unmatched list; a name the LLM invents anyway (despite the hint instruction) still flows through fuzzy and, if no match, surfaces as "new" exactly as today. Out of scope: filtering unmatched results based on what was in the hint set.
+
+## Out of scope (potential v2)
+
+- Embedding-based shortlisting (for users with very large taxonomies where frequency misses the right tag). Would re-introduce a FAISS-shaped subsystem with signals, debounce, and locks. Defer until evidence frequency is insufficient.
+- Tag hierarchy awareness — hinting `Medical/Bloodwork` vs `Bloodwork` when tags are nested.
+- Custom field option values.
+- `StoragePath` template-expression hinting (vs raw `name`).
@@ -0,0 +1,308 @@
+# Usage Reporting — Technical Spec
+
+Voluntary, opt-in usage reporting for paperless-ngx. The goal is to
+understand how many instances are running a given release (especially
+beta), which platforms and architectures are in use, and what features
+are being deployed — without collecting any personal data or document
+content.
+
+---
+
+## Guiding principles
+
+- **Explicitly opt-in.** Nothing is sent automatically. The user runs
+  the command and confirms before any network call is made.
+- **Transparent.** The exact payload is shown before sending.
+- **Anonymous.** The UUID is a random identifier with no link to
+  identity, IP address, or hostname.
+- **Graceful.** Network failures produce a friendly message, never a
+  stack trace.
+
+---
+
+## Client — management command
+
+### Name
+
+```
+manage.py send_usage_report
+```
+
+### Flags
+
+| Flag        | Behaviour                                                 |
+| ----------- | --------------------------------------------------------- |
+| _(none)_    | Show payload, prompt for confirmation, send on `y`/`yes`  |
+| `--dry-run` | Show payload, skip confirmation and network call entirely |
+
+### UUID storage
+
+A random UUID4 is generated on the first run and written to
+`PAPERLESS_DATA_DIR/usage_uuid` (plain text, one line). Subsequent
+runs reuse the same file. If the file is missing it is regenerated
+(counts as a new install — acceptable).
+
+### Confirmation flow
+
+```
+The following information will be sent to paperless-ngx to help
+improve the project:
+
+  Installation ID : a1b2c3d4-e5f6-7890-abcd-ef1234567890
+  Version         : 2.15.0
+  Channel         : beta
+  Commit          : bd86dca57  (built 2026-05-18T12:00:00Z)
+  Install type    : docker
+  Architecture    : x86_64
+  Python          : 3.12.3
+  Database        : postgresql
+  Documents       : 1000–9999
+  Multi-user      : yes
+  Mail enabled    : yes
+  AI enabled      : no
+
+No personal data, document content, or IP address is stored.
+More information: https://docs.paperless-ngx.com/usage-reporting/
+
+Send this report? [y/N]:
+```
+
+Default answer is **N**. Anything other than `y`/`yes` aborts with
+no network call and prints `Nothing sent.`
+
+`--dry-run` skips the prompt entirely and prints `Dry run — nothing sent.`
+
+### Network error handling
+
+- Timeout: 10 seconds
+- On any failure (timeout, DNS, HTTP error): print a single friendly
+  line, exit 0 (not an error from the user's perspective)
+
+```
+Could not reach the reporting endpoint. Nothing was sent.
+```
+
+### Duplicate submission handling
+
+The server returns `429` if the UUID was seen within the last 7 days,
+with a JSON body:
+
+```json
+{
+  "error": "already_submitted",
+  "last_sent": "2026-05-15T10:00:00Z",
+  "retry_after_days": 4
+}
+```
+
+The command prints:
+
+```
+Already submitted 3 days ago. Nothing sent.
+You can send again after 2026-05-19.
+```
+
+---
+
+## Payload schema
+
+All fields are strings unless noted. Fields marked _omit if absent_
+are left out of the JSON entirely when the value is unavailable —
+never sent as `null`.
+
+| Field          | Source                                                    | Notes                                            |
+| -------------- | --------------------------------------------------------- | ------------------------------------------------ |
+| `uuid`         | `PAPERLESS_DATA_DIR/usage_uuid`                           | UUID4, random                                    |
+| `version`      | `paperless/version.py` — `__full_version_str__`           | e.g. `"2.15.0"`                                  |
+| `channel`      | `paperless/version.py` — `__channel__`                    | `"stable"` \| `"beta"` \| `"dev"`                |
+| `commit`       | `paperless/build_info.py` — `SOURCE_COMMIT`               | Short SHA — _omit if absent_                     |
+| `build_date`   | `paperless/build_info.py` — `BUILD_DATE`                  | ISO 8601 — _omit if absent_                      |
+| `install_type` | Detected at runtime (see below)                           |                                                  |
+| `arch`         | `platform.machine()`                                      | e.g. `"x86_64"`, `"aarch64"`                     |
+| `python`       | `platform.python_version()`                               | e.g. `"3.12.3"`                                  |
+| `database`     | Last segment of `settings.DATABASES["default"]["ENGINE"]` | e.g. `"postgresql"`, `"sqlite3"`                 |
+| `doc_bucket`   | Bucketed document count (see below)                       |                                                  |
+| `multi_user`   | boolean                                                   | `true` if more than one real user account exists |
+| `feature_mail` | boolean                                                   | `true` if any mail account is configured         |
+| `feature_ai`   | boolean                                                   | `true` if AI features are enabled in settings    |
+
+### Document count buckets
+
+| Range         | Value           |
+| ------------- | --------------- |
+| 0–99          | `"0-99"`        |
+| 100–999       | `"100-999"`     |
+| 1 000–9 999   | `"1000-9999"`   |
+| 10 000–49 999 | `"10000-49999"` |
+| 50 000+       | `"50000+"`      |
+
+### Install type detection
+
+Evaluated in order; first match wins.
+
+| Value          | Detection                                                   |
+| -------------- | ----------------------------------------------------------- |
+| `"kubernetes"` | `KUBERNETES_SERVICE_HOST` env var is set                    |
+| `"podman"`     | `container` env var equals `"podman"`                       |
+| `"docker"`     | `Path("/.dockerenv").exists()`                              |
+| `"nixos"`      | `"/nix/store/"` in `sys.executable`                         |
+| `"snap"`       | `SNAP` env var is set                                       |
+| `"flatpak"`    | `FLATPAK_ID` env var is set                                 |
+| `"distro"`     | `paperless/distro_info.py` exists (set by distro packagers) |
+| `"release"`    | `paperless/build_info.py` exists (none of the above)        |
+| `"source"`     | Fallback — dev checkout                                     |
+
+Distro packagers (Debian, NixOS community, Unraid, etc.) can opt in
+by shipping a `src/paperless/distro_info.py` containing:
+
+```python
+DISTRO = "debian"   # or "rpm", "homebrew", "unraid", etc.
+```
+
+When present the install type is reported as the `DISTRO` value rather
+than `"distro"`.
+
+### `version.py` additions
+
+Add `__channel__` alongside the existing version fields:
+
+```python
+__channel__: Final[str] = "beta"   # "stable" | "beta" | "dev"
+```
+
+This is the canonical place to set the channel when preparing a
+release. `"dev"` is the default for unreleased branches.
+
+### `build_info.py`
+
+Generated at build time, never committed (add to `.gitignore`).
+
+```python
+SOURCE_COMMIT = "bd86dca57"
+BUILD_DATE = "2026-05-18T12:00:00Z"
+```
+
+---
+
+## Server — Cloudflare Worker
+
+Managed in a separate repository under the paperless-ngx GitHub org
+(e.g. `paperless-ngx/telemetry`). Deployed via Wrangler.
+
+### Endpoint
+
+```
+POST /report
+Content-Type: application/json
+```
+
+Returns `204` on success. No response body.
+
+### Timestamp
+
+`received` is always set server-side. Any client-supplied timestamp
+field is ignored.
+
+### Validation
+
+Reject with `400` if any of the following fail:
+
+- `uuid` does not match UUID4 format
+- `version` does not match `\d+\.\d+\.\d+`
+- `channel` is not one of `stable`, `beta`, `dev`
+- `install_type` is not in the known set
+- `arch` is absent
+- Payload is not valid JSON or exceeds 4 KB
+
+Unknown extra fields are silently ignored (forward compatibility).
+
+### Deduplication
+
+Before inserting, query for the most recent submission from this UUID:
+
+```sql
+SELECT received FROM reports
+WHERE uuid = ?
+ORDER BY received DESC
+LIMIT 1
+```
+
+If the result is within 7 days of now, return:
+
+```
+HTTP 429
+{ "error": "already_submitted", "last_sent": "<iso>", "retry_after_days": <n> }
+```
+
+Otherwise insert and return `204`.
+
+### D1 schema
+
+```sql
+CREATE TABLE reports (
+  id            INTEGER PRIMARY KEY,
+  received      TEXT    NOT NULL,   -- ISO 8601, server-side
+  uuid          TEXT    NOT NULL,
+  version       TEXT,
+  channel       TEXT,
+  commit        TEXT,
+  build_date    TEXT,
+  install_type  TEXT,
+  arch          TEXT,
+  python        TEXT,
+  database      TEXT,
+  doc_bucket    TEXT,
+  multi_user    INTEGER,            -- 0 / 1
+  feature_mail  INTEGER,            -- 0 / 1
+  feature_ai    INTEGER             -- 0 / 1
+);
+
+CREATE INDEX idx_reports_uuid    ON reports(uuid);
+CREATE INDEX idx_reports_channel ON reports(channel);
+CREATE INDEX idx_reports_version ON reports(version);
+```
+
+---
+
+## Useful queries
+
+```sql
+-- Distinct beta installs
+SELECT COUNT(DISTINCT uuid)
+FROM reports
+WHERE channel = 'beta';
+
+-- Installs by commit (beta only)
+SELECT commit, COUNT(DISTINCT uuid) AS installs
+FROM reports
+WHERE channel = 'beta'
+GROUP BY commit
+ORDER BY installs DESC;
+
+-- Architecture breakdown
+SELECT arch, COUNT(DISTINCT uuid) AS installs
+FROM reports
+GROUP BY arch
+ORDER BY installs DESC;
+
+-- Install type split
+SELECT install_type, COUNT(DISTINCT uuid) AS installs
+FROM reports
+GROUP BY install_type
+ORDER BY installs DESC;
+
+-- Database backend split
+SELECT database, COUNT(DISTINCT uuid) AS installs
+FROM reports
+GROUP BY database
+ORDER BY installs DESC;
+```
+
+---
+
+## Out of scope (for now)
+
+- Automatic or scheduled reporting
+- Any opt-out settings flag
+- Server-side dashboard (raw SQL is sufficient)
+- Locale, timezone, or OS version fields