Just for later ideas, store some brainstorming sessions with Claude

This commit is contained in:
stumpylog
2026-06-03 10:49:01 -07:00
parent 6ede72cc44
commit 6fa3e5fac7
5 changed files with 2161 additions and 0 deletions
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,167 @@
# Pluggable Document Storage Design
**Date:** 2026-04-23
**Status:** Approved
## Overview
Replace the hardcoded local filesystem storage in paperless-ngx with a pluggable `DocumentStorage` Protocol. Ship two built-in backends — `LocalFilesystemBackend` (default, zero config change) and `S3CompatibleBackend` (supports AWS S3 and any S3-compatible endpoint). Third parties can implement the Protocol to provide their own backends.
## Scope
- **In scope:** original documents, PDF/A archives
- **Out of scope:** thumbnails (stay on local filesystem, regenerable), consumption directory (stays local)
- **Frontend impact:** none — S3 is invisible; Django proxies all file access
## Protocol
Defined in `src/paperless/storage.py`:
```python
class DocumentStorage(Protocol):
def __enter__(self) -> Self: ...
def __exit__(self, exc_type, exc_val, exc_tb) -> None: ...
def open(self, name: str) -> IO[bytes]: ...
def save(self, name: str, content: IO[bytes]) -> str: ... # returns actual name used
def delete(self, name: str) -> None: ...
def exists(self, name: str) -> bool: ...
def move(self, old_name: str, new_name: str) -> None: ...
def list_files(self, prefix: str = "") -> Iterable[str]: ...
def size(self, name: str) -> int: ...
```
`name` is always the relative key as stored in the DB (e.g. `2024/my-invoice.pdf`). All operations including `open()` must be called within a `with storage:` block — the context manager handles connection lifecycle and backend-specific cleanup.
## Storage Instances
Two module-level singletons in `src/paperless/storage.py`, each an instance of the configured backend class:
```python
original_storage: DocumentStorage = _build("originals")
archive_storage: DocumentStorage = _build("archive")
```
`_build(prefix)` reads `PAPERLESS_DOCUMENT_STORAGE_BACKEND` and `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` from settings, instantiates the backend class with the configured options plus the paperless-controlled prefix. The prefix distinguishes originals from archives within the same bucket or directory root — it is not stored in the DB key.
## Configuration
Two new settings, using the existing key-value dict mechanism:
| Setting | Default | Description |
| ------------------------------------ | ------------------------------------------ | ------------------------------------------------------------ |
| `PAPERLESS_DOCUMENT_STORAGE_BACKEND` | `paperless.storage.LocalFilesystemBackend` | Dotted Python path to any class satisfying `DocumentStorage` |
| `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` | `{}` | Dict of kwargs passed to the backend constructor |
**Example — S3-compatible:**
```
PAPERLESS_DOCUMENT_STORAGE_BACKEND=paperless.storage.S3CompatibleBackend
PAPERLESS_DOCUMENT_STORAGE_OPTIONS={"bucket_name": "my-docs", "endpoint_url": "https://s3.wasabi.com", "region_name": "us-east-1", "access_key": "...", "secret_key": "..."}
```
Existing users set nothing — `LocalFilesystemBackend` with no options is the default.
## Built-in Backends
### `LocalFilesystemBackend`
- `__enter__`: initialises tracking of directories affected during the context
- `__exit__`: calls `delete_empty_directories()` for all tracked dirs; no-op on exception
- `open/save/delete/exists/move`: direct `Path` + `shutil` operations rooted at `settings.ORIGINALS_DIR` / `settings.ARCHIVE_DIR` (via the prefix passed by `_build`)
- `move()`: `shutil.move()` — atomic on same filesystem
- `list_files()`: `Path.rglob("*")`
### `S3CompatibleBackend`
- Wraps `django-storages` S3 backend (`storages.backends.s3boto3.S3Boto3Storage`) for `open`, `save`, `delete`, `exists`
- `__enter__`: initialises boto3 client/session
- `__exit__`: no cleanup required (no empty directory concept on S3)
- `move()`: boto3 `copy_object` (server-side, no data transfer) + `delete_object`
- `open()`: returns streaming S3 response body; caller's `with` block closes the HTTP connection
- `list_files()`: S3 `list_objects_v2` with prefix
- Works with any S3-compatible endpoint via `endpoint_url` option
## Data Migration
One Django migration strips the stored prefix from existing rows:
- `document.filename`: `documents/originals/2024/invoice.pdf``2024/invoice.pdf`
- `document.archive_filename`: `documents/archive/2024/invoice.pdf``2024/invoice.pdf`
The prefix is now owned by the storage instance, not the DB key.
## `migrate_storage` Management Command
```
manage.py migrate_storage [--dry-run] [--no-delete]
[--source-backend=<dotted.path>] [--source-options=<json>]
```
Transfers all document files from one storage backend to another. The user updates `PAPERLESS_DOCUMENT_STORAGE_BACKEND` in their config first, then runs this command to move existing files.
The destination is always the currently configured backend (from settings). The source is specified via `--source-backend` / `--source-options`, defaulting to `LocalFilesystemBackend` with no options if omitted (covering the most common migration path: local → S3).
**Flow:**
1. Instantiate source backend (from CLI args or default) and destination backend (from current settings)
2. Iterate `Document.objects.only("filename", "archive_filename")`
3. For each file (original + archive):
- Skip with warning if missing from source
- Skip silently if already present on destination (idempotent — safe to re-run)
- Copy: `destination.save(name, source.open(name))`
- Unless `--no-delete`: `source.delete(name)`
4. Report counts: moved / skipped / failed
5. `--dry-run`: prints actions without touching files
Individual failures are logged and counted but do not abort the run. Bidirectional: local → S3, S3 → local, S3 → S3.
## Files to Create
| File | Purpose |
| ------------------------------------------------------- | ------------------------------------------------------------------------------ |
| `src/paperless/storage.py` | Protocol, built-in backends, `original_storage` / `archive_storage` singletons |
| `src/documents/management/commands/migrate_storage.py` | Migration command |
| `src/documents/migrations/XXXX_strip_storage_prefix.py` | Strip prefix from existing filename rows |
## Files to Modify
| File | Change |
| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `src/paperless/settings/__init__.py` | Add `PAPERLESS_DOCUMENT_STORAGE_BACKEND`, `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` |
| `src/documents/models.py` | `source_file`, `archive_file` use storage instances; `source_path` returns temp file for subprocess callers |
| `src/documents/consumer.py` | `_write()``storage.save()`; remove `mkdir` calls |
| `src/documents/signals/handlers.py` | `shutil.move()``storage.move()`; remove `create_source_path_directory` / `delete_empty_directories` callsites |
| `src/documents/tasks.py` | Same as signals |
| `src/documents/file_handling.py` | `exists()` checks and directory references use storage API |
| `src/documents/views/` | File-serving views use `storage.open()` within context; wrap for `FileResponse` lifecycle |
| `src/documents/management/commands/document_importer.py` | Replace `Path.glob()` and direct copies with storage API |
| `src/documents/management/commands/document_exporter.py` | Replace direct file copies and `FileLock`-guarded writes with storage API |
## Locking & Concurrency
The codebase serialises all document file write/move operations with `FileLock(settings.MEDIA_LOCK)`, where `MEDIA_LOCK = MEDIA_ROOT / "media.lock"`. This is used in `consumer.py`, `signals/handlers.py`, `tasks.py`, `mail.py`, `document_importer.py`, and `document_exporter.py`.
**The lock file stays on the local filesystem regardless of backend.** `MEDIA_LOCK` lives under `MEDIA_ROOT`, which is the local path even when documents are stored on S3. This means:
- **Single-host deployments** (the common case — Docker Compose, single server): the `FileLock` continues to work correctly. All Celery workers and the Django process share the same lock file. No change required.
- **Multi-host deployments**: the `FileLock` is already broken for these today — each host has its own lock file. This is a pre-existing limitation and is out of scope for this feature.
**Callsite structure** — the storage context manager nests inside the existing lock, preserving current behaviour:
```python
with FileLock(settings.MEDIA_LOCK):
with original_storage as storage:
storage.move(old_name, new_name)
```
**`generate_unique_filename` race:** this function checks `storage.exists()` then saves, which is not atomic on S3. The `FileLock` already serialises this on a single host. For multi-host this is a pre-existing gap — not introduced by this feature.
**Future path for multi-host:** replace `FileLock` with a database-level advisory lock or Redis lock. Out of scope here.
## Key Invariants
- The context manager is required for all storage operations, including reads
- `name` is always the relative key — never an absolute path or URL
- The backend prefix (`originals` / `archive`) is paperless-controlled and never stored in the DB
- `LocalFilesystemBackend` is the default — existing deployments require no config change
- The migrate command is idempotent and can be re-run after partial failure
@@ -0,0 +1,253 @@
# Workflow Runner Refactor — Design
**Date:** 2026-05-19
**Branch base:** `dev`
**Status:** Approved design, pending implementation plan
## Problem
Workflow execution and the Django signal layer have repeatedly produced fragile,
hard-to-fix bugs (see the revert/refix history around password removal: #12803,
#12814, #12716, and the filename race #12386). Three structural causes:
1. **`run_workflows` is dual-mode.** A single function handles both consumption
(mutating a `DocumentMetadataOverrides`) and post-save (mutating a real
`Document`), branching on a `use_overrides` flag. The branching is
concentrated in two places — the action dispatch inside `run_workflows`
(`handlers.py:931-1001`) and `build_workflow_action_context`
(`actions.py:33-83`), each with two full code paths. The `apply_*` helpers in
`workflows/mutations.py` are _already_ split by target type
(`apply_assignment_to_document` vs `apply_assignment_to_overrides`, etc.); the
refactor unifies their callers, not the helpers themselves.
2. **File location is an implicit, timing-dependent side channel.** The
`DOCUMENT_ADDED` workflow fires from `run_workflows_added`, which runs while
the consumer is still inside its transaction — _before_ the consumed file is
copied to `document.source_path` (`document_consumption_finished` is sent at
`consumer.py:658`; the file copy happens after, at `consumer.py:670+`). The
staged path is therefore threaded through as `original_file` /
`caller_supplied_original_file` parameters. Actions that read the file
(password removal, email attachments) depend on this plumbing being correct.
3. **The workflow run races the filename rename.** `update_filename_and_move_files`
is a raw `post_save` receiver on `Document`. When a workflow persists its
changes via `document.save(update_fields=[...])`, that save fires `post_save`
and runs the rename _while the workflow is still executing_. Under concurrent
Celery/UI updates the interleaved `refresh_from_db()` calls corrupt state. The
comment at `handlers.py:980-984` — deliberately excluding `filename` /
`archive_filename` from the workflow save — is a load-bearing workaround for
exactly this.
Note: `run_workflows_added` / `run_workflows_updated` are connected to the
_custom_ signals `document_consumption_finished` / `document_updated`, fired
explicitly by paperless code in a handful of known sites — not to raw Django
`post_save`. Only `update_filename_and_move_files` is a raw `post_save` receiver.
This refactor does not change where workflows are triggered from.
## Scope
In scope:
- Refactor `run_workflows` and its action helpers around an execution-context
abstraction.
- Delete the `original_file` side-channel plumbing.
- Make the workflow-execution → persist → rename sequence explicit and
deterministic.
Out of scope:
- Changing where/when workflows are triggered (custom signal call sites unchanged).
- Reworking the matching logic (`matching.document_matches_workflow`).
- Any change to workflow models, serializers, or the REST API.
## Design
### 1. `WorkflowRunContext` protocol
New module `documents/workflows/context.py` defining a `typing.Protocol`:
```
WorkflowRunContext (Protocol)
source_file: Path # where the file actually is, now
build_placeholder_context() -> dict
apply_assignment(action) -> None
apply_removal(action) -> None
persist() -> None # commit accumulated mutations
record_run(workflow, trigger_type) -> None
```
Two concrete implementations (which need not import the Protocol — structural
typing):
- **`ConsumptionContext`** — wraps `ConsumableDocument` + `DocumentMetadataOverrides`.
`source_file` returns the staged file path. Mutations land on the overrides.
`persist()` is a no-op (the overrides object is returned to the caller).
- **`PersistedContext`** — wraps a real `Document`. Mutations land on the
in-memory `Document`. `persist()` performs a single save.
**Context selection**`run_workflows` picks the context from the call shape:
- CONSUMPTION trigger (`ConsumableDocument` + non-`None` `overrides`) →
`ConsumptionContext`.
- DOCUMENT_ADDED / DOCUMENT_UPDATED / SCHEDULED (a real `Document`,
`overrides=None`) → `PersistedContext`.
**`source_file` for `PersistedContext`.** It cannot unconditionally return
`document.source_path`: for the `DOCUMENT_ADDED` trigger the file has not yet
been moved there. The staged path is therefore passed into the `PersistedContext`
_at construction time_ by `run_workflows_added` (which still receives it from the
`document_consumption_finished` signal). `source_file` returns that staged path
when supplied, otherwise `document.source_path`. This relocates the staged-path
information from a chain of function parameters into a single piece of
construction state — the `original_file` / `caller_supplied_original_file`
_parameter plumbing_ through `run_workflows` and the action helpers is what gets
deleted, not the staged path itself.
`WorkflowRunContext` is a plain `Protocol`, not `@runtime_checkable` — the runner
constructs the context itself, so no `isinstance` check is needed. Genuinely
shared logic goes into module-level helper functions, not a base class.
### 2. `run_workflows` becomes branch-free
`run_workflows` keeps its current public signature so all call sites are
unchanged. Its body:
1. Construct the appropriate context once, from the argument types.
2. Run a single flat match-and-dispatch loop over matching workflows/actions,
delegating every action to context methods.
No `use_overrides` flag anywhere. The branching currently scattered across
`run_workflows`, `build_workflow_action_context`, and the `apply_*` helpers
collapses into the two context classes.
### 3. File staging via `source_file`
`source_file` is a property of the context, fixed at construction. The
`original_file` and `caller_supplied_original_file` parameters threaded through
`run_workflows` and the `execute_*` helpers are deleted; each context resolves
the path itself (see "Context selection" above).
**Deferred password removal.** `execute_password_removal_action`, when given a
`ConsumableDocument`, currently installs a one-shot handler on
`document_consumption_finished` that picks up `original_file` from `kwargs`
later (`actions.py:295-308`). This deferred hook lives outside the context
abstraction. The refactor must explicitly decide its fate: either keep it as-is
(the context still constructs correctly around it) or fold the deferral into
`ConsumptionContext`. This is called out as an open implementation decision, not
silently absorbed.
### 4. Explicit workflow → persist → rename sequencing
What must be deferred is the **file rename**, not the DB save. `run_workflows`
keeps its per-workflow `document.refresh_from_db()` at the top of each iteration
— that is deliberate concurrency protection against `bulk_update_documents`
running simultaneously. Deferring all saves to a single final `persist()` would
let one workflow's refresh wipe a prior workflow's in-memory changes. So:
1. `run_workflows` refreshes and applies actions per workflow, and
`PersistedContext.persist()` saves after each matching workflow, as today.
2. The save deliberately **continues to exclude** `filename` /
`archive_filename` from `update_fields`. This is not duct tape: it guards a
_cross-process_ hazard — another Celery task may have moved the file and
written `filename` to the DB, and a stale in-memory `filename` in our save
would revert it. The `ContextVar` guard (below) only addresses _intra-process_
ordering, so this exclusion stays.
3. The rename is suppressed for the whole run and invoked **exactly once,
afterward**, against final committed state.
The actual race being fixed: `apply_assignment_to_document` assigns tags via
`document.add_nested_tags(...)`, which fires `m2m_changed` on
`Document.tags.through` _before_ the workflow's `document.save()`. The
`m2m_changed` receiver `update_filename_and_move_files` then calls
`refresh_from_db()`, wiping the workflow's in-memory correspondent/type, and
moves the file to a path computed from stale metadata. The guard prevents this.
To stop the rename from firing mid-workflow, a **`ContextVar` guard** is
introduced (e.g. `documents/workflows/context.py` module-level
`_workflow_in_progress: ContextVar[bool]`). `update_filename_and_move_files`
checks the guard and early-returns when set. `run_workflows` wraps its **entire**
persisted-path execution — not just the `persist()` call — in a context manager
that sets the guard via `set()`/`reset(token)`. Token-based reset is
reentrancy-safe for nested saves or nested workflow runs.
The guard must span the whole execution, not just `persist()`, because
`update_filename_and_move_files` is _also_ registered to `m2m_changed` on
`Document.tags.through` and to `post_save` on `CustomFieldInstance`
(`handlers.py:431-432`). A workflow action that assigns tags or custom fields
would otherwise trigger a rename mid-workflow through those signals.
After execution completes, `run_workflows` calls `persist()` once and then
explicitly invokes the move logic once. The `ContextVar` is set/reset in the
same thread that runs these receivers synchronously, so they always observe the
value. (Celery `prefork` workers run each task in its own process; greenlet
pools are also `contextvars`-aware — non-issues, noted for completeness.)
The move body of `update_filename_and_move_files` is extracted into a plain
callable that the runner invokes directly. The function is already invoked
directly (as a plain call, bypassing the decorator) for version documents at
`handlers.py:664-667`, so this extraction has precedent. The thin `post_save`
receiver remains as a guard-checking wrapper.
The two `post_save` receivers on `Document` are `update_filename_and_move_files`
(`handlers.py:433`) and `update_llm_suggestions_cache` (`handlers.py:740`). The
`ContextVar` guard suppresses **only** the former — `update_llm_suggestions_cache`
keeps running normally, as do `document_consumption_finished` receivers such as
`add_or_update_document_in_llm_index` (which is _not_ a `post_save` receiver).
This is why the guard is preferred over persisting with `.update()`, which would
silently suppress _all_ `post_save` receivers including
`update_llm_suggestions_cache`.
`WorkflowRun.objects.create(...)` is created per matching workflow as today
(`handlers.py:998-1002`); it is a separate model and is not deferred.
The comment at `handlers.py:980-984` is updated to describe the new flow
(per-workflow save under the guard; single explicit rename afterward) but the
`filename` / `archive_filename` exclusion it documents is kept — see point 2
above.
## Testing
- **Runner loop** — exercised against a fake context implementing the
`WorkflowRunContext` surface that records `apply_assignment` / `apply_removal`
/ `persist` calls. No DB document, no staged files, no signals.
- **Concrete contexts** — `ConsumptionContext` and `PersistedContext` each get
focused tests: given an action, assert the mutation lands on the overrides vs.
the document, and that `source_file` resolves to the staged vs. final path.
- **ContextVar guard** — assert `update_filename_and_move_files` early-returns
while the guard is set, and that the rename runs exactly once after
`persist()`.
- **Regression: the racy case** — a workflow that reassigns metadata while the
document is subject to a filename template; assert final DB filename and file
location are consistent (the #12386 scenario).
- **Regression safety net** — the existing `test_workflows.py` suite (~100
tests; ~19 `document_consumption_finished.send` sites plus many direct
`run_workflows(...)` calls for the `DOCUMENT_UPDATED` path) must stay green
**unchanged**. A test that needs editing signals a behavior change to flag
explicitly, not a silent refactor outcome.
Per project conventions: tests grouped under classes, fixtures and test
signatures fully type-annotated.
## Implementation sequence
Each step is independently reviewable and keeps the test suite green:
1. Introduce the `Protocol` + the two contexts; `run_workflows` delegates to
them. Pure refactor, no behavior change.
2. Move the staged path into `PersistedContext` construction (passed by
`run_workflows_added`); delete the `original_file` /
`caller_supplied_original_file` parameter plumbing through `run_workflows`
and the `execute_*` helpers.
3. Extract the move body from `update_filename_and_move_files` into a callable;
add the `ContextVar` guard; `run_workflows` invokes the move once after the
run completes. The `filename` / `archive_filename` exclusion in the
per-workflow save is kept; only the comment at `handlers.py:980-984` is
updated to describe the new flow.
## Pain points addressed
- **Dual-mode** → eliminated by the `Protocol` + two contexts; no `use_overrides`.
- **File staging** → `source_file` is a context property; side-channel args deleted.
- **Rename race** → per-workflow save under a `ContextVar` guard that suppresses
the mid-workflow rename; a single explicit rename runs once at the end against
final state.
@@ -0,0 +1,215 @@
# AI Suggestions: Inject existing taxonomy as candidates
**Status:** Design (v2 — frequency-only)
**Date:** 2026-05-20
**Related:** [Discussion #12787](https://github.com/paperless-ngx/paperless-ngx/discussions/12787)
**Branch target:** `dev`
## Problem
AI Suggestions currently asks the LLM for free-form tag/document-type/correspondent/storage-path names, then reconciles via `difflib` fuzzy matching (cutoff 0.8) in `paperless_ai/matching.py`. This works for typos but not for semantic equivalents:
- `blood test` does not fuzzy-match `Bloodwork`
- `IRS` does not fuzzy-match `Taxes`
- `doctor visit` does not fuzzy-match `Medical`
Result: the LLM invents new metadata names that duplicate existing taxonomy entries.
## Goal
Tell the LLM what already exists, so it can prefer existing names verbatim. Fuzzy matching becomes the fallback for typos and for legitimately novel suggestions, not the primary semantic-equivalence mechanism.
Non-goals: changing the LLM client, embedding model selection, or RAG retrieval. Replacing fuzzy matching entirely. Custom-field option values. Embedding-based shortlisting (deferred to a v2 if frequency proves insufficient).
## Approach
For each of Tags, DocumentTypes, Correspondents, StoragePaths:
1. Take the user-visible queryset (owner-aware, matching `matching.py`).
2. Annotate by document-usage count and take the top `X` names by frequency. `X` is configurable per category cap (single setting, applied to all four categories).
3. Inject those names into the LLM prompt as "Available <category>" blocks, with the instruction to prefer them verbatim.
4. When the LLM responds, tell `matching.py` which names were hinted so an exact normalized match short-circuits past fuzzy. Names not in the hint list keep today's fuzzy fallback.
No FAISS index, no signals, no Celery tasks, no locks. Pure DB-side queries on each suggestion request.
## Components
### `paperless_ai/taxonomy.py` (new)
```python
class TaxonomyHints(TypedDict):
tags: list[str]
document_types: list[str]
correspondents: list[str]
storage_paths: list[str]
def build_taxonomy_hints(document: Document, user: User | None) -> TaxonomyHints: ...
def format_hints_for_prompt(hints: TaxonomyHints) -> str: ...
```
Internals:
- `_visible_queryset(model_cls, perm: str, user)` — wraps `get_objects_for_user_owner_aware` exactly as `matching.py` does. If `user` is `None`, returns the unfiltered manager queryset (parity with how `matching.py` behaves today).
- `_shortlist_by_frequency(queryset, max_per_category)` — DB-side:
```python
return list(
queryset
.annotate(usage=Count("documents"))
.order_by("-usage", "name")
.values_list("name", flat=True)[:max_per_category]
)
```
Confirmed reverse relation name is `documents` for all four models (`documents/models.py:164,173,184,211`). Secondary order by `name` keeps results stable when usage ties (common with 0-usage tails). `StoragePath` uses the human `name` field, not the `path` template.
`format_hints_for_prompt` emits one `Available <category>:` block per non-empty category. Empty categories produce no block (avoid prompting the LLM with "Available tags: (none)"). A single instruction line follows:
```
Prefer existing names from these lists verbatim. Only propose a new value
if none of the existing names fits.
```
### `paperless_ai/ai_classifier.py` (modify)
Required signature change (the v1 spec missed this — flagged by code review):
- `build_prompt_without_rag(document, user: User | None = None)` — currently takes only `document`; add `user` with `None` default to keep call sites simple.
- `build_prompt_with_rag(document, user: User | None = None)` — already takes `user`; its existing call to `build_prompt_without_rag(document)` at `ai_classifier.py:39` is updated to pass `user` through.
Both prompt builders accept an optional `hints: TaxonomyHints | None = None` parameter. When non-`None`, `format_hints_for_prompt(hints)` is spliced in before the "Analyze the following document" instruction. When `None` (default), the prompt is built as today.
`get_ai_document_classification(document, user, hints: TaxonomyHints | None = None)` accepts the same optional `hints` and forwards it to the prompt builder. Return shape is **unchanged** (`dict`). The view layer owns hint construction so the same `TaxonomyHints` object can be used both for the prompt and for `hinted_names` in matching — no need to thread it back out of the classifier. Callers in tests pass `hints=None` (or omit) to preserve existing behavior.
### `paperless_ai/matching.py` (modify)
- `_match_names_to_queryset(names, queryset, attr, hinted_names: set[str] | None = None)`:
- Normalization unchanged.
- Exact-match-on-full-queryset behavior unchanged (always tried first).
- When `hinted_names` is provided and the LLM-returned name (normalized) matches a hinted name (normalized) → treated as exact-only; fuzzy is skipped for that name.
- When `hinted_names` is `None` or the name isn't in it → existing 0.8 fuzzy fallback runs.
- `match_tags_by_name(names, user, hinted_names=None)` etc. — optional kwarg, backward compatible.
### `documents/views.py` (modify)
The suggestion endpoint (around line 1482) is the single production caller of `get_ai_document_classification` and the call site for `match_*_by_name`. Update it to:
1. Build hints once: `hints = build_taxonomy_hints(document, request.user)` (when `AIConfig().taxonomy_hints_enabled` and `max_per_category > 0`; otherwise `hints = None`).
2. Pass `hints` into the classifier: `parsed = get_ai_document_classification(document, request.user, hints=hints)`.
3. Pass `hinted_names=set(hints["tags"])` (etc., one per category, or `None` when `hints` is `None`) into each `match_*_by_name` call.
**Cache interaction:** the AI suggestion path is wrapped by `cached_llm_suggestions` / `refresh_suggestions_cache` (views.py:1477). A cached response bypasses the LLM call entirely — so changes to hints config don't take effect until the cache entry is invalidated. Acceptable for v1 (cache is short-lived). If experience shows users change the toggle and expect immediate effect, follow up by including a hash of the hint-relevant config (`taxonomy_hints_enabled`, `_max`) in the cache key.
### `paperless/config.py` (`AIConfig`) + DB model + settings
`AIConfig.__post_init__` reads values from the `ApplicationConfiguration` DB row **and** falls back to `settings.*` constants (pattern at `paperless/config.py:207` for `ai_enabled`). Both layers are needed.
Two new fields, threaded through three places:
1. **`paperless/settings/*.py`** — add module-level constants read from env:
- `AI_TAXONOMY_HINTS: bool = __get_boolean("PAPERLESS_AI_TAXONOMY_HINTS", "yes")` (default on)
- `AI_TAXONOMY_HINTS_MAX: int = int(os.getenv("PAPERLESS_AI_TAXONOMY_HINTS_MAX", "30"))`
2. **`paperless/models.py` (`ApplicationConfiguration`)** — add two nullable columns:
- `taxonomy_hints_enabled = models.BooleanField(null=True)`
- `taxonomy_hints_max_per_category = models.PositiveSmallIntegerField(null=True)` (range 032767; `PositiveSmallIntegerField` is sufficient)
- One Django migration.
3. **`paperless/config.py` (`AIConfig`)** — read with **explicit None check, not `or`** (because `0` and `False` are legitimate user values that would otherwise silently fall back to the settings default):
```python
self.taxonomy_hints_enabled = (
app_config.taxonomy_hints_enabled
if app_config.taxonomy_hints_enabled is not None
else settings.AI_TAXONOMY_HINTS
)
self.taxonomy_hints_max_per_category = (
app_config.taxonomy_hints_max_per_category
if app_config.taxonomy_hints_max_per_category is not None
else settings.AI_TAXONOMY_HINTS_MAX
)
```
(Other fields in this file use `or`; we deliberately diverge here to support `0` and `False`. A short comment in code records why.)
**Frontend** (`src-ui/src/app/data/paperless-config.ts`): add two entries to the `PaperlessConfigOptions` declarative list (one `Boolean`, one `Number`, `category: ConfigCategory.AI`) plus two fields on the `PaperlessConfig` interface. No component changes; the form is generated from this list.
`paperless.conf.example` and the configuration docs page get entries.
## Data flow
Suggestion request:
1. View checks `AIConfig().taxonomy_hints_enabled`; if enabled, calls `hints = build_taxonomy_hints(document, user)`; otherwise `hints = None`.
2. View calls `parsed = get_ai_document_classification(document, user, hints=hints)`.
3. Classifier splices `format_hints_for_prompt(hints)` into the prompt (when non-`None`), calls LLM, returns parsed dict.
4. View calls `match_*_by_name(names, user, hinted_names=set(hints[<category>]) if hints else None)` per category. Exact-on-hint short-circuit; fuzzy fallback unchanged for misses.
No background processing. No persisted state. Each suggestion request runs four lightweight `Count("documents")` queries (could be combined into a single query per model via `.annotate().order_by().values_list()`, no joins beyond the existing reverse relation).
## Error handling
- **Empty visible queryset for a category:** omit that category's block from the prompt.
- **`taxonomy_hints_enabled = False` or `max_per_category = 0`:** `build_taxonomy_hints` returns an empty `TaxonomyHints`; prompt is identical to today; matching is called without `hinted_names`; behavior identical to today.
- **LLM returns a name not in hints but exactly matching an existing visible name:** still treated as exact match. `_match_names_to_queryset` always tries exact-on-full-queryset before fuzzy; `hinted_names` only governs whether fuzzy is consulted for that specific name.
- **DB query failure during shortlist build:** propagate. Suggestion failures already surface as 5xx; adding silent fallbacks here would mask real problems.
## Testing
All new and modified tests use pytest style — functions/classes, no `unittest.TestCase` subclasses; `pytest-django` with per-class `@pytest.mark.django_db`; `pytest-mock`'s `mocker` fixture for patching; every fixture parameter, fixture return, and test signature type-annotated. Tests grouped under classes (`class TestBuildTaxonomyHints:`), not flat free functions. Shared fixtures live in `paperless_ai/tests/conftest.py`. Format with `ruff` directly (not `uv run ruff`).
### `paperless_ai/tests/test_taxonomy.py` (new)
- `class TestBuildTaxonomyHints:`
- Returns a `TaxonomyHints` with all four keys.
- Top-K limit respected (`max_per_category` honored from `AIConfig`).
- Frequency ordering: tag used on 5 docs ranks above tag used on 2 docs.
- Tie-break by name (alphabetical) for stable output.
- Owner-aware: user lacking `view_tag` perm gets `tags=[]`; `view_documenttype` likewise per category.
- Empty queryset for a category → empty list; `format_hints_for_prompt` omits the block.
- `taxonomy_hints_enabled=False` returns zero-filled `TaxonomyHints` and runs no taxonomy DB queries (`django_assert_num_queries`).
- `max_per_category=0` same behavior as disabled.
- `StoragePath` shortlist uses the `name` field, not `path` template (asserted on returned values).
- `class TestFormatHintsForPrompt:`
- All four blocks present when all categories non-empty.
- Empty category produces no block.
- All-empty hints produces empty string (no stray instruction line).
- Instruction line appears exactly once when at least one block is rendered.
### `paperless_ai/tests/test_ai_classifier.py` (extend)
- `class TestBuildPrompt:`
- `build_prompt_without_rag(doc, user)` now accepts `user`; produces a prompt containing the hints block when hints are non-empty.
- `build_prompt_with_rag(doc, user)` includes both the RAG context block (unchanged) and the hints block.
- `taxonomy_hints_enabled=False`: prompt matches today's baseline (string equality against a fixture).
- `get_ai_document_classification(doc, user, hints=...)` forwards hints into the prompt; return shape unchanged (still `dict`).
### `paperless_ai/tests/test_matching.py` (extend)
- `class TestHintedMatching:`
- LLM returns `"Bloodwork"` verbatim, `hinted_names={"Bloodwork", ...}` → exact match returned; `difflib.get_close_matches` not called (`mocker.spy` on `difflib.get_close_matches`).
- LLM returns `"blood test"` not in `hinted_names`, no existing exact → fuzzy fallback runs; behavior unchanged from today (regression guard).
- LLM returns `"Bloodwork "` (whitespace) with hinted_names containing `"Bloodwork"` → normalized exact match wins, fuzzy not consulted.
- Backward compatibility: `match_tags_by_name(names, user)` without the kwarg behaves identically to today (snapshot of an existing test, parameterized).
Markers: no `live` marker needed.
## Migration / rollout
- One Django migration adding two columns to `ApplicationConfiguration` (`taxonomy_hints_enabled BooleanField`, `taxonomy_hints_max_per_category PositiveSmallIntegerField`). Both nullable with sensible defaults so existing rows aren't broken.
- Feature defaults to on for new and existing installs. Set `PAPERLESS_AI_TAXONOMY_HINTS=false` (or via the Application Configuration UI) to restore today's behavior.
- Frontend admin form updated to expose the two fields under the existing AI section.
## Open questions deferred to implementation
- `paperless_ai/tests/conftest.py` already exists — verify fixture-naming conventions match before adding new fixtures.
- Confirm `parse_ai_response` doesn't need to know about hints (it's a pure parser; hints flow alongside, not through it).
- The view layer applying `hinted_names` needs to read the same `AIConfig` instance the classifier used; pass the `TaxonomyHints` through the response tuple (chosen) rather than re-deriving in the view.
## Interplay with `extract_unmatched_names`
`extract_unmatched_names` (used downstream of matching) surfaces LLM-returned names that didn't match any existing taxonomy entry — the UI uses these to offer "create new tag?" affordances. With hints in place, fewer names will be unmatched, which is the desired outcome. No behavior change is required: a hinted name that the LLM repeats verbatim will exact-match and not appear in the unmatched list; a name the LLM invents anyway (despite the hint instruction) still flows through fuzzy and, if no match, surfaces as "new" exactly as today. Out of scope: filtering unmatched results based on what was in the hint set.
## Out of scope (potential v2)
- Embedding-based shortlisting (for users with very large taxonomies where frequency misses the right tag). Would re-introduce a FAISS-shaped subsystem with signals, debounce, and locks. Defer until evidence frequency is insufficient.
- Tag hierarchy awareness — hinting `Medical/Bloodwork` vs `Bloodwork` when tags are nested.
- Custom field option values.
- `StoragePath` template-expression hinting (vs raw `name`).
+308
View File
@@ -0,0 +1,308 @@
# Usage Reporting — Technical Spec
Voluntary, opt-in usage reporting for paperless-ngx. The goal is to
understand how many instances are running a given release (especially
beta), which platforms and architectures are in use, and what features
are being deployed — without collecting any personal data or document
content.
---
## Guiding principles
- **Explicitly opt-in.** Nothing is sent automatically. The user runs
the command and confirms before any network call is made.
- **Transparent.** The exact payload is shown before sending.
- **Anonymous.** The UUID is a random identifier with no link to
identity, IP address, or hostname.
- **Graceful.** Network failures produce a friendly message, never a
stack trace.
---
## Client — management command
### Name
```
manage.py send_usage_report
```
### Flags
| Flag | Behaviour |
| ----------- | --------------------------------------------------------- |
| _(none)_ | Show payload, prompt for confirmation, send on `y`/`yes` |
| `--dry-run` | Show payload, skip confirmation and network call entirely |
### UUID storage
A random UUID4 is generated on the first run and written to
`PAPERLESS_DATA_DIR/usage_uuid` (plain text, one line). Subsequent
runs reuse the same file. If the file is missing it is regenerated
(counts as a new install — acceptable).
### Confirmation flow
```
The following information will be sent to paperless-ngx to help
improve the project:
Installation ID : a1b2c3d4-e5f6-7890-abcd-ef1234567890
Version : 2.15.0
Channel : beta
Commit : bd86dca57 (built 2026-05-18T12:00:00Z)
Install type : docker
Architecture : x86_64
Python : 3.12.3
Database : postgresql
Documents : 10009999
Multi-user : yes
Mail enabled : yes
AI enabled : no
No personal data, document content, or IP address is stored.
More information: https://docs.paperless-ngx.com/usage-reporting/
Send this report? [y/N]:
```
Default answer is **N**. Anything other than `y`/`yes` aborts with
no network call and prints `Nothing sent.`
`--dry-run` skips the prompt entirely and prints `Dry run — nothing sent.`
### Network error handling
- Timeout: 10 seconds
- On any failure (timeout, DNS, HTTP error): print a single friendly
line, exit 0 (not an error from the user's perspective)
```
Could not reach the reporting endpoint. Nothing was sent.
```
### Duplicate submission handling
The server returns `429` if the UUID was seen within the last 7 days,
with a JSON body:
```json
{
"error": "already_submitted",
"last_sent": "2026-05-15T10:00:00Z",
"retry_after_days": 4
}
```
The command prints:
```
Already submitted 3 days ago. Nothing sent.
You can send again after 2026-05-19.
```
---
## Payload schema
All fields are strings unless noted. Fields marked _omit if absent_
are left out of the JSON entirely when the value is unavailable —
never sent as `null`.
| Field | Source | Notes |
| -------------- | --------------------------------------------------------- | ------------------------------------------------ |
| `uuid` | `PAPERLESS_DATA_DIR/usage_uuid` | UUID4, random |
| `version` | `paperless/version.py``__full_version_str__` | e.g. `"2.15.0"` |
| `channel` | `paperless/version.py``__channel__` | `"stable"` \| `"beta"` \| `"dev"` |
| `commit` | `paperless/build_info.py``SOURCE_COMMIT` | Short SHA — _omit if absent_ |
| `build_date` | `paperless/build_info.py``BUILD_DATE` | ISO 8601 — _omit if absent_ |
| `install_type` | Detected at runtime (see below) | |
| `arch` | `platform.machine()` | e.g. `"x86_64"`, `"aarch64"` |
| `python` | `platform.python_version()` | e.g. `"3.12.3"` |
| `database` | Last segment of `settings.DATABASES["default"]["ENGINE"]` | e.g. `"postgresql"`, `"sqlite3"` |
| `doc_bucket` | Bucketed document count (see below) | |
| `multi_user` | boolean | `true` if more than one real user account exists |
| `feature_mail` | boolean | `true` if any mail account is configured |
| `feature_ai` | boolean | `true` if AI features are enabled in settings |
### Document count buckets
| Range | Value |
| ------------- | --------------- |
| 099 | `"0-99"` |
| 100999 | `"100-999"` |
| 1 0009 999 | `"1000-9999"` |
| 10 00049 999 | `"10000-49999"` |
| 50 000+ | `"50000+"` |
### Install type detection
Evaluated in order; first match wins.
| Value | Detection |
| -------------- | ----------------------------------------------------------- |
| `"kubernetes"` | `KUBERNETES_SERVICE_HOST` env var is set |
| `"podman"` | `container` env var equals `"podman"` |
| `"docker"` | `Path("/.dockerenv").exists()` |
| `"nixos"` | `"/nix/store/"` in `sys.executable` |
| `"snap"` | `SNAP` env var is set |
| `"flatpak"` | `FLATPAK_ID` env var is set |
| `"distro"` | `paperless/distro_info.py` exists (set by distro packagers) |
| `"release"` | `paperless/build_info.py` exists (none of the above) |
| `"source"` | Fallback — dev checkout |
Distro packagers (Debian, NixOS community, Unraid, etc.) can opt in
by shipping a `src/paperless/distro_info.py` containing:
```python
DISTRO = "debian" # or "rpm", "homebrew", "unraid", etc.
```
When present the install type is reported as the `DISTRO` value rather
than `"distro"`.
### `version.py` additions
Add `__channel__` alongside the existing version fields:
```python
__channel__: Final[str] = "beta" # "stable" | "beta" | "dev"
```
This is the canonical place to set the channel when preparing a
release. `"dev"` is the default for unreleased branches.
### `build_info.py`
Generated at build time, never committed (add to `.gitignore`).
```python
SOURCE_COMMIT = "bd86dca57"
BUILD_DATE = "2026-05-18T12:00:00Z"
```
---
## Server — Cloudflare Worker
Managed in a separate repository under the paperless-ngx GitHub org
(e.g. `paperless-ngx/telemetry`). Deployed via Wrangler.
### Endpoint
```
POST /report
Content-Type: application/json
```
Returns `204` on success. No response body.
### Timestamp
`received` is always set server-side. Any client-supplied timestamp
field is ignored.
### Validation
Reject with `400` if any of the following fail:
- `uuid` does not match UUID4 format
- `version` does not match `\d+\.\d+\.\d+`
- `channel` is not one of `stable`, `beta`, `dev`
- `install_type` is not in the known set
- `arch` is absent
- Payload is not valid JSON or exceeds 4 KB
Unknown extra fields are silently ignored (forward compatibility).
### Deduplication
Before inserting, query for the most recent submission from this UUID:
```sql
SELECT received FROM reports
WHERE uuid = ?
ORDER BY received DESC
LIMIT 1
```
If the result is within 7 days of now, return:
```
HTTP 429
{ "error": "already_submitted", "last_sent": "<iso>", "retry_after_days": <n> }
```
Otherwise insert and return `204`.
### D1 schema
```sql
CREATE TABLE reports (
id INTEGER PRIMARY KEY,
received TEXT NOT NULL, -- ISO 8601, server-side
uuid TEXT NOT NULL,
version TEXT,
channel TEXT,
commit TEXT,
build_date TEXT,
install_type TEXT,
arch TEXT,
python TEXT,
database TEXT,
doc_bucket TEXT,
multi_user INTEGER, -- 0 / 1
feature_mail INTEGER, -- 0 / 1
feature_ai INTEGER -- 0 / 1
);
CREATE INDEX idx_reports_uuid ON reports(uuid);
CREATE INDEX idx_reports_channel ON reports(channel);
CREATE INDEX idx_reports_version ON reports(version);
```
---
## Useful queries
```sql
-- Distinct beta installs
SELECT COUNT(DISTINCT uuid)
FROM reports
WHERE channel = 'beta';
-- Installs by commit (beta only)
SELECT commit, COUNT(DISTINCT uuid) AS installs
FROM reports
WHERE channel = 'beta'
GROUP BY commit
ORDER BY installs DESC;
-- Architecture breakdown
SELECT arch, COUNT(DISTINCT uuid) AS installs
FROM reports
GROUP BY arch
ORDER BY installs DESC;
-- Install type split
SELECT install_type, COUNT(DISTINCT uuid) AS installs
FROM reports
GROUP BY install_type
ORDER BY installs DESC;
-- Database backend split
SELECT database, COUNT(DISTINCT uuid) AS installs
FROM reports
GROUP BY database
ORDER BY installs DESC;
```
---
## Out of scope (for now)
- Automatic or scheduled reporting
- Any opt-out settings flag
- Server-side dashboard (raw SQL is sufficient)
- Locale, timezone, or OS version fields