mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-06-06 13:49:44 +00:00
Just for later ideas, store some brainstorming sessions with Claude
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,167 @@
|
||||
# Pluggable Document Storage Design
|
||||
|
||||
**Date:** 2026-04-23
|
||||
**Status:** Approved
|
||||
|
||||
## Overview
|
||||
|
||||
Replace the hardcoded local filesystem storage in paperless-ngx with a pluggable `DocumentStorage` Protocol. Ship two built-in backends — `LocalFilesystemBackend` (default, zero config change) and `S3CompatibleBackend` (supports AWS S3 and any S3-compatible endpoint). Third parties can implement the Protocol to provide their own backends.
|
||||
|
||||
## Scope
|
||||
|
||||
- **In scope:** original documents, PDF/A archives
|
||||
- **Out of scope:** thumbnails (stay on local filesystem, regenerable), consumption directory (stays local)
|
||||
- **Frontend impact:** none — S3 is invisible; Django proxies all file access
|
||||
|
||||
## Protocol
|
||||
|
||||
Defined in `src/paperless/storage.py`:
|
||||
|
||||
```python
|
||||
class DocumentStorage(Protocol):
|
||||
def __enter__(self) -> Self: ...
|
||||
def __exit__(self, exc_type, exc_val, exc_tb) -> None: ...
|
||||
def open(self, name: str) -> IO[bytes]: ...
|
||||
def save(self, name: str, content: IO[bytes]) -> str: ... # returns actual name used
|
||||
def delete(self, name: str) -> None: ...
|
||||
def exists(self, name: str) -> bool: ...
|
||||
def move(self, old_name: str, new_name: str) -> None: ...
|
||||
def list_files(self, prefix: str = "") -> Iterable[str]: ...
|
||||
def size(self, name: str) -> int: ...
|
||||
```
|
||||
|
||||
`name` is always the relative key as stored in the DB (e.g. `2024/my-invoice.pdf`). All operations including `open()` must be called within a `with storage:` block — the context manager handles connection lifecycle and backend-specific cleanup.
|
||||
|
||||
## Storage Instances
|
||||
|
||||
Two module-level singletons in `src/paperless/storage.py`, each an instance of the configured backend class:
|
||||
|
||||
```python
|
||||
original_storage: DocumentStorage = _build("originals")
|
||||
archive_storage: DocumentStorage = _build("archive")
|
||||
```
|
||||
|
||||
`_build(prefix)` reads `PAPERLESS_DOCUMENT_STORAGE_BACKEND` and `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` from settings, instantiates the backend class with the configured options plus the paperless-controlled prefix. The prefix distinguishes originals from archives within the same bucket or directory root — it is not stored in the DB key.
|
||||
|
||||
## Configuration
|
||||
|
||||
Two new settings, using the existing key-value dict mechanism:
|
||||
|
||||
| Setting | Default | Description |
|
||||
| ------------------------------------ | ------------------------------------------ | ------------------------------------------------------------ |
|
||||
| `PAPERLESS_DOCUMENT_STORAGE_BACKEND` | `paperless.storage.LocalFilesystemBackend` | Dotted Python path to any class satisfying `DocumentStorage` |
|
||||
| `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` | `{}` | Dict of kwargs passed to the backend constructor |
|
||||
|
||||
**Example — S3-compatible:**
|
||||
|
||||
```
|
||||
PAPERLESS_DOCUMENT_STORAGE_BACKEND=paperless.storage.S3CompatibleBackend
|
||||
PAPERLESS_DOCUMENT_STORAGE_OPTIONS={"bucket_name": "my-docs", "endpoint_url": "https://s3.wasabi.com", "region_name": "us-east-1", "access_key": "...", "secret_key": "..."}
|
||||
```
|
||||
|
||||
Existing users set nothing — `LocalFilesystemBackend` with no options is the default.
|
||||
|
||||
## Built-in Backends
|
||||
|
||||
### `LocalFilesystemBackend`
|
||||
|
||||
- `__enter__`: initialises tracking of directories affected during the context
|
||||
- `__exit__`: calls `delete_empty_directories()` for all tracked dirs; no-op on exception
|
||||
- `open/save/delete/exists/move`: direct `Path` + `shutil` operations rooted at `settings.ORIGINALS_DIR` / `settings.ARCHIVE_DIR` (via the prefix passed by `_build`)
|
||||
- `move()`: `shutil.move()` — atomic on same filesystem
|
||||
- `list_files()`: `Path.rglob("*")`
|
||||
|
||||
### `S3CompatibleBackend`
|
||||
|
||||
- Wraps `django-storages` S3 backend (`storages.backends.s3boto3.S3Boto3Storage`) for `open`, `save`, `delete`, `exists`
|
||||
- `__enter__`: initialises boto3 client/session
|
||||
- `__exit__`: no cleanup required (no empty directory concept on S3)
|
||||
- `move()`: boto3 `copy_object` (server-side, no data transfer) + `delete_object`
|
||||
- `open()`: returns streaming S3 response body; caller's `with` block closes the HTTP connection
|
||||
- `list_files()`: S3 `list_objects_v2` with prefix
|
||||
- Works with any S3-compatible endpoint via `endpoint_url` option
|
||||
|
||||
## Data Migration
|
||||
|
||||
One Django migration strips the stored prefix from existing rows:
|
||||
|
||||
- `document.filename`: `documents/originals/2024/invoice.pdf` → `2024/invoice.pdf`
|
||||
- `document.archive_filename`: `documents/archive/2024/invoice.pdf` → `2024/invoice.pdf`
|
||||
|
||||
The prefix is now owned by the storage instance, not the DB key.
|
||||
|
||||
## `migrate_storage` Management Command
|
||||
|
||||
```
|
||||
manage.py migrate_storage [--dry-run] [--no-delete]
|
||||
[--source-backend=<dotted.path>] [--source-options=<json>]
|
||||
```
|
||||
|
||||
Transfers all document files from one storage backend to another. The user updates `PAPERLESS_DOCUMENT_STORAGE_BACKEND` in their config first, then runs this command to move existing files.
|
||||
|
||||
The destination is always the currently configured backend (from settings). The source is specified via `--source-backend` / `--source-options`, defaulting to `LocalFilesystemBackend` with no options if omitted (covering the most common migration path: local → S3).
|
||||
|
||||
**Flow:**
|
||||
|
||||
1. Instantiate source backend (from CLI args or default) and destination backend (from current settings)
|
||||
2. Iterate `Document.objects.only("filename", "archive_filename")`
|
||||
3. For each file (original + archive):
|
||||
- Skip with warning if missing from source
|
||||
- Skip silently if already present on destination (idempotent — safe to re-run)
|
||||
- Copy: `destination.save(name, source.open(name))`
|
||||
- Unless `--no-delete`: `source.delete(name)`
|
||||
4. Report counts: moved / skipped / failed
|
||||
5. `--dry-run`: prints actions without touching files
|
||||
|
||||
Individual failures are logged and counted but do not abort the run. Bidirectional: local → S3, S3 → local, S3 → S3.
|
||||
|
||||
## Files to Create
|
||||
|
||||
| File | Purpose |
|
||||
| ------------------------------------------------------- | ------------------------------------------------------------------------------ |
|
||||
| `src/paperless/storage.py` | Protocol, built-in backends, `original_storage` / `archive_storage` singletons |
|
||||
| `src/documents/management/commands/migrate_storage.py` | Migration command |
|
||||
| `src/documents/migrations/XXXX_strip_storage_prefix.py` | Strip prefix from existing filename rows |
|
||||
|
||||
## Files to Modify
|
||||
|
||||
| File | Change |
|
||||
| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
|
||||
| `src/paperless/settings/__init__.py` | Add `PAPERLESS_DOCUMENT_STORAGE_BACKEND`, `PAPERLESS_DOCUMENT_STORAGE_OPTIONS` |
|
||||
| `src/documents/models.py` | `source_file`, `archive_file` use storage instances; `source_path` returns temp file for subprocess callers |
|
||||
| `src/documents/consumer.py` | `_write()` → `storage.save()`; remove `mkdir` calls |
|
||||
| `src/documents/signals/handlers.py` | `shutil.move()` → `storage.move()`; remove `create_source_path_directory` / `delete_empty_directories` callsites |
|
||||
| `src/documents/tasks.py` | Same as signals |
|
||||
| `src/documents/file_handling.py` | `exists()` checks and directory references use storage API |
|
||||
| `src/documents/views/` | File-serving views use `storage.open()` within context; wrap for `FileResponse` lifecycle |
|
||||
| `src/documents/management/commands/document_importer.py` | Replace `Path.glob()` and direct copies with storage API |
|
||||
| `src/documents/management/commands/document_exporter.py` | Replace direct file copies and `FileLock`-guarded writes with storage API |
|
||||
|
||||
## Locking & Concurrency
|
||||
|
||||
The codebase serialises all document file write/move operations with `FileLock(settings.MEDIA_LOCK)`, where `MEDIA_LOCK = MEDIA_ROOT / "media.lock"`. This is used in `consumer.py`, `signals/handlers.py`, `tasks.py`, `mail.py`, `document_importer.py`, and `document_exporter.py`.
|
||||
|
||||
**The lock file stays on the local filesystem regardless of backend.** `MEDIA_LOCK` lives under `MEDIA_ROOT`, which is the local path even when documents are stored on S3. This means:
|
||||
|
||||
- **Single-host deployments** (the common case — Docker Compose, single server): the `FileLock` continues to work correctly. All Celery workers and the Django process share the same lock file. No change required.
|
||||
- **Multi-host deployments**: the `FileLock` is already broken for these today — each host has its own lock file. This is a pre-existing limitation and is out of scope for this feature.
|
||||
|
||||
**Callsite structure** — the storage context manager nests inside the existing lock, preserving current behaviour:
|
||||
|
||||
```python
|
||||
with FileLock(settings.MEDIA_LOCK):
|
||||
with original_storage as storage:
|
||||
storage.move(old_name, new_name)
|
||||
```
|
||||
|
||||
**`generate_unique_filename` race:** this function checks `storage.exists()` then saves, which is not atomic on S3. The `FileLock` already serialises this on a single host. For multi-host this is a pre-existing gap — not introduced by this feature.
|
||||
|
||||
**Future path for multi-host:** replace `FileLock` with a database-level advisory lock or Redis lock. Out of scope here.
|
||||
|
||||
## Key Invariants
|
||||
|
||||
- The context manager is required for all storage operations, including reads
|
||||
- `name` is always the relative key — never an absolute path or URL
|
||||
- The backend prefix (`originals` / `archive`) is paperless-controlled and never stored in the DB
|
||||
- `LocalFilesystemBackend` is the default — existing deployments require no config change
|
||||
- The migrate command is idempotent and can be re-run after partial failure
|
||||
@@ -0,0 +1,253 @@
|
||||
# Workflow Runner Refactor — Design
|
||||
|
||||
**Date:** 2026-05-19
|
||||
**Branch base:** `dev`
|
||||
**Status:** Approved design, pending implementation plan
|
||||
|
||||
## Problem
|
||||
|
||||
Workflow execution and the Django signal layer have repeatedly produced fragile,
|
||||
hard-to-fix bugs (see the revert/refix history around password removal: #12803,
|
||||
#12814, #12716, and the filename race #12386). Three structural causes:
|
||||
|
||||
1. **`run_workflows` is dual-mode.** A single function handles both consumption
|
||||
(mutating a `DocumentMetadataOverrides`) and post-save (mutating a real
|
||||
`Document`), branching on a `use_overrides` flag. The branching is
|
||||
concentrated in two places — the action dispatch inside `run_workflows`
|
||||
(`handlers.py:931-1001`) and `build_workflow_action_context`
|
||||
(`actions.py:33-83`), each with two full code paths. The `apply_*` helpers in
|
||||
`workflows/mutations.py` are _already_ split by target type
|
||||
(`apply_assignment_to_document` vs `apply_assignment_to_overrides`, etc.); the
|
||||
refactor unifies their callers, not the helpers themselves.
|
||||
|
||||
2. **File location is an implicit, timing-dependent side channel.** The
|
||||
`DOCUMENT_ADDED` workflow fires from `run_workflows_added`, which runs while
|
||||
the consumer is still inside its transaction — _before_ the consumed file is
|
||||
copied to `document.source_path` (`document_consumption_finished` is sent at
|
||||
`consumer.py:658`; the file copy happens after, at `consumer.py:670+`). The
|
||||
staged path is therefore threaded through as `original_file` /
|
||||
`caller_supplied_original_file` parameters. Actions that read the file
|
||||
(password removal, email attachments) depend on this plumbing being correct.
|
||||
|
||||
3. **The workflow run races the filename rename.** `update_filename_and_move_files`
|
||||
is a raw `post_save` receiver on `Document`. When a workflow persists its
|
||||
changes via `document.save(update_fields=[...])`, that save fires `post_save`
|
||||
and runs the rename _while the workflow is still executing_. Under concurrent
|
||||
Celery/UI updates the interleaved `refresh_from_db()` calls corrupt state. The
|
||||
comment at `handlers.py:980-984` — deliberately excluding `filename` /
|
||||
`archive_filename` from the workflow save — is a load-bearing workaround for
|
||||
exactly this.
|
||||
|
||||
Note: `run_workflows_added` / `run_workflows_updated` are connected to the
|
||||
_custom_ signals `document_consumption_finished` / `document_updated`, fired
|
||||
explicitly by paperless code in a handful of known sites — not to raw Django
|
||||
`post_save`. Only `update_filename_and_move_files` is a raw `post_save` receiver.
|
||||
This refactor does not change where workflows are triggered from.
|
||||
|
||||
## Scope
|
||||
|
||||
In scope:
|
||||
|
||||
- Refactor `run_workflows` and its action helpers around an execution-context
|
||||
abstraction.
|
||||
- Delete the `original_file` side-channel plumbing.
|
||||
- Make the workflow-execution → persist → rename sequence explicit and
|
||||
deterministic.
|
||||
|
||||
Out of scope:
|
||||
|
||||
- Changing where/when workflows are triggered (custom signal call sites unchanged).
|
||||
- Reworking the matching logic (`matching.document_matches_workflow`).
|
||||
- Any change to workflow models, serializers, or the REST API.
|
||||
|
||||
## Design
|
||||
|
||||
### 1. `WorkflowRunContext` protocol
|
||||
|
||||
New module `documents/workflows/context.py` defining a `typing.Protocol`:
|
||||
|
||||
```
|
||||
WorkflowRunContext (Protocol)
|
||||
source_file: Path # where the file actually is, now
|
||||
build_placeholder_context() -> dict
|
||||
apply_assignment(action) -> None
|
||||
apply_removal(action) -> None
|
||||
persist() -> None # commit accumulated mutations
|
||||
record_run(workflow, trigger_type) -> None
|
||||
```
|
||||
|
||||
Two concrete implementations (which need not import the Protocol — structural
|
||||
typing):
|
||||
|
||||
- **`ConsumptionContext`** — wraps `ConsumableDocument` + `DocumentMetadataOverrides`.
|
||||
`source_file` returns the staged file path. Mutations land on the overrides.
|
||||
`persist()` is a no-op (the overrides object is returned to the caller).
|
||||
- **`PersistedContext`** — wraps a real `Document`. Mutations land on the
|
||||
in-memory `Document`. `persist()` performs a single save.
|
||||
|
||||
**Context selection** — `run_workflows` picks the context from the call shape:
|
||||
|
||||
- CONSUMPTION trigger (`ConsumableDocument` + non-`None` `overrides`) →
|
||||
`ConsumptionContext`.
|
||||
- DOCUMENT_ADDED / DOCUMENT_UPDATED / SCHEDULED (a real `Document`,
|
||||
`overrides=None`) → `PersistedContext`.
|
||||
|
||||
**`source_file` for `PersistedContext`.** It cannot unconditionally return
|
||||
`document.source_path`: for the `DOCUMENT_ADDED` trigger the file has not yet
|
||||
been moved there. The staged path is therefore passed into the `PersistedContext`
|
||||
_at construction time_ by `run_workflows_added` (which still receives it from the
|
||||
`document_consumption_finished` signal). `source_file` returns that staged path
|
||||
when supplied, otherwise `document.source_path`. This relocates the staged-path
|
||||
information from a chain of function parameters into a single piece of
|
||||
construction state — the `original_file` / `caller_supplied_original_file`
|
||||
_parameter plumbing_ through `run_workflows` and the action helpers is what gets
|
||||
deleted, not the staged path itself.
|
||||
|
||||
`WorkflowRunContext` is a plain `Protocol`, not `@runtime_checkable` — the runner
|
||||
constructs the context itself, so no `isinstance` check is needed. Genuinely
|
||||
shared logic goes into module-level helper functions, not a base class.
|
||||
|
||||
### 2. `run_workflows` becomes branch-free
|
||||
|
||||
`run_workflows` keeps its current public signature so all call sites are
|
||||
unchanged. Its body:
|
||||
|
||||
1. Construct the appropriate context once, from the argument types.
|
||||
2. Run a single flat match-and-dispatch loop over matching workflows/actions,
|
||||
delegating every action to context methods.
|
||||
|
||||
No `use_overrides` flag anywhere. The branching currently scattered across
|
||||
`run_workflows`, `build_workflow_action_context`, and the `apply_*` helpers
|
||||
collapses into the two context classes.
|
||||
|
||||
### 3. File staging via `source_file`
|
||||
|
||||
`source_file` is a property of the context, fixed at construction. The
|
||||
`original_file` and `caller_supplied_original_file` parameters threaded through
|
||||
`run_workflows` and the `execute_*` helpers are deleted; each context resolves
|
||||
the path itself (see "Context selection" above).
|
||||
|
||||
**Deferred password removal.** `execute_password_removal_action`, when given a
|
||||
`ConsumableDocument`, currently installs a one-shot handler on
|
||||
`document_consumption_finished` that picks up `original_file` from `kwargs`
|
||||
later (`actions.py:295-308`). This deferred hook lives outside the context
|
||||
abstraction. The refactor must explicitly decide its fate: either keep it as-is
|
||||
(the context still constructs correctly around it) or fold the deferral into
|
||||
`ConsumptionContext`. This is called out as an open implementation decision, not
|
||||
silently absorbed.
|
||||
|
||||
### 4. Explicit workflow → persist → rename sequencing
|
||||
|
||||
What must be deferred is the **file rename**, not the DB save. `run_workflows`
|
||||
keeps its per-workflow `document.refresh_from_db()` at the top of each iteration
|
||||
— that is deliberate concurrency protection against `bulk_update_documents`
|
||||
running simultaneously. Deferring all saves to a single final `persist()` would
|
||||
let one workflow's refresh wipe a prior workflow's in-memory changes. So:
|
||||
|
||||
1. `run_workflows` refreshes and applies actions per workflow, and
|
||||
`PersistedContext.persist()` saves after each matching workflow, as today.
|
||||
2. The save deliberately **continues to exclude** `filename` /
|
||||
`archive_filename` from `update_fields`. This is not duct tape: it guards a
|
||||
_cross-process_ hazard — another Celery task may have moved the file and
|
||||
written `filename` to the DB, and a stale in-memory `filename` in our save
|
||||
would revert it. The `ContextVar` guard (below) only addresses _intra-process_
|
||||
ordering, so this exclusion stays.
|
||||
3. The rename is suppressed for the whole run and invoked **exactly once,
|
||||
afterward**, against final committed state.
|
||||
|
||||
The actual race being fixed: `apply_assignment_to_document` assigns tags via
|
||||
`document.add_nested_tags(...)`, which fires `m2m_changed` on
|
||||
`Document.tags.through` _before_ the workflow's `document.save()`. The
|
||||
`m2m_changed` receiver `update_filename_and_move_files` then calls
|
||||
`refresh_from_db()`, wiping the workflow's in-memory correspondent/type, and
|
||||
moves the file to a path computed from stale metadata. The guard prevents this.
|
||||
|
||||
To stop the rename from firing mid-workflow, a **`ContextVar` guard** is
|
||||
introduced (e.g. `documents/workflows/context.py` module-level
|
||||
`_workflow_in_progress: ContextVar[bool]`). `update_filename_and_move_files`
|
||||
checks the guard and early-returns when set. `run_workflows` wraps its **entire**
|
||||
persisted-path execution — not just the `persist()` call — in a context manager
|
||||
that sets the guard via `set()`/`reset(token)`. Token-based reset is
|
||||
reentrancy-safe for nested saves or nested workflow runs.
|
||||
|
||||
The guard must span the whole execution, not just `persist()`, because
|
||||
`update_filename_and_move_files` is _also_ registered to `m2m_changed` on
|
||||
`Document.tags.through` and to `post_save` on `CustomFieldInstance`
|
||||
(`handlers.py:431-432`). A workflow action that assigns tags or custom fields
|
||||
would otherwise trigger a rename mid-workflow through those signals.
|
||||
|
||||
After execution completes, `run_workflows` calls `persist()` once and then
|
||||
explicitly invokes the move logic once. The `ContextVar` is set/reset in the
|
||||
same thread that runs these receivers synchronously, so they always observe the
|
||||
value. (Celery `prefork` workers run each task in its own process; greenlet
|
||||
pools are also `contextvars`-aware — non-issues, noted for completeness.)
|
||||
|
||||
The move body of `update_filename_and_move_files` is extracted into a plain
|
||||
callable that the runner invokes directly. The function is already invoked
|
||||
directly (as a plain call, bypassing the decorator) for version documents at
|
||||
`handlers.py:664-667`, so this extraction has precedent. The thin `post_save`
|
||||
receiver remains as a guard-checking wrapper.
|
||||
|
||||
The two `post_save` receivers on `Document` are `update_filename_and_move_files`
|
||||
(`handlers.py:433`) and `update_llm_suggestions_cache` (`handlers.py:740`). The
|
||||
`ContextVar` guard suppresses **only** the former — `update_llm_suggestions_cache`
|
||||
keeps running normally, as do `document_consumption_finished` receivers such as
|
||||
`add_or_update_document_in_llm_index` (which is _not_ a `post_save` receiver).
|
||||
This is why the guard is preferred over persisting with `.update()`, which would
|
||||
silently suppress _all_ `post_save` receivers including
|
||||
`update_llm_suggestions_cache`.
|
||||
|
||||
`WorkflowRun.objects.create(...)` is created per matching workflow as today
|
||||
(`handlers.py:998-1002`); it is a separate model and is not deferred.
|
||||
|
||||
The comment at `handlers.py:980-984` is updated to describe the new flow
|
||||
(per-workflow save under the guard; single explicit rename afterward) but the
|
||||
`filename` / `archive_filename` exclusion it documents is kept — see point 2
|
||||
above.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Runner loop** — exercised against a fake context implementing the
|
||||
`WorkflowRunContext` surface that records `apply_assignment` / `apply_removal`
|
||||
/ `persist` calls. No DB document, no staged files, no signals.
|
||||
- **Concrete contexts** — `ConsumptionContext` and `PersistedContext` each get
|
||||
focused tests: given an action, assert the mutation lands on the overrides vs.
|
||||
the document, and that `source_file` resolves to the staged vs. final path.
|
||||
- **ContextVar guard** — assert `update_filename_and_move_files` early-returns
|
||||
while the guard is set, and that the rename runs exactly once after
|
||||
`persist()`.
|
||||
- **Regression: the racy case** — a workflow that reassigns metadata while the
|
||||
document is subject to a filename template; assert final DB filename and file
|
||||
location are consistent (the #12386 scenario).
|
||||
- **Regression safety net** — the existing `test_workflows.py` suite (~100
|
||||
tests; ~19 `document_consumption_finished.send` sites plus many direct
|
||||
`run_workflows(...)` calls for the `DOCUMENT_UPDATED` path) must stay green
|
||||
**unchanged**. A test that needs editing signals a behavior change to flag
|
||||
explicitly, not a silent refactor outcome.
|
||||
|
||||
Per project conventions: tests grouped under classes, fixtures and test
|
||||
signatures fully type-annotated.
|
||||
|
||||
## Implementation sequence
|
||||
|
||||
Each step is independently reviewable and keeps the test suite green:
|
||||
|
||||
1. Introduce the `Protocol` + the two contexts; `run_workflows` delegates to
|
||||
them. Pure refactor, no behavior change.
|
||||
2. Move the staged path into `PersistedContext` construction (passed by
|
||||
`run_workflows_added`); delete the `original_file` /
|
||||
`caller_supplied_original_file` parameter plumbing through `run_workflows`
|
||||
and the `execute_*` helpers.
|
||||
3. Extract the move body from `update_filename_and_move_files` into a callable;
|
||||
add the `ContextVar` guard; `run_workflows` invokes the move once after the
|
||||
run completes. The `filename` / `archive_filename` exclusion in the
|
||||
per-workflow save is kept; only the comment at `handlers.py:980-984` is
|
||||
updated to describe the new flow.
|
||||
|
||||
## Pain points addressed
|
||||
|
||||
- **Dual-mode** → eliminated by the `Protocol` + two contexts; no `use_overrides`.
|
||||
- **File staging** → `source_file` is a context property; side-channel args deleted.
|
||||
- **Rename race** → per-workflow save under a `ContextVar` guard that suppresses
|
||||
the mid-workflow rename; a single explicit rename runs once at the end against
|
||||
final state.
|
||||
@@ -0,0 +1,215 @@
|
||||
# AI Suggestions: Inject existing taxonomy as candidates
|
||||
|
||||
**Status:** Design (v2 — frequency-only)
|
||||
**Date:** 2026-05-20
|
||||
**Related:** [Discussion #12787](https://github.com/paperless-ngx/paperless-ngx/discussions/12787)
|
||||
**Branch target:** `dev`
|
||||
|
||||
## Problem
|
||||
|
||||
AI Suggestions currently asks the LLM for free-form tag/document-type/correspondent/storage-path names, then reconciles via `difflib` fuzzy matching (cutoff 0.8) in `paperless_ai/matching.py`. This works for typos but not for semantic equivalents:
|
||||
|
||||
- `blood test` does not fuzzy-match `Bloodwork`
|
||||
- `IRS` does not fuzzy-match `Taxes`
|
||||
- `doctor visit` does not fuzzy-match `Medical`
|
||||
|
||||
Result: the LLM invents new metadata names that duplicate existing taxonomy entries.
|
||||
|
||||
## Goal
|
||||
|
||||
Tell the LLM what already exists, so it can prefer existing names verbatim. Fuzzy matching becomes the fallback for typos and for legitimately novel suggestions, not the primary semantic-equivalence mechanism.
|
||||
|
||||
Non-goals: changing the LLM client, embedding model selection, or RAG retrieval. Replacing fuzzy matching entirely. Custom-field option values. Embedding-based shortlisting (deferred to a v2 if frequency proves insufficient).
|
||||
|
||||
## Approach
|
||||
|
||||
For each of Tags, DocumentTypes, Correspondents, StoragePaths:
|
||||
|
||||
1. Take the user-visible queryset (owner-aware, matching `matching.py`).
|
||||
2. Annotate by document-usage count and take the top `X` names by frequency. `X` is configurable per category cap (single setting, applied to all four categories).
|
||||
3. Inject those names into the LLM prompt as "Available <category>" blocks, with the instruction to prefer them verbatim.
|
||||
4. When the LLM responds, tell `matching.py` which names were hinted so an exact normalized match short-circuits past fuzzy. Names not in the hint list keep today's fuzzy fallback.
|
||||
|
||||
No FAISS index, no signals, no Celery tasks, no locks. Pure DB-side queries on each suggestion request.
|
||||
|
||||
## Components
|
||||
|
||||
### `paperless_ai/taxonomy.py` (new)
|
||||
|
||||
```python
|
||||
class TaxonomyHints(TypedDict):
|
||||
tags: list[str]
|
||||
document_types: list[str]
|
||||
correspondents: list[str]
|
||||
storage_paths: list[str]
|
||||
|
||||
def build_taxonomy_hints(document: Document, user: User | None) -> TaxonomyHints: ...
|
||||
def format_hints_for_prompt(hints: TaxonomyHints) -> str: ...
|
||||
```
|
||||
|
||||
Internals:
|
||||
|
||||
- `_visible_queryset(model_cls, perm: str, user)` — wraps `get_objects_for_user_owner_aware` exactly as `matching.py` does. If `user` is `None`, returns the unfiltered manager queryset (parity with how `matching.py` behaves today).
|
||||
- `_shortlist_by_frequency(queryset, max_per_category)` — DB-side:
|
||||
```python
|
||||
return list(
|
||||
queryset
|
||||
.annotate(usage=Count("documents"))
|
||||
.order_by("-usage", "name")
|
||||
.values_list("name", flat=True)[:max_per_category]
|
||||
)
|
||||
```
|
||||
Confirmed reverse relation name is `documents` for all four models (`documents/models.py:164,173,184,211`). Secondary order by `name` keeps results stable when usage ties (common with 0-usage tails). `StoragePath` uses the human `name` field, not the `path` template.
|
||||
|
||||
`format_hints_for_prompt` emits one `Available <category>:` block per non-empty category. Empty categories produce no block (avoid prompting the LLM with "Available tags: (none)"). A single instruction line follows:
|
||||
|
||||
```
|
||||
Prefer existing names from these lists verbatim. Only propose a new value
|
||||
if none of the existing names fits.
|
||||
```
|
||||
|
||||
### `paperless_ai/ai_classifier.py` (modify)
|
||||
|
||||
Required signature change (the v1 spec missed this — flagged by code review):
|
||||
|
||||
- `build_prompt_without_rag(document, user: User | None = None)` — currently takes only `document`; add `user` with `None` default to keep call sites simple.
|
||||
- `build_prompt_with_rag(document, user: User | None = None)` — already takes `user`; its existing call to `build_prompt_without_rag(document)` at `ai_classifier.py:39` is updated to pass `user` through.
|
||||
|
||||
Both prompt builders accept an optional `hints: TaxonomyHints | None = None` parameter. When non-`None`, `format_hints_for_prompt(hints)` is spliced in before the "Analyze the following document" instruction. When `None` (default), the prompt is built as today.
|
||||
|
||||
`get_ai_document_classification(document, user, hints: TaxonomyHints | None = None)` accepts the same optional `hints` and forwards it to the prompt builder. Return shape is **unchanged** (`dict`). The view layer owns hint construction so the same `TaxonomyHints` object can be used both for the prompt and for `hinted_names` in matching — no need to thread it back out of the classifier. Callers in tests pass `hints=None` (or omit) to preserve existing behavior.
|
||||
|
||||
### `paperless_ai/matching.py` (modify)
|
||||
|
||||
- `_match_names_to_queryset(names, queryset, attr, hinted_names: set[str] | None = None)`:
|
||||
- Normalization unchanged.
|
||||
- Exact-match-on-full-queryset behavior unchanged (always tried first).
|
||||
- When `hinted_names` is provided and the LLM-returned name (normalized) matches a hinted name (normalized) → treated as exact-only; fuzzy is skipped for that name.
|
||||
- When `hinted_names` is `None` or the name isn't in it → existing 0.8 fuzzy fallback runs.
|
||||
- `match_tags_by_name(names, user, hinted_names=None)` etc. — optional kwarg, backward compatible.
|
||||
|
||||
### `documents/views.py` (modify)
|
||||
|
||||
The suggestion endpoint (around line 1482) is the single production caller of `get_ai_document_classification` and the call site for `match_*_by_name`. Update it to:
|
||||
|
||||
1. Build hints once: `hints = build_taxonomy_hints(document, request.user)` (when `AIConfig().taxonomy_hints_enabled` and `max_per_category > 0`; otherwise `hints = None`).
|
||||
2. Pass `hints` into the classifier: `parsed = get_ai_document_classification(document, request.user, hints=hints)`.
|
||||
3. Pass `hinted_names=set(hints["tags"])` (etc., one per category, or `None` when `hints` is `None`) into each `match_*_by_name` call.
|
||||
|
||||
**Cache interaction:** the AI suggestion path is wrapped by `cached_llm_suggestions` / `refresh_suggestions_cache` (views.py:1477). A cached response bypasses the LLM call entirely — so changes to hints config don't take effect until the cache entry is invalidated. Acceptable for v1 (cache is short-lived). If experience shows users change the toggle and expect immediate effect, follow up by including a hash of the hint-relevant config (`taxonomy_hints_enabled`, `_max`) in the cache key.
|
||||
|
||||
### `paperless/config.py` (`AIConfig`) + DB model + settings
|
||||
|
||||
`AIConfig.__post_init__` reads values from the `ApplicationConfiguration` DB row **and** falls back to `settings.*` constants (pattern at `paperless/config.py:207` for `ai_enabled`). Both layers are needed.
|
||||
|
||||
Two new fields, threaded through three places:
|
||||
|
||||
1. **`paperless/settings/*.py`** — add module-level constants read from env:
|
||||
- `AI_TAXONOMY_HINTS: bool = __get_boolean("PAPERLESS_AI_TAXONOMY_HINTS", "yes")` (default on)
|
||||
- `AI_TAXONOMY_HINTS_MAX: int = int(os.getenv("PAPERLESS_AI_TAXONOMY_HINTS_MAX", "30"))`
|
||||
|
||||
2. **`paperless/models.py` (`ApplicationConfiguration`)** — add two nullable columns:
|
||||
- `taxonomy_hints_enabled = models.BooleanField(null=True)`
|
||||
- `taxonomy_hints_max_per_category = models.PositiveSmallIntegerField(null=True)` (range 0–32767; `PositiveSmallIntegerField` is sufficient)
|
||||
- One Django migration.
|
||||
|
||||
3. **`paperless/config.py` (`AIConfig`)** — read with **explicit None check, not `or`** (because `0` and `False` are legitimate user values that would otherwise silently fall back to the settings default):
|
||||
```python
|
||||
self.taxonomy_hints_enabled = (
|
||||
app_config.taxonomy_hints_enabled
|
||||
if app_config.taxonomy_hints_enabled is not None
|
||||
else settings.AI_TAXONOMY_HINTS
|
||||
)
|
||||
self.taxonomy_hints_max_per_category = (
|
||||
app_config.taxonomy_hints_max_per_category
|
||||
if app_config.taxonomy_hints_max_per_category is not None
|
||||
else settings.AI_TAXONOMY_HINTS_MAX
|
||||
)
|
||||
```
|
||||
(Other fields in this file use `or`; we deliberately diverge here to support `0` and `False`. A short comment in code records why.)
|
||||
|
||||
**Frontend** (`src-ui/src/app/data/paperless-config.ts`): add two entries to the `PaperlessConfigOptions` declarative list (one `Boolean`, one `Number`, `category: ConfigCategory.AI`) plus two fields on the `PaperlessConfig` interface. No component changes; the form is generated from this list.
|
||||
|
||||
`paperless.conf.example` and the configuration docs page get entries.
|
||||
|
||||
## Data flow
|
||||
|
||||
Suggestion request:
|
||||
|
||||
1. View checks `AIConfig().taxonomy_hints_enabled`; if enabled, calls `hints = build_taxonomy_hints(document, user)`; otherwise `hints = None`.
|
||||
2. View calls `parsed = get_ai_document_classification(document, user, hints=hints)`.
|
||||
3. Classifier splices `format_hints_for_prompt(hints)` into the prompt (when non-`None`), calls LLM, returns parsed dict.
|
||||
4. View calls `match_*_by_name(names, user, hinted_names=set(hints[<category>]) if hints else None)` per category. Exact-on-hint short-circuit; fuzzy fallback unchanged for misses.
|
||||
|
||||
No background processing. No persisted state. Each suggestion request runs four lightweight `Count("documents")` queries (could be combined into a single query per model via `.annotate().order_by().values_list()`, no joins beyond the existing reverse relation).
|
||||
|
||||
## Error handling
|
||||
|
||||
- **Empty visible queryset for a category:** omit that category's block from the prompt.
|
||||
- **`taxonomy_hints_enabled = False` or `max_per_category = 0`:** `build_taxonomy_hints` returns an empty `TaxonomyHints`; prompt is identical to today; matching is called without `hinted_names`; behavior identical to today.
|
||||
- **LLM returns a name not in hints but exactly matching an existing visible name:** still treated as exact match. `_match_names_to_queryset` always tries exact-on-full-queryset before fuzzy; `hinted_names` only governs whether fuzzy is consulted for that specific name.
|
||||
- **DB query failure during shortlist build:** propagate. Suggestion failures already surface as 5xx; adding silent fallbacks here would mask real problems.
|
||||
|
||||
## Testing
|
||||
|
||||
All new and modified tests use pytest style — functions/classes, no `unittest.TestCase` subclasses; `pytest-django` with per-class `@pytest.mark.django_db`; `pytest-mock`'s `mocker` fixture for patching; every fixture parameter, fixture return, and test signature type-annotated. Tests grouped under classes (`class TestBuildTaxonomyHints:`), not flat free functions. Shared fixtures live in `paperless_ai/tests/conftest.py`. Format with `ruff` directly (not `uv run ruff`).
|
||||
|
||||
### `paperless_ai/tests/test_taxonomy.py` (new)
|
||||
|
||||
- `class TestBuildTaxonomyHints:`
|
||||
- Returns a `TaxonomyHints` with all four keys.
|
||||
- Top-K limit respected (`max_per_category` honored from `AIConfig`).
|
||||
- Frequency ordering: tag used on 5 docs ranks above tag used on 2 docs.
|
||||
- Tie-break by name (alphabetical) for stable output.
|
||||
- Owner-aware: user lacking `view_tag` perm gets `tags=[]`; `view_documenttype` likewise per category.
|
||||
- Empty queryset for a category → empty list; `format_hints_for_prompt` omits the block.
|
||||
- `taxonomy_hints_enabled=False` returns zero-filled `TaxonomyHints` and runs no taxonomy DB queries (`django_assert_num_queries`).
|
||||
- `max_per_category=0` same behavior as disabled.
|
||||
- `StoragePath` shortlist uses the `name` field, not `path` template (asserted on returned values).
|
||||
|
||||
- `class TestFormatHintsForPrompt:`
|
||||
- All four blocks present when all categories non-empty.
|
||||
- Empty category produces no block.
|
||||
- All-empty hints produces empty string (no stray instruction line).
|
||||
- Instruction line appears exactly once when at least one block is rendered.
|
||||
|
||||
### `paperless_ai/tests/test_ai_classifier.py` (extend)
|
||||
|
||||
- `class TestBuildPrompt:`
|
||||
- `build_prompt_without_rag(doc, user)` now accepts `user`; produces a prompt containing the hints block when hints are non-empty.
|
||||
- `build_prompt_with_rag(doc, user)` includes both the RAG context block (unchanged) and the hints block.
|
||||
- `taxonomy_hints_enabled=False`: prompt matches today's baseline (string equality against a fixture).
|
||||
- `get_ai_document_classification(doc, user, hints=...)` forwards hints into the prompt; return shape unchanged (still `dict`).
|
||||
|
||||
### `paperless_ai/tests/test_matching.py` (extend)
|
||||
|
||||
- `class TestHintedMatching:`
|
||||
- LLM returns `"Bloodwork"` verbatim, `hinted_names={"Bloodwork", ...}` → exact match returned; `difflib.get_close_matches` not called (`mocker.spy` on `difflib.get_close_matches`).
|
||||
- LLM returns `"blood test"` not in `hinted_names`, no existing exact → fuzzy fallback runs; behavior unchanged from today (regression guard).
|
||||
- LLM returns `"Bloodwork "` (whitespace) with hinted_names containing `"Bloodwork"` → normalized exact match wins, fuzzy not consulted.
|
||||
- Backward compatibility: `match_tags_by_name(names, user)` without the kwarg behaves identically to today (snapshot of an existing test, parameterized).
|
||||
|
||||
Markers: no `live` marker needed.
|
||||
|
||||
## Migration / rollout
|
||||
|
||||
- One Django migration adding two columns to `ApplicationConfiguration` (`taxonomy_hints_enabled BooleanField`, `taxonomy_hints_max_per_category PositiveSmallIntegerField`). Both nullable with sensible defaults so existing rows aren't broken.
|
||||
- Feature defaults to on for new and existing installs. Set `PAPERLESS_AI_TAXONOMY_HINTS=false` (or via the Application Configuration UI) to restore today's behavior.
|
||||
- Frontend admin form updated to expose the two fields under the existing AI section.
|
||||
|
||||
## Open questions deferred to implementation
|
||||
|
||||
- `paperless_ai/tests/conftest.py` already exists — verify fixture-naming conventions match before adding new fixtures.
|
||||
- Confirm `parse_ai_response` doesn't need to know about hints (it's a pure parser; hints flow alongside, not through it).
|
||||
- The view layer applying `hinted_names` needs to read the same `AIConfig` instance the classifier used; pass the `TaxonomyHints` through the response tuple (chosen) rather than re-deriving in the view.
|
||||
|
||||
## Interplay with `extract_unmatched_names`
|
||||
|
||||
`extract_unmatched_names` (used downstream of matching) surfaces LLM-returned names that didn't match any existing taxonomy entry — the UI uses these to offer "create new tag?" affordances. With hints in place, fewer names will be unmatched, which is the desired outcome. No behavior change is required: a hinted name that the LLM repeats verbatim will exact-match and not appear in the unmatched list; a name the LLM invents anyway (despite the hint instruction) still flows through fuzzy and, if no match, surfaces as "new" exactly as today. Out of scope: filtering unmatched results based on what was in the hint set.
|
||||
|
||||
## Out of scope (potential v2)
|
||||
|
||||
- Embedding-based shortlisting (for users with very large taxonomies where frequency misses the right tag). Would re-introduce a FAISS-shaped subsystem with signals, debounce, and locks. Defer until evidence frequency is insufficient.
|
||||
- Tag hierarchy awareness — hinting `Medical/Bloodwork` vs `Bloodwork` when tags are nested.
|
||||
- Custom field option values.
|
||||
- `StoragePath` template-expression hinting (vs raw `name`).
|
||||
@@ -0,0 +1,308 @@
|
||||
# Usage Reporting — Technical Spec
|
||||
|
||||
Voluntary, opt-in usage reporting for paperless-ngx. The goal is to
|
||||
understand how many instances are running a given release (especially
|
||||
beta), which platforms and architectures are in use, and what features
|
||||
are being deployed — without collecting any personal data or document
|
||||
content.
|
||||
|
||||
---
|
||||
|
||||
## Guiding principles
|
||||
|
||||
- **Explicitly opt-in.** Nothing is sent automatically. The user runs
|
||||
the command and confirms before any network call is made.
|
||||
- **Transparent.** The exact payload is shown before sending.
|
||||
- **Anonymous.** The UUID is a random identifier with no link to
|
||||
identity, IP address, or hostname.
|
||||
- **Graceful.** Network failures produce a friendly message, never a
|
||||
stack trace.
|
||||
|
||||
---
|
||||
|
||||
## Client — management command
|
||||
|
||||
### Name
|
||||
|
||||
```
|
||||
manage.py send_usage_report
|
||||
```
|
||||
|
||||
### Flags
|
||||
|
||||
| Flag | Behaviour |
|
||||
| ----------- | --------------------------------------------------------- |
|
||||
| _(none)_ | Show payload, prompt for confirmation, send on `y`/`yes` |
|
||||
| `--dry-run` | Show payload, skip confirmation and network call entirely |
|
||||
|
||||
### UUID storage
|
||||
|
||||
A random UUID4 is generated on the first run and written to
|
||||
`PAPERLESS_DATA_DIR/usage_uuid` (plain text, one line). Subsequent
|
||||
runs reuse the same file. If the file is missing it is regenerated
|
||||
(counts as a new install — acceptable).
|
||||
|
||||
### Confirmation flow
|
||||
|
||||
```
|
||||
The following information will be sent to paperless-ngx to help
|
||||
improve the project:
|
||||
|
||||
Installation ID : a1b2c3d4-e5f6-7890-abcd-ef1234567890
|
||||
Version : 2.15.0
|
||||
Channel : beta
|
||||
Commit : bd86dca57 (built 2026-05-18T12:00:00Z)
|
||||
Install type : docker
|
||||
Architecture : x86_64
|
||||
Python : 3.12.3
|
||||
Database : postgresql
|
||||
Documents : 1000–9999
|
||||
Multi-user : yes
|
||||
Mail enabled : yes
|
||||
AI enabled : no
|
||||
|
||||
No personal data, document content, or IP address is stored.
|
||||
More information: https://docs.paperless-ngx.com/usage-reporting/
|
||||
|
||||
Send this report? [y/N]:
|
||||
```
|
||||
|
||||
Default answer is **N**. Anything other than `y`/`yes` aborts with
|
||||
no network call and prints `Nothing sent.`
|
||||
|
||||
`--dry-run` skips the prompt entirely and prints `Dry run — nothing sent.`
|
||||
|
||||
### Network error handling
|
||||
|
||||
- Timeout: 10 seconds
|
||||
- On any failure (timeout, DNS, HTTP error): print a single friendly
|
||||
line, exit 0 (not an error from the user's perspective)
|
||||
|
||||
```
|
||||
Could not reach the reporting endpoint. Nothing was sent.
|
||||
```
|
||||
|
||||
### Duplicate submission handling
|
||||
|
||||
The server returns `429` if the UUID was seen within the last 7 days,
|
||||
with a JSON body:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": "already_submitted",
|
||||
"last_sent": "2026-05-15T10:00:00Z",
|
||||
"retry_after_days": 4
|
||||
}
|
||||
```
|
||||
|
||||
The command prints:
|
||||
|
||||
```
|
||||
Already submitted 3 days ago. Nothing sent.
|
||||
You can send again after 2026-05-19.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Payload schema
|
||||
|
||||
All fields are strings unless noted. Fields marked _omit if absent_
|
||||
are left out of the JSON entirely when the value is unavailable —
|
||||
never sent as `null`.
|
||||
|
||||
| Field | Source | Notes |
|
||||
| -------------- | --------------------------------------------------------- | ------------------------------------------------ |
|
||||
| `uuid` | `PAPERLESS_DATA_DIR/usage_uuid` | UUID4, random |
|
||||
| `version` | `paperless/version.py` — `__full_version_str__` | e.g. `"2.15.0"` |
|
||||
| `channel` | `paperless/version.py` — `__channel__` | `"stable"` \| `"beta"` \| `"dev"` |
|
||||
| `commit` | `paperless/build_info.py` — `SOURCE_COMMIT` | Short SHA — _omit if absent_ |
|
||||
| `build_date` | `paperless/build_info.py` — `BUILD_DATE` | ISO 8601 — _omit if absent_ |
|
||||
| `install_type` | Detected at runtime (see below) | |
|
||||
| `arch` | `platform.machine()` | e.g. `"x86_64"`, `"aarch64"` |
|
||||
| `python` | `platform.python_version()` | e.g. `"3.12.3"` |
|
||||
| `database` | Last segment of `settings.DATABASES["default"]["ENGINE"]` | e.g. `"postgresql"`, `"sqlite3"` |
|
||||
| `doc_bucket` | Bucketed document count (see below) | |
|
||||
| `multi_user` | boolean | `true` if more than one real user account exists |
|
||||
| `feature_mail` | boolean | `true` if any mail account is configured |
|
||||
| `feature_ai` | boolean | `true` if AI features are enabled in settings |
|
||||
|
||||
### Document count buckets
|
||||
|
||||
| Range | Value |
|
||||
| ------------- | --------------- |
|
||||
| 0–99 | `"0-99"` |
|
||||
| 100–999 | `"100-999"` |
|
||||
| 1 000–9 999 | `"1000-9999"` |
|
||||
| 10 000–49 999 | `"10000-49999"` |
|
||||
| 50 000+ | `"50000+"` |
|
||||
|
||||
### Install type detection
|
||||
|
||||
Evaluated in order; first match wins.
|
||||
|
||||
| Value | Detection |
|
||||
| -------------- | ----------------------------------------------------------- |
|
||||
| `"kubernetes"` | `KUBERNETES_SERVICE_HOST` env var is set |
|
||||
| `"podman"` | `container` env var equals `"podman"` |
|
||||
| `"docker"` | `Path("/.dockerenv").exists()` |
|
||||
| `"nixos"` | `"/nix/store/"` in `sys.executable` |
|
||||
| `"snap"` | `SNAP` env var is set |
|
||||
| `"flatpak"` | `FLATPAK_ID` env var is set |
|
||||
| `"distro"` | `paperless/distro_info.py` exists (set by distro packagers) |
|
||||
| `"release"` | `paperless/build_info.py` exists (none of the above) |
|
||||
| `"source"` | Fallback — dev checkout |
|
||||
|
||||
Distro packagers (Debian, NixOS community, Unraid, etc.) can opt in
|
||||
by shipping a `src/paperless/distro_info.py` containing:
|
||||
|
||||
```python
|
||||
DISTRO = "debian" # or "rpm", "homebrew", "unraid", etc.
|
||||
```
|
||||
|
||||
When present the install type is reported as the `DISTRO` value rather
|
||||
than `"distro"`.
|
||||
|
||||
### `version.py` additions
|
||||
|
||||
Add `__channel__` alongside the existing version fields:
|
||||
|
||||
```python
|
||||
__channel__: Final[str] = "beta" # "stable" | "beta" | "dev"
|
||||
```
|
||||
|
||||
This is the canonical place to set the channel when preparing a
|
||||
release. `"dev"` is the default for unreleased branches.
|
||||
|
||||
### `build_info.py`
|
||||
|
||||
Generated at build time, never committed (add to `.gitignore`).
|
||||
|
||||
```python
|
||||
SOURCE_COMMIT = "bd86dca57"
|
||||
BUILD_DATE = "2026-05-18T12:00:00Z"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Server — Cloudflare Worker
|
||||
|
||||
Managed in a separate repository under the paperless-ngx GitHub org
|
||||
(e.g. `paperless-ngx/telemetry`). Deployed via Wrangler.
|
||||
|
||||
### Endpoint
|
||||
|
||||
```
|
||||
POST /report
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
Returns `204` on success. No response body.
|
||||
|
||||
### Timestamp
|
||||
|
||||
`received` is always set server-side. Any client-supplied timestamp
|
||||
field is ignored.
|
||||
|
||||
### Validation
|
||||
|
||||
Reject with `400` if any of the following fail:
|
||||
|
||||
- `uuid` does not match UUID4 format
|
||||
- `version` does not match `\d+\.\d+\.\d+`
|
||||
- `channel` is not one of `stable`, `beta`, `dev`
|
||||
- `install_type` is not in the known set
|
||||
- `arch` is absent
|
||||
- Payload is not valid JSON or exceeds 4 KB
|
||||
|
||||
Unknown extra fields are silently ignored (forward compatibility).
|
||||
|
||||
### Deduplication
|
||||
|
||||
Before inserting, query for the most recent submission from this UUID:
|
||||
|
||||
```sql
|
||||
SELECT received FROM reports
|
||||
WHERE uuid = ?
|
||||
ORDER BY received DESC
|
||||
LIMIT 1
|
||||
```
|
||||
|
||||
If the result is within 7 days of now, return:
|
||||
|
||||
```
|
||||
HTTP 429
|
||||
{ "error": "already_submitted", "last_sent": "<iso>", "retry_after_days": <n> }
|
||||
```
|
||||
|
||||
Otherwise insert and return `204`.
|
||||
|
||||
### D1 schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE reports (
|
||||
id INTEGER PRIMARY KEY,
|
||||
received TEXT NOT NULL, -- ISO 8601, server-side
|
||||
uuid TEXT NOT NULL,
|
||||
version TEXT,
|
||||
channel TEXT,
|
||||
commit TEXT,
|
||||
build_date TEXT,
|
||||
install_type TEXT,
|
||||
arch TEXT,
|
||||
python TEXT,
|
||||
database TEXT,
|
||||
doc_bucket TEXT,
|
||||
multi_user INTEGER, -- 0 / 1
|
||||
feature_mail INTEGER, -- 0 / 1
|
||||
feature_ai INTEGER -- 0 / 1
|
||||
);
|
||||
|
||||
CREATE INDEX idx_reports_uuid ON reports(uuid);
|
||||
CREATE INDEX idx_reports_channel ON reports(channel);
|
||||
CREATE INDEX idx_reports_version ON reports(version);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful queries
|
||||
|
||||
```sql
|
||||
-- Distinct beta installs
|
||||
SELECT COUNT(DISTINCT uuid)
|
||||
FROM reports
|
||||
WHERE channel = 'beta';
|
||||
|
||||
-- Installs by commit (beta only)
|
||||
SELECT commit, COUNT(DISTINCT uuid) AS installs
|
||||
FROM reports
|
||||
WHERE channel = 'beta'
|
||||
GROUP BY commit
|
||||
ORDER BY installs DESC;
|
||||
|
||||
-- Architecture breakdown
|
||||
SELECT arch, COUNT(DISTINCT uuid) AS installs
|
||||
FROM reports
|
||||
GROUP BY arch
|
||||
ORDER BY installs DESC;
|
||||
|
||||
-- Install type split
|
||||
SELECT install_type, COUNT(DISTINCT uuid) AS installs
|
||||
FROM reports
|
||||
GROUP BY install_type
|
||||
ORDER BY installs DESC;
|
||||
|
||||
-- Database backend split
|
||||
SELECT database, COUNT(DISTINCT uuid) AS installs
|
||||
FROM reports
|
||||
GROUP BY database
|
||||
ORDER BY installs DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Out of scope (for now)
|
||||
|
||||
- Automatic or scheduled reporting
|
||||
- Any opt-out settings flag
|
||||
- Server-side dashboard (raw SQL is sufficient)
|
||||
- Locale, timezone, or OS version fields
|
||||
Reference in New Issue
Block a user