Documentation (beta): Updates documentation for new v3 features (#13033)

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-06-20 20:34:20 +00:00 · 2026-06-18 16:20:31 -07:00
parent a009ea1f04
commit bb5d7438b1
10 changed files with 187 additions and 52 deletions
@@ -65,6 +65,11 @@ copies you created in the steps above.

    Please review the [migration instructions](migration-v3.md) before upgrading Paperless-ngx to v3.0, it includes some breaking changes that require manual intervention before upgrading.

+!!! note
+
+    Upgrading to v3 clears the existing task history; previously completed, failed, or
+    acknowledged tasks will no longer appear in the task list afterward. No action is required.
+
 ### Docker Route {#docker-updating}

 If a new release of paperless-ngx is available, upgrading depends on how
@@ -500,6 +505,33 @@ task scheduler.
    python3 manage.py document_index reindex --if-needed
    ```

+### Managing the LLM (AI) index {#llm-index}
+
+When the [AI features](advanced_usage.md#ai-features) are enabled with an embedding
+backend, Paperless-ngx maintains a vector index of your documents used for
+Retrieval-Augmented Generation (RAG), similar-document retrieval, and document chat. The
+index is updated automatically on the schedule set by
+[`PAPERLESS_LLM_INDEX_TASK_CRON`](configuration.md#PAPERLESS_LLM_INDEX_TASK_CRON), but you
+can manage it manually:
+
+```
+document_llmindex {rebuild,update,compact}
+```
+
+Specify `rebuild` to build the index from scratch from all documents in the database. Use
+this the first time you enable the feature, or after changing the embedding backend or
+model.
+
+Specify `update` to incrementally index new and changed documents. This is what the
+scheduled task runs.
+
+Specify `compact` to reclaim space and optimize the on-disk vector store.
+
+!!! note
+
+    These commands have no effect unless AI is enabled and an embedding backend is
+    configured.
+
 ### Clearing the database read cache

 If the database read cache is enabled, **you must run this command** after making any changes to the database outside the application context.
@@ -97,6 +97,85 @@ when using this feature:
  of these correspondents to ANY new document, if both are set to
  automatic matching.

+## AI features {#ai-features}
+
+Paperless-ngx includes a set of optional features backed by a large language model
+(LLM): AI-assisted suggestions, similar-document retrieval, and a document chat. They
+are **off by default** and never replace the built-in, non-LLM
+[matching and suggestions](#matching).
+
+!!! warning
+
+    Enabling these features sends document content (and metadata) to the LLM backend you
+    configure. If that backend is a remote/hosted provider, your documents leave your
+    server and may incur usage charges. Consider the privacy implications before enabling,
+    and prefer a local backend (Ollama, or a self-hosted OpenAI-compatible gateway) if that
+    matters to you.
+
+All AI settings can be supplied as `PAPERLESS_AI_*` environment variables (see
+[configuration](configuration.md#ai)) or set in the admin under
+**Settings → Application Configuration**; the database value takes precedence over the
+environment.
+
+### Enabling the AI features
+
+At a minimum you need to enable AI and choose an LLM backend:
+
+- [`PAPERLESS_AI_ENABLED`](configuration.md#PAPERLESS_AI_ENABLED) — master switch.
+- [`PAPERLESS_AI_LLM_BACKEND`](configuration.md#PAPERLESS_AI_LLM_BACKEND) — `ollama`
+  (runs locally) or `openai-like` (OpenAI itself or any OpenAI-compatible API).
+- [`PAPERLESS_AI_LLM_MODEL`](configuration.md#PAPERLESS_AI_LLM_MODEL), and for
+  `openai-like` usually [`PAPERLESS_AI_LLM_API_KEY`](configuration.md#PAPERLESS_AI_LLM_API_KEY)
+  and/or [`PAPERLESS_AI_LLM_ENDPOINT`](configuration.md#PAPERLESS_AI_LLM_ENDPOINT). Ollama
+  requires `PAPERLESS_AI_LLM_ENDPOINT` pointing at your Ollama server.
+
+### AI-assisted suggestions
+
+With AI enabled, Paperless-ngx can suggest a title, tags, correspondent, document type,
+storage path and dates by sending the document to the LLM. This is **opt-in per request**
+and surfaces through the "Suggest" control on the document detail page, alongside the
+classic classifier-based suggestions — it does not disable them. Suggestion output
+language can be steered with
+[`PAPERLESS_AI_LLM_OUTPUT_LANGUAGE`](configuration.md#PAPERLESS_AI_LLM_OUTPUT_LANGUAGE)
+(otherwise it follows the user's UI language).
+
+### The LLM index (RAG) and similar documents
+
+Setting an embedding backend turns on the **LLM index**, a vector index of your documents
+that enables Retrieval-Augmented Generation (RAG). When enabled, suggestions are grounded
+in similar existing documents, and the document chat can retrieve relevant context.
+
+Enable it by setting
+[`PAPERLESS_AI_LLM_EMBEDDING_BACKEND`](configuration.md#PAPERLESS_AI_LLM_EMBEDDING_BACKEND)
+(`huggingface` for fully-local embeddings, or `ollama` / `openai-like`). The index is only
+built when AI is enabled **and** an embedding backend is set.
+
+The index is updated automatically on a schedule controlled by
+[`PAPERLESS_LLM_INDEX_TASK_CRON`](configuration.md#PAPERLESS_LLM_INDEX_TASK_CRON) (daily by
+default), and can be rebuilt or compacted manually — see
+[Managing the LLM index](administration.md#llm-index).
+
+!!! note
+
+    Local embeddings via `huggingface` download the embedding model on first use into the
+    Paperless data directory. The first run therefore needs network access and some disk
+    space.
+
+### Document chat
+
+When the LLM index is enabled, the chat control in the top app toolbar answers questions
+about your documents. It operates over a single document or across multiple documents
+depending on the current view, and its answers include links to the source documents it
+drew from.
+
+### AI Security notes
+
+- Document content is passed to the LLM as **untrusted data**.
+- By default Paperless-ngx allows AI endpoints that resolve to private/loopback addresses
+  (for local backends). Set
+  [`PAPERLESS_AI_LLM_ALLOW_INTERNAL_ENDPOINTS`](configuration.md#PAPERLESS_AI_LLM_ALLOW_INTERNAL_ENDPOINTS)
+  to `false` to block them.
+
 ## Hooking into the consumption process {#consume-hooks}

 Sometimes you may want to do something arbitrary whenever a document is
@@ -846,7 +925,7 @@ Paperless is able to utilize barcodes for automatically performing some tasks. B

 At this time, the library utilized for detection of barcodes supports the following types:

- AN-13/UPC-A
+- EAN-13/UPC-A
 - UPC-E
 - EAN-8
 - Code 128
@@ -855,7 +934,9 @@ At this time, the library utilized for detection of barcodes supports the follow
 - Codabar
 - Interleaved 2 of 5
 - QR Code
- SQ Code
+- Data Matrix
+- Aztec
+- PDF417

 For usage in Paperless, the type of barcode does not matter, only the contents of it.

@@ -227,6 +227,7 @@ Version-aware endpoints:
 - `PATCH /api/documents/{id}/`: content updates target the selected version (`?version={version_id}`) or latest version by default; non-content metadata updates target the root document.
 - `GET /api/documents/{id}/download/`, `GET /api/documents/{id}/preview/`, `GET /api/documents/{id}/thumb/`, `GET /api/documents/{id}/metadata/`: accept `?version={version_id}`.
 - `POST /api/documents/{id}/update_version/`: uploads a new version using multipart form field `document` and optional `version_label`.
+- `PATCH /api/documents/{id}/versions/{version_id}/`: updates the `version_label` of a specific version.
 - `DELETE /api/documents/{root_id}/versions/{version_id}/`: deletes a non-root version.

 ## Permissions
@@ -445,3 +446,9 @@ Initial API version.
  large lists of object IDs for operations affecting many objects.
 - The legacy `title_content` document search parameter is deprecated and will be removed in a future version.
  Clients should use `text` for simple title-and-content search and `title_search` for title-only search.
+- The task tracking system was redesigned. The tasks list (`/api/tasks/`) is now paginated, and the
+  task object exposes `task_type` (formerly `task_name`) and `trigger_source` (formerly `type`). New
+  read-only endpoints `/api/tasks/summary/`, `/api/tasks/status_counts/`, and `/api/tasks/active/`
+  provide aggregate views, and `POST /api/tasks/run/` lets privileged users dispatch supported tasks.
+  API v9 continues to serve the unpaginated list with the legacy field names until support for v9 is
+  dropped.
@@ -62,14 +62,14 @@ and the relevant connection variables.
 #### [`PAPERLESS_DBENGINE=<engine>`](#PAPERLESS_DBENGINE) {#PAPERLESS_DBENGINE}

 : Specifies the database engine to use. Accepted values are `sqlite`, `postgresql`,
-and `mariadb`.
-
-    Defaults to `sqlite` if not set.
+and `mariadb`. PostgreSQL and MariaDB users must set this explicitly.

    PostgreSQL and MariaDB both require [`PAPERLESS_DBHOST`](#PAPERLESS_DBHOST) to be
    set. SQLite does not use any other connection variables; the database file is always
    located at `<PAPERLESS_DATA_DIR>/db.sqlite3`.

+    Defaults to `sqlite`.
+
    !!! warning
        Using MariaDB comes with some caveats.
        See [MySQL Caveats](advanced_usage.md#mysql-caveats).
@@ -892,7 +892,7 @@ modes are available:

    The default is `auto`.

-    For the `skip`, `redo`, and `force` modes, read more about OCR
+    For the `redo` and `force` modes, read more about OCR
    behaviour in the [OCRmyPDF
    documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).

@@ -2131,7 +2131,7 @@ used with the OpenAI-compatible backend to target a custom provider or local gat

    Defaults to true, which allows internal endpoints.

-#### [`PAPERLESS_AI_LLM_INDEX_TASK_CRON=<cron expression>`](#PAPERLESS_AI_LLM_INDEX_TASK_CRON) {#PAPERLESS_AI_LLM_INDEX_TASK_CRON}
+#### [`PAPERLESS_LLM_INDEX_TASK_CRON=<cron expression>`](#PAPERLESS_LLM_INDEX_TASK_CRON) {#PAPERLESS_LLM_INDEX_TASK_CRON}

 : Configures the schedule to update the AI embeddings of text content and metadata for all documents. Only performed if
 AI is enabled and the LLM embedding backend is set.
@@ -132,7 +132,7 @@ uv run manage.py runserver & \
 ```

 You might need the front end to test your back end code.
-This assumes that you have AngularJS installed on your system.
+This assumes that you have Angular installed on your system.
 Go to the [Front end development](#front-end-development) section for further details.
 To build the front end once use this command:

@@ -174,7 +174,7 @@ To add a new development package `uv add --dev <package>`

 ## Front end development

-The front end is built using AngularJS. In order to get started, you need Node.js (version 24+) and
+The front end is built using Angular. In order to get started, you need Node.js (version 24+) and
 `pnpm`.

 !!! note
@@ -248,12 +248,12 @@ that authentication is working.
 ## Localization

 Paperless-ngx is available in many different languages. Since Paperless-ngx
-consists both of a Django application and an AngularJS front end, both
+consists both of a Django application and an Angular front end, both
 these parts have to be translated separately.

 ### Front end localization

- The AngularJS front end does localization according to the [Angular
+- The Angular front end does localization according to the [Angular
  documentation](https://angular.io/guide/i18n).
 - The source language of the project is "en_US".
 - The source strings end up in the file `src-ui/messages.xlf`.
@@ -495,7 +495,7 @@ class MyCustomParser:
        self._tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR)
        )
-        self._text: str | None = None
+        self._text: str = ""
        self._archive_path: Path | None = None

    def __enter__(self) -> Self:
@@ -553,7 +553,8 @@ def parse(
 **Result accessors**

 ```python
-def get_text(self) -> str | None:
+def get_text(self) -> str:
+    # Return the extracted text, or an empty string if none was found.
    return self._text

 def get_date(self) -> "datetime.datetime | None":
@@ -684,7 +685,7 @@ class XmlDocumentParser:
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR))
-        self._text: str | None = None
+        self._text: str = ""

    def __enter__(self) -> Self:
        return self
@@ -702,7 +703,7 @@ class XmlDocumentParser:
        except ET.ParseError as e:
            raise ParseError(f"XML parse error: {e}") from e

-    def get_text(self) -> str | None:
+    def get_text(self) -> str:
        return self._text

    def get_date(self):
@@ -70,7 +70,16 @@ elsewhere. Here are a couple notes about that.
 Paperless-ngx determines the type of a file by inspecting its content
 rather than its file extensions. However, files processed via the
 consumption directory will be rejected if they have a file extension that
-not supported by any of the available parsers.
+is not supported by any of the available parsers.
+
+## _Are duplicate documents rejected?_
+
+**A:** Not by default. As of v3, a file whose contents match an existing document is still
+consumed, and the duplicate is flagged in the UI — open the document and check the
+**Duplicates** tab to review documents that share the same content. If you prefer the old
+behavior of rejecting duplicates during consumption, set
+[`PAPERLESS_CONSUMER_DELETE_DUPLICATES`](configuration.md#PAPERLESS_CONSUMER_DELETE_DUPLICATES)
+to `true`.

 ## _Will paperless-ngx run on Raspberry Pi?_

@@ -118,6 +127,16 @@ able to run paperless, you're a bit on your own. If you can't run the
 docker image, the documentation has instructions for bare metal
 installs.

+## _Does Paperless-ngx use AI, and is my data private?_
+
+**A:** Paperless-ngx includes optional AI features — LLM-based suggestions, document chat,
+and similar-document retrieval — that are **disabled by default**. They only run when you
+enable them and configure an LLM backend. The built-in tag/correspondent suggestions use a
+local, non-LLM machine-learning model and do not send your data anywhere. If you enable the
+LLM features, document content is sent to whichever backend you configure — this can be a
+fully local backend (e.g. Ollama) or a remote provider. See
+[AI features](advanced_usage.md#ai-features) for details.
+
 ## _Which message broker should I use_?

 Paperless-ngx talks to a Redis-compatible message broker, so any broker that
@@ -35,9 +35,10 @@ physical documents into a searchable online archive so you can keep, well, _less
  - _New!_ Supports remote OCR with Azure AI (opt-in).
 - Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 - Uses machine-learning to automatically add tags, correspondents and document types to your documents.
- **New**: Paperless-ngx can now leverage AI (Large Language Models or LLMs) for document suggestions. This is an optional feature that can be enabled (and is disabled by default).
+- **New**: Paperless-ngx can optionally leverage AI (Large Language Models or LLMs) for document suggestions, chatting with your documents, and similar-document retrieval. These features are opt-in and disabled by default.
 - Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
 - Paperless stores your documents plain on disk. Filenames and folders are managed by paperless and their format can be configured freely with different configurations assigned to different documents.
+- Keep multiple **versions** of a document's file under a single entry, sharing one set of metadata.
 - **Beautiful, modern web application** that features:
  - Customizable dashboard with statistics.
  - Filtering by tags, correspondents, types, and more.
@@ -178,7 +178,7 @@ to enable polling and disable inotify. See [here](configuration.md#polling).
    - `fonts-liberation` for generating thumbnails for plain text
      files
    - `imagemagick` >= 6 for PDF conversion
-    - `gnupg` for handling encrypted documents
+    - `gnupg` for decrypting GPG-encrypted email
    - `libpq-dev` for PostgreSQL
    - `libmagic-dev` for mime type detection
    - `mariadb-client` for MariaDB compile time
@@ -271,8 +271,8 @@ to enable polling and disable inotify. See [here](configuration.md#polling).
    needs. Required settings for getting Paperless-ngx running are:
    - [`PAPERLESS_REDIS`](configuration.md#PAPERLESS_REDIS) should point to your broker, such as
      `redis://localhost:6379`.
-    - [`PAPERLESS_DBENGINE`](configuration.md#PAPERLESS_DBENGINE) is optional, and should be one of `postgres`,
-      `mariadb`, or `sqlite`
+    - [`PAPERLESS_DBENGINE`](configuration.md#PAPERLESS_DBENGINE) should be one of `postgresql`,
+      `mariadb`, or `sqlite`. PostgreSQL and MariaDB users must set this explicitly.
    - [`PAPERLESS_DBHOST`](configuration.md#PAPERLESS_DBHOST) should be the hostname on which your
      PostgreSQL server is running. Do not configure this to use
      SQLite instead. Also configure port, database name, user and
@@ -450,6 +450,12 @@ development documentation.
 You can migrate to Paperless-ngx from Paperless-ng or from the original
 Paperless project.

+!!! note
+
+    Upgrading an existing Paperless-ngx installation from v2 to v3 has its own
+    breaking changes and required steps. See the [v3 migration guide](migration-v3.md)
+    before upgrading.
+
 <h3 id="migration_ng">Migrating from Paperless-ng</h3>

 Paperless-ngx is meant to be a drop-in replacement for Paperless-ng, and
@@ -149,37 +149,6 @@ operating system, if these are different from `1000`. See [Docker setup](setup.m
 Also ensure that you are able to read and write to the consumption
 directory on the host.

-## OSError: \[Errno 19\] No such device when consuming files
-
-If you experience errors such as:
-
-```shell-session
-File "/usr/local/lib/python3.7/site-packages/whoosh/codec/base.py", line 570, in open_compound_file
-return CompoundStorage(dbfile, use_mmap=storage.supports_mmap)
-File "/usr/local/lib/python3.7/site-packages/whoosh/filedb/compound.py", line 75, in __init__
-self._source = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ)
-OSError: [Errno 19] No such device
-
-During handling of the above exception, another exception occurred:
-
-Traceback (most recent call last):
-File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
-res = f(*task["args"], **task["kwargs"])
-File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
-override_tag_ids=override_tag_ids)
-File "/usr/src/paperless/src/documents/consumer.py", line 271, in try_consume_file
-raise ConsumerError(e)
-```
-
-Paperless uses a search index to provide better and faster full text
-searching. This search index is stored inside the `data` folder. The
-search index uses memory-mapped files (mmap). The above error indicates
-that paperless was unable to create and open these files.
-
-This happens when you're trying to store the data directory on certain
-file systems (mostly network shares) that don't support memory-mapped
-files.
-
 ## Web-UI stuck at "Loading\..."

 This might have multiple reasons.
@@ -292,6 +292,23 @@ Once setup, navigating to the email settings page in Paperless-ngx will allow yo
 You can also submit a document using the REST API, see [POSTing documents](api.md#file-uploads)
 for details.

+### Duplicate documents
+
+By default, Paperless-ngx **does not reject duplicates**. If you consume a file whose
+contents exactly match an existing document (same checksum), the new copy is still
+consumed and a warning is logged. The task entry for the upload also flags that a
+duplicate was detected and links to the existing document(s).
+
+To review duplicates, open a document and switch to the **Duplicates** tab on the
+document detail page. It lists other documents that share the same content, including any
+that are in the trash (shown with a badge), and links to each so you can decide which to
+keep.
+
+If you would rather reject duplicates at consumption time (the pre-v3 behavior), set
+[`PAPERLESS_CONSUMER_DELETE_DUPLICATES`](configuration.md#PAPERLESS_CONSUMER_DELETE_DUPLICATES)
+to `true`. The duplicate file is then deleted instead of consumed, and the task fails with
+a "document already exists" message.
+
 ## Document Suggestions

 Paperless-ngx can suggest tags, correspondents, document types and storage paths for documents based on the content of the document. This is done using a (non-LLM) machine learning model that is trained on the documents in your database. The suggestions are shown in the document detail page and can be accepted or rejected by the user.
@@ -306,7 +323,9 @@ Paperless-ngx includes several features that use AI to enhance the document mana
    so consider the privacy implications of using these features, especially if using a remote
    model or API provider instead of the default local model.

-The AI features work by creating an embedding of the text content and metadata of documents, which is then used for various tasks such as similarity search and question answering. This uses the FAISS vector store.
+The AI features work by creating an embedding of the text content and metadata of documents, which is then used for various tasks such as similarity search and question answering.
+
+See [AI features](advanced_usage.md#ai-features) for how to enable and configure these features, including choosing an LLM backend and setting up the LLM index for RAG.

 ### AI-Enhanced Suggestions