From 84ab36ba7004e8de5dcd2b593420b4524f58cc13 Mon Sep 17 00:00:00 2001 From: Trenton H <797416+stumpylog@users.noreply.github.com> Date: Fri, 27 Mar 2026 08:04:07 -0700 Subject: [PATCH] Try to further clarify some interactions --- docs/configuration.md | 27 +++++++++++++++++++++++---- docs/migration-v3.md | 7 +++++++ 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 1250f7109..cc2e0183c 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -853,12 +853,31 @@ for display in the web interface. stored. Saves disk space but the web viewer will display the original file directly. + **Behaviour by file type and mode** (`auto` column shows the default): + + | Document type | `never` | `auto` (default) | `always` | + | -------------------------- | ------- | -------------------------- | -------- | + | Scanned image (TIFF, JPEG) | No | **Yes** | Yes | + | Image-based PDF | No | **Yes** (short/no text) | Yes | + | Born-digital PDF | No | No (has embedded text) | Yes | + | Plain text, email, HTML | No | No | No | + | DOCX / ODT (via Tika) | Yes\* | Yes\* | Yes\* | + + \* Tika always produces a PDF rendition for display; this counts as + the archive regardless of the setting. + !!! note - This setting only applies to parsers that can produce archives - (e.g. the Tesseract/OCR parser). Parsers that must convert - documents to PDF for display (e.g. DOCX, ODT via Tika) will - always produce a PDF regardless of this setting. + This setting applies to the built-in Tesseract parser. Parsers + that must always convert documents to PDF for display (e.g. DOCX, + ODT via Tika) will produce a PDF regardless of this setting. + + !!! note + + The **remote OCR parser** (Azure AI) always produces a searchable + PDF and stores it as the archive copy, regardless of this setting. + `ARCHIVE_FILE_GENERATION=never` has no effect when the remote + parser handles a document. #### [`PAPERLESS_OCR_CLEAN=`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN} diff --git a/docs/migration-v3.md b/docs/migration-v3.md index 014b229ba..67bbaa90c 100644 --- a/docs/migration-v3.md +++ b/docs/migration-v3.md @@ -156,6 +156,13 @@ PAPERLESS_ARCHIVE_FILE_GENERATION=auto Paperless will emit a startup warning if the old environment variables are still set. +### Remote OCR parser + +If you use the **remote OCR parser** (Azure AI), note that it always produces a +searchable PDF and stores it as the archive copy. `ARCHIVE_FILE_GENERATION=never` +has no effect for documents handled by the remote parser — the archive is produced +unconditionally by the remote engine. + ## OpenID Connect Token Endpoint Authentication Some existing OpenID Connect setups may require an explicit token endpoint authentication method after upgrading to v3.