docs: update OCR and archive settings docs for v3

- configuration.md: replace PAPERLESS_OCR_SKIP_ARCHIVE_FILE section with PAPERLESS_ARCHIVE_FILE_GENERATION; update OCR_MODE docs to reflect auto as default and document new 'off' mode - setup.md: update resource-constrained device tip to use new setting names - migration-v3.md: add OCR and archive settings section documenting all removed settings, their replacements, and migration examples Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 17:45:24 +00:00 · 2026-03-26 16:20:36 -07:00
parent 68322376f2
commit de97eea3e2
3 changed files with 87 additions and 18 deletions
@@ -801,11 +801,14 @@ parsing documents.

 #### [`PAPERLESS_OCR_MODE=<mode>`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE}

-: Tell paperless when and how to perform ocr on your documents. Three
+: Tell paperless when and how to perform ocr on your documents. Four
 modes are available:

-    -   `skip`: Paperless skips all pages and will perform ocr only on
-        pages where no text is present. This is the safest option.
+    -   `auto` (default): Paperless detects whether a document already
+        has embedded text via pdftotext. If sufficient text is found,
+        OCR is skipped for that document (`--skip-text`). If no text is
+        present, OCR runs normally. This is the safest option for mixed
+        document collections.

    -   `redo`: Paperless will OCR all pages of your documents and
        attempt to replace any existing text layers with new text. This
@@ -823,24 +826,39 @@ modes are available:
        significantly larger and text won't appear as sharp when zoomed
        in.

-    The default is `skip`, which only performs OCR when necessary and
-    always creates archived documents.
+    -   `off`: Paperless never invokes the OCR engine. For PDFs, text
+        is extracted via pdftotext only. For image documents, text will
+        be empty. Archive file generation still works via format
+        conversion (no Tesseract or Ghostscript required).
+
+    The default is `auto`.

    Read more about this in the [OCRmyPDF
    documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).

-#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=<mode>`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE}
+#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=<mode>`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION}

-: Specify when you would like paperless to skip creating an archived
-version of your documents. This is useful if you don't want to have two
-almost-identical versions of your documents in the media folder.
+: Controls when paperless creates a PDF/A archive version of your
+documents. Archive files are stored alongside the original and are used
+for display in the web interface.

-    -   `never`: Never skip creating an archived version.
-    -   `with_text`: Skip creating an archived version for documents
-    that already have embedded text.
-    -   `always`: Always skip creating an archived version.
+    -   `auto` (default): Produce archives for scanned or image-based
+        documents. Skip archive generation for born-digital PDFs that
+        already contain embedded text. This is the recommended setting
+        for mixed document collections.
+    -   `always`: Always produce a PDF/A archive when the parser
+        supports it, regardless of whether the document already has
+        text.
+    -   `never`: Never produce an archive. Only the original file is
+        stored. Saves disk space but the web viewer will display the
+        original file directly.

-    The default is `never`.
+    !!! note
+
+        This setting only applies to parsers that can produce archives
+        (e.g. the Tesseract/OCR parser). Parsers that must convert
+        documents to PDF for display (e.g. DOCX, ODT via Tika) will
+        always produce a PDF regardless of this setting.

 #### [`PAPERLESS_OCR_CLEAN=<mode>`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN}

@@ -104,6 +104,58 @@ Multiple options are combined in a single value:
 PAPERLESS_DB_OPTIONS="sslmode=require;sslrootcert=/certs/ca.pem;pool.max_size=10"
 ```

+## OCR and Archive File Generation Settings
+
+The settings that control OCR behaviour and archive file generation have been redesigned. The old settings that coupled these two concerns together are **removed** — there are no migration shims.
+
+### Removed settings
+
+| Removed Setting                             | Replacement                                                           |
+| ------------------------------------------- | --------------------------------------------------------------------- |
+| `PAPERLESS_OCR_MODE=skip`                   | `PAPERLESS_OCR_MODE=auto` (new default)                               |
+| `PAPERLESS_OCR_MODE=skip_noarchive`         | `PAPERLESS_OCR_MODE=auto` + `PAPERLESS_ARCHIVE_FILE_GENERATION=never` |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never`     | `PAPERLESS_ARCHIVE_FILE_GENERATION=always`                            |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | `PAPERLESS_ARCHIVE_FILE_GENERATION=auto` (new default)                |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always`    | `PAPERLESS_ARCHIVE_FILE_GENERATION=never`                             |
+
+### What changed and why
+
+Previously, `OCR_MODE` conflated two independent concerns: whether to run OCR and whether to produce an archive. `skip` meant "skip OCR if text exists, but always produce an archive". `skip_noarchive` meant "skip OCR if text exists, and also skip the archive". This made it impossible to, for example, disable OCR entirely while still producing archives.
+
+The new settings are independent:
+
+- [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) controls OCR: `auto` (default), `force`, `redo`, `off`.
+- [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) controls archive production: `auto` (default), `always`, `never`.
+
+### Action Required
+
+Remove any `PAPERLESS_OCR_SKIP_ARCHIVE_FILE` variable from your environment. If you relied on `OCR_MODE=skip` or `OCR_MODE=skip_noarchive`, update accordingly:
+
+```bash
+# v2: skip OCR when text present, always archive
+PAPERLESS_OCR_MODE=skip
+# v3: equivalent (auto is the new default)
+# No change needed — auto is the default
+
+# v2: skip OCR when text present, skip archive too
+PAPERLESS_OCR_MODE=skip_noarchive
+# v3: equivalent
+PAPERLESS_OCR_MODE=auto
+PAPERLESS_ARCHIVE_FILE_GENERATION=never
+
+# v2: always skip archive
+PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always
+# v3: equivalent
+PAPERLESS_ARCHIVE_FILE_GENERATION=never
+
+# v2: skip archive only for born-digital docs
+PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text
+# v3: equivalent (auto is the new default)
+PAPERLESS_ARCHIVE_FILE_GENERATION=auto
+```
+
+Paperless will emit a startup warning if the old environment variables are still set.
+
 ## OpenID Connect Token Endpoint Authentication

 Some existing OpenID Connect setups may require an explicit token endpoint authentication method after upgrading to v3.
@@ -633,12 +633,11 @@ hardware, but a few settings can improve performance:
  consumption, so you might want to lower these settings (example: 2
  workers and 1 thread to always have some computing power left for
  other tasks).
- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `skip` and consider
+- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `auto` and consider
  OCRing your documents before feeding them into Paperless. Some
  scanners are able to do this!
- Set [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE`](configuration.md#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) to `with_text` to skip archive
-  file generation for already OCRed documents, or `always` to skip it
-  for all documents.
+- Set [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) to `never` to skip archive
+  file generation entirely, saving disk space at the cost of in-browser PDF/A viewing.
 - If you want to perform OCR on the device, consider using
  `PAPERLESS_OCR_CLEAN=none`. This will speed up OCR times and use
  less memory at the expense of slightly worse OCR results.