docs: update OCR and archive settings docs for v3

- configuration.md: replace PAPERLESS_OCR_SKIP_ARCHIVE_FILE section with PAPERLESS_ARCHIVE_FILE_GENERATION; update OCR_MODE docs to reflect auto as default and document new 'off' mode - setup.md: update resource-constrained device tip to use new setting names - migration-v3.md: add OCR and archive settings section documenting all removed settings, their replacements, and migration examples Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test: use pytest-django settings fixture and pytest.param in new tests
2026-03-27 03:12:45 +00:00 · 2026-03-26 16:38:04 -07:00 · 2026-03-26 16:27:37 -07:00 · 2026-03-26 15:40:02 -07:00 · 2026-03-26 14:33:12 -07:00 · 2026-03-26 14:27:42 -07:00
26 changed files with 1304 additions and 178 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -111,3 +111,4 @@ celerybeat-schedule*

 # ignore pnpm package store folder created when setting up the devcontainer
 .pnpm-store/
+.worktrees
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -801,11 +801,14 @@ parsing documents.

 #### [`PAPERLESS_OCR_MODE=<mode>`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE}

-: Tell paperless when and how to perform ocr on your documents. Three
+: Tell paperless when and how to perform ocr on your documents. Four
 modes are available:

-    -   `skip`: Paperless skips all pages and will perform ocr only on
-        pages where no text is present. This is the safest option.
+    -   `auto` (default): Paperless detects whether a document already
+        has embedded text via pdftotext. If sufficient text is found,
+        OCR is skipped for that document (`--skip-text`). If no text is
+        present, OCR runs normally. This is the safest option for mixed
+        document collections.

    -   `redo`: Paperless will OCR all pages of your documents and
        attempt to replace any existing text layers with new text. This
@@ -823,24 +826,39 @@ modes are available:
        significantly larger and text won't appear as sharp when zoomed
        in.

-    The default is `skip`, which only performs OCR when necessary and
-    always creates archived documents.
+    -   `off`: Paperless never invokes the OCR engine. For PDFs, text
+        is extracted via pdftotext only. For image documents, text will
+        be empty. Archive file generation still works via format
+        conversion (no Tesseract or Ghostscript required).
+
+    The default is `auto`.

    Read more about this in the [OCRmyPDF
    documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).

-#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=<mode>`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE}
+#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=<mode>`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION}

-: Specify when you would like paperless to skip creating an archived
-version of your documents. This is useful if you don't want to have two
-almost-identical versions of your documents in the media folder.
+: Controls when paperless creates a PDF/A archive version of your
+documents. Archive files are stored alongside the original and are used
+for display in the web interface.

-    -   `never`: Never skip creating an archived version.
-    -   `with_text`: Skip creating an archived version for documents
-    that already have embedded text.
-    -   `always`: Always skip creating an archived version.
+    -   `auto` (default): Produce archives for scanned or image-based
+        documents. Skip archive generation for born-digital PDFs that
+        already contain embedded text. This is the recommended setting
+        for mixed document collections.
+    -   `always`: Always produce a PDF/A archive when the parser
+        supports it, regardless of whether the document already has
+        text.
+    -   `never`: Never produce an archive. Only the original file is
+        stored. Saves disk space but the web viewer will display the
+        original file directly.

-    The default is `never`.
+    !!! note
+
+        This setting only applies to parsers that can produce archives
+        (e.g. the Tesseract/OCR parser). Parsers that must convert
+        documents to PDF for display (e.g. DOCX, ODT via Tika) will
+        always produce a PDF regardless of this setting.

 #### [`PAPERLESS_OCR_CLEAN=<mode>`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN}

--- a/docs/migration-v3.md
+++ b/docs/migration-v3.md
@@ -104,6 +104,58 @@ Multiple options are combined in a single value:
 PAPERLESS_DB_OPTIONS="sslmode=require;sslrootcert=/certs/ca.pem;pool.max_size=10"
 ```

+## OCR and Archive File Generation Settings
+
+The settings that control OCR behaviour and archive file generation have been redesigned. The old settings that coupled these two concerns together are **removed** — there are no migration shims.
+
+### Removed settings
+
+| Removed Setting                             | Replacement                                                           |
+| ------------------------------------------- | --------------------------------------------------------------------- |
+| `PAPERLESS_OCR_MODE=skip`                   | `PAPERLESS_OCR_MODE=auto` (new default)                               |
+| `PAPERLESS_OCR_MODE=skip_noarchive`         | `PAPERLESS_OCR_MODE=auto` + `PAPERLESS_ARCHIVE_FILE_GENERATION=never` |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never`     | `PAPERLESS_ARCHIVE_FILE_GENERATION=always`                            |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | `PAPERLESS_ARCHIVE_FILE_GENERATION=auto` (new default)                |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always`    | `PAPERLESS_ARCHIVE_FILE_GENERATION=never`                             |
+
+### What changed and why
+
+Previously, `OCR_MODE` conflated two independent concerns: whether to run OCR and whether to produce an archive. `skip` meant "skip OCR if text exists, but always produce an archive". `skip_noarchive` meant "skip OCR if text exists, and also skip the archive". This made it impossible to, for example, disable OCR entirely while still producing archives.
+
+The new settings are independent:
+
+- [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) controls OCR: `auto` (default), `force`, `redo`, `off`.
+- [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) controls archive production: `auto` (default), `always`, `never`.
+
+### Action Required
+
+Remove any `PAPERLESS_OCR_SKIP_ARCHIVE_FILE` variable from your environment. If you relied on `OCR_MODE=skip` or `OCR_MODE=skip_noarchive`, update accordingly:
+
+```bash
+# v2: skip OCR when text present, always archive
+PAPERLESS_OCR_MODE=skip
+# v3: equivalent (auto is the new default)
+# No change needed — auto is the default
+
+# v2: skip OCR when text present, skip archive too
+PAPERLESS_OCR_MODE=skip_noarchive
+# v3: equivalent
+PAPERLESS_OCR_MODE=auto
+PAPERLESS_ARCHIVE_FILE_GENERATION=never
+
+# v2: always skip archive
+PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always
+# v3: equivalent
+PAPERLESS_ARCHIVE_FILE_GENERATION=never
+
+# v2: skip archive only for born-digital docs
+PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text
+# v3: equivalent (auto is the new default)
+PAPERLESS_ARCHIVE_FILE_GENERATION=auto
+```
+
+Paperless will emit a startup warning if the old environment variables are still set.
+
 ## OpenID Connect Token Endpoint Authentication

 Some existing OpenID Connect setups may require an explicit token endpoint authentication method after upgrading to v3.
--- a/docs/setup.md
+++ b/docs/setup.md
@@ -633,12 +633,11 @@ hardware, but a few settings can improve performance:
  consumption, so you might want to lower these settings (example: 2
  workers and 1 thread to always have some computing power left for
  other tasks).
- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `skip` and consider
+- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `auto` and consider
  OCRing your documents before feeding them into Paperless. Some
  scanners are able to do this!
- Set [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE`](configuration.md#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) to `with_text` to skip archive
-  file generation for already OCRed documents, or `always` to skip it
-  for all documents.
+- Set [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) to `never` to skip archive
+  file generation entirely, saving disk space at the cost of in-browser PDF/A viewing.
 - If you want to perform OCR on the device, consider using
  `PAPERLESS_OCR_CLEAN=none`. This will speed up OCR times and use
  less memory at the expense of slightly worse OCR results.
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -50,9 +50,13 @@ from documents.templating.workflows import parse_w_workflow_placeholders
 from documents.utils import copy_basic_file_stats
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
+from paperless.config import OcrConfig
+from paperless.models import ArchiveFileGenerationChoices
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.registry import get_parser_registry
+from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
+from paperless.parsers.utils import extract_pdf_text

 LOGGING_NAME: Final[str] = "paperless.consumer"

@@ -105,6 +109,42 @@ class ConsumerStatusShortMessage(StrEnum):
    FAILED = "failed"


+def should_produce_archive(
+    parser: "ParserProtocol",
+    mime_type: str,
+    document_path: Path,
+) -> bool:
+    """Return True if a PDF/A archive should be produced for this document.
+
+    IMPORTANT: *parser* must be an instantiated parser, not the class.
+    ``requires_pdf_rendition`` and ``can_produce_archive`` are instance
+    ``@property`` methods — accessing them on the class returns the descriptor
+    (always truthy).
+    """
+    # Must produce a PDF so the frontend can display the original format at all.
+    if parser.requires_pdf_rendition:
+        return True
+
+    # Parser cannot produce an archive (e.g. TextDocumentParser).
+    if not parser.can_produce_archive:
+        return False
+
+    generation = OcrConfig().archive_file_generation
+
+    if generation == ArchiveFileGenerationChoices.ALWAYS:
+        return True
+    if generation == ArchiveFileGenerationChoices.NEVER:
+        return False
+
+    # auto: produce archives for scanned/image documents; skip for born-digital PDFs.
+    if mime_type.startswith("image/"):
+        return True
+    if mime_type == "application/pdf":
+        text = extract_pdf_text(document_path)
+        return text is None or len(text) <= PDF_TEXT_MIN_LENGTH
+    return False
+
+
 class ConsumerPluginMixin:
    if TYPE_CHECKING:
        from logging import Logger
@@ -440,7 +480,16 @@ class ConsumerPlugin(
                    )
                    self.log.debug(f"Parsing {self.filename}...")

-                    document_parser.parse(self.working_copy, mime_type)
+                    produce_archive = should_produce_archive(
+                        document_parser,
+                        mime_type,
+                        self.working_copy,
+                    )
+                    document_parser.parse(
+                        self.working_copy,
+                        mime_type,
+                        produce_archive=produce_archive,
+                    )

                    self.log.debug(f"Generating thumbnail for {self.filename}...")
                    self._send_progress(
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -35,6 +35,7 @@ from documents.consumer import AsnCheckPlugin
 from documents.consumer import ConsumerPlugin
 from documents.consumer import ConsumerPreflightPlugin
 from documents.consumer import WorkflowTriggerPlugin
+from documents.consumer import should_produce_archive
 from documents.data_models import ConsumableDocument
 from documents.data_models import DocumentMetadataOverrides
 from documents.double_sided import CollatePlugin
@@ -321,7 +322,16 @@ def update_document_content_maybe_archive_file(document_id) -> None:
        parser.configure(ParserContext())

        try:
-            parser.parse(document.source_path, mime_type)
+            produce_archive = should_produce_archive(
+                parser,
+                mime_type,
+                document.source_path,
+            )
+            parser.parse(
+                document.source_path,
+                mime_type,
+                produce_archive=produce_archive,
+            )

            thumbnail = parser.get_thumbnail(document.source_path, mime_type)

--- a/src/documents/tests/test_api_app_config.py
+++ b/src/documents/tests/test_api_app_config.py
@@ -46,7 +46,7 @@ class TestApiAppConfig(DirectoriesMixin, APITestCase):
                "pages": None,
                "language": None,
                "mode": None,
-                "skip_archive_file": None,
+                "archive_file_generation": None,
                "image_dpi": None,
                "unpaper_clean": None,
                "deskew": None,
--- a/src/documents/tests/test_barcodes.py
+++ b/src/documents/tests/test_barcodes.py
@@ -1020,7 +1020,7 @@ class TestTagBarcode(DirectoriesMixin, SampleDirMixin, GetReaderPluginMixin, Tes
        CONSUMER_TAG_BARCODE_SPLIT=True,
        CONSUMER_TAG_BARCODE_MAPPING={"TAG:(.*)": "\\g<1>"},
        CELERY_TASK_ALWAYS_EAGER=True,
-        OCR_MODE="skip",
+        OCR_MODE="auto",
    )
    def test_consume_barcode_file_tag_split_and_assignment(self) -> None:
        """
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -1126,6 +1126,7 @@ class TestConsumer(
            mock_mail_parser_parse.assert_called_once_with(
                consumer.working_copy,
                "message/rfc822",
+                produce_archive=True,
            )


@@ -1273,7 +1274,14 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
    def test_no_pre_consume_script(self, m) -> None:
        with self.get_consumer(self.test_file) as c:
            c.run()
-            m.assert_not_called()
+            # Verify no pre-consume script subprocess was invoked
+            # (run_subprocess may still be called by _extract_text_for_archive_check)
+            script_calls = [
+                call
+                for call in m.call_args_list
+                if call.args and call.args[0] and call.args[0][0] not in ("pdftotext",)
+            ]
+            self.assertEqual(script_calls, [])

    @mock.patch("documents.consumer.run_subprocess")
    @override_settings(PRE_CONSUME_SCRIPT="does-not-exist")
@@ -1289,9 +1297,16 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
                with self.get_consumer(self.test_file) as c:
                    c.run()

-                    m.assert_called_once()
+                    self.assertTrue(m.called)

-                    args, _ = m.call_args
+                    # Find the call that invoked the pre-consume script
+                    # (run_subprocess may also be called by _extract_text_for_archive_check)
+                    script_call = next(
+                        call
+                        for call in m.call_args_list
+                        if call.args and call.args[0] and call.args[0][0] == script.name
+                    )
+                    args, _ = script_call

                    command = args[0]
                    environment = args[1]
--- a/src/documents/tests/test_consumer_archive.py
+++ b/src/documents/tests/test_consumer_archive.py
@@ -0,0 +1,157 @@
+"""Tests for should_produce_archive()."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import MagicMock
+from unittest.mock import patch
+
+import pytest
+
+from documents.consumer import should_produce_archive
+
+
+def _parser_instance(
+    *,
+    can_produce: bool = True,
+    requires_rendition: bool = False,
+) -> MagicMock:
+    """Return a mock parser instance with the given capability flags."""
+    instance = MagicMock()
+    instance.can_produce_archive = can_produce
+    instance.requires_pdf_rendition = requires_rendition
+    return instance
+
+
+@pytest.fixture()
+def null_app_config(mocker) -> MagicMock:
+    """Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
+    return mocker.MagicMock(
+        output_type=None,
+        pages=None,
+        language=None,
+        mode=None,
+        archive_file_generation=None,
+        image_dpi=None,
+        unpaper_clean=None,
+        deskew=None,
+        rotate_pages=None,
+        rotate_pages_threshold=None,
+        max_image_pixels=None,
+        color_conversion_strategy=None,
+        user_args=None,
+    )
+
+
+@pytest.fixture(autouse=True)
+def patch_app_config(mocker, null_app_config):
+    """Patch BaseConfig._get_config_instance for all tests in this module."""
+    mocker.patch(
+        "paperless.config.BaseConfig._get_config_instance",
+        return_value=null_app_config,
+    )
+
+
+class TestShouldProduceArchive:
+    @pytest.mark.parametrize(
+        ("generation", "can_produce", "requires_rendition", "mime", "expected"),
+        [
+            pytest.param(
+                "never",
+                True,
+                False,
+                "application/pdf",
+                False,
+                id="never-returns-false",
+            ),
+            pytest.param(
+                "always",
+                True,
+                False,
+                "application/pdf",
+                True,
+                id="always-returns-true",
+            ),
+            pytest.param(
+                "never",
+                True,
+                True,
+                "application/pdf",
+                True,
+                id="requires-rendition-overrides-never",
+            ),
+            pytest.param(
+                "always",
+                False,
+                False,
+                "text/plain",
+                False,
+                id="cannot-produce-overrides-always",
+            ),
+            pytest.param(
+                "always",
+                False,
+                True,
+                "application/pdf",
+                True,
+                id="requires-rendition-wins-even-if-cannot-produce",
+            ),
+            pytest.param(
+                "auto",
+                True,
+                False,
+                "image/tiff",
+                True,
+                id="auto-image-returns-true",
+            ),
+            pytest.param(
+                "auto",
+                True,
+                False,
+                "message/rfc822",
+                False,
+                id="auto-non-pdf-non-image-returns-false",
+            ),
+        ],
+    )
+    def test_generation_setting(
+        self,
+        settings,
+        generation: str,
+        can_produce: bool,  # noqa: FBT001
+        requires_rendition: bool,  # noqa: FBT001
+        mime: str,
+        expected: bool,  # noqa: FBT001
+    ) -> None:
+        settings.ARCHIVE_FILE_GENERATION = generation
+        parser = _parser_instance(
+            can_produce=can_produce,
+            requires_rendition=requires_rendition,
+        )
+        assert should_produce_archive(parser, mime, Path("/tmp/doc")) is expected
+
+    @pytest.mark.parametrize(
+        ("extracted_text", "expected"),
+        [
+            pytest.param(
+                "This is a born-digital PDF with lots of text content. " * 10,
+                False,
+                id="born-digital-long-text-skips-archive",
+            ),
+            pytest.param(None, True, id="no-text-scanned-produces-archive"),
+            pytest.param("tiny", True, id="short-text-treated-as-scanned"),
+        ],
+    )
+    def test_auto_pdf_archive_decision(
+        self,
+        settings,
+        extracted_text: str | None,
+        expected: bool,  # noqa: FBT001
+    ) -> None:
+        settings.ARCHIVE_FILE_GENERATION = "auto"
+        parser = _parser_instance(can_produce=True, requires_rendition=False)
+        with patch("documents.consumer.extract_pdf_text", return_value=extracted_text):
+            assert (
+                should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
+                is expected
+            )
--- a/src/documents/tests/test_management.py
+++ b/src/documents/tests/test_management.py
@@ -27,7 +27,10 @@ sample_file: Path = Path(__file__).parent / "samples" / "simple.pdf"


@pytest.mark.management
-@override_settings(FILENAME_FORMAT="{correspondent}/{title}")
+@override_settings(
+    FILENAME_FORMAT="{correspondent}/{title}",
+    ARCHIVE_FILE_GENERATION="always",
+)
 class TestArchiver(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    def make_models(self):
        return Document.objects.create(
--- a/src/documents/tests/test_tasks.py
+++ b/src/documents/tests/test_tasks.py
@@ -232,6 +232,7 @@ class TestEmptyTrashTask(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
        self.assertEqual(Document.global_objects.count(), 0)


+@override_settings(ARCHIVE_FILE_GENERATION="always")
 class TestUpdateContent(DirectoriesMixin, TestCase):
    def test_update_content_maybe_archive_file(self) -> None:
        """
--- a/src/paperless/checks.py
+++ b/src/paperless/checks.py
@@ -132,23 +132,14 @@ def settings_values_check(app_configs, **kwargs):
                Error(f'OCR output type "{settings.OCR_OUTPUT_TYPE}" is not valid'),
            )

-        if settings.OCR_MODE not in {"force", "skip", "redo", "skip_noarchive"}:
+        if settings.OCR_MODE not in {"auto", "force", "redo", "off"}:
            msgs.append(Error(f'OCR output mode "{settings.OCR_MODE}" is not valid'))

-        if settings.OCR_MODE == "skip_noarchive":
-            msgs.append(
-                Warning(
-                    'OCR output mode "skip_noarchive" is deprecated and will be '
-                    "removed in a future version. Please use "
-                    "PAPERLESS_OCR_SKIP_ARCHIVE_FILE instead.",
-                ),
-            )
-
-        if settings.OCR_SKIP_ARCHIVE_FILE not in {"never", "with_text", "always"}:
+        if settings.ARCHIVE_FILE_GENERATION not in {"auto", "always", "never"}:
            msgs.append(
                Error(
-                    "OCR_SKIP_ARCHIVE_FILE setting "
-                    f'"{settings.OCR_SKIP_ARCHIVE_FILE}" is not valid',
+                    "PAPERLESS_ARCHIVE_FILE_GENERATION setting "
+                    f'"{settings.ARCHIVE_FILE_GENERATION}" is not valid',
                ),
            )

@@ -302,6 +293,41 @@ def check_deprecated_db_settings(
    return warnings


+@register()
+def check_deprecated_v2_ocr_env_vars(
+    app_configs: object,
+    **kwargs: object,
+) -> list[Warning]:
+    """Warn when deprecated v2 OCR environment variables are set.
+
+    Users upgrading from v2 may still have these in their environment or
+    config files, where they are now silently ignored.
+    """
+    warnings: list[Warning] = []
+
+    if os.environ.get("PAPERLESS_OCR_SKIP_ARCHIVE_FILE"):
+        warnings.append(
+            Warning(
+                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE is set but has no effect. "
+                "Use PAPERLESS_ARCHIVE_FILE_GENERATION=never/always/auto instead.",
+                id="paperless.W002",
+            ),
+        )
+
+    ocr_mode = os.environ.get("PAPERLESS_OCR_MODE", "")
+    if ocr_mode in {"skip", "skip_noarchive"}:
+        warnings.append(
+            Warning(
+                f"PAPERLESS_OCR_MODE={ocr_mode!r} is not a valid value. "
+                f"Use PAPERLESS_OCR_MODE=auto (and PAPERLESS_ARCHIVE_FILE_GENERATION=never "
+                f"if you used skip_noarchive) instead.",
+                id="paperless.W003",
+            ),
+        )
+
+    return warnings
+
+
@register()
 def check_remote_parser_configured(app_configs, **kwargs) -> list[Error]:
    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
--- a/src/paperless/config.py
+++ b/src/paperless/config.py
@@ -46,7 +46,7 @@ class OcrConfig(OutputTypeConfig):
    pages: int | None = dataclasses.field(init=False)
    language: str = dataclasses.field(init=False)
    mode: str = dataclasses.field(init=False)
-    skip_archive_file: str = dataclasses.field(init=False)
+    archive_file_generation: str = dataclasses.field(init=False)
    image_dpi: int | None = dataclasses.field(init=False)
    clean: str = dataclasses.field(init=False)
    deskew: bool = dataclasses.field(init=False)
@@ -64,8 +64,8 @@ class OcrConfig(OutputTypeConfig):
        self.pages = app_config.pages or settings.OCR_PAGES
        self.language = app_config.language or settings.OCR_LANGUAGE
        self.mode = app_config.mode or settings.OCR_MODE
-        self.skip_archive_file = (
-            app_config.skip_archive_file or settings.OCR_SKIP_ARCHIVE_FILE
+        self.archive_file_generation = (
+            app_config.archive_file_generation or settings.ARCHIVE_FILE_GENERATION
        )
        self.image_dpi = app_config.image_dpi or settings.OCR_IMAGE_DPI
        self.clean = app_config.unpaper_clean or settings.OCR_CLEAN
--- a/src/paperless/migrations/0008_replace_skip_archive_file.py
+++ b/src/paperless/migrations/0008_replace_skip_archive_file.py
@@ -0,0 +1,44 @@
+# Generated by Django 5.2.12 on 2026-03-26 20:31
+
+from django.db import migrations
+from django.db import models
+
+
+class Migration(migrations.Migration):
+    dependencies = [
+        ("paperless", "0007_optimize_integer_field_sizes"),
+    ]
+
+    operations = [
+        migrations.RemoveField(
+            model_name="applicationconfiguration",
+            name="skip_archive_file",
+        ),
+        migrations.AddField(
+            model_name="applicationconfiguration",
+            name="archive_file_generation",
+            field=models.CharField(
+                blank=True,
+                choices=[("auto", "auto"), ("always", "always"), ("never", "never")],
+                max_length=8,
+                null=True,
+                verbose_name="Controls archive file generation",
+            ),
+        ),
+        migrations.AlterField(
+            model_name="applicationconfiguration",
+            name="mode",
+            field=models.CharField(
+                blank=True,
+                choices=[
+                    ("auto", "auto"),
+                    ("force", "force"),
+                    ("redo", "redo"),
+                    ("off", "off"),
+                ],
+                max_length=16,
+                null=True,
+                verbose_name="Sets the OCR mode",
+            ),
+        ),
+    ]
--- a/src/paperless/models.py
+++ b/src/paperless/models.py
@@ -36,20 +36,20 @@ class ModeChoices(models.TextChoices):
    and our own custom setting
    """

-    SKIP = ("skip", _("skip"))
-    REDO = ("redo", _("redo"))
+    AUTO = ("auto", _("auto"))
    FORCE = ("force", _("force"))
-    SKIP_NO_ARCHIVE = ("skip_noarchive", _("skip_noarchive"))
+    REDO = ("redo", _("redo"))
+    OFF = ("off", _("off"))


-class ArchiveFileChoices(models.TextChoices):
+class ArchiveFileGenerationChoices(models.TextChoices):
    """
    Settings to control creation of an archive PDF file
    """

-    NEVER = ("never", _("never"))
-    WITH_TEXT = ("with_text", _("with_text"))
+    AUTO = ("auto", _("auto"))
    ALWAYS = ("always", _("always"))
+    NEVER = ("never", _("never"))


 class CleanChoices(models.TextChoices):
@@ -126,12 +126,12 @@ class ApplicationConfiguration(AbstractSingletonModel):
        choices=ModeChoices.choices,
    )

-    skip_archive_file = models.CharField(
-        verbose_name=_("Controls the generation of an archive file"),
+    archive_file_generation = models.CharField(
+        verbose_name=_("Controls archive file generation"),
        null=True,
        blank=True,
-        max_length=16,
-        choices=ArchiveFileChoices.choices,
+        max_length=8,
+        choices=ArchiveFileGenerationChoices.choices,
    )

    image_dpi = models.PositiveSmallIntegerField(
--- a/src/paperless/parsers/tesseract.py
+++ b/src/paperless/parsers/tesseract.py
@@ -1,5 +1,6 @@
 from __future__ import annotations

+import importlib.resources
 import logging
 import os
 import re
@@ -18,9 +19,10 @@ from documents.parsers import make_thumbnail_from_pdf
 from documents.utils import maybe_override_pixel_limit
 from documents.utils import run_subprocess
 from paperless.config import OcrConfig
-from paperless.models import ArchiveFileChoices
 from paperless.models import CleanChoices
 from paperless.models import ModeChoices
+from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
+from paperless.parsers.utils import extract_pdf_text
 from paperless.parsers.utils import read_file_handle_unicode_errors
 from paperless.version import __full_version_str__

@@ -250,36 +252,7 @@ class RasterisedDocumentParser:
        if not Path(pdf_file).is_file():
            return None

-        try:
-            text = None
-            with tempfile.NamedTemporaryFile(
-                mode="w+",
-                dir=self.tempdir,
-            ) as tmp:
-                run_subprocess(
-                    [
-                        "pdftotext",
-                        "-q",
-                        "-layout",
-                        "-enc",
-                        "UTF-8",
-                        str(pdf_file),
-                        tmp.name,
-                    ],
-                    logger=self.log,
-                )
-                text = read_file_handle_unicode_errors(Path(tmp.name))
-
-            return post_process_text(text)
-
-        except Exception:
-            #  If pdftotext fails, fall back to OCR.
-            self.log.warning(
-                "Error while getting text from PDF document with pdftotext",
-                exc_info=True,
-            )
-            # probably not a PDF file.
-            return None
+        return post_process_text(extract_pdf_text(Path(pdf_file), log=self.log))

    def construct_ocrmypdf_parameters(
        self,
@@ -289,6 +262,7 @@ class RasterisedDocumentParser:
        sidecar_file: Path,
        *,
        safe_fallback: bool = False,
+        skip_text: bool = False,
    ) -> dict[str, Any]:
        ocrmypdf_args: dict[str, Any] = {
            "input_file_or_options": input_file,
@@ -307,15 +281,14 @@ class RasterisedDocumentParser:
                self.settings.color_conversion_strategy
            )

-        if self.settings.mode == ModeChoices.FORCE or safe_fallback:
+        if safe_fallback or self.settings.mode == ModeChoices.FORCE:
            ocrmypdf_args["force_ocr"] = True
-        elif self.settings.mode in {
-            ModeChoices.SKIP,
-            ModeChoices.SKIP_NO_ARCHIVE,
-        }:
-            ocrmypdf_args["skip_text"] = True
        elif self.settings.mode == ModeChoices.REDO:
            ocrmypdf_args["redo_ocr"] = True
+        elif skip_text or self.settings.mode == ModeChoices.OFF:
+            ocrmypdf_args["skip_text"] = True
+        elif self.settings.mode == ModeChoices.AUTO:
+            pass  # no extra flag: normal OCR (text not found case)
        else:  # pragma: no cover
            raise ParseError(f"Invalid ocr mode: {self.settings.mode}")

@@ -400,6 +373,62 @@ class RasterisedDocumentParser:

        return ocrmypdf_args

+    def _convert_image_to_pdfa(self, document_path: Path, mime_type: str) -> Path:
+        """Convert an image to a PDF/A-2b file without invoking the OCR engine.
+
+        Uses img2pdf for the initial image->PDF wrapping, then pikepdf to stamp
+        PDF/A-2b conformance metadata.
+
+        No Tesseract and no Ghostscript are invoked.
+        """
+        import img2pdf
+        import pikepdf
+
+        plain_pdf_path = Path(self.tempdir) / "image_plain.pdf"
+        try:
+            layout_fun = None
+            if self.settings.image_dpi is not None:
+                layout_fun = img2pdf.get_fixed_dpi_layout_fun(
+                    (self.settings.image_dpi, self.settings.image_dpi),
+                )
+            plain_pdf_path.write_bytes(
+                img2pdf.convert(str(document_path), layout_fun=layout_fun),
+            )
+        except Exception as e:
+            raise ParseError(
+                f"img2pdf conversion failed for {document_path}: {e!s}",
+            ) from e
+
+        icc_data = (
+            importlib.resources.files("ocrmypdf.data").joinpath("sRGB.icc").read_bytes()
+        )
+
+        pdfa_path = Path(self.tempdir) / "archive.pdf"
+        try:
+            with pikepdf.open(plain_pdf_path) as pdf:
+                cs = pdf.make_stream(icc_data)
+                cs["/N"] = 3
+                output_intent = pikepdf.Dictionary(
+                    Type=pikepdf.Name("/OutputIntent"),
+                    S=pikepdf.Name("/GTS_PDFA1"),
+                    OutputConditionIdentifier=pikepdf.String("sRGB"),
+                    DestOutputProfile=cs,
+                )
+                pdf.Root["/OutputIntents"] = pdf.make_indirect(
+                    pikepdf.Array([output_intent]),
+                )
+                meta = pdf.open_metadata(set_pikepdf_as_editor=False)
+                meta["pdfaid:part"] = "2"
+                meta["pdfaid:conformance"] = "B"
+                pdf.save(pdfa_path)
+        except Exception as e:
+            self.log.warning(
+                f"PDF/A metadata stamping failed ({e!s}); falling back to plain PDF.",
+            )
+            pdfa_path.write_bytes(plain_pdf_path.read_bytes())
+
+        return pdfa_path
+
    def parse(
        self,
        document_path: Path,
@@ -409,57 +438,106 @@ class RasterisedDocumentParser:
    ) -> None:
        # This forces tesseract to use one core per page.
        os.environ["OMP_THREAD_LIMIT"] = "1"
-        VALID_TEXT_LENGTH = 50

        if mime_type == "application/pdf":
            text_original = self.extract_text(None, document_path)
            original_has_text = (
-                text_original is not None and len(text_original) > VALID_TEXT_LENGTH
+                text_original is not None and len(text_original) > PDF_TEXT_MIN_LENGTH
            )
        else:
            text_original = None
            original_has_text = False

-        # If the original has text, and the user doesn't want an archive,
-        # we're done here
-        skip_archive_for_text = (
-            self.settings.mode == ModeChoices.SKIP_NO_ARCHIVE
-            or self.settings.skip_archive_file
-            in {
-                ArchiveFileChoices.WITH_TEXT,
-                ArchiveFileChoices.ALWAYS,
-            }
-        )
-        if skip_archive_for_text and original_has_text:
-            self.log.debug("Document has text, skipping OCRmyPDF entirely.")
+        # --- OCR_MODE=off: never invoke OCR engine ---
+        if self.settings.mode == ModeChoices.OFF:
+            if not produce_archive:
+                self.text = text_original or ""
+                return
+            if self.is_image(mime_type):
+                try:
+                    self.archive_path = self._convert_image_to_pdfa(
+                        document_path,
+                        mime_type,
+                    )
+                    self.text = ""
+                except Exception as e:
+                    raise ParseError(
+                        f"Image to PDF/A conversion failed: {e!s}",
+                    ) from e
+                return
+            # PDFs in off mode: PDF/A conversion only via skip_text
+            import ocrmypdf
+            from ocrmypdf import SubprocessOutputError
+
+            archive_path = Path(self.tempdir) / "archive.pdf"
+            sidecar_file = Path(self.tempdir) / "sidecar.txt"
+            args = self.construct_ocrmypdf_parameters(
+                document_path,
+                mime_type,
+                archive_path,
+                sidecar_file,
+                skip_text=True,
+            )
+            try:
+                self.log.debug(
+                    f"Calling OCRmyPDF (off mode, PDF/A conversion only): {args}",
+                )
+                ocrmypdf.ocr(**args)
+                self.archive_path = archive_path
+                self.text = self.extract_text(None, archive_path) or text_original or ""
+            except SubprocessOutputError as e:
+                if "Ghostscript PDF/A rendering" in str(e):
+                    self.log.warning(
+                        "Ghostscript PDF/A rendering failed, consider setting "
+                        "PAPERLESS_OCR_USER_ARGS: "
+                        "'{\"continue_on_soft_render_error\": true}'",
+                    )
+                raise ParseError(
+                    f"SubprocessOutputError: {e!s}. See logs for more information.",
+                ) from e
+            except Exception as e:
+                raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
+            return
+
+        # --- OCR_MODE=auto: skip ocrmypdf entirely if text exists and no archive needed ---
+        if (
+            self.settings.mode == ModeChoices.AUTO
+            and original_has_text
+            and not produce_archive
+        ):
+            self.log.debug(
+                "Document has text and no archive requested; skipping OCRmyPDF entirely.",
+            )
            self.text = text_original
            return

-        # Either no text was in the original or there should be an archive
-        # file created, so OCR the file and create an archive with any
-        # text located via OCR
-
+        # --- All other paths: run ocrmypdf ---
        import ocrmypdf
        from ocrmypdf import EncryptedPdfError
        from ocrmypdf import InputFileError
        from ocrmypdf import SubprocessOutputError
        from ocrmypdf.exceptions import DigitalSignatureError
+        from ocrmypdf.exceptions import PriorOcrFoundError

        archive_path = Path(self.tempdir) / "archive.pdf"
        sidecar_file = Path(self.tempdir) / "sidecar.txt"

+        # auto mode with existing text: PDF/A conversion only (no OCR).
+        skip_text = self.settings.mode == ModeChoices.AUTO and original_has_text
+
        args = self.construct_ocrmypdf_parameters(
            document_path,
            mime_type,
            archive_path,
            sidecar_file,
+            skip_text=skip_text,
        )

        try:
            self.log.debug(f"Calling OCRmyPDF with args: {args}")
            ocrmypdf.ocr(**args)

-            if self.settings.skip_archive_file != ArchiveFileChoices.ALWAYS:
+            if produce_archive:
                self.archive_path = archive_path

            self.text = self.extract_text(sidecar_file, archive_path)
@@ -479,11 +557,10 @@ class RasterisedDocumentParser:
                    "Ghostscript PDF/A rendering failed, consider setting "
                    "PAPERLESS_OCR_USER_ARGS: '{\"continue_on_soft_render_error\": true}'",
                )
-
            raise ParseError(
                f"SubprocessOutputError: {e!s}. See logs for more information.",
            ) from e
-        except (NoTextFoundException, InputFileError) as e:
+        except (NoTextFoundException, InputFileError, PriorOcrFoundError) as e:
            self.log.warning(
                f"Encountered an error while running OCR: {e!s}. "
                f"Attempting force OCR to get the text.",
@@ -492,8 +569,6 @@ class RasterisedDocumentParser:
            archive_path_fallback = Path(self.tempdir) / "archive-fallback.pdf"
            sidecar_file_fallback = Path(self.tempdir) / "sidecar-fallback.txt"

-            # Attempt to run OCR with safe settings.
-
            args = self.construct_ocrmypdf_parameters(
                document_path,
                mime_type,
@@ -505,25 +580,18 @@ class RasterisedDocumentParser:
            try:
                self.log.debug(f"Fallback: Calling OCRmyPDF with args: {args}")
                ocrmypdf.ocr(**args)
-
-                # Don't return the archived file here, since this file
-                # is bigger and blurry due to --force-ocr.
-
                self.text = self.extract_text(
                    sidecar_file_fallback,
                    archive_path_fallback,
                )
-
+                if produce_archive:
+                    self.archive_path = archive_path_fallback
            except Exception as e:
-                # If this fails, we have a serious issue at hand.
                raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

        except Exception as e:
-            # Anything else is probably serious.
            raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

-        # As a last resort, if we still don't have any text for any reason,
-        # try to extract the text from the original document.
        if not self.text:
            if original_has_text:
                self.text = text_original
--- a/src/paperless/parsers/utils.py
+++ b/src/paperless/parsers/utils.py
@@ -10,15 +10,65 @@ from __future__ import annotations

 import logging
 import re
+import tempfile
+from pathlib import Path
 from typing import TYPE_CHECKING

 if TYPE_CHECKING:
-    from pathlib import Path
-
    from paperless.parsers import MetadataEntry

 logger = logging.getLogger("paperless.parsers.utils")

+# Minimum character count for a PDF to be considered "born-digital" (has real text).
+# Used by both the consumer (archive decision) and the tesseract parser (skip-OCR decision).
+PDF_TEXT_MIN_LENGTH = 50
+
+
+def extract_pdf_text(
+    path: Path,
+    log: logging.Logger | None = None,
+) -> str | None:
+    """Run pdftotext on *path* and return the extracted text, or None on failure.
+
+    Parameters
+    ----------
+    path:
+        Absolute path to the PDF file.
+    log:
+        Logger for warnings.  Falls back to the module-level logger when omitted.
+
+    Returns
+    -------
+    str | None
+        Extracted text, or ``None`` if pdftotext fails or the file is not a PDF.
+    """
+    from documents.utils import run_subprocess
+
+    _log = log or logger
+    try:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            out_path = Path(tmpdir) / "text.txt"
+            run_subprocess(
+                [
+                    "pdftotext",
+                    "-q",
+                    "-layout",
+                    "-enc",
+                    "UTF-8",
+                    str(path),
+                    str(out_path),
+                ],
+                logger=_log,
+            )
+            text = read_file_handle_unicode_errors(out_path, log=_log)
+            return text or None
+    except Exception:
+        _log.warning(
+            "Error while getting text from PDF document with pdftotext",
+            exc_info=True,
+        )
+        return None
+

 def read_file_handle_unicode_errors(
    filepath: Path,
--- a/src/paperless/settings/init.py
+++ b/src/paperless/settings/init.py
@@ -21,6 +21,7 @@ from paperless.settings.custom import parse_hosting_settings
 from paperless.settings.custom import parse_ignore_dates
 from paperless.settings.custom import parse_redis_url
 from paperless.settings.parsers import get_bool_from_env
+from paperless.settings.parsers import get_choice_from_env
 from paperless.settings.parsers import get_float_from_env
 from paperless.settings.parsers import get_int_from_env
 from paperless.settings.parsers import get_list_from_env
@@ -874,10 +875,17 @@ OCR_LANGUAGE = os.getenv("PAPERLESS_OCR_LANGUAGE", "eng")
 # OCRmyPDF --output-type options are available.
 OCR_OUTPUT_TYPE = os.getenv("PAPERLESS_OCR_OUTPUT_TYPE", "pdfa")

-# skip. redo, force
-OCR_MODE = os.getenv("PAPERLESS_OCR_MODE", "skip")
+OCR_MODE = get_choice_from_env(
+    "PAPERLESS_OCR_MODE",
+    {"auto", "force", "redo", "off"},
+    default="auto",
+)

-OCR_SKIP_ARCHIVE_FILE = os.getenv("PAPERLESS_OCR_SKIP_ARCHIVE_FILE", "never")
+ARCHIVE_FILE_GENERATION = get_choice_from_env(
+    "PAPERLESS_ARCHIVE_FILE_GENERATION",
+    {"auto", "always", "never"},
+    default="auto",
+)

 OCR_IMAGE_DPI = get_int_from_env("PAPERLESS_OCR_IMAGE_DPI")

--- a/src/paperless/tests/parsers/conftest.py
+++ b/src/paperless/tests/parsers/conftest.py
@@ -708,7 +708,7 @@ def null_app_config(mocker: MockerFixture) -> MagicMock:
        pages=None,
        language=None,
        mode=None,
-        skip_archive_file=None,
+        archive_file_generation=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
--- a/src/paperless/tests/parsers/test_parse_modes.py
+++ b/src/paperless/tests/parsers/test_parse_modes.py
@@ -0,0 +1,436 @@
+"""
+Focused tests for RasterisedDocumentParser.parse() mode behaviour.
+
+These tests mock ``ocrmypdf.ocr`` so they run without a real Tesseract/OCRmyPDF
+installation and execute quickly.  The intent is to verify the *control flow*
+introduced by the ``produce_archive`` flag and the ``OCR_MODE=auto/off`` logic,
+not to test OCRmyPDF itself.
+
+Fixtures are pulled from conftest.py in this package.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import TYPE_CHECKING
+
+import pytest
+
+if TYPE_CHECKING:
+    from pytest_mock import MockerFixture
+
+    from paperless.parsers.tesseract import RasterisedDocumentParser
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+_LONG_TEXT = "This is a test document with enough text. " * 5  # >50 chars
+_SHORT_TEXT = "Hi."  # <50 chars
+
+
+def _make_extract_text(text: str | None):
+    """Return a side_effect function for ``extract_text`` that returns *text*."""
+
+    def _extract(sidecar_file, pdf_file):
+        return text
+
+    return _extract
+
+
+# ---------------------------------------------------------------------------
+# AUTO mode — PDF with sufficient text layer
+# ---------------------------------------------------------------------------
+
+
+class TestAutoModeWithText:
+    """AUTO mode, original PDF has detectable text (>50 chars)."""
+
+    def test_auto_text_no_archive_skips_ocrmypdf(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - AUTO mode, produce_archive=False
+            - PDF with text > VALID_TEXT_LENGTH
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr is NOT called (early return path)
+            - archive_path remains None
+            - text is set from the original
+        """
+        # Patch extract_text to return long text (simulating detectable text layer)
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            simple_digital_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+
+        mock_ocr.assert_not_called()
+        assert tesseract_parser.archive_path is None
+        assert tesseract_parser.get_text() == _LONG_TEXT
+
+    def test_auto_text_with_archive_calls_ocrmypdf_skip_text(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - AUTO mode, produce_archive=True
+            - PDF with text > VALID_TEXT_LENGTH
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr IS called with skip_text=True
+            - archive_path is set
+        """
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            simple_digital_pdf_file,
+            "application/pdf",
+            produce_archive=True,
+        )
+
+        mock_ocr.assert_called_once()
+        call_kwargs = mock_ocr.call_args.kwargs
+        assert call_kwargs.get("skip_text") is True
+        assert "force_ocr" not in call_kwargs
+        assert "redo_ocr" not in call_kwargs
+        assert tesseract_parser.archive_path is not None
+
+
+# ---------------------------------------------------------------------------
+# AUTO mode — PDF without text layer (or too short)
+# ---------------------------------------------------------------------------
+
+
+class TestAutoModeNoText:
+    """AUTO mode, original PDF has no detectable text (<= 50 chars)."""
+
+    def test_auto_no_text_with_archive_calls_ocrmypdf_no_extra_flag(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        multi_page_images_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - AUTO mode, produce_archive=True
+            - PDF with no text (or text <= VALID_TEXT_LENGTH)
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr IS called WITHOUT skip_text/force_ocr/redo_ocr
+            - archive_path is set (since produce_archive=True)
+        """
+        # Return "no text" for the original; return real text for archive
+        extract_call_count = 0
+
+        def _extract_side(sidecar_file, pdf_file):
+            nonlocal extract_call_count
+            extract_call_count += 1
+            if extract_call_count == 1:
+                return None  # original has no text
+            return _LONG_TEXT  # text from archive after OCR
+
+        mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            multi_page_images_pdf_file,
+            "application/pdf",
+            produce_archive=True,
+        )
+
+        mock_ocr.assert_called_once()
+        call_kwargs = mock_ocr.call_args.kwargs
+        assert "skip_text" not in call_kwargs
+        assert "force_ocr" not in call_kwargs
+        assert "redo_ocr" not in call_kwargs
+        assert tesseract_parser.archive_path is not None
+
+    def test_auto_no_text_no_archive_calls_ocrmypdf(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        multi_page_images_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - AUTO mode, produce_archive=False
+            - PDF with no text
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr IS called (no early return since no text detected)
+            - archive_path is NOT set (produce_archive=False)
+        """
+        extract_call_count = 0
+
+        def _extract_side(sidecar_file, pdf_file):
+            nonlocal extract_call_count
+            extract_call_count += 1
+            if extract_call_count == 1:
+                return None
+            return _LONG_TEXT
+
+        mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            multi_page_images_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+
+        mock_ocr.assert_called_once()
+        assert tesseract_parser.archive_path is None
+
+
+# ---------------------------------------------------------------------------
+# OFF mode — PDF
+# ---------------------------------------------------------------------------
+
+
+class TestOffModePdf:
+    """OCR_MODE=off, document is a PDF."""
+
+    def test_off_no_archive_returns_pdftotext(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - OFF mode, produce_archive=False
+            - PDF with text
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr is NOT called
+            - archive_path is None
+            - text comes from pdftotext (extract_text)
+        """
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            simple_digital_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+
+        mock_ocr.assert_not_called()
+        assert tesseract_parser.archive_path is None
+        assert tesseract_parser.get_text() == _LONG_TEXT
+
+    def test_off_with_archive_calls_ocrmypdf_skip_text(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - OFF mode, produce_archive=True
+            - PDF document
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr IS called with skip_text=True (PDF/A conversion only)
+            - archive_path is set
+        """
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            simple_digital_pdf_file,
+            "application/pdf",
+            produce_archive=True,
+        )
+
+        mock_ocr.assert_called_once()
+        call_kwargs = mock_ocr.call_args.kwargs
+        assert call_kwargs.get("skip_text") is True
+        assert "force_ocr" not in call_kwargs
+        assert "redo_ocr" not in call_kwargs
+        assert tesseract_parser.archive_path is not None
+
+
+# ---------------------------------------------------------------------------
+# OFF mode — image
+# ---------------------------------------------------------------------------
+
+
+class TestOffModeImage:
+    """OCR_MODE=off, document is an image (PNG)."""
+
+    def test_off_image_no_archive_no_ocrmypdf(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_png_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - OFF mode, produce_archive=False
+            - Image document (PNG)
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf.ocr is NOT called
+            - archive_path is None
+            - text is empty string (images have no text layer)
+        """
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(simple_png_file, "image/png", produce_archive=False)
+
+        mock_ocr.assert_not_called()
+        assert tesseract_parser.archive_path is None
+        assert tesseract_parser.get_text() == ""
+
+    def test_off_image_with_archive_uses_img2pdf_path(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_png_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - OFF mode, produce_archive=True
+            - Image document (PNG)
+        WHEN:
+            - parse() is called
+        THEN:
+            - _convert_image_to_pdfa() is called instead of ocrmypdf.ocr
+            - archive_path is set to the returned path
+            - text is empty string
+        """
+        fake_archive = Path("/tmp/fake-archive.pdf")
+        mock_convert = mocker.patch.object(
+            tesseract_parser,
+            "_convert_image_to_pdfa",
+            return_value=fake_archive,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(simple_png_file, "image/png", produce_archive=True)
+
+        mock_convert.assert_called_once_with(simple_png_file, "image/png")
+        mock_ocr.assert_not_called()
+        assert tesseract_parser.archive_path == fake_archive
+        assert tesseract_parser.get_text() == ""
+
+
+# ---------------------------------------------------------------------------
+# produce_archive=False never sets archive_path for FORCE / REDO / AUTO modes
+# ---------------------------------------------------------------------------
+
+
+class TestProduceArchiveFalse:
+    """Verify produce_archive=False never results in an archive regardless of mode."""
+
+    @pytest.mark.parametrize("mode", ["force", "redo"])
+    def test_produce_archive_false_force_redo_modes(
+        self,
+        mode: str,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        multi_page_images_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - FORCE or REDO mode, produce_archive=False
+            - Any PDF
+        WHEN:
+            - parse() is called (ocrmypdf mocked to succeed)
+        THEN:
+            - archive_path is NOT set even though ocrmypdf ran
+        """
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = mode
+        tesseract_parser.parse(
+            multi_page_images_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+
+        assert tesseract_parser.archive_path is None
+        assert tesseract_parser.get_text() is not None
+
+    def test_produce_archive_false_auto_with_text(
+        self,
+        mocker: MockerFixture,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - AUTO mode, produce_archive=False
+            - PDF with text > VALID_TEXT_LENGTH
+        WHEN:
+            - parse() is called
+        THEN:
+            - ocrmypdf is skipped entirely (early return)
+            - archive_path is None
+        """
+        mocker.patch.object(
+            tesseract_parser,
+            "extract_text",
+            return_value=_LONG_TEXT,
+        )
+        mock_ocr = mocker.patch("ocrmypdf.ocr")
+
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            simple_digital_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+
+        mock_ocr.assert_not_called()
+        assert tesseract_parser.archive_path is None
--- a/src/paperless/tests/parsers/test_tesseract_custom_settings.py
+++ b/src/paperless/tests/parsers/test_tesseract_custom_settings.py
@@ -89,15 +89,35 @@ class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCas
        WHEN:
            - OCR parameters are constructed
        THEN:
-            - Configuration from database is utilized
+            - Configuration from database is utilized (AUTO mode with skip_text=True
+              triggers skip_text; AUTO mode alone does not add any extra flag)
        """
+        # AUTO mode with skip_text=True explicitly passed: skip_text is set
        with override_settings(OCR_MODE="redo"):
            instance = ApplicationConfiguration.objects.all().first()
-            instance.mode = ModeChoices.SKIP
+            instance.mode = ModeChoices.AUTO
+            instance.save()
+
+            params = RasterisedDocumentParser(None).construct_ocrmypdf_parameters(
+                input_file="input.pdf",
+                output_file="output.pdf",
+                sidecar_file="sidecar.txt",
+                mime_type="application/pdf",
+                safe_fallback=False,
+                skip_text=True,
+            )
+        self.assertTrue(params["skip_text"])
+        self.assertNotIn("redo_ocr", params)
+        self.assertNotIn("force_ocr", params)
+
+        # AUTO mode alone (no skip_text): no extra OCR flag is set
+        with override_settings(OCR_MODE="redo"):
+            instance = ApplicationConfiguration.objects.all().first()
+            instance.mode = ModeChoices.AUTO
            instance.save()

            params = self.get_params()
-        self.assertTrue(params["skip_text"])
+        self.assertNotIn("skip_text", params)
        self.assertNotIn("redo_ocr", params)
        self.assertNotIn("force_ocr", params)

--- a/src/paperless/tests/parsers/test_tesseract_parser.py
+++ b/src/paperless/tests/parsers/test_tesseract_parser.py
@@ -370,15 +370,26 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
+        """
+        GIVEN:
+            - Multi-page digital PDF with sufficient text layer
+            - Default settings (mode=auto, produce_archive=True)
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Archive is created (AUTO mode + text present + produce_archive=True
+              → PDF/A conversion via skip_text)
+            - Text is extracted
+        """
        tesseract_parser.parse(
-            tesseract_samples_dir / "simple-digital.pdf",
+            tesseract_samples_dir / "multi-page-digital.pdf",
            "application/pdf",
        )
        assert tesseract_parser.archive_path is not None
        assert tesseract_parser.archive_path.is_file()
        assert_ordered_substrings(
-            tesseract_parser.get_text(),
-            ["This is a test document."],
+            tesseract_parser.get_text().lower(),
+            ["page 1", "page 2", "page 3"],
        )

    def test_with_form_default(
@@ -397,7 +408,7 @@ class TestParsePdf:
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )

-    def test_with_form_redo_produces_no_archive(
+    def test_with_form_redo_no_archive_when_not_requested(
        self,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
@@ -406,6 +417,7 @@ class TestParsePdf:
        tesseract_parser.parse(
            tesseract_samples_dir / "with-form.pdf",
            "application/pdf",
+            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -433,7 +445,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(tesseract_samples_dir / "signed.pdf", "application/pdf")
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -449,7 +461,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "encrypted.pdf",
            "application/pdf",
@@ -559,7 +571,7 @@ class TestParseMultiPage:
    @pytest.mark.parametrize(
        "mode",
        [
-            pytest.param("skip", id="skip"),
+            pytest.param("auto", id="auto"),
            pytest.param("redo", id="redo"),
            pytest.param("force", id="force"),
        ],
@@ -587,7 +599,7 @@ class TestParseMultiPage:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-images.pdf",
            "application/pdf",
@@ -735,16 +747,18 @@ class TestSkipArchive:
        """
        GIVEN:
            - File with existing text layer
-            - Mode: skip_noarchive
+            - Mode: auto, produce_archive=False
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted; no archive created
+            - Text extracted from original; no archive created (text exists +
+              produce_archive=False skips OCRmyPDF entirely)
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-digital.pdf",
            "application/pdf",
+            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -760,13 +774,13 @@ class TestSkipArchive:
        """
        GIVEN:
            - File with image-only pages (no text layer)
-            - Mode: skip_noarchive
+            - Mode: auto, skip_archive_file: auto
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted; archive created (OCR needed)
+            - Text extracted; archive created (OCR needed, no existing text)
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-images.pdf",
            "application/pdf",
@@ -778,41 +792,58 @@ class TestSkipArchive:
        )

    @pytest.mark.parametrize(
-        ("skip_archive_file", "filename", "expect_archive"),
+        ("produce_archive", "filename", "expect_archive"),
        [
-            pytest.param("never", "multi-page-digital.pdf", True, id="never-with-text"),
-            pytest.param("never", "multi-page-images.pdf", True, id="never-no-text"),
            pytest.param(
-                "with_text",
+                True,
                "multi-page-digital.pdf",
-                False,
-                id="with-text-layer",
+                True,
+                id="produce-archive-with-text",
            ),
            pytest.param(
-                "with_text",
+                True,
                "multi-page-images.pdf",
                True,
-                id="with-text-no-layer",
+                id="produce-archive-no-text",
            ),
            pytest.param(
-                "always",
+                False,
                "multi-page-digital.pdf",
                False,
-                id="always-with-text",
+                id="no-archive-with-text-layer",
+            ),
+            pytest.param(
+                False,
+                "multi-page-images.pdf",
+                False,
+                id="no-archive-no-text-layer",
            ),
-            pytest.param("always", "multi-page-images.pdf", False, id="always-no-text"),
        ],
    )
-    def test_skip_archive_file_setting(
+    def test_produce_archive_flag(
        self,
-        skip_archive_file: str,
+        produce_archive: bool,  # noqa: FBT001
        filename: str,
-        expect_archive: str,
+        expect_archive: bool,  # noqa: FBT001
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.skip_archive_file = skip_archive_file
-        tesseract_parser.parse(tesseract_samples_dir / filename, "application/pdf")
+        """
+        GIVEN:
+            - Various PDFs (with and without text layers)
+            - produce_archive flag set to True or False
+        WHEN:
+            - Document is parsed
+        THEN:
+            - archive_path is set if and only if produce_archive=True
+            - Text is always extracted
+        """
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            tesseract_samples_dir / filename,
+            "application/pdf",
+            produce_archive=produce_archive,
+        )
        text = tesseract_parser.get_text().lower()
        assert_ordered_substrings(text, ["page 1", "page 2", "page 3"])
        if expect_archive:
@@ -835,13 +866,13 @@ class TestParseMixed:
        """
        GIVEN:
            - File with text in some pages (image) and some pages (digital)
-            - Mode: skip
+            - Mode: auto (skip_text), skip_archive_file: always
        WHEN:
            - Document is parsed
        THEN:
            - All pages extracted; archive created; sidecar notes skipped pages
        """
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-mixed.pdf",
            "application/pdf",
@@ -898,17 +929,18 @@ class TestParseMixed:
    ) -> None:
        """
        GIVEN:
-            - File with mixed pages
-            - Mode: skip_noarchive
+            - File with mixed pages (some with text, some image-only)
+            - Mode: auto, produce_archive=False
        WHEN:
            - Document is parsed
        THEN:
-            - No archive created (file has text layer); later-page text present
+            - No archive created (produce_archive=False); text from text layer present
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-mixed.pdf",
            "application/pdf",
+            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -923,12 +955,12 @@ class TestParseMixed:


 class TestParseRotate:
-    def test_rotate_skip_mode(
+    def test_rotate_auto_mode(
        self,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.settings.rotate = True
        tesseract_parser.parse(tesseract_samples_dir / "rotated.pdf", "application/pdf")
        assert_ordered_substrings(
@@ -955,12 +987,19 @@ class TestParseRtl:
    ) -> None:
        """
        GIVEN:
-            - PDF with RTL Arabic text
+            - PDF with RTL Arabic text in its text layer (short: 18 chars)
+            - mode=off, produce_archive=True: PDF/A conversion via skip_text, no OCR engine
        WHEN:
            - Document is parsed
        THEN:
-            - Arabic content is extracted (normalised for bidi)
+            - Arabic content is extracted from the PDF text layer (normalised for bidi)
+
+        Note: The RTL PDF has a short text layer (< VALID_TEXT_LENGTH=50) so AUTO mode
+        would attempt full OCR, which fails due to PriorOcrFoundError and falls back to
+        force-ocr with English Tesseract (producing garbage).  Using mode="off" forces
+        skip_text=True so the Arabic text layer is preserved through PDF/A conversion.
        """
+        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(
            tesseract_samples_dir / "rtl-test.pdf",
            "application/pdf",
@@ -1023,11 +1062,11 @@ class TestOcrmypdfParameters:
        assert ("clean" in params) == expected_clean
        assert ("clean_final" in params) == expected_clean_final

-    def test_clean_final_skip_mode(
+    def test_clean_final_auto_mode(
        self,
        make_tesseract_parser: MakeTesseractParser,
    ) -> None:
-        with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="skip") as parser:
+        with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="auto") as parser:
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
        assert params["clean_final"] is True
        assert "clean" not in params
@@ -1044,9 +1083,9 @@ class TestOcrmypdfParameters:
    @pytest.mark.parametrize(
        ("ocr_mode", "ocr_deskew", "expect_deskew"),
        [
-            pytest.param("skip", True, True, id="skip-deskew-on"),
+            pytest.param("auto", True, True, id="auto-deskew-on"),
            pytest.param("redo", True, False, id="redo-deskew-off"),
-            pytest.param("skip", False, False, id="skip-no-deskew"),
+            pytest.param("auto", False, False, id="auto-no-deskew"),
        ],
    )
    def test_deskew_option(
--- a/src/paperless/tests/test_checks.py
+++ b/src/paperless/tests/test_checks.py
@@ -132,13 +132,13 @@ class TestOcrSettingsChecks:
            pytest.param(
                "OCR_MODE",
                "skip_noarchive",
-                "deprecated",
-                id="deprecated-mode",
+                'OCR output mode "skip_noarchive"',
+                id="deprecated-mode-now-invalid",
            ),
            pytest.param(
-                "OCR_SKIP_ARCHIVE_FILE",
+                "ARCHIVE_FILE_GENERATION",
                "invalid",
-                'OCR_SKIP_ARCHIVE_FILE setting "invalid"',
+                'PAPERLESS_ARCHIVE_FILE_GENERATION setting "invalid"',
                id="invalid-skip-archive-file",
            ),
            pytest.param(
--- a/src/paperless/tests/test_checks_v3.py
+++ b/src/paperless/tests/test_checks_v3.py
@@ -0,0 +1,64 @@
+"""Tests for v3 system checks: deprecated v2 OCR env var warnings."""
+
+from __future__ import annotations
+
+import os
+from typing import TYPE_CHECKING
+
+import pytest
+
+from paperless.checks import check_deprecated_v2_ocr_env_vars
+
+if TYPE_CHECKING:
+    from pytest_mock import MockerFixture
+
+
+class TestDeprecatedV2OcrEnvVarWarnings:
+    def test_no_deprecated_vars_returns_empty(self, mocker: MockerFixture) -> None:
+        """No warnings when neither deprecated variable is set."""
+        mocker.patch.dict(os.environ, {"PAPERLESS_OCR_MODE": "auto"}, clear=True)
+        result = check_deprecated_v2_ocr_env_vars(None)
+        assert result == []
+
+    @pytest.mark.parametrize(
+        ("env_var", "env_value", "expected_id", "expected_fragment"),
+        [
+            pytest.param(
+                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
+                "always",
+                "paperless.W002",
+                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
+                id="skip-archive-file-warns",
+            ),
+            pytest.param(
+                "PAPERLESS_OCR_MODE",
+                "skip",
+                "paperless.W003",
+                "skip",
+                id="ocr-mode-skip-warns",
+            ),
+            pytest.param(
+                "PAPERLESS_OCR_MODE",
+                "skip_noarchive",
+                "paperless.W003",
+                "skip_noarchive",
+                id="ocr-mode-skip-noarchive-warns",
+            ),
+        ],
+    )
+    def test_deprecated_var_produces_one_warning(
+        self,
+        mocker: MockerFixture,
+        env_var: str,
+        env_value: str,
+        expected_id: str,
+        expected_fragment: str,
+    ) -> None:
+        """Each deprecated setting in isolation produces exactly one warning."""
+        mocker.patch.dict(os.environ, {env_var: env_value}, clear=True)
+        result = check_deprecated_v2_ocr_env_vars(None)
+
+        assert len(result) == 1
+        warning = result[0]
+        assert warning.id == expected_id
+        assert expected_fragment in warning.msg
--- a/src/paperless/tests/test_ocr_config.py
+++ b/src/paperless/tests/test_ocr_config.py
@@ -0,0 +1,66 @@
+"""Tests for OcrConfig archive_file_generation field behavior."""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+import pytest
+from django.test import override_settings
+
+from paperless.config import OcrConfig
+
+if TYPE_CHECKING:
+    from unittest.mock import MagicMock
+
+
+@pytest.fixture()
+def null_app_config(mocker) -> MagicMock:
+    """Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
+    return mocker.MagicMock(
+        output_type=None,
+        pages=None,
+        language=None,
+        mode=None,
+        archive_file_generation=None,
+        image_dpi=None,
+        unpaper_clean=None,
+        deskew=None,
+        rotate_pages=None,
+        rotate_pages_threshold=None,
+        max_image_pixels=None,
+        color_conversion_strategy=None,
+        user_args=None,
+    )
+
+
+@pytest.fixture()
+def make_ocr_config(mocker, null_app_config):
+    mocker.patch(
+        "paperless.config.BaseConfig._get_config_instance",
+        return_value=null_app_config,
+    )
+
+    def _make(**django_settings_overrides):
+        with override_settings(**django_settings_overrides):
+            return OcrConfig()
+
+    return _make
+
+
+class TestOcrConfigArchiveFileGeneration:
+    def test_auto_from_settings(self, make_ocr_config) -> None:
+        cfg = make_ocr_config(OCR_MODE="auto", ARCHIVE_FILE_GENERATION="auto")
+        assert cfg.archive_file_generation == "auto"
+
+    def test_always_from_settings(self, make_ocr_config) -> None:
+        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
+        assert cfg.archive_file_generation == "always"
+
+    def test_never_from_settings(self, make_ocr_config) -> None:
+        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="never")
+        assert cfg.archive_file_generation == "never"
+
+    def test_db_value_overrides_setting(self, make_ocr_config, null_app_config) -> None:
+        null_app_config.archive_file_generation = "never"
+        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
+        assert cfg.archive_file_generation == "never"
Author	SHA1	Message	Date
Trenton H	e3c7003b02	docs: update OCR and archive settings docs for v3 - configuration.md: replace PAPERLESS_OCR_SKIP_ARCHIVE_FILE section with PAPERLESS_ARCHIVE_FILE_GENERATION; update OCR_MODE docs to reflect auto as default and document new 'off' mode - setup.md: update resource-constrained device tip to use new setting names - migration-v3.md: add OCR and archive settings section documenting all removed settings, their replacements, and migration examples Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 16:38:04 -07:00
Trenton H	790394dc24	test: use pytest-django settings fixture and pytest.param in new tests - TestShouldProduceArchive: replace @override_settings decorators with settings fixture; consolidate 10 individual tests into 2 parametrized tests (test_generation_setting, test_auto_pdf_archive_decision) - TestDeprecatedV2OcrEnvVarWarnings: call check_deprecated_v2_ocr_env_vars() directly instead of django_checks.run_checks(); use mocker.patch.dict for env isolation; consolidate warn cases into one parametrized test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 16:27:37 -07:00
Trenton H	e45f81db39	refactor: consolidate pdftotext utility and archive-decision logic - Add extract_pdf_text() and PDF_TEXT_MIN_LENGTH to paperless/parsers/utils.py, eliminating duplicate pdftotext call sites in consumer.py and tesseract.py - Rename _should_produce_archive → should_produce_archive (now public, imported by both consumer.py and tasks.py) - update_document_content_maybe_archive_file now calls should_produce_archive, honouring ARCHIVE_FILE_GENERATION the same as the consumer pipeline - Fallback OCR path sets archive_path when produce_archive=True; update test_with_form_redo_produces_no_archive to use produce_archive=False Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 15:40:02 -07:00
Trenton H	516611878d	feat!: add deprecated v2 OCR env var warnings to system checks	2026-03-26 14:33:12 -07:00
Trenton H	dc74e2176f	feat: compute produce_archive from ARCHIVE_FILE_GENERATION, pass to parser Add _extract_text_for_archive_check() and _should_produce_archive() helper functions to documents/consumer.py. These compute whether the parser should produce a PDF/A archive based on the ARCHIVE_FILE_GENERATION setting (always/ never/auto), parser capabilities (can_produce_archive, requires_pdf_rendition), MIME type, and pdftotext-based born-digital detection for auto mode. Update the parse() call site to compute and pass produce_archive=... kwarg. Add 10 unit tests in test_consumer_archive.py; update two existing consumer tests that asserted run_subprocess call counts now that pdftotext is invoked during auto-mode archive detection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 14:27:42 -07:00
Trenton H	6aedcf9026	chore: remove dead archive_file_generation assignments from tests	2026-03-26 14:18:40 -07:00
Trenton H	55c51cb89e	feat!: restructure parse() for OCR_MODE=auto/off and produce_archive flag Implement the new decoupled archive/OCR control in RasterisedDocumentParser: - construct_ocrmypdf_parameters(): add skip_text parameter; fix AUTO mode dispatch so skip_text is only added when explicitly requested (text-present + produce_archive case) rather than unconditionally; add OFF mode support. - parse(): remove archive_file_generation checks; control archive creation exclusively via the produce_archive bool passed by the consumer. - OFF + no archive: return pdftotext text, skip OCRmyPDF entirely. - OFF + image + archive: use new _convert_image_to_pdfa() helper. - OFF + PDF + archive: run OCRmyPDF with skip_text=True (PDF/A only). - AUTO + text + no archive: skip OCRmyPDF entirely (fast path). - AUTO + text + archive: run OCRmyPDF with skip_text=True. - AUTO + no text: run normal OCR regardless of produce_archive. - FORCE/REDO: always run OCRmyPDF; set archive_path only when produce_archive. - Add _convert_image_to_pdfa(): img2pdf wrapping + pikepdf PDF/A-2b stamping without invoking Tesseract or Ghostscript. - Add PriorOcrFoundError to the fallback exception list (same treatment as InputFileError: retry with force_ocr). - Update existing tests to use produce_archive instead of archive_file_generation: TestSkipArchive rewritten; RTL test uses mode=off to preserve Arabic text layer; AUTO mode tests clarified. - Add test_parse_modes.py: 11 focused unit tests with mocked ocrmypdf.ocr verifying control flow for all mode/produce_archive combinations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 14:10:02 -07:00
Trenton H	6658d94f77	feat!: drop skip_archive_file field, add archive_file_generation to ApplicationConfiguration Replace the old skip_archive_file DB field with the correctly-named archive_file_generation field on ApplicationConfiguration. Remove the temporary getattr fallback in OcrConfig now that the migration exists. Update all test fixtures and API response assertions to use the new field name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 13:39:49 -07:00
Trenton H	a53144ef8d	feat!: update OcrConfig to read archive_file_generation from DB field Switches OcrConfig.__post_init__ from reading the old skip_archive_file attribute to the new archive_file_generation attribute, with a getattr fallback to skip_archive_file for compatibility until Task 4 renames the DB model field. Updates null_app_config fixtures in both the parser conftest and the new test_ocr_config.py to explicitly set both attributes to None so MagicMock doesn't return truthy auto-generated attributes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 13:28:04 -07:00
Trenton H	e9e60ea395	feat!: rename OCR_SKIP_ARCHIVE_FILE to ARCHIVE_FILE_GENERATION Rename the Django setting OCR_SKIP_ARCHIVE_FILE to ARCHIVE_FILE_GENERATION and the env var PAPERLESS_OCR_SKIP_ARCHIVE_FILE to PAPERLESS_ARCHIVE_FILE_GENERATION. Rename the OcrConfig attribute skip_archive_file to archive_file_generation. Update checks.py error messages and all tests accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 13:14:29 -07:00
Trenton H	877648be07	chore: remove pointless enum-existence tests	2026-03-26 13:05:56 -07:00
Trenton H	3d88567219	test: add test_new_settings.py for v3 enum values	2026-03-26 13:04:39 -07:00
Trenton H	cd653959d6	feat!: replace ModeChoices and ArchiveFileChoices with new v3 enums - Replace ModeChoices (SKIP/SKIP_NO_ARCHIVE/REDO/FORCE) with new values: AUTO, FORCE, REDO, OFF - Remove ArchiveFileChoices entirely; add ArchiveFileGenerationChoices with AUTO, ALWAYS, NEVER values - Update checks.py valid sets and default settings to use new enum values - Update tesseract parser to use new enum comparisons; AUTO mode maps to skip_text behavior; FORCE/REDO bypass archive-skip early-exit - Update all affected tests to use new valid mode/archive string values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 12:50:43 -07:00
Trenton H	338cadf284	chore: add .worktrees to .gitignore	2026-03-26 11:50:07 -07:00