Bad merge

Merge remote-tracking branch 'origin/dev' into feature-ocr-archive-decoupling
fix: access parser properties on instance not class
2026-03-27 03:12:45 +00:00 · 2026-03-26 09:37:45 -07:00 · 2026-03-26 09:26:38 -07:00 · 2026-03-26 09:02:57 -07:00 · 2026-03-26 08:40:57 -07:00 · 2026-03-23 19:50:52 -07:00
19 changed files with 1163 additions and 163 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -801,11 +801,13 @@ parsing documents.

 #### [`PAPERLESS_OCR_MODE=<mode>`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE}

-: Tell paperless when and how to perform ocr on your documents. Three
-modes are available:
+: Tell paperless when and how to perform OCR on your documents. The
+following modes are available:

-    -   `skip`: Paperless skips all pages and will perform ocr only on
-        pages where no text is present. This is the safest option.
+    -   `auto`: Paperless auto-detects whether a document already
+        contains extractable text using pdftotext. If the extracted
+        text exceeds a threshold (50 characters), OCR is skipped;
+        otherwise OCR runs. This is the default.

    -   `redo`: Paperless will OCR all pages of your documents and
        attempt to replace any existing text layers with new text. This
@@ -823,24 +825,46 @@ modes are available:
        significantly larger and text won't appear as sharp when zoomed
        in.

-    The default is `skip`, which only performs OCR when necessary and
-    always creates archived documents.
+    -   `off`: OCR never runs regardless of input type. Embedded text
+        is still extracted from PDFs via pdftotext, but images and
+        scanned PDFs without text layers will have empty content.
+        Useful for handwritten documents, bulk ingestion of large
+        archives, or content that OCRs poorly. Archive generation still
+        works independently when `PAPERLESS_ARCHIVE_FILE_GENERATION`
+        requests it — a PDF/A can be produced without OCR via format
+        conversion only.
+
+    Defaults to `auto`.

    Read more about this in the [OCRmyPDF
    documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).

-#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=<mode>`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE}
+#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=<mode>`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION}

-: Specify when you would like paperless to skip creating an archived
-version of your documents. This is useful if you don't want to have two
-almost-identical versions of your documents in the media folder.
+: Controls whether paperless produces a normalized PDF/A archive copy
+of each document. This is independent of OCR — a PDF/A can be produced
+with or without running OCR.

-    -   `never`: Never skip creating an archived version.
-    -   `with_text`: Skip creating an archived version for documents
-    that already have embedded text.
-    -   `always`: Always skip creating an archived version.
+    -   `auto`: Produce archives for scanned and image-based documents;
+        skip for born-digital PDFs. Born-digital is detected by
+        checking both whether the PDF contains extractable text and
+        whether it has a logical structure (tag tree), which word
+        processors and PDF export tools produce. Scanner software that
+        applies its own OCR typically does not produce tagged PDFs, so
+        those still receive an archive.

-    The default is `never`.
+    -   `always`: Always produce a PDF/A archive when the parser
+        supports it.
+
+    -   `never`: Never produce an archive.
+
+    Defaults to `auto`.
+
+    !!! note
+
+        Parsers that must produce a PDF for the frontend to display the
+        document (e.g. the Tika parser for Office documents) always
+        produce a PDF rendition regardless of this setting.

 #### [`PAPERLESS_OCR_CLEAN=<mode>`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN}

--- a/docs/migration-v3.md
+++ b/docs/migration-v3.md
@@ -130,3 +130,21 @@ For example:
  }
 }
 ```
+
+## OCR and Archive Settings Changes
+
+The `PAPERLESS_OCR_MODE` values `skip` and `skip_noarchive` have been replaced by
+[`PAPERLESS_OCR_MODE=auto`](configuration.md#PAPERLESS_OCR_MODE). Archive file
+generation is now controlled by the separate
+[`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)
+setting, replacing `PAPERLESS_OCR_SKIP_ARCHIVE_FILE`.
+
+### Summary
+
+| Old Setting                                 | New Setting                                                                                                                                                        |
+| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `PAPERLESS_OCR_MODE=skip`                   | [`PAPERLESS_OCR_MODE=auto`](configuration.md#PAPERLESS_OCR_MODE) (now the default)                                                                                 |
+| `PAPERLESS_OCR_MODE=skip_noarchive`         | [`PAPERLESS_OCR_MODE=auto`](configuration.md#PAPERLESS_OCR_MODE) + [`PAPERLESS_ARCHIVE_FILE_GENERATION=never`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never`     | [`PAPERLESS_ARCHIVE_FILE_GENERATION=always`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)                                                                   |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | [`PAPERLESS_ARCHIVE_FILE_GENERATION=auto`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)                                                                     |
+| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always`    | [`PAPERLESS_ARCHIVE_FILE_GENERATION=never`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)                                                                    |
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -34,6 +34,7 @@ from documents.models import StoragePath
 from documents.models import Tag
 from documents.models import WorkflowTrigger
 from documents.parsers import ParseError
+from documents.parsers import resolve_archive_preference
 from documents.permissions import set_permissions_for_object
 from documents.plugins.base import AlwaysRunPluginMixin
 from documents.plugins.base import ConsumeTaskPlugin
@@ -419,6 +420,14 @@ class ConsumerPlugin(
                    ParserContext(mailrule_id=self.input_doc.mailrule_id),
                )

+                # Determine if we should produce an archive
+                needs_pdf = document_parser.requires_pdf_rendition
+                should_produce_archive = needs_pdf or resolve_archive_preference(
+                    mime_type,
+                    self.working_copy,
+                    can_produce_archive=document_parser.can_produce_archive,
+                )
+
                self.log.debug(
                    f"Parser: {document_parser.name} v{document_parser.version}",
                )
@@ -440,7 +449,11 @@ class ConsumerPlugin(
                    )
                    self.log.debug(f"Parsing {self.filename}...")

-                    document_parser.parse(self.working_copy, mime_type)
+                    document_parser.parse(
+                        self.working_copy,
+                        mime_type,
+                        produce_archive=should_produce_archive,
+                    )

                    self.log.debug(f"Generating thumbnail for {self.filename}...")
                    self._send_progress(
--- a/src/documents/parsers.py
+++ b/src/documents/parsers.py
@@ -9,12 +9,15 @@ import tempfile
 from pathlib import Path
 from typing import TYPE_CHECKING

+import pikepdf
 from django.conf import settings

 from documents.loggers import LoggingMixin
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
+from paperless.models import ArchiveFileGenerationChoices
 from paperless.parsers.registry import get_parser_registry
+from paperless.parsers.utils import read_file_handle_unicode_errors

 if TYPE_CHECKING:
    import datetime
@@ -22,6 +25,91 @@ if TYPE_CHECKING:
 logger = logging.getLogger("paperless.parsing")


+def resolve_archive_preference(
+    mime_type: str,
+    file_path: Path,
+    *,
+    can_produce_archive: bool,
+) -> bool:
+    """
+    Determine whether to produce an archive file based on the new settings.
+
+    Args:
+        mime_type: The MIME type of the document
+        can_produce_archive: Whether the parser can produce an archive
+        file_path: Path to the document file
+
+    Returns:
+        True if an archive should be produced, False otherwise
+    """
+    if not can_produce_archive:
+        return False
+
+    if settings.ARCHIVE_FILE_GENERATION == ArchiveFileGenerationChoices.ALWAYS:
+        return True
+    elif settings.ARCHIVE_FILE_GENERATION == ArchiveFileGenerationChoices.NEVER:
+        return False
+    elif settings.ARCHIVE_FILE_GENERATION == ArchiveFileGenerationChoices.AUTO:
+        # For non-PDF mime types (images etc.), always produce archive
+        if mime_type != "application/pdf":
+            return True
+
+        # For PDFs, use combined heuristic to detect born-digital vs scanned
+        return _should_produce_archive_for_pdf(file_path)
+
+    return False
+
+
+def _should_produce_archive_for_pdf(pdf_path: Path) -> bool:
+    """
+    Determine if a PDF needs an archive based on heuristics.
+
+    Args:
+        pdf_path: Path to the PDF file
+
+    Returns:
+        True if archive should be produced, False otherwise
+    """
+    try:
+        # Extract text via pdftotext
+        text = ""
+        with tempfile.NamedTemporaryFile(mode="w+") as tmp:
+            run_subprocess(
+                [
+                    "pdftotext",
+                    "-q",
+                    "-layout",
+                    "-enc",
+                    "UTF-8",
+                    str(pdf_path),
+                    tmp.name,
+                ],
+                logger=None,  # Don't log from utility function
+            )
+            text = read_file_handle_unicode_errors(Path(tmp.name))
+
+        # Check if PDF is tagged via pikepdf
+        is_tagged = False
+        with pikepdf.open(pdf_path) as pdf:
+            # Check for /StructTreeRoot in pdf.Root
+            if hasattr(pdf.Root, "StructTreeRoot") or (
+                hasattr(pdf.Root, "MarkInfo") and pdf.Root.MarkInfo.get("Marked", False)
+            ):
+                is_tagged = True
+
+        # Apply heuristic:
+        # 1. If len(text) > 50 AND tagged → born-digital → return False
+        # 2. If len(text) > 50 AND not tagged → scanner OCR'd → return True
+        # 3. If no/little text → raw scan → return True
+        if len(text) > 50:
+            return not is_tagged  # born-digital (tagged) → False, scanner OCR'd → True
+        return True  # raw scan
+
+    except Exception:
+        # If anything fails, default to producing archive
+        return True
+
+
 def is_mime_type_supported(mime_type: str) -> bool:
    """
    Returns True if the mime type is supported, False otherwise
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -52,6 +52,7 @@ from documents.models import StoragePath
 from documents.models import Tag
 from documents.models import WorkflowRun
 from documents.models import WorkflowTrigger
+from documents.parsers import resolve_archive_preference
 from documents.plugins.base import ConsumeTaskPlugin
 from documents.plugins.base import StopConsumeTaskError
 from documents.plugins.helpers import ProgressManager
@@ -321,7 +322,19 @@ def update_document_content_maybe_archive_file(document_id) -> None:
        parser.configure(ParserContext())

        try:
-            parser.parse(document.source_path, mime_type)
+            # Determine if we should produce an archive
+            needs_pdf = parser.requires_pdf_rendition
+            should_produce_archive = needs_pdf or resolve_archive_preference(
+                mime_type,
+                Path(document.source_path),
+                can_produce_archive=parser.can_produce_archive,
+            )
+
+            parser.parse(
+                document.source_path,
+                mime_type,
+                produce_archive=should_produce_archive,
+            )

            thumbnail = parser.get_thumbnail(document.source_path, mime_type)

--- a/src/documents/tests/test_archive_preference.py
+++ b/src/documents/tests/test_archive_preference.py
@@ -0,0 +1,257 @@
+"""
+Tests for documents.parsers.resolve_archive_preference function and related logic.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import TYPE_CHECKING
+from unittest.mock import Mock
+
+import pytest
+
+if TYPE_CHECKING:
+    from pytest_django.fixtures import SettingsWrapper
+    from pytest_mock import MockerFixture
+
+from documents.parsers import _should_produce_archive_for_pdf
+from documents.parsers import resolve_archive_preference
+from paperless.models import ArchiveFileGenerationChoices
+
+
+class TestResolveArchivePreference:
+    """Test the resolve_archive_preference function."""
+
+    @pytest.mark.parametrize(
+        ("archive_setting", "can_produce_archive", "expected"),
+        [
+            pytest.param(
+                ArchiveFileGenerationChoices.ALWAYS,
+                True,
+                True,
+                id="always-capable-parser",
+            ),
+            pytest.param(
+                ArchiveFileGenerationChoices.ALWAYS,
+                False,
+                False,
+                id="always-incapable-parser",
+            ),
+            pytest.param(
+                ArchiveFileGenerationChoices.NEVER,
+                True,
+                False,
+                id="never-capable-parser",
+            ),
+            pytest.param(
+                ArchiveFileGenerationChoices.NEVER,
+                False,
+                False,
+                id="never-incapable-parser",
+            ),
+        ],
+    )
+    def test_archive_generation_setting_behavior(
+        self,
+        settings: SettingsWrapper,
+        archive_setting: ArchiveFileGenerationChoices,
+        can_produce_archive: bool,  # noqa: FBT001
+        expected: bool,  # noqa: FBT001
+    ) -> None:
+        """Test archive generation setting behavior for always/never modes."""
+        settings.ARCHIVE_FILE_GENERATION = archive_setting
+
+        result = resolve_archive_preference(
+            "application/pdf",
+            Path("/fake/path.pdf"),
+            can_produce_archive=can_produce_archive,
+        )
+
+        assert result is expected
+
+    def test_auto_mode_non_pdf_returns_true(
+        self,
+        settings: SettingsWrapper,
+    ) -> None:
+        """
+        GIVEN:
+            - ARCHIVE_FILE_GENERATION=auto
+            - Non-PDF mime type
+            - can_produce_archive=True
+        WHEN:
+            - resolve_archive_preference is called
+        THEN:
+            - Returns True (images always need archive)
+        """
+        settings.ARCHIVE_FILE_GENERATION = ArchiveFileGenerationChoices.AUTO
+
+        result = resolve_archive_preference(
+            "image/jpeg",
+            Path("/fake/path.jpg"),
+            can_produce_archive=True,
+        )
+
+        assert result is True
+
+    def test_auto_mode_pdf_delegates_to_heuristic(
+        self,
+        settings: SettingsWrapper,
+        mocker: MockerFixture,
+    ) -> None:
+        """
+        GIVEN:
+            - ARCHIVE_FILE_GENERATION=auto
+            - PDF mime type
+            - can_produce_archive=True
+        WHEN:
+            - resolve_archive_preference is called
+        THEN:
+            - Delegates to _should_produce_archive_for_pdf
+        """
+        settings.ARCHIVE_FILE_GENERATION = ArchiveFileGenerationChoices.AUTO
+        mock_heuristic = mocker.patch(
+            "documents.parsers._should_produce_archive_for_pdf",
+            return_value=True,
+        )
+        fake_path = Path("/fake/path.pdf")
+
+        result = resolve_archive_preference(
+            "application/pdf",
+            fake_path,
+            can_produce_archive=True,
+        )
+
+        mock_heuristic.assert_called_once_with(fake_path)
+        assert result is True
+
+
+class TestShouldProduceArchiveForPdf:
+    """Test the _should_produce_archive_for_pdf heuristic function."""
+
+    @pytest.mark.parametrize(
+        ("text_content", "has_struct_tree", "is_marked", "expected"),
+        [
+            pytest.param(
+                "This is a long text content that is definitely longer than fifty characters",
+                True,
+                False,
+                False,
+                id="tagged-with-struct-tree",
+            ),
+            pytest.param(
+                "This is a long text content that is definitely longer than fifty characters",
+                False,
+                True,
+                False,
+                id="tagged-with-mark-info",
+            ),
+            pytest.param(
+                "This is a long text content that is definitely longer than fifty characters",
+                False,
+                False,
+                True,
+                id="untagged-with-text",
+            ),
+            pytest.param(
+                "Short text",
+                True,
+                False,
+                True,
+                id="little-text-tagged",
+            ),
+            pytest.param(
+                "Short text",
+                False,
+                False,
+                True,
+                id="little-text-untagged",
+            ),
+            pytest.param(
+                "",
+                False,
+                False,
+                True,
+                id="no-text",
+            ),
+        ],
+    )
+    def test_pdf_heuristic_logic(
+        self,
+        mocker: MockerFixture,
+        text_content: str,
+        has_struct_tree: bool,  # noqa: FBT001
+        is_marked: bool,  # noqa: FBT001
+        expected: bool,  # noqa: FBT001
+    ) -> None:
+        """Test the PDF heuristic with various text and tagging combinations."""
+        # Mock text extraction
+        mocker.patch(
+            "documents.parsers.run_subprocess",
+        )
+        mocker.patch(
+            "documents.parsers.read_file_handle_unicode_errors",
+            return_value=text_content,
+        )
+
+        # Mock pikepdf
+        mock_pdf = Mock()
+        if has_struct_tree:
+            mock_pdf.Root.StructTreeRoot = True
+        else:
+            del mock_pdf.Root.StructTreeRoot
+
+        mock_pdf.Root.MarkInfo.get.return_value = is_marked
+        mock_pikepdf = mocker.patch("documents.parsers.pikepdf")
+        mock_pikepdf.open.return_value.__enter__.return_value = mock_pdf
+
+        result = _should_produce_archive_for_pdf(Path("/fake/path.pdf"))
+        assert result is expected
+
+    def test_exception_handling_returns_true(
+        self,
+        mocker: MockerFixture,
+    ) -> None:
+        """
+        GIVEN:
+            - PDF processing raises an exception
+        WHEN:
+            - _should_produce_archive_for_pdf is called
+        THEN:
+            - Returns True (safe default)
+        """
+        # Mock exception during text processing
+        mocker.patch(
+            "documents.parsers.run_subprocess",
+            side_effect=Exception("Test error"),
+        )
+
+        result = _should_produce_archive_for_pdf(Path("/fake/path.pdf"))
+        assert result is True
+
+    def test_pikepdf_exception_returns_true(
+        self,
+        mocker: MockerFixture,
+    ) -> None:
+        """
+        GIVEN:
+            - Text extraction succeeds but pikepdf raises exception
+        WHEN:
+            - _should_produce_archive_for_pdf is called
+        THEN:
+            - Returns True (safe default)
+        """
+        # Mock successful text extraction
+        mocker.patch("documents.parsers.run_subprocess")
+        mocker.patch(
+            "documents.parsers.read_file_handle_unicode_errors",
+            return_value="This is a long text content that is definitely longer than fifty characters",
+        )
+
+        # Mock pikepdf exception
+        mocker.patch(
+            "documents.parsers.pikepdf.open",
+            side_effect=Exception("PDF error"),
+        )
+
+        result = _should_produce_archive_for_pdf(Path("/fake/path.pdf"))
+        assert result is True
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -5,6 +5,7 @@ import tempfile
 from pathlib import Path
 from unittest import mock
 from unittest.mock import MagicMock
+from unittest.mock import Mock

 from django.conf import settings
 from django.contrib.auth.models import Group
@@ -1126,6 +1127,7 @@ class TestConsumer(
            mock_mail_parser_parse.assert_called_once_with(
                consumer.working_copy,
                "message/rfc822",
+                produce_archive=True,
            )


@@ -1548,3 +1550,155 @@ class TestBarcodeApplyDetectedASN(TestCase):

        plugin._apply_detected_asn(123)
        self.assertEqual(plugin.metadata.asn, 123)
+
+
+# TODO: Convert these tests to pytest style in the future
+class TestArchivePreferenceWiring(DirectoriesMixin, GetConsumerMixin, TestCase):
+    """Test that archive preference settings are properly wired to parser calls."""
+
+    def setUp(self) -> None:
+        super().setUp()
+        # Use simple test file that can be parsed by our test parsers
+        src = (
+            Path(__file__).parent
+            / "samples"
+            / "documents"
+            / "originals"
+            / "0000005.pdf"
+        )
+        self.test_file = self.dirs.scratch_dir / "sample.pdf"
+        shutil.copy(src, self.test_file)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="never")
+    @mock.patch("documents.consumer.get_parser_registry")
+    def test_never_setting_passes_produce_archive_false(self, mock_registry):
+        """Test that ARCHIVE_FILE_GENERATION=never passes produce_archive=False to parser."""
+        # Mock parser to track produce_archive parameter
+        from unittest.mock import MagicMock
+
+        mock_parser_instance = Mock()
+        mock_parser_instance.can_produce_archive = True
+        mock_parser_instance.requires_pdf_rendition = False
+        mock_parser_instance.get_text.return_value = "Test text"
+        mock_parser_instance.get_archive_path.return_value = None
+        # Create a temporary thumbnail file for testing
+        thumbnail_path = self.dirs.scratch_dir / "thumbnail.webp"
+        thumbnail_path.write_bytes(b"fake_thumbnail_data")
+        mock_parser_instance.get_thumbnail.return_value = thumbnail_path
+        mock_parser_instance.get_date.return_value = None
+        mock_parser_instance.get_page_count.return_value = 1
+        mock_parser_instance.extract_metadata.return_value = []
+
+        # Use MagicMock to properly support context manager protocol
+        mock_parser_class = MagicMock()
+        mock_parser_class.return_value.__enter__ = Mock(
+            return_value=mock_parser_instance,
+        )
+        mock_parser_class.return_value.__exit__ = Mock(return_value=None)
+
+        mock_registry_instance = Mock()
+        mock_registry_instance.get_parser_for_file.return_value = mock_parser_class
+        mock_registry.return_value = mock_registry_instance
+
+        with self.get_consumer(self.test_file) as consumer:
+            consumer.run()
+
+        # Verify parse was called with produce_archive=False
+        mock_parser_instance.parse.assert_called_once()
+        call_args = mock_parser_instance.parse.call_args
+        self.assertEqual(call_args.kwargs["produce_archive"], False)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="always")
+    @mock.patch("documents.consumer.get_parser_registry")
+    def test_always_setting_passes_produce_archive_true(self, mock_registry):
+        """Test that ARCHIVE_FILE_GENERATION=always passes produce_archive=True to parser."""
+        # Mock parser to track produce_archive parameter
+        from unittest.mock import MagicMock
+
+        mock_parser_instance = Mock()
+        mock_parser_instance.can_produce_archive = True
+        mock_parser_instance.requires_pdf_rendition = False
+        mock_parser_instance.get_text.return_value = "Test text"
+        mock_parser_instance.get_archive_path.return_value = (
+            self.test_file
+        )  # Fake archive
+        # Create a temporary thumbnail file for testing
+        thumbnail_path = self.dirs.scratch_dir / "thumbnail.webp"
+        thumbnail_path.write_bytes(b"fake_thumbnail_data")
+        mock_parser_instance.get_thumbnail.return_value = thumbnail_path
+        mock_parser_instance.get_date.return_value = None
+        mock_parser_instance.get_page_count.return_value = 1
+        mock_parser_instance.extract_metadata.return_value = []
+
+        # Use MagicMock to properly support context manager protocol
+        mock_parser_class = MagicMock()
+        mock_parser_class.return_value.__enter__ = Mock(
+            return_value=mock_parser_instance,
+        )
+        mock_parser_class.return_value.__exit__ = Mock(return_value=None)
+
+        mock_registry_instance = Mock()
+        mock_registry_instance.get_parser_for_file.return_value = mock_parser_class
+        mock_registry.return_value = mock_registry_instance
+
+        with self.get_consumer(self.test_file) as consumer:
+            consumer.run()
+
+        # Verify parse was called with produce_archive=True
+        mock_parser_instance.parse.assert_called_once()
+        call_args = mock_parser_instance.parse.call_args
+        self.assertEqual(call_args.kwargs["produce_archive"], True)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @mock.patch("documents.consumer.resolve_archive_preference")
+    @mock.patch("documents.consumer.get_parser_registry")
+    def test_auto_setting_delegates_to_resolve_archive_preference(
+        self,
+        mock_registry,
+        mock_resolve_preference,
+    ):
+        """Test that ARCHIVE_FILE_GENERATION=auto delegates to resolve_archive_preference."""
+        mock_resolve_preference.return_value = False
+
+        # Mock parser to track produce_archive parameter
+        mock_parser_instance = Mock()
+        mock_parser_instance.can_produce_archive = True
+        mock_parser_instance.requires_pdf_rendition = False
+        mock_parser_instance.get_text.return_value = "Test text"
+        mock_parser_instance.get_archive_path.return_value = None
+        # Create a temporary thumbnail file for testing
+        thumbnail_path = self.dirs.scratch_dir / "thumbnail.webp"
+        thumbnail_path.write_bytes(b"fake_thumbnail_data")
+        mock_parser_instance.get_thumbnail.return_value = thumbnail_path
+        mock_parser_instance.get_date.return_value = None
+        mock_parser_instance.get_page_count.return_value = 1
+        mock_parser_instance.extract_metadata.return_value = []
+
+        # Use MagicMock to properly support context manager protocol
+        from unittest.mock import MagicMock
+
+        mock_parser_class = MagicMock()
+        mock_parser_class.return_value.__enter__ = Mock(
+            return_value=mock_parser_instance,
+        )
+        mock_parser_class.return_value.__exit__ = Mock(return_value=None)
+
+        mock_registry_instance = Mock()
+        mock_registry_instance.get_parser_for_file.return_value = mock_parser_class
+        mock_registry.return_value = mock_registry_instance
+
+        with self.get_consumer(self.test_file) as consumer:
+            consumer.run()
+
+        # Verify resolve_archive_preference was called with correct parameters
+        mock_resolve_preference.assert_called_once()
+        call_args = mock_resolve_preference.call_args
+        self.assertEqual(call_args.args[0], "application/pdf")
+        # Path will be working copy (different from original), so check it's a Path to sample.pdf
+        self.assertEqual(call_args.args[1].name, "sample.pdf")
+        self.assertEqual(call_args.kwargs["can_produce_archive"], True)
+
+        # Verify parse was called with the result from resolve_archive_preference
+        mock_parser_instance.parse.assert_called_once()
+        call_args = mock_parser_instance.parse.call_args
+        self.assertEqual(call_args.kwargs["produce_archive"], False)
--- a/src/documents/tests/test_management.py
+++ b/src/documents/tests/test_management.py
@@ -43,6 +43,7 @@ class TestArchiver(DirectoriesMixin, FileSystemAssertsMixin, TestCase):

        call_command("document_archiver", "--processes", "1", skip_checks=True)

+    @override_settings(ARCHIVE_FILE_GENERATION="always")
    def test_handle_document(self) -> None:
        doc = self.make_models()
        shutil.copy(sample_file, Path(self.dirs.originals_dir) / f"{doc.id:07}.pdf")
@@ -73,7 +74,7 @@ class TestArchiver(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
        self.assertIsNone(doc.archive_filename)
        self.assertIsFile(doc.source_path)

-    @override_settings(FILENAME_FORMAT="{title}")
+    @override_settings(FILENAME_FORMAT="{title}", ARCHIVE_FILE_GENERATION="always")
    def test_naming_priorities(self) -> None:
        doc1 = Document.objects.create(
            checksum="A",
--- a/src/documents/tests/test_parsers.py
+++ b/src/documents/tests/test_parsers.py
@@ -1,9 +1,13 @@
+from unittest.mock import Mock
+from unittest.mock import patch
+
 from django.test import TestCase
 from django.test import override_settings

 from documents.parsers import get_default_file_extension
 from documents.parsers import get_supported_file_extensions
 from documents.parsers import is_file_ext_supported
+from documents.parsers import resolve_archive_preference
 from paperless.parsers.registry import get_parser_registry
 from paperless.parsers.registry import reset_parser_registry
 from paperless.parsers.tesseract import RasterisedDocumentParser
@@ -111,3 +115,195 @@ class TestParserAvailability(TestCase):
        self.assertTrue(is_file_ext_supported(".pdf"))
        self.assertFalse(is_file_ext_supported(".hsdfh"))
        self.assertFalse(is_file_ext_supported(""))
+
+
+class TestResolveArchivePreference(TestCase):
+    """Test the resolve_archive_preference function with various settings and file types."""
+
+    def setUp(self):
+        """Set up test PDF file for mocking."""
+        from pathlib import Path
+
+        self.test_pdf_path = Path("/fake/path/test.pdf")
+
+    @override_settings(ARCHIVE_FILE_GENERATION="always")
+    def test_always_setting_with_capable_parser(self):
+        """Test ARCHIVE_FILE_GENERATION=always with parser that can produce archive."""
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=True,
+        )
+        self.assertTrue(result)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="always")
+    def test_always_setting_with_incapable_parser(self):
+        """Test ARCHIVE_FILE_GENERATION=always with parser that cannot produce archive."""
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=False,
+        )
+        self.assertFalse(result)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="never")
+    def test_never_setting_regardless_of_parser(self):
+        """Test ARCHIVE_FILE_GENERATION=never regardless of parser capability."""
+        # Test with capable parser
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=True,
+        )
+        self.assertFalse(result)
+
+        # Test with incapable parser
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=False,
+        )
+        self.assertFalse(result)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    def test_auto_setting_with_non_pdf_mime_type(self):
+        """Test ARCHIVE_FILE_GENERATION=auto with non-PDF mime types."""
+        # Non-PDF mime types (images etc.) should always produce archive
+        result = resolve_archive_preference(
+            "image/jpeg",
+            self.test_pdf_path,  # Path doesn't matter for non-PDF
+            can_produce_archive=True,
+        )
+        self.assertTrue(result)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @patch("documents.parsers._should_produce_archive_for_pdf")
+    def test_auto_setting_with_pdf_delegates_to_heuristic(self, mock_heuristic):
+        """Test ARCHIVE_FILE_GENERATION=auto with PDF delegates to heuristic function."""
+        mock_heuristic.return_value = False
+
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=True,
+        )
+
+        mock_heuristic.assert_called_once_with(self.test_pdf_path)
+        self.assertFalse(result)
+
+        # Test with heuristic returning True
+        mock_heuristic.reset_mock()
+        mock_heuristic.return_value = True
+
+        result = resolve_archive_preference(
+            "application/pdf",
+            self.test_pdf_path,
+            can_produce_archive=True,
+        )
+
+        mock_heuristic.assert_called_once_with(self.test_pdf_path)
+        self.assertTrue(result)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @patch("documents.parsers.run_subprocess")
+    @patch("documents.parsers.pikepdf.open")
+    @patch("documents.parsers.read_file_handle_unicode_errors")
+    def test_pdf_heuristic_born_digital_tagged(
+        self,
+        mock_read_file,
+        mock_pikepdf_open,
+        mock_subprocess,
+    ):
+        """Test PDF heuristic detects born-digital tagged PDF (should NOT produce archive)."""
+        # Mock pdftotext output - lots of text
+        mock_read_file.return_value = (
+            "This is a lot of text content from a born-digital PDF document."
+        )
+
+        # Mock pikepdf - tagged PDF
+        mock_pdf = Mock()
+        mock_pdf.Root = Mock()
+        mock_pdf.Root.StructTreeRoot = Mock()  # Has structure tree
+        mock_pikepdf_open.return_value.__enter__.return_value = mock_pdf
+
+        from documents.parsers import _should_produce_archive_for_pdf
+
+        result = _should_produce_archive_for_pdf(self.test_pdf_path)
+
+        self.assertFalse(result)  # Born-digital tagged PDF should NOT produce archive
+        mock_subprocess.assert_called_once()
+        mock_pikepdf_open.assert_called_once_with(self.test_pdf_path)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @patch("documents.parsers.run_subprocess")
+    @patch("documents.parsers.pikepdf.open")
+    @patch("documents.parsers.read_file_handle_unicode_errors")
+    def test_pdf_heuristic_scanner_ocr_untagged(
+        self,
+        mock_read_file,
+        mock_pikepdf_open,
+        mock_subprocess,
+    ):
+        """Test PDF heuristic detects scanner OCR'd untagged PDF (should produce archive)."""
+        # Mock pdftotext output - lots of text
+        mock_read_file.return_value = (
+            "This is a lot of text content from a scanner OCR'd PDF document."
+        )
+
+        # Mock pikepdf - untagged PDF
+        mock_pdf = Mock()
+        mock_pdf.Root = Mock()
+        # No StructTreeRoot and MarkInfo.Marked is False
+        del mock_pdf.Root.StructTreeRoot  # Simulate no attribute
+        mock_pdf.Root.MarkInfo = Mock()
+        mock_pdf.Root.MarkInfo.get.return_value = False
+        mock_pikepdf_open.return_value.__enter__.return_value = mock_pdf
+
+        from documents.parsers import _should_produce_archive_for_pdf
+
+        result = _should_produce_archive_for_pdf(self.test_pdf_path)
+
+        self.assertTrue(result)  # Scanner OCR'd PDF should produce archive
+        mock_subprocess.assert_called_once()
+        mock_pikepdf_open.assert_called_once_with(self.test_pdf_path)
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @patch("documents.parsers.run_subprocess")
+    @patch("documents.parsers.pikepdf.open")
+    @patch("documents.parsers.read_file_handle_unicode_errors")
+    def test_pdf_heuristic_raw_scan_no_text(
+        self,
+        mock_read_file,
+        mock_pikepdf_open,
+        mock_subprocess,
+    ):
+        """Test PDF heuristic detects raw scan with no text (should produce archive)."""
+        # Mock pdftotext output - very little text
+        mock_read_file.return_value = "  "  # Just whitespace
+
+        # Mock pikepdf - doesn't matter for this case
+        mock_pdf = Mock()
+        mock_pdf.Root = Mock()
+        mock_pikepdf_open.return_value.__enter__.return_value = mock_pdf
+
+        from documents.parsers import _should_produce_archive_for_pdf
+
+        result = _should_produce_archive_for_pdf(self.test_pdf_path)
+
+        self.assertTrue(result)  # Raw scan should produce archive
+        mock_subprocess.assert_called_once()
+        # pikepdf check is not needed when text is short, but we don't control that here
+
+    @override_settings(ARCHIVE_FILE_GENERATION="auto")
+    @patch(
+        "documents.parsers.run_subprocess",
+        side_effect=Exception("pdftotext failed"),
+    )
+    def test_pdf_heuristic_exception_handling(self, mock_subprocess):
+        """Test PDF heuristic defaults to producing archive when exception occurs."""
+        from documents.parsers import _should_produce_archive_for_pdf
+
+        result = _should_produce_archive_for_pdf(self.test_pdf_path)
+
+        self.assertTrue(result)  # Should default to True when exception occurs
+        mock_subprocess.assert_called_once()
--- a/src/documents/tests/test_tasks.py
+++ b/src/documents/tests/test_tasks.py
@@ -233,10 +233,12 @@ class TestEmptyTrashTask(DirectoriesMixin, FileSystemAssertsMixin, TestCase):


 class TestUpdateContent(DirectoriesMixin, TestCase):
+    @override_settings(ARCHIVE_FILE_GENERATION="always")
    def test_update_content_maybe_archive_file(self) -> None:
        """
        GIVEN:
            - Existing document with archive file
+            - ARCHIVE_FILE_GENERATION=always to force archive production
        WHEN:
            - Update content task is called
        THEN:
--- a/src/paperless/checks.py
+++ b/src/paperless/checks.py
@@ -132,23 +132,14 @@ def settings_values_check(app_configs, **kwargs):
                Error(f'OCR output type "{settings.OCR_OUTPUT_TYPE}" is not valid'),
            )

-        if settings.OCR_MODE not in {"force", "skip", "redo", "skip_noarchive"}:
+        if settings.OCR_MODE not in {"auto", "force", "redo", "off"}:
            msgs.append(Error(f'OCR output mode "{settings.OCR_MODE}" is not valid'))

-        if settings.OCR_MODE == "skip_noarchive":
-            msgs.append(
-                Warning(
-                    'OCR output mode "skip_noarchive" is deprecated and will be '
-                    "removed in a future version. Please use "
-                    "PAPERLESS_OCR_SKIP_ARCHIVE_FILE instead.",
-                ),
-            )
-
-        if settings.OCR_SKIP_ARCHIVE_FILE not in {"never", "with_text", "always"}:
+        if settings.ARCHIVE_FILE_GENERATION not in {"always", "never", "auto"}:
            msgs.append(
                Error(
-                    "OCR_SKIP_ARCHIVE_FILE setting "
-                    f'"{settings.OCR_SKIP_ARCHIVE_FILE}" is not valid',
+                    "ARCHIVE_FILE_GENERATION setting "
+                    f'"{settings.ARCHIVE_FILE_GENERATION}" is not valid',
                ),
            )

--- a/src/paperless/config.py
+++ b/src/paperless/config.py
@@ -46,7 +46,7 @@ class OcrConfig(OutputTypeConfig):
    pages: int | None = dataclasses.field(init=False)
    language: str = dataclasses.field(init=False)
    mode: str = dataclasses.field(init=False)
-    skip_archive_file: str = dataclasses.field(init=False)
+    archive_file_generation: str = dataclasses.field(init=False)
    image_dpi: int | None = dataclasses.field(init=False)
    clean: str = dataclasses.field(init=False)
    deskew: bool = dataclasses.field(init=False)
@@ -64,8 +64,8 @@ class OcrConfig(OutputTypeConfig):
        self.pages = app_config.pages or settings.OCR_PAGES
        self.language = app_config.language or settings.OCR_LANGUAGE
        self.mode = app_config.mode or settings.OCR_MODE
-        self.skip_archive_file = (
-            app_config.skip_archive_file or settings.OCR_SKIP_ARCHIVE_FILE
+        self.archive_file_generation = (
+            app_config.skip_archive_file or settings.ARCHIVE_FILE_GENERATION
        )
        self.image_dpi = app_config.image_dpi or settings.OCR_IMAGE_DPI
        self.clean = app_config.unpaper_clean or settings.OCR_CLEAN
--- a/src/paperless/models.py
+++ b/src/paperless/models.py
@@ -36,20 +36,20 @@ class ModeChoices(models.TextChoices):
    and our own custom setting
    """

-    SKIP = ("skip", _("skip"))
+    AUTO = ("auto", _("auto"))
+    OFF = ("off", _("off"))
    REDO = ("redo", _("redo"))
    FORCE = ("force", _("force"))
-    SKIP_NO_ARCHIVE = ("skip_noarchive", _("skip_noarchive"))


-class ArchiveFileChoices(models.TextChoices):
+class ArchiveFileGenerationChoices(models.TextChoices):
    """
    Settings to control creation of an archive PDF file
    """

-    NEVER = ("never", _("never"))
-    WITH_TEXT = ("with_text", _("with_text"))
    ALWAYS = ("always", _("always"))
+    NEVER = ("never", _("never"))
+    AUTO = ("auto", _("auto"))


 class CleanChoices(models.TextChoices):
@@ -131,7 +131,7 @@ class ApplicationConfiguration(AbstractSingletonModel):
        null=True,
        blank=True,
        max_length=16,
-        choices=ArchiveFileChoices.choices,
+        choices=ArchiveFileGenerationChoices.choices,
    )

    image_dpi = models.PositiveSmallIntegerField(
--- a/src/paperless/parsers/tesseract.py
+++ b/src/paperless/parsers/tesseract.py
@@ -18,7 +18,6 @@ from documents.parsers import make_thumbnail_from_pdf
 from documents.utils import maybe_override_pixel_limit
 from documents.utils import run_subprocess
 from paperless.config import OcrConfig
-from paperless.models import ArchiveFileChoices
 from paperless.models import CleanChoices
 from paperless.models import ModeChoices
 from paperless.parsers.utils import read_file_handle_unicode_errors
@@ -289,6 +288,7 @@ class RasterisedDocumentParser:
        sidecar_file: Path,
        *,
        safe_fallback: bool = False,
+        skip_text: bool = False,
    ) -> dict[str, Any]:
        ocrmypdf_args: dict[str, Any] = {
            "input_file_or_options": input_file,
@@ -309,15 +309,11 @@ class RasterisedDocumentParser:

        if self.settings.mode == ModeChoices.FORCE or safe_fallback:
            ocrmypdf_args["force_ocr"] = True
-        elif self.settings.mode in {
-            ModeChoices.SKIP,
-            ModeChoices.SKIP_NO_ARCHIVE,
-        }:
+        elif self.settings.mode == ModeChoices.OFF or skip_text:
            ocrmypdf_args["skip_text"] = True
        elif self.settings.mode == ModeChoices.REDO:
            ocrmypdf_args["redo_ocr"] = True
-        else:  # pragma: no cover
-            raise ParseError(f"Invalid ocr mode: {self.settings.mode}")
+        # ModeChoices.AUTO is handled by the caller

        if self.settings.clean == CleanChoices.CLEAN:
            ocrmypdf_args["clean"] = True
@@ -411,68 +407,213 @@ class RasterisedDocumentParser:
        os.environ["OMP_THREAD_LIMIT"] = "1"
        VALID_TEXT_LENGTH = 50

+        is_image = self.is_image(mime_type)
+
+        # Text detection heuristic for AUTO mode
+        text_original = None
+        has_usable_text = False
        if mime_type == "application/pdf":
            text_original = self.extract_text(None, document_path)
-            original_has_text = (
+            has_usable_text = (
                text_original is not None and len(text_original) > VALID_TEXT_LENGTH
            )
-        else:
-            text_original = None
-            original_has_text = False

-        # If the original has text, and the user doesn't want an archive,
-        # we're done here
-        skip_archive_for_text = (
-            self.settings.mode == ModeChoices.SKIP_NO_ARCHIVE
-            or self.settings.skip_archive_file
-            in {
-                ArchiveFileChoices.WITH_TEXT,
-                ArchiveFileChoices.ALWAYS,
-            }
-        )
-        if skip_archive_for_text and original_has_text:
-            self.log.debug("Document has text, skipping OCRmyPDF entirely.")
-            self.text = text_original
+        # Core logic: decide whether to run OCRmyPDF and with what parameters
+        should_run_ocrmypdf = True
+        force_ocr = False
+        redo_ocr = False
+        skip_text = False
+
+        if self.settings.mode == ModeChoices.OFF:
+            if is_image and produce_archive:
+                # Image + OFF + archive: use img2pdf then ocrmypdf skip_text for PDF/A
+                should_run_ocrmypdf = True
+                skip_text = True
+            elif is_image and not produce_archive:
+                # Image + OFF + no archive: skip ocrmypdf entirely, text is empty
+                should_run_ocrmypdf = False
+                self.text = ""
+            elif not is_image and produce_archive:
+                # PDF + OFF + archive: ocrmypdf skip_text for PDF/A conversion only
+                should_run_ocrmypdf = True
+                skip_text = True
+            else:  # PDF + OFF + no archive
+                # PDF + OFF + no archive: skip ocrmypdf entirely, use pdftotext
+                should_run_ocrmypdf = False
+                self.text = text_original or ""
+        elif self.settings.mode == ModeChoices.AUTO:
+            if is_image:
+                # Image + AUTO: always run OCR (only way to get text from image)
+                should_run_ocrmypdf = True
+                force_ocr = True
+            elif has_usable_text and not produce_archive:
+                # PDF with text + AUTO + no archive: skip ocrmypdf entirely, use pdftotext
+                should_run_ocrmypdf = False
+                self.text = text_original
+            elif has_usable_text and produce_archive:
+                # PDF with text + AUTO + archive: skip_text for PDF/A conversion only
+                should_run_ocrmypdf = True
+                skip_text = True
+            else:
+                # PDF without text + AUTO: normal OCR
+                should_run_ocrmypdf = True
+        elif self.settings.mode == ModeChoices.FORCE:
+            should_run_ocrmypdf = True
+            force_ocr = True
+        elif self.settings.mode == ModeChoices.REDO:
+            should_run_ocrmypdf = True
+            redo_ocr = True
+
+        # Early return if we're skipping OCRmyPDF entirely
+        if not should_run_ocrmypdf:
+            self.log.debug(f"Skipping OCRmyPDF entirely for mode {self.settings.mode}")
            return

-        # Either no text was in the original or there should be an archive
-        # file created, so OCR the file and create an archive with any
-        # text located via OCR
+        # Special handling for image + OFF + archive: convert to PDF first
+        input_file = document_path
+        if is_image and self.settings.mode == ModeChoices.OFF and produce_archive:
+            self.log.debug("Converting image to PDF using img2pdf for archive creation")
+            pdf_path = Path(self.tempdir) / "input.pdf"

+            import img2pdf
+
+            # Handle alpha channel removal if needed
+            if self.has_alpha(document_path):
+                self.log.info(
+                    f"Removing alpha layer from {document_path} for compatibility with img2pdf",
+                )
+                input_file = self.remove_alpha(document_path)
+
+            with pdf_path.open("wb") as f:
+                f.write(img2pdf.convert(str(input_file)))
+            input_file = pdf_path
+
+        # Run OCRmyPDF with appropriate parameters
        import ocrmypdf
        from ocrmypdf import EncryptedPdfError
        from ocrmypdf import InputFileError
        from ocrmypdf import SubprocessOutputError
        from ocrmypdf.exceptions import DigitalSignatureError
+        from ocrmypdf.exceptions import PriorOcrFoundError

        archive_path = Path(self.tempdir) / "archive.pdf"
        sidecar_file = Path(self.tempdir) / "sidecar.txt"

+        # Build ocrmypdf args with explicit control over OCR behavior
        args = self.construct_ocrmypdf_parameters(
-            document_path,
-            mime_type,
+            input_file,
+            mime_type
+            if not (
+                is_image and self.settings.mode == ModeChoices.OFF and produce_archive
+            )
+            else "application/pdf",
            archive_path,
            sidecar_file,
+            skip_text=skip_text,
        )

+        # Override with specific flags if needed
+        if force_ocr:
+            args["force_ocr"] = True
+            args.pop("skip_text", None)
+        if redo_ocr:
+            args["redo_ocr"] = True
+            args.pop("skip_text", None)
+            args.pop("force_ocr", None)
+
        try:
            self.log.debug(f"Calling OCRmyPDF with args: {args}")
            ocrmypdf.ocr(**args)

-            if self.settings.skip_archive_file != ArchiveFileChoices.ALWAYS:
+            # Set archive path only if we want to produce an archive
+            if produce_archive:
                self.archive_path = archive_path

            self.text = self.extract_text(sidecar_file, archive_path)

            if not self.text:
                raise NoTextFoundException("No text was found in the original document")
+        except PriorOcrFoundError:
+            # pdftotext couldn't detect the text layer (e.g. RTL or CJK scripts),
+            # but ocrmypdf found it. Retry as PDF/A conversion only (skip_text).
+            self.log.debug(
+                "PDF has existing text layer not detected by pdftotext; "
+                "retrying with skip_text for PDF/A conversion.",
+            )
+            retry_args = self.construct_ocrmypdf_parameters(
+                input_file,
+                mime_type,
+                archive_path,
+                sidecar_file,
+                skip_text=True,
+            )
+            try:
+                ocrmypdf.ocr(**retry_args)
+                if produce_archive:
+                    self.archive_path = archive_path
+                self.text = self.extract_text(sidecar_file, archive_path)
+            except Exception as e:
+                raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
        except (DigitalSignatureError, EncryptedPdfError):
            self.log.warning(
                "This file is encrypted and/or signed, OCR is impossible. Using "
                "any text present in the original file.",
            )
-            if original_has_text:
-                self.text = text_original
+            self.text = text_original or ""
+        except InputFileError as e:
+            # Tagged PDFs raise InputFileError when called without skip_text/force_ocr.
+            # Retry with skip_text to do PDF/A conversion without disturbing the text layer.
+            if "Tagged PDF" in str(e):
+                self.log.debug(
+                    "Tagged PDF detected; retrying with skip_text for PDF/A conversion.",
+                )
+                retry_args = self.construct_ocrmypdf_parameters(
+                    input_file,
+                    mime_type,
+                    archive_path,
+                    sidecar_file,
+                    skip_text=True,
+                )
+                try:
+                    ocrmypdf.ocr(**retry_args)
+                    if produce_archive:
+                        self.archive_path = archive_path
+                    self.text = self.extract_text(sidecar_file, archive_path)
+                except Exception as retry_e:
+                    raise ParseError(
+                        f"{retry_e.__class__.__name__}: {retry_e!s}",
+                    ) from retry_e
+            else:
+                self.log.warning(
+                    f"Encountered an error while running OCR: {e!s}. "
+                    f"Attempting force OCR to get the text.",
+                )
+                archive_path_fallback = Path(self.tempdir) / "archive-fallback.pdf"
+                sidecar_file_fallback = Path(self.tempdir) / "sidecar-fallback.txt"
+                args = self.construct_ocrmypdf_parameters(
+                    input_file,
+                    mime_type
+                    if not (
+                        is_image
+                        and self.settings.mode == ModeChoices.OFF
+                        and produce_archive
+                    )
+                    else "application/pdf",
+                    archive_path_fallback,
+                    sidecar_file_fallback,
+                    safe_fallback=True,
+                )
+                try:
+                    self.log.debug(f"Fallback: Calling OCRmyPDF with args: {args}")
+                    ocrmypdf.ocr(**args)
+                    self.text = self.extract_text(
+                        sidecar_file_fallback,
+                        archive_path_fallback,
+                    )
+                except Exception as fallback_e:
+                    raise ParseError(
+                        f"{fallback_e.__class__.__name__}: {fallback_e!s}",
+                    ) from fallback_e
        except SubprocessOutputError as e:
            if "Ghostscript PDF/A rendering" in str(e):
                self.log.warning(
@@ -483,7 +624,7 @@ class RasterisedDocumentParser:
            raise ParseError(
                f"SubprocessOutputError: {e!s}. See logs for more information.",
            ) from e
-        except (NoTextFoundException, InputFileError) as e:
+        except NoTextFoundException as e:
            self.log.warning(
                f"Encountered an error while running OCR: {e!s}. "
                f"Attempting force OCR to get the text.",
@@ -493,10 +634,15 @@ class RasterisedDocumentParser:
            sidecar_file_fallback = Path(self.tempdir) / "sidecar-fallback.txt"

            # Attempt to run OCR with safe settings.
-
            args = self.construct_ocrmypdf_parameters(
-                document_path,
-                mime_type,
+                input_file,
+                mime_type
+                if not (
+                    is_image
+                    and self.settings.mode == ModeChoices.OFF
+                    and produce_archive
+                )
+                else "application/pdf",
                archive_path_fallback,
                sidecar_file_fallback,
                safe_fallback=True,
@@ -525,13 +671,11 @@ class RasterisedDocumentParser:
        # As a last resort, if we still don't have any text for any reason,
        # try to extract the text from the original document.
        if not self.text:
-            if original_has_text:
-                self.text = text_original
-            else:
+            self.text = text_original or ""
+            if not self.text:
                self.log.warning(
                    f"No text was found in {document_path}, the content will be empty.",
                )
-                self.text = ""


 def post_process_text(text: str | None) -> str | None:
--- a/src/paperless/settings/init.py
+++ b/src/paperless/settings/init.py
@@ -21,6 +21,7 @@ from paperless.settings.custom import parse_hosting_settings
 from paperless.settings.custom import parse_ignore_dates
 from paperless.settings.custom import parse_redis_url
 from paperless.settings.parsers import get_bool_from_env
+from paperless.settings.parsers import get_choice_from_env
 from paperless.settings.parsers import get_float_from_env
 from paperless.settings.parsers import get_int_from_env
 from paperless.settings.parsers import get_list_from_env
@@ -874,10 +875,18 @@ OCR_LANGUAGE = os.getenv("PAPERLESS_OCR_LANGUAGE", "eng")
 # OCRmyPDF --output-type options are available.
 OCR_OUTPUT_TYPE = os.getenv("PAPERLESS_OCR_OUTPUT_TYPE", "pdfa")

-# skip. redo, force
-OCR_MODE = os.getenv("PAPERLESS_OCR_MODE", "skip")
+# auto, off, redo, force
+OCR_MODE = get_choice_from_env(
+    "PAPERLESS_OCR_MODE",
+    {"auto", "off", "redo", "force"},
+    "auto",
+)

-OCR_SKIP_ARCHIVE_FILE = os.getenv("PAPERLESS_OCR_SKIP_ARCHIVE_FILE", "never")
+ARCHIVE_FILE_GENERATION = get_choice_from_env(
+    "PAPERLESS_ARCHIVE_FILE_GENERATION",
+    {"always", "never", "auto"},
+    "auto",
+)

 OCR_IMAGE_DPI = get_int_from_env("PAPERLESS_OCR_IMAGE_DPI")

--- a/src/paperless/tests/parsers/conftest.py
+++ b/src/paperless/tests/parsers/conftest.py
@@ -708,7 +708,6 @@ def null_app_config(mocker: MockerFixture) -> MagicMock:
        pages=None,
        language=None,
        mode=None,
-        skip_archive_file=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
--- a/src/paperless/tests/parsers/test_tesseract_custom_settings.py
+++ b/src/paperless/tests/parsers/test_tesseract_custom_settings.py
@@ -93,11 +93,13 @@ class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCas
        """
        with override_settings(OCR_MODE="redo"):
            instance = ApplicationConfiguration.objects.all().first()
-            instance.mode = ModeChoices.SKIP
+            instance.mode = ModeChoices.AUTO
            instance.save()

            params = self.get_params()
-        self.assertTrue(params["skip_text"])
+        # AUTO mode doesn't set skip_text in construct_ocrmypdf_parameters
+        # The skip_text logic is handled in the parse method based on content detection
+        self.assertNotIn("skip_text", params)
        self.assertNotIn("redo_ocr", params)
        self.assertNotIn("force_ocr", params)

--- a/src/paperless/tests/parsers/test_tesseract_parser.py
+++ b/src/paperless/tests/parsers/test_tesseract_parser.py
@@ -433,7 +433,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(tesseract_samples_dir / "signed.pdf", "application/pdf")
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -449,7 +449,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "encrypted.pdf",
            "application/pdf",
@@ -559,7 +559,7 @@ class TestParseMultiPage:
    @pytest.mark.parametrize(
        "mode",
        [
-            pytest.param("skip", id="skip"),
+            pytest.param("auto", id="auto"),
            pytest.param("redo", id="redo"),
            pytest.param("force", id="force"),
        ],
@@ -587,7 +587,7 @@ class TestParseMultiPage:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-images.pdf",
            "application/pdf",
@@ -722,29 +722,31 @@ class TestParseMultiPage:


 # ---------------------------------------------------------------------------
-# Parse — skip_noarchive / skip_archive_file
+# Parse — OCR_MODE=auto / off and produce_archive parameter
 # ---------------------------------------------------------------------------


-class TestSkipArchive:
-    def test_skip_noarchive_with_text_layer(
+class TestOcrModeAndArchiveGeneration:
+    def test_auto_mode_with_text_skips_archive(
        self,
        tesseract_parser: RasterisedDocumentParser,
-        tesseract_samples_dir: Path,
+        multi_page_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
-            - File with existing text layer
-            - Mode: skip_noarchive
+            - File with existing text layer (born-digital PDF)
+            - Mode: auto
+            - produce_archive: False
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted; no archive created
+            - Text extracted; no archive created; ocrmypdf skipped entirely
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
-            tesseract_samples_dir / "multi-page-digital.pdf",
+            multi_page_digital_pdf_file,
            "application/pdf",
+            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -752,24 +754,26 @@ class TestSkipArchive:
            ["page 1", "page 2", "page 3"],
        )

-    def test_skip_noarchive_image_only_creates_archive(
+    def test_auto_mode_with_text_produces_archive(
        self,
        tesseract_parser: RasterisedDocumentParser,
-        tesseract_samples_dir: Path,
+        multi_page_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
-            - File with image-only pages (no text layer)
-            - Mode: skip_noarchive
+            - File with existing text layer (born-digital PDF)
+            - Mode: auto
+            - produce_archive: True
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted; archive created (OCR needed)
+            - Text extracted; archive created with skip_text
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
-            tesseract_samples_dir / "multi-page-images.pdf",
+            multi_page_digital_pdf_file,
            "application/pdf",
+            produce_archive=True,
        )
        assert tesseract_parser.archive_path is not None
        assert_ordered_substrings(
@@ -777,48 +781,137 @@ class TestSkipArchive:
            ["page 1", "page 2", "page 3"],
        )

-    @pytest.mark.parametrize(
-        ("skip_archive_file", "filename", "expect_archive"),
-        [
-            pytest.param("never", "multi-page-digital.pdf", True, id="never-with-text"),
-            pytest.param("never", "multi-page-images.pdf", True, id="never-no-text"),
-            pytest.param(
-                "with_text",
-                "multi-page-digital.pdf",
-                False,
-                id="with-text-layer",
-            ),
-            pytest.param(
-                "with_text",
-                "multi-page-images.pdf",
-                True,
-                id="with-text-no-layer",
-            ),
-            pytest.param(
-                "always",
-                "multi-page-digital.pdf",
-                False,
-                id="always-with-text",
-            ),
-            pytest.param("always", "multi-page-images.pdf", False, id="always-no-text"),
-        ],
-    )
-    def test_skip_archive_file_setting(
+    def test_auto_mode_image_produces_archive(
        self,
-        skip_archive_file: str,
-        filename: str,
-        expect_archive: str,
        tesseract_parser: RasterisedDocumentParser,
-        tesseract_samples_dir: Path,
+        multi_page_images_pdf_file: Path,
    ) -> None:
-        tesseract_parser.settings.skip_archive_file = skip_archive_file
-        tesseract_parser.parse(tesseract_samples_dir / filename, "application/pdf")
-        text = tesseract_parser.get_text().lower()
-        assert_ordered_substrings(text, ["page 1", "page 2", "page 3"])
-        if expect_archive:
-            assert tesseract_parser.archive_path is not None
-        else:
-            assert tesseract_parser.archive_path is None
+        """
+        GIVEN:
+            - File with image-only pages (no text layer)
+            - Mode: auto
+            - produce_archive: True
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Text extracted via OCR; archive created
+        """
+        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.parse(
+            multi_page_images_pdf_file,
+            "application/pdf",
+            produce_archive=True,
+        )
+        assert tesseract_parser.archive_path is not None
+        assert_ordered_substrings(
+            tesseract_parser.get_text().lower(),
+            ["page 1", "page 2", "page 3"],
+        )
+
+    def test_off_mode_image_with_archive(
+        self,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_png_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - Image file
+            - Mode: off
+            - produce_archive: True
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Empty text content; archive created via img2pdf path
+        """
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            simple_png_file,
+            "image/png",
+            produce_archive=True,
+        )
+        assert tesseract_parser.archive_path is not None
+        # OCR mode is OFF, but archive creation with img2pdf+OCRmyPDF may still produce some text
+        assert tesseract_parser.get_text().strip() is not None
+
+    def test_off_mode_image_without_archive(
+        self,
+        tesseract_parser: RasterisedDocumentParser,
+        simple_png_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - Image file
+            - Mode: off
+            - produce_archive: False
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Empty text content; no archive created
+        """
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            simple_png_file,
+            "image/png",
+            produce_archive=False,
+        )
+        assert tesseract_parser.archive_path is None
+        # OCR is disabled, so text should be empty
+        text = tesseract_parser.get_text().strip()
+        assert len(text) == 0
+
+    def test_off_mode_pdf_with_archive(
+        self,
+        tesseract_parser: RasterisedDocumentParser,
+        multi_page_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - PDF file
+            - Mode: off
+            - produce_archive: True
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Text from pdftotext; archive created with skip_text (PDF/A only)
+        """
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            multi_page_digital_pdf_file,
+            "application/pdf",
+            produce_archive=True,
+        )
+        assert tesseract_parser.archive_path is not None
+        assert_ordered_substrings(
+            tesseract_parser.get_text().lower(),
+            ["page 1", "page 2", "page 3"],
+        )
+
+    def test_off_mode_pdf_without_archive(
+        self,
+        tesseract_parser: RasterisedDocumentParser,
+        multi_page_digital_pdf_file: Path,
+    ) -> None:
+        """
+        GIVEN:
+            - PDF file
+            - Mode: off
+            - produce_archive: False
+        WHEN:
+            - Document is parsed
+        THEN:
+            - Text from pdftotext; no archive created
+        """
+        tesseract_parser.settings.mode = "off"
+        tesseract_parser.parse(
+            multi_page_digital_pdf_file,
+            "application/pdf",
+            produce_archive=False,
+        )
+        assert tesseract_parser.archive_path is None
+        assert_ordered_substrings(
+            tesseract_parser.get_text().lower(),
+            ["page 1", "page 2", "page 3"],
+        )


 # ---------------------------------------------------------------------------
@@ -835,13 +928,13 @@ class TestParseMixed:
        """
        GIVEN:
            - File with text in some pages (image) and some pages (digital)
-            - Mode: skip
+            - Mode: auto
        WHEN:
            - Document is parsed
        THEN:
            - All pages extracted; archive created; sidecar notes skipped pages
        """
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-mixed.pdf",
            "application/pdf",
@@ -891,24 +984,26 @@ class TestParseMixed:
            not in sidecar
        )

-    def test_multi_page_mixed_skip_noarchive(
+    def test_multi_page_mixed_auto_mode_without_archive(
        self,
        tesseract_parser: RasterisedDocumentParser,
-        tesseract_samples_dir: Path,
+        multi_page_mixed_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - File with mixed pages
-            - Mode: skip_noarchive
+            - Mode: auto
+            - produce_archive: False
        WHEN:
            - Document is parsed
        THEN:
-            - No archive created (file has text layer); later-page text present
+            - No archive created; text from existing digital pages extracted
        """
-        tesseract_parser.settings.mode = "skip_noarchive"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
-            tesseract_samples_dir / "multi-page-mixed.pdf",
+            multi_page_mixed_pdf_file,
            "application/pdf",
+            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -928,7 +1023,7 @@ class TestParseRotate:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "skip"
+        tesseract_parser.settings.mode = "auto"
        tesseract_parser.settings.rotate = True
        tesseract_parser.parse(tesseract_samples_dir / "rotated.pdf", "application/pdf")
        assert_ordered_substrings(
--- a/src/paperless/tests/test_checks.py
+++ b/src/paperless/tests/test_checks.py
@@ -130,16 +130,10 @@ class TestOcrSettingsChecks:
                id="invalid-mode",
            ),
            pytest.param(
-                "OCR_MODE",
-                "skip_noarchive",
-                "deprecated",
-                id="deprecated-mode",
-            ),
-            pytest.param(
-                "OCR_SKIP_ARCHIVE_FILE",
+                "ARCHIVE_FILE_GENERATION",
                "invalid",
-                'OCR_SKIP_ARCHIVE_FILE setting "invalid"',
-                id="invalid-skip-archive-file",
+                'ARCHIVE_FILE_GENERATION setting "invalid"',
+                id="invalid-archive-file-generation",
            ),
            pytest.param(
                "OCR_CLEAN",
Author	SHA1	Message	Date
Trenton H	561c2c597d	Bad merge	2026-03-26 09:37:45 -07:00
Trenton H	658196d6f3	Merge remote-tracking branch 'origin/dev' into feature-ocr-archive-decoupling	2026-03-26 09:26:38 -07:00
Trenton H	e236b152dc	fix: access parser properties on instance not class requires_pdf_rendition and can_produce_archive are @property decorators on parser instances. Accessing them on the class returns the property descriptor object (which is truthy regardless of the intended value), causing produce_archive to always be set to a non-bool truthy value. Change both consumer.py and tasks.py to access these attributes on the instantiated document_parser/parser object. Update TestArchivePreferenceWiring tests to set the attributes on mock_parser_instance (the context manager return value) rather than mock_parser_class.	2026-03-26 09:02:57 -07:00
Trenton H	ad296df83c	fix: update tests for OCR/archive decoupling - Add @override_settings(ARCHIVE_FILE_GENERATION="always") to tests that expect archive files to always be produced (test_handle_document, test_naming_priorities, test_update_content_maybe_archive_file) - Add test_archive_preference.py with 14 unit tests covering resolve_archive_preference() and _should_produce_archive_for_pdf() - Update tesseract parser: add PriorOcrFoundError retry path and split InputFileError handling (tagged PDF -> skip_text retry vs force-OCR fallback for other input errors) - Update checks.py: validate ARCHIVE_FILE_GENERATION, update OCR_MODE valid set to {auto, force, redo, off} - Update docs: PAPERLESS_OCR_MODE and PAPERLESS_ARCHIVE_FILE_GENERATION docs and migration guide for v3 - Update tesseract parser tests, consumer tests, and parsers tests to reflect new settings and behaviour	2026-03-26 08:40:57 -07:00
Trenton H	cbb128234d	Decouple archive file generation from OCR mode Replace the conflated OCR_MODE=skip/skip_noarchive and skip_archive_file settings with two independent axes: OCR_MODE (auto/force/redo/off) and ARCHIVE_FILE_GENERATION (always/never/auto). OCR_MODE=auto detects whether extractable text exists via pdftotext and skips the OCR engine when it does. OCR_MODE=off bypasses ocrmypdf entirely, using only pdftotext for any embedded text — useful for handwritten documents, bulk ingestion, or content that OCRs poorly. ARCHIVE_FILE_GENERATION=auto uses a combined pdftotext + pikepdf tagged-PDF heuristic: born-digital PDFs (text present and tagged) skip archive creation, scanner-OCRd PDFs (text present but untagged) still receive a PDF/A archive, and raw scans always produce an archive. The produce_archive flag is resolved in documents/parsers.py and passed through to parser.parse() from both the consumer and tasks, replacing the previous pattern where archive decisions were scattered across parser internals.	2026-03-23 19:50:52 -07:00