Merge branch 'dev' into chore/plugin-documentation

Moves the date parsing plugin section under the extending section
Adds a section about how the 2 install types can add external plugins
2026-03-29 20:32:44 +00:00 · 2026-03-27 10:18:22 -07:00 · 2026-03-23 12:56:43 -07:00 · 2026-03-23 12:56:38 -07:00 · 2026-03-23 12:56:34 -07:00
31 changed files with 665 additions and 1640 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -21,6 +21,7 @@ body:
        - [The installation instructions](https://docs.paperless-ngx.com/setup/#installation).
        - [Existing issues and discussions](https://github.com/paperless-ngx/paperless-ngx/search?q=&type=issues).
        - Disable any custom container initialization scripts, if using
        - Remove any third-party parser plugins — issues caused by or requiring changes to a third-party plugin will be closed without investigation.
        If you encounter issues while installing or configuring Paperless-ngx, please post in the ["Support" section of the discussions](https://github.com/paperless-ngx/paperless-ngx/discussions/new?category=support).
  - type: textarea
@@ -120,5 +121,7 @@ body:
          required: true
        - label: I have already searched for relevant existing issues and discussions before opening this report.
          required: true
        - label: I have reproduced this issue with all third-party parser plugins removed. I understand that issues caused by third-party plugins will be closed without investigation.
          required: true
        - label: I have updated the title field above with a concise description.
          required: true
--- a/.gitignore
+++ b/.gitignore
@@ -111,4 +111,3 @@ celerybeat-schedule*
 # ignore pnpm package store folder created when setting up the devcontainer
 .pnpm-store/
 .worktrees
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -723,6 +723,81 @@ services:
 1. Note the `:ro` tag means the folder will be mounted as read only. This is for extra security against changes
 ## Installing third-party parser plugins {#parser-plugins}
 Third-party parser plugins extend Paperless-ngx to support additional file
 formats. A plugin is a Python package that advertises itself under the
 `paperless_ngx.parsers` entry point group. Refer to the
 [developer documentation](development.md#making-custom-parsers) for how to
 create one.
 !!! warning "Third-party plugins are not officially supported"
    The Paperless-ngx maintainers do not provide support for third-party
    plugins. Issues caused by or requiring changes to a third-party plugin
    will be closed without further investigation. Always reproduce problems
    with all plugins removed before filing a bug report.
 ### Docker
 Use a [custom container initialization script](#custom-container-initialization)
 to install the package before the webserver starts. Create a shell script and
 mount it into `/custom-cont-init.d`:
 ```bash
 #!/bin/bash
 # /path/to/my/scripts/install-parsers.sh
 pip install my-paperless-parser-package
 ```
 Mount it in your `docker-compose.yml`:
 ```yaml
 services:
  webserver:
    # ...
    volumes:
      - /path/to/my/scripts:/custom-cont-init.d:ro
 ```
 The script runs as `root` before the webserver starts, so the package will be
 available when Paperless-ngx discovers plugins at startup.
 ### Bare metal
 Install the package into the same Python environment that runs Paperless-ngx.
 If you followed the standard bare-metal install guide, that is the `paperless`
 user's environment:
 ```bash
 sudo -Hu paperless pip3 install my-paperless-parser-package
 ```
 If you are using `uv` or a virtual environment, activate it first and then run:
 ```bash
 uv pip install my-paperless-parser-package
 # or
 pip install my-paperless-parser-package
 ```
 Restart all Paperless-ngx services after installation so the new plugin is
 discovered.
 ### Verifying installation
 On the next startup, check the application logs for a line confirming
 discovery:
 ```
 Loaded third-party parser 'My Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
 ```
 If this line does not appear, verify that the package is installed in the
 correct environment and that its `pyproject.toml` declares the
 `paperless_ngx.parsers` entry point.
 ## MySQL Caveats {#mysql-caveats}
 ### Case Sensitivity
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -801,14 +801,11 @@ parsing documents.
 #### [`PAPERLESS_OCR_MODE=<mode>`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE}
-: Tell paperless when and how to perform ocr on your documents. Four
+: Tell paperless when and how to perform ocr on your documents. Three
 modes are available:
-    -   `auto` (default): Paperless detects whether a document already
+    -   `skip`: Paperless skips all pages and will perform ocr only on
-        has embedded text via pdftotext. If sufficient text is found,
+        pages where no text is present. This is the safest option.
        OCR is skipped for that document (`--skip-text`). If no text is
        present, OCR runs normally. This is the safest option for mixed
        document collections.
    -   `redo`: Paperless will OCR all pages of your documents and
        attempt to replace any existing text layers with new text. This
@@ -826,59 +823,24 @@ modes are available:
        significantly larger and text won't appear as sharp when zoomed
        in.
-    -   `off`: Paperless never invokes the OCR engine. For PDFs, text
+    The default is `skip`, which only performs OCR when necessary and
-        is extracted via pdftotext only. For image documents, text will
+    always creates archived documents.
        be empty. Archive file generation still works via format
        conversion (no Tesseract or Ghostscript required).
-    The default is `auto`.
+    Read more about this in the [OCRmyPDF
    For the `skip`, `redo`, and `force` modes, read more about OCR
    behaviour in the [OCRmyPDF
    documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).
-#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=<mode>`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION}
+#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=<mode>`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE}
-: Controls when paperless creates a PDF/A archive version of your
+: Specify when you would like paperless to skip creating an archived
-documents. Archive files are stored alongside the original and are used
+version of your documents. This is useful if you don't want to have two
-for display in the web interface.
+almost-identical versions of your documents in the media folder.
-    -   `auto` (default): Produce archives for scanned or image-based
+    -   `never`: Never skip creating an archived version.
-        documents. Skip archive generation for born-digital PDFs that
+    -   `with_text`: Skip creating an archived version for documents
-        already contain embedded text. This is the recommended setting
+    that already have embedded text.
-        for mixed document collections.
+    -   `always`: Always skip creating an archived version.
    -   `always`: Always produce a PDF/A archive when the parser
        supports it, regardless of whether the document already has
        text.
    -   `never`: Never produce an archive. Only the original file is
        stored. Saves disk space but the web viewer will display the
        original file directly.
-    **Behaviour by file type and mode** (`auto` column shows the default):
+    The default is `never`.
    | Document type              | `never` | `auto` (default)           | `always` |
    | -------------------------- | ------- | -------------------------- | -------- |
    | Scanned image (TIFF, JPEG) | No      | **Yes**                    | Yes      |
    | Image-based PDF            | No      | **Yes** (short/no text, untagged) | Yes |
    | Born-digital PDF           | No      | No (tagged or has embedded text)  | Yes |
    | Plain text, email, HTML    | No      | No                         | No       |
    | DOCX / ODT (via Tika)      | Yes\*   | Yes\*                      | Yes\*    |
    \* Tika always produces a PDF rendition for display; this counts as
    the archive regardless of the setting.
    !!! note
        This setting applies to the built-in Tesseract parser. Parsers
        that must always convert documents to PDF for display (e.g. DOCX,
        ODT via Tika) will produce a PDF regardless of this setting.
    !!! note
        The **remote OCR parser** (Azure AI) always produces a searchable
        PDF and stores it as the archive copy, regardless of this setting.
        `ARCHIVE_FILE_GENERATION=never` has no effect when the remote
        parser handles a document.
 #### [`PAPERLESS_OCR_CLEAN=<mode>`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN}
--- a/docs/development.md
+++ b/docs/development.md
@@ -370,121 +370,363 @@ docker build --file Dockerfile --tag paperless:local .
 ## Extending Paperless-ngx
-Paperless-ngx does not have any fancy plugin systems and will probably never
+Paperless-ngx supports third-party document parsers via a Python entry point
-have. However, some parts of the application have been designed to allow
+plugin system. Plugins are distributed as ordinary Python packages and
-easy integration of additional features without any modification to the
+discovered automatically at startup — no changes to the Paperless-ngx source
-base code.
+are required.
 !!! warning "Third-party plugins are not officially supported"
    The Paperless-ngx maintainers do not provide support for third-party
    plugins. Issues that are caused by or require changes to a third-party
    plugin will be closed without further investigation. If you believe you
    have found a bug in Paperless-ngx itself (not in a plugin), please
    reproduce it with all third-party plugins removed before filing an issue.
 ### Making custom parsers
-Paperless-ngx uses parsers to add documents. A parser is
+Paperless-ngx uses parsers to add documents. A parser is responsible for:
 responsible for:
- Retrieving the content from the original
+- Extracting plain-text content from the document
- Creating a thumbnail
+- Generating a thumbnail image
- _optional:_ Retrieving a created date from the original
+- _optional:_ Detecting the document's creation date
- _optional:_ Creating an archived document from the original
+- _optional:_ Producing a searchable PDF archive copy
-Custom parsers can be added to Paperless-ngx to support more file types. In
+Custom parsers are distributed as ordinary Python packages and registered
-order to do that, you need to write the parser itself and announce its
+via a [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
-existence to Paperless-ngx.
+No changes to the Paperless-ngx source are required.
-The parser itself must extend `documents.parsers.DocumentParser` and
+#### 1. Implementing the parser class
-must implement the methods `parse` and `get_thumbnail`. You can provide
+
-your own implementation to `get_date` if you don't want to rely on
+Your parser must satisfy the `ParserProtocol` structural interface defined in
-Paperless-ngx' default date guessing mechanisms.
+`paperless.parsers`. The simplest approach is to write a plain class — no base
 class is required, only the right attributes and methods.
 **Class-level identity attributes**
 The registry reads these before instantiating the parser, so they must be
 plain class attributes (not instance attributes or properties):
 ```python
-class MyCustomParser(DocumentParser):
+class MyCustomParser:
-
+    name    = "My Format Parser"   # human-readable name shown in logs
-    def parse(self, document_path, mime_type):
+    version = "1.0.0"              # semantic version string
-        # This method does not return anything. Rather, you should assign
+    author  = "Acme Corp"          # author / organisation
-        # whatever you got from the document to the following fields:
+    url     = "https://example.com/my-parser"  # docs or issue tracker
        # The content of the document.
        self.text = "content"
        # Optional: path to a PDF document that you created from the original.
        self.archive_path = os.path.join(self.tempdir, "archived.pdf")
        # Optional: "created" date of the document.
        self.date = get_created_from_metadata(document_path)
    def get_thumbnail(self, document_path, mime_type):
        # This should return the path to a thumbnail you created for this
        # document.
        return os.path.join(self.tempdir, "thumb.webp")
 ```
-If you encounter any issues during parsing, raise a
+**Declaring supported MIME types**
 `documents.parsers.ParseError`.
-The `self.tempdir` directory is a temporary directory that is guaranteed
+Return a `dict` mapping MIME type strings to preferred file extensions
-to be empty and removed after consumption finished. You can use that
+(including the leading dot). Paperless-ngx uses the extension when storing
-directory to store any intermediate files and also use it to store the
+archive copies and serving files for download.
 thumbnail / archived document.
 After that, you need to announce your parser to Paperless-ngx. You need to
 connect a handler to the `document_consumer_declaration` signal. Have a
 look in the file `src/paperless_tesseract/apps.py` on how that's done.
 The handler is a method that returns information about your parser:
 ```python
-def myparser_consumer_declaration(sender, **kwargs):
+@classmethod
 def supported_mime_types(cls) -> dict[str, str]:
    return {
-        "parser": MyCustomParser,
+        "application/x-my-format": ".myf",
-        "weight": 0,
+        "application/x-my-format-alt": ".myf",
        "mime_types": {
            "application/pdf": ".pdf",
            "image/jpeg": ".jpg",
        }
    }
 ```
- `parser` is a reference to a class that extends `DocumentParser`.
+**Scoring**
 - `weight` is used whenever two or more parsers are able to parse a
  file: The parser with the higher weight wins. This can be used to
  override the parsers provided by Paperless-ngx.
 - `mime_types` is a dictionary. The keys are the mime types your
  parser supports and the value is the default file extension that
  Paperless-ngx should use when storing files and serving them for
  download. We could guess that from the file extensions, but some
  mime types have many extensions associated with them and the Python
  methods responsible for guessing the extension do not always return
  the same value.
-## Using Visual Studio Code devcontainer
+When more than one parser can handle a file, the registry calls `score()` on
 each candidate and picks the one with the highest result. Return `None` to
 decline handling a file even though the MIME type is listed as supported (for
 example, when a required external service is not configured).
-Another easy way to get started with development is to use Visual Studio
+| Score  | Meaning                                           |
-Code devcontainers. This approach will create a preconfigured development
+| ------ | ------------------------------------------------- |
-environment with all of the required tools and dependencies.
+| `None` | Decline — do not handle this file                 |
-[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
+| `10`   | Default priority used by all built-in parsers     |
-The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
+| `> 10` | Override a built-in parser for the same MIME type |
 contain more information about the specific tasks and launch configurations (see the
 non-standard "description" field).
-To get started:
+```python
@classmethod
 def score(
    cls,
    mime_type: str,
    filename: str,
    path: "Path | None" = None,
 ) -> int | None:
    # Inspect filename or file bytes here if needed.
    return 10
 ```
-1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+**Archive and rendition flags**
-2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+```python
@property
 def can_produce_archive(self) -> bool:
    """True if parse() can produce a searchable PDF archive copy."""
    return True   # or False if your parser doesn't produce PDFs
-3. In case your host operating system is Windows:
+@property
-   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
+def requires_pdf_rendition(self) -> bool:
-   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+    """True if the original format cannot be displayed by a browser
    (e.g. DOCX, ODT) and the PDF output must always be kept."""
    return False
 ```
-4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
+**Context manager — temp directory lifecycle**
   will initialize the database tables and create a superuser. Then you can compile the front end
   for production or run the frontend in debug mode.
-5. The project is ready for debugging, start either run the fullstack debug or individual debug
+Paperless-ngx always uses parsers as context managers. Create a temporary
-   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
+working directory in `__enter__` (or `__init__`) and remove it in `__exit__`
 regardless of whether an exception occurred. Store intermediate files,
 thumbnails, and archive PDFs inside this directory.
-## Developing Date Parser Plugins
+```python
 import shutil
 import tempfile
 from pathlib import Path
 from typing import Self
 from types import TracebackType
 from django.conf import settings
 class MyCustomParser:
    ...
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR)
        )
        self._text: str | None = None
        self._archive_path: Path | None = None
    def __enter__(self) -> Self:
        return self
    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_val: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> None:
        shutil.rmtree(self._tempdir, ignore_errors=True)
 ```
 **Optional context — `configure()`**
 The consumer calls `configure()` with a `ParserContext` after instantiation
 and before `parse()`. If your parser doesn't need context, a no-op
 implementation is fine:
 ```python
 from paperless.parsers import ParserContext
 def configure(self, context: ParserContext) -> None:
    pass   # override if you need context.mailrule_id, etc.
 ```
 **Parsing**
 `parse()` is the core method. It must not return a value; instead, store
 results in instance attributes and expose them via the accessor methods below.
 Raise `documents.parsers.ParseError` on any unrecoverable failure.
 ```python
 from documents.parsers import ParseError
 def parse(
    self,
    document_path: Path,
    mime_type: str,
    *,
    produce_archive: bool = True,
 ) -> None:
    try:
        self._text = extract_text_from_my_format(document_path)
    except Exception as e:
        raise ParseError(f"Failed to parse {document_path}: {e}") from e
    if produce_archive and self.can_produce_archive:
        archive = self._tempdir / "archived.pdf"
        convert_to_pdf(document_path, archive)
        self._archive_path = archive
 ```
 **Result accessors**
 ```python
 def get_text(self) -> str | None:
    return self._text
 def get_date(self) -> "datetime.datetime | None":
    # Return a datetime extracted from the document, or None to let
    # Paperless-ngx use its default date-guessing logic.
    return None
 def get_archive_path(self) -> Path | None:
    return self._archive_path
 ```
 **Thumbnail**
 `get_thumbnail()` may be called independently of `parse()`. Return the path
 to a WebP image inside `self._tempdir`. The image should be roughly 500 × 700
 pixels.
 ```python
 def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
    thumb = self._tempdir / "thumb.webp"
    render_thumbnail(document_path, thumb)
    return thumb
 ```
 **Optional methods**
 These are called by the API on demand, not during the consumption pipeline.
 Implement them if your format supports the information; otherwise return
 `None` / `[]`.
 ```python
 def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
    return count_pages(document_path)
 def extract_metadata(
    self,
    document_path: Path,
    mime_type: str,
 ) -> "list[MetadataEntry]":
    # Must never raise. Return [] if metadata cannot be read.
    from paperless.parsers import MetadataEntry
    return [
        MetadataEntry(
            namespace="https://example.com/ns/",
            prefix="ex",
            key="Author",
            value="Alice",
        )
    ]
 ```
 #### 2. Registering via entry point
 Add the following to your package's `pyproject.toml`. The key (left of `=`)
 is an arbitrary name used only in log output; the value is the
 `module:ClassName` import path.
 ```toml
 [project.entry-points."paperless_ngx.parsers"]
 my_parser = "my_package.parsers:MyCustomParser"
 ```
 Install your package into the same Python environment as Paperless-ngx (or
 add it to the Docker image), and the parser will be discovered automatically
 on the next startup. No configuration changes are needed.
 To verify discovery, check the application logs at startup for a line like:
 ```
 Loaded third-party parser 'My Format Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
 ```
 #### 3. Utilities
 `paperless.parsers.utils` provides helpers you can import directly:
 | Function                                | Description                                                      |
 | --------------------------------------- | ---------------------------------------------------------------- |
 | `read_file_handle_unicode_errors(path)` | Read a file as UTF-8, replacing invalid bytes instead of raising |
 | `get_page_count_for_pdf(path)`          | Count pages in a PDF using pikepdf                               |
 | `extract_pdf_metadata(path)`            | Extract XMP metadata from a PDF as a `list[MetadataEntry]`       |
 #### Minimal example
 A complete, working parser for a hypothetical plain-XML format:
 ```python
 from __future__ import annotations
 import shutil
 import tempfile
 from pathlib import Path
 from typing import Self
 from types import TracebackType
 import xml.etree.ElementTree as ET
 from django.conf import settings
 from documents.parsers import ParseError
 from paperless.parsers import ParserContext
 class XmlDocumentParser:
    name    = "XML Parser"
    version = "1.0.0"
    author  = "Acme Corp"
    url     = "https://example.com/xml-parser"
    @classmethod
    def supported_mime_types(cls) -> dict[str, str]:
        return {"application/xml": ".xml", "text/xml": ".xml"}
    @classmethod
    def score(cls, mime_type: str, filename: str, path: Path | None = None) -> int | None:
        return 10
    @property
    def can_produce_archive(self) -> bool:
        return False
    @property
    def requires_pdf_rendition(self) -> bool:
        return False
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR))
        self._text: str | None = None
    def __enter__(self) -> Self:
        return self
    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
        shutil.rmtree(self._tempdir, ignore_errors=True)
    def configure(self, context: ParserContext) -> None:
        pass
    def parse(self, document_path: Path, mime_type: str, *, produce_archive: bool = True) -> None:
        try:
            tree = ET.parse(document_path)
            self._text = " ".join(tree.getroot().itertext())
        except ET.ParseError as e:
            raise ParseError(f"XML parse error: {e}") from e
    def get_text(self) -> str | None:
        return self._text
    def get_date(self):
        return None
    def get_archive_path(self) -> Path | None:
        return None
    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
        from PIL import Image, ImageDraw
        img = Image.new("RGB", (500, 700), color="white")
        ImageDraw.Draw(img).text((10, 10), "XML Document", fill="black")
        out = self._tempdir / "thumb.webp"
        img.save(out, format="WEBP")
        return out
    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
        return None
    def extract_metadata(self, document_path: Path, mime_type: str) -> list:
        return []
 ```
 ### Developing date parser plugins
 Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
-### Creating a Date Parser Plugin
+#### Creating a Date Parser Plugin
 To create a custom date parser plugin, you need to:
@@ -492,7 +734,7 @@ To create a custom date parser plugin, you need to:
 2. Implement the required abstract method
 3. Register your plugin via an entry point
-#### 1. Implementing the Parser Class
+##### 1. Implementing the Parser Class
 Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:
@@ -532,7 +774,7 @@ class MyDateParserPlugin(DateParserPluginBase):
        yield another_datetime
 ```
-#### 2. Configuration and Helper Methods
+##### 2. Configuration and Helper Methods
 Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:
@@ -565,11 +807,11 @@ def _filter_date(
    """
 ```
-#### 3. Resource Management (Optional)
+##### 3. Resource Management (Optional)
 If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.
-#### 4. Registering Your Plugin
+##### 4. Registering Your Plugin
 Register your plugin using a setuptools entry point in your package's `pyproject.toml`:
@@ -580,7 +822,7 @@ my_parser = "my_package.parsers:MyDateParserPlugin"
 The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.
-### Plugin Discovery
+#### Plugin Discovery
 Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:
@@ -591,7 +833,7 @@ Paperless-ngx automatically discovers and loads date parser plugins at runtime.
 If multiple plugins are installed, a warning is logged indicating which plugin was selected.
-### Example: Simple Date Parser
+#### Example: Simple Date Parser
 Here's a minimal example that only looks for ISO 8601 dates:
@@ -623,3 +865,30 @@ class ISODateParserPlugin(DateParserPluginBase):
            if filtered_date is not None:
                yield filtered_date
 ```
 ## Using Visual Studio Code devcontainer
 Another easy way to get started with development is to use Visual Studio
 Code devcontainers. This approach will create a preconfigured development
 environment with all of the required tools and dependencies.
 [Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
 The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
 contain more information about the specific tasks and launch configurations (see the
 non-standard "description" field).
 To get started:
 1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
 2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
 3. In case your host operating system is Windows:
   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
 4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
   will initialize the database tables and create a superuser. Then you can compile the front end
   for production or run the frontend in debug mode.
 5. The project is ready for debugging, start either run the fullstack debug or individual debug
   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
--- a/docs/migration-v3.md
+++ b/docs/migration-v3.md
@@ -104,63 +104,6 @@ Multiple options are combined in a single value:
 PAPERLESS_DB_OPTIONS="sslmode=require;sslrootcert=/certs/ca.pem;pool.max_size=10"
 ```
 ## OCR and Archive File Generation Settings
 The settings that control OCR behaviour and archive file generation have been redesigned. The old settings that coupled these two concerns together are **removed** — old values are not silently honoured; a startup warning is logged if any removed variable is still set in your environment.
 ### Removed settings
 | Removed Setting                             | Replacement                                                           |
 | ------------------------------------------- | --------------------------------------------------------------------- |
 | `PAPERLESS_OCR_MODE=skip`                   | `PAPERLESS_OCR_MODE=auto` (new default)                               |
 | `PAPERLESS_OCR_MODE=skip_noarchive`         | `PAPERLESS_OCR_MODE=auto` + `PAPERLESS_ARCHIVE_FILE_GENERATION=never` |
 | `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never`     | `PAPERLESS_ARCHIVE_FILE_GENERATION=always`                            |
 | `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | `PAPERLESS_ARCHIVE_FILE_GENERATION=auto` (new default)                |
 | `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always`    | `PAPERLESS_ARCHIVE_FILE_GENERATION=never`                             |
 ### What changed and why
 Previously, `OCR_MODE` conflated two independent concerns: whether to run OCR and whether to produce an archive. `skip` meant "skip OCR if text exists, but always produce an archive". `skip_noarchive` meant "skip OCR if text exists, and also skip the archive". This made it impossible to, for example, disable OCR entirely while still producing archives.
 The new settings are independent:
 - [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) controls OCR: `auto` (default), `force`, `redo`, `off`.
 - [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) controls archive production: `auto` (default), `always`, `never`.
 ### Action Required
 Remove any `PAPERLESS_OCR_SKIP_ARCHIVE_FILE` variable from your environment. If you relied on `OCR_MODE=skip` or `OCR_MODE=skip_noarchive`, update accordingly:
 ```bash
 # v2: skip OCR when text present, always archive
 PAPERLESS_OCR_MODE=skip
 # v3: equivalent (auto is the new default)
 # No change needed — auto is the default
 # v2: skip OCR when text present, skip archive too
 PAPERLESS_OCR_MODE=skip_noarchive
 # v3: equivalent
 PAPERLESS_OCR_MODE=auto
 PAPERLESS_ARCHIVE_FILE_GENERATION=never
 # v2: always skip archive
 PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always
 # v3: equivalent
 PAPERLESS_ARCHIVE_FILE_GENERATION=never
 # v2: skip archive only for born-digital docs
 PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text
 # v3: equivalent (auto is the new default)
 PAPERLESS_ARCHIVE_FILE_GENERATION=auto
 ```
 ### Remote OCR parser
 If you use the **remote OCR parser** (Azure AI), note that it always produces a
 searchable PDF and stores it as the archive copy. `ARCHIVE_FILE_GENERATION=never`
 has no effect for documents handled by the remote parser — the archive is produced
 unconditionally by the remote engine.
 ## OpenID Connect Token Endpoint Authentication
 Some existing OpenID Connect setups may require an explicit token endpoint authentication method after upgrading to v3.
--- a/docs/setup.md
+++ b/docs/setup.md
@@ -633,11 +633,12 @@ hardware, but a few settings can improve performance:
  consumption, so you might want to lower these settings (example: 2
  workers and 1 thread to always have some computing power left for
  other tasks).
- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `auto` and consider
+- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `skip` and consider
  OCRing your documents before feeding them into Paperless. Some
  scanners are able to do this!
- Set [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) to `never` to skip archive
+- Set [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE`](configuration.md#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) to `with_text` to skip archive
-  file generation entirely, saving disk space at the cost of in-browser PDF/A viewing.
+  file generation for already OCRed documents, or `always` to skip it
  for all documents.
 - If you want to perform OCR on the device, consider using
  `PAPERLESS_OCR_CLEAN=none`. This will speed up OCR times and use
  less memory at the expense of slightly worse OCR results.
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -134,9 +134,9 @@ following operations on your documents:
 !!! tip
    This process can be configured to fit your needs. If you don't want
-    paperless to create archived versions for born-digital documents, set
+    paperless to create archived versions for digital documents, you can
-    [`PAPERLESS_ARCHIVE_FILE_GENERATION=auto`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)
+    configure that by configuring
-    (the default). To skip archives entirely, use `never`. Please read the
+    `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text`. Please read the
    [relevant section in the documentation](configuration.md#ocr).
 !!! note
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -50,14 +50,9 @@ from documents.utils import compute_checksum
 from documents.utils import copy_basic_file_stats
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
 from paperless.config import OcrConfig
 from paperless.models import ArchiveFileGenerationChoices
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.registry import get_parser_registry
 from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
 from paperless.parsers.utils import extract_pdf_text
 from paperless.parsers.utils import is_tagged_pdf
 LOGGING_NAME: Final[str] = "paperless.consumer"
@@ -110,44 +105,6 @@ class ConsumerStatusShortMessage(StrEnum):
    FAILED = "failed"
 def should_produce_archive(
    parser: "ParserProtocol",
    mime_type: str,
    document_path: Path,
 ) -> bool:
    """Return True if a PDF/A archive should be produced for this document.
    IMPORTANT: *parser* must be an instantiated parser, not the class.
    ``requires_pdf_rendition`` and ``can_produce_archive`` are instance
    ``@property`` methods — accessing them on the class returns the descriptor
    (always truthy).
    """
    # Must produce a PDF so the frontend can display the original format at all.
    if parser.requires_pdf_rendition:
        return True
    # Parser cannot produce an archive (e.g. TextDocumentParser).
    if not parser.can_produce_archive:
        return False
    generation = OcrConfig().archive_file_generation
    if generation == ArchiveFileGenerationChoices.ALWAYS:
        return True
    if generation == ArchiveFileGenerationChoices.NEVER:
        return False
    # auto: produce archives for scanned/image documents; skip for born-digital PDFs.
    if mime_type.startswith("image/"):
        return True
    if mime_type == "application/pdf":
        if is_tagged_pdf(document_path):
            return False
        text = extract_pdf_text(document_path)
        return text is None or len(text) <= PDF_TEXT_MIN_LENGTH
    return False
 class ConsumerPluginMixin:
    if TYPE_CHECKING:
        from logging import Logger
@@ -481,16 +438,7 @@ class ConsumerPlugin(
                    )
                    self.log.debug(f"Parsing {self.filename}...")
-                    produce_archive = should_produce_archive(
+                    document_parser.parse(self.working_copy, mime_type)
                        document_parser,
                        mime_type,
                        self.working_copy,
                    )
                    document_parser.parse(
                        self.working_copy,
                        mime_type,
                        produce_archive=produce_archive,
                    )
                    self.log.debug(f"Generating thumbnail for {self.filename}...")
                    self._send_progress(
@@ -839,7 +787,7 @@ class ConsumerPlugin(
        return document
-    def apply_overrides(self, document: Document) -> None:
+    def apply_overrides(self, document) -> None:
        if self.metadata.correspondent_id:
            document.correspondent = Correspondent.objects.get(
                pk=self.metadata.correspondent_id,
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -34,7 +34,6 @@ from documents.consumer import AsnCheckPlugin
 from documents.consumer import ConsumerPlugin
 from documents.consumer import ConsumerPreflightPlugin
 from documents.consumer import WorkflowTriggerPlugin
 from documents.consumer import should_produce_archive
 from documents.data_models import ConsumableDocument
 from documents.data_models import DocumentMetadataOverrides
 from documents.double_sided import CollatePlugin
@@ -322,16 +321,7 @@ def update_document_content_maybe_archive_file(document_id) -> None:
        parser.configure(ParserContext())
        try:
-            produce_archive = should_produce_archive(
+            parser.parse(document.source_path, mime_type)
                parser,
                mime_type,
                document.source_path,
            )
            parser.parse(
                document.source_path,
                mime_type,
                produce_archive=produce_archive,
            )
            thumbnail = parser.get_thumbnail(document.source_path, mime_type)
--- a/src/documents/tests/test_api_app_config.py
+++ b/src/documents/tests/test_api_app_config.py
@@ -46,7 +46,7 @@ class TestApiAppConfig(DirectoriesMixin, APITestCase):
                "pages": None,
                "language": None,
                "mode": None,
-                "archive_file_generation": None,
+                "skip_archive_file": None,
                "image_dpi": None,
                "unpaper_clean": None,
                "deskew": None,
--- a/src/documents/tests/test_barcodes.py
+++ b/src/documents/tests/test_barcodes.py
@@ -1020,7 +1020,7 @@ class TestTagBarcode(DirectoriesMixin, SampleDirMixin, GetReaderPluginMixin, Tes
        CONSUMER_TAG_BARCODE_SPLIT=True,
        CONSUMER_TAG_BARCODE_MAPPING={"TAG:(.*)": "\\g<1>"},
        CELERY_TASK_ALWAYS_EAGER=True,
-        OCR_MODE="auto",
+        OCR_MODE="skip",
    )
    def test_consume_barcode_file_tag_split_and_assignment(self) -> None:
        """
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -230,11 +230,7 @@ class TestConsumer(
        shutil.copy(src, dst)
        return dst
-    @override_settings(
+    @override_settings(FILENAME_FORMAT=None, TIME_ZONE="America/Chicago")
        FILENAME_FORMAT=None,
        TIME_ZONE="America/Chicago",
        ARCHIVE_FILE_GENERATION="always",
    )
    def testNormalOperation(self) -> None:
        filename = self.get_test_file()
@@ -633,10 +629,7 @@ class TestConsumer(
        # Database empty
        self.assertEqual(Document.objects.all().count(), 0)
-    @override_settings(
+    @override_settings(FILENAME_FORMAT="{correspondent}/{title}")
        FILENAME_FORMAT="{correspondent}/{title}",
        ARCHIVE_FILE_GENERATION="always",
    )
    def testFilenameHandling(self) -> None:
        with self.get_consumer(
            self.get_test_file(),
@@ -653,7 +646,7 @@ class TestConsumer(
        self._assert_first_last_send_progress()
    @mock.patch("documents.consumer.generate_unique_filename")
-    @override_settings(FILENAME_FORMAT="{pk}", ARCHIVE_FILE_GENERATION="always")
+    @override_settings(FILENAME_FORMAT="{pk}")
    def testFilenameHandlingFallsBackWhenGeneratedPathExceedsDbLimit(self, m):
        m.side_effect = lambda doc, archive_filename=False: Path(
            ("a" * 1100 + ".pdf") if not archive_filename else ("b" * 1100 + ".pdf"),
@@ -680,10 +673,7 @@ class TestConsumer(
        self._assert_first_last_send_progress()
-    @override_settings(
+    @override_settings(FILENAME_FORMAT="{correspondent}/{title}")
        FILENAME_FORMAT="{correspondent}/{title}",
        ARCHIVE_FILE_GENERATION="always",
    )
    @mock.patch("documents.signals.handlers.generate_unique_filename")
    def testFilenameHandlingUnstableFormat(self, m) -> None:
        filenames = ["this", "that", "now this", "i cannot decide"]
@@ -1031,7 +1021,7 @@ class TestConsumer(
        self.assertEqual(Document.objects.count(), 2)
        self._assert_first_last_send_progress()
-    @override_settings(FILENAME_FORMAT="{title}", ARCHIVE_FILE_GENERATION="always")
+    @override_settings(FILENAME_FORMAT="{title}")
    @mock.patch("documents.consumer.get_parser_registry")
    def test_similar_filenames(self, m) -> None:
        shutil.copy(
@@ -1142,7 +1132,6 @@ class TestConsumer(
            mock_mail_parser_parse.assert_called_once_with(
                consumer.working_copy,
                "message/rfc822",
                produce_archive=True,
            )
@@ -1290,14 +1279,7 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
    def test_no_pre_consume_script(self, m) -> None:
        with self.get_consumer(self.test_file) as c:
            c.run()
-            # Verify no pre-consume script subprocess was invoked
+            m.assert_not_called()
            # (run_subprocess may still be called by _extract_text_for_archive_check)
            script_calls = [
                call
                for call in m.call_args_list
                if call.args and call.args[0] and call.args[0][0] not in ("pdftotext",)
            ]
            self.assertEqual(script_calls, [])
    @mock.patch("documents.consumer.run_subprocess")
    @override_settings(PRE_CONSUME_SCRIPT="does-not-exist")
@@ -1313,16 +1295,9 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
                with self.get_consumer(self.test_file) as c:
                    c.run()
-                    self.assertTrue(m.called)
+                    m.assert_called_once()
-                    # Find the call that invoked the pre-consume script
+                    args, _ = m.call_args
                    # (run_subprocess may also be called by _extract_text_for_archive_check)
                    script_call = next(
                        call
                        for call in m.call_args_list
                        if call.args and call.args[0] and call.args[0][0] == script.name
                    )
                    args, _ = script_call
                    command = args[0]
                    environment = args[1]
--- a/src/documents/tests/test_consumer_archive.py
+++ b/src/documents/tests/test_consumer_archive.py
@@ -1,189 +0,0 @@
 """Tests for should_produce_archive()."""
 from __future__ import annotations
 from pathlib import Path
 from typing import TYPE_CHECKING
 from unittest.mock import MagicMock
 import pytest
 from documents.consumer import should_produce_archive
 if TYPE_CHECKING:
    from pytest_mock import MockerFixture
 def _parser_instance(
    *,
    can_produce: bool = True,
    requires_rendition: bool = False,
 ) -> MagicMock:
    """Return a mock parser instance with the given capability flags."""
    instance = MagicMock()
    instance.can_produce_archive = can_produce
    instance.requires_pdf_rendition = requires_rendition
    return instance
@pytest.fixture()
 def null_app_config(mocker) -> MagicMock:
    """Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
    return mocker.MagicMock(
        output_type=None,
        pages=None,
        language=None,
        mode=None,
        archive_file_generation=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
        rotate_pages=None,
        rotate_pages_threshold=None,
        max_image_pixels=None,
        color_conversion_strategy=None,
        user_args=None,
    )
@pytest.fixture(autouse=True)
 def patch_app_config(mocker, null_app_config):
    """Patch BaseConfig._get_config_instance for all tests in this module."""
    mocker.patch(
        "paperless.config.BaseConfig._get_config_instance",
        return_value=null_app_config,
    )
 class TestShouldProduceArchive:
    @pytest.mark.parametrize(
        ("generation", "can_produce", "requires_rendition", "mime", "expected"),
        [
            pytest.param(
                "never",
                True,
                False,
                "application/pdf",
                False,
                id="never-returns-false",
            ),
            pytest.param(
                "always",
                True,
                False,
                "application/pdf",
                True,
                id="always-returns-true",
            ),
            pytest.param(
                "never",
                True,
                True,
                "application/pdf",
                True,
                id="requires-rendition-overrides-never",
            ),
            pytest.param(
                "always",
                False,
                False,
                "text/plain",
                False,
                id="cannot-produce-overrides-always",
            ),
            pytest.param(
                "always",
                False,
                True,
                "application/pdf",
                True,
                id="requires-rendition-wins-even-if-cannot-produce",
            ),
            pytest.param(
                "auto",
                True,
                False,
                "image/tiff",
                True,
                id="auto-image-returns-true",
            ),
            pytest.param(
                "auto",
                True,
                False,
                "message/rfc822",
                False,
                id="auto-non-pdf-non-image-returns-false",
            ),
        ],
    )
    def test_generation_setting(
        self,
        settings,
        generation: str,
        can_produce: bool,  # noqa: FBT001
        requires_rendition: bool,  # noqa: FBT001
        mime: str,
        expected: bool,  # noqa: FBT001
    ) -> None:
        settings.ARCHIVE_FILE_GENERATION = generation
        parser = _parser_instance(
            can_produce=can_produce,
            requires_rendition=requires_rendition,
        )
        assert should_produce_archive(parser, mime, Path("/tmp/doc")) is expected
    @pytest.mark.parametrize(
        ("extracted_text", "expected"),
        [
            pytest.param(
                "This is a born-digital PDF with lots of text content. " * 10,
                False,
                id="born-digital-long-text-skips-archive",
            ),
            pytest.param(None, True, id="no-text-scanned-produces-archive"),
            pytest.param("tiny", True, id="short-text-treated-as-scanned"),
        ],
    )
    def test_auto_pdf_archive_decision(
        self,
        mocker: MockerFixture,
        settings,
        extracted_text: str | None,
        expected: bool,  # noqa: FBT001
    ) -> None:
        settings.ARCHIVE_FILE_GENERATION = "auto"
        mocker.patch("documents.consumer.is_tagged_pdf", return_value=False)
        mocker.patch("documents.consumer.extract_pdf_text", return_value=extracted_text)
        parser = _parser_instance(can_produce=True, requires_rendition=False)
        assert (
            should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
            is expected
        )
    def test_tagged_pdf_skips_archive_in_auto_mode(
        self,
        mocker: MockerFixture,
        settings,
    ) -> None:
        """Tagged PDFs (e.g. Word exports) are treated as born-digital regardless of text length."""
        settings.ARCHIVE_FILE_GENERATION = "auto"
        mocker.patch("documents.consumer.is_tagged_pdf", return_value=True)
        parser = _parser_instance(can_produce=True, requires_rendition=False)
        assert (
            should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
            is False
        )
    def test_tagged_pdf_does_not_call_pdftotext(
        self,
        mocker: MockerFixture,
        settings,
    ) -> None:
        """When a PDF is tagged, pdftotext is not invoked (fast path)."""
        settings.ARCHIVE_FILE_GENERATION = "auto"
        mocker.patch("documents.consumer.is_tagged_pdf", return_value=True)
        mock_extract = mocker.patch("documents.consumer.extract_pdf_text")
        parser = _parser_instance(can_produce=True, requires_rendition=False)
        should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
        mock_extract.assert_not_called()
--- a/src/documents/tests/test_management.py
+++ b/src/documents/tests/test_management.py
@@ -27,10 +27,7 @@ sample_file: Path = Path(__file__).parent / "samples" / "simple.pdf"
@pytest.mark.management
-@override_settings(
+@override_settings(FILENAME_FORMAT="{correspondent}/{title}")
    FILENAME_FORMAT="{correspondent}/{title}",
    ARCHIVE_FILE_GENERATION="always",
 )
 class TestArchiver(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    def make_models(self):
        return Document.objects.create(
--- a/src/documents/tests/test_tasks.py
+++ b/src/documents/tests/test_tasks.py
@@ -232,7 +232,6 @@ class TestEmptyTrashTask(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
        self.assertEqual(Document.global_objects.count(), 0)
@override_settings(ARCHIVE_FILE_GENERATION="always")
 class TestUpdateContent(DirectoriesMixin, TestCase):
    def test_update_content_maybe_archive_file(self) -> None:
        """
--- a/src/paperless/checks.py
+++ b/src/paperless/checks.py
@@ -5,7 +5,6 @@ import shutil
 import stat
 import subprocess
 from pathlib import Path
 from typing import Any
 from django.conf import settings
 from django.core.checks import Error
@@ -23,7 +22,7 @@ writeable_hint = (
 )
-def path_check(var: str, directory: Path) -> list[Error]:
+def path_check(var, directory: Path) -> list[Error]:
    messages: list[Error] = []
    if directory:
        if not directory.is_dir():
@@ -60,7 +59,7 @@ def path_check(var: str, directory: Path) -> list[Error]:
@register()
-def paths_check(app_configs: Any, **kwargs: Any) -> list[Error]:
+def paths_check(app_configs, **kwargs) -> list[Error]:
    """
    Check the various paths for existence, readability and writeability
    """
@@ -74,7 +73,7 @@ def paths_check(app_configs: Any, **kwargs: Any) -> list[Error]:
@register()
-def binaries_check(app_configs: Any, **kwargs: Any) -> list[Error]:
+def binaries_check(app_configs, **kwargs):
    """
    Paperless requires the existence of a few binaries, so we do some checks
    for those here.
@@ -94,7 +93,7 @@ def binaries_check(app_configs: Any, **kwargs: Any) -> list[Error]:
@register()
-def debug_mode_check(app_configs: Any, **kwargs: Any) -> list[Warning]:
+def debug_mode_check(app_configs, **kwargs):
    if settings.DEBUG:
        return [
            Warning(
@@ -110,7 +109,7 @@ def debug_mode_check(app_configs: Any, **kwargs: Any) -> list[Warning]:
@register()
-def settings_values_check(app_configs: Any, **kwargs: Any) -> list[Error | Warning]:
+def settings_values_check(app_configs, **kwargs):
    """
    Validates at least some of the user provided settings
    """
@@ -133,14 +132,23 @@ def settings_values_check(app_configs: Any, **kwargs: Any) -> list[Error | Warni
                Error(f'OCR output type "{settings.OCR_OUTPUT_TYPE}" is not valid'),
            )
-        if settings.OCR_MODE not in {"auto", "force", "redo", "off"}:
+        if settings.OCR_MODE not in {"force", "skip", "redo", "skip_noarchive"}:
            msgs.append(Error(f'OCR output mode "{settings.OCR_MODE}" is not valid'))
-        if settings.ARCHIVE_FILE_GENERATION not in {"auto", "always", "never"}:
+        if settings.OCR_MODE == "skip_noarchive":
            msgs.append(
                Warning(
                    'OCR output mode "skip_noarchive" is deprecated and will be '
                    "removed in a future version. Please use "
                    "PAPERLESS_OCR_SKIP_ARCHIVE_FILE instead.",
                ),
            )
        if settings.OCR_SKIP_ARCHIVE_FILE not in {"never", "with_text", "always"}:
            msgs.append(
                Error(
-                    "PAPERLESS_ARCHIVE_FILE_GENERATION setting "
+                    "OCR_SKIP_ARCHIVE_FILE setting "
-                    f'"{settings.ARCHIVE_FILE_GENERATION}" is not valid',
+                    f'"{settings.OCR_SKIP_ARCHIVE_FILE}" is not valid',
                ),
            )
@@ -183,7 +191,7 @@ def settings_values_check(app_configs: Any, **kwargs: Any) -> list[Error | Warni
@register()
-def audit_log_check(app_configs: Any, **kwargs: Any) -> list[Error]:
+def audit_log_check(app_configs, **kwargs):
    db_conn = connections["default"]
    all_tables = db_conn.introspection.table_names()
    result = []
@@ -295,42 +303,7 @@ def check_deprecated_db_settings(
@register()
-def check_deprecated_v2_ocr_env_vars(
+def check_remote_parser_configured(app_configs, **kwargs) -> list[Error]:
    app_configs: object,
    **kwargs: object,
 ) -> list[Warning]:
    """Warn when deprecated v2 OCR environment variables are set.
    Users upgrading from v2 may still have these in their environment or
    config files, where they are now silently ignored.
    """
    warnings: list[Warning] = []
    if os.environ.get("PAPERLESS_OCR_SKIP_ARCHIVE_FILE"):
        warnings.append(
            Warning(
                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE is set but has no effect. "
                "Use PAPERLESS_ARCHIVE_FILE_GENERATION=never/always/auto instead.",
                id="paperless.W002",
            ),
        )
    ocr_mode = os.environ.get("PAPERLESS_OCR_MODE", "")
    if ocr_mode in {"skip", "skip_noarchive"}:
        warnings.append(
            Warning(
                f"PAPERLESS_OCR_MODE={ocr_mode!r} is not a valid value. "
                f"Use PAPERLESS_OCR_MODE=auto (and PAPERLESS_ARCHIVE_FILE_GENERATION=never "
                f"if you used skip_noarchive) instead.",
                id="paperless.W003",
            ),
        )
    return warnings
@register()
 def check_remote_parser_configured(app_configs: Any, **kwargs: Any) -> list[Error]:
    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
    ):
@@ -356,7 +329,7 @@ def get_tesseract_langs():
@register()
-def check_default_language_available(app_configs: Any, **kwargs: Any) -> list[Error]:
+def check_default_language_available(app_configs, **kwargs):
    errs = []
    if not settings.OCR_LANGUAGE:
--- a/src/paperless/config.py
+++ b/src/paperless/config.py
@@ -4,11 +4,6 @@ import json
 from django.conf import settings
 from paperless.models import ApplicationConfiguration
 from paperless.models import ArchiveFileGenerationChoices
 from paperless.models import CleanChoices
 from paperless.models import ColorConvertChoices
 from paperless.models import ModeChoices
 from paperless.models import OutputTypeChoices
@dataclasses.dataclass
@@ -33,7 +28,7 @@ class OutputTypeConfig(BaseConfig):
    Almost all parsers care about the chosen PDF output format
    """
-    output_type: OutputTypeChoices = dataclasses.field(init=False)
+    output_type: str = dataclasses.field(init=False)
    def __post_init__(self) -> None:
        app_config = self._get_config_instance()
@@ -50,17 +45,15 @@ class OcrConfig(OutputTypeConfig):
    pages: int | None = dataclasses.field(init=False)
    language: str = dataclasses.field(init=False)
-    mode: ModeChoices = dataclasses.field(init=False)
+    mode: str = dataclasses.field(init=False)
-    archive_file_generation: ArchiveFileGenerationChoices = dataclasses.field(
+    skip_archive_file: str = dataclasses.field(init=False)
        init=False,
    )
    image_dpi: int | None = dataclasses.field(init=False)
-    clean: CleanChoices = dataclasses.field(init=False)
+    clean: str = dataclasses.field(init=False)
    deskew: bool = dataclasses.field(init=False)
    rotate: bool = dataclasses.field(init=False)
    rotate_threshold: float = dataclasses.field(init=False)
    max_image_pixel: float | None = dataclasses.field(init=False)
-    color_conversion_strategy: ColorConvertChoices = dataclasses.field(init=False)
+    color_conversion_strategy: str = dataclasses.field(init=False)
    user_args: dict[str, str] | None = dataclasses.field(init=False)
    def __post_init__(self) -> None:
@@ -71,8 +64,8 @@ class OcrConfig(OutputTypeConfig):
        self.pages = app_config.pages or settings.OCR_PAGES
        self.language = app_config.language or settings.OCR_LANGUAGE
        self.mode = app_config.mode or settings.OCR_MODE
-        self.archive_file_generation = (
+        self.skip_archive_file = (
-            app_config.archive_file_generation or settings.ARCHIVE_FILE_GENERATION
+            app_config.skip_archive_file or settings.OCR_SKIP_ARCHIVE_FILE
        )
        self.image_dpi = app_config.image_dpi or settings.OCR_IMAGE_DPI
        self.clean = app_config.unpaper_clean or settings.OCR_CLEAN
--- a/src/paperless/migrations/0008_replace_skip_archive_file.py
+++ b/src/paperless/migrations/0008_replace_skip_archive_file.py
@@ -1,44 +0,0 @@
 # Generated by Django 5.2.12 on 2026-03-26 20:31
 from django.db import migrations
 from django.db import models
 class Migration(migrations.Migration):
    dependencies = [
        ("paperless", "0007_optimize_integer_field_sizes"),
    ]
    operations = [
        migrations.RemoveField(
            model_name="applicationconfiguration",
            name="skip_archive_file",
        ),
        migrations.AddField(
            model_name="applicationconfiguration",
            name="archive_file_generation",
            field=models.CharField(
                blank=True,
                choices=[("auto", "auto"), ("always", "always"), ("never", "never")],
                max_length=8,
                null=True,
                verbose_name="Controls archive file generation",
            ),
        ),
        migrations.AlterField(
            model_name="applicationconfiguration",
            name="mode",
            field=models.CharField(
                blank=True,
                choices=[
                    ("auto", "auto"),
                    ("force", "force"),
                    ("redo", "redo"),
                    ("off", "off"),
                ],
                max_length=16,
                null=True,
                verbose_name="Sets the OCR mode",
            ),
        ),
    ]
--- a/src/paperless/models.py
+++ b/src/paperless/models.py
@@ -36,20 +36,20 @@ class ModeChoices(models.TextChoices):
    and our own custom setting
    """
-    AUTO = ("auto", _("auto"))
+    SKIP = ("skip", _("skip"))
    FORCE = ("force", _("force"))
    REDO = ("redo", _("redo"))
-    OFF = ("off", _("off"))
+    FORCE = ("force", _("force"))
    SKIP_NO_ARCHIVE = ("skip_noarchive", _("skip_noarchive"))
-class ArchiveFileGenerationChoices(models.TextChoices):
+class ArchiveFileChoices(models.TextChoices):
    """
    Settings to control creation of an archive PDF file
    """
    AUTO = ("auto", _("auto"))
    ALWAYS = ("always", _("always"))
    NEVER = ("never", _("never"))
    WITH_TEXT = ("with_text", _("with_text"))
    ALWAYS = ("always", _("always"))
 class CleanChoices(models.TextChoices):
@@ -126,12 +126,12 @@ class ApplicationConfiguration(AbstractSingletonModel):
        choices=ModeChoices.choices,
    )
-    archive_file_generation = models.CharField(
+    skip_archive_file = models.CharField(
-        verbose_name=_("Controls archive file generation"),
+        verbose_name=_("Controls the generation of an archive file"),
        null=True,
        blank=True,
-        max_length=8,
+        max_length=16,
-        choices=ArchiveFileGenerationChoices.choices,
+        choices=ArchiveFileChoices.choices,
    )
    image_dpi = models.PositiveSmallIntegerField(
--- a/src/paperless/parsers/tesseract.py
+++ b/src/paperless/parsers/tesseract.py
@@ -1,6 +1,5 @@
 from __future__ import annotations
 import importlib.resources
 import logging
 import os
 import re
@@ -9,8 +8,6 @@ import tempfile
 from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Any
 from typing import Final
 from typing import NoReturn
 from typing import Self
 from django.conf import settings
@@ -21,11 +18,9 @@ from documents.parsers import make_thumbnail_from_pdf
 from documents.utils import maybe_override_pixel_limit
 from documents.utils import run_subprocess
 from paperless.config import OcrConfig
 from paperless.models import ArchiveFileChoices
 from paperless.models import CleanChoices
 from paperless.models import ModeChoices
 from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
 from paperless.parsers.utils import extract_pdf_text
 from paperless.parsers.utils import is_tagged_pdf
 from paperless.parsers.utils import read_file_handle_unicode_errors
 from paperless.version import __full_version_str__
@@ -38,11 +33,7 @@ if TYPE_CHECKING:
 logger = logging.getLogger("paperless.parsing.tesseract")
-_SRGB_ICC_DATA: Final[bytes] = (
+_SUPPORTED_MIME_TYPES: dict[str, str] = {
    importlib.resources.files("ocrmypdf.data").joinpath("sRGB.icc").read_bytes()
 )
 _SUPPORTED_MIME_TYPES: Final[dict[str, str]] = {
    "application/pdf": ".pdf",
    "image/jpeg": ".jpg",
    "image/png": ".png",
@@ -108,7 +99,7 @@ class RasterisedDocumentParser:
    # Lifecycle
    # ------------------------------------------------------------------
-    def __init__(self, logging_group: object | None = None) -> None:
+    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self.tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
@@ -242,7 +233,7 @@ class RasterisedDocumentParser:
        if (
            sidecar_file is not None
            and sidecar_file.is_file()
-            and self.settings.mode != ModeChoices.REDO
+            and self.settings.mode != "redo"
        ):
            text = read_file_handle_unicode_errors(sidecar_file)
@@ -259,7 +250,36 @@ class RasterisedDocumentParser:
        if not Path(pdf_file).is_file():
            return None
-        return post_process_text(extract_pdf_text(Path(pdf_file), log=self.log))
+        try:
            text = None
            with tempfile.NamedTemporaryFile(
                mode="w+",
                dir=self.tempdir,
            ) as tmp:
                run_subprocess(
                    [
                        "pdftotext",
                        "-q",
                        "-layout",
                        "-enc",
                        "UTF-8",
                        str(pdf_file),
                        tmp.name,
                    ],
                    logger=self.log,
                )
                text = read_file_handle_unicode_errors(Path(tmp.name))
            return post_process_text(text)
        except Exception:
            #  If pdftotext fails, fall back to OCR.
            self.log.warning(
                "Error while getting text from PDF document with pdftotext",
                exc_info=True,
            )
            # probably not a PDF file.
            return None
    def construct_ocrmypdf_parameters(
        self,
@@ -269,7 +289,6 @@ class RasterisedDocumentParser:
        sidecar_file: Path,
        *,
        safe_fallback: bool = False,
        skip_text: bool = False,
    ) -> dict[str, Any]:
        ocrmypdf_args: dict[str, Any] = {
            "input_file_or_options": input_file,
@@ -288,14 +307,15 @@ class RasterisedDocumentParser:
                self.settings.color_conversion_strategy
            )
-        if safe_fallback or self.settings.mode == ModeChoices.FORCE:
+        if self.settings.mode == ModeChoices.FORCE or safe_fallback:
            ocrmypdf_args["force_ocr"] = True
        elif self.settings.mode in {
            ModeChoices.SKIP,
            ModeChoices.SKIP_NO_ARCHIVE,
        }:
            ocrmypdf_args["skip_text"] = True
        elif self.settings.mode == ModeChoices.REDO:
            ocrmypdf_args["redo_ocr"] = True
        elif skip_text or self.settings.mode == ModeChoices.OFF:
            ocrmypdf_args["skip_text"] = True
        elif self.settings.mode == ModeChoices.AUTO:
            pass  # no extra flag: normal OCR (text not found case)
        else:  # pragma: no cover
            raise ParseError(f"Invalid ocr mode: {self.settings.mode}")
@@ -380,74 +400,6 @@ class RasterisedDocumentParser:
        return ocrmypdf_args
    def _convert_image_to_pdfa(self, document_path: Path) -> Path:
        """Convert an image to a PDF/A-2b file without invoking the OCR engine.
        Uses img2pdf for the initial image->PDF wrapping, then pikepdf to stamp
        PDF/A-2b conformance metadata.
        No Tesseract and no Ghostscript are invoked.
        """
        import img2pdf
        import pikepdf
        plain_pdf_path = Path(self.tempdir) / "image_plain.pdf"
        try:
            layout_fun = None
            if self.settings.image_dpi is not None:
                layout_fun = img2pdf.get_fixed_dpi_layout_fun(
                    (self.settings.image_dpi, self.settings.image_dpi),
                )
            plain_pdf_path.write_bytes(
                img2pdf.convert(str(document_path), layout_fun=layout_fun),
            )
        except Exception as e:
            raise ParseError(
                f"img2pdf conversion failed for {document_path}: {e!s}",
            ) from e
        pdfa_path = Path(self.tempdir) / "archive.pdf"
        try:
            with pikepdf.open(plain_pdf_path) as pdf:
                cs = pdf.make_stream(_SRGB_ICC_DATA)
                cs["/N"] = 3
                output_intent = pikepdf.Dictionary(
                    Type=pikepdf.Name("/OutputIntent"),
                    S=pikepdf.Name("/GTS_PDFA1"),
                    OutputConditionIdentifier=pikepdf.String("sRGB"),
                    DestOutputProfile=cs,
                )
                pdf.Root["/OutputIntents"] = pdf.make_indirect(
                    pikepdf.Array([output_intent]),
                )
                meta = pdf.open_metadata(set_pikepdf_as_editor=False)
                meta["pdfaid:part"] = "2"
                meta["pdfaid:conformance"] = "B"
                pdf.save(pdfa_path)
        except Exception as e:
            self.log.warning(
                f"PDF/A metadata stamping failed ({e!s}); falling back to plain PDF.",
            )
            pdfa_path.write_bytes(plain_pdf_path.read_bytes())
        return pdfa_path
    def _handle_subprocess_output_error(self, e: Exception) -> NoReturn:
        """Log context for Ghostscript failures and raise ParseError.
        Called from the SubprocessOutputError handlers in parse() to avoid
        duplicating the Ghostscript hint and re-raise logic.
        """
        if "Ghostscript PDF/A rendering" in str(e):
            self.log.warning(
                "Ghostscript PDF/A rendering failed, consider setting "
                "PAPERLESS_OCR_USER_ARGS: "
                "'{\"continue_on_soft_render_error\": true}'",
            )
        raise ParseError(
            f"SubprocessOutputError: {e!s}. See logs for more information.",
        ) from e
    def parse(
        self,
        document_path: Path,
@@ -457,94 +409,57 @@ class RasterisedDocumentParser:
    ) -> None:
        # This forces tesseract to use one core per page.
        os.environ["OMP_THREAD_LIMIT"] = "1"
        VALID_TEXT_LENGTH = 50
        if mime_type == "application/pdf":
            text_original = self.extract_text(None, document_path)
            original_has_text = (
                text_original is not None and len(text_original) > VALID_TEXT_LENGTH
            )
        else:
            text_original = None
            original_has_text = False
        # If the original has text, and the user doesn't want an archive,
        # we're done here
        skip_archive_for_text = (
            self.settings.mode == ModeChoices.SKIP_NO_ARCHIVE
            or self.settings.skip_archive_file
            in {
                ArchiveFileChoices.WITH_TEXT,
                ArchiveFileChoices.ALWAYS,
            }
        )
        if skip_archive_for_text and original_has_text:
            self.log.debug("Document has text, skipping OCRmyPDF entirely.")
            self.text = text_original
            return
        # Either no text was in the original or there should be an archive
        # file created, so OCR the file and create an archive with any
        # text located via OCR
        import ocrmypdf
        from ocrmypdf import EncryptedPdfError
        from ocrmypdf import InputFileError
        from ocrmypdf import SubprocessOutputError
        from ocrmypdf.exceptions import DigitalSignatureError
        from ocrmypdf.exceptions import PriorOcrFoundError
        if mime_type == "application/pdf":
            text_original = self.extract_text(None, document_path)
            original_has_text = is_tagged_pdf(document_path, log=self.log) or (
                text_original is not None and len(text_original) > PDF_TEXT_MIN_LENGTH
            )
        else:
            text_original = None
            original_has_text = False
        # --- OCR_MODE=off: never invoke OCR engine ---
        if self.settings.mode == ModeChoices.OFF:
            if not produce_archive:
                self.text = text_original or ""
                return
            if self.is_image(mime_type):
                try:
                    self.archive_path = self._convert_image_to_pdfa(
                        document_path,
                    )
                    self.text = ""
                except Exception as e:
                    raise ParseError(
                        f"Image to PDF/A conversion failed: {e!s}",
                    ) from e
                return
            # PDFs in off mode: PDF/A conversion only via skip_text
        archive_path = Path(self.tempdir) / "archive.pdf"
        sidecar_file = Path(self.tempdir) / "sidecar.txt"
            args = self.construct_ocrmypdf_parameters(
                document_path,
                mime_type,
                archive_path,
                sidecar_file,
                skip_text=True,
            )
            try:
                self.log.debug(
                    f"Calling OCRmyPDF (off mode, PDF/A conversion only): {args}",
                )
                ocrmypdf.ocr(**args)
                self.archive_path = archive_path
                self.text = self.extract_text(None, archive_path) or text_original or ""
            except SubprocessOutputError as e:
                self._handle_subprocess_output_error(e)
            except Exception as e:
                raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
            return
        # --- OCR_MODE=auto: skip ocrmypdf entirely if text exists and no archive needed ---
        if (
            self.settings.mode == ModeChoices.AUTO
            and original_has_text
            and not produce_archive
        ):
            self.log.debug(
                "Document has text and no archive requested; skipping OCRmyPDF entirely.",
            )
            self.text = text_original
            return
        # --- All other paths: run ocrmypdf ---
        archive_path = Path(self.tempdir) / "archive.pdf"
        sidecar_file = Path(self.tempdir) / "sidecar.txt"
        # auto mode with existing text: PDF/A conversion only (no OCR).
        skip_text = self.settings.mode == ModeChoices.AUTO and original_has_text
        args = self.construct_ocrmypdf_parameters(
            document_path,
            mime_type,
            archive_path,
            sidecar_file,
            skip_text=skip_text,
        )
        try:
            self.log.debug(f"Calling OCRmyPDF with args: {args}")
            ocrmypdf.ocr(**args)
-            if produce_archive:
+            if self.settings.skip_archive_file != ArchiveFileChoices.ALWAYS:
                self.archive_path = archive_path
            self.text = self.extract_text(sidecar_file, archive_path)
@@ -559,8 +474,16 @@ class RasterisedDocumentParser:
            if original_has_text:
                self.text = text_original
        except SubprocessOutputError as e:
-            self._handle_subprocess_output_error(e)
+            if "Ghostscript PDF/A rendering" in str(e):
-        except (NoTextFoundException, InputFileError, PriorOcrFoundError) as e:
+                self.log.warning(
                    "Ghostscript PDF/A rendering failed, consider setting "
                    "PAPERLESS_OCR_USER_ARGS: '{\"continue_on_soft_render_error\": true}'",
                )
            raise ParseError(
                f"SubprocessOutputError: {e!s}. See logs for more information.",
            ) from e
        except (NoTextFoundException, InputFileError) as e:
            self.log.warning(
                f"Encountered an error while running OCR: {e!s}. "
                f"Attempting force OCR to get the text.",
@@ -569,6 +492,8 @@ class RasterisedDocumentParser:
            archive_path_fallback = Path(self.tempdir) / "archive-fallback.pdf"
            sidecar_file_fallback = Path(self.tempdir) / "sidecar-fallback.txt"
            # Attempt to run OCR with safe settings.
            args = self.construct_ocrmypdf_parameters(
                document_path,
                mime_type,
@@ -580,18 +505,25 @@ class RasterisedDocumentParser:
            try:
                self.log.debug(f"Fallback: Calling OCRmyPDF with args: {args}")
                ocrmypdf.ocr(**args)
                # Don't return the archived file here, since this file
                # is bigger and blurry due to --force-ocr.
                self.text = self.extract_text(
                    sidecar_file_fallback,
                    archive_path_fallback,
                )
-                if produce_archive:
+
                    self.archive_path = archive_path_fallback
            except Exception as e:
                # If this fails, we have a serious issue at hand.
                raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
        except Exception as e:
            # Anything else is probably serious.
            raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
        # As a last resort, if we still don't have any text for any reason,
        # try to extract the text from the original document.
        if not self.text:
            if original_has_text:
                self.text = text_original
--- a/src/paperless/parsers/utils.py
+++ b/src/paperless/parsers/utils.py
@@ -10,105 +10,15 @@ from __future__ import annotations
 import logging
 import re
 import tempfile
 from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Final
 if TYPE_CHECKING:
    from pathlib import Path
    from paperless.parsers import MetadataEntry
 logger = logging.getLogger("paperless.parsers.utils")
 # Minimum character count for a PDF to be considered "born-digital" (has real text).
 # Used by both the consumer (archive decision) and the tesseract parser (skip-OCR decision).
 PDF_TEXT_MIN_LENGTH: Final[int] = 50
 def is_tagged_pdf(
    path: Path,
    log: logging.Logger | None = None,
 ) -> bool:
    """Return True if the PDF declares itself as tagged (born-digital indicator).
    Tagged PDFs (e.g. exported from Word or LibreOffice) have ``/MarkInfo``
    with ``/Marked true`` in the document root.  This is a reliable signal
    that the document has a logical structure and embedded text — running OCR
    on it is unnecessary and archive generation can be skipped.
    https://github.com/ocrmypdf/OCRmyPDF/blob/4e974ebd465a5921b2e79004f098f5d203010282/src/ocrmypdf/pdfinfo/info.py#L449
    Parameters
    ----------
    path:
        Absolute path to the PDF file.
    log:
        Logger for warnings.  Falls back to the module-level logger when omitted.
    Returns
    -------
    bool
        ``True`` when the PDF is tagged, ``False`` otherwise or on any error.
    """
    import pikepdf
    _log = log or logger
    try:
        with pikepdf.open(path) as pdf:
            mark_info = pdf.Root.get("/MarkInfo")
            if mark_info is None:
                return False
            return bool(mark_info.get("/Marked", False))
    except Exception:
        _log.warning("Could not check PDF tag status for %s", path, exc_info=True)
        return False
 def extract_pdf_text(
    path: Path,
    log: logging.Logger | None = None,
 ) -> str | None:
    """Run pdftotext on *path* and return the extracted text, or None on failure.
    Parameters
    ----------
    path:
        Absolute path to the PDF file.
    log:
        Logger for warnings.  Falls back to the module-level logger when omitted.
    Returns
    -------
    str | None
        Extracted text, or ``None`` if pdftotext fails or the file is not a PDF.
    """
    from documents.utils import run_subprocess
    _log = log or logger
    try:
        with tempfile.TemporaryDirectory() as tmpdir:
            out_path = Path(tmpdir) / "text.txt"
            run_subprocess(
                [
                    "pdftotext",
                    "-q",
                    "-layout",
                    "-enc",
                    "UTF-8",
                    str(path),
                    str(out_path),
                ],
                logger=_log,
            )
            text = read_file_handle_unicode_errors(out_path, log=_log)
            return text or None
    except Exception:
        _log.warning(
            "Error while getting text from PDF document with pdftotext",
            exc_info=True,
        )
        return None
 def read_file_handle_unicode_errors(
    filepath: Path,
--- a/src/paperless/settings/init.py
+++ b/src/paperless/settings/init.py
@@ -21,7 +21,6 @@ from paperless.settings.custom import parse_hosting_settings
 from paperless.settings.custom import parse_ignore_dates
 from paperless.settings.custom import parse_redis_url
 from paperless.settings.parsers import get_bool_from_env
 from paperless.settings.parsers import get_choice_from_env
 from paperless.settings.parsers import get_float_from_env
 from paperless.settings.parsers import get_int_from_env
 from paperless.settings.parsers import get_list_from_env
@@ -875,17 +874,10 @@ OCR_LANGUAGE = os.getenv("PAPERLESS_OCR_LANGUAGE", "eng")
 # OCRmyPDF --output-type options are available.
 OCR_OUTPUT_TYPE = os.getenv("PAPERLESS_OCR_OUTPUT_TYPE", "pdfa")
-OCR_MODE = get_choice_from_env(
+# skip. redo, force
-    "PAPERLESS_OCR_MODE",
+OCR_MODE = os.getenv("PAPERLESS_OCR_MODE", "skip")
    {"auto", "force", "redo", "off"},
    default="auto",
 )
-ARCHIVE_FILE_GENERATION = get_choice_from_env(
+OCR_SKIP_ARCHIVE_FILE = os.getenv("PAPERLESS_OCR_SKIP_ARCHIVE_FILE", "never")
    "PAPERLESS_ARCHIVE_FILE_GENERATION",
    {"auto", "always", "never"},
    default="auto",
 )
 OCR_IMAGE_DPI = get_int_from_env("PAPERLESS_OCR_IMAGE_DPI")
--- a/src/paperless/tests/parsers/conftest.py
+++ b/src/paperless/tests/parsers/conftest.py
@@ -708,7 +708,7 @@ def null_app_config(mocker: MockerFixture) -> MagicMock:
        pages=None,
        language=None,
        mode=None,
-        archive_file_generation=None,
+        skip_archive_file=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
--- a/src/paperless/tests/parsers/test_parse_modes.py
+++ b/src/paperless/tests/parsers/test_parse_modes.py
@@ -1,436 +0,0 @@
 """
 Focused tests for RasterisedDocumentParser.parse() mode behaviour.
 These tests mock ``ocrmypdf.ocr`` so they run without a real Tesseract/OCRmyPDF
 installation and execute quickly.  The intent is to verify the *control flow*
 introduced by the ``produce_archive`` flag and the ``OCR_MODE=auto/off`` logic,
 not to test OCRmyPDF itself.
 Fixtures are pulled from conftest.py in this package.
 """
 from __future__ import annotations
 from pathlib import Path
 from typing import TYPE_CHECKING
 import pytest
 if TYPE_CHECKING:
    from pytest_mock import MockerFixture
    from paperless.parsers.tesseract import RasterisedDocumentParser
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 _LONG_TEXT = "This is a test document with enough text. " * 5  # >50 chars
 _SHORT_TEXT = "Hi."  # <50 chars
 def _make_extract_text(text: str | None):
    """Return a side_effect function for ``extract_text`` that returns *text*."""
    def _extract(sidecar_file, pdf_file):
        return text
    return _extract
 # ---------------------------------------------------------------------------
 # AUTO mode — PDF with sufficient text layer
 # ---------------------------------------------------------------------------
 class TestAutoModeWithText:
    """AUTO mode, original PDF has detectable text (>50 chars)."""
    def test_auto_text_no_archive_skips_ocrmypdf(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - AUTO mode, produce_archive=False
            - PDF with text > VALID_TEXT_LENGTH
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr is NOT called (early return path)
            - archive_path remains None
            - text is set from the original
        """
        # Patch extract_text to return long text (simulating detectable text layer)
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            simple_digital_pdf_file,
            "application/pdf",
            produce_archive=False,
        )
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path is None
        assert tesseract_parser.get_text() == _LONG_TEXT
    def test_auto_text_with_archive_calls_ocrmypdf_skip_text(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - AUTO mode, produce_archive=True
            - PDF with text > VALID_TEXT_LENGTH
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr IS called with skip_text=True
            - archive_path is set
        """
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            simple_digital_pdf_file,
            "application/pdf",
            produce_archive=True,
        )
        mock_ocr.assert_called_once()
        call_kwargs = mock_ocr.call_args.kwargs
        assert call_kwargs.get("skip_text") is True
        assert "force_ocr" not in call_kwargs
        assert "redo_ocr" not in call_kwargs
        assert tesseract_parser.archive_path is not None
 # ---------------------------------------------------------------------------
 # AUTO mode — PDF without text layer (or too short)
 # ---------------------------------------------------------------------------
 class TestAutoModeNoText:
    """AUTO mode, original PDF has no detectable text (<= 50 chars)."""
    def test_auto_no_text_with_archive_calls_ocrmypdf_no_extra_flag(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        multi_page_images_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - AUTO mode, produce_archive=True
            - PDF with no text (or text <= VALID_TEXT_LENGTH)
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr IS called WITHOUT skip_text/force_ocr/redo_ocr
            - archive_path is set (since produce_archive=True)
        """
        # Return "no text" for the original; return real text for archive
        extract_call_count = 0
        def _extract_side(sidecar_file, pdf_file):
            nonlocal extract_call_count
            extract_call_count += 1
            if extract_call_count == 1:
                return None  # original has no text
            return _LONG_TEXT  # text from archive after OCR
        mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            multi_page_images_pdf_file,
            "application/pdf",
            produce_archive=True,
        )
        mock_ocr.assert_called_once()
        call_kwargs = mock_ocr.call_args.kwargs
        assert "skip_text" not in call_kwargs
        assert "force_ocr" not in call_kwargs
        assert "redo_ocr" not in call_kwargs
        assert tesseract_parser.archive_path is not None
    def test_auto_no_text_no_archive_calls_ocrmypdf(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        multi_page_images_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - AUTO mode, produce_archive=False
            - PDF with no text
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr IS called (no early return since no text detected)
            - archive_path is NOT set (produce_archive=False)
        """
        extract_call_count = 0
        def _extract_side(sidecar_file, pdf_file):
            nonlocal extract_call_count
            extract_call_count += 1
            if extract_call_count == 1:
                return None
            return _LONG_TEXT
        mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            multi_page_images_pdf_file,
            "application/pdf",
            produce_archive=False,
        )
        mock_ocr.assert_called_once()
        assert tesseract_parser.archive_path is None
 # ---------------------------------------------------------------------------
 # OFF mode — PDF
 # ---------------------------------------------------------------------------
 class TestOffModePdf:
    """OCR_MODE=off, document is a PDF."""
    def test_off_no_archive_returns_pdftotext(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - OFF mode, produce_archive=False
            - PDF with text
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr is NOT called
            - archive_path is None
            - text comes from pdftotext (extract_text)
        """
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(
            simple_digital_pdf_file,
            "application/pdf",
            produce_archive=False,
        )
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path is None
        assert tesseract_parser.get_text() == _LONG_TEXT
    def test_off_with_archive_calls_ocrmypdf_skip_text(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - OFF mode, produce_archive=True
            - PDF document
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr IS called with skip_text=True (PDF/A conversion only)
            - archive_path is set
        """
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(
            simple_digital_pdf_file,
            "application/pdf",
            produce_archive=True,
        )
        mock_ocr.assert_called_once()
        call_kwargs = mock_ocr.call_args.kwargs
        assert call_kwargs.get("skip_text") is True
        assert "force_ocr" not in call_kwargs
        assert "redo_ocr" not in call_kwargs
        assert tesseract_parser.archive_path is not None
 # ---------------------------------------------------------------------------
 # OFF mode — image
 # ---------------------------------------------------------------------------
 class TestOffModeImage:
    """OCR_MODE=off, document is an image (PNG)."""
    def test_off_image_no_archive_no_ocrmypdf(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_png_file: Path,
    ) -> None:
        """
        GIVEN:
            - OFF mode, produce_archive=False
            - Image document (PNG)
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf.ocr is NOT called
            - archive_path is None
            - text is empty string (images have no text layer)
        """
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(simple_png_file, "image/png", produce_archive=False)
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path is None
        assert tesseract_parser.get_text() == ""
    def test_off_image_with_archive_uses_img2pdf_path(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_png_file: Path,
    ) -> None:
        """
        GIVEN:
            - OFF mode, produce_archive=True
            - Image document (PNG)
        WHEN:
            - parse() is called
        THEN:
            - _convert_image_to_pdfa() is called instead of ocrmypdf.ocr
            - archive_path is set to the returned path
            - text is empty string
        """
        fake_archive = Path("/tmp/fake-archive.pdf")
        mock_convert = mocker.patch.object(
            tesseract_parser,
            "_convert_image_to_pdfa",
            return_value=fake_archive,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(simple_png_file, "image/png", produce_archive=True)
        mock_convert.assert_called_once_with(simple_png_file)
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path == fake_archive
        assert tesseract_parser.get_text() == ""
 # ---------------------------------------------------------------------------
 # produce_archive=False never sets archive_path for FORCE / REDO / AUTO modes
 # ---------------------------------------------------------------------------
 class TestProduceArchiveFalse:
    """Verify produce_archive=False never results in an archive regardless of mode."""
    @pytest.mark.parametrize("mode", ["force", "redo"])
    def test_produce_archive_false_force_redo_modes(
        self,
        mode: str,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        multi_page_images_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - FORCE or REDO mode, produce_archive=False
            - Any PDF
        WHEN:
            - parse() is called (ocrmypdf mocked to succeed)
        THEN:
            - archive_path is NOT set even though ocrmypdf ran
        """
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = mode
        tesseract_parser.parse(
            multi_page_images_pdf_file,
            "application/pdf",
            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert tesseract_parser.get_text() is not None
    def test_produce_archive_false_auto_with_text(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        simple_digital_pdf_file: Path,
    ) -> None:
        """
        GIVEN:
            - AUTO mode, produce_archive=False
            - PDF with text > VALID_TEXT_LENGTH
        WHEN:
            - parse() is called
        THEN:
            - ocrmypdf is skipped entirely (early return)
            - archive_path is None
        """
        mocker.patch.object(
            tesseract_parser,
            "extract_text",
            return_value=_LONG_TEXT,
        )
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            simple_digital_pdf_file,
            "application/pdf",
            produce_archive=False,
        )
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path is None
--- a/src/paperless/tests/parsers/test_tesseract_custom_settings.py
+++ b/src/paperless/tests/parsers/test_tesseract_custom_settings.py
@@ -89,35 +89,15 @@ class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCas
        WHEN:
            - OCR parameters are constructed
        THEN:
-            - Configuration from database is utilized (AUTO mode with skip_text=True
+            - Configuration from database is utilized
              triggers skip_text; AUTO mode alone does not add any extra flag)
        """
        # AUTO mode with skip_text=True explicitly passed: skip_text is set
        with override_settings(OCR_MODE="redo"):
            instance = ApplicationConfiguration.objects.all().first()
-            instance.mode = ModeChoices.AUTO
+            instance.mode = ModeChoices.SKIP
            instance.save()
            params = RasterisedDocumentParser(None).construct_ocrmypdf_parameters(
                input_file="input.pdf",
                output_file="output.pdf",
                sidecar_file="sidecar.txt",
                mime_type="application/pdf",
                safe_fallback=False,
                skip_text=True,
            )
        self.assertTrue(params["skip_text"])
        self.assertNotIn("redo_ocr", params)
        self.assertNotIn("force_ocr", params)
        # AUTO mode alone (no skip_text): no extra OCR flag is set
        with override_settings(OCR_MODE="redo"):
            instance = ApplicationConfiguration.objects.all().first()
            instance.mode = ModeChoices.AUTO
            instance.save()
            params = self.get_params()
-        self.assertNotIn("skip_text", params)
+        self.assertTrue(params["skip_text"])
        self.assertNotIn("redo_ocr", params)
        self.assertNotIn("force_ocr", params)
--- a/src/paperless/tests/parsers/test_tesseract_parser.py
+++ b/src/paperless/tests/parsers/test_tesseract_parser.py
@@ -370,26 +370,15 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
        """
        GIVEN:
            - Multi-page digital PDF with sufficient text layer
            - Default settings (mode=auto, produce_archive=True)
        WHEN:
            - Document is parsed
        THEN:
            - Archive is created (AUTO mode + text present + produce_archive=True
              → PDF/A conversion via skip_text)
            - Text is extracted
        """
        tesseract_parser.parse(
-            tesseract_samples_dir / "multi-page-digital.pdf",
+            tesseract_samples_dir / "simple-digital.pdf",
            "application/pdf",
        )
        assert tesseract_parser.archive_path is not None
        assert tesseract_parser.archive_path.is_file()
        assert_ordered_substrings(
-            tesseract_parser.get_text().lower(),
+            tesseract_parser.get_text(),
-            ["page 1", "page 2", "page 3"],
+            ["This is a test document."],
        )
    def test_with_form_default(
@@ -408,7 +397,7 @@ class TestParsePdf:
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )
-    def test_with_form_redo_no_archive_when_not_requested(
+    def test_with_form_redo_produces_no_archive(
        self,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
@@ -417,7 +406,6 @@ class TestParsePdf:
        tesseract_parser.parse(
            tesseract_samples_dir / "with-form.pdf",
            "application/pdf",
            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -445,7 +433,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip"
        tesseract_parser.parse(tesseract_samples_dir / "signed.pdf", "application/pdf")
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -461,7 +449,7 @@ class TestParsePdf:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip"
        tesseract_parser.parse(
            tesseract_samples_dir / "encrypted.pdf",
            "application/pdf",
@@ -571,7 +559,7 @@ class TestParseMultiPage:
    @pytest.mark.parametrize(
        "mode",
        [
-            pytest.param("auto", id="auto"),
+            pytest.param("skip", id="skip"),
            pytest.param("redo", id="redo"),
            pytest.param("force", id="force"),
        ],
@@ -599,7 +587,7 @@ class TestParseMultiPage:
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-images.pdf",
            "application/pdf",
@@ -747,18 +735,16 @@ class TestSkipArchive:
        """
        GIVEN:
            - File with existing text layer
-            - Mode: auto, produce_archive=False
+            - Mode: skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted from original; no archive created (text exists +
+            - Text extracted; no archive created
              produce_archive=False skips OCRmyPDF entirely)
        """
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip_noarchive"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-digital.pdf",
            "application/pdf",
            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -774,13 +760,13 @@ class TestSkipArchive:
        """
        GIVEN:
            - File with image-only pages (no text layer)
-            - Mode: auto, skip_archive_file: auto
+            - Mode: skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
-            - Text extracted; archive created (OCR needed, no existing text)
+            - Text extracted; archive created (OCR needed)
        """
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip_noarchive"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-images.pdf",
            "application/pdf",
@@ -792,58 +778,41 @@ class TestSkipArchive:
        )
    @pytest.mark.parametrize(
-        ("produce_archive", "filename", "expect_archive"),
+        ("skip_archive_file", "filename", "expect_archive"),
        [
            pytest.param("never", "multi-page-digital.pdf", True, id="never-with-text"),
            pytest.param("never", "multi-page-images.pdf", True, id="never-no-text"),
            pytest.param(
-                True,
+                "with_text",
                "multi-page-digital.pdf",
                True,
                id="produce-archive-with-text",
            ),
            pytest.param(
                True,
                "multi-page-images.pdf",
                True,
                id="produce-archive-no-text",
            ),
            pytest.param(
                False,
                "multi-page-digital.pdf",
                False,
-                id="no-archive-with-text-layer",
+                id="with-text-layer",
            ),
            pytest.param(
-                False,
+                "with_text",
                "multi-page-images.pdf",
-                False,
+                True,
-                id="no-archive-no-text-layer",
+                id="with-text-no-layer",
            ),
            pytest.param(
                "always",
                "multi-page-digital.pdf",
                False,
                id="always-with-text",
            ),
            pytest.param("always", "multi-page-images.pdf", False, id="always-no-text"),
        ],
    )
-    def test_produce_archive_flag(
+    def test_skip_archive_file_setting(
        self,
-        produce_archive: bool,  # noqa: FBT001
+        skip_archive_file: str,
        filename: str,
-        expect_archive: bool,  # noqa: FBT001
+        expect_archive: str,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        """
+        tesseract_parser.settings.skip_archive_file = skip_archive_file
-        GIVEN:
+        tesseract_parser.parse(tesseract_samples_dir / filename, "application/pdf")
            - Various PDFs (with and without text layers)
            - produce_archive flag set to True or False
        WHEN:
            - Document is parsed
        THEN:
            - archive_path is set if and only if produce_archive=True
            - Text is always extracted
        """
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / filename,
            "application/pdf",
            produce_archive=produce_archive,
        )
        text = tesseract_parser.get_text().lower()
        assert_ordered_substrings(text, ["page 1", "page 2", "page 3"])
        if expect_archive:
@@ -851,59 +820,6 @@ class TestSkipArchive:
        else:
            assert tesseract_parser.archive_path is None
    def test_tagged_pdf_skips_ocr_in_auto_mode(
        self,
        mocker: MockerFixture,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
        """
        GIVEN:
            - A tagged PDF (e.g. exported from Word, /MarkInfo /Marked true)
            - Mode: auto, produce_archive=False
        WHEN:
            - Document is parsed
        THEN:
            - OCRmyPDF is not invoked (tagged ⇒ original_has_text=True)
            - Text is extracted from the original via pdftotext
            - No archive is produced
        """
        tesseract_parser.settings.mode = "auto"
        mock_ocr = mocker.patch("ocrmypdf.ocr")
        tesseract_parser.parse(
            tesseract_samples_dir / "simple-digital.pdf",
            "application/pdf",
            produce_archive=False,
        )
        mock_ocr.assert_not_called()
        assert tesseract_parser.archive_path is None
        assert tesseract_parser.get_text()
    def test_tagged_pdf_produces_pdfa_archive_without_ocr(
        self,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
        """
        GIVEN:
            - A tagged PDF (e.g. exported from Word, /MarkInfo /Marked true)
            - Mode: auto, produce_archive=True
        WHEN:
            - Document is parsed
        THEN:
            - OCRmyPDF runs with skip_text (PDF/A conversion only, no OCR)
            - Archive is produced
            - Text is preserved from the original
        """
        tesseract_parser.settings.mode = "auto"
        tesseract_parser.parse(
            tesseract_samples_dir / "simple-digital.pdf",
            "application/pdf",
            produce_archive=True,
        )
        assert tesseract_parser.archive_path is not None
        assert tesseract_parser.get_text()
 # ---------------------------------------------------------------------------
 # Parse — mixed pages / sidecar
@@ -919,13 +835,13 @@ class TestParseMixed:
        """
        GIVEN:
            - File with text in some pages (image) and some pages (digital)
-            - Mode: auto (skip_text), skip_archive_file: always
+            - Mode: skip
        WHEN:
            - Document is parsed
        THEN:
            - All pages extracted; archive created; sidecar notes skipped pages
        """
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-mixed.pdf",
            "application/pdf",
@@ -982,18 +898,17 @@ class TestParseMixed:
    ) -> None:
        """
        GIVEN:
-            - File with mixed pages (some with text, some image-only)
+            - File with mixed pages
-            - Mode: auto, produce_archive=False
+            - Mode: skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
-            - No archive created (produce_archive=False); text from text layer present
+            - No archive created (file has text layer); later-page text present
        """
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip_noarchive"
        tesseract_parser.parse(
            tesseract_samples_dir / "multi-page-mixed.pdf",
            "application/pdf",
            produce_archive=False,
        )
        assert tesseract_parser.archive_path is None
        assert_ordered_substrings(
@@ -1008,12 +923,12 @@ class TestParseMixed:
 class TestParseRotate:
-    def test_rotate_auto_mode(
+    def test_rotate_skip_mode(
        self,
        tesseract_parser: RasterisedDocumentParser,
        tesseract_samples_dir: Path,
    ) -> None:
-        tesseract_parser.settings.mode = "auto"
+        tesseract_parser.settings.mode = "skip"
        tesseract_parser.settings.rotate = True
        tesseract_parser.parse(tesseract_samples_dir / "rotated.pdf", "application/pdf")
        assert_ordered_substrings(
@@ -1040,19 +955,12 @@ class TestParseRtl:
    ) -> None:
        """
        GIVEN:
-            - PDF with RTL Arabic text in its text layer (short: 18 chars)
+            - PDF with RTL Arabic text
            - mode=off, produce_archive=True: PDF/A conversion via skip_text, no OCR engine
        WHEN:
            - Document is parsed
        THEN:
-            - Arabic content is extracted from the PDF text layer (normalised for bidi)
+            - Arabic content is extracted (normalised for bidi)
        Note: The RTL PDF has a short text layer (< VALID_TEXT_LENGTH=50) so AUTO mode
        would attempt full OCR, which fails due to PriorOcrFoundError and falls back to
        force-ocr with English Tesseract (producing garbage).  Using mode="off" forces
        skip_text=True so the Arabic text layer is preserved through PDF/A conversion.
        """
        tesseract_parser.settings.mode = "off"
        tesseract_parser.parse(
            tesseract_samples_dir / "rtl-test.pdf",
            "application/pdf",
@@ -1115,11 +1023,11 @@ class TestOcrmypdfParameters:
        assert ("clean" in params) == expected_clean
        assert ("clean_final" in params) == expected_clean_final
-    def test_clean_final_auto_mode(
+    def test_clean_final_skip_mode(
        self,
        make_tesseract_parser: MakeTesseractParser,
    ) -> None:
-        with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="auto") as parser:
+        with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="skip") as parser:
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
        assert params["clean_final"] is True
        assert "clean" not in params
@@ -1136,9 +1044,9 @@ class TestOcrmypdfParameters:
    @pytest.mark.parametrize(
        ("ocr_mode", "ocr_deskew", "expect_deskew"),
        [
-            pytest.param("auto", True, True, id="auto-deskew-on"),
+            pytest.param("skip", True, True, id="skip-deskew-on"),
            pytest.param("redo", True, False, id="redo-deskew-off"),
-            pytest.param("auto", False, False, id="auto-no-deskew"),
+            pytest.param("skip", False, False, id="skip-no-deskew"),
        ],
    )
    def test_deskew_option(
--- a/src/paperless/tests/test_checks.py
+++ b/src/paperless/tests/test_checks.py
@@ -132,13 +132,13 @@ class TestOcrSettingsChecks:
            pytest.param(
                "OCR_MODE",
                "skip_noarchive",
-                'OCR output mode "skip_noarchive"',
+                "deprecated",
-                id="deprecated-mode-now-invalid",
+                id="deprecated-mode",
            ),
            pytest.param(
-                "ARCHIVE_FILE_GENERATION",
+                "OCR_SKIP_ARCHIVE_FILE",
                "invalid",
-                'PAPERLESS_ARCHIVE_FILE_GENERATION setting "invalid"',
+                'OCR_SKIP_ARCHIVE_FILE setting "invalid"',
                id="invalid-skip-archive-file",
            ),
            pytest.param(
--- a/src/paperless/tests/test_checks_v3.py
+++ b/src/paperless/tests/test_checks_v3.py
@@ -1,64 +0,0 @@
 """Tests for v3 system checks: deprecated v2 OCR env var warnings."""
 from __future__ import annotations
 import os
 from typing import TYPE_CHECKING
 import pytest
 from paperless.checks import check_deprecated_v2_ocr_env_vars
 if TYPE_CHECKING:
    from pytest_mock import MockerFixture
 class TestDeprecatedV2OcrEnvVarWarnings:
    def test_no_deprecated_vars_returns_empty(self, mocker: MockerFixture) -> None:
        """No warnings when neither deprecated variable is set."""
        mocker.patch.dict(os.environ, {"PAPERLESS_OCR_MODE": "auto"}, clear=True)
        result = check_deprecated_v2_ocr_env_vars(None)
        assert result == []
    @pytest.mark.parametrize(
        ("env_var", "env_value", "expected_id", "expected_fragment"),
        [
            pytest.param(
                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
                "always",
                "paperless.W002",
                "PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
                id="skip-archive-file-warns",
            ),
            pytest.param(
                "PAPERLESS_OCR_MODE",
                "skip",
                "paperless.W003",
                "skip",
                id="ocr-mode-skip-warns",
            ),
            pytest.param(
                "PAPERLESS_OCR_MODE",
                "skip_noarchive",
                "paperless.W003",
                "skip_noarchive",
                id="ocr-mode-skip-noarchive-warns",
            ),
        ],
    )
    def test_deprecated_var_produces_one_warning(
        self,
        mocker: MockerFixture,
        env_var: str,
        env_value: str,
        expected_id: str,
        expected_fragment: str,
    ) -> None:
        """Each deprecated setting in isolation produces exactly one warning."""
        mocker.patch.dict(os.environ, {env_var: env_value}, clear=True)
        result = check_deprecated_v2_ocr_env_vars(None)
        assert len(result) == 1
        warning = result[0]
        assert warning.id == expected_id
        assert expected_fragment in warning.msg
--- a/src/paperless/tests/test_ocr_config.py
+++ b/src/paperless/tests/test_ocr_config.py
@@ -1,66 +0,0 @@
 """Tests for OcrConfig archive_file_generation field behavior."""
 from __future__ import annotations
 from typing import TYPE_CHECKING
 import pytest
 from django.test import override_settings
 from paperless.config import OcrConfig
 if TYPE_CHECKING:
    from unittest.mock import MagicMock
@pytest.fixture()
 def null_app_config(mocker) -> MagicMock:
    """Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
    return mocker.MagicMock(
        output_type=None,
        pages=None,
        language=None,
        mode=None,
        archive_file_generation=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
        rotate_pages=None,
        rotate_pages_threshold=None,
        max_image_pixels=None,
        color_conversion_strategy=None,
        user_args=None,
    )
@pytest.fixture()
 def make_ocr_config(mocker, null_app_config):
    mocker.patch(
        "paperless.config.BaseConfig._get_config_instance",
        return_value=null_app_config,
    )
    def _make(**django_settings_overrides):
        with override_settings(**django_settings_overrides):
            return OcrConfig()
    return _make
 class TestOcrConfigArchiveFileGeneration:
    def test_auto_from_settings(self, make_ocr_config) -> None:
        cfg = make_ocr_config(OCR_MODE="auto", ARCHIVE_FILE_GENERATION="auto")
        assert cfg.archive_file_generation == "auto"
    def test_always_from_settings(self, make_ocr_config) -> None:
        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
        assert cfg.archive_file_generation == "always"
    def test_never_from_settings(self, make_ocr_config) -> None:
        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="never")
        assert cfg.archive_file_generation == "never"
    def test_db_value_overrides_setting(self, make_ocr_config, null_app_config) -> None:
        null_app_config.archive_file_generation = "never"
        cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
        assert cfg.archive_file_generation == "never"
--- a/src/paperless/tests/test_parser_utils.py
+++ b/src/paperless/tests/test_parser_utils.py
@@ -1,25 +0,0 @@
 """Tests for paperless.parsers.utils helpers."""
 from __future__ import annotations
 from pathlib import Path
 from paperless.parsers.utils import is_tagged_pdf
 SAMPLES = Path(__file__).parent / "samples" / "tesseract"
 class TestIsTaggedPdf:
    def test_tagged_pdf_returns_true(self) -> None:
        assert is_tagged_pdf(SAMPLES / "simple-digital.pdf") is True
    def test_untagged_pdf_returns_false(self) -> None:
        assert is_tagged_pdf(SAMPLES / "multi-page-images.pdf") is False
    def test_nonexistent_path_returns_false(self) -> None:
        assert is_tagged_pdf(Path("/nonexistent/file.pdf")) is False
    def test_corrupt_pdf_returns_false(self, tmp_path: Path) -> None:
        bad = tmp_path / "bad.pdf"
        bad.write_bytes(b"not a pdf")
        assert is_tagged_pdf(bad) is False
Author	SHA1	Message	Date
Trenton H	3d0d243057	Merge branch 'dev' into chore/plugin-documentation	2026-03-27 10:18:22 -07:00
Trenton H	7a192d021f	Moves the date parsing plugin section under the extending section	2026-03-23 12:56:43 -07:00
Trenton H	1e30490a46	Adds a section about how the 2 install types can add external plugins	2026-03-23 12:56:38 -07:00
Trenton H	bd9e529a63	Inital documentation updates for developing a plugin	2026-03-23 12:56:34 -07:00