Moves the date parsing plugin section under the extending section

Adds a section about how the 2 install types can add external plugins
Inital documentation updates for developing a plugin
2026-03-21 00:15:57 +00:00 · 2026-03-20 15:05:13 -07:00 · 2026-03-20 14:55:17 -07:00 · 2026-03-20 14:48:45 -07:00 · 2026-03-20 14:23:30 -07:00 · 2026-03-20 13:54:09 -07:00
97 changed files with 4119 additions and 2888 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -21,6 +21,7 @@ body:
        - [The installation instructions](https://docs.paperless-ngx.com/setup/#installation).
        - [Existing issues and discussions](https://github.com/paperless-ngx/paperless-ngx/search?q=&type=issues).
        - Disable any custom container initialization scripts, if using
+        - Remove any third-party parser plugins — issues caused by or requiring changes to a third-party plugin will be closed without investigation.

        If you encounter issues while installing or configuring Paperless-ngx, please post in the ["Support" section of the discussions](https://github.com/paperless-ngx/paperless-ngx/discussions/new?category=support).
  - type: textarea
@@ -120,5 +121,7 @@ body:
          required: true
        - label: I have already searched for relevant existing issues and discussions before opening this report.
          required: true
+        - label: I have reproduced this issue with all third-party parser plugins removed. I understand that issues caused by third-party plugins will be closed without investigation.
+          required: true
        - label: I have updated the title field above with a concise description.
          required: true
--- a/.mypy-baseline.txt
+++ b/.mypy-baseline.txt
@@ -2437,17 +2437,3 @@ src/paperless_tesseract/tests/test_parser_custom_settings.py:0: error: Item "Non
 src/paperless_tesseract/tests/test_parser_custom_settings.py:0: error: Item "None" of "ApplicationConfiguration | None" has no attribute "unpaper_clean"  [union-attr]
 src/paperless_tesseract/tests/test_parser_custom_settings.py:0: error: Item "None" of "ApplicationConfiguration | None" has no attribute "unpaper_clean"  [union-attr]
 src/paperless_tesseract/tests/test_parser_custom_settings.py:0: error: Item "None" of "ApplicationConfiguration | None" has no attribute "user_args"  [union-attr]
-src/paperless_text/parsers.py:0: error: Function is missing a type annotation for one or more arguments  [no-untyped-def]
-src/paperless_text/parsers.py:0: error: Function is missing a type annotation for one or more arguments  [no-untyped-def]
-src/paperless_text/parsers.py:0: error: Incompatible types in assignment (expression has type "str", variable has type "None")  [assignment]
-src/paperless_text/signals.py:0: error: Function is missing a type annotation  [no-untyped-def]
-src/paperless_text/signals.py:0: error: Function is missing a type annotation  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Argument 1 to "make_thumbnail_from_pdf" has incompatible type "None"; expected "Path"  [arg-type]
-src/paperless_tika/parsers.py:0: error: Function is missing a return type annotation  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Function is missing a type annotation  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Function is missing a type annotation  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Function is missing a type annotation for one or more arguments  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Function is missing a type annotation for one or more arguments  [no-untyped-def]
-src/paperless_tika/parsers.py:0: error: Incompatible types in assignment (expression has type "str | None", variable has type "None")  [assignment]
-src/paperless_tika/signals.py:0: error: Function is missing a type annotation  [no-untyped-def]
-src/paperless_tika/signals.py:0: error: Function is missing a type annotation  [no-untyped-def]
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -723,6 +723,81 @@ services:

 1. Note the `:ro` tag means the folder will be mounted as read only. This is for extra security against changes

+## Installing third-party parser plugins {#parser-plugins}
+
+Third-party parser plugins extend Paperless-ngx to support additional file
+formats. A plugin is a Python package that advertises itself under the
+`paperless_ngx.parsers` entry point group. Refer to the
+[developer documentation](development.md#making-custom-parsers) for how to
+create one.
+
+!!! warning "Third-party plugins are not officially supported"
+
+    The Paperless-ngx maintainers do not provide support for third-party
+    plugins. Issues caused by or requiring changes to a third-party plugin
+    will be closed without further investigation. Always reproduce problems
+    with all plugins removed before filing a bug report.
+
+### Docker
+
+Use a [custom container initialization script](#custom-container-initialization)
+to install the package before the webserver starts. Create a shell script and
+mount it into `/custom-cont-init.d`:
+
+```bash
+#!/bin/bash
+# /path/to/my/scripts/install-parsers.sh
+
+pip install my-paperless-parser-package
+```
+
+Mount it in your `docker-compose.yml`:
+
+```yaml
+services:
+  webserver:
+    # ...
+    volumes:
+      - /path/to/my/scripts:/custom-cont-init.d:ro
+```
+
+The script runs as `root` before the webserver starts, so the package will be
+available when Paperless-ngx discovers plugins at startup.
+
+### Bare metal
+
+Install the package into the same Python environment that runs Paperless-ngx.
+If you followed the standard bare-metal install guide, that is the `paperless`
+user's environment:
+
+```bash
+sudo -Hu paperless pip3 install my-paperless-parser-package
+```
+
+If you are using `uv` or a virtual environment, activate it first and then run:
+
+```bash
+uv pip install my-paperless-parser-package
+# or
+pip install my-paperless-parser-package
+```
+
+Restart all Paperless-ngx services after installation so the new plugin is
+discovered.
+
+### Verifying installation
+
+On the next startup, check the application logs for a line confirming
+discovery:
+
+```
+Loaded third-party parser 'My Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
+```
+
+If this line does not appear, verify that the package is installed in the
+correct environment and that its `pyproject.toml` declares the
+`paperless_ngx.parsers` entry point.
+
 ## MySQL Caveats {#mysql-caveats}

 ### Case Sensitivity
--- a/docs/development.md
+++ b/docs/development.md
@@ -370,121 +370,363 @@ docker build --file Dockerfile --tag paperless:local .

 ## Extending Paperless-ngx

-Paperless-ngx does not have any fancy plugin systems and will probably never
-have. However, some parts of the application have been designed to allow
-easy integration of additional features without any modification to the
-base code.
+Paperless-ngx supports third-party document parsers via a Python entry point
+plugin system. Plugins are distributed as ordinary Python packages and
+discovered automatically at startup — no changes to the Paperless-ngx source
+are required.
+
+!!! warning "Third-party plugins are not officially supported"
+
+    The Paperless-ngx maintainers do not provide support for third-party
+    plugins. Issues that are caused by or require changes to a third-party
+    plugin will be closed without further investigation. If you believe you
+    have found a bug in Paperless-ngx itself (not in a plugin), please
+    reproduce it with all third-party plugins removed before filing an issue.

 ### Making custom parsers

-Paperless-ngx uses parsers to add documents. A parser is
-responsible for:
+Paperless-ngx uses parsers to add documents. A parser is responsible for:

- Retrieving the content from the original
- Creating a thumbnail
- _optional:_ Retrieving a created date from the original
- _optional:_ Creating an archived document from the original
+- Extracting plain-text content from the document
+- Generating a thumbnail image
+- _optional:_ Detecting the document's creation date
+- _optional:_ Producing a searchable PDF archive copy

-Custom parsers can be added to Paperless-ngx to support more file types. In
-order to do that, you need to write the parser itself and announce its
-existence to Paperless-ngx.
+Custom parsers are distributed as ordinary Python packages and registered
+via a [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
+No changes to the Paperless-ngx source are required.

-The parser itself must extend `documents.parsers.DocumentParser` and
-must implement the methods `parse` and `get_thumbnail`. You can provide
-your own implementation to `get_date` if you don't want to rely on
-Paperless-ngx' default date guessing mechanisms.
+#### 1. Implementing the parser class
+
+Your parser must satisfy the `ParserProtocol` structural interface defined in
+`paperless.parsers`. The simplest approach is to write a plain class — no base
+class is required, only the right attributes and methods.
+
+**Class-level identity attributes**
+
+The registry reads these before instantiating the parser, so they must be
+plain class attributes (not instance attributes or properties):

 ```python
-class MyCustomParser(DocumentParser):
-
-    def parse(self, document_path, mime_type):
-        # This method does not return anything. Rather, you should assign
-        # whatever you got from the document to the following fields:
-
-        # The content of the document.
-        self.text = "content"
-
-        # Optional: path to a PDF document that you created from the original.
-        self.archive_path = os.path.join(self.tempdir, "archived.pdf")
-
-        # Optional: "created" date of the document.
-        self.date = get_created_from_metadata(document_path)
-
-    def get_thumbnail(self, document_path, mime_type):
-        # This should return the path to a thumbnail you created for this
-        # document.
-        return os.path.join(self.tempdir, "thumb.webp")
+class MyCustomParser:
+    name    = "My Format Parser"   # human-readable name shown in logs
+    version = "1.0.0"              # semantic version string
+    author  = "Acme Corp"          # author / organisation
+    url     = "https://example.com/my-parser"  # docs or issue tracker
 ```

-If you encounter any issues during parsing, raise a
-`documents.parsers.ParseError`.
+**Declaring supported MIME types**

-The `self.tempdir` directory is a temporary directory that is guaranteed
-to be empty and removed after consumption finished. You can use that
-directory to store any intermediate files and also use it to store the
-thumbnail / archived document.
-
-After that, you need to announce your parser to Paperless-ngx. You need to
-connect a handler to the `document_consumer_declaration` signal. Have a
-look in the file `src/paperless_tesseract/apps.py` on how that's done.
-The handler is a method that returns information about your parser:
+Return a `dict` mapping MIME type strings to preferred file extensions
+(including the leading dot). Paperless-ngx uses the extension when storing
+archive copies and serving files for download.

 ```python
-def myparser_consumer_declaration(sender, **kwargs):
+@classmethod
+def supported_mime_types(cls) -> dict[str, str]:
    return {
-        "parser": MyCustomParser,
-        "weight": 0,
-        "mime_types": {
-            "application/pdf": ".pdf",
-            "image/jpeg": ".jpg",
-        }
+        "application/x-my-format": ".myf",
+        "application/x-my-format-alt": ".myf",
    }
 ```

- `parser` is a reference to a class that extends `DocumentParser`.
- `weight` is used whenever two or more parsers are able to parse a
-  file: The parser with the higher weight wins. This can be used to
-  override the parsers provided by Paperless-ngx.
- `mime_types` is a dictionary. The keys are the mime types your
-  parser supports and the value is the default file extension that
-  Paperless-ngx should use when storing files and serving them for
-  download. We could guess that from the file extensions, but some
-  mime types have many extensions associated with them and the Python
-  methods responsible for guessing the extension do not always return
-  the same value.
+**Scoring**

-## Using Visual Studio Code devcontainer
+When more than one parser can handle a file, the registry calls `score()` on
+each candidate and picks the one with the highest result. Return `None` to
+decline handling a file even though the MIME type is listed as supported (for
+example, when a required external service is not configured).

-Another easy way to get started with development is to use Visual Studio
-Code devcontainers. This approach will create a preconfigured development
-environment with all of the required tools and dependencies.
-[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
-The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
-contain more information about the specific tasks and launch configurations (see the
-non-standard "description" field).
+| Score  | Meaning                                           |
+| ------ | ------------------------------------------------- |
+| `None` | Decline — do not handle this file                 |
+| `10`   | Default priority used by all built-in parsers     |
+| `> 10` | Override a built-in parser for the same MIME type |

-To get started:
+```python
+@classmethod
+def score(
+    cls,
+    mime_type: str,
+    filename: str,
+    path: "Path | None" = None,
+) -> int | None:
+    # Inspect filename or file bytes here if needed.
+    return 10
+```

-1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+**Archive and rendition flags**

-2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+```python
+@property
+def can_produce_archive(self) -> bool:
+    """True if parse() can produce a searchable PDF archive copy."""
+    return True   # or False if your parser doesn't produce PDFs

-3. In case your host operating system is Windows:
-   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
-   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+@property
+def requires_pdf_rendition(self) -> bool:
+    """True if the original format cannot be displayed by a browser
+    (e.g. DOCX, ODT) and the PDF output must always be kept."""
+    return False
+```

-4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
-   will initialize the database tables and create a superuser. Then you can compile the front end
-   for production or run the frontend in debug mode.
+**Context manager — temp directory lifecycle**

-5. The project is ready for debugging, start either run the fullstack debug or individual debug
-   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
+Paperless-ngx always uses parsers as context managers. Create a temporary
+working directory in `__enter__` (or `__init__`) and remove it in `__exit__`
+regardless of whether an exception occurred. Store intermediate files,
+thumbnails, and archive PDFs inside this directory.

-## Developing Date Parser Plugins
+```python
+import shutil
+import tempfile
+from pathlib import Path
+from typing import Self
+from types import TracebackType
+
+from django.conf import settings
+
+class MyCustomParser:
+    ...
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self._tempdir = Path(
+            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR)
+        )
+        self._text: str | None = None
+        self._archive_path: Path | None = None
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_val: BaseException | None,
+        exc_tb: TracebackType | None,
+    ) -> None:
+        shutil.rmtree(self._tempdir, ignore_errors=True)
+```
+
+**Optional context — `configure()`**
+
+The consumer calls `configure()` with a `ParserContext` after instantiation
+and before `parse()`. If your parser doesn't need context, a no-op
+implementation is fine:
+
+```python
+from paperless.parsers import ParserContext
+
+def configure(self, context: ParserContext) -> None:
+    pass   # override if you need context.mailrule_id, etc.
+```
+
+**Parsing**
+
+`parse()` is the core method. It must not return a value; instead, store
+results in instance attributes and expose them via the accessor methods below.
+Raise `documents.parsers.ParseError` on any unrecoverable failure.
+
+```python
+from documents.parsers import ParseError
+
+def parse(
+    self,
+    document_path: Path,
+    mime_type: str,
+    *,
+    produce_archive: bool = True,
+) -> None:
+    try:
+        self._text = extract_text_from_my_format(document_path)
+    except Exception as e:
+        raise ParseError(f"Failed to parse {document_path}: {e}") from e
+
+    if produce_archive and self.can_produce_archive:
+        archive = self._tempdir / "archived.pdf"
+        convert_to_pdf(document_path, archive)
+        self._archive_path = archive
+```
+
+**Result accessors**
+
+```python
+def get_text(self) -> str | None:
+    return self._text
+
+def get_date(self) -> "datetime.datetime | None":
+    # Return a datetime extracted from the document, or None to let
+    # Paperless-ngx use its default date-guessing logic.
+    return None
+
+def get_archive_path(self) -> Path | None:
+    return self._archive_path
+```
+
+**Thumbnail**
+
+`get_thumbnail()` may be called independently of `parse()`. Return the path
+to a WebP image inside `self._tempdir`. The image should be roughly 500 × 700
+pixels.
+
+```python
+def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
+    thumb = self._tempdir / "thumb.webp"
+    render_thumbnail(document_path, thumb)
+    return thumb
+```
+
+**Optional methods**
+
+These are called by the API on demand, not during the consumption pipeline.
+Implement them if your format supports the information; otherwise return
+`None` / `[]`.
+
+```python
+def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
+    return count_pages(document_path)
+
+def extract_metadata(
+    self,
+    document_path: Path,
+    mime_type: str,
+) -> "list[MetadataEntry]":
+    # Must never raise. Return [] if metadata cannot be read.
+    from paperless.parsers import MetadataEntry
+    return [
+        MetadataEntry(
+            namespace="https://example.com/ns/",
+            prefix="ex",
+            key="Author",
+            value="Alice",
+        )
+    ]
+```
+
+#### 2. Registering via entry point
+
+Add the following to your package's `pyproject.toml`. The key (left of `=`)
+is an arbitrary name used only in log output; the value is the
+`module:ClassName` import path.
+
+```toml
+[project.entry-points."paperless_ngx.parsers"]
+my_parser = "my_package.parsers:MyCustomParser"
+```
+
+Install your package into the same Python environment as Paperless-ngx (or
+add it to the Docker image), and the parser will be discovered automatically
+on the next startup. No configuration changes are needed.
+
+To verify discovery, check the application logs at startup for a line like:
+
+```
+Loaded third-party parser 'My Format Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
+```
+
+#### 3. Utilities
+
+`paperless.parsers.utils` provides helpers you can import directly:
+
+| Function                                | Description                                                      |
+| --------------------------------------- | ---------------------------------------------------------------- |
+| `read_file_handle_unicode_errors(path)` | Read a file as UTF-8, replacing invalid bytes instead of raising |
+| `get_page_count_for_pdf(path)`          | Count pages in a PDF using pikepdf                               |
+| `extract_pdf_metadata(path)`            | Extract XMP metadata from a PDF as a `list[MetadataEntry]`       |
+
+#### Minimal example
+
+A complete, working parser for a hypothetical plain-XML format:
+
+```python
+from __future__ import annotations
+
+import shutil
+import tempfile
+from pathlib import Path
+from typing import Self
+from types import TracebackType
+import xml.etree.ElementTree as ET
+
+from django.conf import settings
+
+from documents.parsers import ParseError
+from paperless.parsers import ParserContext
+
+
+class XmlDocumentParser:
+    name    = "XML Parser"
+    version = "1.0.0"
+    author  = "Acme Corp"
+    url     = "https://example.com/xml-parser"
+
+    @classmethod
+    def supported_mime_types(cls) -> dict[str, str]:
+        return {"application/xml": ".xml", "text/xml": ".xml"}
+
+    @classmethod
+    def score(cls, mime_type: str, filename: str, path: Path | None = None) -> int | None:
+        return 10
+
+    @property
+    def can_produce_archive(self) -> bool:
+        return False
+
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        return False
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self._tempdir = Path(tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR))
+        self._text: str | None = None
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        shutil.rmtree(self._tempdir, ignore_errors=True)
+
+    def configure(self, context: ParserContext) -> None:
+        pass
+
+    def parse(self, document_path: Path, mime_type: str, *, produce_archive: bool = True) -> None:
+        try:
+            tree = ET.parse(document_path)
+            self._text = " ".join(tree.getroot().itertext())
+        except ET.ParseError as e:
+            raise ParseError(f"XML parse error: {e}") from e
+
+    def get_text(self) -> str | None:
+        return self._text
+
+    def get_date(self):
+        return None
+
+    def get_archive_path(self) -> Path | None:
+        return None
+
+    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
+        from PIL import Image, ImageDraw
+        img = Image.new("RGB", (500, 700), color="white")
+        ImageDraw.Draw(img).text((10, 10), "XML Document", fill="black")
+        out = self._tempdir / "thumb.webp"
+        img.save(out, format="WEBP")
+        return out
+
+    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
+        return None
+
+    def extract_metadata(self, document_path: Path, mime_type: str) -> list:
+        return []
+```
+
+### Developing date parser plugins

 Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).

-### Creating a Date Parser Plugin
+#### Creating a Date Parser Plugin

 To create a custom date parser plugin, you need to:

@@ -492,7 +734,7 @@ To create a custom date parser plugin, you need to:
 2. Implement the required abstract method
 3. Register your plugin via an entry point

-#### 1. Implementing the Parser Class
+##### 1. Implementing the Parser Class

 Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:

@@ -532,7 +774,7 @@ class MyDateParserPlugin(DateParserPluginBase):
        yield another_datetime
 ```

-#### 2. Configuration and Helper Methods
+##### 2. Configuration and Helper Methods

 Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:

@@ -565,11 +807,11 @@ def _filter_date(
    """
 ```

-#### 3. Resource Management (Optional)
+##### 3. Resource Management (Optional)

 If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.

-#### 4. Registering Your Plugin
+##### 4. Registering Your Plugin

 Register your plugin using a setuptools entry point in your package's `pyproject.toml`:

@@ -580,7 +822,7 @@ my_parser = "my_package.parsers:MyDateParserPlugin"

 The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.

-### Plugin Discovery
+#### Plugin Discovery

 Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:

@@ -591,7 +833,7 @@ Paperless-ngx automatically discovers and loads date parser plugins at runtime.

 If multiple plugins are installed, a warning is logged indicating which plugin was selected.

-### Example: Simple Date Parser
+#### Example: Simple Date Parser

 Here's a minimal example that only looks for ISO 8601 dates:

@@ -623,3 +865,30 @@ class ISODateParserPlugin(DateParserPluginBase):
            if filtered_date is not None:
                yield filtered_date
 ```
+
+## Using Visual Studio Code devcontainer
+
+Another easy way to get started with development is to use Visual Studio
+Code devcontainers. This approach will create a preconfigured development
+environment with all of the required tools and dependencies.
+[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
+The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
+contain more information about the specific tasks and launch configurations (see the
+non-standard "description" field).
+
+To get started:
+
+1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+
+2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+
+3. In case your host operating system is Windows:
+   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
+   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+
+4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
+   will initialize the database tables and create a superuser. Then you can compile the front end
+   for production or run the frontend in debug mode.
+
+5. The project is ready for debugging, start either run the fullstack debug or individual debug
+   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -248,15 +248,13 @@ lint.per-file-ignores."docker/wait-for-redis.py" = [
 lint.per-file-ignores."src/documents/models.py" = [
  "SIM115",
 ]
-lint.per-file-ignores."src/paperless_tesseract/tests/test_parser.py" = [
-  "RUF001",
-]
+
 lint.isort.force-single-line = true

 [tool.codespell]
 write-changes = true
 ignore-words-list = "criterias,afterall,valeu,ureue,equest,ure,assertIn,Oktober,commitish"
-skip = "src-ui/src/locale/*,src-ui/pnpm-lock.yaml,src-ui/e2e/*,src/paperless_mail/tests/samples/*,src/documents/tests/samples/*,*.po,*.json"
+skip = "src-ui/src/locale/*,src-ui/pnpm-lock.yaml,src-ui/e2e/*,src/paperless_mail/tests/samples/*,src/paperless/tests/samples/mail/*,src/documents/tests/samples/*,*.po,*.json"

 [tool.pytest]
 minversion = "9.0"
@@ -271,10 +269,6 @@ testpaths = [
  "src/documents/tests/",
  "src/paperless/tests/",
  "src/paperless_mail/tests/",
-  "src/paperless_tesseract/tests/",
-  "src/paperless_tika/tests",
-  "src/paperless_text/tests/",
-  "src/paperless_remote/tests/",
  "src/paperless_ai/tests",
 ]

--- a/src/documents/checks.py
+++ b/src/documents/checks.py
@@ -3,25 +3,20 @@ from django.core.checks import Error
 from django.core.checks import Warning
 from django.core.checks import register

-from documents.signals import document_consumer_declaration
 from documents.templating.utils import convert_format_str_to_template_format
+from paperless.parsers.registry import get_parser_registry


@register()
 def parser_check(app_configs, **kwargs):
-    parsers = []
-    for response in document_consumer_declaration.send(None):
-        parsers.append(response[1])
-
-    if len(parsers) == 0:
+    if not get_parser_registry().all_parsers():
        return [
            Error(
                "No parsers found. This is a bug. The consumer won't be "
                "able to consume any documents without parsers.",
            ),
        ]
-    else:
-        return []
+    return []


@register()
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -32,9 +32,7 @@ from documents.models import DocumentType
 from documents.models import StoragePath
 from documents.models import Tag
 from documents.models import WorkflowTrigger
-from documents.parsers import DocumentParser
 from documents.parsers import ParseError
-from documents.parsers import get_parser_class_for_mime_type
 from documents.permissions import set_permissions_for_object
 from documents.plugins.base import AlwaysRunPluginMixin
 from documents.plugins.base import ConsumeTaskPlugin
@@ -51,33 +49,13 @@ from documents.templating.workflows import parse_w_workflow_placeholders
 from documents.utils import copy_basic_file_stats
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
-from paperless.parsers.remote import RemoteDocumentParser
-from paperless.parsers.text import TextDocumentParser
-from paperless.parsers.tika import TikaDocumentParser
-from paperless_mail.parsers import MailDocumentParser
+from paperless.parsers import ParserContext
+from paperless.parsers import ParserProtocol
+from paperless.parsers.registry import get_parser_registry

 LOGGING_NAME: Final[str] = "paperless.consumer"


-def _parser_cleanup(parser: DocumentParser) -> None:
-    """
-    Call cleanup on a parser, handling the new-style context-manager parsers.
-
-    New-style parsers (e.g. TextDocumentParser) use __exit__ for teardown
-    instead of a cleanup() method.  This shim will be removed once all existing parsers
-    have switched to the new style and this consumer is updated to use it
-
-    TODO(stumpylog): Remove me in the future
-    """
-    if isinstance(
-        parser,
-        (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
-    ):
-        parser.__exit__(None, None, None)
-    else:
-        parser.cleanup()
-
-
 class WorkflowTriggerPlugin(
    NoCleanupPluginMixin,
    NoSetupPluginMixin,
@@ -414,8 +392,12 @@ class ConsumerPlugin(
                    self.log.error(f"Error attempting to clean PDF: {e}")

            # Based on the mime type, get the parser for that type
-            parser_class: type[DocumentParser] | None = get_parser_class_for_mime_type(
-                mime_type,
+            parser_class: type[ParserProtocol] | None = (
+                get_parser_registry().get_parser_for_file(
+                    mime_type,
+                    self.filename,
+                    self.working_copy,
+                )
            )
            if not parser_class:
                tempdir.cleanup()
@@ -438,316 +420,275 @@ class ConsumerPlugin(
                tempdir.cleanup()
            raise

-        def progress_callback(
-            current_progress,
-            max_progress,
-        ) -> None:  # pragma: no cover
-            # recalculate progress to be within 20 and 80
-            p = int((current_progress / max_progress) * 50 + 20)
-            self._send_progress(p, 100, ProgressStatusOptions.WORKING)
-
        # This doesn't parse the document yet, but gives us a parser.
-
-        document_parser: DocumentParser = parser_class(
-            self.logging_group,
-            progress_callback=progress_callback,
-        )
-
-        # New-style parsers use __enter__/__exit__ for resource management.
-        # _parser_cleanup (below) handles __exit__; call __enter__ here.
-        # TODO(stumpylog): Remove me in the future
-        if isinstance(
-            document_parser,
-            (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
-        ):
-            document_parser.__enter__()
-
-        self.log.debug(f"Parser: {type(document_parser).__name__}")
-
-        # Parse the document. This may take some time.
-
-        text = None
-        date = None
-        thumbnail = None
-        archive_path = None
-        page_count = None
-
-        try:
-            self._send_progress(
-                20,
-                100,
-                ProgressStatusOptions.WORKING,
-                ConsumerStatusShortMessage.PARSING_DOCUMENT,
+        with parser_class() as document_parser:
+            document_parser.configure(
+                ParserContext(mailrule_id=self.input_doc.mailrule_id),
            )
-            self.log.debug(f"Parsing {self.filename}...")
-            if (
-                isinstance(document_parser, MailDocumentParser)
-                and self.input_doc.mailrule_id
-            ):
-                document_parser.parse(
-                    self.working_copy,
-                    mime_type,
-                    self.filename,
-                    self.input_doc.mailrule_id,
-                )
-            elif isinstance(
-                document_parser,
-                (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
-            ):
-                # TODO(stumpylog): Remove me in the future
-                document_parser.parse(self.working_copy, mime_type)
-            else:
-                document_parser.parse(self.working_copy, mime_type, self.filename)

-            self.log.debug(f"Generating thumbnail for {self.filename}...")
-            self._send_progress(
-                70,
-                100,
-                ProgressStatusOptions.WORKING,
-                ConsumerStatusShortMessage.GENERATING_THUMBNAIL,
-            )
-            if isinstance(
-                document_parser,
-                (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
-            ):
-                # TODO(stumpylog): Remove me in the future
-                thumbnail = document_parser.get_thumbnail(self.working_copy, mime_type)
-            else:
-                thumbnail = document_parser.get_thumbnail(
-                    self.working_copy,
-                    mime_type,
-                    self.filename,
-                )
+            self.log.debug(f"Parser: {document_parser.name} v{document_parser.version}")

-            text = document_parser.get_text()
-            date = document_parser.get_date()
-            if date is None:
+            # Parse the document. This may take some time.
+
+            text = None
+            date = None
+            thumbnail = None
+            archive_path = None
+            page_count = None
+
+            try:
                self._send_progress(
-                    90,
+                    20,
                    100,
                    ProgressStatusOptions.WORKING,
-                    ConsumerStatusShortMessage.PARSE_DATE,
+                    ConsumerStatusShortMessage.PARSING_DOCUMENT,
                )
-                with get_date_parser() as date_parser:
-                    date = next(date_parser.parse(self.filename, text), None)
-            archive_path = document_parser.get_archive_path()
-            page_count = document_parser.get_page_count(self.working_copy, mime_type)
+                self.log.debug(f"Parsing {self.filename}...")

-        except ParseError as e:
-            _parser_cleanup(document_parser)
-            if tempdir:
-                tempdir.cleanup()
-            self._fail(
-                str(e),
-                f"Error occurred while consuming document {self.filename}: {e}",
-                exc_info=True,
-                exception=e,
-            )
-        except Exception as e:
-            _parser_cleanup(document_parser)
-            if tempdir:
-                tempdir.cleanup()
-            self._fail(
-                str(e),
-                f"Unexpected error while consuming document {self.filename}: {e}",
-                exc_info=True,
-                exception=e,
-            )
+                document_parser.parse(self.working_copy, mime_type)

-        # Prepare the document classifier.
+                self.log.debug(f"Generating thumbnail for {self.filename}...")
+                self._send_progress(
+                    70,
+                    100,
+                    ProgressStatusOptions.WORKING,
+                    ConsumerStatusShortMessage.GENERATING_THUMBNAIL,
+                )
+                thumbnail = document_parser.get_thumbnail(self.working_copy, mime_type)

-        # TODO: I don't really like to do this here, but this way we avoid
-        #   reloading the classifier multiple times, since there are multiple
-        #   post-consume hooks that all require the classifier.
-
-        classifier = load_classifier()
-
-        self._send_progress(
-            95,
-            100,
-            ProgressStatusOptions.WORKING,
-            ConsumerStatusShortMessage.SAVE_DOCUMENT,
-        )
-        # now that everything is done, we can start to store the document
-        # in the system. This will be a transaction and reasonably fast.
-        try:
-            with transaction.atomic():
-                # store the document.
-                if self.input_doc.root_document_id:
-                    # If this is a new version of an existing document, we need
-                    # to make sure we're not creating a new document, but updating
-                    # the existing one.
-                    root_doc = Document.objects.get(
-                        pk=self.input_doc.root_document_id,
+                text = document_parser.get_text()
+                date = document_parser.get_date()
+                if date is None:
+                    self._send_progress(
+                        90,
+                        100,
+                        ProgressStatusOptions.WORKING,
+                        ConsumerStatusShortMessage.PARSE_DATE,
                    )
-                    original_document = self._create_version_from_root(
-                        root_doc,
-                        text=text,
-                        page_count=page_count,
-                        mime_type=mime_type,
-                    )
-                    actor = None
+                    with get_date_parser() as date_parser:
+                        date = next(date_parser.parse(self.filename, text), None)
+                archive_path = document_parser.get_archive_path()
+                page_count = document_parser.get_page_count(
+                    self.working_copy,
+                    mime_type,
+                )

-                    # Save the new version, potentially creating an audit log entry for the version addition if enabled.
-                    if (
-                        settings.AUDIT_LOG_ENABLED
-                        and self.metadata.actor_id is not None
-                    ):
-                        actor = User.objects.filter(pk=self.metadata.actor_id).first()
-                        if actor is not None:
-                            from auditlog.context import (  # type: ignore[import-untyped]
-                                set_actor,
-                            )
+            except ParseError as e:
+                if tempdir:
+                    tempdir.cleanup()
+                self._fail(
+                    str(e),
+                    f"Error occurred while consuming document {self.filename}: {e}",
+                    exc_info=True,
+                    exception=e,
+                )
+            except Exception as e:
+                if tempdir:
+                    tempdir.cleanup()
+                self._fail(
+                    str(e),
+                    f"Unexpected error while consuming document {self.filename}: {e}",
+                    exc_info=True,
+                    exception=e,
+                )

-                            with set_actor(actor):
+            # Prepare the document classifier.
+
+            # TODO: I don't really like to do this here, but this way we avoid
+            #   reloading the classifier multiple times, since there are multiple
+            #   post-consume hooks that all require the classifier.
+
+            classifier = load_classifier()
+
+            self._send_progress(
+                95,
+                100,
+                ProgressStatusOptions.WORKING,
+                ConsumerStatusShortMessage.SAVE_DOCUMENT,
+            )
+            # now that everything is done, we can start to store the document
+            # in the system. This will be a transaction and reasonably fast.
+            try:
+                with transaction.atomic():
+                    # store the document.
+                    if self.input_doc.root_document_id:
+                        # If this is a new version of an existing document, we need
+                        # to make sure we're not creating a new document, but updating
+                        # the existing one.
+                        root_doc = Document.objects.get(
+                            pk=self.input_doc.root_document_id,
+                        )
+                        original_document = self._create_version_from_root(
+                            root_doc,
+                            text=text,
+                            page_count=page_count,
+                            mime_type=mime_type,
+                        )
+                        actor = None
+
+                        # Save the new version, potentially creating an audit log entry for the version addition if enabled.
+                        if (
+                            settings.AUDIT_LOG_ENABLED
+                            and self.metadata.actor_id is not None
+                        ):
+                            actor = User.objects.filter(
+                                pk=self.metadata.actor_id,
+                            ).first()
+                            if actor is not None:
+                                from auditlog.context import (  # type: ignore[import-untyped]
+                                    set_actor,
+                                )
+
+                                with set_actor(actor):
+                                    original_document.save()
+                            else:
                                original_document.save()
                        else:
                            original_document.save()
+
+                        # Create a log entry for the version addition, if enabled
+                        if settings.AUDIT_LOG_ENABLED:
+                            from auditlog.models import (  # type: ignore[import-untyped]
+                                LogEntry,
+                            )
+
+                            LogEntry.objects.log_create(
+                                instance=root_doc,
+                                changes={
+                                    "Version Added": ["None", original_document.id],
+                                },
+                                action=LogEntry.Action.UPDATE,
+                                actor=actor,
+                                additional_data={
+                                    "reason": "Version added",
+                                    "version_id": original_document.id,
+                                },
+                            )
+                        document = original_document
                    else:
-                        original_document.save()
-
-                    # Create a log entry for the version addition, if enabled
-                    if settings.AUDIT_LOG_ENABLED:
-                        from auditlog.models import (  # type: ignore[import-untyped]
-                            LogEntry,
+                        document = self._store(
+                            text=text,
+                            date=date,
+                            page_count=page_count,
+                            mime_type=mime_type,
                        )

-                        LogEntry.objects.log_create(
-                            instance=root_doc,
-                            changes={
-                                "Version Added": ["None", original_document.id],
-                            },
-                            action=LogEntry.Action.UPDATE,
-                            actor=actor,
-                            additional_data={
-                                "reason": "Version added",
-                                "version_id": original_document.id,
-                            },
-                        )
-                    document = original_document
-                else:
-                    document = self._store(
-                        text=text,
-                        date=date,
-                        page_count=page_count,
-                        mime_type=mime_type,
-                    )
+                    # If we get here, it was successful. Proceed with post-consume
+                    # hooks. If they fail, nothing will get changed.

-                # If we get here, it was successful. Proceed with post-consume
-                # hooks. If they fail, nothing will get changed.
-
-                document_consumption_finished.send(
-                    sender=self.__class__,
-                    document=document,
-                    logging_group=self.logging_group,
-                    classifier=classifier,
-                    original_file=self.unmodified_original
-                    if self.unmodified_original
-                    else self.working_copy,
-                )
-
-                # After everything is in the database, copy the files into
-                # place. If this fails, we'll also rollback the transaction.
-                with FileLock(settings.MEDIA_LOCK):
-                    generated_filename = generate_unique_filename(document)
-                    if (
-                        len(str(generated_filename))
-                        > Document.MAX_STORED_FILENAME_LENGTH
-                    ):
-                        self.log.warning(
-                            "Generated source filename exceeds db path limit, falling back to default naming",
-                        )
-                        generated_filename = generate_filename(
-                            document,
-                            use_format=False,
-                        )
-                    document.filename = generated_filename
-                    create_source_path_directory(document.source_path)
-
-                    self._write(
-                        self.unmodified_original
-                        if self.unmodified_original is not None
+                    document_consumption_finished.send(
+                        sender=self.__class__,
+                        document=document,
+                        logging_group=self.logging_group,
+                        classifier=classifier,
+                        original_file=self.unmodified_original
+                        if self.unmodified_original
                        else self.working_copy,
-                        document.source_path,
                    )

-                    self._write(
-                        thumbnail,
-                        document.thumbnail_path,
-                    )
-
-                    if archive_path and Path(archive_path).is_file():
-                        generated_archive_filename = generate_unique_filename(
-                            document,
-                            archive_filename=True,
-                        )
+                    # After everything is in the database, copy the files into
+                    # place. If this fails, we'll also rollback the transaction.
+                    with FileLock(settings.MEDIA_LOCK):
+                        generated_filename = generate_unique_filename(document)
                        if (
-                            len(str(generated_archive_filename))
+                            len(str(generated_filename))
                            > Document.MAX_STORED_FILENAME_LENGTH
                        ):
                            self.log.warning(
-                                "Generated archive filename exceeds db path limit, falling back to default naming",
+                                "Generated source filename exceeds db path limit, falling back to default naming",
                            )
-                            generated_archive_filename = generate_filename(
+                            generated_filename = generate_filename(
                                document,
-                                archive_filename=True,
                                use_format=False,
                            )
-                        document.archive_filename = generated_archive_filename
-                        create_source_path_directory(document.archive_path)
+                        document.filename = generated_filename
+                        create_source_path_directory(document.source_path)
+
                        self._write(
-                            archive_path,
-                            document.archive_path,
+                            self.unmodified_original
+                            if self.unmodified_original is not None
+                            else self.working_copy,
+                            document.source_path,
                        )

-                        with Path(archive_path).open("rb") as f:
-                            document.archive_checksum = hashlib.md5(
-                                f.read(),
-                            ).hexdigest()
+                        self._write(
+                            thumbnail,
+                            document.thumbnail_path,
+                        )

-                # Don't save with the lock active. Saving will cause the file
-                # renaming logic to acquire the lock as well.
-                # This triggers things like file renaming
-                document.save()
+                        if archive_path and Path(archive_path).is_file():
+                            generated_archive_filename = generate_unique_filename(
+                                document,
+                                archive_filename=True,
+                            )
+                            if (
+                                len(str(generated_archive_filename))
+                                > Document.MAX_STORED_FILENAME_LENGTH
+                            ):
+                                self.log.warning(
+                                    "Generated archive filename exceeds db path limit, falling back to default naming",
+                                )
+                                generated_archive_filename = generate_filename(
+                                    document,
+                                    archive_filename=True,
+                                    use_format=False,
+                                )
+                            document.archive_filename = generated_archive_filename
+                            create_source_path_directory(document.archive_path)
+                            self._write(
+                                archive_path,
+                                document.archive_path,
+                            )

-                if document.root_document_id:
-                    document_updated.send(
-                        sender=self.__class__,
-                        document=document.root_document,
-                    )
+                            with Path(archive_path).open("rb") as f:
+                                document.archive_checksum = hashlib.md5(
+                                    f.read(),
+                                ).hexdigest()

-                # Delete the file only if it was successfully consumed
-                self.log.debug(f"Deleting original file {self.input_doc.original_file}")
-                self.input_doc.original_file.unlink()
-                self.log.debug(f"Deleting working copy {self.working_copy}")
-                self.working_copy.unlink()
-                if self.unmodified_original is not None:  # pragma: no cover
+                    # Don't save with the lock active. Saving will cause the file
+                    # renaming logic to acquire the lock as well.
+                    # This triggers things like file renaming
+                    document.save()
+
+                    if document.root_document_id:
+                        document_updated.send(
+                            sender=self.__class__,
+                            document=document.root_document,
+                        )
+
+                    # Delete the file only if it was successfully consumed
                    self.log.debug(
-                        f"Deleting unmodified original file {self.unmodified_original}",
+                        f"Deleting original file {self.input_doc.original_file}",
                    )
-                    self.unmodified_original.unlink()
+                    self.input_doc.original_file.unlink()
+                    self.log.debug(f"Deleting working copy {self.working_copy}")
+                    self.working_copy.unlink()
+                    if self.unmodified_original is not None:  # pragma: no cover
+                        self.log.debug(
+                            f"Deleting unmodified original file {self.unmodified_original}",
+                        )
+                        self.unmodified_original.unlink()

-                # https://github.com/jonaswinkler/paperless-ng/discussions/1037
-                shadow_file = (
-                    Path(self.input_doc.original_file).parent
-                    / f"._{Path(self.input_doc.original_file).name}"
+                    # https://github.com/jonaswinkler/paperless-ng/discussions/1037
+                    shadow_file = (
+                        Path(self.input_doc.original_file).parent
+                        / f"._{Path(self.input_doc.original_file).name}"
+                    )
+
+                    if Path(shadow_file).is_file():
+                        self.log.debug(f"Deleting shadow file {shadow_file}")
+                        Path(shadow_file).unlink()
+
+            except Exception as e:
+                self._fail(
+                    str(e),
+                    f"The following error occurred while storing document "
+                    f"{self.filename} after parsing: {e}",
+                    exc_info=True,
+                    exception=e,
                )
-
-                if Path(shadow_file).is_file():
-                    self.log.debug(f"Deleting shadow file {shadow_file}")
-                    Path(shadow_file).unlink()
-
-        except Exception as e:
-            self._fail(
-                str(e),
-                f"The following error occurred while storing document "
-                f"{self.filename} after parsing: {e}",
-                exc_info=True,
-                exception=e,
-            )
-        finally:
-            _parser_cleanup(document_parser)
-            tempdir.cleanup()
+            finally:
+                tempdir.cleanup()

        self.run_post_consume_script(document)

--- a/src/documents/management/commands/document_thumbnails.py
+++ b/src/documents/management/commands/document_thumbnails.py
@@ -3,14 +3,18 @@ import shutil

 from documents.management.commands.base import PaperlessCommand
 from documents.models import Document
-from documents.parsers import get_parser_class_for_mime_type
+from paperless.parsers.registry import get_parser_registry

 logger = logging.getLogger("paperless.management.thumbnails")


 def _process_document(doc_id: int) -> None:
    document: Document = Document.objects.get(id=doc_id)
-    parser_class = get_parser_class_for_mime_type(document.mime_type)
+    parser_class = get_parser_registry().get_parser_for_file(
+        document.mime_type,
+        document.original_filename or "",
+        document.source_path,
+    )

    if parser_class is None:
        logger.warning(
@@ -20,18 +24,9 @@ def _process_document(doc_id: int) -> None:
        )
        return

-    parser = parser_class(logging_group=None)
-
-    try:
-        thumb = parser.get_thumbnail(
-            document.source_path,
-            document.mime_type,
-            document.get_public_filename(),
-        )
+    with parser_class() as parser:
+        thumb = parser.get_thumbnail(document.source_path, document.mime_type)
        shutil.move(thumb, document.thumbnail_path)
-    finally:
-        # TODO(stumpylog): Cleanup once all parsers are handled
-        parser.cleanup()


 class Command(PaperlessCommand):
--- a/src/documents/parsers.py
+++ b/src/documents/parsers.py
@@ -3,84 +3,47 @@ from __future__ import annotations
 import logging
 import mimetypes
 import os
-import re
 import shutil
 import subprocess
 import tempfile
-from functools import lru_cache
 from pathlib import Path
 from typing import TYPE_CHECKING

 from django.conf import settings

 from documents.loggers import LoggingMixin
-from documents.signals import document_consumer_declaration
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
+from paperless.parsers.registry import get_parser_registry

 if TYPE_CHECKING:
    import datetime

-# This regular expression will try to find dates in the document at
-# hand and will match the following formats:
-# - XX.YY.ZZZZ with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - XX/YY/ZZZZ with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - XX-YY-ZZZZ with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - ZZZZ.XX.YY with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - ZZZZ/XX/YY with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - ZZZZ-XX-YY with XX + YY being 1 or 2 and ZZZZ being 2 or 4 digits
-# - XX. MONTH ZZZZ with XX being 1 or 2 and ZZZZ being 2 or 4 digits
-# - MONTH ZZZZ, with ZZZZ being 4 digits
-# - MONTH XX, ZZZZ with XX being 1 or 2 and ZZZZ being 4 digits
-# - XX MON ZZZZ with XX being 1 or 2 and ZZZZ being 4 digits. MONTH is 3 letters
-# - XXPP MONTH ZZZZ with XX being 1 or 2 and PP being 2 letters and ZZZZ being 4 digits
-
-# TODO: isn't there a date parsing library for this?
-
-DATE_REGEX = re.compile(
-    r"(\b|(?!=([_-])))(\d{1,2})[\.\/-](\d{1,2})[\.\/-](\d{4}|\d{2})(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))(\d{4}|\d{2})[\.\/-](\d{1,2})[\.\/-](\d{1,2})(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))(\d{1,2}[\. ]+[a-zéûäëčžúřěáíóńźçŞğü]{3,9} \d{4}|[a-zéûäëčžúřěáíóńźçŞğü]{3,9} \d{1,2}, \d{4})(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))([^\W\d_]{3,9} \d{1,2}, (\d{4}))(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))([^\W\d_]{3,9} \d{4})(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))(\d{1,2}[^ 0-9]{2}[\. ]+[^ ]{3,9}[ \.\/-]\d{4})(\b|(?=([_-])))|"
-    r"(\b|(?!=([_-])))(\b\d{1,2}[ \.\/-][a-zéûäëčžúřěáíóńźçŞğü]{3}[ \.\/-]\d{4})(\b|(?=([_-])))",
-    re.IGNORECASE,
-)
-
-
 logger = logging.getLogger("paperless.parsing")


-@lru_cache(maxsize=8)
 def is_mime_type_supported(mime_type: str) -> bool:
    """
    Returns True if the mime type is supported, False otherwise
    """
-    return get_parser_class_for_mime_type(mime_type) is not None
+    return get_parser_registry().get_parser_for_file(mime_type, "") is not None


-@lru_cache(maxsize=8)
 def get_default_file_extension(mime_type: str) -> str:
    """
    Returns the default file extension for a mimetype, or
    an empty string if it could not be determined
    """
-    for response in document_consumer_declaration.send(None):
-        parser_declaration = response[1]
-        supported_mime_types = parser_declaration["mime_types"]
-
-        if mime_type in supported_mime_types:
-            return supported_mime_types[mime_type]
+    parser_class = get_parser_registry().get_parser_for_file(mime_type, "")
+    if parser_class is not None:
+        supported = parser_class.supported_mime_types()
+        if mime_type in supported:
+            return supported[mime_type]

    ext = mimetypes.guess_extension(mime_type)
-    if ext:
-        return ext
-    else:
-        return ""
+    return ext if ext else ""


-@lru_cache(maxsize=8)
 def is_file_ext_supported(ext: str) -> bool:
    """
    Returns True if the file extension is supported, False otherwise
@@ -94,44 +57,17 @@ def is_file_ext_supported(ext: str) -> bool:

 def get_supported_file_extensions() -> set[str]:
    extensions = set()
-    for response in document_consumer_declaration.send(None):
-        parser_declaration = response[1]
-        supported_mime_types = parser_declaration["mime_types"]
-
-        for mime_type in supported_mime_types:
+    for parser_class in get_parser_registry().all_parsers():
+        for mime_type, ext in parser_class.supported_mime_types().items():
            extensions.update(mimetypes.guess_all_extensions(mime_type))
            # Python's stdlib might be behind, so also add what the parser
            # says is the default extension
            # This makes image/webp supported on Python < 3.11
-            extensions.add(supported_mime_types[mime_type])
+            extensions.add(ext)

    return extensions


-def get_parser_class_for_mime_type(mime_type: str) -> type[DocumentParser] | None:
-    """
-    Returns the best parser (by weight) for the given mimetype or
-    None if no parser exists
-    """
-
-    options = []
-
-    for response in document_consumer_declaration.send(None):
-        parser_declaration = response[1]
-        supported_mime_types = parser_declaration["mime_types"]
-
-        if mime_type in supported_mime_types:
-            options.append(parser_declaration)
-
-    if not options:
-        return None
-
-    best_parser = sorted(options, key=lambda _: _["weight"], reverse=True)[0]
-
-    # Return the parser with the highest weight.
-    return best_parser["parser"]
-
-
 def run_convert(
    input_file,
    output_file,
--- a/src/documents/signals/init.py
+++ b/src/documents/signals/init.py
@@ -2,5 +2,4 @@ from django.dispatch import Signal

 document_consumption_started = Signal()
 document_consumption_finished = Signal()
-document_consumer_declaration = Signal()
 document_updated = Signal()
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -52,8 +52,6 @@ from documents.models import StoragePath
 from documents.models import Tag
 from documents.models import WorkflowRun
 from documents.models import WorkflowTrigger
-from documents.parsers import DocumentParser
-from documents.parsers import get_parser_class_for_mime_type
 from documents.plugins.base import ConsumeTaskPlugin
 from documents.plugins.base import ProgressManager
 from documents.plugins.base import StopConsumeTaskError
@@ -65,6 +63,8 @@ from documents.signals.handlers import run_workflows
 from documents.signals.handlers import send_websocket_document_updated
 from documents.workflows.utils import get_workflows_for_trigger
 from paperless.config import AIConfig
+from paperless.parsers import ParserContext
+from paperless.parsers.registry import get_parser_registry
 from paperless_ai.indexing import llm_index_add_or_update_document
 from paperless_ai.indexing import llm_index_remove_document
 from paperless_ai.indexing import update_llm_index
@@ -304,7 +304,11 @@ def update_document_content_maybe_archive_file(document_id) -> None:

    mime_type = document.mime_type

-    parser_class: type[DocumentParser] = get_parser_class_for_mime_type(mime_type)
+    parser_class = get_parser_registry().get_parser_for_file(
+        mime_type,
+        document.original_filename or "",
+        document.source_path,
+    )

    if not parser_class:
        logger.error(
@@ -313,98 +317,92 @@ def update_document_content_maybe_archive_file(document_id) -> None:
        )
        return

-    parser: DocumentParser = parser_class(logging_group=uuid.uuid4())
+    with parser_class() as parser:
+        parser.configure(ParserContext())

-    try:
-        parser.parse(document.source_path, mime_type, document.get_public_filename())
+        try:
+            parser.parse(document.source_path, mime_type)

-        thumbnail = parser.get_thumbnail(
-            document.source_path,
-            mime_type,
-            document.get_public_filename(),
-        )
+            thumbnail = parser.get_thumbnail(document.source_path, mime_type)

-        with transaction.atomic():
-            oldDocument = Document.objects.get(pk=document.pk)
-            if parser.get_archive_path():
-                with Path(parser.get_archive_path()).open("rb") as f:
-                    checksum = hashlib.md5(f.read()).hexdigest()
-                # I'm going to save first so that in case the file move
-                # fails, the database is rolled back.
-                # We also don't use save() since that triggers the filehandling
-                # logic, and we don't want that yet (file not yet in place)
-                document.archive_filename = generate_unique_filename(
-                    document,
-                    archive_filename=True,
-                )
-                Document.objects.filter(pk=document.pk).update(
-                    archive_checksum=checksum,
-                    content=parser.get_text(),
-                    archive_filename=document.archive_filename,
-                )
-                newDocument = Document.objects.get(pk=document.pk)
-                if settings.AUDIT_LOG_ENABLED:
-                    LogEntry.objects.log_create(
-                        instance=oldDocument,
-                        changes={
-                            "content": [oldDocument.content, newDocument.content],
-                            "archive_checksum": [
-                                oldDocument.archive_checksum,
-                                newDocument.archive_checksum,
-                            ],
-                            "archive_filename": [
-                                oldDocument.archive_filename,
-                                newDocument.archive_filename,
-                            ],
-                        },
-                        additional_data={
-                            "reason": "Update document content",
-                        },
-                        action=LogEntry.Action.UPDATE,
-                    )
-            else:
-                Document.objects.filter(pk=document.pk).update(
-                    content=parser.get_text(),
-                )
-
-                if settings.AUDIT_LOG_ENABLED:
-                    LogEntry.objects.log_create(
-                        instance=oldDocument,
-                        changes={
-                            "content": [oldDocument.content, parser.get_text()],
-                        },
-                        additional_data={
-                            "reason": "Update document content",
-                        },
-                        action=LogEntry.Action.UPDATE,
-                    )
-
-            with FileLock(settings.MEDIA_LOCK):
+            with transaction.atomic():
+                oldDocument = Document.objects.get(pk=document.pk)
                if parser.get_archive_path():
-                    create_source_path_directory(document.archive_path)
-                    shutil.move(parser.get_archive_path(), document.archive_path)
-                shutil.move(thumbnail, document.thumbnail_path)
+                    with Path(parser.get_archive_path()).open("rb") as f:
+                        checksum = hashlib.md5(f.read()).hexdigest()
+                    # I'm going to save first so that in case the file move
+                    # fails, the database is rolled back.
+                    # We also don't use save() since that triggers the filehandling
+                    # logic, and we don't want that yet (file not yet in place)
+                    document.archive_filename = generate_unique_filename(
+                        document,
+                        archive_filename=True,
+                    )
+                    Document.objects.filter(pk=document.pk).update(
+                        archive_checksum=checksum,
+                        content=parser.get_text(),
+                        archive_filename=document.archive_filename,
+                    )
+                    newDocument = Document.objects.get(pk=document.pk)
+                    if settings.AUDIT_LOG_ENABLED:
+                        LogEntry.objects.log_create(
+                            instance=oldDocument,
+                            changes={
+                                "content": [oldDocument.content, newDocument.content],
+                                "archive_checksum": [
+                                    oldDocument.archive_checksum,
+                                    newDocument.archive_checksum,
+                                ],
+                                "archive_filename": [
+                                    oldDocument.archive_filename,
+                                    newDocument.archive_filename,
+                                ],
+                            },
+                            additional_data={
+                                "reason": "Update document content",
+                            },
+                            action=LogEntry.Action.UPDATE,
+                        )
+                else:
+                    Document.objects.filter(pk=document.pk).update(
+                        content=parser.get_text(),
+                    )

-        document.refresh_from_db()
-        logger.info(
-            f"Updating index for document {document_id} ({document.archive_checksum})",
-        )
-        with index.open_index_writer() as writer:
-            index.update_document(writer, document)
+                    if settings.AUDIT_LOG_ENABLED:
+                        LogEntry.objects.log_create(
+                            instance=oldDocument,
+                            changes={
+                                "content": [oldDocument.content, parser.get_text()],
+                            },
+                            additional_data={
+                                "reason": "Update document content",
+                            },
+                            action=LogEntry.Action.UPDATE,
+                        )

-        ai_config = AIConfig()
-        if ai_config.llm_index_enabled:
-            llm_index_add_or_update_document(document)
+                with FileLock(settings.MEDIA_LOCK):
+                    if parser.get_archive_path():
+                        create_source_path_directory(document.archive_path)
+                        shutil.move(parser.get_archive_path(), document.archive_path)
+                    shutil.move(thumbnail, document.thumbnail_path)

-        clear_document_caches(document.pk)
+            document.refresh_from_db()
+            logger.info(
+                f"Updating index for document {document_id} ({document.archive_checksum})",
+            )
+            with index.open_index_writer() as writer:
+                index.update_document(writer, document)

-    except Exception:
-        logger.exception(
-            f"Error while parsing document {document} (ID: {document_id})",
-        )
-    finally:
-        # TODO(stumpylog): Cleanup once all parsers are handled
-        parser.cleanup()
+            ai_config = AIConfig()
+            if ai_config.llm_index_enabled:
+                llm_index_add_or_update_document(document)
+
+            clear_document_caches(document.pk)
+
+        except Exception:
+            logger.exception(
+                f"Error while parsing document {document} (ID: {document_id})",
+            )


@shared_task
--- a/src/documents/tests/test_checks.py
+++ b/src/documents/tests/test_checks.py
@@ -13,8 +13,10 @@ class TestDocumentChecks(TestCase):
    def test_parser_check(self) -> None:
        self.assertEqual(parser_check(None), [])

-        with mock.patch("documents.checks.document_consumer_declaration.send") as m:
-            m.return_value = []
+        with mock.patch("documents.checks.get_parser_registry") as mock_registry_fn:
+            mock_registry = mock.MagicMock()
+            mock_registry.all_parsers.return_value = []
+            mock_registry_fn.return_value = mock_registry

            self.assertEqual(
                parser_check(None),
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -27,7 +27,6 @@ from documents.models import Document
 from documents.models import DocumentType
 from documents.models import StoragePath
 from documents.models import Tag
-from documents.parsers import DocumentParser
 from documents.parsers import ParseError
 from documents.plugins.helpers import ProgressStatusOptions
 from documents.tasks import sanity_check
@@ -36,65 +35,108 @@ from documents.tests.utils import DummyProgressManager
 from documents.tests.utils import FileSystemAssertsMixin
 from documents.tests.utils import GetConsumerMixin
 from paperless_mail.models import MailRule
-from paperless_mail.parsers import MailDocumentParser


-class _BaseTestParser(DocumentParser):
-    def get_settings(self) -> None:
+class _BaseNewStyleParser:
+    """Minimal ParserProtocol implementation for use in consumer tests."""
+
+    name: str = "test-parser"
+    version: str = "0.1"
+    author: str = "test"
+    url: str = "test"
+
+    @classmethod
+    def supported_mime_types(cls) -> dict:
+        return {
+            "application/pdf": ".pdf",
+            "image/png": ".png",
+            "message/rfc822": ".eml",
+        }
+
+    @classmethod
+    def score(cls, mime_type: str, filename: str, path=None):
+        return 0 if mime_type in cls.supported_mime_types() else None
+
+    @property
+    def can_produce_archive(self) -> bool:
+        return True
+
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        return False
+
+    def __init__(self) -> None:
+        self._tmpdir: Path | None = None
+        self._text: str | None = None
+        self._archive: Path | None = None
+        self._thumb: Path | None = None
+
+    def __enter__(self):
+        self._tmpdir = Path(
+            tempfile.mkdtemp(prefix="paperless-test-", dir=settings.SCRATCH_DIR),
+        )
+        _, thumb = tempfile.mkstemp(suffix=".webp", dir=self._tmpdir)
+        self._thumb = Path(thumb)
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        if self._tmpdir and self._tmpdir.exists():
+            shutil.rmtree(self._tmpdir, ignore_errors=True)
+
+    def configure(self, context) -> None:
        """
-        This parser does not implement additional settings yet
+        Test parser doesn't do anything with context
        """
+
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        raise NotImplementedError
+
+    def get_text(self) -> str | None:
+        return self._text
+
+    def get_date(self):
        return None

+    def get_archive_path(self):
+        return self._archive

-class DummyParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir, archive_path) -> None:
-        super().__init__(logging_group, None)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
-        self.archive_path = archive_path
+    def get_thumbnail(self, document_path, mime_type) -> Path:
+        return self._thumb

-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
+    def get_page_count(self, document_path, mime_type):
+        return None

-    def parse(self, document_path, mime_type, file_name=None) -> None:
-        self.text = "The Text"
+    def extract_metadata(self, document_path, mime_type) -> list:
+        return []


-class CopyParser(_BaseTestParser):
-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
+class DummyParser(_BaseNewStyleParser):
+    _ARCHIVE_SRC = (
+        Path(__file__).parent / "samples" / "documents" / "archive" / "0000001.pdf"
+    )

-    def __init__(self, logging_group, progress_callback=None) -> None:
-        super().__init__(logging_group, progress_callback)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=self.tempdir)
-
-    def parse(self, document_path, mime_type, file_name=None) -> None:
-        self.text = "The text"
-        self.archive_path = Path(self.tempdir / "archive.pdf")
-        shutil.copy(document_path, self.archive_path)
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        self._text = "The Text"
+        if produce_archive and self._tmpdir:
+            self._archive = self._tmpdir / "archive.pdf"
+            shutil.copy(self._ARCHIVE_SRC, self._archive)


-class FaultyParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir) -> None:
-        super().__init__(logging_group)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
+class CopyParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        self._text = "The text"
+        if produce_archive and self._tmpdir:
+            self._archive = self._tmpdir / "archive.pdf"
+            shutil.copy(document_path, self._archive)

-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb

-    def parse(self, document_path, mime_type, file_name=None):
+class FaultyParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
        raise ParseError("Does not compute.")


-class FaultyGenericExceptionParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir) -> None:
-        super().__init__(logging_group)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
-
-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
-
-    def parse(self, document_path, mime_type, file_name=None):
+class FaultyGenericExceptionParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
        raise Exception("Generic exception.")


@@ -148,38 +190,12 @@ class TestConsumer(
        self.assertEqual(payload["data"]["max_progress"], last_progress_max)
        self.assertEqual(payload["data"]["status"], last_status)

-    def make_dummy_parser(self, logging_group, progress_callback=None):
-        return DummyParser(
-            logging_group,
-            self.dirs.scratch_dir,
-            self.get_test_archive_file(),
-        )
-
-    def make_faulty_parser(self, logging_group, progress_callback=None):
-        return FaultyParser(logging_group, self.dirs.scratch_dir)
-
-    def make_faulty_generic_exception_parser(
-        self,
-        logging_group,
-        progress_callback=None,
-    ):
-        return FaultyGenericExceptionParser(logging_group, self.dirs.scratch_dir)
-
    def setUp(self) -> None:
        super().setUp()

-        patcher = mock.patch("documents.parsers.document_consumer_declaration.send")
-        m = patcher.start()
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_dummy_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        patcher = mock.patch("documents.consumer.get_parser_registry")
+        mock_registry = patcher.start()
+        mock_registry.return_value.get_parser_for_file.return_value = DummyParser
        self.addCleanup(patcher.stop)

    def get_test_file(self):
@@ -548,9 +564,9 @@ class TestConsumer(
            ) as consumer:
                consumer.run()

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testNoParsers(self, m) -> None:
-        m.return_value = []
+        m.return_value.get_parser_for_file.return_value = None

        with self.assertRaisesMessage(
            ConsumerError,
@@ -561,18 +577,9 @@ class TestConsumer(

        self._assert_first_last_send_progress(last_status="FAILED")

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testFaultyParser(self, m) -> None:
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_faulty_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = FaultyParser

        with self.get_consumer(self.get_test_file()) as consumer:
            with self.assertRaisesMessage(
@@ -583,18 +590,9 @@ class TestConsumer(

        self._assert_first_last_send_progress(last_status="FAILED")

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testGenericParserException(self, m) -> None:
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_faulty_generic_exception_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = FaultyGenericExceptionParser

        with self.get_consumer(self.get_test_file()) as consumer:
            with self.assertRaisesMessage(
@@ -1018,7 +1016,7 @@ class TestConsumer(
        self._assert_first_last_send_progress()

    @override_settings(FILENAME_FORMAT="{title}")
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def test_similar_filenames(self, m) -> None:
        shutil.copy(
            Path(__file__).parent / "samples" / "simple.pdf",
@@ -1032,16 +1030,7 @@ class TestConsumer(
            Path(__file__).parent / "samples" / "simple-noalpha.png",
            settings.CONSUMPTION_DIR / "simple.png.pdf",
        )
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": CopyParser,
-                    "mime_types": {"application/pdf": ".pdf", "image/png": ".png"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = CopyParser

        with self.get_consumer(settings.CONSUMPTION_DIR / "simple.png") as consumer:
            consumer.run()
@@ -1069,8 +1058,10 @@ class TestConsumer(

        sanity_check()

+    @mock.patch("documents.consumer.get_parser_registry")
    @mock.patch("documents.consumer.run_subprocess")
-    def test_try_to_clean_invalid_pdf(self, m) -> None:
+    def test_try_to_clean_invalid_pdf(self, m, mock_registry) -> None:
+        mock_registry.return_value.get_parser_for_file.return_value = None
        shutil.copy(
            Path(__file__).parent / "samples" / "invalid_pdf.pdf",
            settings.CONSUMPTION_DIR / "invalid_pdf.pdf",
@@ -1091,11 +1082,11 @@ class TestConsumer(
            self.assertEqual(command[1], "--replace-input")

    @mock.patch("paperless_mail.models.MailRule.objects.get")
-    @mock.patch("paperless_mail.parsers.MailDocumentParser.parse")
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("paperless.parsers.mail.MailDocumentParser.parse")
+    @mock.patch("documents.consumer.get_parser_registry")
    def test_mail_parser_receives_mailrule(
        self,
-        mock_consumer_declaration_send: mock.Mock,
+        mock_get_parser_registry: mock.Mock,
        mock_mail_parser_parse: mock.Mock,
        mock_mailrule_get: mock.Mock,
    ) -> None:
@@ -1107,25 +1098,21 @@ class TestConsumer(
        THEN:
            - The mail parser should receive the mail rule
        """
-        mock_consumer_declaration_send.return_value = [
-            (
-                None,
-                {
-                    "parser": MailDocumentParser,
-                    "mime_types": {"message/rfc822": ".eml"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        from paperless.parsers.mail import MailDocumentParser
+
+        mock_get_parser_registry.return_value.get_parser_for_file.return_value = (
+            MailDocumentParser
+        )
        mock_mailrule_get.return_value = mock.Mock(
            pdf_layout=MailRule.PdfLayout.HTML_ONLY,
        )
        with self.get_consumer(
            filepath=(
                Path(__file__).parent.parent.parent
-                / Path("paperless_mail")
+                / Path("paperless")
                / Path("tests")
                / Path("samples")
+                / Path("mail")
            ).resolve()
            / "html.eml",
            source=DocumentSource.MailFetch,
@@ -1136,12 +1123,10 @@ class TestConsumer(
                ConsumerError,
            ):
                consumer.run()
-                mock_mail_parser_parse.assert_called_once_with(
-                    consumer.working_copy,
-                    "message/rfc822",
-                    file_name="sample.pdf",
-                    mailrule=mock_mailrule_get.return_value,
-                )
+            mock_mail_parser_parse.assert_called_once_with(
+                consumer.working_copy,
+                "message/rfc822",
+            )


@mock.patch("documents.consumer.magic.from_file", fake_magic_from_file)
--- a/src/documents/tests/test_parsers.py
+++ b/src/documents/tests/test_parsers.py
@@ -1,130 +1,14 @@
-from tempfile import TemporaryDirectory
-from unittest import mock
-
-from django.apps import apps
 from django.test import TestCase
 from django.test import override_settings

 from documents.parsers import get_default_file_extension
-from documents.parsers import get_parser_class_for_mime_type
 from documents.parsers import get_supported_file_extensions
 from documents.parsers import is_file_ext_supported
+from paperless.parsers.registry import get_parser_registry
+from paperless.parsers.registry import reset_parser_registry
+from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
-from paperless_tesseract.parsers import RasterisedDocumentParser
-
-
-class TestParserDiscovery(TestCase):
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_1_parser(self, m, *args) -> None:
-        """
-        GIVEN:
-            - Parser declared for a given mimetype
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - Declared parser class is returned
-        """
-
-        class DummyParser:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertEqual(get_parser_class_for_mime_type("application/pdf"), DummyParser)
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_n_parsers(self, m, *args) -> None:
-        """
-        GIVEN:
-            - Two parsers declared for a given mimetype
-            - Second parser has a higher weight
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - Second parser class is returned
-        """
-
-        class DummyParser1:
-            pass
-
-        class DummyParser2:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser1,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-            (
-                None,
-                {
-                    "weight": 1,
-                    "parser": DummyParser2,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertEqual(
-            get_parser_class_for_mime_type("application/pdf"),
-            DummyParser2,
-        )
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_0_parsers(self, m, *args) -> None:
-        """
-        GIVEN:
-            - No parsers are declared
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - No parser class is returned
-        """
-        m.return_value = []
-        with TemporaryDirectory():
-            self.assertIsNone(get_parser_class_for_mime_type("application/pdf"))
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_no_valid_parser(self, m, *args) -> None:
-        """
-        GIVEN:
-            - No parser declared for a given mimetype
-            - Parser declared for a different mimetype
-        WHEN:
-            - Attempt to get parser for the given mimetype
-        THEN:
-            - No parser class is returned
-        """
-
-        class DummyParser:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertIsNone(get_parser_class_for_mime_type("image/tiff"))


 class TestParserAvailability(TestCase):
@@ -151,7 +35,7 @@ class TestParserAvailability(TestCase):
            self.assertIn(ext, supported_exts)
            self.assertEqual(get_default_file_extension(mime_type), ext)
            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
+                get_parser_registry().get_parser_for_file(mime_type, "")(),
                RasterisedDocumentParser,
            )

@@ -175,7 +59,7 @@ class TestParserAvailability(TestCase):
            self.assertIn(ext, supported_exts)
            self.assertEqual(get_default_file_extension(mime_type), ext)
            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
+                get_parser_registry().get_parser_for_file(mime_type, "")(),
                TextDocumentParser,
            )

@@ -198,22 +82,23 @@ class TestParserAvailability(TestCase):
            ),
        ]

-        # Force the app ready to notice the settings override
-        with override_settings(TIKA_ENABLED=True, INSTALLED_APPS=["paperless_tika"]):
-            app = apps.get_app_config("paperless_tika")
-            app.ready()
+        self.addCleanup(reset_parser_registry)
+
+        # Reset and rebuild the registry with Tika enabled.
+        with override_settings(TIKA_ENABLED=True):
+            reset_parser_registry()
            supported_exts = get_supported_file_extensions()

-        for mime_type, ext in supported_mimes_and_exts:
-            self.assertIn(ext, supported_exts)
-            self.assertEqual(get_default_file_extension(mime_type), ext)
-            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
-                TikaDocumentParser,
-            )
+            for mime_type, ext in supported_mimes_and_exts:
+                self.assertIn(ext, supported_exts)
+                self.assertEqual(get_default_file_extension(mime_type), ext)
+                self.assertIsInstance(
+                    get_parser_registry().get_parser_for_file(mime_type, "")(),
+                    TikaDocumentParser,
+                )

    def test_no_parser_for_mime(self) -> None:
-        self.assertIsNone(get_parser_class_for_mime_type("text/sdgsdf"))
+        self.assertIsNone(get_parser_registry().get_parser_for_file("text/sdgsdf", ""))

    def test_default_extension(self) -> None:
        # Test no parser declared still returns a an extension
--- a/src/documents/views.py
+++ b/src/documents/views.py
@@ -7,7 +7,6 @@ import tempfile
 import zipfile
 from collections import defaultdict
 from collections import deque
-from contextlib import nullcontext
 from datetime import datetime
 from pathlib import Path
 from time import mktime
@@ -158,7 +157,6 @@ from documents.models import UiSettings
 from documents.models import Workflow
 from documents.models import WorkflowAction
 from documents.models import WorkflowTrigger
-from documents.parsers import get_parser_class_for_mime_type
 from documents.permissions import AcknowledgeTasksPermissions
 from documents.permissions import PaperlessAdminPermissions
 from documents.permissions import PaperlessNotePermissions
@@ -226,7 +224,7 @@ from paperless.celery import app as celery_app
 from paperless.config import AIConfig
 from paperless.config import GeneralConfig
 from paperless.models import ApplicationConfiguration
-from paperless.parsers import ParserProtocol
+from paperless.parsers.registry import get_parser_registry
 from paperless.serialisers import GroupSerializer
 from paperless.serialisers import UserSerializer
 from paperless.views import StandardPagination
@@ -1083,17 +1081,17 @@ class DocumentViewSet(
        if not Path(file).is_file():
            return None

-        parser_class = get_parser_class_for_mime_type(mime_type)
+        parser_class = get_parser_registry().get_parser_for_file(
+            mime_type,
+            Path(file).name,
+            Path(file),
+        )
        if parser_class:
-            parser = parser_class(progress_callback=None, logging_group=None)
-            cm = parser if isinstance(parser, ParserProtocol) else nullcontext(parser)
-
            try:
-                with cm:
+                with parser_class() as parser:
                    return parser.extract_metadata(file, mime_type)
            except Exception:  # pragma: no cover
                logger.exception(f"Issue getting metadata for {file}")
-                # TODO: cover GPG errors, remove later.
                return []
        else:  # pragma: no cover
            logger.warning(f"No parser for {mime_type}")
--- a/src/paperless/checks.py
+++ b/src/paperless/checks.py
@@ -3,6 +3,7 @@ import os
 import pwd
 import shutil
 import stat
+import subprocess
 from pathlib import Path

 from django.conf import settings
@@ -299,3 +300,62 @@ def check_deprecated_db_settings(
        )

    return warnings
+
+
+@register()
+def check_remote_parser_configured(app_configs, **kwargs) -> list[Error]:
+    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
+        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
+    ):
+        return [
+            Error(
+                "Azure AI remote parser requires endpoint and API key to be configured.",
+            ),
+        ]
+
+    return []
+
+
+def get_tesseract_langs():
+    proc = subprocess.run(
+        [shutil.which("tesseract"), "--list-langs"],
+        capture_output=True,
+    )
+
+    # Decode bytes to string, split on newlines, trim out the header
+    proc_lines = proc.stdout.decode("utf8", errors="ignore").strip().split("\n")[1:]
+
+    return [x.strip() for x in proc_lines]
+
+
+@register()
+def check_default_language_available(app_configs, **kwargs):
+    errs = []
+
+    if not settings.OCR_LANGUAGE:
+        errs.append(
+            Warning(
+                "No OCR language has been specified with PAPERLESS_OCR_LANGUAGE. "
+                "This means that tesseract will fallback to english.",
+            ),
+        )
+        return errs
+
+    # binaries_check in paperless will check and report if this doesn't exist
+    # So skip trying to do anything here and let that handle missing binaries
+    if shutil.which("tesseract") is not None:
+        installed_langs = get_tesseract_langs()
+
+        specified_langs = [x.strip() for x in settings.OCR_LANGUAGE.split("+")]
+
+        for lang in specified_langs:
+            if lang not in installed_langs:
+                errs.append(
+                    Error(
+                        f"The selected ocr language {lang} is "
+                        f"not installed. Paperless cannot OCR your documents "
+                        f"without it. Please fix PAPERLESS_OCR_LANGUAGE.",
+                    ),
+                )
+
+    return errs
--- a/src/paperless/parsers/init.py
+++ b/src/paperless/parsers/init.py
@@ -35,6 +35,7 @@ Usage example (third-party parser)::

 from __future__ import annotations

+from dataclasses import dataclass
 from typing import TYPE_CHECKING
 from typing import Protocol
 from typing import Self
@@ -48,6 +49,7 @@ if TYPE_CHECKING:

 __all__ = [
    "MetadataEntry",
+    "ParserContext",
    "ParserProtocol",
 ]

@@ -73,6 +75,44 @@ class MetadataEntry(TypedDict):
    """String representation of the field value."""


+@dataclass(frozen=True, slots=True)
+class ParserContext:
+    """Immutable context passed to a parser before parse().
+
+    The consumer assembles this from the ingestion event and Django
+    settings, then calls ``parser.configure(context)`` before
+    ``parser.parse()``.  Parsers read only the fields relevant to them;
+    unneeded fields are ignored.
+
+    ``frozen=True`` prevents accidental mutation after the consumer
+    hands the context off.  ``slots=True`` keeps instances lightweight.
+
+    Fields
+    ------
+    mailrule_id : int | None
+        Primary key of the ``MailRule`` that triggered this ingestion,
+        or ``None`` when the document did not arrive via a mail rule.
+        Used by ``MailDocumentParser`` to select the PDF layout.
+
+    Notes
+    -----
+    Future fields (not yet implemented):
+
+    * ``output_type`` — PDF/A variant for archive generation
+      (replaces ``settings.OCR_OUTPUT_TYPE`` reads inside parsers).
+    * ``ocr_mode`` — skip-text, redo, force, etc.
+      (replaces ``settings.OCR_MODE`` reads inside parsers).
+    * ``ocr_language`` — Tesseract language string.
+      (replaces ``settings.OCR_LANGUAGE`` reads inside parsers).
+
+    When those fields are added the consumer will read from Django
+    settings once and populate them here, decoupling parsers from
+    ``settings.*`` entirely.
+    """
+
+    mailrule_id: int | None = None
+
+
@runtime_checkable
 class ParserProtocol(Protocol):
    """Structural contract for all Paperless-ngx document parsers.
@@ -191,6 +231,21 @@ class ParserProtocol(Protocol):
    # Core parsing interface
    # ------------------------------------------------------------------

+    def configure(self, context: ParserContext) -> None:
+        """Apply source context before parse().
+
+        Called by the consumer after instantiation and before parse().
+        The default implementation is a no-op; parsers override only the
+        fields they need.
+
+        Parameters
+        ----------
+        context:
+            Immutable context assembled by the consumer for this
+            specific ingestion event.
+        """
+        ...
+
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/mail.py
+++ b/src/paperless/parsers/mail.py
@@ -0,0 +1,834 @@
+"""
+Built-in mail document parser.
+
+Handles message/rfc822 (EML) MIME type by:
+- Parsing the email using imap_tools
+- Generating a PDF via Gotenberg (for display and archive)
+- Extracting text via Tika for HTML content
+- Extracting metadata from email headers
+
+The parser always produces a PDF because EML files cannot be rendered
+natively in a browser (requires_pdf_rendition=True).
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+import shutil
+import tempfile
+from html import escape
+from pathlib import Path
+from typing import TYPE_CHECKING
+from typing import Self
+
+from bleach import clean
+from bleach import linkify
+from django.conf import settings
+from django.utils import timezone
+from django.utils.timezone import is_naive
+from django.utils.timezone import make_aware
+from gotenberg_client import GotenbergClient
+from gotenberg_client.constants import A4
+from gotenberg_client.options import Measurement
+from gotenberg_client.options import MeasurementUnitType
+from gotenberg_client.options import PageMarginsType
+from gotenberg_client.options import PdfAFormat
+from humanize import naturalsize
+from imap_tools import MailAttachment
+from imap_tools import MailMessage
+from tika_client import TikaClient
+
+from documents.parsers import ParseError
+from documents.parsers import make_thumbnail_from_pdf
+from paperless.models import OutputTypeChoices
+from paperless.version import __full_version_str__
+from paperless_mail.models import MailRule
+
+if TYPE_CHECKING:
+    import datetime
+    from types import TracebackType
+
+    from paperless.parsers import MetadataEntry
+    from paperless.parsers import ParserContext
+
+logger = logging.getLogger("paperless.parsing.mail")
+
+_SUPPORTED_MIME_TYPES: dict[str, str] = {
+    "message/rfc822": ".eml",
+}
+
+
+class MailDocumentParser:
+    """Parse .eml email files for Paperless-ngx.
+
+    Uses imap_tools to parse .eml files, generates a PDF using Gotenberg,
+    and sends the HTML part to a Tika server for text extraction.  Because
+    EML files cannot be rendered natively in a browser, the parser always
+    produces a PDF rendition (requires_pdf_rendition=True).
+
+    Pass a ``ParserContext`` to ``configure()`` before ``parse()`` to
+    apply mail-rule-specific PDF layout options:
+
+        parser.configure(ParserContext(mailrule_id=rule.pk))
+        parser.parse(path, mime_type)
+
+    Class attributes
+    ----------------
+    name : str
+        Human-readable parser name.
+    version : str
+        Semantic version string, kept in sync with Paperless-ngx releases.
+    author : str
+        Maintainer name.
+    url : str
+        Issue tracker / source URL.
+    """
+
+    name: str = "Paperless-ngx Mail Parser"
+    version: str = __full_version_str__
+    author: str = "Paperless-ngx Contributors"
+    url: str = "https://github.com/paperless-ngx/paperless-ngx"
+
+    # ------------------------------------------------------------------
+    # Class methods
+    # ------------------------------------------------------------------
+
+    @classmethod
+    def supported_mime_types(cls) -> dict[str, str]:
+        """Return the MIME types this parser handles.
+
+        Returns
+        -------
+        dict[str, str]
+            Mapping of MIME type to preferred file extension.
+        """
+        return _SUPPORTED_MIME_TYPES
+
+    @classmethod
+    def score(
+        cls,
+        mime_type: str,
+        filename: str,
+        path: Path | None = None,
+    ) -> int | None:
+        """Return the priority score for handling this file.
+
+        Parameters
+        ----------
+        mime_type:
+            Detected MIME type of the file.
+        filename:
+            Original filename including extension.
+        path:
+            Optional filesystem path. Not inspected by this parser.
+
+        Returns
+        -------
+        int | None
+            10 if the MIME type is supported, otherwise None.
+        """
+        if mime_type in _SUPPORTED_MIME_TYPES:
+            return 10
+        return None
+
+    # ------------------------------------------------------------------
+    # Properties
+    # ------------------------------------------------------------------
+
+    @property
+    def can_produce_archive(self) -> bool:
+        """Whether this parser can produce a searchable PDF archive copy.
+
+        Returns
+        -------
+        bool
+            Always False — the mail parser produces a display PDF
+            (requires_pdf_rendition=True), not an optional OCR archive.
+        """
+        return False
+
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        """Whether the parser must produce a PDF for the frontend to display.
+
+        Returns
+        -------
+        bool
+            Always True — EML files cannot be rendered natively in a browser,
+            so a PDF conversion is always required for display.
+        """
+        return True
+
+    # ------------------------------------------------------------------
+    # Lifecycle
+    # ------------------------------------------------------------------
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self._tempdir = Path(
+            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
+        )
+        self._text: str | None = None
+        self._date: datetime.datetime | None = None
+        self._archive_path: Path | None = None
+        self._mailrule_id: int | None = None
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_val: BaseException | None,
+        exc_tb: TracebackType | None,
+    ) -> None:
+        logger.debug("Cleaning up temporary directory %s", self._tempdir)
+        shutil.rmtree(self._tempdir, ignore_errors=True)
+
+    # ------------------------------------------------------------------
+    # Core parsing interface
+    # ------------------------------------------------------------------
+
+    def configure(self, context: ParserContext) -> None:
+        self._mailrule_id = context.mailrule_id
+
+    def parse(
+        self,
+        document_path: Path,
+        mime_type: str,
+        *,
+        produce_archive: bool = True,
+    ) -> None:
+        """Parse the given .eml into formatted text and a PDF archive.
+
+        Call ``configure(ParserContext(mailrule_id=...))`` before this method
+        to apply mail-rule-specific PDF layout options.  The ``produce_archive``
+        flag is accepted for protocol compatibility but is always honoured —
+        the mail parser always produces a PDF since EML files cannot be
+        displayed natively.
+
+        Parameters
+        ----------
+        document_path:
+            Absolute path to the .eml file.
+        mime_type:
+            Detected MIME type of the document (should be "message/rfc822").
+        produce_archive:
+            Accepted for protocol compatibility. The PDF rendition is always
+            produced since EML files cannot be displayed natively in a browser.
+
+        Raises
+        ------
+        documents.parsers.ParseError
+            If the file cannot be parsed or PDF generation fails.
+        """
+
+        def strip_text(text: str) -> str:
+            """Reduces the spacing of the given text string."""
+            text = re.sub(r"\s+", " ", text)
+            text = re.sub(r"(\n *)+", "\n", text)
+            return text.strip()
+
+        def build_formatted_text(mail_message: MailMessage) -> str:
+            """Constructs a formatted string based on the given email."""
+            fmt_text = f"Subject: {mail_message.subject}\n\n"
+            fmt_text += f"From: {mail_message.from_values.full if mail_message.from_values else ''}\n\n"
+            to_list = [address.full for address in mail_message.to_values]
+            fmt_text += f"To: {', '.join(to_list)}\n\n"
+            if mail_message.cc_values:
+                fmt_text += (
+                    f"CC: {', '.join(address.full for address in mail.cc_values)}\n\n"
+                )
+            if mail_message.bcc_values:
+                fmt_text += (
+                    f"BCC: {', '.join(address.full for address in mail.bcc_values)}\n\n"
+                )
+            if mail_message.attachments:
+                att = []
+                for a in mail.attachments:
+                    attachment_size = naturalsize(a.size, binary=True, format="%.2f")
+                    att.append(
+                        f"{a.filename} ({attachment_size})",
+                    )
+                fmt_text += f"Attachments: {', '.join(att)}\n\n"
+
+            if mail.html:
+                fmt_text += "HTML content: " + strip_text(self.tika_parse(mail.html))
+
+            fmt_text += f"\n\n{strip_text(mail.text)}"
+
+            return fmt_text
+
+        logger.debug("Parsing file %s into an email", document_path.name)
+        mail = self.parse_file_to_message(document_path)
+
+        logger.debug("Building formatted text from email")
+        self._text = build_formatted_text(mail)
+
+        if is_naive(mail.date):
+            self._date = make_aware(mail.date)
+        else:
+            self._date = mail.date
+
+        logger.debug("Creating a PDF from the email")
+        if self._mailrule_id:
+            rule = MailRule.objects.get(pk=self._mailrule_id)
+            self._archive_path = self.generate_pdf(
+                mail,
+                MailRule.PdfLayout(rule.pdf_layout),
+            )
+        else:
+            self._archive_path = self.generate_pdf(mail)
+
+    # ------------------------------------------------------------------
+    # Result accessors
+    # ------------------------------------------------------------------
+
+    def get_text(self) -> str | None:
+        """Return the plain-text content extracted during parse.
+
+        Returns
+        -------
+        str | None
+            Extracted text, or None if parse has not been called yet.
+        """
+        return self._text
+
+    def get_date(self) -> datetime.datetime | None:
+        """Return the document date detected during parse.
+
+        Returns
+        -------
+        datetime.datetime | None
+            Date from the email headers, or None if not detected.
+        """
+        return self._date
+
+    def get_archive_path(self) -> Path | None:
+        """Return the path to the generated archive PDF, or None.
+
+        Returns
+        -------
+        Path | None
+            Path to the PDF produced by Gotenberg, or None if parse has not
+            been called yet.
+        """
+        return self._archive_path
+
+    # ------------------------------------------------------------------
+    # Thumbnail and metadata
+    # ------------------------------------------------------------------
+
+    def get_thumbnail(
+        self,
+        document_path: Path,
+        mime_type: str,
+        file_name: str | None = None,
+    ) -> Path:
+        """Generate a thumbnail from the PDF rendition of the email.
+
+        Converts the document to PDF first if not already done.
+
+        Parameters
+        ----------
+        document_path:
+            Absolute path to the source document.
+        mime_type:
+            Detected MIME type of the document.
+        file_name:
+            Kept for backward compatibility; not used.
+
+        Returns
+        -------
+        Path
+            Path to the generated WebP thumbnail inside the temporary directory.
+        """
+        if not self._archive_path:
+            self._archive_path = self.generate_pdf(
+                self.parse_file_to_message(document_path),
+            )
+
+        return make_thumbnail_from_pdf(
+            self._archive_path,
+            self._tempdir,
+        )
+
+    def get_page_count(
+        self,
+        document_path: Path,
+        mime_type: str,
+    ) -> int | None:
+        """Return the number of pages in the document.
+
+        Counts pages in the archive PDF produced by a preceding parse()
+        call.  Returns ``None`` if parse() has not been called yet or if
+        no archive was produced.
+
+        Returns
+        -------
+        int | None
+            Page count of the archive PDF, or ``None``.
+        """
+        if self._archive_path is not None:
+            from paperless.parsers.utils import get_page_count_for_pdf
+
+            return get_page_count_for_pdf(self._archive_path, log=logger)
+        return None
+
+    def extract_metadata(
+        self,
+        document_path: Path,
+        mime_type: str,
+    ) -> list[MetadataEntry]:
+        """Extract metadata from the email headers.
+
+        Returns email headers as metadata entries with prefix "header",
+        plus summary entries for attachments and date.
+
+        Returns
+        -------
+        list[MetadataEntry]
+            Sorted list of metadata entries, or ``[]`` on parse failure.
+        """
+        result: list[MetadataEntry] = []
+
+        try:
+            mail = self.parse_file_to_message(document_path)
+        except ParseError as e:
+            logger.warning(
+                "Error while fetching document metadata for %s: %s",
+                document_path,
+                e,
+            )
+            return result
+
+        for key, header_values in mail.headers.items():
+            value = ", ".join(header_values)
+            try:
+                value.encode("utf-8")
+            except UnicodeEncodeError as e:  # pragma: no cover
+                logger.debug("Skipping header %s: %s", key, e)
+                continue
+
+            result.append(
+                {
+                    "namespace": "",
+                    "prefix": "header",
+                    "key": key,
+                    "value": value,
+                },
+            )
+
+        result.append(
+            {
+                "namespace": "",
+                "prefix": "",
+                "key": "attachments",
+                "value": ", ".join(
+                    f"{attachment.filename}"
+                    f"({naturalsize(attachment.size, binary=True, format='%.2f')})"
+                    for attachment in mail.attachments
+                ),
+            },
+        )
+
+        result.append(
+            {
+                "namespace": "",
+                "prefix": "",
+                "key": "date",
+                "value": mail.date.strftime("%Y-%m-%d %H:%M:%S %Z"),
+            },
+        )
+
+        result.sort(key=lambda item: (item["prefix"], item["key"]))
+        return result
+
+    # ------------------------------------------------------------------
+    # Email-specific methods
+    # ------------------------------------------------------------------
+
+    def _settings_to_gotenberg_pdfa(self) -> PdfAFormat | None:
+        """Convert the OCR output type setting to a Gotenberg PdfAFormat."""
+        if settings.OCR_OUTPUT_TYPE in {
+            OutputTypeChoices.PDF_A,
+            OutputTypeChoices.PDF_A2,
+        }:
+            return PdfAFormat.A2b
+        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A1:  # pragma: no cover
+            logger.warning(
+                "Gotenberg does not support PDF/A-1a, choosing PDF/A-2b instead",
+            )
+            return PdfAFormat.A2b
+        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A3:  # pragma: no cover
+            return PdfAFormat.A3b
+        return None
+
+    @staticmethod
+    def parse_file_to_message(filepath: Path) -> MailMessage:
+        """Parse the given .eml file into a MailMessage object.
+
+        Parameters
+        ----------
+        filepath:
+            Path to the .eml file.
+
+        Returns
+        -------
+        MailMessage
+            Parsed mail message.
+
+        Raises
+        ------
+        documents.parsers.ParseError
+            If the file cannot be parsed or is missing required fields.
+        """
+        try:
+            with filepath.open("rb") as eml:
+                parsed = MailMessage.from_bytes(eml.read())
+                if parsed.from_values is None:
+                    raise ParseError(
+                        f"Could not parse {filepath}: Missing 'from'",
+                    )
+        except Exception as err:
+            raise ParseError(
+                f"Could not parse {filepath}: {err}",
+            ) from err
+
+        return parsed
+
+    def tika_parse(self, html: str) -> str:
+        """Send HTML content to the Tika server for text extraction.
+
+        Parameters
+        ----------
+        html:
+            HTML string to parse.
+
+        Returns
+        -------
+        str
+            Extracted plain text.
+
+        Raises
+        ------
+        documents.parsers.ParseError
+            If the Tika server cannot be reached or returns an error.
+        """
+        logger.info("Sending content to Tika server")
+
+        try:
+            with TikaClient(tika_url=settings.TIKA_ENDPOINT) as client:
+                parsed = client.tika.as_text.from_buffer(html, "text/html")
+
+                if parsed.content is not None:
+                    return parsed.content.strip()
+                return ""
+        except Exception as err:
+            raise ParseError(
+                f"Could not parse content with tika server at "
+                f"{settings.TIKA_ENDPOINT}: {err}",
+            ) from err
+
+    def generate_pdf(
+        self,
+        mail_message: MailMessage,
+        pdf_layout: MailRule.PdfLayout | None = None,
+    ) -> Path:
+        """Generate a PDF from the email message.
+
+        Creates separate PDFs for the email body and HTML content, then
+        merges them according to the requested layout.
+
+        Parameters
+        ----------
+        mail_message:
+            Parsed email message.
+        pdf_layout:
+            Layout option for the PDF. Falls back to the
+            EMAIL_PARSE_DEFAULT_LAYOUT setting if not provided.
+
+        Returns
+        -------
+        Path
+            Path to the generated PDF inside the temporary directory.
+        """
+        archive_path = Path(self._tempdir) / "merged.pdf"
+
+        mail_pdf_file = self.generate_pdf_from_mail(mail_message)
+
+        if pdf_layout is None:
+            pdf_layout = MailRule.PdfLayout(settings.EMAIL_PARSE_DEFAULT_LAYOUT)
+
+        # If no HTML content, create the PDF from the message.
+        # Otherwise, create 2 PDFs and merge them with Gotenberg.
+        if not mail_message.html:
+            archive_path.write_bytes(mail_pdf_file.read_bytes())
+        else:
+            pdf_of_html_content = self.generate_pdf_from_html(
+                mail_message.html,
+                mail_message.attachments,
+            )
+
+            logger.debug("Merging email text and HTML content into single PDF")
+
+            with (
+                GotenbergClient(
+                    host=settings.TIKA_GOTENBERG_ENDPOINT,
+                    timeout=settings.CELERY_TASK_TIME_LIMIT,
+                ) as client,
+                client.merge.merge() as route,
+            ):
+                # Configure requested PDF/A formatting, if any
+                pdf_a_format = self._settings_to_gotenberg_pdfa()
+                if pdf_a_format is not None:
+                    route.pdf_format(pdf_a_format)
+
+                match pdf_layout:
+                    case MailRule.PdfLayout.HTML_TEXT:
+                        route.merge([pdf_of_html_content, mail_pdf_file])
+                    case MailRule.PdfLayout.HTML_ONLY:
+                        route.merge([pdf_of_html_content])
+                    case MailRule.PdfLayout.TEXT_ONLY:
+                        route.merge([mail_pdf_file])
+                    case MailRule.PdfLayout.TEXT_HTML | _:
+                        route.merge([mail_pdf_file, pdf_of_html_content])
+
+                try:
+                    response = route.run()
+                    archive_path.write_bytes(response.content)
+                except Exception as err:
+                    raise ParseError(
+                        f"Error while merging email HTML into PDF: {err}",
+                    ) from err
+
+        return archive_path
+
+    def mail_to_html(self, mail: MailMessage) -> Path:
+        """Convert the given email into an HTML file using a template.
+
+        Parameters
+        ----------
+        mail:
+            Parsed mail message.
+
+        Returns
+        -------
+        Path
+            Path to the rendered HTML file inside the temporary directory.
+        """
+
+        def clean_html(text: str) -> str:
+            """Attempt to clean, escape, and linkify the given HTML string."""
+            if isinstance(text, list):
+                text = "\n".join([str(e) for e in text])
+            if not isinstance(text, str):
+                text = str(text)
+            text = escape(text)
+            text = clean(text)
+            text = linkify(text, parse_email=True)
+            text = text.replace("\n", "<br>")
+            return text
+
+        data = {}
+
+        data["subject"] = clean_html(mail.subject)
+        if data["subject"]:
+            data["subject_label"] = "Subject"
+        data["from"] = clean_html(mail.from_values.full if mail.from_values else "")
+        if data["from"]:
+            data["from_label"] = "From"
+        data["to"] = clean_html(", ".join(address.full for address in mail.to_values))
+        if data["to"]:
+            data["to_label"] = "To"
+        data["cc"] = clean_html(", ".join(address.full for address in mail.cc_values))
+        if data["cc"]:
+            data["cc_label"] = "CC"
+        data["bcc"] = clean_html(", ".join(address.full for address in mail.bcc_values))
+        if data["bcc"]:
+            data["bcc_label"] = "BCC"
+
+        att = []
+        for a in mail.attachments:
+            att.append(
+                f"{a.filename} ({naturalsize(a.size, binary=True, format='%.2f')})",
+            )
+        data["attachments"] = clean_html(", ".join(att))
+        if data["attachments"]:
+            data["attachments_label"] = "Attachments"
+
+        data["date"] = clean_html(
+            timezone.localtime(mail.date).strftime("%Y-%m-%d %H:%M"),
+        )
+        data["content"] = clean_html(mail.text.strip())
+
+        from django.template.loader import render_to_string
+
+        html_file = Path(self._tempdir) / "email_as_html.html"
+        html_file.write_text(render_to_string("email_msg_template.html", context=data))
+
+        return html_file
+
+    def generate_pdf_from_mail(self, mail: MailMessage) -> Path:
+        """Create a PDF from the email body using an HTML template and Gotenberg.
+
+        Parameters
+        ----------
+        mail:
+            Parsed mail message.
+
+        Returns
+        -------
+        Path
+            Path to the generated PDF inside the temporary directory.
+
+        Raises
+        ------
+        documents.parsers.ParseError
+            If Gotenberg returns an error.
+        """
+        logger.info("Converting mail to PDF")
+
+        css_file = (
+            Path(__file__).parent.parent.parent
+            / "paperless_mail"
+            / "templates"
+            / "output.css"
+        )
+        email_html_file = self.mail_to_html(mail)
+
+        with (
+            GotenbergClient(
+                host=settings.TIKA_GOTENBERG_ENDPOINT,
+                timeout=settings.CELERY_TASK_TIME_LIMIT,
+            ) as client,
+            client.chromium.html_to_pdf() as route,
+        ):
+            # Configure requested PDF/A formatting, if any
+            pdf_a_format = self._settings_to_gotenberg_pdfa()
+            if pdf_a_format is not None:
+                route.pdf_format(pdf_a_format)
+
+            try:
+                response = (
+                    route.index(email_html_file)
+                    .resource(css_file)
+                    .margins(
+                        PageMarginsType(
+                            top=Measurement(0.1, MeasurementUnitType.Inches),
+                            bottom=Measurement(0.1, MeasurementUnitType.Inches),
+                            left=Measurement(0.1, MeasurementUnitType.Inches),
+                            right=Measurement(0.1, MeasurementUnitType.Inches),
+                        ),
+                    )
+                    .size(A4)
+                    .scale(1.0)
+                    .run()
+                )
+            except Exception as err:
+                raise ParseError(
+                    f"Error while converting email to PDF: {err}",
+                ) from err
+
+        email_as_pdf_file = Path(self._tempdir) / "email_as_pdf.pdf"
+        email_as_pdf_file.write_bytes(response.content)
+
+        return email_as_pdf_file
+
+    def generate_pdf_from_html(
+        self,
+        orig_html: str,
+        attachments: list[MailAttachment],
+    ) -> Path:
+        """Generate a PDF from the HTML content of the email.
+
+        Parameters
+        ----------
+        orig_html:
+            Raw HTML string from the email body.
+        attachments:
+            List of email attachments (used as inline resources).
+
+        Returns
+        -------
+        Path
+            Path to the generated PDF inside the temporary directory.
+
+        Raises
+        ------
+        documents.parsers.ParseError
+            If Gotenberg returns an error.
+        """
+
+        def clean_html_script(text: str) -> str:
+            compiled_open = re.compile(re.escape("<script"), re.IGNORECASE)
+            text = compiled_open.sub("<div hidden ", text)
+
+            compiled_close = re.compile(re.escape("</script"), re.IGNORECASE)
+            text = compiled_close.sub("</div", text)
+            return text
+
+        logger.info("Converting message html to PDF")
+
+        tempdir = Path(self._tempdir)
+
+        html_clean = clean_html_script(orig_html)
+        html_clean_file = tempdir / "index.html"
+        html_clean_file.write_text(html_clean)
+
+        with (
+            GotenbergClient(
+                host=settings.TIKA_GOTENBERG_ENDPOINT,
+                timeout=settings.CELERY_TASK_TIME_LIMIT,
+            ) as client,
+            client.chromium.html_to_pdf() as route,
+        ):
+            # Configure requested PDF/A formatting, if any
+            pdf_a_format = self._settings_to_gotenberg_pdfa()
+            if pdf_a_format is not None:
+                route.pdf_format(pdf_a_format)
+
+            # Add attachments as resources, cleaning the filename and replacing
+            # it in the index file for inclusion
+            for attachment in attachments:
+                # Clean the attachment name to be valid
+                name_cid = f"cid:{attachment.content_id}"
+                name_clean = "".join(e for e in name_cid if e.isalnum())
+
+                # Write attachment payload to a temp file
+                temp_file = tempdir / name_clean
+                temp_file.write_bytes(attachment.payload)
+
+                route.resource(temp_file)
+
+                # Replace as needed the name with the clean name
+                html_clean = html_clean.replace(name_cid, name_clean)
+
+            # Now store the cleaned up HTML version
+            html_clean_file = tempdir / "index.html"
+            html_clean_file.write_text(html_clean)
+            # This is our index file, the main page basically
+            route.index(html_clean_file)
+
+            # Set page size, margins
+            route.margins(
+                PageMarginsType(
+                    top=Measurement(0.1, MeasurementUnitType.Inches),
+                    bottom=Measurement(0.1, MeasurementUnitType.Inches),
+                    left=Measurement(0.1, MeasurementUnitType.Inches),
+                    right=Measurement(0.1, MeasurementUnitType.Inches),
+                ),
+            ).size(A4).scale(1.0)
+
+            try:
+                response = route.run()
+
+            except Exception as err:
+                raise ParseError(
+                    f"Error while converting document to PDF: {err}",
+                ) from err
+
+        html_pdf = tempdir / "html.pdf"
+        html_pdf.write_bytes(response.content)
+        return html_pdf
--- a/src/paperless/parsers/registry.py
+++ b/src/paperless/parsers/registry.py
@@ -33,6 +33,7 @@ name, version, author, url, supported_mime_types (callable), score (callable).
 from __future__ import annotations

 import logging
+import threading
 from importlib.metadata import entry_points
 from typing import TYPE_CHECKING

@@ -49,6 +50,7 @@ logger = logging.getLogger("paperless.parsers.registry")

 _registry: ParserRegistry | None = None
 _discovery_complete: bool = False
+_lock = threading.Lock()

 # Attribute names that every registered external parser class must expose.
 _REQUIRED_ATTRS: tuple[str, ...] = (
@@ -74,7 +76,6 @@ def get_parser_registry() -> ParserRegistry:
    1. Creates a new ParserRegistry.
    2. Calls register_defaults to install built-in parsers.
    3. Calls discover to load third-party plugins via importlib.metadata entrypoints.
-    4. Calls log_summary to emit a startup summary.

    Subsequent calls return the same instance immediately.

@@ -85,14 +86,15 @@ def get_parser_registry() -> ParserRegistry:
    """
    global _registry, _discovery_complete

-    if _registry is None:
-        _registry = ParserRegistry()
-        _registry.register_defaults()
+    with _lock:
+        if _registry is None:
+            r = ParserRegistry()
+            r.register_defaults()
+            _registry = r

-    if not _discovery_complete:
-        _registry.discover()
-        _registry.log_summary()
-        _discovery_complete = True
+        if not _discovery_complete:
+            _registry.discover()
+            _discovery_complete = True

    return _registry

@@ -113,9 +115,11 @@ def init_builtin_parsers() -> None:
    """
    global _registry

-    if _registry is None:
-        _registry = ParserRegistry()
-        _registry.register_defaults()
+    with _lock:
+        if _registry is None:
+            r = ParserRegistry()
+            r.register_defaults()
+            _registry = r


 def reset_parser_registry() -> None:
@@ -193,13 +197,17 @@ class ParserRegistry:
        that log output is predictable; scoring determines which parser wins
        at runtime regardless of registration order.
        """
+        from paperless.parsers.mail import MailDocumentParser
        from paperless.parsers.remote import RemoteDocumentParser
+        from paperless.parsers.tesseract import RasterisedDocumentParser
        from paperless.parsers.text import TextDocumentParser
        from paperless.parsers.tika import TikaDocumentParser

        self.register_builtin(TextDocumentParser)
        self.register_builtin(RemoteDocumentParser)
        self.register_builtin(TikaDocumentParser)
+        self.register_builtin(MailDocumentParser)
+        self.register_builtin(RasterisedDocumentParser)

    # ------------------------------------------------------------------
    # Discovery
@@ -300,6 +308,23 @@ class ParserRegistry:
                getattr(cls, "url", "unknown"),
            )

+    # ------------------------------------------------------------------
+    # Inspection helpers
+    # ------------------------------------------------------------------
+
+    def all_parsers(self) -> list[type[ParserProtocol]]:
+        """Return all registered parser classes (external first, then builtins).
+
+        Used by compatibility wrappers that need to iterate every parser to
+        compute the full set of supported MIME types and file extensions.
+
+        Returns
+        -------
+        list[type[ParserProtocol]]
+            External parsers followed by built-in parsers.
+        """
+        return [*self._external, *self._builtins]
+
    # ------------------------------------------------------------------
    # Parser resolution
    # ------------------------------------------------------------------
@@ -330,7 +355,7 @@ class ParserRegistry:
        mime_type:
            The detected MIME type of the file.
        filename:
-            The original filename, including extension.
+            The original filename, including extension.  May be empty in some cases
        path:
            Optional filesystem path to the file. Forwarded to each
            parser's score method.
--- a/src/paperless/parsers/remote.py
+++ b/src/paperless/parsers/remote.py
@@ -28,6 +28,7 @@ if TYPE_CHECKING:
    from types import TracebackType

    from paperless.parsers import MetadataEntry
+    from paperless.parsers import ParserContext

 logger = logging.getLogger("paperless.parsing.remote")

@@ -204,6 +205,9 @@ class RemoteDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------

+    def configure(self, context: ParserContext) -> None:
+        pass
+
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/tesseract.py
+++ b/src/paperless/parsers/tesseract.py
@@ -1,13 +1,18 @@
+from __future__ import annotations
+
+import logging
 import os
 import re
+import shutil
 import tempfile
 from pathlib import Path
 from typing import TYPE_CHECKING
+from typing import Any
+from typing import Self

 from django.conf import settings
 from PIL import Image

-from documents.parsers import DocumentParser
 from documents.parsers import ParseError
 from documents.parsers import make_thumbnail_from_pdf
 from documents.utils import maybe_override_pixel_limit
@@ -16,6 +21,28 @@ from paperless.config import OcrConfig
 from paperless.models import ArchiveFileChoices
 from paperless.models import CleanChoices
 from paperless.models import ModeChoices
+from paperless.parsers.utils import read_file_handle_unicode_errors
+from paperless.version import __full_version_str__
+
+if TYPE_CHECKING:
+    import datetime
+    from types import TracebackType
+
+    from paperless.parsers import MetadataEntry
+    from paperless.parsers import ParserContext
+
+logger = logging.getLogger("paperless.parsing.tesseract")
+
+_SUPPORTED_MIME_TYPES: dict[str, str] = {
+    "application/pdf": ".pdf",
+    "image/jpeg": ".jpg",
+    "image/png": ".png",
+    "image/tiff": ".tif",
+    "image/gif": ".gif",
+    "image/bmp": ".bmp",
+    "image/webp": ".webp",
+    "image/heic": ".heic",
+}


 class NoTextFoundException(Exception):
@@ -26,81 +53,125 @@ class RtlLanguageException(Exception):
    pass


-class RasterisedDocumentParser(DocumentParser):
+class RasterisedDocumentParser:
    """
    This parser uses Tesseract to try and get some text out of a rasterised
    image, whether it's a PDF, or other graphical format (JPEG, TIFF, etc.)
    """

-    logging_name = "paperless.parsing.tesseract"
+    name: str = "Paperless-ngx Tesseract OCR Parser"
+    version: str = __full_version_str__
+    author: str = "Paperless-ngx Contributors"
+    url: str = "https://github.com/paperless-ngx/paperless-ngx"

-    def get_settings(self) -> OcrConfig:
-        """
-        This parser uses the OCR configuration settings to parse documents
-        """
-        return OcrConfig()
+    # ------------------------------------------------------------------
+    # Class methods
+    # ------------------------------------------------------------------

-    def get_page_count(self, document_path, mime_type):
-        page_count = None
-        if mime_type == "application/pdf":
-            try:
-                import pikepdf
+    @classmethod
+    def supported_mime_types(cls) -> dict[str, str]:
+        return _SUPPORTED_MIME_TYPES

-                with pikepdf.Pdf.open(document_path) as pdf:
-                    page_count = len(pdf.pages)
-            except Exception as e:
-                self.log.warning(
-                    f"Unable to determine PDF page count {document_path}: {e}",
-                )
-        return page_count
+    @classmethod
+    def score(
+        cls,
+        mime_type: str,
+        filename: str,
+        path: Path | None = None,
+    ) -> int | None:
+        if mime_type in _SUPPORTED_MIME_TYPES:
+            return 10
+        return None

-    def extract_metadata(self, document_path, mime_type):
-        result = []
-        if mime_type == "application/pdf":
-            import pikepdf
+    # ------------------------------------------------------------------
+    # Properties
+    # ------------------------------------------------------------------

-            namespace_pattern = re.compile(r"\{(.*)\}(.*)")
+    @property
+    def can_produce_archive(self) -> bool:
+        return True

-            pdf = pikepdf.open(document_path)
-            meta = pdf.open_metadata()
-            for key, value in meta.items():
-                if isinstance(value, list):
-                    value = " ".join([str(e) for e in value])
-                value = str(value)
-                try:
-                    m = namespace_pattern.match(key)
-                    if m is None:  # pragma: no cover
-                        continue
-                    namespace = m.group(1)
-                    key_value = m.group(2)
-                    try:
-                        namespace.encode("utf-8")
-                        key_value.encode("utf-8")
-                    except UnicodeEncodeError as e:  # pragma: no cover
-                        self.log.debug(f"Skipping metadata key {key}: {e}")
-                        continue
-                    result.append(
-                        {
-                            "namespace": namespace,
-                            "prefix": meta.REVERSE_NS[namespace],
-                            "key": key_value,
-                            "value": value,
-                        },
-                    )
-                except Exception as e:
-                    self.log.warning(
-                        f"Error while reading metadata {key}: {value}. Error: {e}",
-                    )
-        return result
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        return False

-    def get_thumbnail(self, document_path, mime_type, file_name=None):
+    # ------------------------------------------------------------------
+    # Lifecycle
+    # ------------------------------------------------------------------
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self.tempdir = Path(
+            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
+        )
+        self.settings = OcrConfig()
+        self.archive_path: Path | None = None
+        self.text: str | None = None
+        self.date: datetime.datetime | None = None
+        self.log = logger
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_val: BaseException | None,
+        exc_tb: TracebackType | None,
+    ) -> None:
+        logger.debug("Cleaning up temporary directory %s", self.tempdir)
+        shutil.rmtree(self.tempdir, ignore_errors=True)
+
+    # ------------------------------------------------------------------
+    # Core parsing interface
+    # ------------------------------------------------------------------
+
+    def configure(self, context: ParserContext) -> None:
+        pass
+
+    # ------------------------------------------------------------------
+    # Result accessors
+    # ------------------------------------------------------------------
+
+    def get_text(self) -> str | None:
+        return self.text
+
+    def get_date(self) -> datetime.datetime | None:
+        return self.date
+
+    def get_archive_path(self) -> Path | None:
+        return self.archive_path
+
+    # ------------------------------------------------------------------
+    # Thumbnail, page count, and metadata
+    # ------------------------------------------------------------------
+
+    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
        return make_thumbnail_from_pdf(
-            self.archive_path or document_path,
+            self.archive_path or Path(document_path),
            self.tempdir,
-            self.logging_group,
        )

-    def is_image(self, mime_type) -> bool:
+    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
+        if mime_type == "application/pdf":
+            from paperless.parsers.utils import get_page_count_for_pdf
+
+            return get_page_count_for_pdf(Path(document_path), log=self.log)
+        return None
+
+    def extract_metadata(
+        self,
+        document_path: Path,
+        mime_type: str,
+    ) -> list[MetadataEntry]:
+        if mime_type != "application/pdf":
+            return []
+
+        from paperless.parsers.utils import extract_pdf_metadata
+
+        return extract_pdf_metadata(Path(document_path), log=self.log)
+
+    def is_image(self, mime_type: str) -> bool:
        return mime_type in [
            "image/png",
            "image/jpeg",
@@ -111,25 +182,25 @@ class RasterisedDocumentParser(DocumentParser):
            "image/heic",
        ]

-    def has_alpha(self, image) -> bool:
+    def has_alpha(self, image: Path) -> bool:
        with Image.open(image) as im:
            return im.mode in ("RGBA", "LA")

-    def remove_alpha(self, image_path: str) -> Path:
+    def remove_alpha(self, image_path: Path) -> Path:
        no_alpha_image = Path(self.tempdir) / "image-no-alpha"
        run_subprocess(
            [
                settings.CONVERT_BINARY,
                "-alpha",
                "off",
-                image_path,
-                no_alpha_image,
+                str(image_path),
+                str(no_alpha_image),
            ],
            logger=self.log,
        )
        return no_alpha_image

-    def get_dpi(self, image) -> int | None:
+    def get_dpi(self, image: Path) -> int | None:
        try:
            with Image.open(image) as im:
                x, _ = im.info["dpi"]
@@ -138,7 +209,7 @@ class RasterisedDocumentParser(DocumentParser):
            self.log.warning(f"Error while getting DPI from image {image}: {e}")
            return None

-    def calculate_a4_dpi(self, image) -> int | None:
+    def calculate_a4_dpi(self, image: Path) -> int | None:
        try:
            with Image.open(image) as im:
                width, _ = im.size
@@ -156,6 +227,7 @@ class RasterisedDocumentParser(DocumentParser):
        sidecar_file: Path | None,
        pdf_file: Path,
    ) -> str | None:
+        text: str | None = None
        # When re-doing OCR, the sidecar contains ONLY the new text, not
        # the whole text, so do not utilize it in that case
        if (
@@ -163,7 +235,7 @@ class RasterisedDocumentParser(DocumentParser):
            and sidecar_file.is_file()
            and self.settings.mode != "redo"
        ):
-            text = self.read_file_handle_unicode_errors(sidecar_file)
+            text = read_file_handle_unicode_errors(sidecar_file)

            if "[OCR skipped on page" not in text:
                # This happens when there's already text in the input file.
@@ -191,12 +263,12 @@ class RasterisedDocumentParser(DocumentParser):
                        "-layout",
                        "-enc",
                        "UTF-8",
-                        pdf_file,
+                        str(pdf_file),
                        tmp.name,
                    ],
                    logger=self.log,
                )
-                text = self.read_file_handle_unicode_errors(Path(tmp.name))
+                text = read_file_handle_unicode_errors(Path(tmp.name))

            return post_process_text(text)

@@ -211,16 +283,14 @@ class RasterisedDocumentParser(DocumentParser):

    def construct_ocrmypdf_parameters(
        self,
-        input_file,
-        mime_type,
-        output_file,
-        sidecar_file,
+        input_file: Path,
+        mime_type: str,
+        output_file: Path,
+        sidecar_file: Path,
        *,
-        safe_fallback=False,
-    ):
-        if TYPE_CHECKING:
-            assert isinstance(self.settings, OcrConfig)
-        ocrmypdf_args = {
+        safe_fallback: bool = False,
+    ) -> dict[str, Any]:
+        ocrmypdf_args: dict[str, Any] = {
            "input_file_or_options": input_file,
            "output_file": output_file,
            # need to use threads, since this will be run in daemonized
@@ -330,7 +400,13 @@ class RasterisedDocumentParser(DocumentParser):

        return ocrmypdf_args

-    def parse(self, document_path: Path, mime_type, file_name=None) -> None:
+    def parse(
+        self,
+        document_path: Path,
+        mime_type: str,
+        *,
+        produce_archive: bool = True,
+    ) -> None:
        # This forces tesseract to use one core per page.
        os.environ["OMP_THREAD_LIMIT"] = "1"
        VALID_TEXT_LENGTH = 50
@@ -458,7 +534,7 @@ class RasterisedDocumentParser(DocumentParser):
                self.text = ""


-def post_process_text(text):
+def post_process_text(text: str | None) -> str | None:
    if not text:
        return None

--- a/src/paperless/parsers/text.py
+++ b/src/paperless/parsers/text.py
@@ -27,6 +27,7 @@ if TYPE_CHECKING:
    from types import TracebackType

    from paperless.parsers import MetadataEntry
+    from paperless.parsers import ParserContext

 logger = logging.getLogger("paperless.parsing.text")

@@ -156,6 +157,9 @@ class TextDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------

+    def configure(self, context: ParserContext) -> None:
+        pass
+
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/tika.py
+++ b/src/paperless/parsers/tika.py
@@ -35,6 +35,7 @@ if TYPE_CHECKING:
    from types import TracebackType

    from paperless.parsers import MetadataEntry
+    from paperless.parsers import ParserContext

 logger = logging.getLogger("paperless.parsing.tika")

@@ -205,6 +206,9 @@ class TikaDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------

+    def configure(self, context: ParserContext) -> None:
+        pass
+
    def parse(
        self,
        document_path: Path,
@@ -340,11 +344,19 @@ class TikaDocumentParser:
    ) -> int | None:
        """Return the number of pages in the document.

+        Counts pages in the archive PDF produced by a preceding parse()
+        call.  Returns ``None`` if parse() has not been called yet or if
+        no archive was produced.
+
        Returns
        -------
        int | None
-            Always None — page count is not available from Tika.
+            Page count of the archive PDF, or ``None``.
        """
+        if self._archive_path is not None:
+            from paperless.parsers.utils import get_page_count_for_pdf
+
+            return get_page_count_for_pdf(self._archive_path, log=logger)
        return None

    def extract_metadata(
--- a/src/paperless/parsers/utils.py
+++ b/src/paperless/parsers/utils.py
@@ -20,6 +20,34 @@ if TYPE_CHECKING:
 logger = logging.getLogger("paperless.parsers.utils")


+def read_file_handle_unicode_errors(
+    filepath: Path,
+    log: logging.Logger | None = None,
+) -> str:
+    """Read a file as UTF-8 text, replacing invalid bytes rather than raising.
+
+    Parameters
+    ----------
+    filepath:
+        Absolute path to the file to read.
+    log:
+        Logger to use for warnings.  Falls back to the module-level logger
+        when omitted.
+
+    Returns
+    -------
+    str
+        File content as a string, with any invalid UTF-8 sequences replaced
+        by the Unicode replacement character.
+    """
+    _log = log or logger
+    try:
+        return filepath.read_text(encoding="utf-8")
+    except UnicodeDecodeError as e:
+        _log.warning("Unicode error during text reading, continuing: %s", e)
+        return filepath.read_bytes().decode("utf-8", errors="replace")
+
+
 def get_page_count_for_pdf(
    document_path: Path,
    log: logging.Logger | None = None,
@@ -107,7 +135,7 @@ def extract_pdf_metadata(
            try:
                namespace.encode("utf-8")
                key_value.encode("utf-8")
-            except UnicodeEncodeError as enc_err:
+            except UnicodeEncodeError as enc_err:  # pragma: no cover
                _log.debug("Skipping metadata key %s: %s", key, enc_err)
                continue

--- a/src/paperless/settings/init.py
+++ b/src/paperless/settings/init.py
@@ -121,10 +121,7 @@ INSTALLED_APPS = [
    "django_extensions",
    "paperless",
    "documents.apps.DocumentsConfig",
-    "paperless_tesseract.apps.PaperlessTesseractConfig",
-    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
-    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -974,8 +971,8 @@ TIKA_GOTENBERG_ENDPOINT = os.getenv(
    "http://localhost:3000",
 )

-if TIKA_ENABLED:
-    INSTALLED_APPS.append("paperless_tika.apps.PaperlessTikaConfig")
+# Tika parser is now integrated into the main parser registry
+# No separate Django app needed

 AUDIT_LOG_ENABLED = get_bool_from_env("PAPERLESS_AUDIT_LOG_ENABLED", "true")
 if AUDIT_LOG_ENABLED:
--- a/src/paperless/tests/parsers/conftest.py
+++ b/src/paperless/tests/parsers/conftest.py
@@ -6,19 +6,29 @@ so it is easy to see which files belong to which test module.

 from __future__ import annotations

+from contextlib import contextmanager
 from typing import TYPE_CHECKING

 import pytest
+from django.test import override_settings

+from paperless.parsers.mail import MailDocumentParser
 from paperless.parsers.remote import RemoteDocumentParser
+from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser

 if TYPE_CHECKING:
+    from collections.abc import Callable
    from collections.abc import Generator
    from pathlib import Path
+    from unittest.mock import MagicMock

    from pytest_django.fixtures import SettingsWrapper
+    from pytest_mock import MockerFixture
+
+    #: Type for the ``make_tesseract_parser`` fixture factory.
+    MakeTesseractParser = Callable[..., Generator[RasterisedDocumentParser, None, None]]


 # ------------------------------------------------------------------
@@ -80,35 +90,6 @@ def text_parser() -> Generator[TextDocumentParser, None, None]:
        yield parser


-# ------------------------------------------------------------------
-# Remote parser sample files
-# ------------------------------------------------------------------
-
-
-@pytest.fixture(scope="session")
-def remote_samples_dir(samples_dir: Path) -> Path:
-    """Absolute path to the remote parser sample files directory.
-
-    Returns
-    -------
-    Path
-        ``<samples_dir>/remote/``
-    """
-    return samples_dir / "remote"
-
-
-@pytest.fixture(scope="session")
-def sample_pdf_file(remote_samples_dir: Path) -> Path:
-    """Path to a simple digital PDF sample file.
-
-    Returns
-    -------
-    Path
-        Absolute path to ``remote/simple-digital.pdf``.
-    """
-    return remote_samples_dir / "simple-digital.pdf"
-
-
 # ------------------------------------------------------------------
 # Remote parser instance
 # ------------------------------------------------------------------
@@ -247,3 +228,544 @@ def tika_parser() -> Generator[TikaDocumentParser, None, None]:
    """
    with TikaDocumentParser() as parser:
        yield parser
+
+
+# ------------------------------------------------------------------
+# Mail parser sample files
+# ------------------------------------------------------------------
+
+
+@pytest.fixture(scope="session")
+def mail_samples_dir(samples_dir: Path) -> Path:
+    """Absolute path to the mail parser sample files directory.
+
+    Returns
+    -------
+    Path
+        ``<samples_dir>/mail/``
+    """
+    return samples_dir / "mail"
+
+
+@pytest.fixture(scope="session")
+def broken_email_file(mail_samples_dir: Path) -> Path:
+    """Path to a broken/malformed EML sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/broken.eml``.
+    """
+    return mail_samples_dir / "broken.eml"
+
+
+@pytest.fixture(scope="session")
+def simple_txt_email_file(mail_samples_dir: Path) -> Path:
+    """Path to a plain-text email sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/simple_text.eml``.
+    """
+    return mail_samples_dir / "simple_text.eml"
+
+
+@pytest.fixture(scope="session")
+def simple_txt_email_pdf_file(mail_samples_dir: Path) -> Path:
+    """Path to the expected PDF rendition of the plain-text email.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/simple_text.eml.pdf``.
+    """
+    return mail_samples_dir / "simple_text.eml.pdf"
+
+
+@pytest.fixture(scope="session")
+def simple_txt_email_thumbnail_file(mail_samples_dir: Path) -> Path:
+    """Path to the expected thumbnail for the plain-text email.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/simple_text.eml.pdf.webp``.
+    """
+    return mail_samples_dir / "simple_text.eml.pdf.webp"
+
+
+@pytest.fixture(scope="session")
+def html_email_file(mail_samples_dir: Path) -> Path:
+    """Path to an HTML email sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/html.eml``.
+    """
+    return mail_samples_dir / "html.eml"
+
+
+@pytest.fixture(scope="session")
+def html_email_pdf_file(mail_samples_dir: Path) -> Path:
+    """Path to the expected PDF rendition of the HTML email.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/html.eml.pdf``.
+    """
+    return mail_samples_dir / "html.eml.pdf"
+
+
+@pytest.fixture(scope="session")
+def html_email_thumbnail_file(mail_samples_dir: Path) -> Path:
+    """Path to the expected thumbnail for the HTML email.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/html.eml.pdf.webp``.
+    """
+    return mail_samples_dir / "html.eml.pdf.webp"
+
+
+@pytest.fixture(scope="session")
+def html_email_html_file(mail_samples_dir: Path) -> Path:
+    """Path to the HTML body of the HTML email sample.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/html.eml.html``.
+    """
+    return mail_samples_dir / "html.eml.html"
+
+
+@pytest.fixture(scope="session")
+def merged_pdf_first(mail_samples_dir: Path) -> Path:
+    """Path to the first PDF used in PDF-merge tests.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/first.pdf``.
+    """
+    return mail_samples_dir / "first.pdf"
+
+
+@pytest.fixture(scope="session")
+def merged_pdf_second(mail_samples_dir: Path) -> Path:
+    """Path to the second PDF used in PDF-merge tests.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``mail/second.pdf``.
+    """
+    return mail_samples_dir / "second.pdf"
+
+
+# ------------------------------------------------------------------
+# Mail parser instance
+# ------------------------------------------------------------------
+
+
+@pytest.fixture()
+def mail_parser() -> Generator[MailDocumentParser, None, None]:
+    """Yield a MailDocumentParser and clean up its temporary directory afterwards.
+
+    Yields
+    ------
+    MailDocumentParser
+        A ready-to-use parser instance.
+    """
+    with MailDocumentParser() as parser:
+        yield parser
+
+
+@pytest.fixture(scope="session")
+def nginx_base_url() -> Generator[str, None, None]:
+    """
+    The base URL for the nginx HTTP server we expect to be alive
+    """
+    yield "http://localhost:8080"
+
+
+# ------------------------------------------------------------------
+# Tesseract parser sample files
+# ------------------------------------------------------------------
+
+
+@pytest.fixture(scope="session")
+def tesseract_samples_dir(samples_dir: Path) -> Path:
+    """Absolute path to the tesseract parser sample files directory.
+
+    Returns
+    -------
+    Path
+        ``<samples_dir>/tesseract/``
+    """
+    return samples_dir / "tesseract"
+
+
+@pytest.fixture(scope="session")
+def document_webp_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a WebP document sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/document.webp``.
+    """
+    return tesseract_samples_dir / "document.webp"
+
+
+@pytest.fixture(scope="session")
+def encrypted_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to an encrypted PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/encrypted.pdf``.
+    """
+    return tesseract_samples_dir / "encrypted.pdf"
+
+
+@pytest.fixture(scope="session")
+def multi_page_digital_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page digital PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-digital.pdf``.
+    """
+    return tesseract_samples_dir / "multi-page-digital.pdf"
+
+
+@pytest.fixture(scope="session")
+def multi_page_images_alpha_rgb_tiff_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page TIFF with alpha channel in RGB.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-images-alpha-rgb.tiff``.
+    """
+    return tesseract_samples_dir / "multi-page-images-alpha-rgb.tiff"
+
+
+@pytest.fixture(scope="session")
+def multi_page_images_alpha_tiff_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page TIFF with alpha channel.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-images-alpha.tiff``.
+    """
+    return tesseract_samples_dir / "multi-page-images-alpha.tiff"
+
+
+@pytest.fixture(scope="session")
+def multi_page_images_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page PDF with images.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-images.pdf``.
+    """
+    return tesseract_samples_dir / "multi-page-images.pdf"
+
+
+@pytest.fixture(scope="session")
+def multi_page_images_tiff_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page TIFF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-images.tiff``.
+    """
+    return tesseract_samples_dir / "multi-page-images.tiff"
+
+
+@pytest.fixture(scope="session")
+def multi_page_mixed_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a multi-page mixed PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/multi-page-mixed.pdf``.
+    """
+    return tesseract_samples_dir / "multi-page-mixed.pdf"
+
+
+@pytest.fixture(scope="session")
+def no_text_alpha_png_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a PNG with alpha channel and no text.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/no-text-alpha.png``.
+    """
+    return tesseract_samples_dir / "no-text-alpha.png"
+
+
+@pytest.fixture(scope="session")
+def rotated_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a rotated PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/rotated.pdf``.
+    """
+    return tesseract_samples_dir / "rotated.pdf"
+
+
+@pytest.fixture(scope="session")
+def rtl_test_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to an RTL test PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/rtl-test.pdf``.
+    """
+    return tesseract_samples_dir / "rtl-test.pdf"
+
+
+@pytest.fixture(scope="session")
+def signed_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a signed PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/signed.pdf``.
+    """
+    return tesseract_samples_dir / "signed.pdf"
+
+
+@pytest.fixture(scope="session")
+def simple_alpha_png_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple PNG with alpha channel.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple-alpha.png``.
+    """
+    return tesseract_samples_dir / "simple-alpha.png"
+
+
+@pytest.fixture(scope="session")
+def simple_digital_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple digital PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple-digital.pdf``.
+    """
+    return tesseract_samples_dir / "simple-digital.pdf"
+
+
+@pytest.fixture(scope="session")
+def simple_no_dpi_png_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple PNG without DPI information.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple-no-dpi.png``.
+    """
+    return tesseract_samples_dir / "simple-no-dpi.png"
+
+
+@pytest.fixture(scope="session")
+def simple_bmp_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple BMP sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.bmp``.
+    """
+    return tesseract_samples_dir / "simple.bmp"
+
+
+@pytest.fixture(scope="session")
+def simple_gif_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple GIF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.gif``.
+    """
+    return tesseract_samples_dir / "simple.gif"
+
+
+@pytest.fixture(scope="session")
+def simple_heic_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple HEIC sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.heic``.
+    """
+    return tesseract_samples_dir / "simple.heic"
+
+
+@pytest.fixture(scope="session")
+def simple_jpg_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple JPG sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.jpg``.
+    """
+    return tesseract_samples_dir / "simple.jpg"
+
+
+@pytest.fixture(scope="session")
+def simple_png_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple PNG sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.png``.
+    """
+    return tesseract_samples_dir / "simple.png"
+
+
+@pytest.fixture(scope="session")
+def simple_tif_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a simple TIF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/simple.tif``.
+    """
+    return tesseract_samples_dir / "simple.tif"
+
+
+@pytest.fixture(scope="session")
+def single_page_mixed_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a single-page mixed PDF sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/single-page-mixed.pdf``.
+    """
+    return tesseract_samples_dir / "single-page-mixed.pdf"
+
+
+@pytest.fixture(scope="session")
+def with_form_pdf_file(tesseract_samples_dir: Path) -> Path:
+    """Path to a PDF with form sample file.
+
+    Returns
+    -------
+    Path
+        Absolute path to ``tesseract/with-form.pdf``.
+    """
+    return tesseract_samples_dir / "with-form.pdf"
+
+
+# ------------------------------------------------------------------
+# Tesseract parser instance and settings helpers
+# ------------------------------------------------------------------
+
+
+@pytest.fixture()
+def null_app_config(mocker: MockerFixture) -> MagicMock:
+    """Return a MagicMock with all OcrConfig fields set to None.
+
+    This allows the parser to fall back to Django settings instead of
+    hitting the database.
+
+    Returns
+    -------
+    MagicMock
+        Mock config with all fields as None
+    """
+    return mocker.MagicMock(
+        output_type=None,
+        pages=None,
+        language=None,
+        mode=None,
+        skip_archive_file=None,
+        image_dpi=None,
+        unpaper_clean=None,
+        deskew=None,
+        rotate_pages=None,
+        rotate_pages_threshold=None,
+        max_image_pixels=None,
+        color_conversion_strategy=None,
+        user_args=None,
+    )
+
+
+@pytest.fixture()
+def tesseract_parser(
+    mocker: MockerFixture,
+    null_app_config: MagicMock,
+) -> Generator[RasterisedDocumentParser, None, None]:
+    """Yield a RasterisedDocumentParser and clean up its temporary directory afterwards.
+
+    Patches the config system to avoid database access.
+
+    Yields
+    ------
+    RasterisedDocumentParser
+        A ready-to-use parser instance.
+    """
+    mocker.patch(
+        "paperless.config.BaseConfig._get_config_instance",
+        return_value=null_app_config,
+    )
+    with RasterisedDocumentParser() as parser:
+        yield parser
+
+
+@pytest.fixture()
+def make_tesseract_parser(
+    mocker: MockerFixture,
+    null_app_config: MagicMock,
+) -> MakeTesseractParser:
+    """Return a factory for creating RasterisedDocumentParser with Django settings overrides.
+
+    This fixture is useful for tests that need to create parsers with different
+    settings configurations.
+
+    Returns
+    -------
+    Callable[..., contextmanager[RasterisedDocumentParser]]
+        A context manager factory that accepts Django settings overrides
+    """
+    mocker.patch(
+        "paperless.config.BaseConfig._get_config_instance",
+        return_value=null_app_config,
+    )
+
+    @contextmanager
+    def _make_parser(**django_settings_overrides):
+        with override_settings(**django_settings_overrides):
+            with RasterisedDocumentParser() as parser:
+                yield parser
+
+    return _make_parser
--- a/src/paperless/tests/parsers/test_mail_parser.py
+++ b/src/paperless/tests/parsers/test_mail_parser.py
@@ -12,7 +12,64 @@ from pytest_httpx import HTTPXMock
 from pytest_mock import MockerFixture

 from documents.parsers import ParseError
-from paperless_mail.parsers import MailDocumentParser
+from paperless.parsers import ParserContext
+from paperless.parsers import ParserProtocol
+from paperless.parsers.mail import MailDocumentParser
+
+
+class TestMailParserProtocol:
+    """Verify that MailDocumentParser satisfies the ParserProtocol contract."""
+
+    def test_isinstance_satisfies_protocol(
+        self,
+        mail_parser: MailDocumentParser,
+    ) -> None:
+        assert isinstance(mail_parser, ParserProtocol)
+
+    def test_supported_mime_types(self) -> None:
+        mime_types = MailDocumentParser.supported_mime_types()
+        assert isinstance(mime_types, dict)
+        assert "message/rfc822" in mime_types
+
+    @pytest.mark.parametrize(
+        ("mime_type", "expected"),
+        [
+            ("message/rfc822", 10),
+            ("application/pdf", None),
+            ("text/plain", None),
+        ],
+    )
+    def test_score(self, mime_type: str, expected: int | None) -> None:
+        assert MailDocumentParser.score(mime_type, "email.eml") == expected
+
+    def test_can_produce_archive_is_false(
+        self,
+        mail_parser: MailDocumentParser,
+    ) -> None:
+        assert mail_parser.can_produce_archive is False
+
+    def test_requires_pdf_rendition_is_true(
+        self,
+        mail_parser: MailDocumentParser,
+    ) -> None:
+        assert mail_parser.requires_pdf_rendition is True
+
+    def test_get_page_count_returns_none_without_archive(
+        self,
+        mail_parser: MailDocumentParser,
+        html_email_file: Path,
+    ) -> None:
+        assert mail_parser.get_page_count(html_email_file, "message/rfc822") is None
+
+    def test_get_page_count_returns_int_with_pdf_archive(
+        self,
+        mail_parser: MailDocumentParser,
+        simple_txt_email_pdf_file: Path,
+    ) -> None:
+        mail_parser._archive_path = simple_txt_email_pdf_file
+        count = mail_parser.get_page_count(simple_txt_email_pdf_file, "message/rfc822")
+        assert isinstance(count, int)
+        assert count > 0


 class TestEmailFileParsing:
@@ -24,7 +81,7 @@ class TestEmailFileParsing:
    def test_parse_error_missing_file(
        self,
        mail_parser: MailDocumentParser,
-        sample_dir: Path,
+        mail_samples_dir: Path,
    ) -> None:
        """
        GIVEN:
@@ -35,7 +92,7 @@ class TestEmailFileParsing:
            - An Exception is thrown
        """
        # Check if exception is raised when parsing fails.
-        test_file = sample_dir / "doesntexist.eml"
+        test_file = mail_samples_dir / "doesntexist.eml"

        assert not test_file.exists()

@@ -246,12 +303,12 @@ class TestEmailThumbnailGenerate:
        """
        mocked_return = "Passing the return value through.."
        mock_make_thumbnail_from_pdf = mocker.patch(
-            "paperless_mail.parsers.make_thumbnail_from_pdf",
+            "paperless.parsers.mail.make_thumbnail_from_pdf",
        )
        mock_make_thumbnail_from_pdf.return_value = mocked_return

        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        mock_generate_pdf.return_value = "Mocked return value.."

@@ -260,8 +317,7 @@ class TestEmailThumbnailGenerate:
        mock_generate_pdf.assert_called_once()
        mock_make_thumbnail_from_pdf.assert_called_once_with(
            "Mocked return value..",
-            mail_parser.tempdir,
-            None,
+            mail_parser._tempdir,
        )

        assert mocked_return == thumb
@@ -373,7 +429,7 @@ class TestParser:
        """
        # Validate parsing returns the expected results
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )

        mail_parser.parse(simple_txt_email_file, "message/rfc822")
@@ -385,7 +441,7 @@ class TestParser:
            "BCC: fdf@fvf.de\n\n"
            "\n\nThis is just a simple Text Mail."
        )
-        assert text_expected == mail_parser.text
+        assert text_expected == mail_parser.get_text()
        assert (
            datetime.datetime(
                2022,
@@ -396,7 +452,7 @@ class TestParser:
                43,
                tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)),
            )
-            == mail_parser.date
+            == mail_parser.get_date()
        )

        # Just check if tried to generate archive, the unittest for generate_pdf() goes deeper.
@@ -419,7 +475,7 @@ class TestParser:
        """

        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )

        # Validate parsing returns the expected results
@@ -443,7 +499,7 @@ class TestParser:
        mail_parser.parse(html_email_file, "message/rfc822")

        mock_generate_pdf.assert_called_once()
-        assert text_expected == mail_parser.text
+        assert text_expected == mail_parser.get_text()
        assert (
            datetime.datetime(
                2022,
@@ -454,7 +510,7 @@ class TestParser:
                19,
                tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)),
            )
-            == mail_parser.date
+            == mail_parser.get_date()
        )

    def test_generate_pdf_parse_error(
@@ -501,7 +557,7 @@ class TestParser:

        mail_parser.parse(simple_txt_email_file, "message/rfc822")

-        assert mail_parser.archive_path is not None
+        assert mail_parser.get_archive_path() is not None

    @pytest.mark.httpx_mock(can_send_already_matched_responses=True)
    def test_generate_pdf_html_email(
@@ -542,7 +598,7 @@ class TestParser:
        )
        mail_parser.parse(html_email_file, "message/rfc822")

-        assert mail_parser.archive_path is not None
+        assert mail_parser.get_archive_path() is not None

    def test_generate_pdf_html_email_html_to_pdf_failure(
        self,
@@ -712,10 +768,10 @@ class TestParser:

        def test_layout_option(layout_option, expected_calls, expected_pdf_names):
            mock_mailrule_get.return_value = mock.Mock(pdf_layout=layout_option)
+            mail_parser.configure(ParserContext(mailrule_id=1))
            mail_parser.parse(
                document_path=html_email_file,
                mime_type="message/rfc822",
-                mailrule_id=1,
            )
            args, _ = mock_merge_route.call_args
            assert len(args[0]) == expected_calls
--- a/src/paperless/tests/parsers/test_mail_parser_live.py
+++ b/src/paperless/tests/parsers/test_mail_parser_live.py
@@ -11,7 +11,7 @@ from PIL import Image
 from pytest_mock import MockerFixture

 from documents.tests.utils import util_call_with_backoff
-from paperless_mail.parsers import MailDocumentParser
+from paperless.parsers.mail import MailDocumentParser


 def extract_text(pdf_path: Path) -> str:
@@ -159,7 +159,7 @@ class TestParserLive:
            - The returned thumbnail image file shall match the expected hash
        """
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        mock_generate_pdf.return_value = simple_txt_email_pdf_file

@@ -216,10 +216,10 @@ class TestParserLive:
            - The merged PDF shall contain text from both source PDFs
        """
        mock_generate_pdf_from_html = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf_from_html",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf_from_html",
        )
        mock_generate_pdf_from_mail = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf_from_mail",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf_from_mail",
        )
        mock_generate_pdf_from_mail.return_value = merged_pdf_first
        mock_generate_pdf_from_html.return_value = merged_pdf_second
--- a/src/paperless/tests/parsers/test_remote_parser.py
+++ b/src/paperless/tests/parsers/test_remote_parser.py
@@ -20,6 +20,7 @@ from unittest.mock import Mock

 import pytest

+from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.remote import RemoteDocumentParser

@@ -276,20 +277,20 @@ class TestRemoteParserParse:
    def test_parse_returns_text_from_azure(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        assert remote_parser.get_text() == _DEFAULT_TEXT

    def test_parse_sets_archive_path(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        archive = remote_parser.get_archive_path()
        assert archive is not None
@@ -299,10 +300,11 @@ class TestRemoteParserParse:
    def test_parse_closes_client_on_success(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.configure(ParserContext())
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        azure_client.close.assert_called_once()

@@ -310,9 +312,9 @@ class TestRemoteParserParse:
    def test_parse_sets_empty_text_when_not_configured(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        assert remote_parser.get_text() == ""
        assert remote_parser.get_archive_path() is None
@@ -326,10 +328,10 @@ class TestRemoteParserParse:
    def test_get_date_always_none(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        assert remote_parser.get_date() is None

@@ -343,33 +345,33 @@ class TestRemoteParserParseError:
    def test_parse_returns_none_on_azure_error(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        failing_azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        assert remote_parser.get_text() is None

    def test_parse_closes_client_on_error(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        failing_azure_client: Mock,
    ) -> None:
-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        failing_azure_client.close.assert_called_once()

    def test_parse_logs_error_on_azure_failure(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
        failing_azure_client: Mock,
        mocker: MockerFixture,
    ) -> None:
        mock_log = mocker.patch("paperless.parsers.remote.logger")

-        remote_parser.parse(sample_pdf_file, "application/pdf")
+        remote_parser.parse(simple_digital_pdf_file, "application/pdf")

        mock_log.error.assert_called_once()
        assert "Azure AI Vision parsing failed" in mock_log.error.call_args[0][0]
@@ -384,18 +386,18 @@ class TestRemoteParserPageCount:
    def test_page_count_for_pdf(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        count = remote_parser.get_page_count(sample_pdf_file, "application/pdf")
+        count = remote_parser.get_page_count(simple_digital_pdf_file, "application/pdf")
        assert isinstance(count, int)
        assert count >= 1

    def test_page_count_returns_none_for_image_mime(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        count = remote_parser.get_page_count(sample_pdf_file, "image/png")
+        count = remote_parser.get_page_count(simple_digital_pdf_file, "image/png")
        assert count is None

    def test_page_count_returns_none_for_invalid_pdf(
@@ -418,25 +420,31 @@ class TestRemoteParserMetadata:
    def test_extract_metadata_non_pdf_returns_empty(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        result = remote_parser.extract_metadata(sample_pdf_file, "image/png")
+        result = remote_parser.extract_metadata(simple_digital_pdf_file, "image/png")
        assert result == []

    def test_extract_metadata_pdf_returns_list(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        result = remote_parser.extract_metadata(sample_pdf_file, "application/pdf")
+        result = remote_parser.extract_metadata(
+            simple_digital_pdf_file,
+            "application/pdf",
+        )
        assert isinstance(result, list)

    def test_extract_metadata_pdf_entries_have_required_keys(
        self,
        remote_parser: RemoteDocumentParser,
-        sample_pdf_file: Path,
+        simple_digital_pdf_file: Path,
    ) -> None:
-        result = remote_parser.extract_metadata(sample_pdf_file, "application/pdf")
+        result = remote_parser.extract_metadata(
+            simple_digital_pdf_file,
+            "application/pdf",
+        )
        for entry in result:
            assert "namespace" in entry
            assert "prefix" in entry
@@ -479,12 +487,17 @@ class TestRemoteParserRegistry:
        assert parser_cls is RemoteDocumentParser

    @pytest.mark.usefixtures("no_engine_settings")
-    def test_get_parser_returns_none_for_pdf_when_not_configured(self) -> None:
-        """With no tesseract parser registered yet, PDF has no handler if remote is off."""
+    def test_get_parser_returns_none_for_unsupported_type_when_not_configured(
+        self,
+    ) -> None:
+        """With remote off and a truly unsupported MIME type, registry returns None."""
        from paperless.parsers.registry import ParserRegistry

        registry = ParserRegistry()
        registry.register_defaults()
-        parser_cls = registry.get_parser_for_file("application/pdf", "doc.pdf")
+        parser_cls = registry.get_parser_for_file(
+            "application/x-unknown-format",
+            "doc.xyz",
+        )

        assert parser_cls is None
--- a/src/paperless/tests/parsers/test_tesseract_custom_settings.py
+++ b/src/paperless/tests/parsers/test_tesseract_custom_settings.py
@@ -10,7 +10,7 @@ from paperless.models import CleanChoices
 from paperless.models import ColorConvertChoices
 from paperless.models import ModeChoices
 from paperless.models import OutputTypeChoices
-from paperless_tesseract.parsers import RasterisedDocumentParser
+from paperless.parsers.tesseract import RasterisedDocumentParser


 class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
--- a/src/paperless/tests/parsers/test_tesseract_parser.py
+++ b/src/paperless/tests/parsers/test_tesseract_parser.py
--- a/src/paperless/tests/parsers/test_text_parser.py
+++ b/src/paperless/tests/parsers/test_text_parser.py
@@ -12,6 +12,7 @@ from pathlib import Path

 import pytest

+from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.text import TextDocumentParser

@@ -93,6 +94,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
+        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")

        assert text_parser.get_text() == "This is a test file.\n"
@@ -102,6 +104,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
+        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")

        assert text_parser.get_archive_path() is None
@@ -111,6 +114,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
+        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")

        assert text_parser.get_date() is None
@@ -129,6 +133,7 @@ class TestTextParserParse:
            - Parsing succeeds
            - Invalid bytes are replaced with the Unicode replacement character
        """
+        text_parser.configure(ParserContext())
        text_parser.parse(malformed_txt_file, "text/plain")

        assert text_parser.get_text() == "Pantothens\ufffdure\n"
@@ -251,6 +256,9 @@ class TestTextParserRegistry:
        from paperless.parsers.registry import get_parser_registry

        registry = get_parser_registry()
-        parser_cls = registry.get_parser_for_file("application/pdf", "doc.pdf")
+        parser_cls = registry.get_parser_for_file(
+            "application/x-unknown-format",
+            "doc.xyz",
+        )

        assert parser_cls is None
--- a/src/paperless/tests/parsers/test_tika_parser.py
+++ b/src/paperless/tests/parsers/test_tika_parser.py
@@ -9,6 +9,7 @@ from pytest_django.fixtures import SettingsWrapper
 from pytest_httpx import HTTPXMock

 from documents.parsers import ParseError
+from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.tika import TikaDocumentParser

@@ -60,6 +61,29 @@ class TestTikaParserRegistryInterface:
    def test_requires_pdf_rendition_is_true(self) -> None:
        assert TikaDocumentParser().requires_pdf_rendition is True

+    def test_get_page_count_returns_none_without_archive(
+        self,
+        tika_parser: TikaDocumentParser,
+        sample_odt_file: Path,
+    ) -> None:
+        assert (
+            tika_parser.get_page_count(
+                sample_odt_file,
+                "application/vnd.oasis.opendocument.text",
+            )
+            is None
+        )
+
+    def test_get_page_count_returns_int_with_pdf_archive(
+        self,
+        tika_parser: TikaDocumentParser,
+        simple_digital_pdf_file: Path,
+    ) -> None:
+        tika_parser._archive_path = simple_digital_pdf_file
+        count = tika_parser.get_page_count(simple_digital_pdf_file, "application/pdf")
+        assert isinstance(count, int)
+        assert count > 0
+

@pytest.mark.django_db()
 class TestTikaParser:
@@ -83,6 +107,7 @@ class TestTikaParser:
        # Pretend convert to PDF response
        httpx_mock.add_response(content=b"PDF document")

+        tika_parser.configure(ParserContext())
        tika_parser.parse(sample_odt_file, "application/vnd.oasis.opendocument.text")

        assert tika_parser.get_text() == "the content"
--- a/src/paperless/tests/samples/mail/broken.eml
+++ b/src/paperless/tests/samples/mail/broken.eml
--- a/src/paperless/tests/samples/mail/first.pdf
+++ b/src/paperless/tests/samples/mail/first.pdf
--- a/src/paperless/tests/samples/mail/html.eml
+++ b/src/paperless/tests/samples/mail/html.eml
--- a/src/paperless/tests/samples/mail/html.eml.html
+++ b/src/paperless/tests/samples/mail/html.eml.html
--- a/src/paperless/tests/samples/mail/html.eml.pdf
+++ b/src/paperless/tests/samples/mail/html.eml.pdf
--- a/src/paperless/tests/samples/mail/html.eml.pdf.webp
+++ b/src/paperless/tests/samples/mail/html.eml.pdf.webp
--- a/src/paperless/tests/samples/mail/sample.html
+++ b/src/paperless/tests/samples/mail/sample.html
--- a/src/paperless/tests/samples/mail/sample.html.pdf
+++ b/src/paperless/tests/samples/mail/sample.html.pdf
--- a/src/paperless/tests/samples/mail/sample.html.pdf.webp
+++ b/src/paperless/tests/samples/mail/sample.html.pdf.webp
--- a/src/paperless/tests/samples/mail/sample.png
+++ b/src/paperless/tests/samples/mail/sample.png
--- a/src/paperless/tests/samples/mail/second.pdf
+++ b/src/paperless/tests/samples/mail/second.pdf
--- a/src/paperless/tests/samples/mail/simple_text.eml
+++ b/src/paperless/tests/samples/mail/simple_text.eml
--- a/src/paperless/tests/samples/mail/simple_text.eml.pdf
+++ b/src/paperless/tests/samples/mail/simple_text.eml.pdf
--- a/src/paperless/tests/samples/mail/simple_text.eml.pdf.webp
+++ b/src/paperless/tests/samples/mail/simple_text.eml.pdf.webp
--- a/src/paperless/tests/samples/tesseract/document.webp
+++ b/src/paperless/tests/samples/tesseract/document.webp
--- a/src/paperless/tests/samples/tesseract/encrypted.pdf
+++ b/src/paperless/tests/samples/tesseract/encrypted.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-digital.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-digital.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-images-alpha-rgb.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images-alpha-rgb.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-images-alpha.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images-alpha.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-images.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-images.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-images.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-mixed.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-mixed.pdf
--- a/src/paperless/tests/samples/tesseract/no-text-alpha.png
+++ b/src/paperless/tests/samples/tesseract/no-text-alpha.png
--- a/src/paperless/tests/samples/tesseract/rotated.pdf
+++ b/src/paperless/tests/samples/tesseract/rotated.pdf
--- a/src/paperless/tests/samples/tesseract/rtl-test.pdf
+++ b/src/paperless/tests/samples/tesseract/rtl-test.pdf
--- a/src/paperless/tests/samples/tesseract/signed.pdf
+++ b/src/paperless/tests/samples/tesseract/signed.pdf
--- a/src/paperless/tests/samples/tesseract/simple-alpha.png
+++ b/src/paperless/tests/samples/tesseract/simple-alpha.png
--- a/src/paperless/tests/samples/tesseract/simple-digital.pdf
+++ b/src/paperless/tests/samples/tesseract/simple-digital.pdf
--- a/src/paperless/tests/samples/tesseract/simple-no-dpi.png
+++ b/src/paperless/tests/samples/tesseract/simple-no-dpi.png
--- a/src/paperless/tests/samples/tesseract/simple.bmp
+++ b/src/paperless/tests/samples/tesseract/simple.bmp
--- a/src/paperless/tests/samples/tesseract/simple.gif
+++ b/src/paperless/tests/samples/tesseract/simple.gif
--- a/src/paperless/tests/samples/tesseract/simple.heic
+++ b/src/paperless/tests/samples/tesseract/simple.heic
--- a/src/paperless/tests/samples/tesseract/simple.jpg
+++ b/src/paperless/tests/samples/tesseract/simple.jpg
--- a/src/paperless/tests/samples/tesseract/simple.png
+++ b/src/paperless/tests/samples/tesseract/simple.png
--- a/src/paperless/tests/samples/tesseract/simple.tif
+++ b/src/paperless/tests/samples/tesseract/simple.tif
--- a/src/paperless/tests/samples/tesseract/single-page-mixed.pdf
+++ b/src/paperless/tests/samples/tesseract/single-page-mixed.pdf
--- a/src/paperless/tests/samples/tesseract/with-form.pdf
+++ b/src/paperless/tests/samples/tesseract/with-form.pdf
--- a/src/paperless/tests/test_checks.py
+++ b/src/paperless/tests/test_checks.py
@@ -5,6 +5,7 @@ from pathlib import Path
 from unittest import mock

 import pytest
+from django.core.checks import ERROR
 from django.core.checks import Error
 from django.core.checks import Warning
 from pytest_django.fixtures import SettingsWrapper
@@ -12,7 +13,9 @@ from pytest_mock import MockerFixture

 from paperless.checks import audit_log_check
 from paperless.checks import binaries_check
+from paperless.checks import check_default_language_available
 from paperless.checks import check_deprecated_db_settings
+from paperless.checks import check_remote_parser_configured
 from paperless.checks import check_v3_minimum_upgrade_version
 from paperless.checks import debug_mode_check
 from paperless.checks import paths_check
@@ -626,3 +629,116 @@ class TestV3MinimumUpgradeVersionCheck:
        conn.introspection.table_names.side_effect = OperationalError("DB unavailable")
        mocker.patch.dict("paperless.checks.connections", {"default": conn})
        assert check_v3_minimum_upgrade_version(None) == []
+
+
+class TestRemoteParserChecks:
+    def test_no_engine(self, settings: SettingsWrapper) -> None:
+        settings.REMOTE_OCR_ENGINE = None
+        msgs = check_remote_parser_configured(None)
+
+        assert len(msgs) == 0
+
+    def test_azure_no_endpoint(self, settings: SettingsWrapper) -> None:
+
+        settings.REMOTE_OCR_ENGINE = "azureai"
+        settings.REMOTE_OCR_API_KEY = "somekey"
+        settings.REMOTE_OCR_ENDPOINT = None
+
+        msgs = check_remote_parser_configured(None)
+
+        assert len(msgs) == 1
+
+        msg = msgs[0]
+
+        assert (
+            "Azure AI remote parser requires endpoint and API key to be configured."
+            in msg.msg
+        )
+
+
+class TestTesseractChecks:
+    def test_default_language(self) -> None:
+        check_default_language_available(None)
+
+    def test_no_language(self, settings: SettingsWrapper) -> None:
+
+        settings.OCR_LANGUAGE = ""
+
+        msgs = check_default_language_available(None)
+
+        assert len(msgs) == 1
+        msg = msgs[0]
+
+        assert (
+            "No OCR language has been specified with PAPERLESS_OCR_LANGUAGE" in msg.msg
+        )
+
+    def test_invalid_language(
+        self,
+        settings: SettingsWrapper,
+        mocker: MockerFixture,
+    ) -> None:
+
+        settings.OCR_LANGUAGE = "ita"
+
+        tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
+        tesser_lang_mock.return_value = ["deu", "eng"]
+
+        msgs = check_default_language_available(None)
+
+        assert len(msgs) == 1
+        msg = msgs[0]
+
+        assert msg.level == ERROR
+        assert "The selected ocr language ita is not installed" in msg.msg
+
+    def test_multi_part_language(
+        self,
+        settings: SettingsWrapper,
+        mocker: MockerFixture,
+    ) -> None:
+        """
+        GIVEN:
+            - An OCR language which is multi part (ie chi-sim)
+            - The language is correctly formatted
+        WHEN:
+            - Installed packages are checked
+        THEN:
+            - No errors are reported
+        """
+
+        settings.OCR_LANGUAGE = "chi_sim"
+
+        tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
+        tesser_lang_mock.return_value = ["chi_sim", "eng"]
+
+        msgs = check_default_language_available(None)
+
+        assert len(msgs) == 0
+
+    def test_multi_part_language_bad_format(
+        self,
+        settings: SettingsWrapper,
+        mocker: MockerFixture,
+    ) -> None:
+        """
+        GIVEN:
+            - An OCR language which is multi part (ie chi-sim)
+            - The language is correctly NOT formatted
+        WHEN:
+            - Installed packages are checked
+        THEN:
+            - No errors are reported
+        """
+        settings.OCR_LANGUAGE = "chi-sim"
+
+        tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
+        tesser_lang_mock.return_value = ["chi_sim", "eng"]
+
+        msgs = check_default_language_available(None)
+
+        assert len(msgs) == 1
+        msg = msgs[0]
+
+        assert msg.level == ERROR
+        assert "The selected ocr language chi-sim is not installed" in msg.msg
--- a/src/paperless/tests/test_registry.py
+++ b/src/paperless/tests/test_registry.py
@@ -18,6 +18,7 @@ from unittest.mock import patch

 import pytest

+from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.registry import ParserRegistry
 from paperless.parsers.registry import get_parser_registry
@@ -103,6 +104,11 @@ def dummy_parser_cls() -> type:
        ) -> list:
            return []

+        def configure(self, context: ParserContext) -> None:
+            """
+            Required to exist, but doesn't need to do anything
+            """
+
        def __enter__(self) -> Self:
            return self

@@ -144,6 +150,7 @@ class TestParserProtocol:
    @pytest.mark.parametrize(
        "missing_method",
        [
+            pytest.param("configure", id="missing-configure"),
            pytest.param("parse", id="missing-parse"),
            pytest.param("get_text", id="missing-get_text"),
            pytest.param("get_thumbnail", id="missing-get_thumbnail"),
--- a/src/paperless_mail/apps.py
+++ b/src/paperless_mail/apps.py
@@ -1,18 +1,8 @@
 from django.apps import AppConfig
-from django.conf import settings
 from django.utils.translation import gettext_lazy as _

-from paperless_mail.signals import mail_consumer_declaration
-

 class PaperlessMailConfig(AppConfig):
    name = "paperless_mail"

    verbose_name = _("Paperless mail")
-
-    def ready(self) -> None:
-        from documents.signals import document_consumer_declaration
-
-        if settings.TIKA_ENABLED:
-            document_consumer_declaration.connect(mail_consumer_declaration)
-        AppConfig.ready(self)
--- a/src/paperless_mail/parsers.py
+++ b/src/paperless_mail/parsers.py
@@ -1,481 +0,0 @@
-import re
-from html import escape
-from pathlib import Path
-
-from bleach import clean
-from bleach import linkify
-from django.conf import settings
-from django.utils import timezone
-from django.utils.timezone import is_naive
-from django.utils.timezone import make_aware
-from gotenberg_client import GotenbergClient
-from gotenberg_client.constants import A4
-from gotenberg_client.options import Measurement
-from gotenberg_client.options import MeasurementUnitType
-from gotenberg_client.options import PageMarginsType
-from gotenberg_client.options import PdfAFormat
-from humanize import naturalsize
-from imap_tools import MailAttachment
-from imap_tools import MailMessage
-from tika_client import TikaClient
-
-from documents.parsers import DocumentParser
-from documents.parsers import ParseError
-from documents.parsers import make_thumbnail_from_pdf
-from paperless.models import OutputTypeChoices
-from paperless_mail.models import MailRule
-
-
-class MailDocumentParser(DocumentParser):
-    """
-    This parser uses imap_tools to parse .eml files, generates pdf using
-    Gotenberg and sends the html part to a Tika server for text extraction.
-    """
-
-    logging_name = "paperless.parsing.mail"
-
-    def _settings_to_gotenberg_pdfa(self) -> PdfAFormat | None:
-        """
-        Converts our requested PDF/A output into the Gotenberg API
-        format
-        """
-        if settings.OCR_OUTPUT_TYPE in {
-            OutputTypeChoices.PDF_A,
-            OutputTypeChoices.PDF_A2,
-        }:
-            return PdfAFormat.A2b
-        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A1:  # pragma: no cover
-            self.log.warning(
-                "Gotenberg does not support PDF/A-1a, choosing PDF/A-2b instead",
-            )
-            return PdfAFormat.A2b
-        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A3:  # pragma: no cover
-            return PdfAFormat.A3b
-        return None
-
-    def get_thumbnail(
-        self,
-        document_path: Path,
-        mime_type: str,
-        file_name=None,
-    ) -> Path:
-        if not self.archive_path:
-            self.archive_path = self.generate_pdf(
-                self.parse_file_to_message(document_path),
-            )
-
-        return make_thumbnail_from_pdf(
-            self.archive_path,
-            self.tempdir,
-            self.logging_group,
-        )
-
-    def extract_metadata(self, document_path: Path, mime_type: str):
-        result = []
-
-        try:
-            mail = self.parse_file_to_message(document_path)
-        except ParseError as e:
-            self.log.warning(
-                f"Error while fetching document metadata for {document_path}: {e}",
-            )
-            return result
-
-        for key, value in mail.headers.items():
-            value = ", ".join(i for i in value)
-            try:
-                value.encode("utf-8")
-            except UnicodeEncodeError as e:  # pragma: no cover
-                self.log.debug(f"Skipping header {key}: {e}")
-                continue
-
-            result.append(
-                {
-                    "namespace": "",
-                    "prefix": "header",
-                    "key": key,
-                    "value": value,
-                },
-            )
-
-        result.append(
-            {
-                "namespace": "",
-                "prefix": "",
-                "key": "attachments",
-                "value": ", ".join(
-                    f"{attachment.filename}"
-                    f"({naturalsize(attachment.size, binary=True, format='%.2f')})"
-                    for attachment in mail.attachments
-                ),
-            },
-        )
-
-        result.append(
-            {
-                "namespace": "",
-                "prefix": "",
-                "key": "date",
-                "value": mail.date.strftime("%Y-%m-%d %H:%M:%S %Z"),
-            },
-        )
-
-        result.sort(key=lambda item: (item["prefix"], item["key"]))
-        return result
-
-    def parse(
-        self,
-        document_path: Path,
-        mime_type: str,
-        file_name=None,
-        mailrule_id: int | None = None,
-    ) -> None:
-        """
-        Parses the given .eml into formatted text, based on the decoded email.
-
-        """
-
-        def strip_text(text: str):
-            """
-            Reduces the spacing of the given text string
-            """
-            text = re.sub(r"\s+", " ", text)
-            text = re.sub(r"(\n *)+", "\n", text)
-            return text.strip()
-
-        def build_formatted_text(mail_message: MailMessage) -> str:
-            """
-            Constructs a formatted string, based on the given email.  Basically tries
-            to get most of the email content, included front matter, into a nice string
-            """
-            fmt_text = f"Subject: {mail_message.subject}\n\n"
-            fmt_text += f"From: {mail_message.from_values.full}\n\n"
-            to_list = [address.full for address in mail_message.to_values]
-            fmt_text += f"To: {', '.join(to_list)}\n\n"
-            if mail_message.cc_values:
-                fmt_text += (
-                    f"CC: {', '.join(address.full for address in mail.cc_values)}\n\n"
-                )
-            if mail_message.bcc_values:
-                fmt_text += (
-                    f"BCC: {', '.join(address.full for address in mail.bcc_values)}\n\n"
-                )
-            if mail_message.attachments:
-                att = []
-                for a in mail.attachments:
-                    attachment_size = naturalsize(a.size, binary=True, format="%.2f")
-                    att.append(
-                        f"{a.filename} ({attachment_size})",
-                    )
-                fmt_text += f"Attachments: {', '.join(att)}\n\n"
-
-            if mail.html:
-                fmt_text += "HTML content: " + strip_text(self.tika_parse(mail.html))
-
-            fmt_text += f"\n\n{strip_text(mail.text)}"
-
-            return fmt_text
-
-        self.log.debug(f"Parsing file {document_path.name} into an email")
-        mail = self.parse_file_to_message(document_path)
-
-        self.log.debug("Building formatted text from email")
-        self.text = build_formatted_text(mail)
-
-        if is_naive(mail.date):
-            self.date = make_aware(mail.date)
-        else:
-            self.date = mail.date
-
-        self.log.debug("Creating a PDF from the email")
-        if mailrule_id:
-            rule = MailRule.objects.get(pk=mailrule_id)
-            self.archive_path = self.generate_pdf(mail, rule.pdf_layout)
-        else:
-            self.archive_path = self.generate_pdf(mail)
-
-    @staticmethod
-    def parse_file_to_message(filepath: Path) -> MailMessage:
-        """
-        Parses the given .eml file into a MailMessage object
-        """
-        try:
-            with filepath.open("rb") as eml:
-                parsed = MailMessage.from_bytes(eml.read())
-                if parsed.from_values is None:
-                    raise ParseError(
-                        f"Could not parse {filepath}: Missing 'from'",
-                    )
-        except Exception as err:
-            raise ParseError(
-                f"Could not parse {filepath}: {err}",
-            ) from err
-
-        return parsed
-
-    def tika_parse(self, html: str):
-        self.log.info("Sending content to Tika server")
-
-        try:
-            with TikaClient(tika_url=settings.TIKA_ENDPOINT) as client:
-                parsed = client.tika.as_text.from_buffer(html, "text/html")
-
-                if parsed.content is not None:
-                    return parsed.content.strip()
-                return ""
-        except Exception as err:
-            raise ParseError(
-                f"Could not parse content with tika server at "
-                f"{settings.TIKA_ENDPOINT}: {err}",
-            ) from err
-
-    def generate_pdf(
-        self,
-        mail_message: MailMessage,
-        pdf_layout: MailRule.PdfLayout | None = None,
-    ) -> Path:
-        archive_path = Path(self.tempdir) / "merged.pdf"
-
-        mail_pdf_file = self.generate_pdf_from_mail(mail_message)
-
-        pdf_layout = (
-            pdf_layout or settings.EMAIL_PARSE_DEFAULT_LAYOUT
-        )  # EMAIL_PARSE_DEFAULT_LAYOUT is a MailRule.PdfLayout
-
-        # If no HTML content, create the PDF from the message
-        # Otherwise, create 2 PDFs and merge them with Gotenberg
-        if not mail_message.html:
-            archive_path.write_bytes(mail_pdf_file.read_bytes())
-        else:
-            pdf_of_html_content = self.generate_pdf_from_html(
-                mail_message.html,
-                mail_message.attachments,
-            )
-
-            self.log.debug("Merging email text and HTML content into single PDF")
-
-            with (
-                GotenbergClient(
-                    host=settings.TIKA_GOTENBERG_ENDPOINT,
-                    timeout=settings.CELERY_TASK_TIME_LIMIT,
-                ) as client,
-                client.merge.merge() as route,
-            ):
-                # Configure requested PDF/A formatting, if any
-                pdf_a_format = self._settings_to_gotenberg_pdfa()
-                if pdf_a_format is not None:
-                    route.pdf_format(pdf_a_format)
-
-                match pdf_layout:
-                    case MailRule.PdfLayout.HTML_TEXT:
-                        route.merge([pdf_of_html_content, mail_pdf_file])
-                    case MailRule.PdfLayout.HTML_ONLY:
-                        route.merge([pdf_of_html_content])
-                    case MailRule.PdfLayout.TEXT_ONLY:
-                        route.merge([mail_pdf_file])
-                    case MailRule.PdfLayout.TEXT_HTML | _:
-                        route.merge([mail_pdf_file, pdf_of_html_content])
-
-                try:
-                    response = route.run()
-                    archive_path.write_bytes(response.content)
-                except Exception as err:
-                    raise ParseError(
-                        f"Error while merging email HTML into PDF: {err}",
-                    ) from err
-
-        return archive_path
-
-    def mail_to_html(self, mail: MailMessage) -> Path:
-        """
-        Converts the given email into an HTML file, formatted
-        based on the given template
-        """
-
-        def clean_html(text: str) -> str:
-            """
-            Attempts to clean, escape and linkify the given HTML string
-            """
-            if isinstance(text, list):
-                text = "\n".join([str(e) for e in text])
-            if not isinstance(text, str):
-                text = str(text)
-            text = escape(text)
-            text = clean(text)
-            text = linkify(text, parse_email=True)
-            text = text.replace("\n", "<br>")
-            return text
-
-        data = {}
-
-        data["subject"] = clean_html(mail.subject)
-        if data["subject"]:
-            data["subject_label"] = "Subject"
-        data["from"] = clean_html(mail.from_values.full)
-        if data["from"]:
-            data["from_label"] = "From"
-        data["to"] = clean_html(", ".join(address.full for address in mail.to_values))
-        if data["to"]:
-            data["to_label"] = "To"
-        data["cc"] = clean_html(", ".join(address.full for address in mail.cc_values))
-        if data["cc"]:
-            data["cc_label"] = "CC"
-        data["bcc"] = clean_html(", ".join(address.full for address in mail.bcc_values))
-        if data["bcc"]:
-            data["bcc_label"] = "BCC"
-
-        att = []
-        for a in mail.attachments:
-            att.append(
-                f"{a.filename} ({naturalsize(a.size, binary=True, format='%.2f')})",
-            )
-        data["attachments"] = clean_html(", ".join(att))
-        if data["attachments"]:
-            data["attachments_label"] = "Attachments"
-
-        data["date"] = clean_html(
-            timezone.localtime(mail.date).strftime("%Y-%m-%d %H:%M"),
-        )
-        data["content"] = clean_html(mail.text.strip())
-
-        from django.template.loader import render_to_string
-
-        html_file = Path(self.tempdir) / "email_as_html.html"
-        html_file.write_text(render_to_string("email_msg_template.html", context=data))
-
-        return html_file
-
-    def generate_pdf_from_mail(self, mail: MailMessage) -> Path:
-        """
-        Creates a PDF based on the given email, using the email's values in a
-        an HTML template
-        """
-        self.log.info("Converting mail to PDF")
-
-        css_file = Path(__file__).parent / "templates" / "output.css"
-        email_html_file = self.mail_to_html(mail)
-
-        with (
-            GotenbergClient(
-                host=settings.TIKA_GOTENBERG_ENDPOINT,
-                timeout=settings.CELERY_TASK_TIME_LIMIT,
-            ) as client,
-            client.chromium.html_to_pdf() as route,
-        ):
-            # Configure requested PDF/A formatting, if any
-            pdf_a_format = self._settings_to_gotenberg_pdfa()
-            if pdf_a_format is not None:
-                route.pdf_format(pdf_a_format)
-
-            try:
-                response = (
-                    route.index(email_html_file)
-                    .resource(css_file)
-                    .margins(
-                        PageMarginsType(
-                            top=Measurement(0.1, MeasurementUnitType.Inches),
-                            bottom=Measurement(0.1, MeasurementUnitType.Inches),
-                            left=Measurement(0.1, MeasurementUnitType.Inches),
-                            right=Measurement(0.1, MeasurementUnitType.Inches),
-                        ),
-                    )
-                    .size(A4)
-                    .scale(1.0)
-                    .run()
-                )
-            except Exception as err:
-                raise ParseError(
-                    f"Error while converting email to PDF: {err}",
-                ) from err
-
-        email_as_pdf_file = Path(self.tempdir) / "email_as_pdf.pdf"
-        email_as_pdf_file.write_bytes(response.content)
-
-        return email_as_pdf_file
-
-    def generate_pdf_from_html(
-        self,
-        orig_html: str,
-        attachments: list[MailAttachment],
-    ) -> Path:
-        """
-        Generates a PDF file based on the HTML and attachments of the email
-        """
-
-        def clean_html_script(text: str):
-            compiled_open = re.compile(re.escape("<script"), re.IGNORECASE)
-            text = compiled_open.sub("<div hidden ", text)
-
-            compiled_close = re.compile(re.escape("</script"), re.IGNORECASE)
-            text = compiled_close.sub("</div", text)
-            return text
-
-        self.log.info("Converting message html to PDF")
-
-        tempdir = Path(self.tempdir)
-
-        html_clean = clean_html_script(orig_html)
-        html_clean_file = tempdir / "index.html"
-        html_clean_file.write_text(html_clean)
-
-        with (
-            GotenbergClient(
-                host=settings.TIKA_GOTENBERG_ENDPOINT,
-                timeout=settings.CELERY_TASK_TIME_LIMIT,
-            ) as client,
-            client.chromium.html_to_pdf() as route,
-        ):
-            # Configure requested PDF/A formatting, if any
-            pdf_a_format = self._settings_to_gotenberg_pdfa()
-            if pdf_a_format is not None:
-                route.pdf_format(pdf_a_format)
-
-            # Add attachments as resources, cleaning the filename and replacing
-            # it in the index file for inclusion
-            for attachment in attachments:
-                # Clean the attachment name to be valid
-                name_cid = f"cid:{attachment.content_id}"
-                name_clean = "".join(e for e in name_cid if e.isalnum())
-
-                # Write attachment payload to a temp file
-                temp_file = tempdir / name_clean
-                temp_file.write_bytes(attachment.payload)
-
-                route.resource(temp_file)
-
-                # Replace as needed the name with the clean name
-                html_clean = html_clean.replace(name_cid, name_clean)
-
-            # Now store the cleaned up HTML version
-            html_clean_file = tempdir / "index.html"
-            html_clean_file.write_text(html_clean)
-            # This is our index file, the main page basically
-            route.index(html_clean_file)
-
-            # Set page size, margins
-            route.margins(
-                PageMarginsType(
-                    top=Measurement(0.1, MeasurementUnitType.Inches),
-                    bottom=Measurement(0.1, MeasurementUnitType.Inches),
-                    left=Measurement(0.1, MeasurementUnitType.Inches),
-                    right=Measurement(0.1, MeasurementUnitType.Inches),
-                ),
-            ).size(A4).scale(1.0)
-
-            try:
-                response = route.run()
-
-            except Exception as err:
-                raise ParseError(
-                    f"Error while converting document to PDF: {err}",
-                ) from err
-
-        html_pdf = tempdir / "html.pdf"
-        html_pdf.write_bytes(response.content)
-        return html_pdf
-
-    def get_settings(self) -> None:
-        """
-        This parser does not implement additional settings yet
-        """
-        return None
--- a/src/paperless_mail/signals.py
+++ b/src/paperless_mail/signals.py
@@ -1,14 +0,0 @@
-def get_parser(*args, **kwargs):
-    from paperless_mail.parsers import MailDocumentParser
-
-    return MailDocumentParser(*args, **kwargs)
-
-
-def mail_consumer_declaration(sender, **kwargs):
-    return {
-        "parser": get_parser,
-        "weight": 20,
-        "mime_types": {
-            "message/rfc822": ".eml",
-        },
-    }
--- a/src/paperless_mail/tests/conftest.py
+++ b/src/paperless_mail/tests/conftest.py
@@ -1,71 +1,9 @@
 from collections.abc import Generator
-from pathlib import Path

 import pytest

 from paperless_mail.mail import MailAccountHandler
 from paperless_mail.models import MailAccount
-from paperless_mail.parsers import MailDocumentParser
-
-
-@pytest.fixture(scope="session")
-def sample_dir() -> Path:
-    return (Path(__file__).parent / Path("samples")).resolve()
-
-
-@pytest.fixture(scope="session")
-def broken_email_file(sample_dir: Path) -> Path:
-    return sample_dir / "broken.eml"
-
-
-@pytest.fixture(scope="session")
-def simple_txt_email_file(sample_dir: Path) -> Path:
-    return sample_dir / "simple_text.eml"
-
-
-@pytest.fixture(scope="session")
-def simple_txt_email_pdf_file(sample_dir: Path) -> Path:
-    return sample_dir / "simple_text.eml.pdf"
-
-
-@pytest.fixture(scope="session")
-def simple_txt_email_thumbnail_file(sample_dir: Path) -> Path:
-    return sample_dir / "simple_text.eml.pdf.webp"
-
-
-@pytest.fixture(scope="session")
-def html_email_file(sample_dir: Path) -> Path:
-    return sample_dir / "html.eml"
-
-
-@pytest.fixture(scope="session")
-def html_email_pdf_file(sample_dir: Path) -> Path:
-    return sample_dir / "html.eml.pdf"
-
-
-@pytest.fixture(scope="session")
-def html_email_thumbnail_file(sample_dir: Path) -> Path:
-    return sample_dir / "html.eml.pdf.webp"
-
-
-@pytest.fixture(scope="session")
-def html_email_html_file(sample_dir: Path) -> Path:
-    return sample_dir / "html.eml.html"
-
-
-@pytest.fixture(scope="session")
-def merged_pdf_first(sample_dir: Path) -> Path:
-    return sample_dir / "first.pdf"
-
-
-@pytest.fixture(scope="session")
-def merged_pdf_second(sample_dir: Path) -> Path:
-    return sample_dir / "second.pdf"
-
-
-@pytest.fixture()
-def mail_parser() -> MailDocumentParser:
-    return MailDocumentParser(logging_group=None)


@pytest.fixture()
@@ -89,11 +27,3 @@ def greenmail_mail_account(db: None) -> Generator[MailAccount, None, None]:
@pytest.fixture()
 def mail_account_handler() -> MailAccountHandler:
    return MailAccountHandler()
-
-
-@pytest.fixture(scope="session")
-def nginx_base_url() -> Generator[str, None, None]:
-    """
-    The base URL for the nginx HTTP server we expect to be alive
-    """
-    yield "http://localhost:8080"
--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -1,4 +0,0 @@
-# this is here so that django finds the checks.
-from paperless_remote.checks import check_remote_parser_configured
-
-__all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -1,14 +0,0 @@
-from django.apps import AppConfig
-
-from paperless_remote.signals import remote_consumer_declaration
-
-
-class PaperlessRemoteParserConfig(AppConfig):
-    name = "paperless_remote"
-
-    def ready(self) -> None:
-        from documents.signals import document_consumer_declaration
-
-        document_consumer_declaration.connect(remote_consumer_declaration)
-
-        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -1,17 +0,0 @@
-from django.conf import settings
-from django.core.checks import Error
-from django.core.checks import register
-
-
-@register()
-def check_remote_parser_configured(app_configs, **kwargs):
-    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
-        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
-    ):
-        return [
-            Error(
-                "Azure AI remote parser requires endpoint and API key to be configured.",
-            ),
-        ]
-
-    return []
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -1,38 +0,0 @@
-from __future__ import annotations
-
-from typing import Any
-
-
-def get_parser(*args: Any, **kwargs: Any) -> Any:
-    from paperless.parsers.remote import RemoteDocumentParser
-
-    # The new RemoteDocumentParser does not accept the progress_callback
-    # kwarg injected by the old signal-based consumer.  logging_group is
-    # forwarded as a positional arg.
-    # Phase 4 will replace this signal path with the new ParserRegistry.
-    kwargs.pop("progress_callback", None)
-    return RemoteDocumentParser(*args, **kwargs)
-
-
-def get_supported_mime_types() -> dict[str, str]:
-    from django.conf import settings
-
-    from paperless.parsers.remote import RemoteDocumentParser
-    from paperless.parsers.remote import RemoteEngineConfig
-
-    config = RemoteEngineConfig(
-        engine=settings.REMOTE_OCR_ENGINE,
-        api_key=settings.REMOTE_OCR_API_KEY,
-        endpoint=settings.REMOTE_OCR_ENDPOINT,
-    )
-    if not config.engine_is_valid():
-        return {}
-    return RemoteDocumentParser.supported_mime_types()
-
-
-def remote_consumer_declaration(sender: Any, **kwargs: Any) -> dict[str, Any]:
-    return {
-        "parser": get_parser,
-        "weight": 5,
-        "mime_types": get_supported_mime_types(),
-    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -1,24 +0,0 @@
-from unittest import TestCase
-
-from django.test import override_settings
-
-from paperless_remote import check_remote_parser_configured
-
-
-class TestChecks(TestCase):
-    @override_settings(REMOTE_OCR_ENGINE=None)
-    def test_no_engine(self) -> None:
-        msgs = check_remote_parser_configured(None)
-        self.assertEqual(len(msgs), 0)
-
-    @override_settings(REMOTE_OCR_ENGINE="azureai")
-    @override_settings(REMOTE_OCR_API_KEY="somekey")
-    @override_settings(REMOTE_OCR_ENDPOINT=None)
-    def test_azure_no_endpoint(self) -> None:
-        msgs = check_remote_parser_configured(None)
-        self.assertEqual(len(msgs), 1)
-        self.assertTrue(
-            msgs[0].msg.startswith(
-                "Azure AI remote parser requires endpoint and API key to be configured.",
-            ),
-        )
--- a/src/paperless_tesseract/init.py
+++ b/src/paperless_tesseract/init.py
@@ -1,5 +0,0 @@
-# this is here so that django finds the checks.
-from paperless_tesseract.checks import check_default_language_available
-from paperless_tesseract.checks import get_tesseract_langs
-
-__all__ = ["check_default_language_available", "get_tesseract_langs"]
--- a/src/paperless_tesseract/apps.py
+++ b/src/paperless_tesseract/apps.py
@@ -1,14 +0,0 @@
-from django.apps import AppConfig
-
-from paperless_tesseract.signals import tesseract_consumer_declaration
-
-
-class PaperlessTesseractConfig(AppConfig):
-    name = "paperless_tesseract"
-
-    def ready(self) -> None:
-        from documents.signals import document_consumer_declaration
-
-        document_consumer_declaration.connect(tesseract_consumer_declaration)
-
-        AppConfig.ready(self)
--- a/src/paperless_tesseract/checks.py
+++ b/src/paperless_tesseract/checks.py
@@ -1,52 +0,0 @@
-import shutil
-import subprocess
-
-from django.conf import settings
-from django.core.checks import Error
-from django.core.checks import Warning
-from django.core.checks import register
-
-
-def get_tesseract_langs():
-    proc = subprocess.run(
-        [shutil.which("tesseract"), "--list-langs"],
-        capture_output=True,
-    )
-
-    # Decode bytes to string, split on newlines, trim out the header
-    proc_lines = proc.stdout.decode("utf8", errors="ignore").strip().split("\n")[1:]
-
-    return [x.strip() for x in proc_lines]
-
-
-@register()
-def check_default_language_available(app_configs, **kwargs):
-    errs = []
-
-    if not settings.OCR_LANGUAGE:
-        errs.append(
-            Warning(
-                "No OCR language has been specified with PAPERLESS_OCR_LANGUAGE. "
-                "This means that tesseract will fallback to english.",
-            ),
-        )
-        return errs
-
-    # binaries_check in paperless will check and report if this doesn't exist
-    # So skip trying to do anything here and let that handle missing binaries
-    if shutil.which("tesseract") is not None:
-        installed_langs = get_tesseract_langs()
-
-        specified_langs = [x.strip() for x in settings.OCR_LANGUAGE.split("+")]
-
-        for lang in specified_langs:
-            if lang not in installed_langs:
-                errs.append(
-                    Error(
-                        f"The selected ocr language {lang} is "
-                        f"not installed. Paperless cannot OCR your documents "
-                        f"without it. Please fix PAPERLESS_OCR_LANGUAGE.",
-                    ),
-                )
-
-    return errs
--- a/src/paperless_tesseract/signals.py
+++ b/src/paperless_tesseract/signals.py
@@ -1,21 +0,0 @@
-def get_parser(*args, **kwargs):
-    from paperless_tesseract.parsers import RasterisedDocumentParser
-
-    return RasterisedDocumentParser(*args, **kwargs)
-
-
-def tesseract_consumer_declaration(sender, **kwargs):
-    return {
-        "parser": get_parser,
-        "weight": 0,
-        "mime_types": {
-            "application/pdf": ".pdf",
-            "image/jpeg": ".jpg",
-            "image/png": ".png",
-            "image/tiff": ".tif",
-            "image/gif": ".gif",
-            "image/bmp": ".bmp",
-            "image/webp": ".webp",
-            "image/heic": ".heic",
-        },
-    }
--- a/src/paperless_tesseract/tests/init.py
+++ b/src/paperless_tesseract/tests/init.py
--- a/src/paperless_tesseract/tests/samples/simple-digital.pdf
+++ b/src/paperless_tesseract/tests/samples/simple-digital.pdf
--- a/src/paperless_tesseract/tests/test_checks.py
+++ b/src/paperless_tesseract/tests/test_checks.py
@@ -1,67 +0,0 @@
-from unittest import mock
-
-from django.core.checks import ERROR
-from django.test import TestCase
-from django.test import override_settings
-
-from paperless_tesseract import check_default_language_available
-
-
-class TestChecks(TestCase):
-    def test_default_language(self) -> None:
-        check_default_language_available(None)
-
-    @override_settings(OCR_LANGUAGE="")
-    def test_no_language(self) -> None:
-        msgs = check_default_language_available(None)
-        self.assertEqual(len(msgs), 1)
-        self.assertTrue(
-            msgs[0].msg.startswith(
-                "No OCR language has been specified with PAPERLESS_OCR_LANGUAGE",
-            ),
-        )
-
-    @override_settings(OCR_LANGUAGE="ita")
-    @mock.patch("paperless_tesseract.checks.get_tesseract_langs")
-    def test_invalid_language(self, m) -> None:
-        m.return_value = ["deu", "eng"]
-        msgs = check_default_language_available(None)
-        self.assertEqual(len(msgs), 1)
-        self.assertEqual(msgs[0].level, ERROR)
-
-    @override_settings(OCR_LANGUAGE="chi_sim")
-    @mock.patch("paperless_tesseract.checks.get_tesseract_langs")
-    def test_multi_part_language(self, m) -> None:
-        """
-        GIVEN:
-            - An OCR language which is multi part (ie chi-sim)
-            - The language is correctly formatted
-        WHEN:
-            - Installed packages are checked
-        THEN:
-            - No errors are reported
-        """
-        m.return_value = ["chi_sim", "eng"]
-
-        msgs = check_default_language_available(None)
-
-        self.assertEqual(len(msgs), 0)
-
-    @override_settings(OCR_LANGUAGE="chi-sim")
-    @mock.patch("paperless_tesseract.checks.get_tesseract_langs")
-    def test_multi_part_language_bad_format(self, m) -> None:
-        """
-        GIVEN:
-            - An OCR language which is multi part (ie chi-sim)
-            - The language is correctly NOT formatted
-        WHEN:
-            - Installed packages are checked
-        THEN:
-            - No errors are reported
-        """
-        m.return_value = ["chi_sim", "eng"]
-
-        msgs = check_default_language_available(None)
-
-        self.assertEqual(len(msgs), 1)
-        self.assertEqual(msgs[0].level, ERROR)
--- a/src/paperless_tesseract/tests/test_parser.py
+++ b/src/paperless_tesseract/tests/test_parser.py
@@ -1,924 +0,0 @@
-import shutil
-import tempfile
-import unicodedata
-import uuid
-from pathlib import Path
-from unittest import mock
-
-from django.test import TestCase
-from django.test import override_settings
-from ocrmypdf import SubprocessOutputError
-
-from documents.parsers import ParseError
-from documents.parsers import run_convert
-from documents.tests.utils import DirectoriesMixin
-from documents.tests.utils import FileSystemAssertsMixin
-from paperless_tesseract.parsers import RasterisedDocumentParser
-from paperless_tesseract.parsers import post_process_text
-
-
-class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
-    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
-
-    def assertContainsStrings(self, content, strings) -> None:
-        # Asserts that all strings appear in content, in the given order.
-        indices = []
-        for s in strings:
-            if s in content:
-                indices.append(content.index(s))
-            else:
-                self.fail(f"'{s}' is not in '{content}'")
-        self.assertListEqual(indices, sorted(indices))
-
-    def test_post_process_text(self) -> None:
-        text_cases = [
-            ("simple     string", "simple string"),
-            ("simple    newline\n   testing string", "simple newline\ntesting string"),
-            (
-                "utf-8   строка с пробелами в конце  ",
-                "utf-8 строка с пробелами в конце",
-            ),
-        ]
-
-        for source, result in text_cases:
-            actual_result = post_process_text(source)
-            self.assertEqual(
-                result,
-                actual_result,
-                f"strip_exceess_whitespace({source}) != '{result}', but '{actual_result}'",
-            )
-
-    def test_get_text_from_pdf(self) -> None:
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        text = parser.extract_text(
-            None,
-            self.SAMPLE_FILES / "simple-digital.pdf",
-        )
-
-        self.assertContainsStrings(text.strip(), ["This is a test document."])
-
-    def test_get_page_count(self) -> None:
-        """
-        GIVEN:
-            - PDF file with a single page
-            - PDF file with multiple pages
-        WHEN:
-            - The number of pages is requested
-        THEN:
-            - The method returns 1 as the expected number of pages
-            - The method returns the correct number of pages (6)
-        """
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        page_count = parser.get_page_count(
-            str(self.SAMPLE_FILES / "simple-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertEqual(page_count, 1)
-
-        page_count = parser.get_page_count(
-            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
-            "application/pdf",
-        )
-        self.assertEqual(page_count, 6)
-
-    def test_get_page_count_password_protected(self) -> None:
-        """
-        GIVEN:
-            - Password protected PDF file
-        WHEN:
-            - The number of pages is requested
-        THEN:
-            - The method returns None
-        """
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        with self.assertLogs("paperless.parsing.tesseract", level="WARNING") as cm:
-            page_count = parser.get_page_count(
-                str(self.SAMPLE_FILES / "password-protected.pdf"),
-                "application/pdf",
-            )
-            self.assertEqual(page_count, None)
-            self.assertIn("Unable to determine PDF page count", cm.output[0])
-
-    def test_thumbnail(self) -> None:
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        thumb = parser.get_thumbnail(
-            str(self.SAMPLE_FILES / "simple-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(thumb)
-
-    @mock.patch("documents.parsers.run_convert")
-    def test_thumbnail_fallback(self, m) -> None:
-        def call_convert(input_file, output_file, **kwargs) -> None:
-            if ".pdf" in str(input_file):
-                raise ParseError("Does not compute.")
-            else:
-                run_convert(input_file=input_file, output_file=output_file, **kwargs)
-
-        m.side_effect = call_convert
-
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        thumb = parser.get_thumbnail(
-            str(self.SAMPLE_FILES / "simple-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(thumb)
-
-    def test_thumbnail_encrypted(self) -> None:
-        parser = RasterisedDocumentParser(uuid.uuid4())
-        thumb = parser.get_thumbnail(
-            str(self.SAMPLE_FILES / "encrypted.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(thumb)
-
-    def test_get_dpi(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        dpi = parser.get_dpi(str(self.SAMPLE_FILES / "simple-no-dpi.png"))
-        self.assertEqual(dpi, None)
-
-        dpi = parser.get_dpi(str(self.SAMPLE_FILES / "simple.png"))
-        self.assertEqual(dpi, 72)
-
-    def test_simple_digital(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "simple-digital.pdf"),
-            "application/pdf",
-        )
-
-        self.assertIsFile(parser.archive_path)
-
-        self.assertContainsStrings(parser.get_text(), ["This is a test document."])
-
-    def test_with_form(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "with-form.pdf"),
-            "application/pdf",
-        )
-
-        self.assertIsFile(parser.archive_path)
-
-        self.assertContainsStrings(
-            parser.get_text(),
-            ["Please enter your name in here:", "This is a PDF document with a form."],
-        )
-
-    @override_settings(OCR_MODE="redo")
-    def test_with_form_error(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "with-form.pdf"),
-            "application/pdf",
-        )
-
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text(),
-            ["Please enter your name in here:", "This is a PDF document with a form."],
-        )
-
-    @override_settings(OCR_MODE="skip")
-    def test_signed(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(str(self.SAMPLE_FILES / "signed.pdf"), "application/pdf")
-
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text(),
-            [
-                "This is a digitally signed PDF, created with Acrobat Pro for the Paperless project to enable",
-                "automated testing of signed/encrypted PDFs",
-            ],
-        )
-
-    @override_settings(OCR_MODE="skip")
-    def test_encrypted(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "encrypted.pdf"),
-            "application/pdf",
-        )
-
-        self.assertIsNone(parser.archive_path)
-        self.assertEqual(parser.get_text(), "")
-
-    @override_settings(OCR_MODE="redo")
-    def test_with_form_error_notext(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "with-form.pdf"),
-            "application/pdf",
-        )
-
-        self.assertContainsStrings(
-            parser.get_text(),
-            ["Please enter your name in here:", "This is a PDF document with a form."],
-        )
-
-    @override_settings(OCR_MODE="force")
-    def test_with_form_force(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "with-form.pdf"),
-            "application/pdf",
-        )
-
-        self.assertContainsStrings(
-            parser.get_text(),
-            ["Please enter your name in here:", "This is a PDF document with a form."],
-        )
-
-    def test_image_simple(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(str(self.SAMPLE_FILES / "simple.png"), "image/png")
-
-        self.assertIsFile(parser.archive_path)
-
-        self.assertContainsStrings(parser.get_text(), ["This is a test document."])
-
-    def test_image_simple_alpha(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        with tempfile.TemporaryDirectory() as tempdir:
-            # Copy sample file to temp directory, as the parsing changes the file
-            # and this makes it modified to Git
-            sample_file = self.SAMPLE_FILES / "simple-alpha.png"
-            dest_file = Path(tempdir) / "simple-alpha.png"
-            shutil.copy(sample_file, dest_file)
-
-            parser.parse(str(dest_file), "image/png")
-
-            self.assertIsFile(parser.archive_path)
-
-            self.assertContainsStrings(parser.get_text(), ["This is a test document."])
-
-    def test_image_calc_a4_dpi(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        dpi = parser.calculate_a4_dpi(
-            str(self.SAMPLE_FILES / "simple-no-dpi.png"),
-        )
-
-        self.assertEqual(dpi, 62)
-
-    @mock.patch("paperless_tesseract.parsers.RasterisedDocumentParser.calculate_a4_dpi")
-    def test_image_dpi_fail(self, m) -> None:
-        m.return_value = None
-        parser = RasterisedDocumentParser(None)
-
-        def f() -> None:
-            parser.parse(
-                str(self.SAMPLE_FILES / "simple-no-dpi.png"),
-                "image/png",
-            )
-
-        self.assertRaises(ParseError, f)
-
-    @override_settings(OCR_IMAGE_DPI=72, MAX_IMAGE_PIXELS=0)
-    def test_image_no_dpi_default(self) -> None:
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(str(self.SAMPLE_FILES / "simple-no-dpi.png"), "image/png")
-
-        self.assertIsFile(parser.archive_path)
-
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["this is a test document."],
-        )
-
-    def test_multi_page(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_PAGES=2, OCR_MODE="skip")
-    def test_multi_page_pages_skip(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_PAGES=2, OCR_MODE="redo")
-    def test_multi_page_pages_redo(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_PAGES=2, OCR_MODE="force")
-    def test_multi_page_pages_force(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_MODE="skip")
-    def test_multi_page_analog_pages_skip(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_PAGES=2, OCR_MODE="redo")
-    def test_multi_page_analog_pages_redo(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR of only pages 1 and 2 requested
-            - OCR mode set to redo
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text of page 1 and 2 extracted
-            - An archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(parser.get_text().lower(), ["page 1", "page 2"])
-        self.assertNotIn("page 3", parser.get_text().lower())
-
-    @override_settings(OCR_PAGES=1, OCR_MODE="force")
-    def test_multi_page_analog_pages_force(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR of only page 1 requested
-            - OCR mode set to force
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Only text of page 1 is extracted
-            - An archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(parser.get_text().lower(), ["page 1"])
-        self.assertNotIn("page 2", parser.get_text().lower())
-        self.assertNotIn("page 3", parser.get_text().lower())
-
-    @override_settings(OCR_MODE="skip_noarchive")
-    def test_skip_noarchive_withtext(self) -> None:
-        """
-        GIVEN:
-            - File with existing text layer
-            - OCR mode set to skip_noarchive
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - No archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_MODE="skip_noarchive")
-    def test_skip_noarchive_notext(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR mode set to skip_noarchive
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - An archive file is created with the OCRd text
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-        self.assertIsNotNone(parser.archive_path)
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="never")
-    def test_skip_archive_never_withtext(self) -> None:
-        """
-        GIVEN:
-            - File with existing text layer
-            - OCR_SKIP_ARCHIVE_FILE set to never
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from text layer is extracted
-            - Archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNotNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="never")
-    def test_skip_archive_never_withimages(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR_SKIP_ARCHIVE_FILE set to never
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - Archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNotNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="with_text")
-    def test_skip_archive_withtext_withtext(self) -> None:
-        """
-        GIVEN:
-            - File with existing text layer
-            - OCR_SKIP_ARCHIVE_FILE set to with_text
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from text layer is extracted
-            - No archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="with_text")
-    def test_skip_archive_withtext_withimages(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR_SKIP_ARCHIVE_FILE set to with_text
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - Archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNotNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="always")
-    def test_skip_archive_always_withtext(self) -> None:
-        """
-        GIVEN:
-            - File with existing text layer
-            - OCR_SKIP_ARCHIVE_FILE set to always
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from text layer is extracted
-            - No archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_SKIP_ARCHIVE_FILE="always")
-    def test_skip_archive_always_withimages(self) -> None:
-        """
-        GIVEN:
-            - File with text contained in images but no text layer
-            - OCR_SKIP_ARCHIVE_FILE set to always
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - No archive file is created
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    @override_settings(OCR_MODE="skip")
-    def test_multi_page_mixed(self) -> None:
-        """
-        GIVEN:
-            - File with some text contained in images and some in text layer
-            - OCR mode set to skip
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - An archive file is created with the OCRd text and the original text
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNotNone(parser.archive_path)
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3", "page 4", "page 5", "page 6"],
-        )
-
-        with (parser.tempdir / "sidecar.txt").open() as f:
-            sidecar = f.read()
-
-        self.assertIn("[OCR skipped on page(s) 4-6]", sidecar)
-
-    @override_settings(OCR_MODE="redo")
-    def test_single_page_mixed(self) -> None:
-        """
-        GIVEN:
-            - File with some text contained in images and some in text layer
-            - Text and images are mixed on the same page
-            - OCR mode set to redo
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - Full content of the file is parsed (not just the image text)
-            - An archive file is created with the OCRd text and the original text
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "single-page-mixed.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNotNone(parser.archive_path)
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            [
-                "this is some normal text, present on page 1 of the document.",
-                "this is some text, but in an image, also on page 1.",
-                "this is further text on page 1.",
-            ],
-        )
-
-        with (parser.tempdir / "sidecar.txt").open() as f:
-            sidecar = f.read().lower()
-
-        self.assertIn("this is some text, but in an image, also on page 1.", sidecar)
-        self.assertNotIn(
-            "this is some normal text, present on page 1 of the document.",
-            sidecar,
-        )
-
-    @override_settings(OCR_MODE="skip_noarchive")
-    def test_multi_page_mixed_no_archive(self) -> None:
-        """
-        GIVEN:
-            - File with some text contained in images and some in text layer
-            - OCR mode set to skip_noarchive
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from images is extracted
-            - No archive file is created as original file contains text
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
-            "application/pdf",
-        )
-        self.assertIsNone(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 4", "page 5", "page 6"],
-        )
-
-    @override_settings(OCR_MODE="skip", OCR_ROTATE_PAGES=True)
-    def test_rotate(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "rotated.pdf"), "application/pdf")
-        self.assertContainsStrings(
-            parser.get_text(),
-            [
-                "This is the text that appears on the first page. It’s a lot of text.",
-                "Even if the pages are rotated, OCRmyPDF still gets the job done.",
-                "This is a really weird file with lots of nonsense text.",
-                "If you read this, it’s your own fault. Also check your screen orientation.",
-            ],
-        )
-
-    def test_multi_page_tiff(self) -> None:
-        """
-        GIVEN:
-            - Multi-page TIFF image
-        WHEN:
-            - Image is parsed
-        THEN:
-            - Text from all pages extracted
-        """
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "multi-page-images.tiff"),
-            "image/tiff",
-        )
-        self.assertIsFile(parser.archive_path)
-        self.assertContainsStrings(
-            parser.get_text().lower(),
-            ["page 1", "page 2", "page 3"],
-        )
-
-    def test_multi_page_tiff_alpha(self) -> None:
-        """
-        GIVEN:
-            - Multi-page TIFF image
-            - Image include an alpha channel
-        WHEN:
-            - Image is parsed
-        THEN:
-            - Text from all pages extracted
-        """
-        parser = RasterisedDocumentParser(None)
-        sample_file = self.SAMPLE_FILES / "multi-page-images-alpha.tiff"
-        with tempfile.NamedTemporaryFile() as tmp_file:
-            shutil.copy(sample_file, tmp_file.name)
-            parser.parse(
-                tmp_file.name,
-                "image/tiff",
-            )
-            self.assertIsFile(parser.archive_path)
-            self.assertContainsStrings(
-                parser.get_text().lower(),
-                ["page 1", "page 2", "page 3"],
-            )
-
-    def test_multi_page_tiff_alpha_srgb(self) -> None:
-        """
-        GIVEN:
-            - Multi-page TIFF image
-            - Image include an alpha channel
-            - Image is srgb colorspace
-        WHEN:
-            - Image is parsed
-        THEN:
-            - Text from all pages extracted
-        """
-        parser = RasterisedDocumentParser(None)
-        sample_file = str(
-            self.SAMPLE_FILES / "multi-page-images-alpha-rgb.tiff",
-        )
-        with tempfile.NamedTemporaryFile() as tmp_file:
-            shutil.copy(sample_file, tmp_file.name)
-            parser.parse(
-                tmp_file.name,
-                "image/tiff",
-            )
-            self.assertIsFile(parser.archive_path)
-            self.assertContainsStrings(
-                parser.get_text().lower(),
-                ["page 1", "page 2", "page 3"],
-            )
-
-    def test_ocrmypdf_parameters(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        params = parser.construct_ocrmypdf_parameters(
-            input_file="input.pdf",
-            output_file="output.pdf",
-            sidecar_file="sidecar.txt",
-            mime_type="application/pdf",
-            safe_fallback=False,
-        )
-
-        self.assertEqual(params["input_file_or_options"], "input.pdf")
-        self.assertEqual(params["output_file"], "output.pdf")
-        self.assertEqual(params["sidecar"], "sidecar.txt")
-
-        with override_settings(OCR_CLEAN="none"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertNotIn("clean", params)
-            self.assertNotIn("clean_final", params)
-
-        with override_settings(OCR_CLEAN="clean"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertTrue(params["clean"])
-            self.assertNotIn("clean_final", params)
-
-        with override_settings(OCR_CLEAN="clean-final", OCR_MODE="skip"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertTrue(params["clean_final"])
-            self.assertNotIn("clean", params)
-
-        with override_settings(OCR_CLEAN="clean-final", OCR_MODE="redo"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertTrue(params["clean"])
-            self.assertNotIn("clean_final", params)
-
-        with override_settings(OCR_DESKEW=True, OCR_MODE="skip"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertTrue(params["deskew"])
-
-        with override_settings(OCR_DESKEW=True, OCR_MODE="redo"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertNotIn("deskew", params)
-
-        with override_settings(OCR_DESKEW=False, OCR_MODE="skip"):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertNotIn("deskew", params)
-
-        with override_settings(OCR_MAX_IMAGE_PIXELS=1_000_001.0):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertIn("max_image_mpixels", params)
-            self.assertAlmostEqual(params["max_image_mpixels"], 1, places=4)
-
-        with override_settings(OCR_MAX_IMAGE_PIXELS=-1_000_001.0):
-            parser = RasterisedDocumentParser(None)
-            params = parser.construct_ocrmypdf_parameters("", "", "", "")
-            self.assertNotIn("max_image_mpixels", params)
-
-    def test_rtl_language_detection(self) -> None:
-        """
-        GIVEN:
-            - File with text in an RTL language
-        WHEN:
-            - Document is parsed
-        THEN:
-            - Text from the document is extracted
-        """
-        parser = RasterisedDocumentParser(None)
-
-        parser.parse(
-            str(self.SAMPLE_FILES / "rtl-test.pdf"),
-            "application/pdf",
-        )
-
-        # OCR output for RTL text varies across platforms/versions due to
-        # bidi controls and presentation forms; normalize before assertion.
-        normalized_text = "".join(
-            char
-            for char in unicodedata.normalize("NFKC", parser.get_text())
-            if unicodedata.category(char) != "Cf" and not char.isspace()
-        )
-
-        self.assertIn("ةرازو", normalized_text)
-        self.assertTrue(
-            any(token in normalized_text for token in ("ةیلخادلا", "الاخليد")),
-        )
-
-    @mock.patch("ocrmypdf.ocr")
-    def test_gs_rendering_error(self, m) -> None:
-        m.side_effect = SubprocessOutputError("Ghostscript PDF/A rendering failed")
-        parser = RasterisedDocumentParser(None)
-
-        self.assertRaises(
-            ParseError,
-            parser.parse,
-            str(self.SAMPLE_FILES / "simple-digital.pdf"),
-            "application/pdf",
-        )
-
-
-class TestParserFileTypes(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
-    SAMPLE_FILES = Path(__file__).parent / "samples"
-
-    def test_bmp(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "simple.bmp"), "image/bmp")
-        self.assertIsFile(parser.archive_path)
-        self.assertIn("this is a test document", parser.get_text().lower())
-
-    def test_jpg(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "simple.jpg"), "image/jpeg")
-        self.assertIsFile(parser.archive_path)
-        self.assertIn("this is a test document", parser.get_text().lower())
-
-    def test_heic(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "simple.heic"), "image/heic")
-        self.assertIsFile(parser.archive_path)
-        self.assertIn("pizza", parser.get_text().lower())
-
-    @override_settings(OCR_IMAGE_DPI=200)
-    def test_gif(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "simple.gif"), "image/gif")
-        self.assertIsFile(parser.archive_path)
-        self.assertIn("this is a test document", parser.get_text().lower())
-
-    def test_tiff(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(str(self.SAMPLE_FILES / "simple.tif"), "image/tiff")
-        self.assertIsFile(parser.archive_path)
-        self.assertIn("this is a test document", parser.get_text().lower())
-
-    @override_settings(OCR_IMAGE_DPI=72)
-    def test_webp(self) -> None:
-        parser = RasterisedDocumentParser(None)
-        parser.parse(
-            str(self.SAMPLE_FILES / "document.webp"),
-            "image/webp",
-        )
-        self.assertIsFile(parser.archive_path)
-        # Older tesseracts consistently mangle the space between "a webp",
-        # tesseract 5.3.0 seems to do a better job, so we're accepting both
-        self.assertRegex(
-            parser.get_text().lower(),
-            r"this is a ?webp document, created 11/14/2022.",
-        )
--- a/src/paperless_text/init.py
+++ b/src/paperless_text/init.py
--- a/src/paperless_text/apps.py
+++ b/src/paperless_text/apps.py
@@ -1,14 +0,0 @@
-from django.apps import AppConfig
-
-from paperless_text.signals import text_consumer_declaration
-
-
-class PaperlessTextConfig(AppConfig):
-    name = "paperless_text"
-
-    def ready(self) -> None:
-        from documents.signals import document_consumer_declaration
-
-        document_consumer_declaration.connect(text_consumer_declaration)
-
-        AppConfig.ready(self)
--- a/src/paperless_text/signals.py
+++ b/src/paperless_text/signals.py
@@ -1,29 +0,0 @@
-from __future__ import annotations
-
-from typing import Any
-
-
-def get_parser(*args: Any, **kwargs: Any) -> Any:
-    from paperless.parsers.text import TextDocumentParser
-
-    # TextDocumentParser accepts logging_group for constructor compatibility but
-    # does not store or use it (no legacy DocumentParser base class).
-    # progress_callback is also not used.  Both may arrive as a positional arg
-    # (consumer) or a keyword arg (views); *args absorbs the positional form,
-    # kwargs.pop handles the keyword form.  Phase 4 will replace this signal
-    # path with the new ParserRegistry so the shim can be removed at that point.
-    kwargs.pop("logging_group", None)
-    kwargs.pop("progress_callback", None)
-    return TextDocumentParser(*args, **kwargs)
-
-
-def text_consumer_declaration(sender: Any, **kwargs: Any) -> dict[str, Any]:
-    return {
-        "parser": get_parser,
-        "weight": 10,
-        "mime_types": {
-            "text/plain": ".txt",
-            "text/csv": ".csv",
-            "application/csv": ".csv",
-        },
-    }
--- a/src/paperless_text/tests/init.py
+++ b/src/paperless_text/tests/init.py
--- a/src/paperless_tika/init.py
+++ b/src/paperless_tika/init.py
--- a/src/paperless_tika/apps.py
+++ b/src/paperless_tika/apps.py
@@ -1,15 +0,0 @@
-from django.apps import AppConfig
-from django.conf import settings
-
-from paperless_tika.signals import tika_consumer_declaration
-
-
-class PaperlessTikaConfig(AppConfig):
-    name = "paperless_tika"
-
-    def ready(self) -> None:
-        from documents.signals import document_consumer_declaration
-
-        if settings.TIKA_ENABLED:
-            document_consumer_declaration.connect(tika_consumer_declaration)
-        AppConfig.ready(self)
--- a/src/paperless_tika/signals.py
+++ b/src/paperless_tika/signals.py
@@ -1,33 +0,0 @@
-def get_parser(*args, **kwargs):
-    from paperless.parsers.tika import TikaDocumentParser
-
-    # TikaDocumentParser accepts logging_group for constructor compatibility but
-    # does not store or use it (no legacy DocumentParser base class).
-    # progress_callback is also not used.  Both may arrive as a positional arg
-    # (consumer) or a keyword arg (views); *args absorbs the positional form,
-    # kwargs.pop handles the keyword form.  Phase 4 will replace this signal
-    # path with the new ParserRegistry so the shim can be removed at that point.
-    kwargs.pop("logging_group", None)
-    kwargs.pop("progress_callback", None)
-    return TikaDocumentParser()
-
-
-def tika_consumer_declaration(sender, **kwargs):
-    return {
-        "parser": get_parser,
-        "weight": 10,
-        "mime_types": {
-            "application/msword": ".doc",
-            "application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
-            "application/vnd.ms-excel": ".xls",
-            "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
-            "application/vnd.ms-powerpoint": ".ppt",
-            "application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx",
-            "application/vnd.openxmlformats-officedocument.presentationml.slideshow": ".ppsx",
-            "application/vnd.oasis.opendocument.presentation": ".odp",
-            "application/vnd.oasis.opendocument.spreadsheet": ".ods",
-            "application/vnd.oasis.opendocument.text": ".odt",
-            "application/vnd.oasis.opendocument.graphics": ".odg",
-            "text/rtf": ".rtf",
-        },
-    }
Author	SHA1	Message	Date
Trenton H	f1fecfc2aa	Moves the date parsing plugin section under the extending section	2026-03-20 15:05:13 -07:00
Trenton H	dd01f5b263	Adds a section about how the 2 install types can add external plugins	2026-03-20 14:55:17 -07:00
Trenton H	4fd6963d27	Inital documentation updates for developing a plugin	2026-03-20 14:48:45 -07:00
Trenton H	3c60003635	Fixes a race condition where webserver threads could race to populate the registry	2026-03-20 14:23:30 -07:00
Trenton H	854406c118	Cleans up the duplicate test file/fixture	2026-03-20 13:54:09 -07:00
Trenton H	eb3401725c	refactor: remove automatic log_summary() call from get_parser_registry() The summary was logged once per process, causing it to appear repeatedly during Docker startup (management commands, web server, each Celery worker subprocess). External parsers are already announced individually at INFO when discovered; the full summary is redundant noise. log_summary() is retained on ParserRegistry for manual/debug use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 13:17:02 -07:00
Trenton H	36ce9218ec	Adds a comment to satisy Sonar	2026-03-20 12:58:17 -07:00
Trenton H	a806280c1b	Moves the checks and tests to the main application and removes the old applications	2026-03-20 12:47:08 -07:00
Trenton H	2c1690c891	refactor: remove empty paperless_text and paperless_tika Django apps After parser classes were moved to paperless/parsers/ in the plugin refactor, these Django apps contained only empty AppConfig classes with no models, views, tasks, migrations, or other functionality. - Remove paperless_text and paperless_tika from INSTALLED_APPS - Delete empty app directories entirely - Update pyproject.toml test exclusions - Clean stale mypy baseline entries for moved parser files paperless_remote app is retained as it contains meaningful system checks for Azure AI configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:47:06 -07:00
Trenton H	6640968064	refactor: remove document_consumer_declaration signal infrastructure Remove the document_consumer_declaration signal that was previously used for parser registration. Each parser app no longer connects to this signal, and the signal declaration itself has been removed from documents/signals. Changes: - Remove document_consumer_declaration from documents/signals/__init__.py - Remove ready() methods and signal imports from all parser app configs - Delete signal shim files (signals.py) from all parser apps: - paperless_tesseract/signals.py - paperless_text/signals.py - paperless_tika/signals.py - paperless_mail/signals.py - paperless_remote/signals.py Parser discovery now happens exclusively through the ParserRegistry system introduced in the previous refactor phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:47:03 -07:00
Trenton H	f86ddcf221	refactor: drop get_parser_class_for_mime_type; callers use registry directly All callers now call get_parser_registry().get_parser_for_file() with the actual filename and path, enabling score() to use file extension hints. The MIME-only helper is removed. - consumer.py: passes self.filename + self.working_copy - tasks.py: passes document.original_filename + document.source_path - document_thumbnails.py: same pattern - views.py: passes Path(file).name + Path(file) - parsers.py: internal helpers inline the registry call with filename="" - test_parsers.py: drop TestParserDiscovery (was testing mock behavior); TestParserAvailability uses registry directly - test_consumer.py: mocks switch to documents.consumer.get_parser_registry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:46:59 -07:00
Trenton H	c094e91567	refactor: switch consumer and callers to ParserRegistry (Phase 4) Replace all Django signal-based parser discovery with direct registry calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all old-style isinstance checks. All parser instantiation now uses the `with parser_class() as parser:` context manager pattern. - documents/parsers.py: delegate to get_parser_registry(); drop lru_cache - documents/consumer.py: use registry + context manager; remove shims - documents/tasks.py: same pattern - documents/management/commands/document_thumbnails.py: same pattern - documents/views.py: get_metadata uses context manager - documents/checks.py: use get_parser_registry().all_parsers() - paperless/parsers/registry.py: add all_parsers() public method - tests: update mocks to target documents.consumer.get_parser_class_for_mime_type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:46:55 -07:00
Trenton H	a9756f9462	Chore: Convert Tesseract parser to plugin style (#12403 ) * Move tesseract parser, tests, and samples to paperless.parsers Relocates files in preparation for the Phase 3 Protocol-based parser refactor, preserving full git history via rename. - src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py - src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py - src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py - src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/ - Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor RasterisedDocumentParser to ParserProtocol interface - Add RasterisedDocumentParser to registry.register_defaults() - Update parser class: remove DocumentParser inheritance, add Protocol class attrs/classmethods/properties, context-manager lifecycle - Add read_file_handle_unicode_errors() to shared parsers/utils.py - Replace inline unicode-error-handling with shared utility call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update tesseract signals.py to import from new parser location RasterisedDocumentParser moved to paperless.parsers.tesseract; update the lazy import in signals.get_parser so the signal-based consumer declaration continues to work during the registry transition. Pop logging_group and progress_callback kwargs for constructor compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests: rewrite test_tesseract_parser to pytest style with typed fixtures - Converts all tests from Django TestCase to pytest-style classes - Adds tesseract_samples_dir, null_app_config, tesseract_parser, and make_tesseract_parser fixtures in conftest.py; all DB-free except TestOcrmypdfParameters which uses @pytest.mark.django_db - Defines MakeTesseractParser type alias in conftest.py for autocomplete - Fixes FBT001 (boolean positional args) by making bool params keyword-only with * separator in parametrize test signatures - Adds type annotations to all fixture parameters for IDE support - Uses pytest.param(..., id="...") throughout; pytest-mock for patching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(types): fully annotate paperless/parsers/tesseract.py Fixes all mypy and pyrefly errors in the new parser file: - Add missing type annotations to is_image, has_alpha, get_dpi, calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text - Narrow Path-only (no str) for image helper args; convert to str when building list[str] args for run_subprocess - Annotate ocrmypdf_args as dict[str, Any] so operator expressions on its values type-check and ocrmypdf.ocr(*args) resolves cleanly - Declare text: str \| None = None at top of extract_text to unify all assignments to the same type across both branches - Import Any from typing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Fixes isort * fix: add RasterisedDocumentParser to new-style parser shim checks The new RasterisedDocumentParser uses __enter__/__exit__ for resource management instead of cleanup(). Update all existing new-style shims to include it in the isinstance checks: - documents/consumer.py: _parser_cleanup(), parser_is_new_style - documents/tasks.py: parser_is_new_style, finally cleanup branch (also adds RemoteDocumentParser which was missing from the latter) - documents/management/commands/document_thumbnails.py: adds new-style handling from scratch (enter/exit + 2-arg get_thumbnail signature) Fix stale import paths in three test files that were still importing from paperless_tesseract.parsers instead of paperless.parsers.tesseract. Fix two registry tests that used application/pdf as a proxy for "no handler" — now that RasterisedDocumentParser is registered, PDF always has a handler, so switch to a truly unsupported MIME type. Signal infrastructure and shims remain intact; this is plumbing only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * One missed import (cherry pick?) * Adds a no cover for a special case of handling unicode errors in PDF metadata --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:46:07 -07:00
Trenton H	c2b8b22fb4	Chore: Convert mail parser to plugin style (#12397 ) * Refactor(mail): rename paperless_mail/parsers.py → paperless/parsers/mail.py Preserve git history for MailDocumentParser by committing the rename separately before editing, following the project convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor(mail): move mail parser tests to paperless/tests/parsers/ Move test_parsers.py → test_mail_parser.py and test_parsers_live.py → test_mail_parser_live.py alongside the other built-in parser tests, preserving git history before editing. Update MailDocumentParser import to the new canonical location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore: move mail parser sample files to paperless/tests/samples/mail/ Relocate all mail test fixtures from src/paperless_mail/tests/samples/ to src/paperless/tests/samples/mail/ ahead of the parser plugin refactor. Add the new path to the codespell skip list to prevent false-positive spell corrections in binary/fixture email files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Feat(tests): add mail parser fixtures to paperless/tests/parsers/conftest.py Add mail_samples_dir, per-file sample fixtures, and mail_parser (context-manager style) to mirror the old paperless_mail conftest but rooted at the new samples/mail/ location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Feat(parsers): migrate MailDocumentParser to ParserProtocol Move the mail parser from paperless_mail/parsers.py to paperless/parsers/mail.py and refactor it to implement ParserProtocol: - Class-level name/version/author/url attributes - supported_mime_types() and score() classmethods (score=20) - can_produce_archive=False, requires_pdf_rendition=True - Context manager lifecycle (__enter__/__exit__) - New parse() signature without mailrule_id kwarg; consumer sets parser.mailrule_id before calling parse() instead - get_text()/get_date()/get_archive_path() accessor methods - extract_metadata() returning email headers and attachment info Register MailDocumentParser in the ParserRegistry alongside Text and Tika parsers. Update consumer, signals, and all import sites to use the new location. Update tests to use the new accessor API, patch paths, and context-manager fixture. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix(parsers): pop legacy constructor args in mail signal wrapper MailDocumentParser.__init__ takes no constructor args in the new protocol. Update the get_parser() signal wrapper to pop logging_group and progress_callback (passed by the legacy consumer dispatch path) before instantiating — the same pattern used by TextDocumentParser. Also update test_mail_parser_receives_mailrule to use the real signal wrapper (mail_get_parser) instead of MailDocumentParser directly, so the test exercises the actual dispatch path and matches the new parse() call signature (no mailrule kwarg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bumps this so we can run * Fixes location of the fixture * Removes fixtures which were duplicated * Feat(parsers): add ParserContext and configure() to ParserProtocol Replace the ad-hoc mailrule_id attribute assignment with a typed, immutable ParserContext dataclass and a configure() method on the Protocol: - ParserContext(frozen=True, slots=True) lives in paperless/parsers/ alongside ParserProtocol and MetadataEntry; currently carries only mailrule_id but is designed to grow with output_type, ocr_mode, and ocr_language in a future phase (decoupling parsers from settings.) - ParserProtocol.configure(context: ParserContext) -> None is the extension point; no-op by default - MailDocumentParser.configure() reads mailrule_id into _mailrule_id - TextDocumentParser and TikaDocumentParser implement a no-op configure() - Consumer calls document_parser.configure(ParserContext(...)) before parse(), replacing the isinstance(parser, MailDocumentParser) guard and the direct attribute mutation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Feat(parsers): call configure(ParserContext()) in update_document task Apply the same new-style parser shim pattern as the consumer to update_document_content_maybe_archive_file: - Call __enter__ for Text/Tika parsers after instantiation - Call configure(ParserContext()) before parse() for all new-style parsers (mailrule_id is not available here — this is a re-process of an existing document, so the default empty context is correct) - Call parse(path, mime_type) with 2 args for new-style parsers - Call get_thumbnail(path, mime_type) with 2 args for new-style parsers - Call __exit__ instead of cleanup() in the finally block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix(tests): add configure() to DummyParser and missing-method parametrize ParserProtocol now requires configure(context: ParserContext) -> None. Update DummyParser in test_registry.py to implement it, and add 'missing-configure' to the test_partial_compliant_fails_isinstance parametrize list so the new method is covered by the negative test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Cleans up the reprocess task and generally reduces duplicate of classes * Corrects the score return * Updates so we can report a page count for these parsers, assuming we do have an archive produced when called * Increases test coverage * One more coverage * Updates typing * Updates typing --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 09:22:18 -07:00