fix: add RasterisedDocumentParser to new-style parser shim checks

The new RasterisedDocumentParser uses __enter__/__exit__ for resource management instead of cleanup(). Update all existing new-style shims to include it in the isinstance checks: - documents/consumer.py: _parser_cleanup(), parser_is_new_style - documents/tasks.py: parser_is_new_style, finally cleanup branch (also adds RemoteDocumentParser which was missing from the latter) - documents/management/commands/document_thumbnails.py: adds new-style handling from scratch (enter/exit + 2-arg get_thumbnail signature) Fix stale import paths in three test files that were still importing from paperless_tesseract.parsers instead of paperless.parsers.tesseract. Fix two registry tests that used application/pdf as a proxy for "no handler" — now that RasterisedDocumentParser is registered, PDF always has a handler, so switch to a truly unsupported MIME type. Signal infrastructure and shims remain intact; this is plumbing only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(types): fully annotate paperless/parsers/tesseract.py
2026-03-20 16:05:56 +00:00 · 2026-03-19 14:54:34 -07:00 · 2026-03-19 14:19:22 -07:00 · 2026-03-19 13:51:34 -07:00 · 2026-03-19 13:04:53 -07:00 · 2026-03-19 13:02:43 -07:00
65 changed files with 3107 additions and 1641 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -248,15 +248,13 @@ lint.per-file-ignores."docker/wait-for-redis.py" = [
 lint.per-file-ignores."src/documents/models.py" = [
  "SIM115",
 ]
-lint.per-file-ignores."src/paperless_tesseract/tests/test_parser.py" = [
+
  "RUF001",
 ]
 lint.isort.force-single-line = true
 [tool.codespell]
 write-changes = true
 ignore-words-list = "criterias,afterall,valeu,ureue,equest,ure,assertIn,Oktober,commitish"
-skip = "src-ui/src/locale/*,src-ui/pnpm-lock.yaml,src-ui/e2e/*,src/paperless_mail/tests/samples/*,src/documents/tests/samples/*,*.po,*.json"
+skip = "src-ui/src/locale/*,src-ui/pnpm-lock.yaml,src-ui/e2e/*,src/paperless_mail/tests/samples/*,src/paperless/tests/samples/mail/*,src/documents/tests/samples/*,*.po,*.json"
 [tool.pytest]
 minversion = "9.0"
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -51,10 +51,12 @@ from documents.templating.workflows import parse_w_workflow_placeholders
 from documents.utils import copy_basic_file_stats
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
 from paperless.parsers import ParserContext
 from paperless.parsers.mail import MailDocumentParser
 from paperless.parsers.remote import RemoteDocumentParser
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
 from paperless_mail.parsers import MailDocumentParser
 LOGGING_NAME: Final[str] = "paperless.consumer"
@@ -71,7 +73,13 @@ def _parser_cleanup(parser: DocumentParser) -> None:
    """
    if isinstance(
        parser,
-        (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
+        (
            MailDocumentParser,
            RasterisedDocumentParser,
            RemoteDocumentParser,
            TextDocumentParser,
            TikaDocumentParser,
        ),
    ):
        parser.__exit__(None, None, None)
    else:
@@ -453,13 +461,21 @@ class ConsumerPlugin(
            progress_callback=progress_callback,
        )
        parser_is_new_style = isinstance(
            document_parser,
            (
                MailDocumentParser,
                RasterisedDocumentParser,
                RemoteDocumentParser,
                TextDocumentParser,
                TikaDocumentParser,
            ),
        )
        # New-style parsers use __enter__/__exit__ for resource management.
        # _parser_cleanup (below) handles __exit__; call __enter__ here.
        # TODO(stumpylog): Remove me in the future
-        if isinstance(
+        if parser_is_new_style:
            document_parser,
            (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
        ):
            document_parser.__enter__()
        self.log.debug(f"Parser: {type(document_parser).__name__}")
@@ -480,20 +496,12 @@ class ConsumerPlugin(
                ConsumerStatusShortMessage.PARSING_DOCUMENT,
            )
            self.log.debug(f"Parsing {self.filename}...")
-            if (
+
-                isinstance(document_parser, MailDocumentParser)
+            # TODO(stumpylog): Remove me in the future when all parsers use new protocol
-                and self.input_doc.mailrule_id
+            if parser_is_new_style:
-            ):
+                document_parser.configure(
-                document_parser.parse(
+                    ParserContext(mailrule_id=self.input_doc.mailrule_id),
                    self.working_copy,
                    mime_type,
                    self.filename,
                    self.input_doc.mailrule_id,
                )
            elif isinstance(
                document_parser,
                (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
            ):
                # TODO(stumpylog): Remove me in the future
                document_parser.parse(self.working_copy, mime_type)
            else:
@@ -506,11 +514,8 @@ class ConsumerPlugin(
                ProgressStatusOptions.WORKING,
                ConsumerStatusShortMessage.GENERATING_THUMBNAIL,
            )
-            if isinstance(
+            # TODO(stumpylog): Remove me in the future when all parsers use new protocol
-                document_parser,
+            if parser_is_new_style:
                (TextDocumentParser, RemoteDocumentParser, TikaDocumentParser),
            ):
                # TODO(stumpylog): Remove me in the future
                thumbnail = document_parser.get_thumbnail(self.working_copy, mime_type)
            else:
                thumbnail = document_parser.get_thumbnail(
--- a/src/documents/management/commands/document_thumbnails.py
+++ b/src/documents/management/commands/document_thumbnails.py
@@ -4,6 +4,11 @@ import shutil
 from documents.management.commands.base import PaperlessCommand
 from documents.models import Document
 from documents.parsers import get_parser_class_for_mime_type
 from paperless.parsers.mail import MailDocumentParser
 from paperless.parsers.remote import RemoteDocumentParser
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
 logger = logging.getLogger("paperless.management.thumbnails")
@@ -22,16 +27,38 @@ def _process_document(doc_id: int) -> None:
    parser = parser_class(logging_group=None)
    parser_is_new_style = isinstance(
        parser,
        (
            MailDocumentParser,
            RasterisedDocumentParser,
            RemoteDocumentParser,
            TextDocumentParser,
            TikaDocumentParser,
        ),
    )
    # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
    if parser_is_new_style:
        parser.__enter__()
    try:
-        thumb = parser.get_thumbnail(
+        # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
-            document.source_path,
+        if parser_is_new_style:
-            document.mime_type,
+            thumb = parser.get_thumbnail(document.source_path, document.mime_type)
-            document.get_public_filename(),
+        else:
-        )
+            thumb = parser.get_thumbnail(
                document.source_path,
                document.mime_type,
                document.get_public_filename(),
            )
        shutil.move(thumb, document.thumbnail_path)
    finally:
        # TODO(stumpylog): Cleanup once all parsers are handled
-        parser.cleanup()
+        if parser_is_new_style:
            parser.__exit__(None, None, None)
        else:
            parser.cleanup()
 class Command(PaperlessCommand):
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -65,6 +65,12 @@ from documents.signals.handlers import run_workflows
 from documents.signals.handlers import send_websocket_document_updated
 from documents.workflows.utils import get_workflows_for_trigger
 from paperless.config import AIConfig
 from paperless.parsers import ParserContext
 from paperless.parsers.mail import MailDocumentParser
 from paperless.parsers.remote import RemoteDocumentParser
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
 from paperless_ai.indexing import llm_index_add_or_update_document
 from paperless_ai.indexing import llm_index_remove_document
 from paperless_ai.indexing import update_llm_index
@@ -304,7 +310,9 @@ def update_document_content_maybe_archive_file(document_id) -> None:
    mime_type = document.mime_type
-    parser_class: type[DocumentParser] = get_parser_class_for_mime_type(mime_type)
+    parser_class: type[DocumentParser] | None = get_parser_class_for_mime_type(
        mime_type,
    )
    if not parser_class:
        logger.error(
@@ -315,14 +323,42 @@ def update_document_content_maybe_archive_file(document_id) -> None:
    parser: DocumentParser = parser_class(logging_group=uuid.uuid4())
-    try:
+    parser_is_new_style = isinstance(
-        parser.parse(document.source_path, mime_type, document.get_public_filename())
+        parser,
        (
            MailDocumentParser,
            RasterisedDocumentParser,
            RemoteDocumentParser,
            TextDocumentParser,
            TikaDocumentParser,
        ),
    )
-        thumbnail = parser.get_thumbnail(
+    # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
-            document.source_path,
+    if parser_is_new_style:
-            mime_type,
+        parser.__enter__()
-            document.get_public_filename(),
+
-        )
+    try:
        # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
        if parser_is_new_style:
            parser.configure(ParserContext())
            parser.parse(document.source_path, mime_type)
        else:
            parser.parse(
                document.source_path,
                mime_type,
                document.get_public_filename(),
            )
        # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
        if parser_is_new_style:
            thumbnail = parser.get_thumbnail(document.source_path, mime_type)
        else:
            thumbnail = parser.get_thumbnail(
                document.source_path,
                mime_type,
                document.get_public_filename(),
            )
        with transaction.atomic():
            oldDocument = Document.objects.get(pk=document.pk)
@@ -403,8 +439,20 @@ def update_document_content_maybe_archive_file(document_id) -> None:
            f"Error while parsing document {document} (ID: {document_id})",
        )
    finally:
-        # TODO(stumpylog): Cleanup once all parsers are handled
+        # TODO(stumpylog): Remove branch in the future when all parsers use new protocol
-        parser.cleanup()
+        if isinstance(
            parser,
            (
                MailDocumentParser,
                RasterisedDocumentParser,
                RemoteDocumentParser,
                TextDocumentParser,
                TikaDocumentParser,
            ),
        ):
            parser.__exit__(None, None, None)
        else:
            parser.cleanup()
@shared_task
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -36,7 +36,6 @@ from documents.tests.utils import DummyProgressManager
 from documents.tests.utils import FileSystemAssertsMixin
 from documents.tests.utils import GetConsumerMixin
 from paperless_mail.models import MailRule
 from paperless_mail.parsers import MailDocumentParser
 class _BaseTestParser(DocumentParser):
@@ -1091,7 +1090,7 @@ class TestConsumer(
            self.assertEqual(command[1], "--replace-input")
    @mock.patch("paperless_mail.models.MailRule.objects.get")
-    @mock.patch("paperless_mail.parsers.MailDocumentParser.parse")
+    @mock.patch("paperless.parsers.mail.MailDocumentParser.parse")
    @mock.patch("documents.parsers.document_consumer_declaration.send")
    def test_mail_parser_receives_mailrule(
        self,
@@ -1107,11 +1106,13 @@ class TestConsumer(
        THEN:
            - The mail parser should receive the mail rule
        """
        from paperless_mail.signals import get_parser as mail_get_parser
        mock_consumer_declaration_send.return_value = [
            (
                None,
                {
-                    "parser": MailDocumentParser,
+                    "parser": mail_get_parser,
                    "mime_types": {"message/rfc822": ".eml"},
                    "weight": 0,
                },
@@ -1123,9 +1124,10 @@ class TestConsumer(
        with self.get_consumer(
            filepath=(
                Path(__file__).parent.parent.parent
-                / Path("paperless_mail")
+                / Path("paperless")
                / Path("tests")
                / Path("samples")
                / Path("mail")
            ).resolve()
            / "html.eml",
            source=DocumentSource.MailFetch,
@@ -1136,12 +1138,10 @@ class TestConsumer(
                ConsumerError,
            ):
                consumer.run()
-                mock_mail_parser_parse.assert_called_once_with(
+            mock_mail_parser_parse.assert_called_once_with(
-                    consumer.working_copy,
+                consumer.working_copy,
-                    "message/rfc822",
+                "message/rfc822",
-                    file_name="sample.pdf",
+            )
                    mailrule=mock_mailrule_get.return_value,
                )
@mock.patch("documents.consumer.magic.from_file", fake_magic_from_file)
--- a/src/documents/tests/test_parsers.py
+++ b/src/documents/tests/test_parsers.py
@@ -9,9 +9,9 @@ from documents.parsers import get_default_file_extension
 from documents.parsers import get_parser_class_for_mime_type
 from documents.parsers import get_supported_file_extensions
 from documents.parsers import is_file_ext_supported
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
 from paperless_tesseract.parsers import RasterisedDocumentParser
 class TestParserDiscovery(TestCase):
--- a/src/paperless/parsers/init.py
+++ b/src/paperless/parsers/init.py
@@ -35,6 +35,7 @@ Usage example (third-party parser)::
 from __future__ import annotations
 from dataclasses import dataclass
 from typing import TYPE_CHECKING
 from typing import Protocol
 from typing import Self
@@ -48,6 +49,7 @@ if TYPE_CHECKING:
 __all__ = [
    "MetadataEntry",
    "ParserContext",
    "ParserProtocol",
 ]
@@ -73,6 +75,44 @@ class MetadataEntry(TypedDict):
    """String representation of the field value."""
@dataclass(frozen=True, slots=True)
 class ParserContext:
    """Immutable context passed to a parser before parse().
    The consumer assembles this from the ingestion event and Django
    settings, then calls ``parser.configure(context)`` before
    ``parser.parse()``.  Parsers read only the fields relevant to them;
    unneeded fields are ignored.
    ``frozen=True`` prevents accidental mutation after the consumer
    hands the context off.  ``slots=True`` keeps instances lightweight.
    Fields
    ------
    mailrule_id : int | None
        Primary key of the ``MailRule`` that triggered this ingestion,
        or ``None`` when the document did not arrive via a mail rule.
        Used by ``MailDocumentParser`` to select the PDF layout.
    Notes
    -----
    Future fields (not yet implemented):
    * ``output_type`` — PDF/A variant for archive generation
      (replaces ``settings.OCR_OUTPUT_TYPE`` reads inside parsers).
    * ``ocr_mode`` — skip-text, redo, force, etc.
      (replaces ``settings.OCR_MODE`` reads inside parsers).
    * ``ocr_language`` — Tesseract language string.
      (replaces ``settings.OCR_LANGUAGE`` reads inside parsers).
    When those fields are added the consumer will read from Django
    settings once and populate them here, decoupling parsers from
    ``settings.*`` entirely.
    """
    mailrule_id: int | None = None
@runtime_checkable
 class ParserProtocol(Protocol):
    """Structural contract for all Paperless-ngx document parsers.
@@ -191,6 +231,21 @@ class ParserProtocol(Protocol):
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        """Apply source context before parse().
        Called by the consumer after instantiation and before parse().
        The default implementation is a no-op; parsers override only the
        fields they need.
        Parameters
        ----------
        context:
            Immutable context assembled by the consumer for this
            specific ingestion event.
        """
        ...
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/mail.py
+++ b/src/paperless/parsers/mail.py
@@ -0,0 +1,834 @@
 """
 Built-in mail document parser.
 Handles message/rfc822 (EML) MIME type by:
 - Parsing the email using imap_tools
 - Generating a PDF via Gotenberg (for display and archive)
 - Extracting text via Tika for HTML content
 - Extracting metadata from email headers
 The parser always produces a PDF because EML files cannot be rendered
 natively in a browser (requires_pdf_rendition=True).
 """
 from __future__ import annotations
 import logging
 import re
 import shutil
 import tempfile
 from html import escape
 from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Self
 from bleach import clean
 from bleach import linkify
 from django.conf import settings
 from django.utils import timezone
 from django.utils.timezone import is_naive
 from django.utils.timezone import make_aware
 from gotenberg_client import GotenbergClient
 from gotenberg_client.constants import A4
 from gotenberg_client.options import Measurement
 from gotenberg_client.options import MeasurementUnitType
 from gotenberg_client.options import PageMarginsType
 from gotenberg_client.options import PdfAFormat
 from humanize import naturalsize
 from imap_tools import MailAttachment
 from imap_tools import MailMessage
 from tika_client import TikaClient
 from documents.parsers import ParseError
 from documents.parsers import make_thumbnail_from_pdf
 from paperless.models import OutputTypeChoices
 from paperless.version import __full_version_str__
 from paperless_mail.models import MailRule
 if TYPE_CHECKING:
    import datetime
    from types import TracebackType
    from paperless.parsers import MetadataEntry
    from paperless.parsers import ParserContext
 logger = logging.getLogger("paperless.parsing.mail")
 _SUPPORTED_MIME_TYPES: dict[str, str] = {
    "message/rfc822": ".eml",
 }
 class MailDocumentParser:
    """Parse .eml email files for Paperless-ngx.
    Uses imap_tools to parse .eml files, generates a PDF using Gotenberg,
    and sends the HTML part to a Tika server for text extraction.  Because
    EML files cannot be rendered natively in a browser, the parser always
    produces a PDF rendition (requires_pdf_rendition=True).
    Pass a ``ParserContext`` to ``configure()`` before ``parse()`` to
    apply mail-rule-specific PDF layout options:
        parser.configure(ParserContext(mailrule_id=rule.pk))
        parser.parse(path, mime_type)
    Class attributes
    ----------------
    name : str
        Human-readable parser name.
    version : str
        Semantic version string, kept in sync with Paperless-ngx releases.
    author : str
        Maintainer name.
    url : str
        Issue tracker / source URL.
    """
    name: str = "Paperless-ngx Mail Parser"
    version: str = __full_version_str__
    author: str = "Paperless-ngx Contributors"
    url: str = "https://github.com/paperless-ngx/paperless-ngx"
    # ------------------------------------------------------------------
    # Class methods
    # ------------------------------------------------------------------
    @classmethod
    def supported_mime_types(cls) -> dict[str, str]:
        """Return the MIME types this parser handles.
        Returns
        -------
        dict[str, str]
            Mapping of MIME type to preferred file extension.
        """
        return _SUPPORTED_MIME_TYPES
    @classmethod
    def score(
        cls,
        mime_type: str,
        filename: str,
        path: Path | None = None,
    ) -> int | None:
        """Return the priority score for handling this file.
        Parameters
        ----------
        mime_type:
            Detected MIME type of the file.
        filename:
            Original filename including extension.
        path:
            Optional filesystem path. Not inspected by this parser.
        Returns
        -------
        int | None
            10 if the MIME type is supported, otherwise None.
        """
        if mime_type in _SUPPORTED_MIME_TYPES:
            return 10
        return None
    # ------------------------------------------------------------------
    # Properties
    # ------------------------------------------------------------------
    @property
    def can_produce_archive(self) -> bool:
        """Whether this parser can produce a searchable PDF archive copy.
        Returns
        -------
        bool
            Always False — the mail parser produces a display PDF
            (requires_pdf_rendition=True), not an optional OCR archive.
        """
        return False
    @property
    def requires_pdf_rendition(self) -> bool:
        """Whether the parser must produce a PDF for the frontend to display.
        Returns
        -------
        bool
            Always True — EML files cannot be rendered natively in a browser,
            so a PDF conversion is always required for display.
        """
        return True
    # ------------------------------------------------------------------
    # Lifecycle
    # ------------------------------------------------------------------
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
        )
        self._text: str | None = None
        self._date: datetime.datetime | None = None
        self._archive_path: Path | None = None
        self._mailrule_id: int | None = None
    def __enter__(self) -> Self:
        return self
    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_val: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> None:
        logger.debug("Cleaning up temporary directory %s", self._tempdir)
        shutil.rmtree(self._tempdir, ignore_errors=True)
    # ------------------------------------------------------------------
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        self._mailrule_id = context.mailrule_id
    def parse(
        self,
        document_path: Path,
        mime_type: str,
        *,
        produce_archive: bool = True,
    ) -> None:
        """Parse the given .eml into formatted text and a PDF archive.
        Call ``configure(ParserContext(mailrule_id=...))`` before this method
        to apply mail-rule-specific PDF layout options.  The ``produce_archive``
        flag is accepted for protocol compatibility but is always honoured —
        the mail parser always produces a PDF since EML files cannot be
        displayed natively.
        Parameters
        ----------
        document_path:
            Absolute path to the .eml file.
        mime_type:
            Detected MIME type of the document (should be "message/rfc822").
        produce_archive:
            Accepted for protocol compatibility. The PDF rendition is always
            produced since EML files cannot be displayed natively in a browser.
        Raises
        ------
        documents.parsers.ParseError
            If the file cannot be parsed or PDF generation fails.
        """
        def strip_text(text: str) -> str:
            """Reduces the spacing of the given text string."""
            text = re.sub(r"\s+", " ", text)
            text = re.sub(r"(\n *)+", "\n", text)
            return text.strip()
        def build_formatted_text(mail_message: MailMessage) -> str:
            """Constructs a formatted string based on the given email."""
            fmt_text = f"Subject: {mail_message.subject}\n\n"
            fmt_text += f"From: {mail_message.from_values.full if mail_message.from_values else ''}\n\n"
            to_list = [address.full for address in mail_message.to_values]
            fmt_text += f"To: {', '.join(to_list)}\n\n"
            if mail_message.cc_values:
                fmt_text += (
                    f"CC: {', '.join(address.full for address in mail.cc_values)}\n\n"
                )
            if mail_message.bcc_values:
                fmt_text += (
                    f"BCC: {', '.join(address.full for address in mail.bcc_values)}\n\n"
                )
            if mail_message.attachments:
                att = []
                for a in mail.attachments:
                    attachment_size = naturalsize(a.size, binary=True, format="%.2f")
                    att.append(
                        f"{a.filename} ({attachment_size})",
                    )
                fmt_text += f"Attachments: {', '.join(att)}\n\n"
            if mail.html:
                fmt_text += "HTML content: " + strip_text(self.tika_parse(mail.html))
            fmt_text += f"\n\n{strip_text(mail.text)}"
            return fmt_text
        logger.debug("Parsing file %s into an email", document_path.name)
        mail = self.parse_file_to_message(document_path)
        logger.debug("Building formatted text from email")
        self._text = build_formatted_text(mail)
        if is_naive(mail.date):
            self._date = make_aware(mail.date)
        else:
            self._date = mail.date
        logger.debug("Creating a PDF from the email")
        if self._mailrule_id:
            rule = MailRule.objects.get(pk=self._mailrule_id)
            self._archive_path = self.generate_pdf(
                mail,
                MailRule.PdfLayout(rule.pdf_layout),
            )
        else:
            self._archive_path = self.generate_pdf(mail)
    # ------------------------------------------------------------------
    # Result accessors
    # ------------------------------------------------------------------
    def get_text(self) -> str | None:
        """Return the plain-text content extracted during parse.
        Returns
        -------
        str | None
            Extracted text, or None if parse has not been called yet.
        """
        return self._text
    def get_date(self) -> datetime.datetime | None:
        """Return the document date detected during parse.
        Returns
        -------
        datetime.datetime | None
            Date from the email headers, or None if not detected.
        """
        return self._date
    def get_archive_path(self) -> Path | None:
        """Return the path to the generated archive PDF, or None.
        Returns
        -------
        Path | None
            Path to the PDF produced by Gotenberg, or None if parse has not
            been called yet.
        """
        return self._archive_path
    # ------------------------------------------------------------------
    # Thumbnail and metadata
    # ------------------------------------------------------------------
    def get_thumbnail(
        self,
        document_path: Path,
        mime_type: str,
        file_name: str | None = None,
    ) -> Path:
        """Generate a thumbnail from the PDF rendition of the email.
        Converts the document to PDF first if not already done.
        Parameters
        ----------
        document_path:
            Absolute path to the source document.
        mime_type:
            Detected MIME type of the document.
        file_name:
            Kept for backward compatibility; not used.
        Returns
        -------
        Path
            Path to the generated WebP thumbnail inside the temporary directory.
        """
        if not self._archive_path:
            self._archive_path = self.generate_pdf(
                self.parse_file_to_message(document_path),
            )
        return make_thumbnail_from_pdf(
            self._archive_path,
            self._tempdir,
        )
    def get_page_count(
        self,
        document_path: Path,
        mime_type: str,
    ) -> int | None:
        """Return the number of pages in the document.
        Counts pages in the archive PDF produced by a preceding parse()
        call.  Returns ``None`` if parse() has not been called yet or if
        no archive was produced.
        Returns
        -------
        int | None
            Page count of the archive PDF, or ``None``.
        """
        if self._archive_path is not None:
            from paperless.parsers.utils import get_page_count_for_pdf
            return get_page_count_for_pdf(self._archive_path, log=logger)
        return None
    def extract_metadata(
        self,
        document_path: Path,
        mime_type: str,
    ) -> list[MetadataEntry]:
        """Extract metadata from the email headers.
        Returns email headers as metadata entries with prefix "header",
        plus summary entries for attachments and date.
        Returns
        -------
        list[MetadataEntry]
            Sorted list of metadata entries, or ``[]`` on parse failure.
        """
        result: list[MetadataEntry] = []
        try:
            mail = self.parse_file_to_message(document_path)
        except ParseError as e:
            logger.warning(
                "Error while fetching document metadata for %s: %s",
                document_path,
                e,
            )
            return result
        for key, header_values in mail.headers.items():
            value = ", ".join(header_values)
            try:
                value.encode("utf-8")
            except UnicodeEncodeError as e:  # pragma: no cover
                logger.debug("Skipping header %s: %s", key, e)
                continue
            result.append(
                {
                    "namespace": "",
                    "prefix": "header",
                    "key": key,
                    "value": value,
                },
            )
        result.append(
            {
                "namespace": "",
                "prefix": "",
                "key": "attachments",
                "value": ", ".join(
                    f"{attachment.filename}"
                    f"({naturalsize(attachment.size, binary=True, format='%.2f')})"
                    for attachment in mail.attachments
                ),
            },
        )
        result.append(
            {
                "namespace": "",
                "prefix": "",
                "key": "date",
                "value": mail.date.strftime("%Y-%m-%d %H:%M:%S %Z"),
            },
        )
        result.sort(key=lambda item: (item["prefix"], item["key"]))
        return result
    # ------------------------------------------------------------------
    # Email-specific methods
    # ------------------------------------------------------------------
    def _settings_to_gotenberg_pdfa(self) -> PdfAFormat | None:
        """Convert the OCR output type setting to a Gotenberg PdfAFormat."""
        if settings.OCR_OUTPUT_TYPE in {
            OutputTypeChoices.PDF_A,
            OutputTypeChoices.PDF_A2,
        }:
            return PdfAFormat.A2b
        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A1:  # pragma: no cover
            logger.warning(
                "Gotenberg does not support PDF/A-1a, choosing PDF/A-2b instead",
            )
            return PdfAFormat.A2b
        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A3:  # pragma: no cover
            return PdfAFormat.A3b
        return None
    @staticmethod
    def parse_file_to_message(filepath: Path) -> MailMessage:
        """Parse the given .eml file into a MailMessage object.
        Parameters
        ----------
        filepath:
            Path to the .eml file.
        Returns
        -------
        MailMessage
            Parsed mail message.
        Raises
        ------
        documents.parsers.ParseError
            If the file cannot be parsed or is missing required fields.
        """
        try:
            with filepath.open("rb") as eml:
                parsed = MailMessage.from_bytes(eml.read())
                if parsed.from_values is None:
                    raise ParseError(
                        f"Could not parse {filepath}: Missing 'from'",
                    )
        except Exception as err:
            raise ParseError(
                f"Could not parse {filepath}: {err}",
            ) from err
        return parsed
    def tika_parse(self, html: str) -> str:
        """Send HTML content to the Tika server for text extraction.
        Parameters
        ----------
        html:
            HTML string to parse.
        Returns
        -------
        str
            Extracted plain text.
        Raises
        ------
        documents.parsers.ParseError
            If the Tika server cannot be reached or returns an error.
        """
        logger.info("Sending content to Tika server")
        try:
            with TikaClient(tika_url=settings.TIKA_ENDPOINT) as client:
                parsed = client.tika.as_text.from_buffer(html, "text/html")
                if parsed.content is not None:
                    return parsed.content.strip()
                return ""
        except Exception as err:
            raise ParseError(
                f"Could not parse content with tika server at "
                f"{settings.TIKA_ENDPOINT}: {err}",
            ) from err
    def generate_pdf(
        self,
        mail_message: MailMessage,
        pdf_layout: MailRule.PdfLayout | None = None,
    ) -> Path:
        """Generate a PDF from the email message.
        Creates separate PDFs for the email body and HTML content, then
        merges them according to the requested layout.
        Parameters
        ----------
        mail_message:
            Parsed email message.
        pdf_layout:
            Layout option for the PDF. Falls back to the
            EMAIL_PARSE_DEFAULT_LAYOUT setting if not provided.
        Returns
        -------
        Path
            Path to the generated PDF inside the temporary directory.
        """
        archive_path = Path(self._tempdir) / "merged.pdf"
        mail_pdf_file = self.generate_pdf_from_mail(mail_message)
        if pdf_layout is None:
            pdf_layout = MailRule.PdfLayout(settings.EMAIL_PARSE_DEFAULT_LAYOUT)
        # If no HTML content, create the PDF from the message.
        # Otherwise, create 2 PDFs and merge them with Gotenberg.
        if not mail_message.html:
            archive_path.write_bytes(mail_pdf_file.read_bytes())
        else:
            pdf_of_html_content = self.generate_pdf_from_html(
                mail_message.html,
                mail_message.attachments,
            )
            logger.debug("Merging email text and HTML content into single PDF")
            with (
                GotenbergClient(
                    host=settings.TIKA_GOTENBERG_ENDPOINT,
                    timeout=settings.CELERY_TASK_TIME_LIMIT,
                ) as client,
                client.merge.merge() as route,
            ):
                # Configure requested PDF/A formatting, if any
                pdf_a_format = self._settings_to_gotenberg_pdfa()
                if pdf_a_format is not None:
                    route.pdf_format(pdf_a_format)
                match pdf_layout:
                    case MailRule.PdfLayout.HTML_TEXT:
                        route.merge([pdf_of_html_content, mail_pdf_file])
                    case MailRule.PdfLayout.HTML_ONLY:
                        route.merge([pdf_of_html_content])
                    case MailRule.PdfLayout.TEXT_ONLY:
                        route.merge([mail_pdf_file])
                    case MailRule.PdfLayout.TEXT_HTML | _:
                        route.merge([mail_pdf_file, pdf_of_html_content])
                try:
                    response = route.run()
                    archive_path.write_bytes(response.content)
                except Exception as err:
                    raise ParseError(
                        f"Error while merging email HTML into PDF: {err}",
                    ) from err
        return archive_path
    def mail_to_html(self, mail: MailMessage) -> Path:
        """Convert the given email into an HTML file using a template.
        Parameters
        ----------
        mail:
            Parsed mail message.
        Returns
        -------
        Path
            Path to the rendered HTML file inside the temporary directory.
        """
        def clean_html(text: str) -> str:
            """Attempt to clean, escape, and linkify the given HTML string."""
            if isinstance(text, list):
                text = "\n".join([str(e) for e in text])
            if not isinstance(text, str):
                text = str(text)
            text = escape(text)
            text = clean(text)
            text = linkify(text, parse_email=True)
            text = text.replace("\n", "<br>")
            return text
        data = {}
        data["subject"] = clean_html(mail.subject)
        if data["subject"]:
            data["subject_label"] = "Subject"
        data["from"] = clean_html(mail.from_values.full if mail.from_values else "")
        if data["from"]:
            data["from_label"] = "From"
        data["to"] = clean_html(", ".join(address.full for address in mail.to_values))
        if data["to"]:
            data["to_label"] = "To"
        data["cc"] = clean_html(", ".join(address.full for address in mail.cc_values))
        if data["cc"]:
            data["cc_label"] = "CC"
        data["bcc"] = clean_html(", ".join(address.full for address in mail.bcc_values))
        if data["bcc"]:
            data["bcc_label"] = "BCC"
        att = []
        for a in mail.attachments:
            att.append(
                f"{a.filename} ({naturalsize(a.size, binary=True, format='%.2f')})",
            )
        data["attachments"] = clean_html(", ".join(att))
        if data["attachments"]:
            data["attachments_label"] = "Attachments"
        data["date"] = clean_html(
            timezone.localtime(mail.date).strftime("%Y-%m-%d %H:%M"),
        )
        data["content"] = clean_html(mail.text.strip())
        from django.template.loader import render_to_string
        html_file = Path(self._tempdir) / "email_as_html.html"
        html_file.write_text(render_to_string("email_msg_template.html", context=data))
        return html_file
    def generate_pdf_from_mail(self, mail: MailMessage) -> Path:
        """Create a PDF from the email body using an HTML template and Gotenberg.
        Parameters
        ----------
        mail:
            Parsed mail message.
        Returns
        -------
        Path
            Path to the generated PDF inside the temporary directory.
        Raises
        ------
        documents.parsers.ParseError
            If Gotenberg returns an error.
        """
        logger.info("Converting mail to PDF")
        css_file = (
            Path(__file__).parent.parent.parent
            / "paperless_mail"
            / "templates"
            / "output.css"
        )
        email_html_file = self.mail_to_html(mail)
        with (
            GotenbergClient(
                host=settings.TIKA_GOTENBERG_ENDPOINT,
                timeout=settings.CELERY_TASK_TIME_LIMIT,
            ) as client,
            client.chromium.html_to_pdf() as route,
        ):
            # Configure requested PDF/A formatting, if any
            pdf_a_format = self._settings_to_gotenberg_pdfa()
            if pdf_a_format is not None:
                route.pdf_format(pdf_a_format)
            try:
                response = (
                    route.index(email_html_file)
                    .resource(css_file)
                    .margins(
                        PageMarginsType(
                            top=Measurement(0.1, MeasurementUnitType.Inches),
                            bottom=Measurement(0.1, MeasurementUnitType.Inches),
                            left=Measurement(0.1, MeasurementUnitType.Inches),
                            right=Measurement(0.1, MeasurementUnitType.Inches),
                        ),
                    )
                    .size(A4)
                    .scale(1.0)
                    .run()
                )
            except Exception as err:
                raise ParseError(
                    f"Error while converting email to PDF: {err}",
                ) from err
        email_as_pdf_file = Path(self._tempdir) / "email_as_pdf.pdf"
        email_as_pdf_file.write_bytes(response.content)
        return email_as_pdf_file
    def generate_pdf_from_html(
        self,
        orig_html: str,
        attachments: list[MailAttachment],
    ) -> Path:
        """Generate a PDF from the HTML content of the email.
        Parameters
        ----------
        orig_html:
            Raw HTML string from the email body.
        attachments:
            List of email attachments (used as inline resources).
        Returns
        -------
        Path
            Path to the generated PDF inside the temporary directory.
        Raises
        ------
        documents.parsers.ParseError
            If Gotenberg returns an error.
        """
        def clean_html_script(text: str) -> str:
            compiled_open = re.compile(re.escape("<script"), re.IGNORECASE)
            text = compiled_open.sub("<div hidden ", text)
            compiled_close = re.compile(re.escape("</script"), re.IGNORECASE)
            text = compiled_close.sub("</div", text)
            return text
        logger.info("Converting message html to PDF")
        tempdir = Path(self._tempdir)
        html_clean = clean_html_script(orig_html)
        html_clean_file = tempdir / "index.html"
        html_clean_file.write_text(html_clean)
        with (
            GotenbergClient(
                host=settings.TIKA_GOTENBERG_ENDPOINT,
                timeout=settings.CELERY_TASK_TIME_LIMIT,
            ) as client,
            client.chromium.html_to_pdf() as route,
        ):
            # Configure requested PDF/A formatting, if any
            pdf_a_format = self._settings_to_gotenberg_pdfa()
            if pdf_a_format is not None:
                route.pdf_format(pdf_a_format)
            # Add attachments as resources, cleaning the filename and replacing
            # it in the index file for inclusion
            for attachment in attachments:
                # Clean the attachment name to be valid
                name_cid = f"cid:{attachment.content_id}"
                name_clean = "".join(e for e in name_cid if e.isalnum())
                # Write attachment payload to a temp file
                temp_file = tempdir / name_clean
                temp_file.write_bytes(attachment.payload)
                route.resource(temp_file)
                # Replace as needed the name with the clean name
                html_clean = html_clean.replace(name_cid, name_clean)
            # Now store the cleaned up HTML version
            html_clean_file = tempdir / "index.html"
            html_clean_file.write_text(html_clean)
            # This is our index file, the main page basically
            route.index(html_clean_file)
            # Set page size, margins
            route.margins(
                PageMarginsType(
                    top=Measurement(0.1, MeasurementUnitType.Inches),
                    bottom=Measurement(0.1, MeasurementUnitType.Inches),
                    left=Measurement(0.1, MeasurementUnitType.Inches),
                    right=Measurement(0.1, MeasurementUnitType.Inches),
                ),
            ).size(A4).scale(1.0)
            try:
                response = route.run()
            except Exception as err:
                raise ParseError(
                    f"Error while converting document to PDF: {err}",
                ) from err
        html_pdf = tempdir / "html.pdf"
        html_pdf.write_bytes(response.content)
        return html_pdf
--- a/src/paperless/parsers/registry.py
+++ b/src/paperless/parsers/registry.py
@@ -193,13 +193,17 @@ class ParserRegistry:
        that log output is predictable; scoring determines which parser wins
        at runtime regardless of registration order.
        """
        from paperless.parsers.mail import MailDocumentParser
        from paperless.parsers.remote import RemoteDocumentParser
        from paperless.parsers.tesseract import RasterisedDocumentParser
        from paperless.parsers.text import TextDocumentParser
        from paperless.parsers.tika import TikaDocumentParser
        self.register_builtin(TextDocumentParser)
        self.register_builtin(RemoteDocumentParser)
        self.register_builtin(TikaDocumentParser)
        self.register_builtin(MailDocumentParser)
        self.register_builtin(RasterisedDocumentParser)
    # ------------------------------------------------------------------
    # Discovery
--- a/src/paperless/parsers/remote.py
+++ b/src/paperless/parsers/remote.py
@@ -28,6 +28,7 @@ if TYPE_CHECKING:
    from types import TracebackType
    from paperless.parsers import MetadataEntry
    from paperless.parsers import ParserContext
 logger = logging.getLogger("paperless.parsing.remote")
@@ -204,6 +205,9 @@ class RemoteDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        pass
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/tesseract.py
+++ b/src/paperless/parsers/tesseract.py
@@ -1,13 +1,18 @@
 from __future__ import annotations
 import logging
 import os
 import re
 import shutil
 import tempfile
 from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Any
 from typing import Self
 from django.conf import settings
 from PIL import Image
 from documents.parsers import DocumentParser
 from documents.parsers import ParseError
 from documents.parsers import make_thumbnail_from_pdf
 from documents.utils import maybe_override_pixel_limit
@@ -16,6 +21,28 @@ from paperless.config import OcrConfig
 from paperless.models import ArchiveFileChoices
 from paperless.models import CleanChoices
 from paperless.models import ModeChoices
 from paperless.parsers.utils import read_file_handle_unicode_errors
 from paperless.version import __full_version_str__
 if TYPE_CHECKING:
    import datetime
    from types import TracebackType
    from paperless.parsers import MetadataEntry
    from paperless.parsers import ParserContext
 logger = logging.getLogger("paperless.parsing.tesseract")
 _SUPPORTED_MIME_TYPES: dict[str, str] = {
    "application/pdf": ".pdf",
    "image/jpeg": ".jpg",
    "image/png": ".png",
    "image/tiff": ".tif",
    "image/gif": ".gif",
    "image/bmp": ".bmp",
    "image/webp": ".webp",
    "image/heic": ".heic",
 }
 class NoTextFoundException(Exception):
@@ -26,81 +53,125 @@ class RtlLanguageException(Exception):
    pass
-class RasterisedDocumentParser(DocumentParser):
+class RasterisedDocumentParser:
    """
    This parser uses Tesseract to try and get some text out of a rasterised
    image, whether it's a PDF, or other graphical format (JPEG, TIFF, etc.)
    """
-    logging_name = "paperless.parsing.tesseract"
+    name: str = "Paperless-ngx Tesseract OCR Parser"
    version: str = __full_version_str__
    author: str = "Paperless-ngx Contributors"
    url: str = "https://github.com/paperless-ngx/paperless-ngx"
-    def get_settings(self) -> OcrConfig:
+    # ------------------------------------------------------------------
-        """
+    # Class methods
-        This parser uses the OCR configuration settings to parse documents
+    # ------------------------------------------------------------------
        """
        return OcrConfig()
-    def get_page_count(self, document_path, mime_type):
+    @classmethod
-        page_count = None
+    def supported_mime_types(cls) -> dict[str, str]:
-        if mime_type == "application/pdf":
+        return _SUPPORTED_MIME_TYPES
            try:
                import pikepdf
-                with pikepdf.Pdf.open(document_path) as pdf:
+    @classmethod
-                    page_count = len(pdf.pages)
+    def score(
-            except Exception as e:
+        cls,
-                self.log.warning(
+        mime_type: str,
-                    f"Unable to determine PDF page count {document_path}: {e}",
+        filename: str,
-                )
+        path: Path | None = None,
-        return page_count
+    ) -> int | None:
        if mime_type in _SUPPORTED_MIME_TYPES:
            return 10
        return None
-    def extract_metadata(self, document_path, mime_type):
+    # ------------------------------------------------------------------
-        result = []
+    # Properties
-        if mime_type == "application/pdf":
+    # ------------------------------------------------------------------
            import pikepdf
-            namespace_pattern = re.compile(r"\{(.*)\}(.*)")
+    @property
    def can_produce_archive(self) -> bool:
        return True
-            pdf = pikepdf.open(document_path)
+    @property
-            meta = pdf.open_metadata()
+    def requires_pdf_rendition(self) -> bool:
-            for key, value in meta.items():
+        return False
                if isinstance(value, list):
                    value = " ".join([str(e) for e in value])
                value = str(value)
                try:
                    m = namespace_pattern.match(key)
                    if m is None:  # pragma: no cover
                        continue
                    namespace = m.group(1)
                    key_value = m.group(2)
                    try:
                        namespace.encode("utf-8")
                        key_value.encode("utf-8")
                    except UnicodeEncodeError as e:  # pragma: no cover
                        self.log.debug(f"Skipping metadata key {key}: {e}")
                        continue
                    result.append(
                        {
                            "namespace": namespace,
                            "prefix": meta.REVERSE_NS[namespace],
                            "key": key_value,
                            "value": value,
                        },
                    )
                except Exception as e:
                    self.log.warning(
                        f"Error while reading metadata {key}: {value}. Error: {e}",
                    )
        return result
-    def get_thumbnail(self, document_path, mime_type, file_name=None):
+    # ------------------------------------------------------------------
    # Lifecycle
    # ------------------------------------------------------------------
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self.tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
        )
        self.settings = OcrConfig()
        self.archive_path: Path | None = None
        self.text: str | None = None
        self.date: datetime.datetime | None = None
        self.log = logger
    def __enter__(self) -> Self:
        return self
    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_val: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> None:
        logger.debug("Cleaning up temporary directory %s", self.tempdir)
        shutil.rmtree(self.tempdir, ignore_errors=True)
    # ------------------------------------------------------------------
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        pass
    # ------------------------------------------------------------------
    # Result accessors
    # ------------------------------------------------------------------
    def get_text(self) -> str | None:
        return self.text
    def get_date(self) -> datetime.datetime | None:
        return self.date
    def get_archive_path(self) -> Path | None:
        return self.archive_path
    # ------------------------------------------------------------------
    # Thumbnail, page count, and metadata
    # ------------------------------------------------------------------
    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
        return make_thumbnail_from_pdf(
-            self.archive_path or document_path,
+            self.archive_path or Path(document_path),
            self.tempdir,
            self.logging_group,
        )
-    def is_image(self, mime_type) -> bool:
+    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
        if mime_type == "application/pdf":
            from paperless.parsers.utils import get_page_count_for_pdf
            return get_page_count_for_pdf(Path(document_path), log=self.log)
        return None
    def extract_metadata(
        self,
        document_path: Path,
        mime_type: str,
    ) -> list[MetadataEntry]:
        if mime_type != "application/pdf":
            return []
        from paperless.parsers.utils import extract_pdf_metadata
        return extract_pdf_metadata(Path(document_path), log=self.log)
    def is_image(self, mime_type: str) -> bool:
        return mime_type in [
            "image/png",
            "image/jpeg",
@@ -111,25 +182,25 @@ class RasterisedDocumentParser(DocumentParser):
            "image/heic",
        ]
-    def has_alpha(self, image) -> bool:
+    def has_alpha(self, image: Path) -> bool:
        with Image.open(image) as im:
            return im.mode in ("RGBA", "LA")
-    def remove_alpha(self, image_path: str) -> Path:
+    def remove_alpha(self, image_path: Path) -> Path:
        no_alpha_image = Path(self.tempdir) / "image-no-alpha"
        run_subprocess(
            [
                settings.CONVERT_BINARY,
                "-alpha",
                "off",
-                image_path,
+                str(image_path),
-                no_alpha_image,
+                str(no_alpha_image),
            ],
            logger=self.log,
        )
        return no_alpha_image
-    def get_dpi(self, image) -> int | None:
+    def get_dpi(self, image: Path) -> int | None:
        try:
            with Image.open(image) as im:
                x, _ = im.info["dpi"]
@@ -138,7 +209,7 @@ class RasterisedDocumentParser(DocumentParser):
            self.log.warning(f"Error while getting DPI from image {image}: {e}")
            return None
-    def calculate_a4_dpi(self, image) -> int | None:
+    def calculate_a4_dpi(self, image: Path) -> int | None:
        try:
            with Image.open(image) as im:
                width, _ = im.size
@@ -156,6 +227,7 @@ class RasterisedDocumentParser(DocumentParser):
        sidecar_file: Path | None,
        pdf_file: Path,
    ) -> str | None:
        text: str | None = None
        # When re-doing OCR, the sidecar contains ONLY the new text, not
        # the whole text, so do not utilize it in that case
        if (
@@ -163,7 +235,7 @@ class RasterisedDocumentParser(DocumentParser):
            and sidecar_file.is_file()
            and self.settings.mode != "redo"
        ):
-            text = self.read_file_handle_unicode_errors(sidecar_file)
+            text = read_file_handle_unicode_errors(sidecar_file)
            if "[OCR skipped on page" not in text:
                # This happens when there's already text in the input file.
@@ -191,12 +263,12 @@ class RasterisedDocumentParser(DocumentParser):
                        "-layout",
                        "-enc",
                        "UTF-8",
-                        pdf_file,
+                        str(pdf_file),
                        tmp.name,
                    ],
                    logger=self.log,
                )
-                text = self.read_file_handle_unicode_errors(Path(tmp.name))
+                text = read_file_handle_unicode_errors(Path(tmp.name))
            return post_process_text(text)
@@ -211,16 +283,14 @@ class RasterisedDocumentParser(DocumentParser):
    def construct_ocrmypdf_parameters(
        self,
-        input_file,
+        input_file: Path,
-        mime_type,
+        mime_type: str,
-        output_file,
+        output_file: Path,
-        sidecar_file,
+        sidecar_file: Path,
        *,
-        safe_fallback=False,
+        safe_fallback: bool = False,
-    ):
+    ) -> dict[str, Any]:
-        if TYPE_CHECKING:
+        ocrmypdf_args: dict[str, Any] = {
            assert isinstance(self.settings, OcrConfig)
        ocrmypdf_args = {
            "input_file_or_options": input_file,
            "output_file": output_file,
            # need to use threads, since this will be run in daemonized
@@ -330,7 +400,13 @@ class RasterisedDocumentParser(DocumentParser):
        return ocrmypdf_args
-    def parse(self, document_path: Path, mime_type, file_name=None) -> None:
+    def parse(
        self,
        document_path: Path,
        mime_type: str,
        *,
        produce_archive: bool = True,
    ) -> None:
        # This forces tesseract to use one core per page.
        os.environ["OMP_THREAD_LIMIT"] = "1"
        VALID_TEXT_LENGTH = 50
@@ -458,7 +534,7 @@ class RasterisedDocumentParser(DocumentParser):
                self.text = ""
-def post_process_text(text):
+def post_process_text(text: str | None) -> str | None:
    if not text:
        return None
--- a/src/paperless/parsers/text.py
+++ b/src/paperless/parsers/text.py
@@ -27,6 +27,7 @@ if TYPE_CHECKING:
    from types import TracebackType
    from paperless.parsers import MetadataEntry
    from paperless.parsers import ParserContext
 logger = logging.getLogger("paperless.parsing.text")
@@ -156,6 +157,9 @@ class TextDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        pass
    def parse(
        self,
        document_path: Path,
--- a/src/paperless/parsers/tika.py
+++ b/src/paperless/parsers/tika.py
@@ -35,6 +35,7 @@ if TYPE_CHECKING:
    from types import TracebackType
    from paperless.parsers import MetadataEntry
    from paperless.parsers import ParserContext
 logger = logging.getLogger("paperless.parsing.tika")
@@ -205,6 +206,9 @@ class TikaDocumentParser:
    # Core parsing interface
    # ------------------------------------------------------------------
    def configure(self, context: ParserContext) -> None:
        pass
    def parse(
        self,
        document_path: Path,
@@ -340,11 +344,19 @@ class TikaDocumentParser:
    ) -> int | None:
        """Return the number of pages in the document.
        Counts pages in the archive PDF produced by a preceding parse()
        call.  Returns ``None`` if parse() has not been called yet or if
        no archive was produced.
        Returns
        -------
        int | None
-            Always None — page count is not available from Tika.
+            Page count of the archive PDF, or ``None``.
        """
        if self._archive_path is not None:
            from paperless.parsers.utils import get_page_count_for_pdf
            return get_page_count_for_pdf(self._archive_path, log=logger)
        return None
    def extract_metadata(
--- a/src/paperless/parsers/utils.py
+++ b/src/paperless/parsers/utils.py
@@ -20,6 +20,34 @@ if TYPE_CHECKING:
 logger = logging.getLogger("paperless.parsers.utils")
 def read_file_handle_unicode_errors(
    filepath: Path,
    log: logging.Logger | None = None,
 ) -> str:
    """Read a file as UTF-8 text, replacing invalid bytes rather than raising.
    Parameters
    ----------
    filepath:
        Absolute path to the file to read.
    log:
        Logger to use for warnings.  Falls back to the module-level logger
        when omitted.
    Returns
    -------
    str
        File content as a string, with any invalid UTF-8 sequences replaced
        by the Unicode replacement character.
    """
    _log = log or logger
    try:
        return filepath.read_text(encoding="utf-8")
    except UnicodeDecodeError as e:
        _log.warning("Unicode error during text reading, continuing: %s", e)
        return filepath.read_bytes().decode("utf-8", errors="replace")
 def get_page_count_for_pdf(
    document_path: Path,
    log: logging.Logger | None = None,
--- a/src/paperless/tests/parsers/conftest.py
+++ b/src/paperless/tests/parsers/conftest.py
@@ -6,19 +6,29 @@ so it is easy to see which files belong to which test module.
 from __future__ import annotations
 from contextlib import contextmanager
 from typing import TYPE_CHECKING
 import pytest
 from django.test import override_settings
 from paperless.parsers.mail import MailDocumentParser
 from paperless.parsers.remote import RemoteDocumentParser
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser
 if TYPE_CHECKING:
    from collections.abc import Callable
    from collections.abc import Generator
    from pathlib import Path
    from unittest.mock import MagicMock
    from pytest_django.fixtures import SettingsWrapper
    from pytest_mock import MockerFixture
    #: Type for the ``make_tesseract_parser`` fixture factory.
    MakeTesseractParser = Callable[..., Generator[RasterisedDocumentParser, None, None]]
 # ------------------------------------------------------------------
@@ -247,3 +257,544 @@ def tika_parser() -> Generator[TikaDocumentParser, None, None]:
    """
    with TikaDocumentParser() as parser:
        yield parser
 # ------------------------------------------------------------------
 # Mail parser sample files
 # ------------------------------------------------------------------
@pytest.fixture(scope="session")
 def mail_samples_dir(samples_dir: Path) -> Path:
    """Absolute path to the mail parser sample files directory.
    Returns
    -------
    Path
        ``<samples_dir>/mail/``
    """
    return samples_dir / "mail"
@pytest.fixture(scope="session")
 def broken_email_file(mail_samples_dir: Path) -> Path:
    """Path to a broken/malformed EML sample file.
    Returns
    -------
    Path
        Absolute path to ``mail/broken.eml``.
    """
    return mail_samples_dir / "broken.eml"
@pytest.fixture(scope="session")
 def simple_txt_email_file(mail_samples_dir: Path) -> Path:
    """Path to a plain-text email sample file.
    Returns
    -------
    Path
        Absolute path to ``mail/simple_text.eml``.
    """
    return mail_samples_dir / "simple_text.eml"
@pytest.fixture(scope="session")
 def simple_txt_email_pdf_file(mail_samples_dir: Path) -> Path:
    """Path to the expected PDF rendition of the plain-text email.
    Returns
    -------
    Path
        Absolute path to ``mail/simple_text.eml.pdf``.
    """
    return mail_samples_dir / "simple_text.eml.pdf"
@pytest.fixture(scope="session")
 def simple_txt_email_thumbnail_file(mail_samples_dir: Path) -> Path:
    """Path to the expected thumbnail for the plain-text email.
    Returns
    -------
    Path
        Absolute path to ``mail/simple_text.eml.pdf.webp``.
    """
    return mail_samples_dir / "simple_text.eml.pdf.webp"
@pytest.fixture(scope="session")
 def html_email_file(mail_samples_dir: Path) -> Path:
    """Path to an HTML email sample file.
    Returns
    -------
    Path
        Absolute path to ``mail/html.eml``.
    """
    return mail_samples_dir / "html.eml"
@pytest.fixture(scope="session")
 def html_email_pdf_file(mail_samples_dir: Path) -> Path:
    """Path to the expected PDF rendition of the HTML email.
    Returns
    -------
    Path
        Absolute path to ``mail/html.eml.pdf``.
    """
    return mail_samples_dir / "html.eml.pdf"
@pytest.fixture(scope="session")
 def html_email_thumbnail_file(mail_samples_dir: Path) -> Path:
    """Path to the expected thumbnail for the HTML email.
    Returns
    -------
    Path
        Absolute path to ``mail/html.eml.pdf.webp``.
    """
    return mail_samples_dir / "html.eml.pdf.webp"
@pytest.fixture(scope="session")
 def html_email_html_file(mail_samples_dir: Path) -> Path:
    """Path to the HTML body of the HTML email sample.
    Returns
    -------
    Path
        Absolute path to ``mail/html.eml.html``.
    """
    return mail_samples_dir / "html.eml.html"
@pytest.fixture(scope="session")
 def merged_pdf_first(mail_samples_dir: Path) -> Path:
    """Path to the first PDF used in PDF-merge tests.
    Returns
    -------
    Path
        Absolute path to ``mail/first.pdf``.
    """
    return mail_samples_dir / "first.pdf"
@pytest.fixture(scope="session")
 def merged_pdf_second(mail_samples_dir: Path) -> Path:
    """Path to the second PDF used in PDF-merge tests.
    Returns
    -------
    Path
        Absolute path to ``mail/second.pdf``.
    """
    return mail_samples_dir / "second.pdf"
 # ------------------------------------------------------------------
 # Mail parser instance
 # ------------------------------------------------------------------
@pytest.fixture()
 def mail_parser() -> Generator[MailDocumentParser, None, None]:
    """Yield a MailDocumentParser and clean up its temporary directory afterwards.
    Yields
    ------
    MailDocumentParser
        A ready-to-use parser instance.
    """
    with MailDocumentParser() as parser:
        yield parser
@pytest.fixture(scope="session")
 def nginx_base_url() -> Generator[str, None, None]:
    """
    The base URL for the nginx HTTP server we expect to be alive
    """
    yield "http://localhost:8080"
 # ------------------------------------------------------------------
 # Tesseract parser sample files
 # ------------------------------------------------------------------
@pytest.fixture(scope="session")
 def tesseract_samples_dir(samples_dir: Path) -> Path:
    """Absolute path to the tesseract parser sample files directory.
    Returns
    -------
    Path
        ``<samples_dir>/tesseract/``
    """
    return samples_dir / "tesseract"
@pytest.fixture(scope="session")
 def document_webp_file(tesseract_samples_dir: Path) -> Path:
    """Path to a WebP document sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/document.webp``.
    """
    return tesseract_samples_dir / "document.webp"
@pytest.fixture(scope="session")
 def encrypted_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to an encrypted PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/encrypted.pdf``.
    """
    return tesseract_samples_dir / "encrypted.pdf"
@pytest.fixture(scope="session")
 def multi_page_digital_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page digital PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-digital.pdf``.
    """
    return tesseract_samples_dir / "multi-page-digital.pdf"
@pytest.fixture(scope="session")
 def multi_page_images_alpha_rgb_tiff_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page TIFF with alpha channel in RGB.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-images-alpha-rgb.tiff``.
    """
    return tesseract_samples_dir / "multi-page-images-alpha-rgb.tiff"
@pytest.fixture(scope="session")
 def multi_page_images_alpha_tiff_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page TIFF with alpha channel.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-images-alpha.tiff``.
    """
    return tesseract_samples_dir / "multi-page-images-alpha.tiff"
@pytest.fixture(scope="session")
 def multi_page_images_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page PDF with images.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-images.pdf``.
    """
    return tesseract_samples_dir / "multi-page-images.pdf"
@pytest.fixture(scope="session")
 def multi_page_images_tiff_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page TIFF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-images.tiff``.
    """
    return tesseract_samples_dir / "multi-page-images.tiff"
@pytest.fixture(scope="session")
 def multi_page_mixed_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a multi-page mixed PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/multi-page-mixed.pdf``.
    """
    return tesseract_samples_dir / "multi-page-mixed.pdf"
@pytest.fixture(scope="session")
 def no_text_alpha_png_file(tesseract_samples_dir: Path) -> Path:
    """Path to a PNG with alpha channel and no text.
    Returns
    -------
    Path
        Absolute path to ``tesseract/no-text-alpha.png``.
    """
    return tesseract_samples_dir / "no-text-alpha.png"
@pytest.fixture(scope="session")
 def rotated_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a rotated PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/rotated.pdf``.
    """
    return tesseract_samples_dir / "rotated.pdf"
@pytest.fixture(scope="session")
 def rtl_test_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to an RTL test PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/rtl-test.pdf``.
    """
    return tesseract_samples_dir / "rtl-test.pdf"
@pytest.fixture(scope="session")
 def signed_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a signed PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/signed.pdf``.
    """
    return tesseract_samples_dir / "signed.pdf"
@pytest.fixture(scope="session")
 def simple_alpha_png_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple PNG with alpha channel.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple-alpha.png``.
    """
    return tesseract_samples_dir / "simple-alpha.png"
@pytest.fixture(scope="session")
 def simple_digital_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple digital PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple-digital.pdf``.
    """
    return tesseract_samples_dir / "simple-digital.pdf"
@pytest.fixture(scope="session")
 def simple_no_dpi_png_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple PNG without DPI information.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple-no-dpi.png``.
    """
    return tesseract_samples_dir / "simple-no-dpi.png"
@pytest.fixture(scope="session")
 def simple_bmp_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple BMP sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.bmp``.
    """
    return tesseract_samples_dir / "simple.bmp"
@pytest.fixture(scope="session")
 def simple_gif_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple GIF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.gif``.
    """
    return tesseract_samples_dir / "simple.gif"
@pytest.fixture(scope="session")
 def simple_heic_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple HEIC sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.heic``.
    """
    return tesseract_samples_dir / "simple.heic"
@pytest.fixture(scope="session")
 def simple_jpg_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple JPG sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.jpg``.
    """
    return tesseract_samples_dir / "simple.jpg"
@pytest.fixture(scope="session")
 def simple_png_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple PNG sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.png``.
    """
    return tesseract_samples_dir / "simple.png"
@pytest.fixture(scope="session")
 def simple_tif_file(tesseract_samples_dir: Path) -> Path:
    """Path to a simple TIF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/simple.tif``.
    """
    return tesseract_samples_dir / "simple.tif"
@pytest.fixture(scope="session")
 def single_page_mixed_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a single-page mixed PDF sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/single-page-mixed.pdf``.
    """
    return tesseract_samples_dir / "single-page-mixed.pdf"
@pytest.fixture(scope="session")
 def with_form_pdf_file(tesseract_samples_dir: Path) -> Path:
    """Path to a PDF with form sample file.
    Returns
    -------
    Path
        Absolute path to ``tesseract/with-form.pdf``.
    """
    return tesseract_samples_dir / "with-form.pdf"
 # ------------------------------------------------------------------
 # Tesseract parser instance and settings helpers
 # ------------------------------------------------------------------
@pytest.fixture()
 def null_app_config(mocker: MockerFixture) -> MagicMock:
    """Return a MagicMock with all OcrConfig fields set to None.
    This allows the parser to fall back to Django settings instead of
    hitting the database.
    Returns
    -------
    MagicMock
        Mock config with all fields as None
    """
    return mocker.MagicMock(
        output_type=None,
        pages=None,
        language=None,
        mode=None,
        skip_archive_file=None,
        image_dpi=None,
        unpaper_clean=None,
        deskew=None,
        rotate_pages=None,
        rotate_pages_threshold=None,
        max_image_pixels=None,
        color_conversion_strategy=None,
        user_args=None,
    )
@pytest.fixture()
 def tesseract_parser(
    mocker: MockerFixture,
    null_app_config: MagicMock,
 ) -> Generator[RasterisedDocumentParser, None, None]:
    """Yield a RasterisedDocumentParser and clean up its temporary directory afterwards.
    Patches the config system to avoid database access.
    Yields
    ------
    RasterisedDocumentParser
        A ready-to-use parser instance.
    """
    mocker.patch(
        "paperless.config.BaseConfig._get_config_instance",
        return_value=null_app_config,
    )
    with RasterisedDocumentParser() as parser:
        yield parser
@pytest.fixture()
 def make_tesseract_parser(
    mocker: MockerFixture,
    null_app_config: MagicMock,
 ) -> MakeTesseractParser:
    """Return a factory for creating RasterisedDocumentParser with Django settings overrides.
    This fixture is useful for tests that need to create parsers with different
    settings configurations.
    Returns
    -------
    Callable[..., contextmanager[RasterisedDocumentParser]]
        A context manager factory that accepts Django settings overrides
    """
    mocker.patch(
        "paperless.config.BaseConfig._get_config_instance",
        return_value=null_app_config,
    )
    @contextmanager
    def _make_parser(**django_settings_overrides):
        with override_settings(**django_settings_overrides):
            with RasterisedDocumentParser() as parser:
                yield parser
    return _make_parser
--- a/src/paperless/tests/parsers/test_mail_parser.py
+++ b/src/paperless/tests/parsers/test_mail_parser.py
@@ -12,7 +12,64 @@ from pytest_httpx import HTTPXMock
 from pytest_mock import MockerFixture
 from documents.parsers import ParseError
-from paperless_mail.parsers import MailDocumentParser
+from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.mail import MailDocumentParser
 class TestMailParserProtocol:
    """Verify that MailDocumentParser satisfies the ParserProtocol contract."""
    def test_isinstance_satisfies_protocol(
        self,
        mail_parser: MailDocumentParser,
    ) -> None:
        assert isinstance(mail_parser, ParserProtocol)
    def test_supported_mime_types(self) -> None:
        mime_types = MailDocumentParser.supported_mime_types()
        assert isinstance(mime_types, dict)
        assert "message/rfc822" in mime_types
    @pytest.mark.parametrize(
        ("mime_type", "expected"),
        [
            ("message/rfc822", 10),
            ("application/pdf", None),
            ("text/plain", None),
        ],
    )
    def test_score(self, mime_type: str, expected: int | None) -> None:
        assert MailDocumentParser.score(mime_type, "email.eml") == expected
    def test_can_produce_archive_is_false(
        self,
        mail_parser: MailDocumentParser,
    ) -> None:
        assert mail_parser.can_produce_archive is False
    def test_requires_pdf_rendition_is_true(
        self,
        mail_parser: MailDocumentParser,
    ) -> None:
        assert mail_parser.requires_pdf_rendition is True
    def test_get_page_count_returns_none_without_archive(
        self,
        mail_parser: MailDocumentParser,
        html_email_file: Path,
    ) -> None:
        assert mail_parser.get_page_count(html_email_file, "message/rfc822") is None
    def test_get_page_count_returns_int_with_pdf_archive(
        self,
        mail_parser: MailDocumentParser,
        simple_txt_email_pdf_file: Path,
    ) -> None:
        mail_parser._archive_path = simple_txt_email_pdf_file
        count = mail_parser.get_page_count(simple_txt_email_pdf_file, "message/rfc822")
        assert isinstance(count, int)
        assert count > 0
 class TestEmailFileParsing:
@@ -24,7 +81,7 @@ class TestEmailFileParsing:
    def test_parse_error_missing_file(
        self,
        mail_parser: MailDocumentParser,
-        sample_dir: Path,
+        mail_samples_dir: Path,
    ) -> None:
        """
        GIVEN:
@@ -35,7 +92,7 @@ class TestEmailFileParsing:
            - An Exception is thrown
        """
        # Check if exception is raised when parsing fails.
-        test_file = sample_dir / "doesntexist.eml"
+        test_file = mail_samples_dir / "doesntexist.eml"
        assert not test_file.exists()
@@ -246,12 +303,12 @@ class TestEmailThumbnailGenerate:
        """
        mocked_return = "Passing the return value through.."
        mock_make_thumbnail_from_pdf = mocker.patch(
-            "paperless_mail.parsers.make_thumbnail_from_pdf",
+            "paperless.parsers.mail.make_thumbnail_from_pdf",
        )
        mock_make_thumbnail_from_pdf.return_value = mocked_return
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        mock_generate_pdf.return_value = "Mocked return value.."
@@ -260,8 +317,7 @@ class TestEmailThumbnailGenerate:
        mock_generate_pdf.assert_called_once()
        mock_make_thumbnail_from_pdf.assert_called_once_with(
            "Mocked return value..",
-            mail_parser.tempdir,
+            mail_parser._tempdir,
            None,
        )
        assert mocked_return == thumb
@@ -373,7 +429,7 @@ class TestParser:
        """
        # Validate parsing returns the expected results
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        mail_parser.parse(simple_txt_email_file, "message/rfc822")
@@ -385,7 +441,7 @@ class TestParser:
            "BCC: fdf@fvf.de\n\n"
            "\n\nThis is just a simple Text Mail."
        )
-        assert text_expected == mail_parser.text
+        assert text_expected == mail_parser.get_text()
        assert (
            datetime.datetime(
                2022,
@@ -396,7 +452,7 @@ class TestParser:
                43,
                tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)),
            )
-            == mail_parser.date
+            == mail_parser.get_date()
        )
        # Just check if tried to generate archive, the unittest for generate_pdf() goes deeper.
@@ -419,7 +475,7 @@ class TestParser:
        """
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        # Validate parsing returns the expected results
@@ -443,7 +499,7 @@ class TestParser:
        mail_parser.parse(html_email_file, "message/rfc822")
        mock_generate_pdf.assert_called_once()
-        assert text_expected == mail_parser.text
+        assert text_expected == mail_parser.get_text()
        assert (
            datetime.datetime(
                2022,
@@ -454,7 +510,7 @@ class TestParser:
                19,
                tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)),
            )
-            == mail_parser.date
+            == mail_parser.get_date()
        )
    def test_generate_pdf_parse_error(
@@ -501,7 +557,7 @@ class TestParser:
        mail_parser.parse(simple_txt_email_file, "message/rfc822")
-        assert mail_parser.archive_path is not None
+        assert mail_parser.get_archive_path() is not None
    @pytest.mark.httpx_mock(can_send_already_matched_responses=True)
    def test_generate_pdf_html_email(
@@ -542,7 +598,7 @@ class TestParser:
        )
        mail_parser.parse(html_email_file, "message/rfc822")
-        assert mail_parser.archive_path is not None
+        assert mail_parser.get_archive_path() is not None
    def test_generate_pdf_html_email_html_to_pdf_failure(
        self,
@@ -712,10 +768,10 @@ class TestParser:
        def test_layout_option(layout_option, expected_calls, expected_pdf_names):
            mock_mailrule_get.return_value = mock.Mock(pdf_layout=layout_option)
            mail_parser.configure(ParserContext(mailrule_id=1))
            mail_parser.parse(
                document_path=html_email_file,
                mime_type="message/rfc822",
                mailrule_id=1,
            )
            args, _ = mock_merge_route.call_args
            assert len(args[0]) == expected_calls
--- a/src/paperless/tests/parsers/test_mail_parser_live.py
+++ b/src/paperless/tests/parsers/test_mail_parser_live.py
@@ -11,7 +11,7 @@ from PIL import Image
 from pytest_mock import MockerFixture
 from documents.tests.utils import util_call_with_backoff
-from paperless_mail.parsers import MailDocumentParser
+from paperless.parsers.mail import MailDocumentParser
 def extract_text(pdf_path: Path) -> str:
@@ -159,7 +159,7 @@ class TestParserLive:
            - The returned thumbnail image file shall match the expected hash
        """
        mock_generate_pdf = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf",
        )
        mock_generate_pdf.return_value = simple_txt_email_pdf_file
@@ -216,10 +216,10 @@ class TestParserLive:
            - The merged PDF shall contain text from both source PDFs
        """
        mock_generate_pdf_from_html = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf_from_html",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf_from_html",
        )
        mock_generate_pdf_from_mail = mocker.patch(
-            "paperless_mail.parsers.MailDocumentParser.generate_pdf_from_mail",
+            "paperless.parsers.mail.MailDocumentParser.generate_pdf_from_mail",
        )
        mock_generate_pdf_from_mail.return_value = merged_pdf_first
        mock_generate_pdf_from_html.return_value = merged_pdf_second
--- a/src/paperless/tests/parsers/test_remote_parser.py
+++ b/src/paperless/tests/parsers/test_remote_parser.py
@@ -20,6 +20,7 @@ from unittest.mock import Mock
 import pytest
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.remote import RemoteDocumentParser
@@ -302,6 +303,7 @@ class TestRemoteParserParse:
        sample_pdf_file: Path,
        azure_client: Mock,
    ) -> None:
        remote_parser.configure(ParserContext())
        remote_parser.parse(sample_pdf_file, "application/pdf")
        azure_client.close.assert_called_once()
@@ -479,12 +481,17 @@ class TestRemoteParserRegistry:
        assert parser_cls is RemoteDocumentParser
    @pytest.mark.usefixtures("no_engine_settings")
-    def test_get_parser_returns_none_for_pdf_when_not_configured(self) -> None:
+    def test_get_parser_returns_none_for_unsupported_type_when_not_configured(
-        """With no tesseract parser registered yet, PDF has no handler if remote is off."""
+        self,
    ) -> None:
        """With remote off and a truly unsupported MIME type, registry returns None."""
        from paperless.parsers.registry import ParserRegistry
        registry = ParserRegistry()
        registry.register_defaults()
-        parser_cls = registry.get_parser_for_file("application/pdf", "doc.pdf")
+        parser_cls = registry.get_parser_for_file(
            "application/x-unknown-format",
            "doc.xyz",
        )
        assert parser_cls is None
--- a/src/paperless/tests/parsers/test_tesseract_custom_settings.py
+++ b/src/paperless/tests/parsers/test_tesseract_custom_settings.py
@@ -10,7 +10,7 @@ from paperless.models import CleanChoices
 from paperless.models import ColorConvertChoices
 from paperless.models import ModeChoices
 from paperless.models import OutputTypeChoices
-from paperless_tesseract.parsers import RasterisedDocumentParser
+from paperless.parsers.tesseract import RasterisedDocumentParser
 class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
--- a/src/paperless/tests/parsers/test_tesseract_parser.py
+++ b/src/paperless/tests/parsers/test_tesseract_parser.py
--- a/src/paperless/tests/parsers/test_text_parser.py
+++ b/src/paperless/tests/parsers/test_text_parser.py
@@ -12,6 +12,7 @@ from pathlib import Path
 import pytest
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.text import TextDocumentParser
@@ -93,6 +94,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")
        assert text_parser.get_text() == "This is a test file.\n"
@@ -102,6 +104,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")
        assert text_parser.get_archive_path() is None
@@ -111,6 +114,7 @@ class TestTextParserParse:
        text_parser: TextDocumentParser,
        sample_txt_file: Path,
    ) -> None:
        text_parser.configure(ParserContext())
        text_parser.parse(sample_txt_file, "text/plain")
        assert text_parser.get_date() is None
@@ -129,6 +133,7 @@ class TestTextParserParse:
            - Parsing succeeds
            - Invalid bytes are replaced with the Unicode replacement character
        """
        text_parser.configure(ParserContext())
        text_parser.parse(malformed_txt_file, "text/plain")
        assert text_parser.get_text() == "Pantothens\ufffdure\n"
@@ -251,6 +256,9 @@ class TestTextParserRegistry:
        from paperless.parsers.registry import get_parser_registry
        registry = get_parser_registry()
-        parser_cls = registry.get_parser_for_file("application/pdf", "doc.pdf")
+        parser_cls = registry.get_parser_for_file(
            "application/x-unknown-format",
            "doc.xyz",
        )
        assert parser_cls is None
--- a/src/paperless/tests/parsers/test_tika_parser.py
+++ b/src/paperless/tests/parsers/test_tika_parser.py
@@ -9,6 +9,7 @@ from pytest_django.fixtures import SettingsWrapper
 from pytest_httpx import HTTPXMock
 from documents.parsers import ParseError
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.tika import TikaDocumentParser
@@ -60,6 +61,29 @@ class TestTikaParserRegistryInterface:
    def test_requires_pdf_rendition_is_true(self) -> None:
        assert TikaDocumentParser().requires_pdf_rendition is True
    def test_get_page_count_returns_none_without_archive(
        self,
        tika_parser: TikaDocumentParser,
        sample_odt_file: Path,
    ) -> None:
        assert (
            tika_parser.get_page_count(
                sample_odt_file,
                "application/vnd.oasis.opendocument.text",
            )
            is None
        )
    def test_get_page_count_returns_int_with_pdf_archive(
        self,
        tika_parser: TikaDocumentParser,
        sample_pdf_file: Path,
    ) -> None:
        tika_parser._archive_path = sample_pdf_file
        count = tika_parser.get_page_count(sample_pdf_file, "application/pdf")
        assert isinstance(count, int)
        assert count > 0
@pytest.mark.django_db()
 class TestTikaParser:
@@ -83,6 +107,7 @@ class TestTikaParser:
        # Pretend convert to PDF response
        httpx_mock.add_response(content=b"PDF document")
        tika_parser.configure(ParserContext())
        tika_parser.parse(sample_odt_file, "application/vnd.oasis.opendocument.text")
        assert tika_parser.get_text() == "the content"
--- a/src/paperless/tests/samples/mail/broken.eml
+++ b/src/paperless/tests/samples/mail/broken.eml
--- a/src/paperless/tests/samples/mail/first.pdf
+++ b/src/paperless/tests/samples/mail/first.pdf
--- a/src/paperless/tests/samples/mail/html.eml
+++ b/src/paperless/tests/samples/mail/html.eml
--- a/src/paperless/tests/samples/mail/html.eml.html
+++ b/src/paperless/tests/samples/mail/html.eml.html
--- a/src/paperless/tests/samples/mail/html.eml.pdf
+++ b/src/paperless/tests/samples/mail/html.eml.pdf
--- a/src/paperless/tests/samples/mail/html.eml.pdf.webp
+++ b/src/paperless/tests/samples/mail/html.eml.pdf.webp
--- a/src/paperless/tests/samples/mail/sample.html
+++ b/src/paperless/tests/samples/mail/sample.html
--- a/src/paperless/tests/samples/mail/sample.html.pdf
+++ b/src/paperless/tests/samples/mail/sample.html.pdf
--- a/src/paperless/tests/samples/mail/sample.html.pdf.webp
+++ b/src/paperless/tests/samples/mail/sample.html.pdf.webp
--- a/src/paperless/tests/samples/mail/sample.png
+++ b/src/paperless/tests/samples/mail/sample.png
--- a/src/paperless/tests/samples/mail/second.pdf
+++ b/src/paperless/tests/samples/mail/second.pdf
--- a/src/paperless/tests/samples/mail/simple_text.eml
+++ b/src/paperless/tests/samples/mail/simple_text.eml
--- a/src/paperless/tests/samples/mail/simple_text.eml.pdf
+++ b/src/paperless/tests/samples/mail/simple_text.eml.pdf
--- a/src/paperless/tests/samples/mail/simple_text.eml.pdf.webp
+++ b/src/paperless/tests/samples/mail/simple_text.eml.pdf.webp
--- a/src/paperless/tests/samples/tesseract/document.webp
+++ b/src/paperless/tests/samples/tesseract/document.webp
--- a/src/paperless/tests/samples/tesseract/encrypted.pdf
+++ b/src/paperless/tests/samples/tesseract/encrypted.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-digital.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-digital.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-images-alpha-rgb.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images-alpha-rgb.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-images-alpha.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images-alpha.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-images.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-images.pdf
--- a/src/paperless/tests/samples/tesseract/multi-page-images.tiff
+++ b/src/paperless/tests/samples/tesseract/multi-page-images.tiff
--- a/src/paperless/tests/samples/tesseract/multi-page-mixed.pdf
+++ b/src/paperless/tests/samples/tesseract/multi-page-mixed.pdf
--- a/src/paperless/tests/samples/tesseract/no-text-alpha.png
+++ b/src/paperless/tests/samples/tesseract/no-text-alpha.png
--- a/src/paperless/tests/samples/tesseract/rotated.pdf
+++ b/src/paperless/tests/samples/tesseract/rotated.pdf
--- a/src/paperless/tests/samples/tesseract/rtl-test.pdf
+++ b/src/paperless/tests/samples/tesseract/rtl-test.pdf
--- a/src/paperless/tests/samples/tesseract/signed.pdf
+++ b/src/paperless/tests/samples/tesseract/signed.pdf
--- a/src/paperless/tests/samples/tesseract/simple-alpha.png
+++ b/src/paperless/tests/samples/tesseract/simple-alpha.png
--- a/src/paperless/tests/samples/tesseract/simple-digital.pdf
+++ b/src/paperless/tests/samples/tesseract/simple-digital.pdf
--- a/src/paperless/tests/samples/tesseract/simple-no-dpi.png
+++ b/src/paperless/tests/samples/tesseract/simple-no-dpi.png
--- a/src/paperless/tests/samples/tesseract/simple.bmp
+++ b/src/paperless/tests/samples/tesseract/simple.bmp
--- a/src/paperless/tests/samples/tesseract/simple.gif
+++ b/src/paperless/tests/samples/tesseract/simple.gif
--- a/src/paperless/tests/samples/tesseract/simple.heic
+++ b/src/paperless/tests/samples/tesseract/simple.heic
--- a/src/paperless/tests/samples/tesseract/simple.jpg
+++ b/src/paperless/tests/samples/tesseract/simple.jpg
--- a/src/paperless/tests/samples/tesseract/simple.png
+++ b/src/paperless/tests/samples/tesseract/simple.png
--- a/src/paperless/tests/samples/tesseract/simple.tif
+++ b/src/paperless/tests/samples/tesseract/simple.tif
--- a/src/paperless/tests/samples/tesseract/single-page-mixed.pdf
+++ b/src/paperless/tests/samples/tesseract/single-page-mixed.pdf
--- a/src/paperless/tests/samples/tesseract/with-form.pdf
+++ b/src/paperless/tests/samples/tesseract/with-form.pdf
--- a/src/paperless/tests/test_registry.py
+++ b/src/paperless/tests/test_registry.py
@@ -18,6 +18,7 @@ from unittest.mock import patch
 import pytest
 from paperless.parsers import ParserContext
 from paperless.parsers import ParserProtocol
 from paperless.parsers.registry import ParserRegistry
 from paperless.parsers.registry import get_parser_registry
@@ -103,6 +104,11 @@ def dummy_parser_cls() -> type:
        ) -> list:
            return []
        def configure(self, context: ParserContext) -> None:
            """
            Required to exist, but doesn't need to do anything
            """
        def __enter__(self) -> Self:
            return self
@@ -144,6 +150,7 @@ class TestParserProtocol:
    @pytest.mark.parametrize(
        "missing_method",
        [
            pytest.param("configure", id="missing-configure"),
            pytest.param("parse", id="missing-parse"),
            pytest.param("get_text", id="missing-get_text"),
            pytest.param("get_thumbnail", id="missing-get_thumbnail"),
--- a/src/paperless_mail/parsers.py
+++ b/src/paperless_mail/parsers.py
@@ -1,481 +0,0 @@
 import re
 from html import escape
 from pathlib import Path
 from bleach import clean
 from bleach import linkify
 from django.conf import settings
 from django.utils import timezone
 from django.utils.timezone import is_naive
 from django.utils.timezone import make_aware
 from gotenberg_client import GotenbergClient
 from gotenberg_client.constants import A4
 from gotenberg_client.options import Measurement
 from gotenberg_client.options import MeasurementUnitType
 from gotenberg_client.options import PageMarginsType
 from gotenberg_client.options import PdfAFormat
 from humanize import naturalsize
 from imap_tools import MailAttachment
 from imap_tools import MailMessage
 from tika_client import TikaClient
 from documents.parsers import DocumentParser
 from documents.parsers import ParseError
 from documents.parsers import make_thumbnail_from_pdf
 from paperless.models import OutputTypeChoices
 from paperless_mail.models import MailRule
 class MailDocumentParser(DocumentParser):
    """
    This parser uses imap_tools to parse .eml files, generates pdf using
    Gotenberg and sends the html part to a Tika server for text extraction.
    """
    logging_name = "paperless.parsing.mail"
    def _settings_to_gotenberg_pdfa(self) -> PdfAFormat | None:
        """
        Converts our requested PDF/A output into the Gotenberg API
        format
        """
        if settings.OCR_OUTPUT_TYPE in {
            OutputTypeChoices.PDF_A,
            OutputTypeChoices.PDF_A2,
        }:
            return PdfAFormat.A2b
        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A1:  # pragma: no cover
            self.log.warning(
                "Gotenberg does not support PDF/A-1a, choosing PDF/A-2b instead",
            )
            return PdfAFormat.A2b
        elif settings.OCR_OUTPUT_TYPE == OutputTypeChoices.PDF_A3:  # pragma: no cover
            return PdfAFormat.A3b
        return None
    def get_thumbnail(
        self,
        document_path: Path,
        mime_type: str,
        file_name=None,
    ) -> Path:
        if not self.archive_path:
            self.archive_path = self.generate_pdf(
                self.parse_file_to_message(document_path),
            )
        return make_thumbnail_from_pdf(
            self.archive_path,
            self.tempdir,
            self.logging_group,
        )
    def extract_metadata(self, document_path: Path, mime_type: str):
        result = []
        try:
            mail = self.parse_file_to_message(document_path)
        except ParseError as e:
            self.log.warning(
                f"Error while fetching document metadata for {document_path}: {e}",
            )
            return result
        for key, value in mail.headers.items():
            value = ", ".join(i for i in value)
            try:
                value.encode("utf-8")
            except UnicodeEncodeError as e:  # pragma: no cover
                self.log.debug(f"Skipping header {key}: {e}")
                continue
            result.append(
                {
                    "namespace": "",
                    "prefix": "header",
                    "key": key,
                    "value": value,
                },
            )
        result.append(
            {
                "namespace": "",
                "prefix": "",
                "key": "attachments",
                "value": ", ".join(
                    f"{attachment.filename}"
                    f"({naturalsize(attachment.size, binary=True, format='%.2f')})"
                    for attachment in mail.attachments
                ),
            },
        )
        result.append(
            {
                "namespace": "",
                "prefix": "",
                "key": "date",
                "value": mail.date.strftime("%Y-%m-%d %H:%M:%S %Z"),
            },
        )
        result.sort(key=lambda item: (item["prefix"], item["key"]))
        return result
    def parse(
        self,
        document_path: Path,
        mime_type: str,
        file_name=None,
        mailrule_id: int | None = None,
    ) -> None:
        """
        Parses the given .eml into formatted text, based on the decoded email.
        """
        def strip_text(text: str):
            """
            Reduces the spacing of the given text string
            """
            text = re.sub(r"\s+", " ", text)
            text = re.sub(r"(\n *)+", "\n", text)
            return text.strip()
        def build_formatted_text(mail_message: MailMessage) -> str:
            """
            Constructs a formatted string, based on the given email.  Basically tries
            to get most of the email content, included front matter, into a nice string
            """
            fmt_text = f"Subject: {mail_message.subject}\n\n"
            fmt_text += f"From: {mail_message.from_values.full}\n\n"
            to_list = [address.full for address in mail_message.to_values]
            fmt_text += f"To: {', '.join(to_list)}\n\n"
            if mail_message.cc_values:
                fmt_text += (
                    f"CC: {', '.join(address.full for address in mail.cc_values)}\n\n"
                )
            if mail_message.bcc_values:
                fmt_text += (
                    f"BCC: {', '.join(address.full for address in mail.bcc_values)}\n\n"
                )
            if mail_message.attachments:
                att = []
                for a in mail.attachments:
                    attachment_size = naturalsize(a.size, binary=True, format="%.2f")
                    att.append(
                        f"{a.filename} ({attachment_size})",
                    )
                fmt_text += f"Attachments: {', '.join(att)}\n\n"
            if mail.html:
                fmt_text += "HTML content: " + strip_text(self.tika_parse(mail.html))
            fmt_text += f"\n\n{strip_text(mail.text)}"
            return fmt_text
        self.log.debug(f"Parsing file {document_path.name} into an email")
        mail = self.parse_file_to_message(document_path)
        self.log.debug("Building formatted text from email")
        self.text = build_formatted_text(mail)
        if is_naive(mail.date):
            self.date = make_aware(mail.date)
        else:
            self.date = mail.date
        self.log.debug("Creating a PDF from the email")
        if mailrule_id:
            rule = MailRule.objects.get(pk=mailrule_id)
            self.archive_path = self.generate_pdf(mail, rule.pdf_layout)
        else:
            self.archive_path = self.generate_pdf(mail)
    @staticmethod
    def parse_file_to_message(filepath: Path) -> MailMessage:
        """
        Parses the given .eml file into a MailMessage object
        """
        try:
            with filepath.open("rb") as eml:
                parsed = MailMessage.from_bytes(eml.read())
                if parsed.from_values is None:
                    raise ParseError(
                        f"Could not parse {filepath}: Missing 'from'",
                    )
        except Exception as err:
            raise ParseError(
                f"Could not parse {filepath}: {err}",
            ) from err
        return parsed
    def tika_parse(self, html: str):
        self.log.info("Sending content to Tika server")
        try:
            with TikaClient(tika_url=settings.TIKA_ENDPOINT) as client:
                parsed = client.tika.as_text.from_buffer(html, "text/html")
                if parsed.content is not None:
                    return parsed.content.strip()
                return ""
        except Exception as err:
            raise ParseError(
                f"Could not parse content with tika server at "
                f"{settings.TIKA_ENDPOINT}: {err}",
            ) from err
    def generate_pdf(
        self,
        mail_message: MailMessage,
        pdf_layout: MailRule.PdfLayout | None = None,
    ) -> Path:
        archive_path = Path(self.tempdir) / "merged.pdf"
        mail_pdf_file = self.generate_pdf_from_mail(mail_message)
        pdf_layout = (
            pdf_layout or settings.EMAIL_PARSE_DEFAULT_LAYOUT
        )  # EMAIL_PARSE_DEFAULT_LAYOUT is a MailRule.PdfLayout
        # If no HTML content, create the PDF from the message
        # Otherwise, create 2 PDFs and merge them with Gotenberg
        if not mail_message.html:
            archive_path.write_bytes(mail_pdf_file.read_bytes())
        else:
            pdf_of_html_content = self.generate_pdf_from_html(
                mail_message.html,
                mail_message.attachments,
            )
            self.log.debug("Merging email text and HTML content into single PDF")
            with (
                GotenbergClient(
                    host=settings.TIKA_GOTENBERG_ENDPOINT,
                    timeout=settings.CELERY_TASK_TIME_LIMIT,
                ) as client,
                client.merge.merge() as route,
            ):
                # Configure requested PDF/A formatting, if any
                pdf_a_format = self._settings_to_gotenberg_pdfa()
                if pdf_a_format is not None:
                    route.pdf_format(pdf_a_format)
                match pdf_layout:
                    case MailRule.PdfLayout.HTML_TEXT:
                        route.merge([pdf_of_html_content, mail_pdf_file])
                    case MailRule.PdfLayout.HTML_ONLY:
                        route.merge([pdf_of_html_content])
                    case MailRule.PdfLayout.TEXT_ONLY:
                        route.merge([mail_pdf_file])
                    case MailRule.PdfLayout.TEXT_HTML | _:
                        route.merge([mail_pdf_file, pdf_of_html_content])
                try:
                    response = route.run()
                    archive_path.write_bytes(response.content)
                except Exception as err:
                    raise ParseError(
                        f"Error while merging email HTML into PDF: {err}",
                    ) from err
        return archive_path
    def mail_to_html(self, mail: MailMessage) -> Path:
        """
        Converts the given email into an HTML file, formatted
        based on the given template
        """
        def clean_html(text: str) -> str:
            """
            Attempts to clean, escape and linkify the given HTML string
            """
            if isinstance(text, list):
                text = "\n".join([str(e) for e in text])
            if not isinstance(text, str):
                text = str(text)
            text = escape(text)
            text = clean(text)
            text = linkify(text, parse_email=True)
            text = text.replace("\n", "<br>")
            return text
        data = {}
        data["subject"] = clean_html(mail.subject)
        if data["subject"]:
            data["subject_label"] = "Subject"
        data["from"] = clean_html(mail.from_values.full)
        if data["from"]:
            data["from_label"] = "From"
        data["to"] = clean_html(", ".join(address.full for address in mail.to_values))
        if data["to"]:
            data["to_label"] = "To"
        data["cc"] = clean_html(", ".join(address.full for address in mail.cc_values))
        if data["cc"]:
            data["cc_label"] = "CC"
        data["bcc"] = clean_html(", ".join(address.full for address in mail.bcc_values))
        if data["bcc"]:
            data["bcc_label"] = "BCC"
        att = []
        for a in mail.attachments:
            att.append(
                f"{a.filename} ({naturalsize(a.size, binary=True, format='%.2f')})",
            )
        data["attachments"] = clean_html(", ".join(att))
        if data["attachments"]:
            data["attachments_label"] = "Attachments"
        data["date"] = clean_html(
            timezone.localtime(mail.date).strftime("%Y-%m-%d %H:%M"),
        )
        data["content"] = clean_html(mail.text.strip())
        from django.template.loader import render_to_string
        html_file = Path(self.tempdir) / "email_as_html.html"
        html_file.write_text(render_to_string("email_msg_template.html", context=data))
        return html_file
    def generate_pdf_from_mail(self, mail: MailMessage) -> Path:
        """
        Creates a PDF based on the given email, using the email's values in a
        an HTML template
        """
        self.log.info("Converting mail to PDF")
        css_file = Path(__file__).parent / "templates" / "output.css"
        email_html_file = self.mail_to_html(mail)
        with (
            GotenbergClient(
                host=settings.TIKA_GOTENBERG_ENDPOINT,
                timeout=settings.CELERY_TASK_TIME_LIMIT,
            ) as client,
            client.chromium.html_to_pdf() as route,
        ):
            # Configure requested PDF/A formatting, if any
            pdf_a_format = self._settings_to_gotenberg_pdfa()
            if pdf_a_format is not None:
                route.pdf_format(pdf_a_format)
            try:
                response = (
                    route.index(email_html_file)
                    .resource(css_file)
                    .margins(
                        PageMarginsType(
                            top=Measurement(0.1, MeasurementUnitType.Inches),
                            bottom=Measurement(0.1, MeasurementUnitType.Inches),
                            left=Measurement(0.1, MeasurementUnitType.Inches),
                            right=Measurement(0.1, MeasurementUnitType.Inches),
                        ),
                    )
                    .size(A4)
                    .scale(1.0)
                    .run()
                )
            except Exception as err:
                raise ParseError(
                    f"Error while converting email to PDF: {err}",
                ) from err
        email_as_pdf_file = Path(self.tempdir) / "email_as_pdf.pdf"
        email_as_pdf_file.write_bytes(response.content)
        return email_as_pdf_file
    def generate_pdf_from_html(
        self,
        orig_html: str,
        attachments: list[MailAttachment],
    ) -> Path:
        """
        Generates a PDF file based on the HTML and attachments of the email
        """
        def clean_html_script(text: str):
            compiled_open = re.compile(re.escape("<script"), re.IGNORECASE)
            text = compiled_open.sub("<div hidden ", text)
            compiled_close = re.compile(re.escape("</script"), re.IGNORECASE)
            text = compiled_close.sub("</div", text)
            return text
        self.log.info("Converting message html to PDF")
        tempdir = Path(self.tempdir)
        html_clean = clean_html_script(orig_html)
        html_clean_file = tempdir / "index.html"
        html_clean_file.write_text(html_clean)
        with (
            GotenbergClient(
                host=settings.TIKA_GOTENBERG_ENDPOINT,
                timeout=settings.CELERY_TASK_TIME_LIMIT,
            ) as client,
            client.chromium.html_to_pdf() as route,
        ):
            # Configure requested PDF/A formatting, if any
            pdf_a_format = self._settings_to_gotenberg_pdfa()
            if pdf_a_format is not None:
                route.pdf_format(pdf_a_format)
            # Add attachments as resources, cleaning the filename and replacing
            # it in the index file for inclusion
            for attachment in attachments:
                # Clean the attachment name to be valid
                name_cid = f"cid:{attachment.content_id}"
                name_clean = "".join(e for e in name_cid if e.isalnum())
                # Write attachment payload to a temp file
                temp_file = tempdir / name_clean
                temp_file.write_bytes(attachment.payload)
                route.resource(temp_file)
                # Replace as needed the name with the clean name
                html_clean = html_clean.replace(name_cid, name_clean)
            # Now store the cleaned up HTML version
            html_clean_file = tempdir / "index.html"
            html_clean_file.write_text(html_clean)
            # This is our index file, the main page basically
            route.index(html_clean_file)
            # Set page size, margins
            route.margins(
                PageMarginsType(
                    top=Measurement(0.1, MeasurementUnitType.Inches),
                    bottom=Measurement(0.1, MeasurementUnitType.Inches),
                    left=Measurement(0.1, MeasurementUnitType.Inches),
                    right=Measurement(0.1, MeasurementUnitType.Inches),
                ),
            ).size(A4).scale(1.0)
            try:
                response = route.run()
            except Exception as err:
                raise ParseError(
                    f"Error while converting document to PDF: {err}",
                ) from err
        html_pdf = tempdir / "html.pdf"
        html_pdf.write_bytes(response.content)
        return html_pdf
    def get_settings(self) -> None:
        """
        This parser does not implement additional settings yet
        """
        return None
--- a/src/paperless_mail/signals.py
+++ b/src/paperless_mail/signals.py
@@ -1,7 +1,12 @@
 def get_parser(*args, **kwargs):
-    from paperless_mail.parsers import MailDocumentParser
+    from paperless.parsers.mail import MailDocumentParser
-    return MailDocumentParser(*args, **kwargs)
+    # MailDocumentParser accepts no constructor args in the new-style protocol.
    # Pop legacy args that arrive from the signal-based consumer path.
    # Phase 4 will replace this signal path with the ParserRegistry.
    kwargs.pop("logging_group", None)
    kwargs.pop("progress_callback", None)
    return MailDocumentParser()
 def mail_consumer_declaration(sender, **kwargs):
--- a/src/paperless_mail/tests/conftest.py
+++ b/src/paperless_mail/tests/conftest.py
@@ -1,71 +1,9 @@
 from collections.abc import Generator
 from pathlib import Path
 import pytest
 from paperless_mail.mail import MailAccountHandler
 from paperless_mail.models import MailAccount
 from paperless_mail.parsers import MailDocumentParser
@pytest.fixture(scope="session")
 def sample_dir() -> Path:
    return (Path(__file__).parent / Path("samples")).resolve()
@pytest.fixture(scope="session")
 def broken_email_file(sample_dir: Path) -> Path:
    return sample_dir / "broken.eml"
@pytest.fixture(scope="session")
 def simple_txt_email_file(sample_dir: Path) -> Path:
    return sample_dir / "simple_text.eml"
@pytest.fixture(scope="session")
 def simple_txt_email_pdf_file(sample_dir: Path) -> Path:
    return sample_dir / "simple_text.eml.pdf"
@pytest.fixture(scope="session")
 def simple_txt_email_thumbnail_file(sample_dir: Path) -> Path:
    return sample_dir / "simple_text.eml.pdf.webp"
@pytest.fixture(scope="session")
 def html_email_file(sample_dir: Path) -> Path:
    return sample_dir / "html.eml"
@pytest.fixture(scope="session")
 def html_email_pdf_file(sample_dir: Path) -> Path:
    return sample_dir / "html.eml.pdf"
@pytest.fixture(scope="session")
 def html_email_thumbnail_file(sample_dir: Path) -> Path:
    return sample_dir / "html.eml.pdf.webp"
@pytest.fixture(scope="session")
 def html_email_html_file(sample_dir: Path) -> Path:
    return sample_dir / "html.eml.html"
@pytest.fixture(scope="session")
 def merged_pdf_first(sample_dir: Path) -> Path:
    return sample_dir / "first.pdf"
@pytest.fixture(scope="session")
 def merged_pdf_second(sample_dir: Path) -> Path:
    return sample_dir / "second.pdf"
@pytest.fixture()
 def mail_parser() -> MailDocumentParser:
    return MailDocumentParser(logging_group=None)
@pytest.fixture()
@@ -89,11 +27,3 @@ def greenmail_mail_account(db: None) -> Generator[MailAccount, None, None]:
@pytest.fixture()
 def mail_account_handler() -> MailAccountHandler:
    return MailAccountHandler()
@pytest.fixture(scope="session")
 def nginx_base_url() -> Generator[str, None, None]:
    """
    The base URL for the nginx HTTP server we expect to be alive
    """
    yield "http://localhost:8080"
--- a/src/paperless_tesseract/signals.py
+++ b/src/paperless_tesseract/signals.py
@@ -1,10 +1,23 @@
-def get_parser(*args, **kwargs):
+from __future__ import annotations
    from paperless_tesseract.parsers import RasterisedDocumentParser
 from typing import Any
 def get_parser(*args: Any, **kwargs: Any) -> Any:
    from paperless.parsers.tesseract import RasterisedDocumentParser
    # RasterisedDocumentParser accepts logging_group for constructor compatibility but
    # does not store or use it (no legacy DocumentParser base class).
    # progress_callback is also not used.  Both may arrive as a positional arg
    # (consumer) or a keyword arg (views); *args absorbs the positional form,
    # kwargs.pop handles the keyword form.  Phase 4 will replace this signal
    # path with the new ParserRegistry so the shim can be removed at that point.
    kwargs.pop("logging_group", None)
    kwargs.pop("progress_callback", None)
    return RasterisedDocumentParser(*args, **kwargs)
-def tesseract_consumer_declaration(sender, **kwargs):
+def tesseract_consumer_declaration(sender: Any, **kwargs: Any) -> dict[str, Any]:
    return {
        "parser": get_parser,
        "weight": 0,
--- a/src/paperless_tesseract/tests/test_parser.py
+++ b/src/paperless_tesseract/tests/test_parser.py
@@ -1,924 +0,0 @@
 import shutil
 import tempfile
 import unicodedata
 import uuid
 from pathlib import Path
 from unittest import mock
 from django.test import TestCase
 from django.test import override_settings
 from ocrmypdf import SubprocessOutputError
 from documents.parsers import ParseError
 from documents.parsers import run_convert
 from documents.tests.utils import DirectoriesMixin
 from documents.tests.utils import FileSystemAssertsMixin
 from paperless_tesseract.parsers import RasterisedDocumentParser
 from paperless_tesseract.parsers import post_process_text
 class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
    def assertContainsStrings(self, content, strings) -> None:
        # Asserts that all strings appear in content, in the given order.
        indices = []
        for s in strings:
            if s in content:
                indices.append(content.index(s))
            else:
                self.fail(f"'{s}' is not in '{content}'")
        self.assertListEqual(indices, sorted(indices))
    def test_post_process_text(self) -> None:
        text_cases = [
            ("simple     string", "simple string"),
            ("simple    newline\n   testing string", "simple newline\ntesting string"),
            (
                "utf-8   строка с пробелами в конце  ",
                "utf-8 строка с пробелами в конце",
            ),
        ]
        for source, result in text_cases:
            actual_result = post_process_text(source)
            self.assertEqual(
                result,
                actual_result,
                f"strip_exceess_whitespace({source}) != '{result}', but '{actual_result}'",
            )
    def test_get_text_from_pdf(self) -> None:
        parser = RasterisedDocumentParser(uuid.uuid4())
        text = parser.extract_text(
            None,
            self.SAMPLE_FILES / "simple-digital.pdf",
        )
        self.assertContainsStrings(text.strip(), ["This is a test document."])
    def test_get_page_count(self) -> None:
        """
        GIVEN:
            - PDF file with a single page
            - PDF file with multiple pages
        WHEN:
            - The number of pages is requested
        THEN:
            - The method returns 1 as the expected number of pages
            - The method returns the correct number of pages (6)
        """
        parser = RasterisedDocumentParser(uuid.uuid4())
        page_count = parser.get_page_count(
            str(self.SAMPLE_FILES / "simple-digital.pdf"),
            "application/pdf",
        )
        self.assertEqual(page_count, 1)
        page_count = parser.get_page_count(
            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
            "application/pdf",
        )
        self.assertEqual(page_count, 6)
    def test_get_page_count_password_protected(self) -> None:
        """
        GIVEN:
            - Password protected PDF file
        WHEN:
            - The number of pages is requested
        THEN:
            - The method returns None
        """
        parser = RasterisedDocumentParser(uuid.uuid4())
        with self.assertLogs("paperless.parsing.tesseract", level="WARNING") as cm:
            page_count = parser.get_page_count(
                str(self.SAMPLE_FILES / "password-protected.pdf"),
                "application/pdf",
            )
            self.assertEqual(page_count, None)
            self.assertIn("Unable to determine PDF page count", cm.output[0])
    def test_thumbnail(self) -> None:
        parser = RasterisedDocumentParser(uuid.uuid4())
        thumb = parser.get_thumbnail(
            str(self.SAMPLE_FILES / "simple-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(thumb)
    @mock.patch("documents.parsers.run_convert")
    def test_thumbnail_fallback(self, m) -> None:
        def call_convert(input_file, output_file, **kwargs) -> None:
            if ".pdf" in str(input_file):
                raise ParseError("Does not compute.")
            else:
                run_convert(input_file=input_file, output_file=output_file, **kwargs)
        m.side_effect = call_convert
        parser = RasterisedDocumentParser(uuid.uuid4())
        thumb = parser.get_thumbnail(
            str(self.SAMPLE_FILES / "simple-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(thumb)
    def test_thumbnail_encrypted(self) -> None:
        parser = RasterisedDocumentParser(uuid.uuid4())
        thumb = parser.get_thumbnail(
            str(self.SAMPLE_FILES / "encrypted.pdf"),
            "application/pdf",
        )
        self.assertIsFile(thumb)
    def test_get_dpi(self) -> None:
        parser = RasterisedDocumentParser(None)
        dpi = parser.get_dpi(str(self.SAMPLE_FILES / "simple-no-dpi.png"))
        self.assertEqual(dpi, None)
        dpi = parser.get_dpi(str(self.SAMPLE_FILES / "simple.png"))
        self.assertEqual(dpi, 72)
    def test_simple_digital(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "simple-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(parser.get_text(), ["This is a test document."])
    def test_with_form(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "with-form.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text(),
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )
    @override_settings(OCR_MODE="redo")
    def test_with_form_error(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "with-form.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text(),
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )
    @override_settings(OCR_MODE="skip")
    def test_signed(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "signed.pdf"), "application/pdf")
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text(),
            [
                "This is a digitally signed PDF, created with Acrobat Pro for the Paperless project to enable",
                "automated testing of signed/encrypted PDFs",
            ],
        )
    @override_settings(OCR_MODE="skip")
    def test_encrypted(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "encrypted.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertEqual(parser.get_text(), "")
    @override_settings(OCR_MODE="redo")
    def test_with_form_error_notext(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "with-form.pdf"),
            "application/pdf",
        )
        self.assertContainsStrings(
            parser.get_text(),
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )
    @override_settings(OCR_MODE="force")
    def test_with_form_force(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "with-form.pdf"),
            "application/pdf",
        )
        self.assertContainsStrings(
            parser.get_text(),
            ["Please enter your name in here:", "This is a PDF document with a form."],
        )
    def test_image_simple(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.png"), "image/png")
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(parser.get_text(), ["This is a test document."])
    def test_image_simple_alpha(self) -> None:
        parser = RasterisedDocumentParser(None)
        with tempfile.TemporaryDirectory() as tempdir:
            # Copy sample file to temp directory, as the parsing changes the file
            # and this makes it modified to Git
            sample_file = self.SAMPLE_FILES / "simple-alpha.png"
            dest_file = Path(tempdir) / "simple-alpha.png"
            shutil.copy(sample_file, dest_file)
            parser.parse(str(dest_file), "image/png")
            self.assertIsFile(parser.archive_path)
            self.assertContainsStrings(parser.get_text(), ["This is a test document."])
    def test_image_calc_a4_dpi(self) -> None:
        parser = RasterisedDocumentParser(None)
        dpi = parser.calculate_a4_dpi(
            str(self.SAMPLE_FILES / "simple-no-dpi.png"),
        )
        self.assertEqual(dpi, 62)
    @mock.patch("paperless_tesseract.parsers.RasterisedDocumentParser.calculate_a4_dpi")
    def test_image_dpi_fail(self, m) -> None:
        m.return_value = None
        parser = RasterisedDocumentParser(None)
        def f() -> None:
            parser.parse(
                str(self.SAMPLE_FILES / "simple-no-dpi.png"),
                "image/png",
            )
        self.assertRaises(ParseError, f)
    @override_settings(OCR_IMAGE_DPI=72, MAX_IMAGE_PIXELS=0)
    def test_image_no_dpi_default(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple-no-dpi.png"), "image/png")
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["this is a test document."],
        )
    def test_multi_page(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_PAGES=2, OCR_MODE="skip")
    def test_multi_page_pages_skip(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_PAGES=2, OCR_MODE="redo")
    def test_multi_page_pages_redo(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_PAGES=2, OCR_MODE="force")
    def test_multi_page_pages_force(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_MODE="skip")
    def test_multi_page_analog_pages_skip(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_PAGES=2, OCR_MODE="redo")
    def test_multi_page_analog_pages_redo(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR of only pages 1 and 2 requested
            - OCR mode set to redo
        WHEN:
            - Document is parsed
        THEN:
            - Text of page 1 and 2 extracted
            - An archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(parser.get_text().lower(), ["page 1", "page 2"])
        self.assertNotIn("page 3", parser.get_text().lower())
    @override_settings(OCR_PAGES=1, OCR_MODE="force")
    def test_multi_page_analog_pages_force(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR of only page 1 requested
            - OCR mode set to force
        WHEN:
            - Document is parsed
        THEN:
            - Only text of page 1 is extracted
            - An archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(parser.get_text().lower(), ["page 1"])
        self.assertNotIn("page 2", parser.get_text().lower())
        self.assertNotIn("page 3", parser.get_text().lower())
    @override_settings(OCR_MODE="skip_noarchive")
    def test_skip_noarchive_withtext(self) -> None:
        """
        GIVEN:
            - File with existing text layer
            - OCR mode set to skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - No archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_MODE="skip_noarchive")
    def test_skip_noarchive_notext(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR mode set to skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - An archive file is created with the OCRd text
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
        self.assertIsNotNone(parser.archive_path)
    @override_settings(OCR_SKIP_ARCHIVE_FILE="never")
    def test_skip_archive_never_withtext(self) -> None:
        """
        GIVEN:
            - File with existing text layer
            - OCR_SKIP_ARCHIVE_FILE set to never
        WHEN:
            - Document is parsed
        THEN:
            - Text from text layer is extracted
            - Archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsNotNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_SKIP_ARCHIVE_FILE="never")
    def test_skip_archive_never_withimages(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR_SKIP_ARCHIVE_FILE set to never
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - Archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsNotNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_SKIP_ARCHIVE_FILE="with_text")
    def test_skip_archive_withtext_withtext(self) -> None:
        """
        GIVEN:
            - File with existing text layer
            - OCR_SKIP_ARCHIVE_FILE set to with_text
        WHEN:
            - Document is parsed
        THEN:
            - Text from text layer is extracted
            - No archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_SKIP_ARCHIVE_FILE="with_text")
    def test_skip_archive_withtext_withimages(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR_SKIP_ARCHIVE_FILE set to with_text
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - Archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsNotNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_SKIP_ARCHIVE_FILE="always")
    def test_skip_archive_always_withtext(self) -> None:
        """
        GIVEN:
            - File with existing text layer
            - OCR_SKIP_ARCHIVE_FILE set to always
        WHEN:
            - Document is parsed
        THEN:
            - Text from text layer is extracted
            - No archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-digital.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_SKIP_ARCHIVE_FILE="always")
    def test_skip_archive_always_withimages(self) -> None:
        """
        GIVEN:
            - File with text contained in images but no text layer
            - OCR_SKIP_ARCHIVE_FILE set to always
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - No archive file is created
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    @override_settings(OCR_MODE="skip")
    def test_multi_page_mixed(self) -> None:
        """
        GIVEN:
            - File with some text contained in images and some in text layer
            - OCR mode set to skip
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - An archive file is created with the OCRd text and the original text
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
            "application/pdf",
        )
        self.assertIsNotNone(parser.archive_path)
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3", "page 4", "page 5", "page 6"],
        )
        with (parser.tempdir / "sidecar.txt").open() as f:
            sidecar = f.read()
        self.assertIn("[OCR skipped on page(s) 4-6]", sidecar)
    @override_settings(OCR_MODE="redo")
    def test_single_page_mixed(self) -> None:
        """
        GIVEN:
            - File with some text contained in images and some in text layer
            - Text and images are mixed on the same page
            - OCR mode set to redo
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - Full content of the file is parsed (not just the image text)
            - An archive file is created with the OCRd text and the original text
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "single-page-mixed.pdf"),
            "application/pdf",
        )
        self.assertIsNotNone(parser.archive_path)
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            [
                "this is some normal text, present on page 1 of the document.",
                "this is some text, but in an image, also on page 1.",
                "this is further text on page 1.",
            ],
        )
        with (parser.tempdir / "sidecar.txt").open() as f:
            sidecar = f.read().lower()
        self.assertIn("this is some text, but in an image, also on page 1.", sidecar)
        self.assertNotIn(
            "this is some normal text, present on page 1 of the document.",
            sidecar,
        )
    @override_settings(OCR_MODE="skip_noarchive")
    def test_multi_page_mixed_no_archive(self) -> None:
        """
        GIVEN:
            - File with some text contained in images and some in text layer
            - OCR mode set to skip_noarchive
        WHEN:
            - Document is parsed
        THEN:
            - Text from images is extracted
            - No archive file is created as original file contains text
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-mixed.pdf"),
            "application/pdf",
        )
        self.assertIsNone(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 4", "page 5", "page 6"],
        )
    @override_settings(OCR_MODE="skip", OCR_ROTATE_PAGES=True)
    def test_rotate(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "rotated.pdf"), "application/pdf")
        self.assertContainsStrings(
            parser.get_text(),
            [
                "This is the text that appears on the first page. It’s a lot of text.",
                "Even if the pages are rotated, OCRmyPDF still gets the job done.",
                "This is a really weird file with lots of nonsense text.",
                "If you read this, it’s your own fault. Also check your screen orientation.",
            ],
        )
    def test_multi_page_tiff(self) -> None:
        """
        GIVEN:
            - Multi-page TIFF image
        WHEN:
            - Image is parsed
        THEN:
            - Text from all pages extracted
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "multi-page-images.tiff"),
            "image/tiff",
        )
        self.assertIsFile(parser.archive_path)
        self.assertContainsStrings(
            parser.get_text().lower(),
            ["page 1", "page 2", "page 3"],
        )
    def test_multi_page_tiff_alpha(self) -> None:
        """
        GIVEN:
            - Multi-page TIFF image
            - Image include an alpha channel
        WHEN:
            - Image is parsed
        THEN:
            - Text from all pages extracted
        """
        parser = RasterisedDocumentParser(None)
        sample_file = self.SAMPLE_FILES / "multi-page-images-alpha.tiff"
        with tempfile.NamedTemporaryFile() as tmp_file:
            shutil.copy(sample_file, tmp_file.name)
            parser.parse(
                tmp_file.name,
                "image/tiff",
            )
            self.assertIsFile(parser.archive_path)
            self.assertContainsStrings(
                parser.get_text().lower(),
                ["page 1", "page 2", "page 3"],
            )
    def test_multi_page_tiff_alpha_srgb(self) -> None:
        """
        GIVEN:
            - Multi-page TIFF image
            - Image include an alpha channel
            - Image is srgb colorspace
        WHEN:
            - Image is parsed
        THEN:
            - Text from all pages extracted
        """
        parser = RasterisedDocumentParser(None)
        sample_file = str(
            self.SAMPLE_FILES / "multi-page-images-alpha-rgb.tiff",
        )
        with tempfile.NamedTemporaryFile() as tmp_file:
            shutil.copy(sample_file, tmp_file.name)
            parser.parse(
                tmp_file.name,
                "image/tiff",
            )
            self.assertIsFile(parser.archive_path)
            self.assertContainsStrings(
                parser.get_text().lower(),
                ["page 1", "page 2", "page 3"],
            )
    def test_ocrmypdf_parameters(self) -> None:
        parser = RasterisedDocumentParser(None)
        params = parser.construct_ocrmypdf_parameters(
            input_file="input.pdf",
            output_file="output.pdf",
            sidecar_file="sidecar.txt",
            mime_type="application/pdf",
            safe_fallback=False,
        )
        self.assertEqual(params["input_file_or_options"], "input.pdf")
        self.assertEqual(params["output_file"], "output.pdf")
        self.assertEqual(params["sidecar"], "sidecar.txt")
        with override_settings(OCR_CLEAN="none"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertNotIn("clean", params)
            self.assertNotIn("clean_final", params)
        with override_settings(OCR_CLEAN="clean"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertTrue(params["clean"])
            self.assertNotIn("clean_final", params)
        with override_settings(OCR_CLEAN="clean-final", OCR_MODE="skip"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertTrue(params["clean_final"])
            self.assertNotIn("clean", params)
        with override_settings(OCR_CLEAN="clean-final", OCR_MODE="redo"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertTrue(params["clean"])
            self.assertNotIn("clean_final", params)
        with override_settings(OCR_DESKEW=True, OCR_MODE="skip"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertTrue(params["deskew"])
        with override_settings(OCR_DESKEW=True, OCR_MODE="redo"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertNotIn("deskew", params)
        with override_settings(OCR_DESKEW=False, OCR_MODE="skip"):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertNotIn("deskew", params)
        with override_settings(OCR_MAX_IMAGE_PIXELS=1_000_001.0):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertIn("max_image_mpixels", params)
            self.assertAlmostEqual(params["max_image_mpixels"], 1, places=4)
        with override_settings(OCR_MAX_IMAGE_PIXELS=-1_000_001.0):
            parser = RasterisedDocumentParser(None)
            params = parser.construct_ocrmypdf_parameters("", "", "", "")
            self.assertNotIn("max_image_mpixels", params)
    def test_rtl_language_detection(self) -> None:
        """
        GIVEN:
            - File with text in an RTL language
        WHEN:
            - Document is parsed
        THEN:
            - Text from the document is extracted
        """
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "rtl-test.pdf"),
            "application/pdf",
        )
        # OCR output for RTL text varies across platforms/versions due to
        # bidi controls and presentation forms; normalize before assertion.
        normalized_text = "".join(
            char
            for char in unicodedata.normalize("NFKC", parser.get_text())
            if unicodedata.category(char) != "Cf" and not char.isspace()
        )
        self.assertIn("ةرازو", normalized_text)
        self.assertTrue(
            any(token in normalized_text for token in ("ةیلخادلا", "الاخليد")),
        )
    @mock.patch("ocrmypdf.ocr")
    def test_gs_rendering_error(self, m) -> None:
        m.side_effect = SubprocessOutputError("Ghostscript PDF/A rendering failed")
        parser = RasterisedDocumentParser(None)
        self.assertRaises(
            ParseError,
            parser.parse,
            str(self.SAMPLE_FILES / "simple-digital.pdf"),
            "application/pdf",
        )
 class TestParserFileTypes(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    SAMPLE_FILES = Path(__file__).parent / "samples"
    def test_bmp(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.bmp"), "image/bmp")
        self.assertIsFile(parser.archive_path)
        self.assertIn("this is a test document", parser.get_text().lower())
    def test_jpg(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.jpg"), "image/jpeg")
        self.assertIsFile(parser.archive_path)
        self.assertIn("this is a test document", parser.get_text().lower())
    def test_heic(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.heic"), "image/heic")
        self.assertIsFile(parser.archive_path)
        self.assertIn("pizza", parser.get_text().lower())
    @override_settings(OCR_IMAGE_DPI=200)
    def test_gif(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.gif"), "image/gif")
        self.assertIsFile(parser.archive_path)
        self.assertIn("this is a test document", parser.get_text().lower())
    def test_tiff(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(str(self.SAMPLE_FILES / "simple.tif"), "image/tiff")
        self.assertIsFile(parser.archive_path)
        self.assertIn("this is a test document", parser.get_text().lower())
    @override_settings(OCR_IMAGE_DPI=72)
    def test_webp(self) -> None:
        parser = RasterisedDocumentParser(None)
        parser.parse(
            str(self.SAMPLE_FILES / "document.webp"),
            "image/webp",
        )
        self.assertIsFile(parser.archive_path)
        # Older tesseracts consistently mangle the space between "a webp",
        # tesseract 5.3.0 seems to do a better job, so we're accepting both
        self.assertRegex(
            parser.get_text().lower(),
            r"this is a ?webp document, created 11/14/2022.",
        )
Author	SHA1	Message	Date
Trenton H	e24a2d8214	fix: add RasterisedDocumentParser to new-style parser shim checks The new RasterisedDocumentParser uses __enter__/__exit__ for resource management instead of cleanup(). Update all existing new-style shims to include it in the isinstance checks: - documents/consumer.py: _parser_cleanup(), parser_is_new_style - documents/tasks.py: parser_is_new_style, finally cleanup branch (also adds RemoteDocumentParser which was missing from the latter) - documents/management/commands/document_thumbnails.py: adds new-style handling from scratch (enter/exit + 2-arg get_thumbnail signature) Fix stale import paths in three test files that were still importing from paperless_tesseract.parsers instead of paperless.parsers.tesseract. Fix two registry tests that used application/pdf as a proxy for "no handler" — now that RasterisedDocumentParser is registered, PDF always has a handler, so switch to a truly unsupported MIME type. Signal infrastructure and shims remain intact; this is plumbing only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 14:54:34 -07:00
Trenton H	8e3dfcb4ee	fix(types): fully annotate paperless/parsers/tesseract.py Fixes all mypy and pyrefly errors in the new parser file: - Add missing type annotations to is_image, has_alpha, get_dpi, calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text - Narrow Path-only (no str) for image helper args; convert to str when building list[str] args for run_subprocess - Annotate ocrmypdf_args as dict[str, Any] so operator expressions on its values type-check and ocrmypdf.ocr(**args) resolves cleanly - Declare text: str \| None = None at top of extract_text to unify all assignments to the same type across both branches - Import Any from typing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 14:19:22 -07:00
Trenton H	1b45e4d029	tests: rewrite test_tesseract_parser to pytest style with typed fixtures - Converts all tests from Django TestCase to pytest-style classes - Adds tesseract_samples_dir, null_app_config, tesseract_parser, and make_tesseract_parser fixtures in conftest.py; all DB-free except TestOcrmypdfParameters which uses @pytest.mark.django_db - Defines MakeTesseractParser type alias in conftest.py for autocomplete - Fixes FBT001 (boolean positional args) by making bool params keyword-only with * separator in parametrize test signatures - Adds type annotations to all fixture parameters for IDE support - Uses pytest.param(..., id="...") throughout; pytest-mock for patching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 13:51:34 -07:00
Trenton H	6b279e9368	Update tesseract signals.py to import from new parser location RasterisedDocumentParser moved to paperless.parsers.tesseract; update the lazy import in signals.get_parser so the signal-based consumer declaration continues to work during the registry transition. Pop logging_group and progress_callback kwargs for constructor compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 13:04:53 -07:00
Trenton H	97bc53ccdc	Refactor RasterisedDocumentParser to ParserProtocol interface - Add RasterisedDocumentParser to registry.register_defaults() - Update parser class: remove DocumentParser inheritance, add Protocol class attrs/classmethods/properties, context-manager lifecycle - Add read_file_handle_unicode_errors() to shared parsers/utils.py - Replace inline unicode-error-handling with shared utility call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 13:02:43 -07:00
Trenton H	80fa4f6f12	Move tesseract parser, tests, and samples to paperless.parsers Relocates files in preparation for the Phase 3 Protocol-based parser refactor, preserving full git history via rename. - src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py - src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py - src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py - src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/ - Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 12:50:58 -07:00
Trenton H	c5b006e666	Updates typing	2026-03-19 12:33:43 -07:00
Trenton H	ad1654d89b	Updates typing	2026-03-19 12:22:29 -07:00
Trenton H	466a402715	Merge branch 'dev' into feature-mail-parser-plugin	2026-03-19 12:02:32 -07:00
Trenton H	b2e3048083	One more coverage	2026-03-19 12:00:11 -07:00
Trenton H	fe1e35b9ac	Increases test coverage	2026-03-19 11:43:12 -07:00
Trenton H	d01513a869	Updates so we can report a page count for these parsers, assuming we do have an archive produced when called	2026-03-19 11:42:38 -07:00
Trenton H	9e3c93f72d	Corrects the score return	2026-03-19 11:23:30 -07:00
Trenton H	16e73f611d	Cleans up the reprocess task and generally reduces duplicate of classes	2026-03-19 09:57:08 -07:00
Trenton H	b66cfb1867	Merge remote-tracking branch 'origin/dev' into feature-mail-parser-plugin	2026-03-19 09:24:44 -07:00
Trenton H	49e1ebb620	Fix(tests): add configure() to DummyParser and missing-method parametrize ParserProtocol now requires configure(context: ParserContext) -> None. Update DummyParser in test_registry.py to implement it, and add 'missing-configure' to the test_partial_compliant_fails_isinstance parametrize list so the new method is covered by the negative test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 08:34:59 -07:00
Trenton H	8148f2ced2	Feat(parsers): call configure(ParserContext()) in update_document task Apply the same new-style parser shim pattern as the consumer to update_document_content_maybe_archive_file: - Call __enter__ for Text/Tika parsers after instantiation - Call configure(ParserContext()) before parse() for all new-style parsers (mailrule_id is not available here — this is a re-process of an existing document, so the default empty context is correct) - Call parse(path, mime_type) with 2 args for new-style parsers - Call get_thumbnail(path, mime_type) with 2 args for new-style parsers - Call __exit__ instead of cleanup() in the finally block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 08:28:17 -07:00
Trenton H	a36b6ecbef	Feat(parsers): add ParserContext and configure() to ParserProtocol Replace the ad-hoc mailrule_id attribute assignment with a typed, immutable ParserContext dataclass and a configure() method on the Protocol: - ParserContext(frozen=True, slots=True) lives in paperless/parsers/ alongside ParserProtocol and MetadataEntry; currently carries only mailrule_id but is designed to grow with output_type, ocr_mode, and ocr_language in a future phase (decoupling parsers from settings.*) - ParserProtocol.configure(context: ParserContext) -> None is the extension point; no-op by default - MailDocumentParser.configure() reads mailrule_id into _mailrule_id - TextDocumentParser and TikaDocumentParser implement a no-op configure() - Consumer calls document_parser.configure(ParserContext(...)) before parse(), replacing the isinstance(parser, MailDocumentParser) guard and the direct attribute mutation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-19 08:19:17 -07:00
Trenton H	07237bde6a	Removes fixtures which were duplicated	2026-03-18 15:13:35 -07:00
Trenton H	b80702acb8	Fixes location of the fixture	2026-03-18 15:05:04 -07:00
Trenton H	7428bbb8dc	Bumps this so we can run	2026-03-18 14:55:36 -07:00
Trenton H	9a709abb7d	Fix(parsers): pop legacy constructor args in mail signal wrapper MailDocumentParser.__init__ takes no constructor args in the new protocol. Update the get_parser() signal wrapper to pop logging_group and progress_callback (passed by the legacy consumer dispatch path) before instantiating — the same pattern used by TextDocumentParser. Also update test_mail_parser_receives_mailrule to use the real signal wrapper (mail_get_parser) instead of MailDocumentParser directly, so the test exercises the actual dispatch path and matches the new parse() call signature (no mailrule kwarg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:48:10 -07:00
Trenton H	3236bbd0c5	Feat(parsers): migrate MailDocumentParser to ParserProtocol Move the mail parser from paperless_mail/parsers.py to paperless/parsers/mail.py and refactor it to implement ParserProtocol: - Class-level name/version/author/url attributes - supported_mime_types() and score() classmethods (score=20) - can_produce_archive=False, requires_pdf_rendition=True - Context manager lifecycle (__enter__/__exit__) - New parse() signature without mailrule_id kwarg; consumer sets parser.mailrule_id before calling parse() instead - get_text()/get_date()/get_archive_path() accessor methods - extract_metadata() returning email headers and attachment info Register MailDocumentParser in the ParserRegistry alongside Text and Tika parsers. Update consumer, signals, and all import sites to use the new location. Update tests to use the new accessor API, patch paths, and context-manager fixture. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:41:26 -07:00
Trenton H	d107c8c531	Feat(tests): add mail parser fixtures to paperless/tests/parsers/conftest.py Add mail_samples_dir, per-file sample fixtures, and mail_parser (context-manager style) to mirror the old paperless_mail conftest but rooted at the new samples/mail/ location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:29:14 -07:00
Trenton H	8c671514ab	Chore: move mail parser sample files to paperless/tests/samples/mail/ Relocate all mail test fixtures from src/paperless_mail/tests/samples/ to src/paperless/tests/samples/mail/ ahead of the parser plugin refactor. Add the new path to the codespell skip list to prevent false-positive spell corrections in binary/fixture email files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:26:58 -07:00
Trenton H	f2c16a7d98	Refactor(mail): move mail parser tests to paperless/tests/parsers/ Move test_parsers.py → test_mail_parser.py and test_parsers_live.py → test_mail_parser_live.py alongside the other built-in parser tests, preserving git history before editing. Update MailDocumentParser import to the new canonical location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:16:23 -07:00
Trenton H	7c76e65950	Refactor(mail): rename paperless_mail/parsers.py → paperless/parsers/mail.py Preserve git history for MailDocumentParser by committing the rename separately before editing, following the project convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 14:06:17 -07:00