Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)

* refactor: switch consumer and callers to ParserRegistry (Phase 4) Replace all Django signal-based parser discovery with direct registry calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all old-style isinstance checks. All parser instantiation now uses the `with parser_class() as parser:` context manager pattern. - documents/parsers.py: delegate to get_parser_registry(); drop lru_cache - documents/consumer.py: use registry + context manager; remove shims - documents/tasks.py: same pattern - documents/management/commands/document_thumbnails.py: same pattern - documents/views.py: get_metadata uses context manager - documents/checks.py: use get_parser_registry().all_parsers() - paperless/parsers/registry.py: add all_parsers() public method - tests: update mocks to target documents.consumer.get_parser_class_for_mime_type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: drop get_parser_class_for_mime_type; callers use registry directly All callers now call get_parser_registry().get_parser_for_file() with the actual filename and path, enabling score() to use file extension hints. The MIME-only helper is removed. - consumer.py: passes self.filename + self.working_copy - tasks.py: passes document.original_filename + document.source_path - document_thumbnails.py: same pattern - views.py: passes Path(file).name + Path(file) - parsers.py: internal helpers inline the registry call with filename="" - test_parsers.py: drop TestParserDiscovery (was testing mock behavior); TestParserAvailability uses registry directly - test_consumer.py: mocks switch to documents.consumer.get_parser_registry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove document_consumer_declaration signal infrastructure Remove the document_consumer_declaration signal that was previously used for parser registration. Each parser app no longer connects to this signal, and the signal declaration itself has been removed from documents/signals. Changes: - Remove document_consumer_declaration from documents/signals/__init__.py - Remove ready() methods and signal imports from all parser app configs - Delete signal shim files (signals.py) from all parser apps: - paperless_tesseract/signals.py - paperless_text/signals.py - paperless_tika/signals.py - paperless_mail/signals.py - paperless_remote/signals.py Parser discovery now happens exclusively through the ParserRegistry system introduced in the previous refactor phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove empty paperless_text and paperless_tika Django apps After parser classes were moved to paperless/parsers/ in the plugin refactor, these Django apps contained only empty AppConfig classes with no models, views, tasks, migrations, or other functionality. - Remove paperless_text and paperless_tika from INSTALLED_APPS - Delete empty app directories entirely - Update pyproject.toml test exclusions - Clean stale mypy baseline entries for moved parser files paperless_remote app is retained as it contains meaningful system checks for Azure AI configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Moves the checks and tests to the main application and removes the old applications * Adds a comment to satisy Sonar * refactor: remove automatic log_summary() call from get_parser_registry() The summary was logged once per process, causing it to appear repeatedly during Docker startup (management commands, web server, each Celery worker subprocess). External parsers are already announced individually at INFO when discovered; the full summary is redundant noise. log_summary() is retained on ParserRegistry for manual/debug use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Cleans up the duplicate test file/fixture * Fixes a race condition where webserver threads could race to populate the registry --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-07-02 02:04:19 +00:00 · 2026-03-22 06:53:32 -07:00
parent 07f54bfdab
commit 701735f6e5
41 changed files with 713 additions and 1295 deletions
@@ -13,8 +13,10 @@ class TestDocumentChecks(TestCase):
    def test_parser_check(self) -> None:
        self.assertEqual(parser_check(None), [])

-        with mock.patch("documents.checks.document_consumer_declaration.send") as m:
-            m.return_value = []
+        with mock.patch("documents.checks.get_parser_registry") as mock_registry_fn:
+            mock_registry = mock.MagicMock()
+            mock_registry.all_parsers.return_value = []
+            mock_registry_fn.return_value = mock_registry

            self.assertEqual(
                parser_check(None),
@@ -27,7 +27,6 @@ from documents.models import Document
 from documents.models import DocumentType
 from documents.models import StoragePath
 from documents.models import Tag
-from documents.parsers import DocumentParser
 from documents.parsers import ParseError
 from documents.plugins.helpers import ProgressStatusOptions
 from documents.tasks import sanity_check
@@ -38,62 +37,106 @@ from documents.tests.utils import GetConsumerMixin
 from paperless_mail.models import MailRule


-class _BaseTestParser(DocumentParser):
-    def get_settings(self) -> None:
+class _BaseNewStyleParser:
+    """Minimal ParserProtocol implementation for use in consumer tests."""
+
+    name: str = "test-parser"
+    version: str = "0.1"
+    author: str = "test"
+    url: str = "test"
+
+    @classmethod
+    def supported_mime_types(cls) -> dict:
+        return {
+            "application/pdf": ".pdf",
+            "image/png": ".png",
+            "message/rfc822": ".eml",
+        }
+
+    @classmethod
+    def score(cls, mime_type: str, filename: str, path=None):
+        return 0 if mime_type in cls.supported_mime_types() else None
+
+    @property
+    def can_produce_archive(self) -> bool:
+        return True
+
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        return False
+
+    def __init__(self) -> None:
+        self._tmpdir: Path | None = None
+        self._text: str | None = None
+        self._archive: Path | None = None
+        self._thumb: Path | None = None
+
+    def __enter__(self):
+        self._tmpdir = Path(
+            tempfile.mkdtemp(prefix="paperless-test-", dir=settings.SCRATCH_DIR),
+        )
+        _, thumb = tempfile.mkstemp(suffix=".webp", dir=self._tmpdir)
+        self._thumb = Path(thumb)
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        if self._tmpdir and self._tmpdir.exists():
+            shutil.rmtree(self._tmpdir, ignore_errors=True)
+
+    def configure(self, context) -> None:
        """
-        This parser does not implement additional settings yet
+        Test parser doesn't do anything with context
        """
+
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        raise NotImplementedError
+
+    def get_text(self) -> str | None:
+        return self._text
+
+    def get_date(self):
        return None

+    def get_archive_path(self):
+        return self._archive

-class DummyParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir, archive_path) -> None:
-        super().__init__(logging_group, None)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
-        self.archive_path = archive_path
+    def get_thumbnail(self, document_path, mime_type) -> Path:
+        return self._thumb

-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
+    def get_page_count(self, document_path, mime_type):
+        return None

-    def parse(self, document_path, mime_type, file_name=None) -> None:
-        self.text = "The Text"
+    def extract_metadata(self, document_path, mime_type) -> list:
+        return []


-class CopyParser(_BaseTestParser):
-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
+class DummyParser(_BaseNewStyleParser):
+    _ARCHIVE_SRC = (
+        Path(__file__).parent / "samples" / "documents" / "archive" / "0000001.pdf"
+    )

-    def __init__(self, logging_group, progress_callback=None) -> None:
-        super().__init__(logging_group, progress_callback)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=self.tempdir)
-
-    def parse(self, document_path, mime_type, file_name=None) -> None:
-        self.text = "The text"
-        self.archive_path = Path(self.tempdir / "archive.pdf")
-        shutil.copy(document_path, self.archive_path)
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        self._text = "The Text"
+        if produce_archive and self._tmpdir:
+            self._archive = self._tmpdir / "archive.pdf"
+            shutil.copy(self._ARCHIVE_SRC, self._archive)


-class FaultyParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir) -> None:
-        super().__init__(logging_group)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
+class CopyParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
+        self._text = "The text"
+        if produce_archive and self._tmpdir:
+            self._archive = self._tmpdir / "archive.pdf"
+            shutil.copy(document_path, self._archive)

-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb

-    def parse(self, document_path, mime_type, file_name=None):
+class FaultyParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
        raise ParseError("Does not compute.")


-class FaultyGenericExceptionParser(_BaseTestParser):
-    def __init__(self, logging_group, scratch_dir) -> None:
-        super().__init__(logging_group)
-        _, self.fake_thumb = tempfile.mkstemp(suffix=".webp", dir=scratch_dir)
-
-    def get_thumbnail(self, document_path, mime_type, file_name=None):
-        return self.fake_thumb
-
-    def parse(self, document_path, mime_type, file_name=None):
+class FaultyGenericExceptionParser(_BaseNewStyleParser):
+    def parse(self, document_path, mime_type, *, produce_archive: bool = True) -> None:
        raise Exception("Generic exception.")


@@ -147,38 +190,12 @@ class TestConsumer(
        self.assertEqual(payload["data"]["max_progress"], last_progress_max)
        self.assertEqual(payload["data"]["status"], last_status)

-    def make_dummy_parser(self, logging_group, progress_callback=None):
-        return DummyParser(
-            logging_group,
-            self.dirs.scratch_dir,
-            self.get_test_archive_file(),
-        )
-
-    def make_faulty_parser(self, logging_group, progress_callback=None):
-        return FaultyParser(logging_group, self.dirs.scratch_dir)
-
-    def make_faulty_generic_exception_parser(
-        self,
-        logging_group,
-        progress_callback=None,
-    ):
-        return FaultyGenericExceptionParser(logging_group, self.dirs.scratch_dir)
-
    def setUp(self) -> None:
        super().setUp()

-        patcher = mock.patch("documents.parsers.document_consumer_declaration.send")
-        m = patcher.start()
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_dummy_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        patcher = mock.patch("documents.consumer.get_parser_registry")
+        mock_registry = patcher.start()
+        mock_registry.return_value.get_parser_for_file.return_value = DummyParser
        self.addCleanup(patcher.stop)

    def get_test_file(self):
@@ -547,9 +564,9 @@ class TestConsumer(
            ) as consumer:
                consumer.run()

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testNoParsers(self, m) -> None:
-        m.return_value = []
+        m.return_value.get_parser_for_file.return_value = None

        with self.assertRaisesMessage(
            ConsumerError,
@@ -560,18 +577,9 @@ class TestConsumer(

        self._assert_first_last_send_progress(last_status="FAILED")

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testFaultyParser(self, m) -> None:
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_faulty_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = FaultyParser

        with self.get_consumer(self.get_test_file()) as consumer:
            with self.assertRaisesMessage(
@@ -582,18 +590,9 @@ class TestConsumer(

        self._assert_first_last_send_progress(last_status="FAILED")

-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def testGenericParserException(self, m) -> None:
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": self.make_faulty_generic_exception_parser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = FaultyGenericExceptionParser

        with self.get_consumer(self.get_test_file()) as consumer:
            with self.assertRaisesMessage(
@@ -1017,7 +1016,7 @@ class TestConsumer(
        self._assert_first_last_send_progress()

    @override_settings(FILENAME_FORMAT="{title}")
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def test_similar_filenames(self, m) -> None:
        shutil.copy(
            Path(__file__).parent / "samples" / "simple.pdf",
@@ -1031,16 +1030,7 @@ class TestConsumer(
            Path(__file__).parent / "samples" / "simple-noalpha.png",
            settings.CONSUMPTION_DIR / "simple.png.pdf",
        )
-        m.return_value = [
-            (
-                None,
-                {
-                    "parser": CopyParser,
-                    "mime_types": {"application/pdf": ".pdf", "image/png": ".png"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        m.return_value.get_parser_for_file.return_value = CopyParser

        with self.get_consumer(settings.CONSUMPTION_DIR / "simple.png") as consumer:
            consumer.run()
@@ -1068,8 +1058,10 @@ class TestConsumer(

        sanity_check()

+    @mock.patch("documents.consumer.get_parser_registry")
    @mock.patch("documents.consumer.run_subprocess")
-    def test_try_to_clean_invalid_pdf(self, m) -> None:
+    def test_try_to_clean_invalid_pdf(self, m, mock_registry) -> None:
+        mock_registry.return_value.get_parser_for_file.return_value = None
        shutil.copy(
            Path(__file__).parent / "samples" / "invalid_pdf.pdf",
            settings.CONSUMPTION_DIR / "invalid_pdf.pdf",
@@ -1091,10 +1083,10 @@ class TestConsumer(

    @mock.patch("paperless_mail.models.MailRule.objects.get")
    @mock.patch("paperless.parsers.mail.MailDocumentParser.parse")
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
+    @mock.patch("documents.consumer.get_parser_registry")
    def test_mail_parser_receives_mailrule(
        self,
-        mock_consumer_declaration_send: mock.Mock,
+        mock_get_parser_registry: mock.Mock,
        mock_mail_parser_parse: mock.Mock,
        mock_mailrule_get: mock.Mock,
    ) -> None:
@@ -1106,18 +1098,11 @@ class TestConsumer(
        THEN:
            - The mail parser should receive the mail rule
        """
-        from paperless_mail.signals import get_parser as mail_get_parser
+        from paperless.parsers.mail import MailDocumentParser

-        mock_consumer_declaration_send.return_value = [
-            (
-                None,
-                {
-                    "parser": mail_get_parser,
-                    "mime_types": {"message/rfc822": ".eml"},
-                    "weight": 0,
-                },
-            ),
-        ]
+        mock_get_parser_registry.return_value.get_parser_for_file.return_value = (
+            MailDocumentParser
+        )
        mock_mailrule_get.return_value = mock.Mock(
            pdf_layout=MailRule.PdfLayout.HTML_ONLY,
        )
@@ -1,132 +1,16 @@
-from tempfile import TemporaryDirectory
-from unittest import mock
-
-from django.apps import apps
 from django.test import TestCase
 from django.test import override_settings

 from documents.parsers import get_default_file_extension
-from documents.parsers import get_parser_class_for_mime_type
 from documents.parsers import get_supported_file_extensions
 from documents.parsers import is_file_ext_supported
+from paperless.parsers.registry import get_parser_registry
+from paperless.parsers.registry import reset_parser_registry
 from paperless.parsers.tesseract import RasterisedDocumentParser
 from paperless.parsers.text import TextDocumentParser
 from paperless.parsers.tika import TikaDocumentParser


-class TestParserDiscovery(TestCase):
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_1_parser(self, m, *args) -> None:
-        """
-        GIVEN:
-            - Parser declared for a given mimetype
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - Declared parser class is returned
-        """
-
-        class DummyParser:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertEqual(get_parser_class_for_mime_type("application/pdf"), DummyParser)
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_n_parsers(self, m, *args) -> None:
-        """
-        GIVEN:
-            - Two parsers declared for a given mimetype
-            - Second parser has a higher weight
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - Second parser class is returned
-        """
-
-        class DummyParser1:
-            pass
-
-        class DummyParser2:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser1,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-            (
-                None,
-                {
-                    "weight": 1,
-                    "parser": DummyParser2,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertEqual(
-            get_parser_class_for_mime_type("application/pdf"),
-            DummyParser2,
-        )
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_0_parsers(self, m, *args) -> None:
-        """
-        GIVEN:
-            - No parsers are declared
-        WHEN:
-            - Attempt to get parser for the mimetype
-        THEN:
-            - No parser class is returned
-        """
-        m.return_value = []
-        with TemporaryDirectory():
-            self.assertIsNone(get_parser_class_for_mime_type("application/pdf"))
-
-    @mock.patch("documents.parsers.document_consumer_declaration.send")
-    def test_get_parser_class_no_valid_parser(self, m, *args) -> None:
-        """
-        GIVEN:
-            - No parser declared for a given mimetype
-            - Parser declared for a different mimetype
-        WHEN:
-            - Attempt to get parser for the given mimetype
-        THEN:
-            - No parser class is returned
-        """
-
-        class DummyParser:
-            pass
-
-        m.return_value = (
-            (
-                None,
-                {
-                    "weight": 0,
-                    "parser": DummyParser,
-                    "mime_types": {"application/pdf": ".pdf"},
-                },
-            ),
-        )
-
-        self.assertIsNone(get_parser_class_for_mime_type("image/tiff"))
-
-
 class TestParserAvailability(TestCase):
    def test_tesseract_parser(self) -> None:
        """
@@ -151,7 +35,7 @@ class TestParserAvailability(TestCase):
            self.assertIn(ext, supported_exts)
            self.assertEqual(get_default_file_extension(mime_type), ext)
            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
+                get_parser_registry().get_parser_for_file(mime_type, "")(),
                RasterisedDocumentParser,
            )

@@ -175,7 +59,7 @@ class TestParserAvailability(TestCase):
            self.assertIn(ext, supported_exts)
            self.assertEqual(get_default_file_extension(mime_type), ext)
            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
+                get_parser_registry().get_parser_for_file(mime_type, "")(),
                TextDocumentParser,
            )

@@ -198,22 +82,23 @@ class TestParserAvailability(TestCase):
            ),
        ]

-        # Force the app ready to notice the settings override
-        with override_settings(TIKA_ENABLED=True, INSTALLED_APPS=["paperless_tika"]):
-            app = apps.get_app_config("paperless_tika")
-            app.ready()
+        self.addCleanup(reset_parser_registry)
+
+        # Reset and rebuild the registry with Tika enabled.
+        with override_settings(TIKA_ENABLED=True):
+            reset_parser_registry()
            supported_exts = get_supported_file_extensions()

-        for mime_type, ext in supported_mimes_and_exts:
-            self.assertIn(ext, supported_exts)
-            self.assertEqual(get_default_file_extension(mime_type), ext)
-            self.assertIsInstance(
-                get_parser_class_for_mime_type(mime_type)(logging_group=None),
-                TikaDocumentParser,
-            )
+            for mime_type, ext in supported_mimes_and_exts:
+                self.assertIn(ext, supported_exts)
+                self.assertEqual(get_default_file_extension(mime_type), ext)
+                self.assertIsInstance(
+                    get_parser_registry().get_parser_for_file(mime_type, "")(),
+                    TikaDocumentParser,
+                )

    def test_no_parser_for_mime(self) -> None:
-        self.assertIsNone(get_parser_class_for_mime_type("text/sdgsdf"))
+        self.assertIsNone(get_parser_registry().get_parser_for_file("text/sdgsdf", ""))

    def test_default_extension(self) -> None:
        # Test no parser declared still returns a an extension