Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)

* refactor: switch consumer and callers to ParserRegistry (Phase 4) Replace all Django signal-based parser discovery with direct registry calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all old-style isinstance checks. All parser instantiation now uses the `with parser_class() as parser:` context manager pattern. - documents/parsers.py: delegate to get_parser_registry(); drop lru_cache - documents/consumer.py: use registry + context manager; remove shims - documents/tasks.py: same pattern - documents/management/commands/document_thumbnails.py: same pattern - documents/views.py: get_metadata uses context manager - documents/checks.py: use get_parser_registry().all_parsers() - paperless/parsers/registry.py: add all_parsers() public method - tests: update mocks to target documents.consumer.get_parser_class_for_mime_type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: drop get_parser_class_for_mime_type; callers use registry directly All callers now call get_parser_registry().get_parser_for_file() with the actual filename and path, enabling score() to use file extension hints. The MIME-only helper is removed. - consumer.py: passes self.filename + self.working_copy - tasks.py: passes document.original_filename + document.source_path - document_thumbnails.py: same pattern - views.py: passes Path(file).name + Path(file) - parsers.py: internal helpers inline the registry call with filename="" - test_parsers.py: drop TestParserDiscovery (was testing mock behavior); TestParserAvailability uses registry directly - test_consumer.py: mocks switch to documents.consumer.get_parser_registry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove document_consumer_declaration signal infrastructure Remove the document_consumer_declaration signal that was previously used for parser registration. Each parser app no longer connects to this signal, and the signal declaration itself has been removed from documents/signals. Changes: - Remove document_consumer_declaration from documents/signals/__init__.py - Remove ready() methods and signal imports from all parser app configs - Delete signal shim files (signals.py) from all parser apps: - paperless_tesseract/signals.py - paperless_text/signals.py - paperless_tika/signals.py - paperless_mail/signals.py - paperless_remote/signals.py Parser discovery now happens exclusively through the ParserRegistry system introduced in the previous refactor phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove empty paperless_text and paperless_tika Django apps After parser classes were moved to paperless/parsers/ in the plugin refactor, these Django apps contained only empty AppConfig classes with no models, views, tasks, migrations, or other functionality. - Remove paperless_text and paperless_tika from INSTALLED_APPS - Delete empty app directories entirely - Update pyproject.toml test exclusions - Clean stale mypy baseline entries for moved parser files paperless_remote app is retained as it contains meaningful system checks for Azure AI configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Moves the checks and tests to the main application and removes the old applications * Adds a comment to satisy Sonar * refactor: remove automatic log_summary() call from get_parser_registry() The summary was logged once per process, causing it to appear repeatedly during Docker startup (management commands, web server, each Celery worker subprocess). External parsers are already announced individually at INFO when discovered; the full summary is redundant noise. log_summary() is retained on ParserRegistry for manual/debug use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Cleans up the duplicate test file/fixture * Fixes a race condition where webserver threads could race to populate the registry --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-23 06:55:23 +00:00 · 2026-03-22 06:53:32 -07:00
parent 07f54bfdab
commit 701735f6e5
41 changed files with 713 additions and 1295 deletions
@@ -32,9 +32,7 @@ from documents.models import DocumentType
 from documents.models import StoragePath
 from documents.models import Tag
 from documents.models import WorkflowTrigger
-from documents.parsers import DocumentParser
 from documents.parsers import ParseError
-from documents.parsers import get_parser_class_for_mime_type
 from documents.permissions import set_permissions_for_object
 from documents.plugins.base import AlwaysRunPluginMixin
 from documents.plugins.base import ConsumeTaskPlugin
@@ -52,40 +50,12 @@ from documents.utils import copy_basic_file_stats
 from documents.utils import copy_file_with_basic_stats
 from documents.utils import run_subprocess
 from paperless.parsers import ParserContext
-from paperless.parsers.mail import MailDocumentParser
-from paperless.parsers.remote import RemoteDocumentParser
-from paperless.parsers.tesseract import RasterisedDocumentParser
-from paperless.parsers.text import TextDocumentParser
-from paperless.parsers.tika import TikaDocumentParser
+from paperless.parsers import ParserProtocol
+from paperless.parsers.registry import get_parser_registry

 LOGGING_NAME: Final[str] = "paperless.consumer"


-def _parser_cleanup(parser: DocumentParser) -> None:
-    """
-    Call cleanup on a parser, handling the new-style context-manager parsers.
-
-    New-style parsers (e.g. TextDocumentParser) use __exit__ for teardown
-    instead of a cleanup() method.  This shim will be removed once all existing parsers
-    have switched to the new style and this consumer is updated to use it
-
-    TODO(stumpylog): Remove me in the future
-    """
-    if isinstance(
-        parser,
-        (
-            MailDocumentParser,
-            RasterisedDocumentParser,
-            RemoteDocumentParser,
-            TextDocumentParser,
-            TikaDocumentParser,
-        ),
-    ):
-        parser.__exit__(None, None, None)
-    else:
-        parser.cleanup()
-
-
 class WorkflowTriggerPlugin(
    NoCleanupPluginMixin,
    NoSetupPluginMixin,
@@ -422,8 +392,12 @@ class ConsumerPlugin(
                    self.log.error(f"Error attempting to clean PDF: {e}")

            # Based on the mime type, get the parser for that type
-            parser_class: type[DocumentParser] | None = get_parser_class_for_mime_type(
-                mime_type,
+            parser_class: type[ParserProtocol] | None = (
+                get_parser_registry().get_parser_for_file(
+                    mime_type,
+                    self.filename,
+                    self.working_copy,
+                )
            )
            if not parser_class:
                tempdir.cleanup()
@@ -446,313 +420,275 @@ class ConsumerPlugin(
                tempdir.cleanup()
            raise

-        def progress_callback(
-            current_progress,
-            max_progress,
-        ) -> None:  # pragma: no cover
-            # recalculate progress to be within 20 and 80
-            p = int((current_progress / max_progress) * 50 + 20)
-            self._send_progress(p, 100, ProgressStatusOptions.WORKING)
-
        # This doesn't parse the document yet, but gives us a parser.
-
-        document_parser: DocumentParser = parser_class(
-            self.logging_group,
-            progress_callback=progress_callback,
-        )
-
-        parser_is_new_style = isinstance(
-            document_parser,
-            (
-                MailDocumentParser,
-                RasterisedDocumentParser,
-                RemoteDocumentParser,
-                TextDocumentParser,
-                TikaDocumentParser,
-            ),
-        )
-
-        # New-style parsers use __enter__/__exit__ for resource management.
-        # _parser_cleanup (below) handles __exit__; call __enter__ here.
-        # TODO(stumpylog): Remove me in the future
-        if parser_is_new_style:
-            document_parser.__enter__()
-
-        self.log.debug(f"Parser: {type(document_parser).__name__}")
-
-        # Parse the document. This may take some time.
-
-        text = None
-        date = None
-        thumbnail = None
-        archive_path = None
-        page_count = None
-
-        try:
-            self._send_progress(
-                20,
-                100,
-                ProgressStatusOptions.WORKING,
-                ConsumerStatusShortMessage.PARSING_DOCUMENT,
+        with parser_class() as document_parser:
+            document_parser.configure(
+                ParserContext(mailrule_id=self.input_doc.mailrule_id),
            )
-            self.log.debug(f"Parsing {self.filename}...")

-            # TODO(stumpylog): Remove me in the future when all parsers use new protocol
-            if parser_is_new_style:
-                document_parser.configure(
-                    ParserContext(mailrule_id=self.input_doc.mailrule_id),
-                )
-                # TODO(stumpylog): Remove me in the future
-                document_parser.parse(self.working_copy, mime_type)
-            else:
-                document_parser.parse(self.working_copy, mime_type, self.filename)
+            self.log.debug(f"Parser: {document_parser.name} v{document_parser.version}")

-            self.log.debug(f"Generating thumbnail for {self.filename}...")
-            self._send_progress(
-                70,
-                100,
-                ProgressStatusOptions.WORKING,
-                ConsumerStatusShortMessage.GENERATING_THUMBNAIL,
-            )
-            # TODO(stumpylog): Remove me in the future when all parsers use new protocol
-            if parser_is_new_style:
-                thumbnail = document_parser.get_thumbnail(self.working_copy, mime_type)
-            else:
-                thumbnail = document_parser.get_thumbnail(
-                    self.working_copy,
-                    mime_type,
-                    self.filename,
-                )
+            # Parse the document. This may take some time.

-            text = document_parser.get_text()
-            date = document_parser.get_date()
-            if date is None:
+            text = None
+            date = None
+            thumbnail = None
+            archive_path = None
+            page_count = None
+
+            try:
                self._send_progress(
-                    90,
+                    20,
                    100,
                    ProgressStatusOptions.WORKING,
-                    ConsumerStatusShortMessage.PARSE_DATE,
+                    ConsumerStatusShortMessage.PARSING_DOCUMENT,
                )
-                with get_date_parser() as date_parser:
-                    date = next(date_parser.parse(self.filename, text), None)
-            archive_path = document_parser.get_archive_path()
-            page_count = document_parser.get_page_count(self.working_copy, mime_type)
+                self.log.debug(f"Parsing {self.filename}...")

-        except ParseError as e:
-            _parser_cleanup(document_parser)
-            if tempdir:
-                tempdir.cleanup()
-            self._fail(
-                str(e),
-                f"Error occurred while consuming document {self.filename}: {e}",
-                exc_info=True,
-                exception=e,
-            )
-        except Exception as e:
-            _parser_cleanup(document_parser)
-            if tempdir:
-                tempdir.cleanup()
-            self._fail(
-                str(e),
-                f"Unexpected error while consuming document {self.filename}: {e}",
-                exc_info=True,
-                exception=e,
-            )
+                document_parser.parse(self.working_copy, mime_type)

-        # Prepare the document classifier.
+                self.log.debug(f"Generating thumbnail for {self.filename}...")
+                self._send_progress(
+                    70,
+                    100,
+                    ProgressStatusOptions.WORKING,
+                    ConsumerStatusShortMessage.GENERATING_THUMBNAIL,
+                )
+                thumbnail = document_parser.get_thumbnail(self.working_copy, mime_type)

-        # TODO: I don't really like to do this here, but this way we avoid
-        #   reloading the classifier multiple times, since there are multiple
-        #   post-consume hooks that all require the classifier.
-
-        classifier = load_classifier()
-
-        self._send_progress(
-            95,
-            100,
-            ProgressStatusOptions.WORKING,
-            ConsumerStatusShortMessage.SAVE_DOCUMENT,
-        )
-        # now that everything is done, we can start to store the document
-        # in the system. This will be a transaction and reasonably fast.
-        try:
-            with transaction.atomic():
-                # store the document.
-                if self.input_doc.root_document_id:
-                    # If this is a new version of an existing document, we need
-                    # to make sure we're not creating a new document, but updating
-                    # the existing one.
-                    root_doc = Document.objects.get(
-                        pk=self.input_doc.root_document_id,
+                text = document_parser.get_text()
+                date = document_parser.get_date()
+                if date is None:
+                    self._send_progress(
+                        90,
+                        100,
+                        ProgressStatusOptions.WORKING,
+                        ConsumerStatusShortMessage.PARSE_DATE,
                    )
-                    original_document = self._create_version_from_root(
-                        root_doc,
-                        text=text,
-                        page_count=page_count,
-                        mime_type=mime_type,
-                    )
-                    actor = None
+                    with get_date_parser() as date_parser:
+                        date = next(date_parser.parse(self.filename, text), None)
+                archive_path = document_parser.get_archive_path()
+                page_count = document_parser.get_page_count(
+                    self.working_copy,
+                    mime_type,
+                )

-                    # Save the new version, potentially creating an audit log entry for the version addition if enabled.
-                    if (
-                        settings.AUDIT_LOG_ENABLED
-                        and self.metadata.actor_id is not None
-                    ):
-                        actor = User.objects.filter(pk=self.metadata.actor_id).first()
-                        if actor is not None:
-                            from auditlog.context import (  # type: ignore[import-untyped]
-                                set_actor,
-                            )
+            except ParseError as e:
+                if tempdir:
+                    tempdir.cleanup()
+                self._fail(
+                    str(e),
+                    f"Error occurred while consuming document {self.filename}: {e}",
+                    exc_info=True,
+                    exception=e,
+                )
+            except Exception as e:
+                if tempdir:
+                    tempdir.cleanup()
+                self._fail(
+                    str(e),
+                    f"Unexpected error while consuming document {self.filename}: {e}",
+                    exc_info=True,
+                    exception=e,
+                )

-                            with set_actor(actor):
+            # Prepare the document classifier.
+
+            # TODO: I don't really like to do this here, but this way we avoid
+            #   reloading the classifier multiple times, since there are multiple
+            #   post-consume hooks that all require the classifier.
+
+            classifier = load_classifier()
+
+            self._send_progress(
+                95,
+                100,
+                ProgressStatusOptions.WORKING,
+                ConsumerStatusShortMessage.SAVE_DOCUMENT,
+            )
+            # now that everything is done, we can start to store the document
+            # in the system. This will be a transaction and reasonably fast.
+            try:
+                with transaction.atomic():
+                    # store the document.
+                    if self.input_doc.root_document_id:
+                        # If this is a new version of an existing document, we need
+                        # to make sure we're not creating a new document, but updating
+                        # the existing one.
+                        root_doc = Document.objects.get(
+                            pk=self.input_doc.root_document_id,
+                        )
+                        original_document = self._create_version_from_root(
+                            root_doc,
+                            text=text,
+                            page_count=page_count,
+                            mime_type=mime_type,
+                        )
+                        actor = None
+
+                        # Save the new version, potentially creating an audit log entry for the version addition if enabled.
+                        if (
+                            settings.AUDIT_LOG_ENABLED
+                            and self.metadata.actor_id is not None
+                        ):
+                            actor = User.objects.filter(
+                                pk=self.metadata.actor_id,
+                            ).first()
+                            if actor is not None:
+                                from auditlog.context import (  # type: ignore[import-untyped]
+                                    set_actor,
+                                )
+
+                                with set_actor(actor):
+                                    original_document.save()
+                            else:
                                original_document.save()
                        else:
                            original_document.save()
+
+                        # Create a log entry for the version addition, if enabled
+                        if settings.AUDIT_LOG_ENABLED:
+                            from auditlog.models import (  # type: ignore[import-untyped]
+                                LogEntry,
+                            )
+
+                            LogEntry.objects.log_create(
+                                instance=root_doc,
+                                changes={
+                                    "Version Added": ["None", original_document.id],
+                                },
+                                action=LogEntry.Action.UPDATE,
+                                actor=actor,
+                                additional_data={
+                                    "reason": "Version added",
+                                    "version_id": original_document.id,
+                                },
+                            )
+                        document = original_document
                    else:
-                        original_document.save()
-
-                    # Create a log entry for the version addition, if enabled
-                    if settings.AUDIT_LOG_ENABLED:
-                        from auditlog.models import (  # type: ignore[import-untyped]
-                            LogEntry,
+                        document = self._store(
+                            text=text,
+                            date=date,
+                            page_count=page_count,
+                            mime_type=mime_type,
                        )

-                        LogEntry.objects.log_create(
-                            instance=root_doc,
-                            changes={
-                                "Version Added": ["None", original_document.id],
-                            },
-                            action=LogEntry.Action.UPDATE,
-                            actor=actor,
-                            additional_data={
-                                "reason": "Version added",
-                                "version_id": original_document.id,
-                            },
-                        )
-                    document = original_document
-                else:
-                    document = self._store(
-                        text=text,
-                        date=date,
-                        page_count=page_count,
-                        mime_type=mime_type,
-                    )
+                    # If we get here, it was successful. Proceed with post-consume
+                    # hooks. If they fail, nothing will get changed.

-                # If we get here, it was successful. Proceed with post-consume
-                # hooks. If they fail, nothing will get changed.
-
-                document_consumption_finished.send(
-                    sender=self.__class__,
-                    document=document,
-                    logging_group=self.logging_group,
-                    classifier=classifier,
-                    original_file=self.unmodified_original
-                    if self.unmodified_original
-                    else self.working_copy,
-                )
-
-                # After everything is in the database, copy the files into
-                # place. If this fails, we'll also rollback the transaction.
-                with FileLock(settings.MEDIA_LOCK):
-                    generated_filename = generate_unique_filename(document)
-                    if (
-                        len(str(generated_filename))
-                        > Document.MAX_STORED_FILENAME_LENGTH
-                    ):
-                        self.log.warning(
-                            "Generated source filename exceeds db path limit, falling back to default naming",
-                        )
-                        generated_filename = generate_filename(
-                            document,
-                            use_format=False,
-                        )
-                    document.filename = generated_filename
-                    create_source_path_directory(document.source_path)
-
-                    self._write(
-                        self.unmodified_original
-                        if self.unmodified_original is not None
+                    document_consumption_finished.send(
+                        sender=self.__class__,
+                        document=document,
+                        logging_group=self.logging_group,
+                        classifier=classifier,
+                        original_file=self.unmodified_original
+                        if self.unmodified_original
                        else self.working_copy,
-                        document.source_path,
                    )

-                    self._write(
-                        thumbnail,
-                        document.thumbnail_path,
-                    )
-
-                    if archive_path and Path(archive_path).is_file():
-                        generated_archive_filename = generate_unique_filename(
-                            document,
-                            archive_filename=True,
-                        )
+                    # After everything is in the database, copy the files into
+                    # place. If this fails, we'll also rollback the transaction.
+                    with FileLock(settings.MEDIA_LOCK):
+                        generated_filename = generate_unique_filename(document)
                        if (
-                            len(str(generated_archive_filename))
+                            len(str(generated_filename))
                            > Document.MAX_STORED_FILENAME_LENGTH
                        ):
                            self.log.warning(
-                                "Generated archive filename exceeds db path limit, falling back to default naming",
+                                "Generated source filename exceeds db path limit, falling back to default naming",
                            )
-                            generated_archive_filename = generate_filename(
+                            generated_filename = generate_filename(
                                document,
-                                archive_filename=True,
                                use_format=False,
                            )
-                        document.archive_filename = generated_archive_filename
-                        create_source_path_directory(document.archive_path)
+                        document.filename = generated_filename
+                        create_source_path_directory(document.source_path)
+
                        self._write(
-                            archive_path,
-                            document.archive_path,
+                            self.unmodified_original
+                            if self.unmodified_original is not None
+                            else self.working_copy,
+                            document.source_path,
                        )

-                        with Path(archive_path).open("rb") as f:
-                            document.archive_checksum = hashlib.md5(
-                                f.read(),
-                            ).hexdigest()
+                        self._write(
+                            thumbnail,
+                            document.thumbnail_path,
+                        )

-                # Don't save with the lock active. Saving will cause the file
-                # renaming logic to acquire the lock as well.
-                # This triggers things like file renaming
-                document.save()
+                        if archive_path and Path(archive_path).is_file():
+                            generated_archive_filename = generate_unique_filename(
+                                document,
+                                archive_filename=True,
+                            )
+                            if (
+                                len(str(generated_archive_filename))
+                                > Document.MAX_STORED_FILENAME_LENGTH
+                            ):
+                                self.log.warning(
+                                    "Generated archive filename exceeds db path limit, falling back to default naming",
+                                )
+                                generated_archive_filename = generate_filename(
+                                    document,
+                                    archive_filename=True,
+                                    use_format=False,
+                                )
+                            document.archive_filename = generated_archive_filename
+                            create_source_path_directory(document.archive_path)
+                            self._write(
+                                archive_path,
+                                document.archive_path,
+                            )

-                if document.root_document_id:
-                    document_updated.send(
-                        sender=self.__class__,
-                        document=document.root_document,
-                    )
+                            with Path(archive_path).open("rb") as f:
+                                document.archive_checksum = hashlib.md5(
+                                    f.read(),
+                                ).hexdigest()

-                # Delete the file only if it was successfully consumed
-                self.log.debug(f"Deleting original file {self.input_doc.original_file}")
-                self.input_doc.original_file.unlink()
-                self.log.debug(f"Deleting working copy {self.working_copy}")
-                self.working_copy.unlink()
-                if self.unmodified_original is not None:  # pragma: no cover
+                    # Don't save with the lock active. Saving will cause the file
+                    # renaming logic to acquire the lock as well.
+                    # This triggers things like file renaming
+                    document.save()
+
+                    if document.root_document_id:
+                        document_updated.send(
+                            sender=self.__class__,
+                            document=document.root_document,
+                        )
+
+                    # Delete the file only if it was successfully consumed
                    self.log.debug(
-                        f"Deleting unmodified original file {self.unmodified_original}",
+                        f"Deleting original file {self.input_doc.original_file}",
                    )
-                    self.unmodified_original.unlink()
+                    self.input_doc.original_file.unlink()
+                    self.log.debug(f"Deleting working copy {self.working_copy}")
+                    self.working_copy.unlink()
+                    if self.unmodified_original is not None:  # pragma: no cover
+                        self.log.debug(
+                            f"Deleting unmodified original file {self.unmodified_original}",
+                        )
+                        self.unmodified_original.unlink()

-                # https://github.com/jonaswinkler/paperless-ng/discussions/1037
-                shadow_file = (
-                    Path(self.input_doc.original_file).parent
-                    / f"._{Path(self.input_doc.original_file).name}"
+                    # https://github.com/jonaswinkler/paperless-ng/discussions/1037
+                    shadow_file = (
+                        Path(self.input_doc.original_file).parent
+                        / f"._{Path(self.input_doc.original_file).name}"
+                    )
+
+                    if Path(shadow_file).is_file():
+                        self.log.debug(f"Deleting shadow file {shadow_file}")
+                        Path(shadow_file).unlink()
+
+            except Exception as e:
+                self._fail(
+                    str(e),
+                    f"The following error occurred while storing document "
+                    f"{self.filename} after parsing: {e}",
+                    exc_info=True,
+                    exception=e,
                )
-
-                if Path(shadow_file).is_file():
-                    self.log.debug(f"Deleting shadow file {shadow_file}")
-                    Path(shadow_file).unlink()
-
-        except Exception as e:
-            self._fail(
-                str(e),
-                f"The following error occurred while storing document "
-                f"{self.filename} after parsing: {e}",
-                exc_info=True,
-                exception=e,
-            )
-        finally:
-            _parser_cleanup(document_parser)
-            tempdir.cleanup()
+            finally:
+                tempdir.cleanup()

        self.run_post_consume_script(document)