Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)

* refactor: switch consumer and callers to ParserRegistry (Phase 4)

Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.

- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: drop get_parser_class_for_mime_type; callers use registry directly

All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.

- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
  TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove document_consumer_declaration signal infrastructure

Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.

Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
  - paperless_tesseract/signals.py
  - paperless_text/signals.py
  - paperless_tika/signals.py
  - paperless_mail/signals.py
  - paperless_remote/signals.py

Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove empty paperless_text and paperless_tika Django apps

After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.

- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files

paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Moves the checks and tests to the main application and removes the old applications

* Adds a comment to satisy Sonar

* refactor: remove automatic log_summary() call from get_parser_registry()

The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the duplicate test file/fixture

* Fixes a race condition where webserver threads could race to populate the registry

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Trenton H
2026-03-22 06:53:32 -07:00
committed by GitHub
parent 07f54bfdab
commit 701735f6e5
41 changed files with 713 additions and 1295 deletions
-29
View File
@@ -90,35 +90,6 @@ def text_parser() -> Generator[TextDocumentParser, None, None]:
yield parser
# ------------------------------------------------------------------
# Remote parser sample files
# ------------------------------------------------------------------
@pytest.fixture(scope="session")
def remote_samples_dir(samples_dir: Path) -> Path:
"""Absolute path to the remote parser sample files directory.
Returns
-------
Path
``<samples_dir>/remote/``
"""
return samples_dir / "remote"
@pytest.fixture(scope="session")
def sample_pdf_file(remote_samples_dir: Path) -> Path:
"""Path to a simple digital PDF sample file.
Returns
-------
Path
Absolute path to ``remote/simple-digital.pdf``.
"""
return remote_samples_dir / "simple-digital.pdf"
# ------------------------------------------------------------------
# Remote parser instance
# ------------------------------------------------------------------
@@ -277,20 +277,20 @@ class TestRemoteParserParse:
def test_parse_returns_text_from_azure(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
azure_client: Mock,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
assert remote_parser.get_text() == _DEFAULT_TEXT
def test_parse_sets_archive_path(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
azure_client: Mock,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
archive = remote_parser.get_archive_path()
assert archive is not None
@@ -300,11 +300,11 @@ class TestRemoteParserParse:
def test_parse_closes_client_on_success(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
azure_client: Mock,
) -> None:
remote_parser.configure(ParserContext())
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
azure_client.close.assert_called_once()
@@ -312,9 +312,9 @@ class TestRemoteParserParse:
def test_parse_sets_empty_text_when_not_configured(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
assert remote_parser.get_text() == ""
assert remote_parser.get_archive_path() is None
@@ -328,10 +328,10 @@ class TestRemoteParserParse:
def test_get_date_always_none(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
azure_client: Mock,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
assert remote_parser.get_date() is None
@@ -345,33 +345,33 @@ class TestRemoteParserParseError:
def test_parse_returns_none_on_azure_error(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
failing_azure_client: Mock,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
assert remote_parser.get_text() is None
def test_parse_closes_client_on_error(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
failing_azure_client: Mock,
) -> None:
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
failing_azure_client.close.assert_called_once()
def test_parse_logs_error_on_azure_failure(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
failing_azure_client: Mock,
mocker: MockerFixture,
) -> None:
mock_log = mocker.patch("paperless.parsers.remote.logger")
remote_parser.parse(sample_pdf_file, "application/pdf")
remote_parser.parse(simple_digital_pdf_file, "application/pdf")
mock_log.error.assert_called_once()
assert "Azure AI Vision parsing failed" in mock_log.error.call_args[0][0]
@@ -386,18 +386,18 @@ class TestRemoteParserPageCount:
def test_page_count_for_pdf(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
count = remote_parser.get_page_count(sample_pdf_file, "application/pdf")
count = remote_parser.get_page_count(simple_digital_pdf_file, "application/pdf")
assert isinstance(count, int)
assert count >= 1
def test_page_count_returns_none_for_image_mime(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
count = remote_parser.get_page_count(sample_pdf_file, "image/png")
count = remote_parser.get_page_count(simple_digital_pdf_file, "image/png")
assert count is None
def test_page_count_returns_none_for_invalid_pdf(
@@ -420,25 +420,31 @@ class TestRemoteParserMetadata:
def test_extract_metadata_non_pdf_returns_empty(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
result = remote_parser.extract_metadata(sample_pdf_file, "image/png")
result = remote_parser.extract_metadata(simple_digital_pdf_file, "image/png")
assert result == []
def test_extract_metadata_pdf_returns_list(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
result = remote_parser.extract_metadata(sample_pdf_file, "application/pdf")
result = remote_parser.extract_metadata(
simple_digital_pdf_file,
"application/pdf",
)
assert isinstance(result, list)
def test_extract_metadata_pdf_entries_have_required_keys(
self,
remote_parser: RemoteDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
result = remote_parser.extract_metadata(sample_pdf_file, "application/pdf")
result = remote_parser.extract_metadata(
simple_digital_pdf_file,
"application/pdf",
)
for entry in result:
assert "namespace" in entry
assert "prefix" in entry
@@ -77,10 +77,10 @@ class TestTikaParserRegistryInterface:
def test_get_page_count_returns_int_with_pdf_archive(
self,
tika_parser: TikaDocumentParser,
sample_pdf_file: Path,
simple_digital_pdf_file: Path,
) -> None:
tika_parser._archive_path = sample_pdf_file
count = tika_parser.get_page_count(sample_pdf_file, "application/pdf")
tika_parser._archive_path = simple_digital_pdf_file
count = tika_parser.get_page_count(simple_digital_pdf_file, "application/pdf")
assert isinstance(count, int)
assert count > 0
+116
View File
@@ -5,6 +5,7 @@ from pathlib import Path
from unittest import mock
import pytest
from django.core.checks import ERROR
from django.core.checks import Error
from django.core.checks import Warning
from pytest_django.fixtures import SettingsWrapper
@@ -12,7 +13,9 @@ from pytest_mock import MockerFixture
from paperless.checks import audit_log_check
from paperless.checks import binaries_check
from paperless.checks import check_default_language_available
from paperless.checks import check_deprecated_db_settings
from paperless.checks import check_remote_parser_configured
from paperless.checks import check_v3_minimum_upgrade_version
from paperless.checks import debug_mode_check
from paperless.checks import paths_check
@@ -626,3 +629,116 @@ class TestV3MinimumUpgradeVersionCheck:
conn.introspection.table_names.side_effect = OperationalError("DB unavailable")
mocker.patch.dict("paperless.checks.connections", {"default": conn})
assert check_v3_minimum_upgrade_version(None) == []
class TestRemoteParserChecks:
def test_no_engine(self, settings: SettingsWrapper) -> None:
settings.REMOTE_OCR_ENGINE = None
msgs = check_remote_parser_configured(None)
assert len(msgs) == 0
def test_azure_no_endpoint(self, settings: SettingsWrapper) -> None:
settings.REMOTE_OCR_ENGINE = "azureai"
settings.REMOTE_OCR_API_KEY = "somekey"
settings.REMOTE_OCR_ENDPOINT = None
msgs = check_remote_parser_configured(None)
assert len(msgs) == 1
msg = msgs[0]
assert (
"Azure AI remote parser requires endpoint and API key to be configured."
in msg.msg
)
class TestTesseractChecks:
def test_default_language(self) -> None:
check_default_language_available(None)
def test_no_language(self, settings: SettingsWrapper) -> None:
settings.OCR_LANGUAGE = ""
msgs = check_default_language_available(None)
assert len(msgs) == 1
msg = msgs[0]
assert (
"No OCR language has been specified with PAPERLESS_OCR_LANGUAGE" in msg.msg
)
def test_invalid_language(
self,
settings: SettingsWrapper,
mocker: MockerFixture,
) -> None:
settings.OCR_LANGUAGE = "ita"
tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
tesser_lang_mock.return_value = ["deu", "eng"]
msgs = check_default_language_available(None)
assert len(msgs) == 1
msg = msgs[0]
assert msg.level == ERROR
assert "The selected ocr language ita is not installed" in msg.msg
def test_multi_part_language(
self,
settings: SettingsWrapper,
mocker: MockerFixture,
) -> None:
"""
GIVEN:
- An OCR language which is multi part (ie chi-sim)
- The language is correctly formatted
WHEN:
- Installed packages are checked
THEN:
- No errors are reported
"""
settings.OCR_LANGUAGE = "chi_sim"
tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
tesser_lang_mock.return_value = ["chi_sim", "eng"]
msgs = check_default_language_available(None)
assert len(msgs) == 0
def test_multi_part_language_bad_format(
self,
settings: SettingsWrapper,
mocker: MockerFixture,
) -> None:
"""
GIVEN:
- An OCR language which is multi part (ie chi-sim)
- The language is correctly NOT formatted
WHEN:
- Installed packages are checked
THEN:
- No errors are reported
"""
settings.OCR_LANGUAGE = "chi-sim"
tesser_lang_mock = mocker.patch("paperless.checks.get_tesseract_langs")
tesser_lang_mock.return_value = ["chi_sim", "eng"]
msgs = check_default_language_available(None)
assert len(msgs) == 1
msg = msgs[0]
assert msg.level == ERROR
assert "The selected ocr language chi-sim is not installed" in msg.msg