paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-07-01 17:54:25 +00:00

Author	SHA1	Message	Date
Trenton H	92c016ce47	Fix: Handle the UTF 16 and BOM text files better (#12994 )	2026-06-13 05:35:38 -07:00
Trenton H	c232d443fa	Breaking: Decouple OCR control from archive file control (#12448 ) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-04-06 15:50:21 -07:00
Trenton H	a9756f9462	Chore: Convert Tesseract parser to plugin style (#12403 ) * Move tesseract parser, tests, and samples to paperless.parsers Relocates files in preparation for the Phase 3 Protocol-based parser refactor, preserving full git history via rename. - src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py - src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py - src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py - src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/ - Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor RasterisedDocumentParser to ParserProtocol interface - Add RasterisedDocumentParser to registry.register_defaults() - Update parser class: remove DocumentParser inheritance, add Protocol class attrs/classmethods/properties, context-manager lifecycle - Add read_file_handle_unicode_errors() to shared parsers/utils.py - Replace inline unicode-error-handling with shared utility call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update tesseract signals.py to import from new parser location RasterisedDocumentParser moved to paperless.parsers.tesseract; update the lazy import in signals.get_parser so the signal-based consumer declaration continues to work during the registry transition. Pop logging_group and progress_callback kwargs for constructor compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests: rewrite test_tesseract_parser to pytest style with typed fixtures - Converts all tests from Django TestCase to pytest-style classes - Adds tesseract_samples_dir, null_app_config, tesseract_parser, and make_tesseract_parser fixtures in conftest.py; all DB-free except TestOcrmypdfParameters which uses @pytest.mark.django_db - Defines MakeTesseractParser type alias in conftest.py for autocomplete - Fixes FBT001 (boolean positional args) by making bool params keyword-only with * separator in parametrize test signatures - Adds type annotations to all fixture parameters for IDE support - Uses pytest.param(..., id="...") throughout; pytest-mock for patching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(types): fully annotate paperless/parsers/tesseract.py Fixes all mypy and pyrefly errors in the new parser file: - Add missing type annotations to is_image, has_alpha, get_dpi, calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text - Narrow Path-only (no str) for image helper args; convert to str when building list[str] args for run_subprocess - Annotate ocrmypdf_args as dict[str, Any] so operator expressions on its values type-check and ocrmypdf.ocr(*args) resolves cleanly - Declare text: str \| None = None at top of extract_text to unify all assignments to the same type across both branches - Import Any from typing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Fixes isort * fix: add RasterisedDocumentParser to new-style parser shim checks The new RasterisedDocumentParser uses __enter__/__exit__ for resource management instead of cleanup(). Update all existing new-style shims to include it in the isinstance checks: - documents/consumer.py: _parser_cleanup(), parser_is_new_style - documents/tasks.py: parser_is_new_style, finally cleanup branch (also adds RemoteDocumentParser which was missing from the latter) - documents/management/commands/document_thumbnails.py: adds new-style handling from scratch (enter/exit + 2-arg get_thumbnail signature) Fix stale import paths in three test files that were still importing from paperless_tesseract.parsers instead of paperless.parsers.tesseract. Fix two registry tests that used application/pdf as a proxy for "no handler" — now that RasterisedDocumentParser is registered, PDF always has a handler, so switch to a truly unsupported MIME type. Signal infrastructure and shims remain intact; this is plumbing only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * One missed import (cherry pick?) * Adds a no cover for a special case of handling unicode errors in PDF metadata --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 12:46:07 -07:00
Trenton H	2cbe6ae892	Feature: Convert remote AI parser to plugin system (#12334 ) * Refactor: move remote parser, test, and sample to paperless.parsers Relocates three files to their new homes in the parser plugin system: - src/paperless_remote/parsers.py → src/paperless/parsers/remote.py - src/paperless_remote/tests/test_parser.py → src/paperless/tests/parsers/test_remote_parser.py - src/paperless_remote/tests/samples/simple-digital.pdf → src/paperless/tests/samples/remote/simple-digital.pdf Content and imports will be updated in the follow-up commit that rewrites the parser to the new ParserProtocol interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Feature: migrate RemoteDocumentParser to ParserProtocol interface Rewrites the remote OCR parser to the new plugin system contract: - `supported_mime_types()` is now a classmethod that always returns the full set of 7 MIME types; the old instance-method hack (returning {} when unconfigured) is removed - `score()` classmethod returns None when no remote engine is configured (making the parser invisible to the registry), and 20 when active — higher than the tesseract default of 10 so the remote engine takes priority when both are available - No longer inherits from RasterisedDocumentParser; inherits no parser class at all — just implements the protocol directly - `can_produce_archive = True`; `requires_pdf_rendition = False` - `_azure_ai_vision_parse()` takes explicit config arg; API client created and closed within the method - `get_page_count()` returns the PDF page count for application/pdf, delegating to the new `get_page_count_for_pdf()` utility - `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs; returns [] for all other MIME types New files: - `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and `get_page_count_for_pdf()` utilities (pikepdf-based); both the remote and tesseract parsers will use these going forward - `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style tests using pytest-django `settings` and pytest-mock `mocker` fixtures - `src/paperless/tests/parsers/conftest.py` — remote parser instance, sample-file, and settings-helper fixtures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: use fixture factory and usefixtures in remote parser tests - `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture in conftest.py; tests call `make_azure_mock()` or `make_azure_mock("custom text")` instead of a module-level function - `azure_settings` and `no_engine_settings` applied via `@pytest.mark.usefixtures` wherever their value is not referenced inside the test body; `TestRemoteParserParseError` marked at the class level since all three tests need the same setting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: improve remote parser test fixture structure - make_azure_mock moved from conftest.py back into test_remote_parser.py; it is specific to that module and does not belong in shared fixtures - azure_client fixture composes azure_settings + make_azure_mock + patch in one step; tests no longer repeat the mocker.patch call or carry an unused azure_settings parameter - failing_azure_client fixture similarly composes azure_settings + patch with a RuntimeError side effect; TestRemoteParserParseError now only receives the mock it actually uses - All @pytest.mark.parametrize calls use pytest.param with explicit ids (pdf, png, jpeg, ...) for readable test output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: wire RemoteDocumentParser into consumer and fix signals - paperless_remote/signals.py: import from paperless.parsers.remote (new location after git mv). supported_mime_types() is now a classmethod that always returns the full set, so get_supported_mime_types() in the signal layer explicitly checks RemoteEngineConfig validity and returns {} when unconfigured — preserving the old behaviour where an unconfigured remote parser does not register for any MIME types. - documents/consumer.py: extend the _parser_cleanup() shim, parse() dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser alongside TextDocumentParser. Both new-style parsers use __exit__ for cleanup and take (document_path, mime_type) without a file_name argument. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: fix type errors in remote parser and signals - remote.py: add `if TYPE_CHECKING: assert` guards before the Azure client construction to narrow config.endpoint and config.api_key from str\|None to str. The narrowing is safe: engine_is_valid() guarantees both are non-None when it returns True (api_key explicitly; endpoint via `not (engine=="azureai" and endpoint is None)` for the only valid engine). Asserts are wrapped in TYPE_CHECKING so they carry zero runtime cost. - signals.py: add full type annotations — return types, Any-typed sender parameter, and explicit logging_group argument replacing args. Add `from __future__ import annotations` for consistent annotation style. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Fix: get_parser factory forwards logging_group, drops progress_callback consumer.py calls parser_class(logging_group, progress_callback=...). RemoteDocumentParser.__init__ accepts logging_group but not progress_callback, so only the latter is dropped — matching the pattern established by the TextDocumentParser signals shim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: text parser get_parser forwards logging_group, drops progress_callback TextDocumentParser.__init__ accepts logging_group: object = None, same as RemoteDocumentParser. The old shim incorrectly dropped it; fix to forward it as a positional arg and only drop progress_callback. Add type annotations and from __future__ import annotations for consistency with the remote parser signals shim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 16:19:46 -07:00

4 Commits