Commit Graph

152 Commits

Author SHA1 Message Date
Trenton H
c232d443fa Breaking: Decouple OCR control from archive file control (#12448)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-04-06 15:50:21 -07:00
Trenton H
f32ad98d8e Feature: Update consumer logging to include task ID for log correlation (#12510) 2026-04-03 13:31:40 -07:00
Trenton H
aed9abe48c Feature: Replace Whoosh with tantivy search backend (#12471)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
2026-04-02 12:38:22 -07:00
Trenton H
9383471fa0 Feature: Transition all checksums to use SHA256 (#12432) 2026-03-26 11:28:02 -07:00
Trenton H
8efb01010c fix: Don't silently drop the change_groups and switch to a couple slightly more efficient implementations (#12431) 2026-03-26 14:15:42 +00:00
Trenton H
701735f6e5 Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)
* refactor: switch consumer and callers to ParserRegistry (Phase 4)

Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.

- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: drop get_parser_class_for_mime_type; callers use registry directly

All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.

- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
  TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove document_consumer_declaration signal infrastructure

Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.

Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
  - paperless_tesseract/signals.py
  - paperless_text/signals.py
  - paperless_tika/signals.py
  - paperless_mail/signals.py
  - paperless_remote/signals.py

Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove empty paperless_text and paperless_tika Django apps

After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.

- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files

paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Moves the checks and tests to the main application and removes the old applications

* Adds a comment to satisy Sonar

* refactor: remove automatic log_summary() call from get_parser_registry()

The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the duplicate test file/fixture

* Fixes a race condition where webserver threads could race to populate the registry

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 06:53:32 -07:00
Trenton H
a9756f9462 Chore: Convert Tesseract parser to plugin style (#12403)
* Move tesseract parser, tests, and samples to paperless.parsers

Relocates files in preparation for the Phase 3 Protocol-based parser
refactor, preserving full git history via rename.

- src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py
- src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py
- src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py
- src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/
- Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor RasterisedDocumentParser to ParserProtocol interface

- Add RasterisedDocumentParser to registry.register_defaults()
- Update parser class: remove DocumentParser inheritance, add Protocol
  class attrs/classmethods/properties, context-manager lifecycle
- Add read_file_handle_unicode_errors() to shared parsers/utils.py
- Replace inline unicode-error-handling with shared utility call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update tesseract signals.py to import from new parser location

RasterisedDocumentParser moved to paperless.parsers.tesseract; update
the lazy import in signals.get_parser so the signal-based consumer
declaration continues to work during the registry transition. Pop
logging_group and progress_callback kwargs for constructor compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests: rewrite test_tesseract_parser to pytest style with typed fixtures

- Converts all tests from Django TestCase to pytest-style classes
- Adds tesseract_samples_dir, null_app_config, tesseract_parser, and
  make_tesseract_parser fixtures in conftest.py; all DB-free except
  TestOcrmypdfParameters which uses @pytest.mark.django_db
- Defines MakeTesseractParser type alias in conftest.py for autocomplete
- Fixes FBT001 (boolean positional args) by making bool params
  keyword-only with * separator in parametrize test signatures
- Adds type annotations to all fixture parameters for IDE support
- Uses pytest.param(..., id="...") throughout; pytest-mock for patching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(types): fully annotate paperless/parsers/tesseract.py

Fixes all mypy and pyrefly errors in the new parser file:

- Add missing type annotations to is_image, has_alpha, get_dpi,
  calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text
- Narrow Path-only (no str) for image helper args; convert to str when
  building list[str] args for run_subprocess
- Annotate ocrmypdf_args as dict[str, Any] so operator expressions on
  its values type-check and ocrmypdf.ocr(**args) resolves cleanly
- Declare text: str | None = None at top of extract_text to unify
  all assignments to the same type across both branches
- Import Any from typing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixes isort

* fix: add RasterisedDocumentParser to new-style parser shim checks

The new RasterisedDocumentParser uses __enter__/__exit__ for resource
management instead of cleanup(). Update all existing new-style shims to
include it in the isinstance checks:

- documents/consumer.py: _parser_cleanup(), parser_is_new_style
- documents/tasks.py: parser_is_new_style, finally cleanup branch
  (also adds RemoteDocumentParser which was missing from the latter)
- documents/management/commands/document_thumbnails.py: adds new-style
  handling from scratch (enter/exit + 2-arg get_thumbnail signature)

Fix stale import paths in three test files that were still importing
from paperless_tesseract.parsers instead of paperless.parsers.tesseract.

Fix two registry tests that used application/pdf as a proxy for "no
handler" — now that RasterisedDocumentParser is registered, PDF always
has a handler, so switch to a truly unsupported MIME type.

Signal infrastructure and shims remain intact; this is plumbing only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* One missed import (cherry pick?)

* Adds a no cover for a special case of handling unicode errors in PDF metadata

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 12:46:07 -07:00
Trenton H
c2b8b22fb4 Chore: Convert mail parser to plugin style (#12397)
* Refactor(mail): rename paperless_mail/parsers.py → paperless/parsers/mail.py

Preserve git history for MailDocumentParser by committing the rename
separately before editing, following the project convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor(mail): move mail parser tests to paperless/tests/parsers/

Move test_parsers.py → test_mail_parser.py and test_parsers_live.py →
test_mail_parser_live.py alongside the other built-in parser tests,
preserving git history before editing. Update MailDocumentParser import
to the new canonical location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore: move mail parser sample files to paperless/tests/samples/mail/

Relocate all mail test fixtures from src/paperless_mail/tests/samples/ to
src/paperless/tests/samples/mail/ ahead of the parser plugin refactor.
Add the new path to the codespell skip list to prevent false-positive
spell corrections in binary/fixture email files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(tests): add mail parser fixtures to paperless/tests/parsers/conftest.py

Add mail_samples_dir, per-file sample fixtures, and mail_parser
(context-manager style) to mirror the old paperless_mail conftest
but rooted at the new samples/mail/ location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(parsers): migrate MailDocumentParser to ParserProtocol

Move the mail parser from paperless_mail/parsers.py to
paperless/parsers/mail.py and refactor it to implement ParserProtocol:

- Class-level name/version/author/url attributes
- supported_mime_types() and score() classmethods (score=20)
- can_produce_archive=False, requires_pdf_rendition=True
- Context manager lifecycle (__enter__/__exit__)
- New parse() signature without mailrule_id kwarg; consumer sets
  parser.mailrule_id before calling parse() instead
- get_text()/get_date()/get_archive_path() accessor methods
- extract_metadata() returning email headers and attachment info

Register MailDocumentParser in the ParserRegistry alongside Text and
Tika parsers. Update consumer, signals, and all import sites to use
the new location. Update tests to use the new accessor API, patch
paths, and context-manager fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix(parsers): pop legacy constructor args in mail signal wrapper

MailDocumentParser.__init__ takes no constructor args in the new
protocol. Update the get_parser() signal wrapper to pop logging_group
and progress_callback (passed by the legacy consumer dispatch path)
before instantiating — the same pattern used by TextDocumentParser.

Also update test_mail_parser_receives_mailrule to use the real signal
wrapper (mail_get_parser) instead of MailDocumentParser directly, so
the test exercises the actual dispatch path and matches the new
parse() call signature (no mailrule kwarg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Bumps this so we can run

* Fixes location of the fixture

* Removes fixtures which were duplicated

* Feat(parsers): add ParserContext and configure() to ParserProtocol

Replace the ad-hoc mailrule_id attribute assignment with a typed,
immutable ParserContext dataclass and a configure() method on the
Protocol:

- ParserContext(frozen=True, slots=True) lives in paperless/parsers/
  alongside ParserProtocol and MetadataEntry; currently carries only
  mailrule_id but is designed to grow with output_type, ocr_mode, and
  ocr_language in a future phase (decoupling parsers from settings.*)
- ParserProtocol.configure(context: ParserContext) -> None is the
  extension point; no-op by default
- MailDocumentParser.configure() reads mailrule_id into _mailrule_id
- TextDocumentParser and TikaDocumentParser implement a no-op configure()
- Consumer calls document_parser.configure(ParserContext(...)) before
  parse(), replacing the isinstance(parser, MailDocumentParser) guard
  and the direct attribute mutation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(parsers): call configure(ParserContext()) in update_document task

Apply the same new-style parser shim pattern as the consumer to
update_document_content_maybe_archive_file:

- Call __enter__ for Text/Tika parsers after instantiation
- Call configure(ParserContext()) before parse() for all new-style parsers
  (mailrule_id is not available here — this is a re-process of an
  existing document, so the default empty context is correct)
- Call parse(path, mime_type) with 2 args for new-style parsers
- Call get_thumbnail(path, mime_type) with 2 args for new-style parsers
- Call __exit__ instead of cleanup() in the finally block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix(tests): add configure() to DummyParser and missing-method parametrize

ParserProtocol now requires configure(context: ParserContext) -> None.
Update DummyParser in test_registry.py to implement it, and add
'missing-configure' to the test_partial_compliant_fails_isinstance
parametrize list so the new method is covered by the negative test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the reprocess task and generally reduces duplicate of classes

* Corrects the score return

* Updates so we can report a page count for these parsers, assuming we do have an archive produced when called

* Increases test coverage

* One more coverage

* Updates typing

* Updates typing

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 09:22:18 -07:00
Trenton H
9d69705e26 Feature: Add progress information to the classifier training for a better ux (#12331) 2026-03-14 19:53:52 +00:00
Trenton H
d86cfdb088 Feature: Initial document parser plugin framework (#12294) 2026-03-12 21:53:17 +00:00
shamoon
299dac21ee Enhancement: “live” document updates (#12141) 2026-03-04 00:27:07 +00:00
Trenton H
e58a35d40c Feature: Transition sanity check to rich and improve output (#12182) 2026-03-02 10:53:39 -08:00
Trenton H
20a9cd40e8 Feature: Switch all indexing to use rich (#12193) 2026-03-02 10:41:09 -08:00
shamoon
96ac7b2336 Tweak: Ignore version docs for workflows (#12217) 2026-03-02 08:21:14 -08:00
shamoon
ceee769e26 Feature: document file versions (#12061) 2026-02-26 16:46:54 +00:00
shamoon
00ef0837d2 Fix: re-run ASN check after barcode detection (#11681) 2026-02-02 23:23:37 +00:00
Sebastian Steinbeißer
3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
shamoon
1f074390e4 Feature: sharelink bundles (#11682) 2026-01-27 18:54:51 +00:00
shamoon
e940764fe0 Feature: Paperless AI (#10319) 2026-01-13 16:24:42 +00:00
shamoon
f3e3ba49d1 Fix: recurring workflow to respect latest run time (#11735) 2026-01-08 09:52:53 -08:00
shamoon
66d363bdc5 Chore: refactor workflows code (#11563) 2025-12-11 12:13:10 -08:00
shamoon
4cff907ba0 Feature: Nested Tags (#10833)
---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2025-09-17 21:41:39 +00:00
shamoon
1cd21d0f38 Fix check scheduled workflows docstring 2025-07-03 00:11:12 -07:00
shamoon
3b069ac034 Fix: restore expected pre-2.16 scheduled workflow offset behavior (#10218) 2025-06-19 14:47:54 +00:00
shamoon
422bffe1a6 Performance: pre-filter document list in scheduled workflow checks (#10031) 2025-06-03 21:47:29 +00:00
shamoon
e97cfb9b5e Chore: refactor consumer plugin checks to a pre-flight plugin (#9994) 2025-06-03 19:28:49 +00:00
shamoon
f39463ff4e Add a more helpful docstring to schedule logic, scheduled test 2025-05-27 13:05:42 -07:00
shamoon
344cc70cd5 Enhancement: support negative offset in scheduled workflows (#9746) 2025-05-11 20:04:46 +00:00
shamoon
924a13f724 Fix: always update classifier task result (#9817) 2025-04-29 11:46:18 -07:00
shamoon
358db10fe3 Fix: ensure only matched scheduled workflows are applied (#9580) 2025-04-08 08:55:03 -07:00
Sebastian Steinbeißer
76d363f22d Chore: switch from os.path to pathlib.Path (#9060) 2025-03-05 21:06:01 +00:00
shamoon
2d52226732 Enhancement: system status report sanity check, simpler classifier check, styling updates (#9106) 2025-02-26 22:12:20 +00:00
shamoon
ceffcd6360 Fix: correct logged number of deleted documents on trash (#9148) 2025-02-17 18:41:47 -08:00
Sebastian Steinbeißer
e560fa3be0 Chore: Enable ruff FBT (#8645) 2025-02-07 09:12:03 -08:00
shamoon
f836c5ce3e Chore: some logging to trash emptying 2024-12-07 08:04:10 -08:00
shamoon
7d182ab894 Enhancement: prune audit logs and management command (#8416) 2024-12-03 19:28:27 +00:00
shamoon
c6fdf4409b Fix: include distinct on workflow objects filter 2024-11-24 12:22:48 -08:00
shamoon
2b29233a1e Feature: scheduled workflow trigger (#8036) 2024-11-24 18:22:31 +00:00
shamoon
9c1561adfb Change: change update content to handle archive disabled (#8315) 2024-11-20 20:01:13 +00:00
Trenton H
e6f59472e4 Chore: Drop Python 3.9 support (#7774) 2024-09-26 12:22:24 -07:00
shamoon
a771d2afd9 Fix: use JSON for update archive file auditlog entries (#7503) 2024-08-19 23:29:24 -07:00
shamoon
0f9710dc8f Fix: index fresh document data after update archive file (#7057) 2024-06-21 18:33:01 +00:00
shamoon
a796e58a94 Feature: documents trash aka soft delete (#6944) 2024-06-17 08:07:08 -07:00
Trenton H
622f624132 Chore: Change the code formatter to Ruff (#6756)
* Changing the formatting to ruff-format

* Replaces references to black to ruff or ruff format, removes black from dependencies
2024-05-18 02:26:50 +00:00
Trenton H
b720aa3cd1 Chore: Convert the consumer to a plugin (#6361) 2024-04-18 02:59:14 +00:00
shamoon
4af8070450 Feature: PDF actions - merge, split & rotate (#6094) 2024-03-25 18:41:24 -07:00
Trenton H
2da5e46386 Refactor file consumption task to allow beginnings of a plugin system (#5367) 2024-01-13 16:11:14 +00:00
Trenton H
bd35030c59 Fix: Crash in barcode ASN reading when the file type isn't supported (#5261)
* Fixes a random crash in the barcode ASN reading so it doesn't try to access a not created temp dir

* Don't parse the barcodes twice, store the result instead
2024-01-06 05:08:24 +00:00
shamoon
3b6ce16f1c Feature: Workflows (#5121) 2024-01-03 08:19:19 +00:00
Trenton H
122e4141b0 Fix: Document metadata is lost during barcode splitting (#4982)
* Fixes barcode splitting dropping metadata that might be needed for the round 2
2023-12-15 09:17:25 -08:00