Commit Graph

275 Commits

Author SHA1 Message Date
Trenton H 16e73f611d Cleans up the reprocess task and generally reduces duplicate of classes 2026-03-19 09:57:08 -07:00
Trenton H b66cfb1867 Merge remote-tracking branch 'origin/dev' into feature-mail-parser-plugin 2026-03-19 09:24:44 -07:00
Trenton H a36b6ecbef Feat(parsers): add ParserContext and configure() to ParserProtocol
Replace the ad-hoc mailrule_id attribute assignment with a typed,
immutable ParserContext dataclass and a configure() method on the
Protocol:

- ParserContext(frozen=True, slots=True) lives in paperless/parsers/
  alongside ParserProtocol and MetadataEntry; currently carries only
  mailrule_id but is designed to grow with output_type, ocr_mode, and
  ocr_language in a future phase (decoupling parsers from settings.*)
- ParserProtocol.configure(context: ParserContext) -> None is the
  extension point; no-op by default
- MailDocumentParser.configure() reads mailrule_id into _mailrule_id
- TextDocumentParser and TikaDocumentParser implement a no-op configure()
- Consumer calls document_parser.configure(ParserContext(...)) before
  parse(), replacing the isinstance(parser, MailDocumentParser) guard
  and the direct attribute mutation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 08:19:17 -07:00
Trenton H 2cbe6ae892 Feature: Convert remote AI parser to plugin system (#12334)
* Refactor: move remote parser, test, and sample to paperless.parsers

Relocates three files to their new homes in the parser plugin system:

- src/paperless_remote/parsers.py
    → src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
    → src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
    → src/paperless/tests/samples/remote/simple-digital.pdf

Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: migrate RemoteDocumentParser to ParserProtocol interface

Rewrites the remote OCR parser to the new plugin system contract:

- `supported_mime_types()` is now a classmethod that always returns the
  full set of 7 MIME types; the old instance-method hack (returning {}
  when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
  (making the parser invisible to the registry), and 20 when active —
  higher than the tesseract default of 10 so the remote engine takes
  priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
  class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
  created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
  delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
  returns [] for all other MIME types

New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
  `get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
  and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
  tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
  sample-file, and settings-helper fixtures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: use fixture factory and usefixtures in remote parser tests

- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
  in conftest.py; tests call `make_azure_mock()` or
  `make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
  `@pytest.mark.usefixtures` wherever their value is not referenced
  inside the test body; `TestRemoteParserParseError` marked at the class
  level since all three tests need the same setting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: improve remote parser test fixture structure

- make_azure_mock moved from conftest.py back into test_remote_parser.py;
  it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
  in one step; tests no longer repeat the mocker.patch call or carry an
  unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
  with a RuntimeError side effect; TestRemoteParserParseError now only
  receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
  (pdf, png, jpeg, ...) for readable test output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: wire RemoteDocumentParser into consumer and fix signals

- paperless_remote/signals.py: import from paperless.parsers.remote
  (new location after git mv). supported_mime_types() is now a
  classmethod that always returns the full set, so get_supported_mime_types()
  in the signal layer explicitly checks RemoteEngineConfig validity and
  returns {} when unconfigured — preserving the old behaviour where an
  unconfigured remote parser does not register for any MIME types.

- documents/consumer.py: extend the _parser_cleanup() shim, parse()
  dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
  alongside TextDocumentParser. Both new-style parsers use __exit__
  for cleanup and take (document_path, mime_type) without a file_name
  argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: fix type errors in remote parser and signals

- remote.py: add `if TYPE_CHECKING: assert` guards before the Azure
  client construction to narrow config.endpoint and config.api_key from
  str|None to str. The narrowing is safe: engine_is_valid() guarantees
  both are non-None when it returns True (api_key explicitly; endpoint
  via `not (engine=="azureai" and endpoint is None)` for the only valid
  engine). Asserts are wrapped in TYPE_CHECKING so they carry zero
  runtime cost.

- signals.py: add full type annotations — return types, Any-typed
  sender parameter, and explicit logging_group argument replacing *args.
  Add `from __future__ import annotations` for consistent annotation style.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: get_parser factory forwards logging_group, drops progress_callback

consumer.py calls parser_class(logging_group, progress_callback=...).
RemoteDocumentParser.__init__ accepts logging_group but not
progress_callback, so only the latter is dropped — matching the pattern
established by the TextDocumentParser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: text parser get_parser forwards logging_group, drops progress_callback

TextDocumentParser.__init__ accepts logging_group: object = None, same
as RemoteDocumentParser. The old shim incorrectly dropped it; fix to
forward it as a positional arg and only drop progress_callback.
Add type annotations and from __future__ import annotations for
consistency with the remote parser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 16:19:46 -07:00
Trenton H 3236bbd0c5 Feat(parsers): migrate MailDocumentParser to ParserProtocol
Move the mail parser from paperless_mail/parsers.py to
paperless/parsers/mail.py and refactor it to implement ParserProtocol:

- Class-level name/version/author/url attributes
- supported_mime_types() and score() classmethods (score=20)
- can_produce_archive=False, requires_pdf_rendition=True
- Context manager lifecycle (__enter__/__exit__)
- New parse() signature without mailrule_id kwarg; consumer sets
  parser.mailrule_id before calling parse() instead
- get_text()/get_date()/get_archive_path() accessor methods
- extract_metadata() returning email headers and attachment info

Register MailDocumentParser in the ParserRegistry alongside Text and
Tika parsers. Update consumer, signals, and all import sites to use
the new location. Update tests to use the new accessor API, patch
paths, and context-manager fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 14:41:26 -07:00
Trenton H aea2927a02 Feature: Convert Tika parser to the plugin system (#12333)
* Chore: move Tika parser and tests to paperless/

Move TikaDocumentParser and its tests to the canonical parser package
location, matching the pattern established for TextDocumentParser:

- src/paperless_tika/parsers.py → src/paperless/parsers/tika.py
- src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py
- src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/

Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file,
sample_doc_file, sample_broken_odt) into the shared parsers conftest.
Remove the now-empty src/paperless_tika/tests/conftest.py.

Content is unchanged — this commit is rename-only so git history is
preserved on the moved files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol

Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing
the legacy DocumentParser ABC:

- Add ClassVars: name, version, author, url
- Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types)
- Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise
- can_produce_archive = False (PDF is for display, not an OCR archive)
- requires_pdf_rendition = True (Office formats need PDF for browser display)
- __enter__/__exit__ via ExitStack: TikaClient opened once per parser
  lifetime and shared across parse() and extract_metadata() calls
- extract_metadata() falls back to a short-lived TikaClient when called
  outside a context manager (legacy view-layer metadata path)
- _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored
  ApplicationConfiguration before falling back to the env-var setting
- Rename convert_to_pdf → _convert_to_pdf (private helper)

Update paperless_tika/signals.py shim to import from the new module path
and drop the legacy logging_group/progress_callback kwargs.

Update documents/consumer.py to extend the existing TextDocumentParser
special cases to also cover TikaDocumentParser (parse/get_thumbnail
signatures, __exit__ cleanup).

Add TestTikaParserRegistryInterface (7 tests) covering score(), properties,
and ParserProtocol isinstance check.  Update existing tests to use the new
accessor API (get_text, get_date, get_archive_path, _convert_to_pdf).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: update remaining imports and move live Tika tests after parser migration

- src/documents/tests/test_parsers.py: import TikaDocumentParser from
  paperless.parsers.tika (old paperless_tika.parsers no longer exists)
- git mv paperless_tika/tests/test_live_tika.py →
  paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests
  with the parser; update import and replace old attribute API
  (tika_parser.text/.archive_path) with accessor methods
  (get_text/get_archive_path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: satisfy mypy and pyrefly for TikaDocumentParser

Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from
TikaClient | None to TikaClient at the point of use in parse().  The
assert is visible to type checkers (TYPE_CHECKING=True) so both mypy
and pyrefly accept the subsequent attribute accesses without error;
at runtime TYPE_CHECKING is False so the assert never executes and no
ruff S101 suppression is required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: require context manager for TikaDocumentParser; clean up client lifecycle

- consumer.py: call __enter__ for new-style parsers so _tika_client and
  _gotenberg_client are set before parse() is invoked
- views.py: use `with parser` (via nullcontext for old-style parsers) in
  get_metadata so extract_metadata always runs inside a context manager
- tika.py: GotenbergClient added to ExitStack alongside TikaClient;
  inline client creation removed from extract_metadata and _convert_to_pdf;
  __exit__ uses ExitStack.close() instead of __exit__ pass-through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 15:43:28 -07:00
Trenton H d86cfdb088 Feature: Initial document parser plugin framework (#12294) 2026-03-12 21:53:17 +00:00
shamoon 24a2cfd957 Change: use explicit doc creation instead of clone for versions (#12226) 2026-03-04 15:57:44 -08:00
shamoon df03207eef Fix: correct doc version filename handling (#12223) 2026-03-04 23:28:07 +00:00
Trenton H 1e21bcd26e Breaking: Drop support for Python 3.10 (#12234) 2026-03-04 15:03:33 -08:00
shamoon d51a118aac Merge branch 'main' into dev 2026-03-04 13:31:20 -08:00
shamoon 8b8307571a Fix: enforce path limit for db filename fields (#12235) 2026-03-03 13:19:56 -08:00
shamoon 96ac7b2336 Tweak: Ignore version docs for workflows (#12217) 2026-03-02 08:21:14 -08:00
shamoon ceee769e26 Feature: document file versions (#12061) 2026-02-26 16:46:54 +00:00
shamoon 6192915be7 Fixhancement: improve ASN handling with PDF operations (#11689) 2026-02-06 21:14:02 +00:00
Trenton H 2ec8ec96c8 Feature: Enable users to customize date parsing via plugins (#11931) 2026-02-03 20:09:13 +00:00
shamoon 00ef0837d2 Fix: re-run ASN check after barcode detection (#11681) 2026-02-02 23:23:37 +00:00
Sebastian Steinbeißer 3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
shamoon 4428354150 Feature: allow duplicates with warnings, UI for discovery (#11815) 2026-01-26 18:55:08 +00:00
Trenton H d0032c18be Breaking: Remove support for document and thumbnail encryption (#11850) 2026-01-24 19:29:54 -08:00
shamoon 7604a0b583 Fix: prevent ASN collisions for merge operations (#11634) 2025-12-19 20:05:34 -08:00
shamoon 4cff907ba0 Feature: Nested Tags (#10833)
---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2025-09-17 21:41:39 +00:00
shamoon dfad3c4d8e Chore: clarify file deletion logging 2025-06-27 13:34:44 -07:00
shamoon e97cfb9b5e Chore: refactor consumer plugin checks to a pre-flight plugin (#9994) 2025-06-03 19:28:49 +00:00
matthesrieke e9746aa0e3 Enhancement: include DOCUMENT_TYPE to post consume scripts (#9977)
* expose DOCUMENT_TYPE to post consume scripts

* Apply suggestions from code review

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-05-28 23:32:59 +00:00
shamoon a3b85c64ca Fixhancement: check more permissions for status consumer messages (#9804) 2025-04-26 23:31:04 -07:00
shamoon edc7181843 Enhancement: support assigning custom field values in workflows (#9272) 2025-03-05 12:30:19 -08:00
Trenton H f205c4d0e2 Removes undocumented FileInfo (#9298) 2025-03-04 13:49:47 -08:00
Silvia Bigler 71472a6a82 Enhancement: add layout options for email conversion (#8907)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-02-07 18:32:35 +00:00
Sebastian Steinbeißer fce7b03324 Chore: Switch from os.path to pathlib.Path (#8644) 2025-01-29 10:58:53 -08:00
shamoon d97e4a9a95 Fix: fix email/wh actions on consume started (#8750) 2025-01-15 15:48:10 +00:00
shamoon d61b2bbfc6 Fix: pass working file to workflows, pickle file bytes (#8741) 2025-01-14 23:03:40 -08:00
shamoon 86788f1445 Fix: use unmodified original for checksum if exists (#8693) 2025-01-13 21:02:10 +00:00
lufi 0406fca59b Enhancement: include current filename placeholder in workflows (#8319)
Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2024-12-03 03:09:27 +00:00
shamoon 1d65628132 Feature: email, webhook workflow actions (#8108) 2024-12-03 00:12:40 +00:00
shamoon 37dc791301 Fix: fix auto-clean PDFs, create parent dir for storing unmodified original (#8157) 2024-11-02 20:54:28 -07:00
shamoon dcc8d4046a Chore: Unify workflow logic (#7880) 2024-10-10 20:28:44 +00:00
Trenton H cf3645c296 Fixes the ASN checking to allow an ASN of 0 (#7878) 2024-10-08 12:47:37 -07:00
Trenton H e6f59472e4 Chore: Drop Python 3.9 support (#7774) 2024-09-26 12:22:24 -07:00
shamoon 5e687d9a93 Feature: auto-clean some invalid pdfs (#7651) 2024-09-25 15:57:20 +00:00
s0llvan c92c3e224a Feature: page count (#7750)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2024-09-25 08:22:12 -07:00
shamoon 0ee85aae21 Enhancement: log when pre-check fails for documents in trash (#7355) 2024-08-05 17:01:01 -07:00
Freddy0 8e3ca37b05 Enhancement: include owner username in post-consumption variables (#7270) 2024-07-16 15:23:29 -07:00
shamoon 73d33ff25a Fix: include trashed docs in existing doc check (#7229) 2024-07-12 16:45:35 -07:00
shamoon ada283441c Fix: include documents in trash for existing asn check (#7189) 2024-07-08 16:28:40 +00:00
Trenton H 6d2ae3df1f Resolves test issues with Python 3.12 (#6902) 2024-06-03 12:33:46 -07:00
Trenton H b720aa3cd1 Chore: Convert the consumer to a plugin (#6361) 2024-04-18 02:59:14 +00:00
Trenton H 2c43b06910 Chore: Standardize subprocess running and logging (#6275) 2024-04-04 13:11:43 -07:00
Elias Probst 41fc11efff Enhancement: add ASN to consume rejection message (#6217) 2024-03-28 19:38:29 -07:00
shamoon f07441a408 Feature: workflow removal action (#5928)
---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2024-03-04 17:37:42 +00:00