Commit Graph

272 Commits

Author SHA1 Message Date
Trenton H 5c339f7f60 Merge remote-tracking branch 'origin/dev' into feature-remote-parser-protocol 2026-03-18 09:56:14 -07:00
Trenton H aea2927a02 Feature: Convert Tika parser to the plugin system (#12333)
* Chore: move Tika parser and tests to paperless/

Move TikaDocumentParser and its tests to the canonical parser package
location, matching the pattern established for TextDocumentParser:

- src/paperless_tika/parsers.py → src/paperless/parsers/tika.py
- src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py
- src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/

Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file,
sample_doc_file, sample_broken_odt) into the shared parsers conftest.
Remove the now-empty src/paperless_tika/tests/conftest.py.

Content is unchanged — this commit is rename-only so git history is
preserved on the moved files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol

Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing
the legacy DocumentParser ABC:

- Add ClassVars: name, version, author, url
- Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types)
- Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise
- can_produce_archive = False (PDF is for display, not an OCR archive)
- requires_pdf_rendition = True (Office formats need PDF for browser display)
- __enter__/__exit__ via ExitStack: TikaClient opened once per parser
  lifetime and shared across parse() and extract_metadata() calls
- extract_metadata() falls back to a short-lived TikaClient when called
  outside a context manager (legacy view-layer metadata path)
- _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored
  ApplicationConfiguration before falling back to the env-var setting
- Rename convert_to_pdf → _convert_to_pdf (private helper)

Update paperless_tika/signals.py shim to import from the new module path
and drop the legacy logging_group/progress_callback kwargs.

Update documents/consumer.py to extend the existing TextDocumentParser
special cases to also cover TikaDocumentParser (parse/get_thumbnail
signatures, __exit__ cleanup).

Add TestTikaParserRegistryInterface (7 tests) covering score(), properties,
and ParserProtocol isinstance check.  Update existing tests to use the new
accessor API (get_text, get_date, get_archive_path, _convert_to_pdf).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: update remaining imports and move live Tika tests after parser migration

- src/documents/tests/test_parsers.py: import TikaDocumentParser from
  paperless.parsers.tika (old paperless_tika.parsers no longer exists)
- git mv paperless_tika/tests/test_live_tika.py →
  paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests
  with the parser; update import and replace old attribute API
  (tika_parser.text/.archive_path) with accessor methods
  (get_text/get_archive_path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: satisfy mypy and pyrefly for TikaDocumentParser

Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from
TikaClient | None to TikaClient at the point of use in parse().  The
assert is visible to type checkers (TYPE_CHECKING=True) so both mypy
and pyrefly accept the subsequent attribute accesses without error;
at runtime TYPE_CHECKING is False so the assert never executes and no
ruff S101 suppression is required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: require context manager for TikaDocumentParser; clean up client lifecycle

- consumer.py: call __enter__ for new-style parsers so _tika_client and
  _gotenberg_client are set before parse() is invoked
- views.py: use `with parser` (via nullcontext for old-style parsers) in
  get_metadata so extract_metadata always runs inside a context manager
- tika.py: GotenbergClient added to ExitStack alongside TikaClient;
  inline client creation removed from extract_metadata and _convert_to_pdf;
  __exit__ uses ExitStack.close() instead of __exit__ pass-through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 15:43:28 -07:00
Trenton H e9e1d4ccca Refactor: wire RemoteDocumentParser into consumer and fix signals
- paperless_remote/signals.py: import from paperless.parsers.remote
  (new location after git mv). supported_mime_types() is now a
  classmethod that always returns the full set, so get_supported_mime_types()
  in the signal layer explicitly checks RemoteEngineConfig validity and
  returns {} when unconfigured — preserving the old behaviour where an
  unconfigured remote parser does not register for any MIME types.

- documents/consumer.py: extend the _parser_cleanup() shim, parse()
  dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
  alongside TextDocumentParser. Both new-style parsers use __exit__
  for cleanup and take (document_path, mime_type) without a file_name
  argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-13 12:09:33 -07:00
Trenton H d86cfdb088 Feature: Initial document parser plugin framework (#12294) 2026-03-12 21:53:17 +00:00
shamoon 24a2cfd957 Change: use explicit doc creation instead of clone for versions (#12226) 2026-03-04 15:57:44 -08:00
shamoon df03207eef Fix: correct doc version filename handling (#12223) 2026-03-04 23:28:07 +00:00
Trenton H 1e21bcd26e Breaking: Drop support for Python 3.10 (#12234) 2026-03-04 15:03:33 -08:00
shamoon d51a118aac Merge branch 'main' into dev 2026-03-04 13:31:20 -08:00
shamoon 8b8307571a Fix: enforce path limit for db filename fields (#12235) 2026-03-03 13:19:56 -08:00
shamoon 96ac7b2336 Tweak: Ignore version docs for workflows (#12217) 2026-03-02 08:21:14 -08:00
shamoon ceee769e26 Feature: document file versions (#12061) 2026-02-26 16:46:54 +00:00
shamoon 6192915be7 Fixhancement: improve ASN handling with PDF operations (#11689) 2026-02-06 21:14:02 +00:00
Trenton H 2ec8ec96c8 Feature: Enable users to customize date parsing via plugins (#11931) 2026-02-03 20:09:13 +00:00
shamoon 00ef0837d2 Fix: re-run ASN check after barcode detection (#11681) 2026-02-02 23:23:37 +00:00
Sebastian Steinbeißer 3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
shamoon 4428354150 Feature: allow duplicates with warnings, UI for discovery (#11815) 2026-01-26 18:55:08 +00:00
Trenton H d0032c18be Breaking: Remove support for document and thumbnail encryption (#11850) 2026-01-24 19:29:54 -08:00
shamoon 7604a0b583 Fix: prevent ASN collisions for merge operations (#11634) 2025-12-19 20:05:34 -08:00
shamoon 4cff907ba0 Feature: Nested Tags (#10833)
---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2025-09-17 21:41:39 +00:00
shamoon dfad3c4d8e Chore: clarify file deletion logging 2025-06-27 13:34:44 -07:00
shamoon e97cfb9b5e Chore: refactor consumer plugin checks to a pre-flight plugin (#9994) 2025-06-03 19:28:49 +00:00
matthesrieke e9746aa0e3 Enhancement: include DOCUMENT_TYPE to post consume scripts (#9977)
* expose DOCUMENT_TYPE to post consume scripts

* Apply suggestions from code review

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-05-28 23:32:59 +00:00
shamoon a3b85c64ca Fixhancement: check more permissions for status consumer messages (#9804) 2025-04-26 23:31:04 -07:00
shamoon edc7181843 Enhancement: support assigning custom field values in workflows (#9272) 2025-03-05 12:30:19 -08:00
Trenton H f205c4d0e2 Removes undocumented FileInfo (#9298) 2025-03-04 13:49:47 -08:00
Silvia Bigler 71472a6a82 Enhancement: add layout options for email conversion (#8907)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-02-07 18:32:35 +00:00
Sebastian Steinbeißer fce7b03324 Chore: Switch from os.path to pathlib.Path (#8644) 2025-01-29 10:58:53 -08:00
shamoon d97e4a9a95 Fix: fix email/wh actions on consume started (#8750) 2025-01-15 15:48:10 +00:00
shamoon d61b2bbfc6 Fix: pass working file to workflows, pickle file bytes (#8741) 2025-01-14 23:03:40 -08:00
shamoon 86788f1445 Fix: use unmodified original for checksum if exists (#8693) 2025-01-13 21:02:10 +00:00
lufi 0406fca59b Enhancement: include current filename placeholder in workflows (#8319)
Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2024-12-03 03:09:27 +00:00
shamoon 1d65628132 Feature: email, webhook workflow actions (#8108) 2024-12-03 00:12:40 +00:00
shamoon 37dc791301 Fix: fix auto-clean PDFs, create parent dir for storing unmodified original (#8157) 2024-11-02 20:54:28 -07:00
shamoon dcc8d4046a Chore: Unify workflow logic (#7880) 2024-10-10 20:28:44 +00:00
Trenton H cf3645c296 Fixes the ASN checking to allow an ASN of 0 (#7878) 2024-10-08 12:47:37 -07:00
Trenton H e6f59472e4 Chore: Drop Python 3.9 support (#7774) 2024-09-26 12:22:24 -07:00
shamoon 5e687d9a93 Feature: auto-clean some invalid pdfs (#7651) 2024-09-25 15:57:20 +00:00
s0llvan c92c3e224a Feature: page count (#7750)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2024-09-25 08:22:12 -07:00
shamoon 0ee85aae21 Enhancement: log when pre-check fails for documents in trash (#7355) 2024-08-05 17:01:01 -07:00
Freddy0 8e3ca37b05 Enhancement: include owner username in post-consumption variables (#7270) 2024-07-16 15:23:29 -07:00
shamoon 73d33ff25a Fix: include trashed docs in existing doc check (#7229) 2024-07-12 16:45:35 -07:00
shamoon ada283441c Fix: include documents in trash for existing asn check (#7189) 2024-07-08 16:28:40 +00:00
Trenton H 6d2ae3df1f Resolves test issues with Python 3.12 (#6902) 2024-06-03 12:33:46 -07:00
Trenton H b720aa3cd1 Chore: Convert the consumer to a plugin (#6361) 2024-04-18 02:59:14 +00:00
Trenton H 2c43b06910 Chore: Standardize subprocess running and logging (#6275) 2024-04-04 13:11:43 -07:00
Elias Probst 41fc11efff Enhancement: add ASN to consume rejection message (#6217) 2024-03-28 19:38:29 -07:00
shamoon f07441a408 Feature: workflow removal action (#5928)
---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2024-03-04 17:37:42 +00:00
shamoon 754627681c Fix: ensure document title always limited to 128 chars (#5934) 2024-02-28 07:05:19 -08:00
Trenton H 13201dbfff Ensure all creations of directories create the parents too (#5711) 2024-02-10 11:02:40 -08:00
Trenton H 2da5e46386 Refactor file consumption task to allow beginnings of a plugin system (#5367) 2024-01-13 16:11:14 +00:00