Commit Graph

444 Commits

Author SHA1 Message Date
shamoon
2cbbbaf170 Add a couple deprecation notes 2026-04-01 11:08:01 -07:00
shamoon
71889e6e90 Use tantivy for global search too 2026-04-01 10:58:39 -07:00
shamoon
9139507bd6 Wire the simple searches to view 2026-04-01 10:18:26 -07:00
shamoon
631074e4ed Add simple text search mode and API param 2026-04-01 10:08:27 -07:00
Trenton H
eac4a6ca05 Merge remote-tracking branch 'origin/dev' into feature-tantivy-search-backend
Hopefully the conflicts are good
2026-03-31 11:57:10 -07:00
shamoon
d2328b776a Performance: support bulk edit without id lists (#12355) 2026-03-31 18:23:28 +00:00
shamoon
245514ad10 Performance: deprecate and remove usage of all in API results (#12309) 2026-03-31 07:55:59 -07:00
Trenton H
eaa23751de Merge: resolve conflict with include_selection_data from #12300
The origin/dev branch added include_selection_data support to the search
list() view (PR #12300). Our Tantivy list() had replaced the Whoosh
implementation entirely, causing a conflict.

Resolution: keep the Tantivy implementation and incorporate the
include_selection_data feature. When requested, selection_data is computed
over all matching document IDs from ordered_hits (the full Tantivy result
set, not just the current page).

Also update test_search_with_include_selection_data from #12300 to use
the Tantivy indexing API (get_backend().add_or_update) instead of the
removed Whoosh AsyncWriter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 17:08:39 -07:00
Trenton H
50f6b2d4c3 feat(search): wire Tantivy backend into all callsites; remove Whoosh
- Replace all `from documents import index` + Whoosh writer usage across
  admin.py, bulk_edit.py, tasks.py, views.py, signals/handlers.py with
  `get_backend().add_or_update/remove/batch_update`
- Add `effective_content` param to `_build_tantivy_doc` / `add_or_update`
  (used by signal handler to re-index root doc with version's OCR text)
- Add `wipe_index()` (renamed from `_wipe_index`) to public API; use from
  `document_index --recreate` flag
- `index_optimize()` replaced with deprecation log message; Tantivy
  manages segment merging automatically
- `index_reindex()` now calls `get_backend().rebuild()` + `reset_backend()`
  with select_related/prefetch_related for efficiency
- `document_index` management command: add `--recreate` flag
- Status view: use `get_backend()` + dir mtime scan instead of Whoosh
  `ix.last_modified()`
- Delete `documents/index.py`, `test_index.py`, `test_delayedquery.py`
- Update all tests: patch `documents.search.get_backend` (lazy imports);
  `DirectoriesMixin` calls `reset_backend()` in setUp/tearDown;
  `TestDocumentConsumptionFinishedSignal` likewise
- `test_api_search.py`: fix order-independent assertions for date-range
  queries; fix `_rewrite_8digit_date` to be field-aware and
  timezone-correct for DateTimeField vs DateField

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 10:43:30 -07:00
shamoon
f715533770 Performance: support passing selection data with filtered document requests (#12300) 2026-03-30 16:38:52 +00:00
shamoon
129da3ade7 Tweakhancement: show file extension in StoragePath test (#12452) 2026-03-28 13:58:33 -07:00
shamoon
ae0474450f Chore: logger, response and template sanitization cleanup (#12439) 2026-03-26 07:36:02 -07:00
Trenton H
701735f6e5 Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)
* refactor: switch consumer and callers to ParserRegistry (Phase 4)

Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.

- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: drop get_parser_class_for_mime_type; callers use registry directly

All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.

- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
  TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove document_consumer_declaration signal infrastructure

Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.

Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
  - paperless_tesseract/signals.py
  - paperless_text/signals.py
  - paperless_tika/signals.py
  - paperless_mail/signals.py
  - paperless_remote/signals.py

Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove empty paperless_text and paperless_tika Django apps

After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.

- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files

paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Moves the checks and tests to the main application and removes the old applications

* Adds a comment to satisy Sonar

* refactor: remove automatic log_summary() call from get_parser_registry()

The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the duplicate test file/fixture

* Fixes a race condition where webserver threads could race to populate the registry

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 06:53:32 -07:00
shamoon
0f84af27d0 Merge branch 'main' into dev
# Conflicts:
#	docs/setup.md
#	src-ui/src/main.ts
#	src/documents/tests/test_api_bulk_edit.py
#	src/documents/tests/test_api_custom_fields.py
#	src/documents/tests/test_api_search.py
#	src/documents/tests/test_api_status.py
#	src/documents/tests/test_workflows.py
#	src/paperless_mail/tests/test_api.py
2026-03-21 02:12:19 -07:00
shamoon
3cbdf5d0b7 Fix: require view permission for more-like search 2026-03-21 01:20:59 -07:00
shamoon
9e9fc6213c Resolve GHSA-96jx-fj7m-qh6x 2026-03-20 15:39:15 -07:00
shamoon
87ebd13abc Fix: remove pagination from document notes api spec (#12388) 2026-03-18 06:48:05 -07:00
Trenton H
aea2927a02 Feature: Convert Tika parser to the plugin system (#12333)
* Chore: move Tika parser and tests to paperless/

Move TikaDocumentParser and its tests to the canonical parser package
location, matching the pattern established for TextDocumentParser:

- src/paperless_tika/parsers.py → src/paperless/parsers/tika.py
- src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py
- src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/

Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file,
sample_doc_file, sample_broken_odt) into the shared parsers conftest.
Remove the now-empty src/paperless_tika/tests/conftest.py.

Content is unchanged — this commit is rename-only so git history is
preserved on the moved files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol

Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing
the legacy DocumentParser ABC:

- Add ClassVars: name, version, author, url
- Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types)
- Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise
- can_produce_archive = False (PDF is for display, not an OCR archive)
- requires_pdf_rendition = True (Office formats need PDF for browser display)
- __enter__/__exit__ via ExitStack: TikaClient opened once per parser
  lifetime and shared across parse() and extract_metadata() calls
- extract_metadata() falls back to a short-lived TikaClient when called
  outside a context manager (legacy view-layer metadata path)
- _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored
  ApplicationConfiguration before falling back to the env-var setting
- Rename convert_to_pdf → _convert_to_pdf (private helper)

Update paperless_tika/signals.py shim to import from the new module path
and drop the legacy logging_group/progress_callback kwargs.

Update documents/consumer.py to extend the existing TextDocumentParser
special cases to also cover TikaDocumentParser (parse/get_thumbnail
signatures, __exit__ cleanup).

Add TestTikaParserRegistryInterface (7 tests) covering score(), properties,
and ParserProtocol isinstance check.  Update existing tests to use the new
accessor API (get_text, get_date, get_archive_path, _convert_to_pdf).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: update remaining imports and move live Tika tests after parser migration

- src/documents/tests/test_parsers.py: import TikaDocumentParser from
  paperless.parsers.tika (old paperless_tika.parsers no longer exists)
- git mv paperless_tika/tests/test_live_tika.py →
  paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests
  with the parser; update import and replace old attribute API
  (tika_parser.text/.archive_path) with accessor methods
  (get_text/get_archive_path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: satisfy mypy and pyrefly for TikaDocumentParser

Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from
TikaClient | None to TikaClient at the point of use in parse().  The
assert is visible to type checkers (TYPE_CHECKING=True) so both mypy
and pyrefly accept the subsequent attribute accesses without error;
at runtime TYPE_CHECKING is False so the assert never executes and no
ruff S101 suppression is required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: require context manager for TikaDocumentParser; clean up client lifecycle

- consumer.py: call __enter__ for new-style parsers so _tika_client and
  _gotenberg_client are set before parse() is invoked
- views.py: use `with parser` (via nullcontext for old-style parsers) in
  get_metadata so extract_metadata always runs inside a context manager
- tika.py: GotenbergClient added to ExitStack alongside TikaClient;
  inline client creation removed from extract_metadata and _convert_to_pdf;
  __exit__ uses ExitStack.close() instead of __exit__ pass-through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 15:43:28 -07:00
shamoon
2bb4af2be6 Change: sort custom fields alphabetically by default (#12358) 2026-03-15 22:52:02 -07:00
shamoon
45b363659e Chore: mark document detail email action as deprecated (#12308) 2026-03-12 15:42:14 +00:00
shamoon
86573fc1a0 Chore: separate actions from bulk edit endpoint (#12286) 2026-03-10 18:55:36 +00:00
shamoon
85a18e5911 Enhancement: saved view sharing (#12142) 2026-03-04 14:15:43 -08:00
shamoon
d51a118aac Merge branch 'main' into dev 2026-03-04 13:31:20 -08:00
shamoon
5b809122b5 Fix: apply ordering after annotating tag document count (#12238) 2026-03-04 00:33:13 -08:00
shamoon
96ac7b2336 Tweak: Ignore version docs for workflows (#12217) 2026-03-02 08:21:14 -08:00
shamoon
f65807b906 Merge branch 'main' into dev
# Conflicts:
#	docs/setup.md
#	src-ui/src/app/components/manage/document-attributes/management-list/management-list.component.ts
#	src/documents/tests/test_api_documents.py
2026-02-28 02:31:20 -08:00
shamoon
c7f83212a3 Enforce on selection_data too 2026-02-28 01:27:40 -08:00
Jan Kleine
c86ebc0260 Enhancment: Formatted filename for single document downloads (#12095)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-02-26 18:06:47 +00:00
shamoon
ceee769e26 Feature: document file versions (#12061) 2026-02-26 16:46:54 +00:00
Trenton H
53ac338946 Breaking: Removes API v1 and the related serializer (#12166) 2026-02-26 04:06:43 +00:00
shamoon
426c0a8974 Chore: typing fixes 2026-02-16 09:54:06 -08:00
shamoon
4884b67714 Fix more typing failures 2026-02-16 09:37:33 -08:00
shamoon
02896a15fd Fix: only pass user to SerializerWithPerms serializers 2026-02-16 09:31:33 -08:00
shamoon
d8e07b8d84 Fix typing issue 2026-02-16 09:17:20 -08:00
shamoon
be4e29a19c Merge branch 'main' into dev 2026-02-16 09:01:19 -08:00
shamoon
afaf39e43a Fix/GHSA-x395-6h48-wr8v 2026-02-16 00:02:15 -08:00
shamoon
5b45b89d35 Performance fix: use subqueries to improve object retrieval in large installs (#11950) 2026-02-05 08:46:32 -08:00
Trenton H
2ec8ec96c8 Feature: Enable users to customize date parsing via plugins (#11931) 2026-02-03 20:09:13 +00:00
Sebastian Steinbeißer
3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
shamoon
c3b036e0d3 Merge branch 'main' into dev 2026-01-31 09:10:33 -08:00
shamoon
c8c4c7c749 Security: enforce permissions for post_document 2026-01-30 12:14:18 -08:00
shamoon
e4b861d76f Fix: prevent note deletion outside doc 2026-01-29 13:35:01 -08:00
shamoon
1f074390e4 Feature: sharelink bundles (#11682) 2026-01-27 18:54:51 +00:00
Antoine Mérino
df07b8a03e Performance: faster statistics panel on dashboard (#11760) 2026-01-26 12:10:57 -08:00
shamoon
857aaca493 Merge branch 'release/v2.20.x' into dev 2026-01-26 09:25:58 -08:00
shamoon
891f4a2faf Fix: correctly extract all ids for nested tags (#11888) 2026-01-26 09:12:03 -08:00
shamoon
2312314aa7 Performance: improve treenode inefficiencies (#11606) 2026-01-25 21:47:08 -08:00
Trenton H
d0032c18be Breaking: Remove support for document and thumbnail encryption (#11850) 2026-01-24 19:29:54 -08:00
shamoon
00bb92e3e1 Fix: support ordering by storage path name (#11661) 2026-01-13 09:36:14 -08:00
shamoon
e940764fe0 Feature: Paperless AI (#10319) 2026-01-13 16:24:42 +00:00