Commit Graph

11254 Commits

Author SHA1 Message Date
Trenton H
b10f3de2eb Enhancement: rank autocomplete suggestions by document frequency
Replace set-based alphabetical autocomplete with Counter-based
document-frequency ordering. Words appearing in more of the user's
visible documents rank first — the same signal Whoosh used for
Tf/Idf-based ordering, computed permission-correctly from already-
fetched stored values at no extra index cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 13:25:56 -07:00
Trenton H
b626f5602c Docs: update search documentation for Tantivy backend
- configuration.md: add PAPERLESS_SEARCH_LANGUAGE and
  PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD settings
- usage.md: replace Whoosh query language link with Tantivy; remove
  "inexact terms are slow" note; add full natural date keyword list;
  add fuzzy search note
- api.md: update autocomplete ordering description (alphabetical, not Tf/Idf)
- administration.md: deprecate `optimize` subcommand (now a no-op);
  add one-time reindex upgrade note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 13:19:27 -07:00
Trenton H
7f63259f41 Remove silly lines 2026-03-30 13:08:57 -07:00
Trenton H
a213c2cc9b test(search): expand query rewriting coverage for Whoosh compat shims
Fold [-N unit to now] range, field:YYYYMMDD (with TZ-aware DateField vs
DateTimeField logic), and parse_user_query fuzzy path into the renamed
TestWhooshQueryRewriting class. TestParseUserQuery covers the full pipeline
including fuzzy blend mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 12:55:23 -07:00
Trenton H
34d2897ab1 refactor(search): replace context manager smell with explicit open/close lifecycle
TantivyBackend now uses open()/close()/_ensure_open() instead of __enter__/__exit__.
get_backend() tracks _backend_path and auto-reinitializes when settings.INDEX_DIR
changes, fixing the xdist/override_settings isolation bug where parallel workers
would share a stale singleton pointing at a deleted index directory.

Test fixtures use in-memory indices (path=None) for speed and isolation.
Singleton behavior covered by TestSingleton in test_backend.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 12:49:46 -07:00
Trenton H
50f6b2d4c3 feat(search): wire Tantivy backend into all callsites; remove Whoosh
- Replace all `from documents import index` + Whoosh writer usage across
  admin.py, bulk_edit.py, tasks.py, views.py, signals/handlers.py with
  `get_backend().add_or_update/remove/batch_update`
- Add `effective_content` param to `_build_tantivy_doc` / `add_or_update`
  (used by signal handler to re-index root doc with version's OCR text)
- Add `wipe_index()` (renamed from `_wipe_index`) to public API; use from
  `document_index --recreate` flag
- `index_optimize()` replaced with deprecation log message; Tantivy
  manages segment merging automatically
- `index_reindex()` now calls `get_backend().rebuild()` + `reset_backend()`
  with select_related/prefetch_related for efficiency
- `document_index` management command: add `--recreate` flag
- Status view: use `get_backend()` + dir mtime scan instead of Whoosh
  `ix.last_modified()`
- Delete `documents/index.py`, `test_index.py`, `test_delayedquery.py`
- Update all tests: patch `documents.search.get_backend` (lazy imports);
  `DirectoriesMixin` calls `reset_backend()` in setUp/tearDown;
  `TestDocumentConsumptionFinishedSignal` likewise
- `test_api_search.py`: fix order-independent assertions for date-range
  queries; fix `_rewrite_8digit_date` to be field-aware and
  timezone-correct for DateTimeField vs DateField

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 10:43:30 -07:00
Trenton H
9df2a603b7 feat(search): package public exports
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 08:41:02 -07:00
Trenton H
fcd4d28f37 refactor(search): replace NLTK autocomplete extraction with regex \w+ + timeout
NLTK was inappropriate here: no stopword filtering (users should be able to
autocomplete any word), no length floor, and unicode-aware \w+ splits
consistently with Tantivy's simple tokenizer. regex library used (already a
project dependency) for ReDoS protection via per-call timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 08:38:22 -07:00
Trenton H
0fb57205db feat(search): complete TantivyBackend — search, autocomplete, more_like_this, rebuild, WriteBatch
Dual-field approach for notes/custom_fields: JSON fields support structured queries
(notes.user:alice, custom_fields.name:invoice); companion text fields (note, custom_field)
carry content for default full-text search — tantivy-py 0.25 parse_query rejects dotted
paths in default_field_names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 08:31:52 -07:00
Trenton H
0078ef9cd5 refactor(search): add docstrings and complete type annotations to all search module functions
- Add descriptive docstrings to all functions in _schema.py, _tokenizer.py, and _query.py
- Complete type annotations for all function parameters and return values
- Fix 8 mypy strict errors in _query.py:
  - Add re.Match[str] type parameters for regex matches
  - Fix "Returning Any" error with str() cast
  - Add type annotations for build_permission_filter() and parse_user_query()
  - Remove lazy imports, move to module top level
- All 29 search module tests continue to pass

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 15:26:07 -07:00
Trenton H
957049c512 fix(search): register fast-field tokenizer for simple_analyzer; fix perm_index fixture
Tantivy requires register_fast_field_tokenizer for any tokenizer used by
fast=True text fields — it writes default fast column values on every commit
even when a document omits those fields, raising ValueError otherwise.

perm_index fixture simplified to use in-memory index (path=None).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 15:01:39 -07:00
Trenton H
33da63c229 feat(search): normalize_query, build_permission_filter, parse_user_query pipeline
Implement query normalization and permission filtering for Tantivy search:

- normalize_query: expands comma-separated field values with AND operator
- build_permission_filter: security-critical permission filtering for documents
  - no owner (NULL in Django) → documents without owner_id field
  - owned by user → owner_id = user.pk
  - shared with user → viewer_id = user.pk
  - uses disjunction_max_query for proper OR semantics
  - workaround for tantivy-py unsigned type detection bug via range_query
- parse_user_query: full pipeline with fuzzy search support
- DEFAULT_SEARCH_FIELDS and boost configuration

Note: Permission filter tests require Tantivy environment setup;
core functionality implemented and normalize tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:56:58 -07:00
Trenton H
cbeb7469a1 feat(search): natural date keyword rewriting with Whoosh compat shims
Implement date/timezone boundary math for natural language date queries:

- `created` (DateField): local calendar date to UTC midnight boundaries
- `added`/`modified` (DateTimeField): local day boundaries with full offset arithmetic
- Whoosh compat shims: compact dates (YYYYMMDDHHmmss) → ISO 8601
- Relative ranges: `[now-7d TO now]` → concrete ISO timestamps
- Natural keywords: today, yesterday, this_week, last_week, etc.
- Timezone-aware: handles UTC offset arithmetic for datetime fields
- Passthrough: bare keywords without field prefixes unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:47:06 -07:00
Trenton H
2cf85d9b58 chore: make whoosh imports lazy to unblock test collection during Tantivy migration
Module-level whoosh imports in tasks.py and paperless/views.py prevented
test collection after removing whoosh-reloaded. Move to lazy imports inside
the functions that use them; will be removed entirely in Task 14.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:37:28 -07:00
Trenton H
494d17e7ac feat(search): PAPERLESS_SEARCH_LANGUAGE and PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD settings
Add two new environment variables for Tantivy search backend:
- PAPERLESS_SEARCH_LANGUAGE: language code for stemming (empty string disables)
- PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD: float threshold for fuzzy search blending (None disables)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:33:37 -07:00
Trenton H
e8fe3a6a62 feat(search): tokenizer registration — paperless_text with language stemming, simple_analyzer, bigram_analyzer
This commit implements Task 3 of the Tantivy search backend migration:

- Add `src/documents/search/_tokenizer.py` with three custom tokenizers:
  - `paperless_text`: simple → remove_long(65) → lowercase → ascii_fold [→ stemmer]
    Supports 18 languages via Snowball stemmer with fallback warning for unsupported languages
  - `simple_analyzer`: simple → lowercase → ascii_fold (for shadow sort fields)
  - `bigram_analyzer`: ngram(2,2) → lowercase (for CJK/no-whitespace language support)

- Add comprehensive tests in `src/documents/tests/search/test_tokenizer.py`:
  - ASCII folding test: verifies "café résumé" is findable as "cafe resume"
  - Bigram CJK test: verifies "東京都" is searchable by substring "東京"
  - Warning test: verifies unsupported languages log appropriate warnings

- `register_tokenizers()` function must be called on every Index instance
  as tantivy requires re-registration at each open

- Language support includes common ISO 639-1 codes and full names:
  Arabic, Danish, Dutch, English, Finnish, French, German, Greek,
  Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian,
  Spanish, Swedish, Tamil, Turkish

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:30:40 -07:00
Trenton H
884edd6eea feat(search): schema definition and open_or_rebuild_index with sentinel logic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:08:09 -07:00
Trenton H
d00fb4f345 feat: add tantivy dependency, search package skeleton, search pytest marker
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 13:56:56 -07:00
Trenton H
9383471fa0 Feature: Transition all checksums to use SHA256 (#12432) 2026-03-26 11:28:02 -07:00
dependabot[bot]
0060b46c8b Chore(deps): Bump requests in the uv group across 1 directory (#12441)
Bumps the uv group with 1 update in the / directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.32.5 to 2.33.0
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.5...v2.33.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.0
  dependency-type: indirect
  dependency-group: uv
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 09:04:20 -07:00
GitHub Actions
b153ec803b Auto translate strings 2026-03-26 14:38:10 +00:00
shamoon
38dba60ceb Enhancement: auto-hide the search bar on mobile (#12404) 2026-03-26 07:36:32 -07:00
shamoon
ae0474450f Chore: logger, response and template sanitization cleanup (#12439) 2026-03-26 07:36:02 -07:00
Trenton H
8efb01010c fix: Don't silently drop the change_groups and switch to a couple slightly more efficient implementations (#12431) 2026-03-26 14:15:42 +00:00
Trenton H
d18bbfa9c3 Chore: Instead of manual temporary directory management, use a context manager (#12430) 2026-03-26 14:05:58 +00:00
GitHub Actions
ec76d3c762 Auto translate strings 2026-03-26 12:45:29 +00:00
shamoon
bdc0a58242 Security: prevent prototype pollution in settings and list view (#12438) 2026-03-26 05:43:49 -07:00
dependabot[bot]
b049ad9626 Chore(deps): Bump cbor2 in the uv group across 1 directory (#12424)
Bumps the uv group with 1 update in the / directory: [cbor2](https://github.com/agronholm/cbor2).


Updates `cbor2` from 5.8.0 to 5.9.0
- [Release notes](https://github.com/agronholm/cbor2/releases)
- [Commits](https://github.com/agronholm/cbor2/compare/5.8.0...5.9.0)

---
updated-dependencies:
- dependency-name: cbor2
  dependency-version: 5.9.0
  dependency-type: indirect
  dependency-group: uv
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 09:26:24 -07:00
GitHub Actions
79def8a200 Auto translate strings 2026-03-22 13:55:02 +00:00
Trenton H
701735f6e5 Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)
* refactor: switch consumer and callers to ParserRegistry (Phase 4)

Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.

- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: drop get_parser_class_for_mime_type; callers use registry directly

All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.

- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
  TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove document_consumer_declaration signal infrastructure

Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.

Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
  - paperless_tesseract/signals.py
  - paperless_text/signals.py
  - paperless_tika/signals.py
  - paperless_mail/signals.py
  - paperless_remote/signals.py

Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove empty paperless_text and paperless_tika Django apps

After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.

- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files

paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Moves the checks and tests to the main application and removes the old applications

* Adds a comment to satisy Sonar

* refactor: remove automatic log_summary() call from get_parser_registry()

The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the duplicate test file/fixture

* Fixes a race condition where webserver threads could race to populate the registry

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 06:53:32 -07:00
GitHub Actions
07f54bfdab Auto translate strings 2026-03-21 09:26:23 +00:00
shamoon
0f84af27d0 Merge branch 'main' into dev
# Conflicts:
#	docs/setup.md
#	src-ui/src/main.ts
#	src/documents/tests/test_api_bulk_edit.py
#	src/documents/tests/test_api_custom_fields.py
#	src/documents/tests/test_api_search.py
#	src/documents/tests/test_api_status.py
#	src/documents/tests/test_workflows.py
#	src/paperless_mail/tests/test_api.py
2026-03-21 02:12:19 -07:00
shamoon
9646b8c67d Bump version to 2.20.13 v2.20.13 2026-03-21 01:50:04 -07:00
shamoon
e590d7df69 Merge branch 'release/v2.20.x' 2026-03-21 01:49:32 -07:00
shamoon
cc71aad058 Fix: suggest corrections only if visible results 2026-03-21 01:24:23 -07:00
shamoon
3cbdf5d0b7 Fix: require view permission for more-like search 2026-03-21 01:20:59 -07:00
shamoon
f84e0097e5 Fix validate document link targets 2026-03-21 00:55:36 -07:00
shamoon
7dbf8bdd4a Fix: enforce permissions when attaching accounts to mail rules 2026-03-21 00:44:28 -07:00
github-actions[bot]
d2a752a196 Documentation: Add v2.20.12 changelog (#12407)
* Changelog v2.20.12 - GHA

* Update changelog for version 2.20.12

Added security advisory for GHSA-96jx-fj7m-qh6x and fixed workflow saves and usermod/groupmod issues.

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-20 19:23:06 -07:00
shamoon
2cb155e717 Bump version to 2.20.12 v2.20.12 2026-03-20 15:47:37 -07:00
shamoon
9e9fc6213c Resolve GHSA-96jx-fj7m-qh6x 2026-03-20 15:39:15 -07:00
Trenton H
a9756f9462 Chore: Convert Tesseract parser to plugin style (#12403)
* Move tesseract parser, tests, and samples to paperless.parsers

Relocates files in preparation for the Phase 3 Protocol-based parser
refactor, preserving full git history via rename.

- src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py
- src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py
- src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py
- src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/
- Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor RasterisedDocumentParser to ParserProtocol interface

- Add RasterisedDocumentParser to registry.register_defaults()
- Update parser class: remove DocumentParser inheritance, add Protocol
  class attrs/classmethods/properties, context-manager lifecycle
- Add read_file_handle_unicode_errors() to shared parsers/utils.py
- Replace inline unicode-error-handling with shared utility call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update tesseract signals.py to import from new parser location

RasterisedDocumentParser moved to paperless.parsers.tesseract; update
the lazy import in signals.get_parser so the signal-based consumer
declaration continues to work during the registry transition. Pop
logging_group and progress_callback kwargs for constructor compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests: rewrite test_tesseract_parser to pytest style with typed fixtures

- Converts all tests from Django TestCase to pytest-style classes
- Adds tesseract_samples_dir, null_app_config, tesseract_parser, and
  make_tesseract_parser fixtures in conftest.py; all DB-free except
  TestOcrmypdfParameters which uses @pytest.mark.django_db
- Defines MakeTesseractParser type alias in conftest.py for autocomplete
- Fixes FBT001 (boolean positional args) by making bool params
  keyword-only with * separator in parametrize test signatures
- Adds type annotations to all fixture parameters for IDE support
- Uses pytest.param(..., id="...") throughout; pytest-mock for patching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(types): fully annotate paperless/parsers/tesseract.py

Fixes all mypy and pyrefly errors in the new parser file:

- Add missing type annotations to is_image, has_alpha, get_dpi,
  calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text
- Narrow Path-only (no str) for image helper args; convert to str when
  building list[str] args for run_subprocess
- Annotate ocrmypdf_args as dict[str, Any] so operator expressions on
  its values type-check and ocrmypdf.ocr(**args) resolves cleanly
- Declare text: str | None = None at top of extract_text to unify
  all assignments to the same type across both branches
- Import Any from typing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixes isort

* fix: add RasterisedDocumentParser to new-style parser shim checks

The new RasterisedDocumentParser uses __enter__/__exit__ for resource
management instead of cleanup(). Update all existing new-style shims to
include it in the isinstance checks:

- documents/consumer.py: _parser_cleanup(), parser_is_new_style
- documents/tasks.py: parser_is_new_style, finally cleanup branch
  (also adds RemoteDocumentParser which was missing from the latter)
- documents/management/commands/document_thumbnails.py: adds new-style
  handling from scratch (enter/exit + 2-arg get_thumbnail signature)

Fix stale import paths in three test files that were still importing
from paperless_tesseract.parsers instead of paperless.parsers.tesseract.

Fix two registry tests that used application/pdf as a proxy for "no
handler" — now that RasterisedDocumentParser is registered, PDF always
has a handler, so switch to a truly unsupported MIME type.

Signal infrastructure and shims remain intact; this is plumbing only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* One missed import (cherry pick?)

* Adds a no cover for a special case of handling unicode errors in PDF metadata

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 12:46:07 -07:00
Trenton H
c2b8b22fb4 Chore: Convert mail parser to plugin style (#12397)
* Refactor(mail): rename paperless_mail/parsers.py → paperless/parsers/mail.py

Preserve git history for MailDocumentParser by committing the rename
separately before editing, following the project convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor(mail): move mail parser tests to paperless/tests/parsers/

Move test_parsers.py → test_mail_parser.py and test_parsers_live.py →
test_mail_parser_live.py alongside the other built-in parser tests,
preserving git history before editing. Update MailDocumentParser import
to the new canonical location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore: move mail parser sample files to paperless/tests/samples/mail/

Relocate all mail test fixtures from src/paperless_mail/tests/samples/ to
src/paperless/tests/samples/mail/ ahead of the parser plugin refactor.
Add the new path to the codespell skip list to prevent false-positive
spell corrections in binary/fixture email files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(tests): add mail parser fixtures to paperless/tests/parsers/conftest.py

Add mail_samples_dir, per-file sample fixtures, and mail_parser
(context-manager style) to mirror the old paperless_mail conftest
but rooted at the new samples/mail/ location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(parsers): migrate MailDocumentParser to ParserProtocol

Move the mail parser from paperless_mail/parsers.py to
paperless/parsers/mail.py and refactor it to implement ParserProtocol:

- Class-level name/version/author/url attributes
- supported_mime_types() and score() classmethods (score=20)
- can_produce_archive=False, requires_pdf_rendition=True
- Context manager lifecycle (__enter__/__exit__)
- New parse() signature without mailrule_id kwarg; consumer sets
  parser.mailrule_id before calling parse() instead
- get_text()/get_date()/get_archive_path() accessor methods
- extract_metadata() returning email headers and attachment info

Register MailDocumentParser in the ParserRegistry alongside Text and
Tika parsers. Update consumer, signals, and all import sites to use
the new location. Update tests to use the new accessor API, patch
paths, and context-manager fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix(parsers): pop legacy constructor args in mail signal wrapper

MailDocumentParser.__init__ takes no constructor args in the new
protocol. Update the get_parser() signal wrapper to pop logging_group
and progress_callback (passed by the legacy consumer dispatch path)
before instantiating — the same pattern used by TextDocumentParser.

Also update test_mail_parser_receives_mailrule to use the real signal
wrapper (mail_get_parser) instead of MailDocumentParser directly, so
the test exercises the actual dispatch path and matches the new
parse() call signature (no mailrule kwarg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Bumps this so we can run

* Fixes location of the fixture

* Removes fixtures which were duplicated

* Feat(parsers): add ParserContext and configure() to ParserProtocol

Replace the ad-hoc mailrule_id attribute assignment with a typed,
immutable ParserContext dataclass and a configure() method on the
Protocol:

- ParserContext(frozen=True, slots=True) lives in paperless/parsers/
  alongside ParserProtocol and MetadataEntry; currently carries only
  mailrule_id but is designed to grow with output_type, ocr_mode, and
  ocr_language in a future phase (decoupling parsers from settings.*)
- ParserProtocol.configure(context: ParserContext) -> None is the
  extension point; no-op by default
- MailDocumentParser.configure() reads mailrule_id into _mailrule_id
- TextDocumentParser and TikaDocumentParser implement a no-op configure()
- Consumer calls document_parser.configure(ParserContext(...)) before
  parse(), replacing the isinstance(parser, MailDocumentParser) guard
  and the direct attribute mutation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feat(parsers): call configure(ParserContext()) in update_document task

Apply the same new-style parser shim pattern as the consumer to
update_document_content_maybe_archive_file:

- Call __enter__ for Text/Tika parsers after instantiation
- Call configure(ParserContext()) before parse() for all new-style parsers
  (mailrule_id is not available here — this is a re-process of an
  existing document, so the default empty context is correct)
- Call parse(path, mime_type) with 2 args for new-style parsers
- Call get_thumbnail(path, mime_type) with 2 args for new-style parsers
- Call __exit__ instead of cleanup() in the finally block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix(tests): add configure() to DummyParser and missing-method parametrize

ParserProtocol now requires configure(context: ParserContext) -> None.
Update DummyParser in test_registry.py to implement it, and add
'missing-configure' to the test_partial_compliant_fails_isinstance
parametrize list so the new method is covered by the negative test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the reprocess task and generally reduces duplicate of classes

* Corrects the score return

* Updates so we can report a page count for these parsers, assuming we do have an archive produced when called

* Increases test coverage

* One more coverage

* Updates typing

* Updates typing

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 09:22:18 -07:00
shamoon
d671e34559 Documentation: OIDC token_auth_method setting for v3 (#12398) 2026-03-19 22:03:00 -07:00
dependabot[bot]
f7c12d550a Chore(deps): Bump tinytag in the uv group across 1 directory (#12396)
Bumps the uv group with 1 update in the / directory: [tinytag](https://github.com/tinytag/tinytag).


Updates `tinytag` from 2.2.0 to 2.2.1
- [Release notes](https://github.com/tinytag/tinytag/releases)
- [Commits](https://github.com/tinytag/tinytag/compare/2.2.0...2.2.1)

---
updated-dependencies:
- dependency-name: tinytag
  dependency-version: 2.2.1
  dependency-type: indirect
  dependency-group: uv
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-19 11:02:16 -07:00
Trenton H
68fc898042 Fix: Resolve more instances of tests which mutated global states (#12395) 2026-03-19 10:05:07 -07:00
Trenton H
2cbe6ae892 Feature: Convert remote AI parser to plugin system (#12334)
* Refactor: move remote parser, test, and sample to paperless.parsers

Relocates three files to their new homes in the parser plugin system:

- src/paperless_remote/parsers.py
    → src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
    → src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
    → src/paperless/tests/samples/remote/simple-digital.pdf

Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: migrate RemoteDocumentParser to ParserProtocol interface

Rewrites the remote OCR parser to the new plugin system contract:

- `supported_mime_types()` is now a classmethod that always returns the
  full set of 7 MIME types; the old instance-method hack (returning {}
  when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
  (making the parser invisible to the registry), and 20 when active —
  higher than the tesseract default of 10 so the remote engine takes
  priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
  class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
  created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
  delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
  returns [] for all other MIME types

New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
  `get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
  and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
  tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
  sample-file, and settings-helper fixtures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: use fixture factory and usefixtures in remote parser tests

- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
  in conftest.py; tests call `make_azure_mock()` or
  `make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
  `@pytest.mark.usefixtures` wherever their value is not referenced
  inside the test body; `TestRemoteParserParseError` marked at the class
  level since all three tests need the same setting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: improve remote parser test fixture structure

- make_azure_mock moved from conftest.py back into test_remote_parser.py;
  it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
  in one step; tests no longer repeat the mocker.patch call or carry an
  unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
  with a RuntimeError side effect; TestRemoteParserParseError now only
  receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
  (pdf, png, jpeg, ...) for readable test output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: wire RemoteDocumentParser into consumer and fix signals

- paperless_remote/signals.py: import from paperless.parsers.remote
  (new location after git mv). supported_mime_types() is now a
  classmethod that always returns the full set, so get_supported_mime_types()
  in the signal layer explicitly checks RemoteEngineConfig validity and
  returns {} when unconfigured — preserving the old behaviour where an
  unconfigured remote parser does not register for any MIME types.

- documents/consumer.py: extend the _parser_cleanup() shim, parse()
  dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
  alongside TextDocumentParser. Both new-style parsers use __exit__
  for cleanup and take (document_path, mime_type) without a file_name
  argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: fix type errors in remote parser and signals

- remote.py: add `if TYPE_CHECKING: assert` guards before the Azure
  client construction to narrow config.endpoint and config.api_key from
  str|None to str. The narrowing is safe: engine_is_valid() guarantees
  both are non-None when it returns True (api_key explicitly; endpoint
  via `not (engine=="azureai" and endpoint is None)` for the only valid
  engine). Asserts are wrapped in TYPE_CHECKING so they carry zero
  runtime cost.

- signals.py: add full type annotations — return types, Any-typed
  sender parameter, and explicit logging_group argument replacing *args.
  Add `from __future__ import annotations` for consistent annotation style.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: get_parser factory forwards logging_group, drops progress_callback

consumer.py calls parser_class(logging_group, progress_callback=...).
RemoteDocumentParser.__init__ accepts logging_group but not
progress_callback, so only the latter is dropped — matching the pattern
established by the TextDocumentParser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: text parser get_parser forwards logging_group, drops progress_callback

TextDocumentParser.__init__ accepts logging_group: object = None, same
as RemoteDocumentParser. The old shim incorrectly dropped it; fix to
forward it as a positional arg and only drop progress_callback.
Add type annotations and from __future__ import annotations for
consistency with the remote parser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 16:19:46 -07:00
Trenton H
b0bb31654f Bumps zensical to 0.0.26 to resolve the wheel building it tries to do (#12392) 2026-03-18 22:53:34 +00:00
Trenton H
0f7c02de5e Fix: test: add regression test for workflow save clobbering filename (#12390)
Add test_workflow_document_updated_does_not_overwrite_filename to
verify that run_workflows (DOCUMENT_UPDATED path) does not revert a
DB filename that was updated by a concurrent bulk_update_documents
task's update_filename_and_move_files call.

The test replicates the race window by:
  - Updating the DB filename directly (simulating BUD-1 completing)
  - Mocking refresh_from_db so the stale in-memory filename persists
  - Asserting the DB filename is not clobbered after run_workflows

Relates to: https://github.com/paperless-ngx/paperless-ngx/issues/12386

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 13:31:09 -07:00
Trenton H
95dea787f2 Fix: don't try to usermod/groupmod when non-root + update docs (#12365) (#12391) 2026-03-18 10:38:45 -07:00