The origin/dev branch added include_selection_data support to the search
list() view (PR #12300). Our Tantivy list() had replaced the Whoosh
implementation entirely, causing a conflict.
Resolution: keep the Tantivy implementation and incorporate the
include_selection_data feature. When requested, selection_data is computed
over all matching document IDs from ordered_hits (the full Tantivy result
set, not just the current page).
Also update test_search_with_include_selection_data from #12300 to use
the Tantivy indexing API (get_backend().add_or_update) instead of the
removed Whoosh AsyncWriter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When open_or_rebuild_index is called and the index directory does not exist,
return a fresh in-memory Tantivy index instead of creating the directory as
a side effect. This prevents workspace contamination during test runs where
INDEX_DIR has not been redirected to a temp directory.
In production the data directory is always created during setup, so disk-
based indexes continue to work normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the duplicated `IterWrapper` type alias and `identity` function from
tasks.py, _backend.py, sanity_checker.py, and paperless_ai/indexing.py into
a single location in documents/utils.py. All four callers now import from
there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `open_or_rebuild_index` now calls `index_dir.mkdir(parents=True, exist_ok=True)`
so a missing index directory is created on demand rather than crashing on
`iterdir()` inside `wipe_index`
- `TestTagHierarchy.setUp` calls `super().setUp()` so `DirectoriesMixin` runs
and `self.dirs` is set before teardown tries to clean up
- `test_search_more_like` d4 content changed to words with no overlap with d2/d3
to avoid spurious MLT hits from shared stop words at `min_doc_frequency=1`
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add comprehensive docstrings to all public methods and classes in the search package
- Clarify purpose, parameters, return values, and implementation notes
- Document thread safety, error handling, and usage patterns
- Explain Tantivy-specific workarounds and design decisions
- Improve test quality and pytest compliance
- Add descriptive comments explaining what each test verifies
- Convert TestIndexOptimize to pytest style with @pytest.mark.django_db
- Ensure all test docstrings focus on behavior verification rather than implementation
- Maintain existing functionality while improving code documentation
- No changes to production logic or test coverage
- All tests continue to pass with enhanced clarity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests that create or consume documents trigger the search index signal handler,
which calls get_backend().add_or_update() against settings.INDEX_DIR. This
class only inherited SampleDirMixin, leaving INDEX_DIR pointing at the default
non-existent path and causing FileNotFoundError in CI.
Added _search_index fixture to documents/tests/conftest.py: creates a temp
index directory, overrides INDEX_DIR, and resets the backend singleton.
Applied via @pytest.mark.usefixtures on the class.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
document.notes.count() bypasses the prefetch cache and hits the DB on every
document during rebuild. Counting in the existing loop eliminates the query
entirely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove duplicated list comprehension in search sort branches
- Simplify WriteBatch.__exit__ by removing redundant else/pass block
- Fix rebuild() to swap index once before loop instead of per-document
- Add error recovery in rebuild() to restore old index on failure
- Remove redundant re-import of register_tokenizers in rebuild()
- Use tuple unpacking in autocomplete hit iteration
- Collect tag names in single pass for autocomplete text sources
- Use lazy % formatting in logger.debug instead of f-string
- Remove redundant score list variable in normalization
- Fix stale "NLTK stopword filtering" comment (NLTK was removed)
- Remove obvious inline comments that restate the code
- Align index_optimize task message with management command wording
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename _needs_rebuild -> needs_rebuild and export from documents.search
- document_index command imports directly from documents.search, constructs
the queryset and calls get_backend().rebuild() inline — no tasks.py indirection
- Optimize subcommand logs deprecation directly; no longer calls index_optimize
- Remove index_reindex from tasks.py
- Convert TestMakeIndex to pytest class (no TestCase); use mocker fixtures
- Simplify TestIndexReindex -> TestIndexOptimize (wrapper test removed)
Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds word-order, accent-insensitivity, and separator-agnostic notes to the
intro, then new subsections covering custom_fields.name/value query syntax
with tokenization examples and a limitation note for custom date fields, plus
a notes.user/notes.note subsection.
Also prefetch document versions during index_reindex.
Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Equal-frequency words were non-deterministically ordered; sort key is
now (-count, word) so ties resolve alphabetically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add --if-needed flag to `document_index reindex`: checks _needs_rebuild()
(schema version + language sentinels) and skips if index is up to date.
Safe to run on every upgrade or startup.
Simplify Docker init-search-index script to unconditionally call
`reindex --if-needed` — schema/language change detection is now fully
delegated to Python. Removes the bash index_version and language file
tracking entirely; Tantivy's own sentinels are the source of truth.
Update docs: bare metal upgrade step uses --if-needed; Docker note
updated to describe the new always-runs-but-skips-if-current behaviour.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Reset index_version to 1 — Tantivy is a full format change so
versioning restarts from scratch; all existing v9 installs trigger
an automatic reindex on next container start
- Add PAPERLESS_SEARCH_LANGUAGE change detection: track raw env var in
.index_language so changing the language setting auto-reindexes;
raw env var (not resolved language) avoids false positives from
OCR_LANGUAGE inference
- docs/administration.md: clarify that Docker handles the post-upgrade
reindex automatically; bare metal users need to run
document_index reindex manually; add that as step 4 in the
bare metal upgrade guide
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SEARCH_LANGUAGE is now str | None (None = no stemming, not "")
- When PAPERLESS_SEARCH_LANGUAGE is set, validate it against
SUPPORTED_LANGUAGES via get_choice_from_env (startup error on bad value)
- When not set, infer from OCR_LANGUAGE's primary Tesseract code
(eng→en, deu→de, fra→fr, etc.) covering all 18 Tantivy-supported languages
- _schema.py sentinel normalises None → "" for on-disk comparison
- _tokenizer.py type annotations updated to str | None
- docs: recommend ISO 639-1 two-letter codes; note that capitalized
Tantivy enum names are not valid; link to Tantivy Language enum
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace set-based alphabetical autocomplete with Counter-based
document-frequency ordering. Words appearing in more of the user's
visible documents rank first — the same signal Whoosh used for
Tf/Idf-based ordering, computed permission-correctly from already-
fetched stored values at no extra index cost.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fold [-N unit to now] range, field:YYYYMMDD (with TZ-aware DateField vs
DateTimeField logic), and parse_user_query fuzzy path into the renamed
TestWhooshQueryRewriting class. TestParseUserQuery covers the full pipeline
including fuzzy blend mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TantivyBackend now uses open()/close()/_ensure_open() instead of __enter__/__exit__.
get_backend() tracks _backend_path and auto-reinitializes when settings.INDEX_DIR
changes, fixing the xdist/override_settings isolation bug where parallel workers
would share a stale singleton pointing at a deleted index directory.
Test fixtures use in-memory indices (path=None) for speed and isolation.
Singleton behavior covered by TestSingleton in test_backend.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace all `from documents import index` + Whoosh writer usage across
admin.py, bulk_edit.py, tasks.py, views.py, signals/handlers.py with
`get_backend().add_or_update/remove/batch_update`
- Add `effective_content` param to `_build_tantivy_doc` / `add_or_update`
(used by signal handler to re-index root doc with version's OCR text)
- Add `wipe_index()` (renamed from `_wipe_index`) to public API; use from
`document_index --recreate` flag
- `index_optimize()` replaced with deprecation log message; Tantivy
manages segment merging automatically
- `index_reindex()` now calls `get_backend().rebuild()` + `reset_backend()`
with select_related/prefetch_related for efficiency
- `document_index` management command: add `--recreate` flag
- Status view: use `get_backend()` + dir mtime scan instead of Whoosh
`ix.last_modified()`
- Delete `documents/index.py`, `test_index.py`, `test_delayedquery.py`
- Update all tests: patch `documents.search.get_backend` (lazy imports);
`DirectoriesMixin` calls `reset_backend()` in setUp/tearDown;
`TestDocumentConsumptionFinishedSignal` likewise
- `test_api_search.py`: fix order-independent assertions for date-range
queries; fix `_rewrite_8digit_date` to be field-aware and
timezone-correct for DateTimeField vs DateField
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NLTK was inappropriate here: no stopword filtering (users should be able to
autocomplete any word), no length floor, and unicode-aware \w+ splits
consistently with Tantivy's simple tokenizer. regex library used (already a
project dependency) for ReDoS protection via per-call timeout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add descriptive docstrings to all functions in _schema.py, _tokenizer.py, and _query.py
- Complete type annotations for all function parameters and return values
- Fix 8 mypy strict errors in _query.py:
- Add re.Match[str] type parameters for regex matches
- Fix "Returning Any" error with str() cast
- Add type annotations for build_permission_filter() and parse_user_query()
- Remove lazy imports, move to module top level
- All 29 search module tests continue to pass
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tantivy requires register_fast_field_tokenizer for any tokenizer used by
fast=True text fields — it writes default fast column values on every commit
even when a document omits those fields, raising ValueError otherwise.
perm_index fixture simplified to use in-memory index (path=None).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement query normalization and permission filtering for Tantivy search:
- normalize_query: expands comma-separated field values with AND operator
- build_permission_filter: security-critical permission filtering for documents
- no owner (NULL in Django) → documents without owner_id field
- owned by user → owner_id = user.pk
- shared with user → viewer_id = user.pk
- uses disjunction_max_query for proper OR semantics
- workaround for tantivy-py unsigned type detection bug via range_query
- parse_user_query: full pipeline with fuzzy search support
- DEFAULT_SEARCH_FIELDS and boost configuration
Note: Permission filter tests require Tantivy environment setup;
core functionality implemented and normalize tests passing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement date/timezone boundary math for natural language date queries:
- `created` (DateField): local calendar date to UTC midnight boundaries
- `added`/`modified` (DateTimeField): local day boundaries with full offset arithmetic
- Whoosh compat shims: compact dates (YYYYMMDDHHmmss) → ISO 8601
- Relative ranges: `[now-7d TO now]` → concrete ISO timestamps
- Natural keywords: today, yesterday, this_week, last_week, etc.
- Timezone-aware: handles UTC offset arithmetic for datetime fields
- Passthrough: bare keywords without field prefixes unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Module-level whoosh imports in tasks.py and paperless/views.py prevented
test collection after removing whoosh-reloaded. Move to lazy imports inside
the functions that use them; will be removed entirely in Task 14.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add two new environment variables for Tantivy search backend:
- PAPERLESS_SEARCH_LANGUAGE: language code for stemming (empty string disables)
- PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD: float threshold for fuzzy search blending (None disables)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit implements Task 3 of the Tantivy search backend migration:
- Add `src/documents/search/_tokenizer.py` with three custom tokenizers:
- `paperless_text`: simple → remove_long(65) → lowercase → ascii_fold [→ stemmer]
Supports 18 languages via Snowball stemmer with fallback warning for unsupported languages
- `simple_analyzer`: simple → lowercase → ascii_fold (for shadow sort fields)
- `bigram_analyzer`: ngram(2,2) → lowercase (for CJK/no-whitespace language support)
- Add comprehensive tests in `src/documents/tests/search/test_tokenizer.py`:
- ASCII folding test: verifies "café résumé" is findable as "cafe resume"
- Bigram CJK test: verifies "東京都" is searchable by substring "東京"
- Warning test: verifies unsupported languages log appropriate warnings
- `register_tokenizers()` function must be called on every Index instance
as tantivy requires re-registration at each open
- Language support includes common ISO 639-1 codes and full names:
Arabic, Danish, Dutch, English, Finnish, French, German, Greek,
Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian,
Spanish, Swedish, Tamil, Turkish
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Tests: add regression test for redis URL with empty username and password
Covers the unix://:SECRET@/path.sock format (empty username, password only),
which was missing from the existing test cases for PR #12239.
* Update src/paperless/tests/settings/test_custom_parsers.py
---------
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>