The origin/dev branch added include_selection_data support to the search
list() view (PR #12300). Our Tantivy list() had replaced the Whoosh
implementation entirely, causing a conflict.
Resolution: keep the Tantivy implementation and incorporate the
include_selection_data feature. When requested, selection_data is computed
over all matching document IDs from ordered_hits (the full Tantivy result
set, not just the current page).
Also update test_search_with_include_selection_data from #12300 to use
the Tantivy indexing API (get_backend().add_or_update) instead of the
removed Whoosh AsyncWriter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When open_or_rebuild_index is called and the index directory does not exist,
return a fresh in-memory Tantivy index instead of creating the directory as
a side effect. This prevents workspace contamination during test runs where
INDEX_DIR has not been redirected to a temp directory.
In production the data directory is always created during setup, so disk-
based indexes continue to work normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the duplicated `IterWrapper` type alias and `identity` function from
tasks.py, _backend.py, sanity_checker.py, and paperless_ai/indexing.py into
a single location in documents/utils.py. All four callers now import from
there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `open_or_rebuild_index` now calls `index_dir.mkdir(parents=True, exist_ok=True)`
so a missing index directory is created on demand rather than crashing on
`iterdir()` inside `wipe_index`
- `TestTagHierarchy.setUp` calls `super().setUp()` so `DirectoriesMixin` runs
and `self.dirs` is set before teardown tries to clean up
- `test_search_more_like` d4 content changed to words with no overlap with d2/d3
to avoid spurious MLT hits from shared stop words at `min_doc_frequency=1`
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add comprehensive docstrings to all public methods and classes in the search package
- Clarify purpose, parameters, return values, and implementation notes
- Document thread safety, error handling, and usage patterns
- Explain Tantivy-specific workarounds and design decisions
- Improve test quality and pytest compliance
- Add descriptive comments explaining what each test verifies
- Convert TestIndexOptimize to pytest style with @pytest.mark.django_db
- Ensure all test docstrings focus on behavior verification rather than implementation
- Maintain existing functionality while improving code documentation
- No changes to production logic or test coverage
- All tests continue to pass with enhanced clarity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests that create or consume documents trigger the search index signal handler,
which calls get_backend().add_or_update() against settings.INDEX_DIR. This
class only inherited SampleDirMixin, leaving INDEX_DIR pointing at the default
non-existent path and causing FileNotFoundError in CI.
Added _search_index fixture to documents/tests/conftest.py: creates a temp
index directory, overrides INDEX_DIR, and resets the backend singleton.
Applied via @pytest.mark.usefixtures on the class.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
document.notes.count() bypasses the prefetch cache and hits the DB on every
document during rebuild. Counting in the existing loop eliminates the query
entirely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove duplicated list comprehension in search sort branches
- Simplify WriteBatch.__exit__ by removing redundant else/pass block
- Fix rebuild() to swap index once before loop instead of per-document
- Add error recovery in rebuild() to restore old index on failure
- Remove redundant re-import of register_tokenizers in rebuild()
- Use tuple unpacking in autocomplete hit iteration
- Collect tag names in single pass for autocomplete text sources
- Use lazy % formatting in logger.debug instead of f-string
- Remove redundant score list variable in normalization
- Fix stale "NLTK stopword filtering" comment (NLTK was removed)
- Remove obvious inline comments that restate the code
- Align index_optimize task message with management command wording
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename _needs_rebuild -> needs_rebuild and export from documents.search
- document_index command imports directly from documents.search, constructs
the queryset and calls get_backend().rebuild() inline — no tasks.py indirection
- Optimize subcommand logs deprecation directly; no longer calls index_optimize
- Remove index_reindex from tasks.py
- Convert TestMakeIndex to pytest class (no TestCase); use mocker fixtures
- Simplify TestIndexReindex -> TestIndexOptimize (wrapper test removed)
Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds word-order, accent-insensitivity, and separator-agnostic notes to the
intro, then new subsections covering custom_fields.name/value query syntax
with tokenization examples and a limitation note for custom date fields, plus
a notes.user/notes.note subsection.
Also prefetch document versions during index_reindex.
Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Equal-frequency words were non-deterministically ordered; sort key is
now (-count, word) so ties resolve alphabetically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>