Commit Graph

11316 Commits

Author SHA1 Message Date
shamoon
3539f3f66a Switch simple substring search to simple_search analyzer 2026-04-01 13:37:54 -07:00
shamoon
7c98d29de2 Fix e2e 2026-04-01 11:51:42 -07:00
shamoon
0b9f67fe68 Just moving these comments 2026-04-01 11:47:12 -07:00
shamoon
6a08244c52 Fix this one failing test 2026-04-01 11:41:10 -07:00
shamoon
935d75d457 Update all these uses of FILTER_TITLE_CONTENT 2026-04-01 11:24:24 -07:00
shamoon
af671397f5 Update the filter editor too 2026-04-01 11:23:24 -07:00
shamoon
66e4409242 Bring in the new filter type to frontend 2026-04-01 11:23:04 -07:00
shamoon
8756054778 Ok make it a proper filter type 2026-04-01 11:21:04 -07:00
shamoon
2cbbbaf170 Add a couple deprecation notes 2026-04-01 11:08:01 -07:00
shamoon
4b4f656fbb Drop the custom fields text query option, but dont break existing views 2026-04-01 11:01:54 -07:00
shamoon
54d2da2375 Backend tests 2026-04-01 10:59:11 -07:00
shamoon
71889e6e90 Use tantivy for global search too 2026-04-01 10:58:39 -07:00
shamoon
02fab43df9 Handle simple searches with frontend query param parsing 2026-04-01 10:38:08 -07:00
shamoon
9139507bd6 Wire the simple searches to view 2026-04-01 10:18:26 -07:00
shamoon
3d77e45c14 Add a simple title query 2026-04-01 10:13:13 -07:00
shamoon
631074e4ed Add simple text search mode and API param 2026-04-01 10:08:27 -07:00
Trenton H
64fe8546ca Custom field indexing wouldn't have matched exactly, also, index the select field label, not its ID (might break, don't want the VM) 2026-03-31 14:29:31 -07:00
Trenton H
edcadfcdc7 Merge branch 'dev' into feature-tantivy-search-backend 2026-03-31 12:07:39 -07:00
Trenton H
eac4a6ca05 Merge remote-tracking branch 'origin/dev' into feature-tantivy-search-backend
Hopefully the conflicts are good
2026-03-31 11:57:10 -07:00
GitHub Actions
2aa0c9f0b4 Auto translate strings 2026-03-31 18:25:03 +00:00
shamoon
d2328b776a Performance: support bulk edit without id lists (#12355) 2026-03-31 18:23:28 +00:00
Trenton H
c981fb26f7 Adds no cover on some defensive error handling, cover a few other cases more directly 2026-03-31 11:14:26 -07:00
Trenton H
9003bfdeea Fine, I'll spin up the VM here. Good this got tested 2026-03-31 09:19:57 -07:00
Trenton H
2bb7c7ae17 Chore: Document the parser plugin system (#12423)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-31 09:16:43 -07:00
Trenton H
32111b00f5 Further search coverage which maybe works, we'll find out 2026-03-31 09:02:40 -07:00
Trenton H
4ddb27afc7 better typing through fixtures + comment 2026-03-31 08:23:33 -07:00
Trenton H
65b9d69ee6 Quick testing to cover most (everything?) in needs_rebuild 2026-03-31 08:11:11 -07:00
Trenton H
8272b98f4e Adds more coverage for the Whoosh re-writing to ISO format, covering standard and some odd cases like wrapping 2026-03-31 08:08:29 -07:00
Trenton H
25e905395c And the filelock one, this is defensive stuff I don't see value in 2026-03-31 08:07:15 -07:00
Trenton H
977d41f3aa Also no cover this defensive TimeoutError 2026-03-31 08:06:38 -07:00
Trenton H
1000c47d86 TimeoutError is builtin, not exported from regex 2026-03-31 08:02:58 -07:00
GitHub Actions
e1da2a1efe Auto translate strings 2026-03-31 14:57:34 +00:00
shamoon
245514ad10 Performance: deprecate and remove usage of all in API results (#12309) 2026-03-31 07:55:59 -07:00
Trenton H
1e1bba1a15 Improves typing a touch 2026-03-31 07:52:21 -07:00
Trenton H
97034e8ff6 Covers the reindex --recreate path and documents the flag 2026-03-31 07:50:44 -07:00
Trenton H
0b032fffeb Probably covers that branch, we'll find out 2026-03-31 07:33:53 -07:00
Trenton H
dd4bd8dd7e Updates fixture for sonar 2026-03-31 07:33:04 -07:00
Trenton H
881196183c iterdir doesn't need a list 2026-03-31 07:32:47 -07:00
Trenton H
f36ea803d1 Uses regex matching with a timeout for all Whoosh query re-writing 2026-03-31 07:18:14 -07:00
Trenton H
eaa23751de Merge: resolve conflict with include_selection_data from #12300
The origin/dev branch added include_selection_data support to the search
list() view (PR #12300). Our Tantivy list() had replaced the Whoosh
implementation entirely, causing a conflict.

Resolution: keep the Tantivy implementation and incorporate the
include_selection_data feature. When requested, selection_data is computed
over all matching document IDs from ordered_hits (the full Tantivy result
set, not just the current page).

Also update test_search_with_include_selection_data from #12300 to use
the Tantivy indexing API (get_backend().add_or_update) instead of the
removed Whoosh AsyncWriter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 17:08:39 -07:00
Trenton H
3cc78fe994 Fix: fall back to in-memory index when INDEX_DIR does not exist
When open_or_rebuild_index is called and the index directory does not exist,
return a fresh in-memory Tantivy index instead of creating the directory as
a side effect. This prevents workspace contamination during test runs where
INDEX_DIR has not been redirected to a temp directory.

In production the data directory is always created during setup, so disk-
based indexes continue to work normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 16:27:07 -07:00
Trenton H
9e8b5ddf08 Refactor: consolidate IterWrapper/identity into documents.utils
Move the duplicated `IterWrapper` type alias and `identity` function from
tasks.py, _backend.py, sanity_checker.py, and paperless_ai/indexing.py into
a single location in documents/utils.py. All four callers now import from
there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 16:26:49 -07:00
Trenton H
e4b63d61b9 Fix: ensure index dir exists before open, fix test isolation gaps
- `open_or_rebuild_index` now calls `index_dir.mkdir(parents=True, exist_ok=True)`
  so a missing index directory is created on demand rather than crashing on
  `iterdir()` inside `wipe_index`
- `TestTagHierarchy.setUp` calls `super().setUp()` so `DirectoriesMixin` runs
  and `self.dirs` is set before teardown tries to clean up
- `test_search_more_like` d4 content changed to words with no overlap with d2/d3
  to avoid spurious MLT hits from shared stop words at `min_doc_frequency=1`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 16:06:09 -07:00
Trenton H
e7f68c2082 docs: Enhance docstrings and test quality for Tantivy search backend
- Add comprehensive docstrings to all public methods and classes in the search package
  - Clarify purpose, parameters, return values, and implementation notes
  - Document thread safety, error handling, and usage patterns
  - Explain Tantivy-specific workarounds and design decisions

- Improve test quality and pytest compliance
  - Add descriptive comments explaining what each test verifies
  - Convert TestIndexOptimize to pytest style with @pytest.mark.django_db
  - Ensure all test docstrings focus on behavior verification rather than implementation

- Maintain existing functionality while improving code documentation
  - No changes to production logic or test coverage
  - All tests continue to pass with enhanced clarity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 15:54:18 -07:00
Trenton H
12eb9b9abf Fix: add _search_index fixture to TestDateWorkflowLocalization
Tests that create or consume documents trigger the search index signal handler,
which calls get_backend().add_or_update() against settings.INDEX_DIR. This
class only inherited SampleDirMixin, leaving INDEX_DIR pointing at the default
non-existent path and causing FileNotFoundError in CI.

Added _search_index fixture to documents/tests/conftest.py: creates a temp
index directory, overrides INDEX_DIR, and resets the backend singleton.
Applied via @pytest.mark.usefixtures on the class.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 15:36:03 -07:00
Trenton H
ac03a3d609 Fix: count notes during iteration instead of issuing extra COUNT(*) query
document.notes.count() bypasses the prefetch cache and hits the DB on every
document during rebuild. Counting in the existing loop eliminates the query
entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 15:14:15 -07:00
Trenton H
7d1af2e215 Refactor: simplify Tantivy search backend for clarity and consistency
- Remove duplicated list comprehension in search sort branches
- Simplify WriteBatch.__exit__ by removing redundant else/pass block
- Fix rebuild() to swap index once before loop instead of per-document
- Add error recovery in rebuild() to restore old index on failure
- Remove redundant re-import of register_tokenizers in rebuild()
- Use tuple unpacking in autocomplete hit iteration
- Collect tag names in single pass for autocomplete text sources
- Use lazy % formatting in logger.debug instead of f-string
- Remove redundant score list variable in normalization
- Fix stale "NLTK stopword filtering" comment (NLTK was removed)
- Remove obvious inline comments that restate the code
- Align index_optimize task message with management command wording

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 15:03:55 -07:00
Trenton H
061099b064 Refactor: inline index_reindex into management command; promote needs_rebuild to public API
- Rename _needs_rebuild -> needs_rebuild and export from documents.search
- document_index command imports directly from documents.search, constructs
  the queryset and calls get_backend().rebuild() inline — no tasks.py indirection
- Optimize subcommand logs deprecation directly; no longer calls index_optimize
- Remove index_reindex from tasks.py
- Convert TestMakeIndex to pytest class (no TestCase); use mocker fixtures
- Simplify TestIndexReindex -> TestIndexOptimize (wrapper test removed)

Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 14:41:25 -07:00
Trenton H
6699679c29 Docs: expand search section with custom field, notes, and tokenization examples
Adds word-order, accent-insensitivity, and separator-agnostic notes to the
intro, then new subsections covering custom_fields.name/value query syntax
with tokenization examples and a limitation note for custom date fields, plus
a notes.user/notes.note subsection.

Also prefetch document versions during index_reindex.

Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 14:28:46 -07:00
Trenton H
8107b7d209 Fix: break autocomplete frequency ties alphabetically for stable output
Equal-frequency words were non-deterministically ordered; sort key is
now (-count, word) so ties resolve alphabetically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 14:13:10 -07:00