paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-04-01 13:52:43 +00:00

Author	SHA1	Message	Date
Trenton H	0b032fffeb	Probably covers that branch, we'll find out	2026-03-31 07:33:53 -07:00
Trenton H	dd4bd8dd7e	Updates fixture for sonar	2026-03-31 07:33:04 -07:00
Trenton H	881196183c	iterdir doesn't need a list	2026-03-31 07:32:47 -07:00
Trenton H	f36ea803d1	Uses regex matching with a timeout for all Whoosh query re-writing	2026-03-31 07:18:14 -07:00
Trenton H	eaa23751de	Merge: resolve conflict with include_selection_data from #12300 The origin/dev branch added include_selection_data support to the search list() view (PR #12300). Our Tantivy list() had replaced the Whoosh implementation entirely, causing a conflict. Resolution: keep the Tantivy implementation and incorporate the include_selection_data feature. When requested, selection_data is computed over all matching document IDs from ordered_hits (the full Tantivy result set, not just the current page). Also update test_search_with_include_selection_data from #12300 to use the Tantivy indexing API (get_backend().add_or_update) instead of the removed Whoosh AsyncWriter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 17:08:39 -07:00
Trenton H	3cc78fe994	Fix: fall back to in-memory index when INDEX_DIR does not exist When open_or_rebuild_index is called and the index directory does not exist, return a fresh in-memory Tantivy index instead of creating the directory as a side effect. This prevents workspace contamination during test runs where INDEX_DIR has not been redirected to a temp directory. In production the data directory is always created during setup, so disk- based indexes continue to work normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 16:27:07 -07:00
Trenton H	9e8b5ddf08	Refactor: consolidate IterWrapper/identity into documents.utils Move the duplicated `IterWrapper` type alias and `identity` function from tasks.py, _backend.py, sanity_checker.py, and paperless_ai/indexing.py into a single location in documents/utils.py. All four callers now import from there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 16:26:49 -07:00
Trenton H	e4b63d61b9	Fix: ensure index dir exists before open, fix test isolation gaps - `open_or_rebuild_index` now calls `index_dir.mkdir(parents=True, exist_ok=True)` so a missing index directory is created on demand rather than crashing on `iterdir()` inside `wipe_index` - `TestTagHierarchy.setUp` calls `super().setUp()` so `DirectoriesMixin` runs and `self.dirs` is set before teardown tries to clean up - `test_search_more_like` d4 content changed to words with no overlap with d2/d3 to avoid spurious MLT hits from shared stop words at `min_doc_frequency=1` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 16:06:09 -07:00
Trenton H	e7f68c2082	docs: Enhance docstrings and test quality for Tantivy search backend - Add comprehensive docstrings to all public methods and classes in the search package - Clarify purpose, parameters, return values, and implementation notes - Document thread safety, error handling, and usage patterns - Explain Tantivy-specific workarounds and design decisions - Improve test quality and pytest compliance - Add descriptive comments explaining what each test verifies - Convert TestIndexOptimize to pytest style with @pytest.mark.django_db - Ensure all test docstrings focus on behavior verification rather than implementation - Maintain existing functionality while improving code documentation - No changes to production logic or test coverage - All tests continue to pass with enhanced clarity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 15:54:18 -07:00
Trenton H	12eb9b9abf	Fix: add _search_index fixture to TestDateWorkflowLocalization Tests that create or consume documents trigger the search index signal handler, which calls get_backend().add_or_update() against settings.INDEX_DIR. This class only inherited SampleDirMixin, leaving INDEX_DIR pointing at the default non-existent path and causing FileNotFoundError in CI. Added _search_index fixture to documents/tests/conftest.py: creates a temp index directory, overrides INDEX_DIR, and resets the backend singleton. Applied via @pytest.mark.usefixtures on the class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 15:36:03 -07:00
Trenton H	ac03a3d609	Fix: count notes during iteration instead of issuing extra COUNT(*) query document.notes.count() bypasses the prefetch cache and hits the DB on every document during rebuild. Counting in the existing loop eliminates the query entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 15:14:15 -07:00
Trenton H	7d1af2e215	Refactor: simplify Tantivy search backend for clarity and consistency - Remove duplicated list comprehension in search sort branches - Simplify WriteBatch.__exit__ by removing redundant else/pass block - Fix rebuild() to swap index once before loop instead of per-document - Add error recovery in rebuild() to restore old index on failure - Remove redundant re-import of register_tokenizers in rebuild() - Use tuple unpacking in autocomplete hit iteration - Collect tag names in single pass for autocomplete text sources - Use lazy % formatting in logger.debug instead of f-string - Remove redundant score list variable in normalization - Fix stale "NLTK stopword filtering" comment (NLTK was removed) - Remove obvious inline comments that restate the code - Align index_optimize task message with management command wording Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 15:03:55 -07:00
Trenton H	061099b064	Refactor: inline index_reindex into management command; promote needs_rebuild to public API - Rename _needs_rebuild -> needs_rebuild and export from documents.search - document_index command imports directly from documents.search, constructs the queryset and calls get_backend().rebuild() inline — no tasks.py indirection - Optimize subcommand logs deprecation directly; no longer calls index_optimize - Remove index_reindex from tasks.py - Convert TestMakeIndex to pytest class (no TestCase); use mocker fixtures - Simplify TestIndexReindex -> TestIndexOptimize (wrapper test removed) Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:41:25 -07:00
Trenton H	6699679c29	Docs: expand search section with custom field, notes, and tokenization examples Adds word-order, accent-insensitivity, and separator-agnostic notes to the intro, then new subsections covering custom_fields.name/value query syntax with tokenization examples and a limitation note for custom date fields, plus a notes.user/notes.note subsection. Also prefetch document versions during index_reindex. Co-Authored-By: Antoine Mérino <3023499+Merinorus@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:28:46 -07:00
Trenton H	8107b7d209	Fix: break autocomplete frequency ties alphabetically for stable output Equal-frequency words were non-deterministically ordered; sort key is now (-count, word) so ties resolve alphabetically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:13:10 -07:00
Trenton H	897f7d2199	Tests: cover document_index reindex --if-needed flag Two cases: skips when _needs_rebuild returns False; runs when True. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:11:07 -07:00
Trenton H	da3ff7865e	Enhancement: document_index reindex --if-needed; simplify Docker startup Add --if-needed flag to `document_index reindex`: checks _needs_rebuild() (schema version + language sentinels) and skips if index is up to date. Safe to run on every upgrade or startup. Simplify Docker init-search-index script to unconditionally call `reindex --if-needed` — schema/language change detection is now fully delegated to Python. Removes the bash index_version and language file tracking entirely; Tantivy's own sentinels are the source of truth. Update docs: bare metal upgrade step uses --if-needed; Docker note updated to describe the new always-runs-but-skips-if-current behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:08:49 -07:00
Trenton H	ae494d4b6a	Chore: bump Docker index version to 1 (Tantivy); add language change detection - Reset index_version to 1 — Tantivy is a full format change so versioning restarts from scratch; all existing v9 installs trigger an automatic reindex on next container start - Add PAPERLESS_SEARCH_LANGUAGE change detection: track raw env var in .index_language so changing the language setting auto-reindexes; raw env var (not resolved language) avoids false positives from OCR_LANGUAGE inference - docs/administration.md: clarify that Docker handles the post-upgrade reindex automatically; bare metal users need to run document_index reindex manually; add that as step 4 in the bare metal upgrade guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 13:55:20 -07:00
Trenton H	fdf08bdc43	Enhancement: infer SEARCH_LANGUAGE from OCR_LANGUAGE; validate if explicit - SEARCH_LANGUAGE is now str \| None (None = no stemming, not "") - When PAPERLESS_SEARCH_LANGUAGE is set, validate it against SUPPORTED_LANGUAGES via get_choice_from_env (startup error on bad value) - When not set, infer from OCR_LANGUAGE's primary Tesseract code (eng→en, deu→de, fra→fr, etc.) covering all 18 Tantivy-supported languages - _schema.py sentinel normalises None → "" for on-disk comparison - _tokenizer.py type annotations updated to str \| None - docs: recommend ISO 639-1 two-letter codes; note that capitalized Tantivy enum names are not valid; link to Tantivy Language enum Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 13:37:34 -07:00
Trenton H	b10f3de2eb	Enhancement: rank autocomplete suggestions by document frequency Replace set-based alphabetical autocomplete with Counter-based document-frequency ordering. Words appearing in more of the user's visible documents rank first — the same signal Whoosh used for Tf/Idf-based ordering, computed permission-correctly from already- fetched stored values at no extra index cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 13:25:56 -07:00
Trenton H	b626f5602c	Docs: update search documentation for Tantivy backend - configuration.md: add PAPERLESS_SEARCH_LANGUAGE and PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD settings - usage.md: replace Whoosh query language link with Tantivy; remove "inexact terms are slow" note; add full natural date keyword list; add fuzzy search note - api.md: update autocomplete ordering description (alphabetical, not Tf/Idf) - administration.md: deprecate `optimize` subcommand (now a no-op); add one-time reindex upgrade note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 13:19:27 -07:00
Trenton H	7f63259f41	Remove silly lines	2026-03-30 13:08:57 -07:00
Trenton H	a213c2cc9b	test(search): expand query rewriting coverage for Whoosh compat shims Fold [-N unit to now] range, field:YYYYMMDD (with TZ-aware DateField vs DateTimeField logic), and parse_user_query fuzzy path into the renamed TestWhooshQueryRewriting class. TestParseUserQuery covers the full pipeline including fuzzy blend mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 12:55:23 -07:00
Trenton H	34d2897ab1	refactor(search): replace context manager smell with explicit open/close lifecycle TantivyBackend now uses open()/close()/_ensure_open() instead of __enter__/__exit__. get_backend() tracks _backend_path and auto-reinitializes when settings.INDEX_DIR changes, fixing the xdist/override_settings isolation bug where parallel workers would share a stale singleton pointing at a deleted index directory. Test fixtures use in-memory indices (path=None) for speed and isolation. Singleton behavior covered by TestSingleton in test_backend.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 12:49:46 -07:00
Trenton H	50f6b2d4c3	feat(search): wire Tantivy backend into all callsites; remove Whoosh - Replace all `from documents import index` + Whoosh writer usage across admin.py, bulk_edit.py, tasks.py, views.py, signals/handlers.py with `get_backend().add_or_update/remove/batch_update` - Add `effective_content` param to `_build_tantivy_doc` / `add_or_update` (used by signal handler to re-index root doc with version's OCR text) - Add `wipe_index()` (renamed from `_wipe_index`) to public API; use from `document_index --recreate` flag - `index_optimize()` replaced with deprecation log message; Tantivy manages segment merging automatically - `index_reindex()` now calls `get_backend().rebuild()` + `reset_backend()` with select_related/prefetch_related for efficiency - `document_index` management command: add `--recreate` flag - Status view: use `get_backend()` + dir mtime scan instead of Whoosh `ix.last_modified()` - Delete `documents/index.py`, `test_index.py`, `test_delayedquery.py` - Update all tests: patch `documents.search.get_backend` (lazy imports); `DirectoriesMixin` calls `reset_backend()` in setUp/tearDown; `TestDocumentConsumptionFinishedSignal` likewise - `test_api_search.py`: fix order-independent assertions for date-range queries; fix `_rewrite_8digit_date` to be field-aware and timezone-correct for DateTimeField vs DateField Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 10:43:30 -07:00
GitHub Actions	020057e1a4	Auto translate strings	2026-03-30 16:40:47 +00:00
shamoon	f715533770	Performance: support passing selection data with filtered document requests (#12300 )	2026-03-30 16:38:52 +00:00
Jan Kleine	0292edbee7	Fixhancement: include trashed documents in document exporter/importer (#12425 )	2026-03-30 16:30:22 +00:00
dependabot[bot]	5b755528da	Chore(deps): Bump cryptography in the uv group across 1 directory (#12458 ) Bumps the uv group with 1 update in the / directory: [cryptography](https://github.com/pyca/cryptography). Updates `cryptography` from 46.0.5 to 46.0.6 - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/46.0.5...46.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-version: 46.0.6 dependency-type: indirect dependency-group: uv ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-30 08:51:24 -07:00
Trenton H	9df2a603b7	feat(search): package public exports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 08:41:02 -07:00
Trenton H	fcd4d28f37	refactor(search): replace NLTK autocomplete extraction with regex \w+ + timeout NLTK was inappropriate here: no stopword filtering (users should be able to autocomplete any word), no length floor, and unicode-aware \w+ splits consistently with Tantivy's simple tokenizer. regex library used (already a project dependency) for ReDoS protection via per-call timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 08:38:22 -07:00
Trenton H	0fb57205db	feat(search): complete TantivyBackend — search, autocomplete, more_like_this, rebuild, WriteBatch Dual-field approach for notes/custom_fields: JSON fields support structured queries (notes.user:alice, custom_fields.name:invoice); companion text fields (note, custom_field) carry content for default full-text search — tantivy-py 0.25 parse_query rejects dotted paths in default_field_names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 08:31:52 -07:00
shamoon	3d4353dc2b	Security: pin GitHub Actions to specific SHAs (#12465 )	2026-03-29 17:16:44 -07:00
Trenton H	0078ef9cd5	refactor(search): add docstrings and complete type annotations to all search module functions - Add descriptive docstrings to all functions in _schema.py, _tokenizer.py, and _query.py - Complete type annotations for all function parameters and return values - Fix 8 mypy strict errors in _query.py: - Add re.Match[str] type parameters for regex matches - Fix "Returning Any" error with str() cast - Add type annotations for build_permission_filter() and parse_user_query() - Remove lazy imports, move to module top level - All 29 search module tests continue to pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 15:26:07 -07:00
Trenton H	957049c512	fix(search): register fast-field tokenizer for simple_analyzer; fix perm_index fixture Tantivy requires register_fast_field_tokenizer for any tokenizer used by fast=True text fields — it writes default fast column values on every commit even when a document omits those fields, raising ValueError otherwise. perm_index fixture simplified to use in-memory index (path=None). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 15:01:39 -07:00
Trenton H	33da63c229	feat(search): normalize_query, build_permission_filter, parse_user_query pipeline Implement query normalization and permission filtering for Tantivy search: - normalize_query: expands comma-separated field values with AND operator - build_permission_filter: security-critical permission filtering for documents - no owner (NULL in Django) → documents without owner_id field - owned by user → owner_id = user.pk - shared with user → viewer_id = user.pk - uses disjunction_max_query for proper OR semantics - workaround for tantivy-py unsigned type detection bug via range_query - parse_user_query: full pipeline with fuzzy search support - DEFAULT_SEARCH_FIELDS and boost configuration Note: Permission filter tests require Tantivy environment setup; core functionality implemented and normalize tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:56:58 -07:00
Trenton H	cbeb7469a1	feat(search): natural date keyword rewriting with Whoosh compat shims Implement date/timezone boundary math for natural language date queries: - `created` (DateField): local calendar date to UTC midnight boundaries - `added`/`modified` (DateTimeField): local day boundaries with full offset arithmetic - Whoosh compat shims: compact dates (YYYYMMDDHHmmss) → ISO 8601 - Relative ranges: `[now-7d TO now]` → concrete ISO timestamps - Natural keywords: today, yesterday, this_week, last_week, etc. - Timezone-aware: handles UTC offset arithmetic for datetime fields - Passthrough: bare keywords without field prefixes unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:47:06 -07:00
Trenton H	2cf85d9b58	chore: make whoosh imports lazy to unblock test collection during Tantivy migration Module-level whoosh imports in tasks.py and paperless/views.py prevented test collection after removing whoosh-reloaded. Move to lazy imports inside the functions that use them; will be removed entirely in Task 14. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:37:28 -07:00
Trenton H	494d17e7ac	feat(search): PAPERLESS_SEARCH_LANGUAGE and PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD settings Add two new environment variables for Tantivy search backend: - PAPERLESS_SEARCH_LANGUAGE: language code for stemming (empty string disables) - PAPERLESS_ADVANCED_FUZZY_SEARCH_THRESHOLD: float threshold for fuzzy search blending (None disables) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:33:37 -07:00
Trenton H	e8fe3a6a62	feat(search): tokenizer registration — paperless_text with language stemming, simple_analyzer, bigram_analyzer This commit implements Task 3 of the Tantivy search backend migration: - Add `src/documents/search/_tokenizer.py` with three custom tokenizers: - `paperless_text`: simple → remove_long(65) → lowercase → ascii_fold [→ stemmer] Supports 18 languages via Snowball stemmer with fallback warning for unsupported languages - `simple_analyzer`: simple → lowercase → ascii_fold (for shadow sort fields) - `bigram_analyzer`: ngram(2,2) → lowercase (for CJK/no-whitespace language support) - Add comprehensive tests in `src/documents/tests/search/test_tokenizer.py`: - ASCII folding test: verifies "café résumé" is findable as "cafe resume" - Bigram CJK test: verifies "東京都" is searchable by substring "東京" - Warning test: verifies unsupported languages log appropriate warnings - `register_tokenizers()` function must be called on every Index instance as tantivy requires re-registration at each open - Language support includes common ISO 639-1 codes and full names: Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:30:40 -07:00
Trenton H	884edd6eea	feat(search): schema definition and open_or_rebuild_index with sentinel logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 14:08:09 -07:00
Trenton H	d00fb4f345	feat: add tantivy dependency, search package skeleton, search pytest marker Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 13:56:56 -07:00
Andreas Schneider	85e0d1842a	Tests: add regression test for redis URL with empty username (#12460 ) * Tests: add regression test for redis URL with empty username and password Covers the unix://:SECRET@/path.sock format (empty username, password only), which was missing from the existing test cases for PR #12239. * Update src/paperless/tests/settings/test_custom_parsers.py --------- Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-03-29 06:31:18 -07:00
GitHub Actions	62f79c088e	Auto translate strings	2026-03-28 21:00:05 +00:00
shamoon	129da3ade7	Tweakhancement: show file extension in StoragePath test (#12452 )	2026-03-28 13:58:33 -07:00
Trenton H	9383471fa0	Feature: Transition all checksums to use SHA256 (#12432 )	2026-03-26 11:28:02 -07:00
dependabot[bot]	0060b46c8b	Chore(deps): Bump requests in the uv group across 1 directory (#12441 ) Bumps the uv group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.32.5 to 2.33.0 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.32.5...v2.33.0) --- updated-dependencies: - dependency-name: requests dependency-version: 2.33.0 dependency-type: indirect dependency-group: uv ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-26 09:04:20 -07:00
GitHub Actions	b153ec803b	Auto translate strings	2026-03-26 14:38:10 +00:00
shamoon	38dba60ceb	Enhancement: auto-hide the search bar on mobile (#12404 )	2026-03-26 07:36:32 -07:00
shamoon	ae0474450f	Chore: logger, response and template sanitization cleanup (#12439 )	2026-03-26 07:36:02 -07:00

1 2 3 4 5 ...

11281 Commits