Commit Graph

3947 Commits

Author SHA1 Message Date
Trenton H 67da965d21 Merge branch 'dev' into feature-search-pagination-improvements 2026-04-08 17:37:30 -07:00
Trenton H acdee63197 Call to_html on snippets. JSON fields don't support snippets, so store a 'notes_text' to highlight instead. Use tantivty score when sorting for that, instead of discarding it 2026-04-08 15:05:04 -07:00
GitHub Actions ec6969e326 Auto translate strings 2026-04-08 15:42:05 +00:00
shamoon 4629bbf83e Enhancement: add view_global_statistics and view_system_status permissions (#12530) 2026-04-08 15:39:47 +00:00
Trenton H 759717404e Adds notes for where we can improve, if either fixes, features or a new release drop in from Tantivy 2026-04-07 14:45:27 -07:00
Trenton H 0bdaff203c Fixes copilot found issues, try to tune the filtering as suggested 2026-04-07 13:30:35 -07:00
Trenton H 689f5964fc Merge branch 'dev' into feature-search-pagination-improvements 2026-04-07 08:06:35 -07:00
GitHub Actions 51c59746a7 Auto translate strings 2026-04-06 22:51:57 +00:00
Trenton H c232d443fa Breaking: Decouple OCR control from archive file control (#12448)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-04-06 15:50:21 -07:00
Trenton Holmes 51624840f2 docs: note autocomplete as candidate for Redis caching
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 14:33:45 -07:00
Trenton Holmes 48309938c6 perf: use prefix query in autocomplete to avoid full-index scan
Previously autocomplete scanned every visible document to extract
words, then filtered by prefix in Python. Now builds a regex query
on autocomplete_word so Tantivy only returns docs containing matching
words. At 5k docs: rare prefixes go from 335ms to <1ms, common
prefixes from 342ms to 199ms with 58-99% less peak memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 14:26:26 -07:00
Trenton Holmes b4cfc27876 docs: note potential large IN clause in selection_data query
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:52:41 -07:00
Trenton Holmes 86ac3ba9f1 fix: limit global search to 9 IDs and fix more_like_this_ids off-by-one
Global search only displays 3 results but was fetching all matching IDs
and hydrating them via in_bulk. Now passes limit=9 to search_ids().

more_like_this_ids could return limit-1 results when the original doc
appeared in the result set. Now fetches limit+1 and slices after
filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:49:23 -07:00
Trenton Holmes 67261287d2 refactor: extract nested helpers in UnifiedSearchViewSet.list()
Break the monolithic list() method into typed sub-functions for
readability: parse_search_params, intersect_and_order, run_text_search,
run_more_like_this. Also defer get_backend() until after param
validation so invalid requests fail fast.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:31:36 -07:00
Trenton Holmes ca077ba1e3 fix: reuse notes snippet generator across docs in highlight_hits()
The notes SnippetGenerator was being recreated per document instead of
lazily initialized once like the content generator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:23:14 -07:00
Trenton Holmes e3076b8d62 refactor: remove dead search() method and SearchResults from TantivyBackend
All production callers now use search_ids() + highlight_hits(). Migrated
10 tests to search_ids(), removed 5 that tested search()-specific
features (score normalization, highlight windowing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:21:01 -07:00
Trenton Holmes cb851fc217 refactor: switch global search from backend.search() to search_ids()
The global search endpoint only needs document IDs (takes top 3), not
highlights or scores. Using search_ids() avoids building SearchHit dicts
and removes the last production caller of backend.search().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:14:21 -07:00
Trenton Holmes 534fcfde6b refactor: remove dead more_like_this() method from TantivyBackend
The method is no longer called anywhere in production code — all callers
were migrated to more_like_this_ids() during the search pagination work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 13:10:58 -07:00
Trenton Holmes 0b5b6fdad5 refactor: extract _parse_query and _apply_permission_filter helpers
Deduplicates query parsing (3 call sites) and permission filter
wrapping (4 call sites) into private helper methods on TantivyBackend.
Also documents the N-lookup limitation of highlight_hits().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 13:31:37 -07:00
Trenton Holmes d98dbd50f4 fix: address code review findings (int keys, docstring, empty ordering)
- TantivyRelevanceList.__getitem__ now handles int keys, not just slices
- search_ids() docstring corrected ("no highlights or scores")
- Empty ordering param now correctly becomes None instead of ""

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 13:26:10 -07:00
Trenton Holmes 7649e4a6b1 Merge remote-tracking branch 'origin/dev' into feature-search-pagination-improvements 2026-04-05 13:18:43 -07:00
Trenton Holmes 610ba27891 feat: replace 10000 overfetch with search_ids + page-only highlights
Use search_ids() for the full set of matching IDs (lightweight ints,
no arbitrary cap) and highlight_hits() for just the displayed page.
TantivyRelevanceList now holds ordered IDs for count/selection_data
and a small page of rich SearchHit dicts for serialization.

Removes the hardcoded 10000 limit that silently truncated results
for large collections. Memory usage down ~10% on sorted/paginated
search paths at 200 docs, with larger gains expected at scale.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 12:54:47 -07:00
Trenton H 5f5fb263c9 Fix: Don't create a new note highlight generator per note in the loop (#12512) 2026-04-03 17:34:15 -07:00
Trenton Holmes 7c50e0077c chore: remove temporary profiling infrastructure
Profiling tests and helper served their purpose during the search
performance optimization work. Baseline and post-implementation
data captured in docs/superpowers/plans/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:53:55 -07:00
Trenton Holmes 288740ea62 refactor: promote sort_field_map to class-level constant on TantivyBackend
Single source of truth for sort field mapping. The viewset now references
TantivyBackend.SORTABLE_FIELDS instead of maintaining a duplicate set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:53:49 -07:00
Trenton Holmes d998d3fbaf feat: delegate sorting to Tantivy and use page-only highlights in viewset
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:35:14 -07:00
Trenton Holmes 6cf01dd383 feat: add search_ids() and more_like_this_ids() lightweight methods
search_ids() returns only document IDs matching a query — no highlights,
no SearchHit objects. more_like_this_ids() does the same for MLT queries.
These provide lightweight paths when only IDs are needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:21:52 -07:00
Trenton Holmes 0d915c58a4 feat: add highlight_page/highlight_page_size params to search()
Gate expensive snippet/highlight generation to only the requested
slice of hits, allowing the viewset to avoid generating highlights
for all 10k results when only 25 are displayed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:10:00 -07:00
Trenton Holmes 46008d2da7 test: add baseline profiling tests for search performance
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 14:58:11 -07:00
shamoon b807b107ad Enhancement: include sharelinks + bundles in export/import (#12479) 2026-04-03 21:51:57 +00:00
Trenton H c2f02851da Chore: Better typed status manager messages (#12509) 2026-04-03 21:18:01 +00:00
GitHub Actions d0f8a98a9a Auto translate strings 2026-04-03 20:55:14 +00:00
shamoon 566afdffca Enhancement: unify text search to use tantivy (#12485) 2026-04-03 13:53:45 -07:00
Trenton H f32ad98d8e Feature: Update consumer logging to include task ID for log correlation (#12510) 2026-04-03 13:31:40 -07:00
Trenton H d365f19962 Security: Registers a custom serializer which signs the task payload (#12504) 2026-04-03 03:49:54 +00:00
GitHub Actions 2703c12f1a Auto translate strings 2026-04-03 03:25:57 +00:00
shamoon e7c7978d67 Enhancement: allow opt-in blocking internal mail hosts (#12502) 2026-04-03 03:24:28 +00:00
GitHub Actions 83501757df Auto translate strings 2026-04-02 22:36:32 +00:00
Trenton H dda05a7c00 Security: Improve overall security in a few ways (#12501)
- Make sure we're always using regex with timeouts for user controlled data
- Adds rate limiting to the token endpoint (configurable)
- Signs the classifier pickle file with the SECRET_KEY and refuse to load one which doesn't verify.
- Require the user to set a secret key, instead of falling back to our old hard coded one
2026-04-02 15:30:26 -07:00
Trenton H 376af81b9c Fix: Resolve another TC assuming an object has been created somewhere (#12503) 2026-04-02 14:58:28 -07:00
GitHub Actions 05c9e21fac Auto translate strings 2026-04-02 19:40:05 +00:00
Trenton H aed9abe48c Feature: Replace Whoosh with tantivy search backend (#12471)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
2026-04-02 12:38:22 -07:00
GitHub Actions 2aa0c9f0b4 Auto translate strings 2026-03-31 18:25:03 +00:00
shamoon d2328b776a Performance: support bulk edit without id lists (#12355) 2026-03-31 18:23:28 +00:00
GitHub Actions e1da2a1efe Auto translate strings 2026-03-31 14:57:34 +00:00
shamoon 245514ad10 Performance: deprecate and remove usage of all in API results (#12309) 2026-03-31 07:55:59 -07:00
GitHub Actions 020057e1a4 Auto translate strings 2026-03-30 16:40:47 +00:00
shamoon f715533770 Performance: support passing selection data with filtered document requests (#12300) 2026-03-30 16:38:52 +00:00
Jan Kleine 0292edbee7 Fixhancement: include trashed documents in document exporter/importer (#12425) 2026-03-30 16:30:22 +00:00
Andreas Schneider 85e0d1842a Tests: add regression test for redis URL with empty username (#12460)
* Tests: add regression test for redis URL with empty username and password

Covers the unix://:SECRET@/path.sock format (empty username, password only),
which was missing from the existing test cases for PR #12239.

* Update src/paperless/tests/settings/test_custom_parsers.py

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-29 06:31:18 -07:00