Deduplicates query parsing (3 call sites) and permission filter
wrapping (4 call sites) into private helper methods on TantivyBackend.
Also documents the N-lookup limitation of highlight_hits().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TantivyRelevanceList.__getitem__ now handles int keys, not just slices
- search_ids() docstring corrected ("no highlights or scores")
- Empty ordering param now correctly becomes None instead of ""
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use search_ids() for the full set of matching IDs (lightweight ints,
no arbitrary cap) and highlight_hits() for just the displayed page.
TantivyRelevanceList now holds ordered IDs for count/selection_data
and a small page of rich SearchHit dicts for serialization.
Removes the hardcoded 10000 limit that silently truncated results
for large collections. Memory usage down ~10% on sorted/paginated
search paths at 200 docs, with larger gains expected at scale.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Profiling tests and helper served their purpose during the search
performance optimization work. Baseline and post-implementation
data captured in docs/superpowers/plans/.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single source of truth for sort field mapping. The viewset now references
TantivyBackend.SORTABLE_FIELDS instead of maintaining a duplicate set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
search_ids() returns only document IDs matching a query — no highlights,
no SearchHit objects. more_like_this_ids() does the same for MLT queries.
These provide lightweight paths when only IDs are needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gate expensive snippet/highlight generation to only the requested
slice of hits, allowing the viewset to avoid generating highlights
for all 10k results when only 25 are displayed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make sure we're always using regex with timeouts for user controlled data
- Adds rate limiting to the token endpoint (configurable)
- Signs the classifier pickle file with the SECRET_KEY and refuse to load one which doesn't verify.
- Require the user to set a secret key, instead of falling back to our old hard coded one
* Tests: add regression test for redis URL with empty username and password
Covers the unix://:SECRET@/path.sock format (empty username, password only),
which was missing from the existing test cases for PR #12239.
* Update src/paperless/tests/settings/test_custom_parsers.py
---------
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
* refactor: switch consumer and callers to ParserRegistry (Phase 4)
Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.
- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: drop get_parser_class_for_mime_type; callers use registry directly
All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.
- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove document_consumer_declaration signal infrastructure
Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.
Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
- paperless_tesseract/signals.py
- paperless_text/signals.py
- paperless_tika/signals.py
- paperless_mail/signals.py
- paperless_remote/signals.py
Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove empty paperless_text and paperless_tika Django apps
After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.
- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files
paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Moves the checks and tests to the main application and removes the old applications
* Adds a comment to satisy Sonar
* refactor: remove automatic log_summary() call from get_parser_registry()
The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Cleans up the duplicate test file/fixture
* Fixes a race condition where webserver threads could race to populate the registry
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>