* Chore: move Tika parser and tests to paperless/
Move TikaDocumentParser and its tests to the canonical parser package
location, matching the pattern established for TextDocumentParser:
- src/paperless_tika/parsers.py → src/paperless/parsers/tika.py
- src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py
- src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/
Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file,
sample_doc_file, sample_broken_odt) into the shared parsers conftest.
Remove the now-empty src/paperless_tika/tests/conftest.py.
Content is unchanged — this commit is rename-only so git history is
preserved on the moved files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol
Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing
the legacy DocumentParser ABC:
- Add ClassVars: name, version, author, url
- Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types)
- Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise
- can_produce_archive = False (PDF is for display, not an OCR archive)
- requires_pdf_rendition = True (Office formats need PDF for browser display)
- __enter__/__exit__ via ExitStack: TikaClient opened once per parser
lifetime and shared across parse() and extract_metadata() calls
- extract_metadata() falls back to a short-lived TikaClient when called
outside a context manager (legacy view-layer metadata path)
- _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored
ApplicationConfiguration before falling back to the env-var setting
- Rename convert_to_pdf → _convert_to_pdf (private helper)
Update paperless_tika/signals.py shim to import from the new module path
and drop the legacy logging_group/progress_callback kwargs.
Update documents/consumer.py to extend the existing TextDocumentParser
special cases to also cover TikaDocumentParser (parse/get_thumbnail
signatures, __exit__ cleanup).
Add TestTikaParserRegistryInterface (7 tests) covering score(), properties,
and ParserProtocol isinstance check. Update existing tests to use the new
accessor API (get_text, get_date, get_archive_path, _convert_to_pdf).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: update remaining imports and move live Tika tests after parser migration
- src/documents/tests/test_parsers.py: import TikaDocumentParser from
paperless.parsers.tika (old paperless_tika.parsers no longer exists)
- git mv paperless_tika/tests/test_live_tika.py →
paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests
with the parser; update import and replace old attribute API
(tika_parser.text/.archive_path) with accessor methods
(get_text/get_archive_path)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: satisfy mypy and pyrefly for TikaDocumentParser
Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from
TikaClient | None to TikaClient at the point of use in parse(). The
assert is visible to type checkers (TYPE_CHECKING=True) so both mypy
and pyrefly accept the subsequent attribute accesses without error;
at runtime TYPE_CHECKING is False so the assert never executes and no
ruff S101 suppression is required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: require context manager for TikaDocumentParser; clean up client lifecycle
- consumer.py: call __enter__ for new-style parsers so _tika_client and
_gotenberg_client are set before parse() is invoked
- views.py: use `with parser` (via nullcontext for old-style parsers) in
get_metadata so extract_metadata always runs inside a context manager
- tika.py: GotenbergClient added to ExitStack alongside TikaClient;
inline client creation removed from extract_metadata and _convert_to_pdf;
__exit__ uses ExitStack.close() instead of __exit__ pass-through
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This alters the retry/backoff logic in the init-wait-for-db script to be more
optimistic about database availability. During regular deployment and
operations of paperless-ngx, it's common to restart the application server with
the database instance already running, so we should optimize for this case.
Instead of unconditionally delaying 5 seconds between each connection attempt,
start with a minimum delay of 1 second and increase the delay linearly with
each attempt, maxing out at 10 seconds. This makes the retry count-based
failure mode less practical, so instead we just use a timeout-based approach.*
*NOTE: the original implementation would have an effective timeout of 25s. This
alters the behavior to 60s.
Additionally, this removes an unnecessary 5s delay that was injected in the
postgres case. The script uses a more comprehensive connection check for
postgres than it does mariadb, so if anything this 5s delay after getting an
"ok" response from the DB was extra unnecessary in the postgres case.
* Explicitly set the HOME environment for the migrations to fix issue with certificates
* Defines the HOME globally when we're running as root for startup