mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-03-17 22:45:58 +00:00
* Chore: move Tika parser and tests to paperless/ Move TikaDocumentParser and its tests to the canonical parser package location, matching the pattern established for TextDocumentParser: - src/paperless_tika/parsers.py → src/paperless/parsers/tika.py - src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py - src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/ Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file, sample_doc_file, sample_broken_odt) into the shared parsers conftest. Remove the now-empty src/paperless_tika/tests/conftest.py. Content is unchanged — this commit is rename-only so git history is preserved on the moved files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing the legacy DocumentParser ABC: - Add ClassVars: name, version, author, url - Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types) - Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise - can_produce_archive = False (PDF is for display, not an OCR archive) - requires_pdf_rendition = True (Office formats need PDF for browser display) - __enter__/__exit__ via ExitStack: TikaClient opened once per parser lifetime and shared across parse() and extract_metadata() calls - extract_metadata() falls back to a short-lived TikaClient when called outside a context manager (legacy view-layer metadata path) - _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored ApplicationConfiguration before falling back to the env-var setting - Rename convert_to_pdf → _convert_to_pdf (private helper) Update paperless_tika/signals.py shim to import from the new module path and drop the legacy logging_group/progress_callback kwargs. Update documents/consumer.py to extend the existing TextDocumentParser special cases to also cover TikaDocumentParser (parse/get_thumbnail signatures, __exit__ cleanup). Add TestTikaParserRegistryInterface (7 tests) covering score(), properties, and ParserProtocol isinstance check. Update existing tests to use the new accessor API (get_text, get_date, get_archive_path, _convert_to_pdf). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: update remaining imports and move live Tika tests after parser migration - src/documents/tests/test_parsers.py: import TikaDocumentParser from paperless.parsers.tika (old paperless_tika.parsers no longer exists) - git mv paperless_tika/tests/test_live_tika.py → paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests with the parser; update import and replace old attribute API (tika_parser.text/.archive_path) with accessor methods (get_text/get_archive_path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: satisfy mypy and pyrefly for TikaDocumentParser Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from TikaClient | None to TikaClient at the point of use in parse(). The assert is visible to type checkers (TYPE_CHECKING=True) so both mypy and pyrefly accept the subsequent attribute accesses without error; at runtime TYPE_CHECKING is False so the assert never executes and no ruff S101 suppression is required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: require context manager for TikaDocumentParser; clean up client lifecycle - consumer.py: call __enter__ for new-style parsers so _tika_client and _gotenberg_client are set before parse() is invoked - views.py: use `with parser` (via nullcontext for old-style parsers) in get_metadata so extract_metadata always runs inside a context manager - tika.py: GotenbergClient added to ExitStack alongside TikaClient; inline client creation removed from extract_metadata and _convert_to_pdf; __exit__ uses ExitStack.close() instead of __exit__ pass-through Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
47 lines
1.6 KiB
YAML
47 lines
1.6 KiB
YAML
# Docker Compose file for running paperless testing with actual Gotenberg
|
|
# and Tika containers for a more end to end test of the Tika related functionality
|
|
# Can be used locally or by the CI to start the necessary containers with the
|
|
# correct networking for the tests
|
|
services:
|
|
gotenberg:
|
|
image: docker.io/gotenberg/gotenberg:8.27
|
|
hostname: gotenberg
|
|
container_name: gotenberg
|
|
network_mode: host
|
|
restart: unless-stopped
|
|
# The gotenberg chromium route is used to convert .eml files. We do not
|
|
# want to allow external content like tracking pixels or even javascript.
|
|
command:
|
|
- "gotenberg"
|
|
- "--chromium-disable-javascript=true"
|
|
- "--chromium-allow-list=file:///tmp/.*"
|
|
- "--log-level=warn"
|
|
- "--log-format=text"
|
|
tika:
|
|
image: docker.io/apache/tika:3.2.3.0
|
|
hostname: tika
|
|
container_name: tika
|
|
network_mode: host
|
|
restart: unless-stopped
|
|
greenmail:
|
|
image: docker.io/greenmail/standalone:2.1.8
|
|
hostname: greenmail
|
|
container_name: greenmail
|
|
environment:
|
|
# Enable only IMAP for now (SMTP available via 3025 if needed later)
|
|
GREENMAIL_OPTS: >-
|
|
-Dgreenmail.setup.test.imap -Dgreenmail.users=test@localhost:test -Dgreenmail.users.login=test@localhost -Dgreenmail.verbose
|
|
ports:
|
|
- "3143:3143" # IMAP
|
|
restart: unless-stopped
|
|
nginx:
|
|
image: docker.io/nginx:1.29.5-alpine
|
|
hostname: nginx
|
|
container_name: nginx
|
|
ports:
|
|
- "8080:8080"
|
|
restart: unless-stopped
|
|
volumes:
|
|
- ../../docs/assets:/usr/share/nginx/html/assets:ro
|
|
- ./test-nginx.conf:/etc/nginx/conf.d/default.conf:ro
|