- Make sure we're always using regex with timeouts for user controlled data
- Adds rate limiting to the token endpoint (configurable)
- Signs the classifier pickle file with the SECRET_KEY and refuse to load one which doesn't verify.
- Require the user to set a secret key, instead of falling back to our old hard coded one
* Tests: add regression test for redis URL with empty username and password
Covers the unix://:SECRET@/path.sock format (empty username, password only),
which was missing from the existing test cases for PR #12239.
* Update src/paperless/tests/settings/test_custom_parsers.py
---------
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
- TestShouldProduceArchive: replace @override_settings decorators with
settings fixture; consolidate 10 individual tests into 2 parametrized
tests (test_generation_setting, test_auto_pdf_archive_decision)
- TestDeprecatedV2OcrEnvVarWarnings: call check_deprecated_v2_ocr_env_vars()
directly instead of django_checks.run_checks(); use mocker.patch.dict for
env isolation; consolidate warn cases into one parametrized test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add extract_pdf_text() and PDF_TEXT_MIN_LENGTH to paperless/parsers/utils.py,
eliminating duplicate pdftotext call sites in consumer.py and tesseract.py
- Rename _should_produce_archive → should_produce_archive (now public, imported
by both consumer.py and tasks.py)
- update_document_content_maybe_archive_file now calls should_produce_archive,
honouring ARCHIVE_FILE_GENERATION the same as the consumer pipeline
- Fallback OCR path sets archive_path when produce_archive=True; update
test_with_form_redo_produces_no_archive to use produce_archive=False
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _extract_text_for_archive_check() and _should_produce_archive() helper
functions to documents/consumer.py. These compute whether the parser should
produce a PDF/A archive based on the ARCHIVE_FILE_GENERATION setting (always/
never/auto), parser capabilities (can_produce_archive, requires_pdf_rendition),
MIME type, and pdftotext-based born-digital detection for auto mode.
Update the parse() call site to compute and pass produce_archive=... kwarg.
Add 10 unit tests in test_consumer_archive.py; update two existing consumer
tests that asserted run_subprocess call counts now that pdftotext is invoked
during auto-mode archive detection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement the new decoupled archive/OCR control in RasterisedDocumentParser:
- construct_ocrmypdf_parameters(): add skip_text parameter; fix AUTO mode
dispatch so skip_text is only added when explicitly requested (text-present
+ produce_archive case) rather than unconditionally; add OFF mode support.
- parse(): remove archive_file_generation checks; control archive creation
exclusively via the produce_archive bool passed by the consumer.
- OFF + no archive: return pdftotext text, skip OCRmyPDF entirely.
- OFF + image + archive: use new _convert_image_to_pdfa() helper.
- OFF + PDF + archive: run OCRmyPDF with skip_text=True (PDF/A only).
- AUTO + text + no archive: skip OCRmyPDF entirely (fast path).
- AUTO + text + archive: run OCRmyPDF with skip_text=True.
- AUTO + no text: run normal OCR regardless of produce_archive.
- FORCE/REDO: always run OCRmyPDF; set archive_path only when produce_archive.
- Add _convert_image_to_pdfa(): img2pdf wrapping + pikepdf PDF/A-2b stamping
without invoking Tesseract or Ghostscript.
- Add PriorOcrFoundError to the fallback exception list (same treatment as
InputFileError: retry with force_ocr).
- Update existing tests to use produce_archive instead of archive_file_generation:
TestSkipArchive rewritten; RTL test uses mode=off to preserve Arabic text
layer; AUTO mode tests clarified.
- Add test_parse_modes.py: 11 focused unit tests with mocked ocrmypdf.ocr
verifying control flow for all mode/produce_archive combinations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the old skip_archive_file DB field with the correctly-named
archive_file_generation field on ApplicationConfiguration. Remove the
temporary getattr fallback in OcrConfig now that the migration exists.
Update all test fixtures and API response assertions to use the new field name.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switches OcrConfig.__post_init__ from reading the old skip_archive_file
attribute to the new archive_file_generation attribute, with a getattr
fallback to skip_archive_file for compatibility until Task 4 renames
the DB model field. Updates null_app_config fixtures in both the parser
conftest and the new test_ocr_config.py to explicitly set both attributes
to None so MagicMock doesn't return truthy auto-generated attributes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename the Django setting OCR_SKIP_ARCHIVE_FILE to ARCHIVE_FILE_GENERATION
and the env var PAPERLESS_OCR_SKIP_ARCHIVE_FILE to PAPERLESS_ARCHIVE_FILE_GENERATION.
Rename the OcrConfig attribute skip_archive_file to archive_file_generation.
Update checks.py error messages and all tests accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace ModeChoices (SKIP/SKIP_NO_ARCHIVE/REDO/FORCE) with new values:
AUTO, FORCE, REDO, OFF
- Remove ArchiveFileChoices entirely; add ArchiveFileGenerationChoices
with AUTO, ALWAYS, NEVER values
- Update checks.py valid sets and default settings to use new enum values
- Update tesseract parser to use new enum comparisons; AUTO mode maps to
skip_text behavior; FORCE/REDO bypass archive-skip early-exit
- Update all affected tests to use new valid mode/archive string values
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>