Commit Graph

3941 Commits

Author SHA1 Message Date
Trenton Holmes
a557d2210b Actually, the system check wouldn't see the 'wrong' value like that 2026-04-05 15:16:27 -07:00
Trenton Holmes
7454ce5a20 Try to automatically migrate user's DB settings values 2026-04-05 14:51:23 -07:00
Trenton Holmes
7f01b3a6f9 If the selected OCR mode is not a valid choice, warn and default to auto instead 2026-04-05 13:38:25 -07:00
Trenton Holmes
4138319832 Merge remote-tracking branch 'origin/dev' into feature-archive-ocr-decoupling 2026-04-05 13:32:00 -07:00
Trenton H
5f5fb263c9 Fix: Don't create a new note highlight generator per note in the loop (#12512) 2026-04-03 17:34:15 -07:00
shamoon
b807b107ad Enhancement: include sharelinks + bundles in export/import (#12479) 2026-04-03 21:51:57 +00:00
Trenton H
c2f02851da Chore: Better typed status manager messages (#12509) 2026-04-03 21:18:01 +00:00
GitHub Actions
d0f8a98a9a Auto translate strings 2026-04-03 20:55:14 +00:00
shamoon
566afdffca Enhancement: unify text search to use tantivy (#12485) 2026-04-03 13:53:45 -07:00
Trenton H
f32ad98d8e Feature: Update consumer logging to include task ID for log correlation (#12510) 2026-04-03 13:31:40 -07:00
Trenton H
91c77c42f0 Add debug level logging for why an archive is made and why we decided OCR or not 2026-04-03 09:16:00 -07:00
Trenton H
8115332cc9 Tests and fix a bug with the img2pdf functionality 2026-04-03 09:05:21 -07:00
Trenton H
c3be765761 Merge branch 'dev' into feature-archive-ocr-decoupling 2026-04-03 08:17:09 -07:00
Trenton H
d365f19962 Security: Registers a custom serializer which signs the task payload (#12504) 2026-04-03 03:49:54 +00:00
GitHub Actions
2703c12f1a Auto translate strings 2026-04-03 03:25:57 +00:00
shamoon
e7c7978d67 Enhancement: allow opt-in blocking internal mail hosts (#12502) 2026-04-03 03:24:28 +00:00
GitHub Actions
83501757df Auto translate strings 2026-04-02 22:36:32 +00:00
Trenton H
dda05a7c00 Security: Improve overall security in a few ways (#12501)
- Make sure we're always using regex with timeouts for user controlled data
- Adds rate limiting to the token endpoint (configurable)
- Signs the classifier pickle file with the SECRET_KEY and refuse to load one which doesn't verify.
- Require the user to set a secret key, instead of falling back to our old hard coded one
2026-04-02 15:30:26 -07:00
Trenton H
33c41dd2e7 Merge remote-tracking branch 'origin/dev' into feature-archive-ocr-decoupling 2026-04-02 15:27:08 -07:00
Trenton H
376af81b9c Fix: Resolve another TC assuming an object has been created somewhere (#12503) 2026-04-02 14:58:28 -07:00
GitHub Actions
05c9e21fac Auto translate strings 2026-04-02 19:40:05 +00:00
Trenton H
aed9abe48c Feature: Replace Whoosh with tantivy search backend (#12471)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
2026-04-02 12:38:22 -07:00
GitHub Actions
2aa0c9f0b4 Auto translate strings 2026-03-31 18:25:03 +00:00
shamoon
d2328b776a Performance: support bulk edit without id lists (#12355) 2026-03-31 18:23:28 +00:00
GitHub Actions
e1da2a1efe Auto translate strings 2026-03-31 14:57:34 +00:00
shamoon
245514ad10 Performance: deprecate and remove usage of all in API results (#12309) 2026-03-31 07:55:59 -07:00
GitHub Actions
020057e1a4 Auto translate strings 2026-03-30 16:40:47 +00:00
shamoon
f715533770 Performance: support passing selection data with filtered document requests (#12300) 2026-03-30 16:38:52 +00:00
Jan Kleine
0292edbee7 Fixhancement: include trashed documents in document exporter/importer (#12425) 2026-03-30 16:30:22 +00:00
Andreas Schneider
85e0d1842a Tests: add regression test for redis URL with empty username (#12460)
* Tests: add regression test for redis URL with empty username and password

Covers the unix://:SECRET@/path.sock format (empty username, password only),
which was missing from the existing test cases for PR #12239.

* Update src/paperless/tests/settings/test_custom_parsers.py

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-29 06:31:18 -07:00
GitHub Actions
62f79c088e Auto translate strings 2026-03-28 21:00:05 +00:00
shamoon
129da3ade7 Tweakhancement: show file extension in StoragePath test (#12452) 2026-03-28 13:58:33 -07:00
Trenton H
5cbbe0be89 Improvements for typing purposes mostly + some reuse 2026-03-28 13:21:52 -07:00
Trenton H
d5248838ca Whoops, the tagged PDF check catches our fixture sample files, which broke these 2026-03-27 13:47:52 -07:00
Trenton H
6eb6e352da Adds a tagged PDF check as well, for an even better decision to skip OCR in auto mode 2026-03-27 08:45:34 -07:00
Trenton H
d89a86643d Merge branch 'dev' into feature-archive-ocr-decoupling 2026-03-27 08:35:25 -07:00
Trenton H
68322376f2 test: use pytest-django settings fixture and pytest.param in new tests
- TestShouldProduceArchive: replace @override_settings decorators with
  settings fixture; consolidate 10 individual tests into 2 parametrized
  tests (test_generation_setting, test_auto_pdf_archive_decision)
- TestDeprecatedV2OcrEnvVarWarnings: call check_deprecated_v2_ocr_env_vars()
  directly instead of django_checks.run_checks(); use mocker.patch.dict for
  env isolation; consolidate warn cases into one parametrized test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
2729b0d3dc refactor: consolidate pdftotext utility and archive-decision logic
- Add extract_pdf_text() and PDF_TEXT_MIN_LENGTH to paperless/parsers/utils.py,
  eliminating duplicate pdftotext call sites in consumer.py and tesseract.py
- Rename _should_produce_archive → should_produce_archive (now public, imported
  by both consumer.py and tasks.py)
- update_document_content_maybe_archive_file now calls should_produce_archive,
  honouring ARCHIVE_FILE_GENERATION the same as the consumer pipeline
- Fallback OCR path sets archive_path when produce_archive=True; update
  test_with_form_redo_produces_no_archive to use produce_archive=False

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
07c0ed5e26 feat!: add deprecated v2 OCR env var warnings to system checks 2026-03-27 07:51:24 -07:00
Trenton H
d684452588 feat: compute produce_archive from ARCHIVE_FILE_GENERATION, pass to parser
Add _extract_text_for_archive_check() and _should_produce_archive() helper
functions to documents/consumer.py. These compute whether the parser should
produce a PDF/A archive based on the ARCHIVE_FILE_GENERATION setting (always/
never/auto), parser capabilities (can_produce_archive, requires_pdf_rendition),
MIME type, and pdftotext-based born-digital detection for auto mode.

Update the parse() call site to compute and pass produce_archive=... kwarg.
Add 10 unit tests in test_consumer_archive.py; update two existing consumer
tests that asserted run_subprocess call counts now that pdftotext is invoked
during auto-mode archive detection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
e00658375b chore: remove dead archive_file_generation assignments from tests 2026-03-27 07:51:24 -07:00
Trenton H
a0cf673f1b feat!: restructure parse() for OCR_MODE=auto/off and produce_archive flag
Implement the new decoupled archive/OCR control in RasterisedDocumentParser:

- construct_ocrmypdf_parameters(): add skip_text parameter; fix AUTO mode
  dispatch so skip_text is only added when explicitly requested (text-present
  + produce_archive case) rather than unconditionally; add OFF mode support.

- parse(): remove archive_file_generation checks; control archive creation
  exclusively via the produce_archive bool passed by the consumer.
  - OFF + no archive: return pdftotext text, skip OCRmyPDF entirely.
  - OFF + image + archive: use new _convert_image_to_pdfa() helper.
  - OFF + PDF + archive: run OCRmyPDF with skip_text=True (PDF/A only).
  - AUTO + text + no archive: skip OCRmyPDF entirely (fast path).
  - AUTO + text + archive: run OCRmyPDF with skip_text=True.
  - AUTO + no text: run normal OCR regardless of produce_archive.
  - FORCE/REDO: always run OCRmyPDF; set archive_path only when produce_archive.

- Add _convert_image_to_pdfa(): img2pdf wrapping + pikepdf PDF/A-2b stamping
  without invoking Tesseract or Ghostscript.

- Add PriorOcrFoundError to the fallback exception list (same treatment as
  InputFileError: retry with force_ocr).

- Update existing tests to use produce_archive instead of archive_file_generation:
  TestSkipArchive rewritten; RTL test uses mode=off to preserve Arabic text
  layer; AUTO mode tests clarified.

- Add test_parse_modes.py: 11 focused unit tests with mocked ocrmypdf.ocr
  verifying control flow for all mode/produce_archive combinations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
300432ae05 feat!: drop skip_archive_file field, add archive_file_generation to ApplicationConfiguration
Replace the old skip_archive_file DB field with the correctly-named
archive_file_generation field on ApplicationConfiguration. Remove the
temporary getattr fallback in OcrConfig now that the migration exists.
Update all test fixtures and API response assertions to use the new field name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
6ba1b726be feat!: update OcrConfig to read archive_file_generation from DB field
Switches OcrConfig.__post_init__ from reading the old skip_archive_file
attribute to the new archive_file_generation attribute, with a getattr
fallback to skip_archive_file for compatibility until Task 4 renames
the DB model field. Updates null_app_config fixtures in both the parser
conftest and the new test_ocr_config.py to explicitly set both attributes
to None so MagicMock doesn't return truthy auto-generated attributes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
38d2abb982 feat!: rename OCR_SKIP_ARCHIVE_FILE to ARCHIVE_FILE_GENERATION
Rename the Django setting OCR_SKIP_ARCHIVE_FILE to ARCHIVE_FILE_GENERATION
and the env var PAPERLESS_OCR_SKIP_ARCHIVE_FILE to PAPERLESS_ARCHIVE_FILE_GENERATION.
Rename the OcrConfig attribute skip_archive_file to archive_file_generation.
Update checks.py error messages and all tests accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Trenton H
cd653959d6 feat!: replace ModeChoices and ArchiveFileChoices with new v3 enums
- Replace ModeChoices (SKIP/SKIP_NO_ARCHIVE/REDO/FORCE) with new values:
  AUTO, FORCE, REDO, OFF
- Remove ArchiveFileChoices entirely; add ArchiveFileGenerationChoices
  with AUTO, ALWAYS, NEVER values
- Update checks.py valid sets and default settings to use new enum values
- Update tesseract parser to use new enum comparisons; AUTO mode maps to
  skip_text behavior; FORCE/REDO bypass archive-skip early-exit
- Update all affected tests to use new valid mode/archive string values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 12:50:43 -07:00
Trenton H
9383471fa0 Feature: Transition all checksums to use SHA256 (#12432) 2026-03-26 11:28:02 -07:00
GitHub Actions
b153ec803b Auto translate strings 2026-03-26 14:38:10 +00:00
shamoon
ae0474450f Chore: logger, response and template sanitization cleanup (#12439) 2026-03-26 07:36:02 -07:00
Trenton H
8efb01010c fix: Don't silently drop the change_groups and switch to a couple slightly more efficient implementations (#12431) 2026-03-26 14:15:42 +00:00