Commit Graph

359 Commits

Author SHA1 Message Date
Trenton H 5202dc0748 Fix: Clear ContentType/guardian caches at import and test cases (#12758) 2026-05-08 20:48:47 +00:00
Trenton H e822e72964 Feature: Further reduce document importer memory usage (#12707)
* Replaces loaddata with streaming bulk_create

Replaces call_command('loaddata') with a streaming implementation that
reads manifest records one at a time via ijson, accumulates per-model
batches up to --batch-size, and flushes via bulk_create.  This reduces
peak memory and no longer scales directly with the size of the import.

* fix(importer): avoid guardian lru_cache poisoning; include M2M through tables in check_constraints

clear_cache() inside the import transaction emptied Django's ContentType
manager cache while fixture PKs were live, causing downstream ContentType
lookups to repopulate guardian's separate @lru_cache(None) with
fixture-PK objects. After the TestCase transaction rolled back to
original PKs, guardian's lru_cache held stale fixture ContentType
objects, causing MixedContentTypeError in unrelated subsequent tests.

Remove clear_cache() since it was defending against a theoretical
stale-cache scenario that doesn't occur in a proper same-install restore.

Fix check_constraints() to explicitly include auto-created M2M through
tables (populated by .set() after bulk_create) alongside the model tables,
addressing the gap where join-table FK violations would have gone
undetected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Excludes the consumer and AnonymousUser from any models which might have a FK relation to it.  This prevents orphan things like UI setting, which have a relation to no existing user

* Splits into more sub functions for Sonar

* Improvements to the typing of the new functions

* Coverage for some error cases, and removes handling for pk only models.  No need to support these

* Final coverage gaps

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 16:36:05 -07:00
Trenton H 14fe520319 Chore: Update typing and baselines again (#12641)
a
2026-04-28 09:28:05 -07:00
Trenton H fbf4e32646 Chore: Converts all call sites and test asserts to use apply_async and headers (#12591) 2026-04-20 11:40:04 -07:00
Trenton H 8e67828bd7 Feature: Redesign the task system (#12584)
* feat(tasks): replace PaperlessTask model with structured redesign

Drop the old string-based PaperlessTask table and recreate it with
Status/TaskType/TriggerSource enums, JSONField result storage, and
duration tracking fields. Update all call sites to use the new API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): rewrite signal handlers to track all task types

Replace the old consume_file-only handler with a full rewrite that tracks
6 task types (consume_file, train_classifier, sanity_check, index_optimize,
llm_index, mail_fetch) with proper trigger source detection, input data
extraction, legacy result string parsing, duration/wait time recording,
and structured error capture on failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): add traceback and revoked state coverage to signal tests

* refactor(tasks): remove manual PaperlessTask creation and scheduled/auto params

All task records are now created exclusively via Celery signals (Task 2).
Removed PaperlessTask creation/update from train_classifier, sanity_check,
llmindex_index, and check_sanity. Removed scheduled= and auto= parameters
from all 7 call sites. Updated apply_async callers to use trigger_source
headers instead. Exceptions now propagate naturally from task functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): auto-inject trigger_source=scheduled header for all beat tasks

Inject `headers: {"trigger_source": "scheduled"}` into every Celery beat
schedule entry so signal handlers can identify scheduler-originated tasks
without per-task instrumentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update serializer, filter, and viewset with v9 backwards compat

- Replace TasksViewSerializer/RunTaskViewSerializer with TaskSerializerV10
  (new field names), TaskSerializerV9 (v9 compat), TaskSummarySerializer,
  and RunTaskSerializer
- Add AcknowledgeTasksViewSerializer unchanged (kept existing validation)
- Expand PaperlessTaskFilterSet with MultipleChoiceFilter for task_type,
  trigger_source, status; add is_complete, date_created_after/before filters
- Replace TasksViewSet.get_serializer_class() to branch on request.version
- Add get_queryset() v9 compat for task_name/type query params
- Add acknowledge_all, summary, active actions to TasksViewSet
- Rewrite run action to use apply_async with trigger_source header
- Add timedelta import to views.py; add MultipleChoiceFilter/DateTimeFilter
  to filters.py imports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): add read_only_fields to TaskSerializerV9, enforce admin via permission_classes on run action

* test(tasks): rewrite API task tests for redesigned model and v9 compat

Replaces the old Django TestCase-based tests with pytest-style classes using
PaperlessTaskFactory. Covers v10 field names, v9 backwards-compat field
mapping, filtering, ordering, acknowledge, acknowledge_all, summary, active,
and run endpoints. Also adds PaperlessTaskFactory to factories.py and fixes
a redundant source= kwarg in TaskSerializerV10.related_document_ids.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): fix two spec gaps in task API test suite

Move test_list_is_owner_aware to TestGetTasksV10 (it tests GET /api/tasks/,
not acknowledge). Add test_related_document_ids_includes_duplicate_of to
cover the duplicate_of path in the related_document_ids property.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): address code quality review findings

Remove trivial field-existence tests per project conventions. Fix
potentially flaky ordering test to use explicit date_created values.
Add is_complete=false filter test, v9 type filter input direction test,
and tighten TestActive second test to target REVOKED specifically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update TaskAdmin for redesigned model

Add date_created, duration_seconds to list_display; add trigger_source
to list_filter; add input_data, duration_seconds, wait_time_seconds to
readonly_fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update Angular types and service for task redesign

Replace PaperlessTaskName/PaperlessTaskType/PaperlessTaskStatus enums
with new PaperlessTaskType, PaperlessTaskTriggerSource, PaperlessTaskStatus
enums. Update PaperlessTask interface to new field names (task_type,
trigger_source, input_data, result_message, related_document_ids).
Update TasksService to filter by task_type instead of task_name.
Update tasks component and system-status-dialog to use new field names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(tasks): remove django-celery-results

PaperlessTask now tracks all task results via Celery signals. The
django-celery-results DB backend was write-only -- nothing reads
from it. Drop the package and add a migration to clean up the
orphaned tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: fix remaining tests broken by task system redesign

Update all tests that created PaperlessTask objects with old field names
to use PaperlessTaskFactory and new field names (task_type, trigger_source,
status, result_message). Use apply_async instead of delay where mocked.
Drop TestCheckSanityTaskRecording — tests PaperlessTask creation that was
intentionally removed from check_sanity().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): improve test_api_tasks.py structure and add api marker

- Move admin_client, v9_client, user_client fixtures to conftest.py so
  they can be reused by other API tests; all three now build on the
  rest_api_client fixture instead of creating APIClient() directly
- Move regular_user fixture to conftest.py (was already done, now also
  used by the new client fixtures)
- Add docstrings to every test method describing the behaviour under test
- Move timedelta/timezone imports to module level
- Register 'api' pytest marker in pyproject.toml and apply pytestmark to
  the entire file so all 40 tests are selectable via -m api

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(tasks): simplify task tracking code after redesign

- Extract COMPLETE_STATUSES as a class constant on PaperlessTask,
  eliminating the repeated status tuple across models.py, views.py (3×),
  and filters.py
- Extract _CELERY_STATE_TO_STATUS as a module-level constant instead of
  rebuilding the dict on every task_postrun
- Extract _V9_TYPE_TO_TRIGGER_SOURCE and _RUNNABLE_TASKS as class
  constants on TasksViewSet instead of rebuilding on every request
- Extract _TRIGGER_SOURCE_TO_V9_TYPE as a class constant on
  TaskSerializerV9 instead of rebuilding per serialized object
- Extract _get_consume_args helper to deduplicate identical arg
  extraction logic in _extract_input_data, _determine_trigger_source,
  and _extract_owner_id
- Move inline imports (re, traceback) and Avg to module level
- Fix _DOCUMENT_SOURCE_TO_TRIGGER type annotation key type to
  DocumentSource instead of Any
- Remove redundant truthiness checks in SystemStatusView branches
  already guarded by an is-None check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(tasks): add docstrings and rename _parse_legacy_result

- Add docstrings to _extract_input_data, _determine_trigger_source,
  _extract_owner_id explaining what each helper does and why
- Rename _parse_legacy_result -> _parse_consume_result: the function
  parses current consume_file string outputs (consumer.py returns
  "New document id N created" and "It is a duplicate of X (#N)"),
  not legacy data; the old name was misleading

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): extend and harden the task system redesign

- TaskType: add EMPTY_TRASH, CHECK_WORKFLOWS, CLEANUP_SHARE_LINKS;
  remove INDEX_REBUILD (no backing task — beat schedule uses index_optimize)
- TRACKED_TASKS: wire up all nine task types including the three new ones
  and llmindex_index / process_mail_accounts
- Add task_revoked_handler so cancelled/expired tasks are marked REVOKED
- Fix double-write: task_postrun_handler no longer overwrites result_data
  when status is already FAILURE (task_failure_handler owns that write)
- v9 serialiser: map EMAIL_CONSUME and FOLDER_CONSUME to AUTO_TASK
- views: scope task list to owner for regular users, admins see all;
  validate ?days= query param and return 400 on bad input
- tests: add test_list_admin_sees_all_tasks; rename/fix
  test_parses_duplicate_string (duplicates produce SUCCESS, not FAILURE);
  use PaperlessTaskFactory in modified tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): fix MAIL_FETCH null input_data and postrun double-query

- _extract_input_data: return {} instead of {"account_ids": None} when
  process_mail_accounts is called without an explicit account list (the
  normal beat-scheduled path); add test to cover this path
- task_postrun_handler: replace filter().first() + filter().update() with
  get() + save(update_fields=[...]) — single fetch, single write,
  consistent with task_prerun_handler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): add queryset stub to satisfy drf-spectacular schema generation

TasksViewSet.get_queryset() accesses request.user, which drf-spectacular
cannot provide during static schema generation.  Adding a class-level
queryset = PaperlessTask.objects.none() gives spectacular a model to
introspect without invoking get_queryset(), eliminating both warnings
and the test_valid_schema failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): fill coverage gaps in task system

- test_task_signals: add TestTaskRevokedHandler (marks REVOKED, ignores
  None request, ignores unknown id); switch existing direct
  PaperlessTask.objects.create calls to PaperlessTaskFactory; import
  pytest_mock and use MockerFixture typing on mocker params
- test_api_tasks: add test_rejects_invalid_days_param to TestSummary
- tasks.service.spec: add dismissAllTasks test (POST acknowledge_all +
  reload)
- models: add pragma: no cover to __str__, is_complete, and
  related_document_ids (trivial delegates, covered indirectly)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Well, that was a bad push.

* Fixes v9 API compatability with testing coverage

* fix(tasks): restore INDEX_OPTIMIZE enum and remove no-op run button

INDEX_OPTIMIZE was dropped from the TaskType enum but still referenced
in _RUNNABLE_TASKS (views.py) and the frontend system-status-dialog,
causing an AttributeError at import time. Restore the enum value in the
model and migration so the serializer accepts it, but remove it from
_RUNNABLE_TASKS since index_optimize is a Tantivy no-op. Remove the
frontend "Run Task" button for index optimization accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): v9 type filter now matches all equivalent trigger sources

The v9 ?type= query param mapped each value to a single TriggerSource,
but the serializer maps multiple sources to the same v9 type value.
A task serialized as "auto_task" would not appear when filtering by
?type=auto_task if its trigger_source was email_consume or
folder_consume. Same issue for "manual_task" missing web_ui and
api_upload sources. Changed to trigger_source__in with the full set
of sources for each v9 type value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): give task_failure_handler full ownership of FAILURE path

task_postrun_handler now early-returns for FAILURE states instead of
redundantly writing status and date_done. task_failure_handler now
computes duration_seconds and wait_time_seconds so failed tasks get
complete timing data. This eliminates a wasted .get() + .save() round
trip on every failed task and gives each handler a clean, non-overlapping
responsibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): resolve trigger_source header via TriggerSource enum lookup

Replace two hardcoded string comparisons ("scheduled", "system") with a
single TriggerSource(header_source) lookup so the enum values are the
single source of truth. Any valid TriggerSource DB value passed in the
header is accepted; invalid values fall through to the document-source /
MANUAL logic. Update tests to pass enum values in headers rather than raw
strings, and add a test for the invalid-header fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): use TriggerSource enum values at all apply_async call sites

Replace raw strings ("system", "manual") with PaperlessTask.TriggerSource
enum values in the three callers that can import models. The settings
file remains a raw string (models cannot be imported at settings load
time) with a comment pointing to the enum value it must match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): parametrize repetitive test cases in task test files

test_api_tasks.py:
- Collapse six trigger_source->v9-type tests into one parametrized test,
  adding the previously untested API_UPLOAD case
- Collapse three task_name mapping tests (two remaps + pass-through)
  into one parametrized test
- Collapse two acknowledge_all status tests into one parametrized test
- Collapse two run-endpoint 400 tests into one parametrized test
- Update run/ assertions to use TriggerSource enum values

test_task_signals.py:
- Collapse three trigger_source header tests into one parametrized test
- Collapse two DocumentSource->TriggerSource mapping tests into one
  parametrized test
- Collapse two prerun ignore-invalid-id tests into one parametrized test

All parametrize cases use pytest.param with descriptive ids.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Handle JSON serialization for datetime and Path.  Further restrist the v9 permissions as Copilot suggests

* That should fix the generated schema/browser

* Use XSerializer for the schema

* A few more basic cases I see no value in covering

* Drops the migration related stuff too.  Just in case we want it again or it confuses people

* fix: annotate tasks_summary_retrieve as array of TaskSummarySerializer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: annotate tasks_active_retrieve as array of TaskSerializerV10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Restore task running to superuser only

* Removes the acknowledge/dismiss all stuff

* Aligns v10 and v9 task permissions with each other

* Short blurb just to warn users about the tasks being cleared

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 09:28:41 -07:00
Trenton H 2fd1a1cf3a Feature: Document fuzzy match improvements (#12579)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:59:43 +00:00
shamoon b807b107ad Enhancement: include sharelinks + bundles in export/import (#12479) 2026-04-03 21:51:57 +00:00
Trenton H aed9abe48c Feature: Replace Whoosh with tantivy search backend (#12471)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Antoine Mérino <3023499+Merinorus@users.noreply.github.com>
2026-04-02 12:38:22 -07:00
Jan Kleine 0292edbee7 Fixhancement: include trashed documents in document exporter/importer (#12425) 2026-03-30 16:30:22 +00:00
Trenton H 9383471fa0 Feature: Transition all checksums to use SHA256 (#12432) 2026-03-26 11:28:02 -07:00
Trenton H 701735f6e5 Chore: Drop old signal and unneeded apps, transition to parser registry instead (#12405)
* refactor: switch consumer and callers to ParserRegistry (Phase 4)

Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.

- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: drop get_parser_class_for_mime_type; callers use registry directly

All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.

- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
  TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove document_consumer_declaration signal infrastructure

Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.

Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
  - paperless_tesseract/signals.py
  - paperless_text/signals.py
  - paperless_tika/signals.py
  - paperless_mail/signals.py
  - paperless_remote/signals.py

Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove empty paperless_text and paperless_tika Django apps

After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.

- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files

paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Moves the checks and tests to the main application and removes the old applications

* Adds a comment to satisy Sonar

* refactor: remove automatic log_summary() call from get_parser_registry()

The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Cleans up the duplicate test file/fixture

* Fixes a race condition where webserver threads could race to populate the registry

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 06:53:32 -07:00
Trenton H a9756f9462 Chore: Convert Tesseract parser to plugin style (#12403)
* Move tesseract parser, tests, and samples to paperless.parsers

Relocates files in preparation for the Phase 3 Protocol-based parser
refactor, preserving full git history via rename.

- src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py
- src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py
- src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py
- src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/
- Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor RasterisedDocumentParser to ParserProtocol interface

- Add RasterisedDocumentParser to registry.register_defaults()
- Update parser class: remove DocumentParser inheritance, add Protocol
  class attrs/classmethods/properties, context-manager lifecycle
- Add read_file_handle_unicode_errors() to shared parsers/utils.py
- Replace inline unicode-error-handling with shared utility call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update tesseract signals.py to import from new parser location

RasterisedDocumentParser moved to paperless.parsers.tesseract; update
the lazy import in signals.get_parser so the signal-based consumer
declaration continues to work during the registry transition. Pop
logging_group and progress_callback kwargs for constructor compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests: rewrite test_tesseract_parser to pytest style with typed fixtures

- Converts all tests from Django TestCase to pytest-style classes
- Adds tesseract_samples_dir, null_app_config, tesseract_parser, and
  make_tesseract_parser fixtures in conftest.py; all DB-free except
  TestOcrmypdfParameters which uses @pytest.mark.django_db
- Defines MakeTesseractParser type alias in conftest.py for autocomplete
- Fixes FBT001 (boolean positional args) by making bool params
  keyword-only with * separator in parametrize test signatures
- Adds type annotations to all fixture parameters for IDE support
- Uses pytest.param(..., id="...") throughout; pytest-mock for patching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(types): fully annotate paperless/parsers/tesseract.py

Fixes all mypy and pyrefly errors in the new parser file:

- Add missing type annotations to is_image, has_alpha, get_dpi,
  calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text
- Narrow Path-only (no str) for image helper args; convert to str when
  building list[str] args for run_subprocess
- Annotate ocrmypdf_args as dict[str, Any] so operator expressions on
  its values type-check and ocrmypdf.ocr(**args) resolves cleanly
- Declare text: str | None = None at top of extract_text to unify
  all assignments to the same type across both branches
- Import Any from typing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixes isort

* fix: add RasterisedDocumentParser to new-style parser shim checks

The new RasterisedDocumentParser uses __enter__/__exit__ for resource
management instead of cleanup(). Update all existing new-style shims to
include it in the isinstance checks:

- documents/consumer.py: _parser_cleanup(), parser_is_new_style
- documents/tasks.py: parser_is_new_style, finally cleanup branch
  (also adds RemoteDocumentParser which was missing from the latter)
- documents/management/commands/document_thumbnails.py: adds new-style
  handling from scratch (enter/exit + 2-arg get_thumbnail signature)

Fix stale import paths in three test files that were still importing
from paperless_tesseract.parsers instead of paperless.parsers.tesseract.

Fix two registry tests that used application/pdf as a proxy for "no
handler" — now that RasterisedDocumentParser is registered, PDF always
has a handler, so switch to a truly unsupported MIME type.

Signal infrastructure and shims remain intact; this is plumbing only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* One missed import (cherry pick?)

* Adds a no cover for a special case of handling unicode errors in PDF metadata

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 12:46:07 -07:00
Trenton H 1caa3eb8aa Chore: Disables the system checks for management commands in tests and when unnecessary (#12332) 2026-03-16 15:10:35 +00:00
Trenton H 9d69705e26 Feature: Add progress information to the classifier training for a better ux (#12331) 2026-03-14 19:53:52 +00:00
Trenton H d86cfdb088 Feature: Initial document parser plugin framework (#12294) 2026-03-12 21:53:17 +00:00
Trenton H 63cb75564e Chore: Remove some further old items (encryption passphrase and PNG handling) (#12290) 2026-03-09 22:04:51 +00:00
Trenton H bcc2f11152 Performance: Stream JSON during import for memory improvements (#12276)
* Perf: stream manifest parsing with ijson in document_importer

Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.

- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
  temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
  fraction of manifest) for the tqdm progress bar

Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Perf: slim dict in _import_files_from_manifest, discard fields

When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).

Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
2026-03-09 10:20:48 -07:00
Trenton H e30676f889 Feature: Migrate import/export to rich progress (#12260)
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()

Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses

Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 08:59:17 -07:00
Trenton H 2cdb1424ef Performance: Further export memory improvements (#12273)
* Perf: streaming manifest writer for document exporter (Phase 3)

Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.

Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
  compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
  bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
  accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml

Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption

Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 14:24:50 -08:00
Trenton H a9cb89c633 Enhancement: Improve exporter memory efficiency (#12236)
Phase 1 -- Eliminate JSON round-trip in document exporter

Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.

Phase 2 -- Batched QuerySet serialization in document exporter

Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.
2026-03-04 14:54:20 -08:00
Trenton H 43406f44f2 Feature: Improve the retagger output using rich (#12194) 2026-03-03 07:14:59 -08:00
Trenton H e58a35d40c Feature: Transition sanity check to rich and improve output (#12182) 2026-03-02 10:53:39 -08:00
Trenton H 20a9cd40e8 Feature: Switch all indexing to use rich (#12193) 2026-03-02 10:41:09 -08:00
Trenton H c30ee1ec03 Feature: Switch progress bar library to rich (#12169) 2026-02-26 12:20:21 -08:00
Sebastian Steinbeißer 3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
Trenton H 6859e7e3c2 Chore: Resolve more flaky tests (#11920) 2026-01-28 16:13:27 +00:00
Trenton H d0032c18be Breaking: Remove support for document and thumbnail encryption (#11850) 2026-01-24 19:29:54 -08:00
Trenton H 51b466a86b Feature: Simplify and improve the consumer (#11753) 2026-01-21 14:37:48 -08:00
shamoon e940764fe0 Feature: Paperless AI (#10319) 2026-01-13 16:24:42 +00:00
shamoon b3d6359afc Chore: set signal receivers with weak=False 2025-11-17 10:02:32 -08:00
shamoon 6119c215e7 Fix: skip fuzzy matching for empty document content (#10914) 2025-09-22 23:30:24 -07:00
shamoon 0e35acaef5 Fix: add extra error handling to _consume for file checks (#10897) 2025-09-21 13:21:40 -07:00
Sebastian Steinbeißer d2064a2535 Chore: switch from os.path to pathlib.Path (#10539) 2025-09-03 08:12:41 -07:00
dependabot[bot] edb8c06e2a Chore(deps): Bump the django group across 1 directory with 9 updates (#10538)
* Chore(deps): Bump the django group across 1 directory with 9 updates

Bumps the django group with 9 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [django](https://github.com/django/django) | `5.1.8` | `5.2.5` |
| [django-auditlog](https://github.com/jazzband/django-auditlog) | `3.1.2` | `3.2.1` |
| [django-guardian](https://github.com/django-guardian/django-guardian) | `2.4.0` | `3.0.3` |
| [django-multiselectfield](https://github.com/goinnn/django-multiselectfield) | `0.1.13` | `1.0.1` |
| [django-soft-delete](https://github.com/san4ezy/django_softdelete) | `1.0.18` | `1.0.19` |
| [djangorestframework](https://github.com/encode/django-rest-framework) | `3.16.0` | `3.16.1` |
| [djangorestframework-guardian](https://github.com/rpkilby/django-rest-framework-guardian) | `0.3.0` | `0.4.0` |
| [drf-spectacular-sidecar](https://github.com/tfranzel/drf-spectacular-sidecar) | `2025.4.1` | `2025.8.1` |
| [pytest-django](https://github.com/pytest-dev/pytest-django) | `4.10.0` | `4.11.1` |



Updates `django` from 5.1.8 to 5.2.5
- [Commits](https://github.com/django/django/compare/5.1.8...5.2.5)

Updates `django-auditlog` from 3.1.2 to 3.2.1
- [Release notes](https://github.com/jazzband/django-auditlog/releases)
- [Changelog](https://github.com/jazzband/django-auditlog/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jazzband/django-auditlog/compare/v3.1.2...v3.2.1)

Updates `django-guardian` from 2.4.0 to 3.0.3
- [Release notes](https://github.com/django-guardian/django-guardian/releases)
- [Commits](https://github.com/django-guardian/django-guardian/compare/v2.4.0...3.0.3)

Updates `django-multiselectfield` from 0.1.13 to 1.0.1
- [Release notes](https://github.com/goinnn/django-multiselectfield/releases)
- [Changelog](https://github.com/goinnn/django-multiselectfield/blob/master/CHANGES.rst)
- [Commits](https://github.com/goinnn/django-multiselectfield/compare/v0.1.13...v1.0.1)

Updates `django-soft-delete` from 1.0.18 to 1.0.19
- [Changelog](https://github.com/san4ezy/django_softdelete/blob/master/CHANGELOG.md)
- [Commits](https://github.com/san4ezy/django_softdelete/commits)

Updates `djangorestframework` from 3.16.0 to 3.16.1
- [Release notes](https://github.com/encode/django-rest-framework/releases)
- [Commits](https://github.com/encode/django-rest-framework/compare/3.16.0...3.16.1)

Updates `djangorestframework-guardian` from 0.3.0 to 0.4.0
- [Changelog](https://github.com/rpkilby/django-rest-framework-guardian/blob/master/CHANGELOG)
- [Commits](https://github.com/rpkilby/django-rest-framework-guardian/compare/0.3.0...0.4.0)

Updates `drf-spectacular-sidecar` from 2025.4.1 to 2025.8.1
- [Commits](https://github.com/tfranzel/drf-spectacular-sidecar/compare/2025.4.1...2025.8.1)

Updates `pytest-django` from 4.10.0 to 4.11.1
- [Release notes](https://github.com/pytest-dev/pytest-django/releases)
- [Changelog](https://github.com/pytest-dev/pytest-django/blob/main/docs/changelog.rst)
- [Commits](https://github.com/pytest-dev/pytest-django/compare/v4.10.0...v4.11.1)

---
updated-dependencies:
- dependency-name: django
  dependency-version: 5.2.5
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: django
- dependency-name: django-auditlog
  dependency-version: 3.2.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: django
- dependency-name: django-guardian
  dependency-version: 3.0.3
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: django
- dependency-name: django-multiselectfield
  dependency-version: 1.0.1
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: django
- dependency-name: django-soft-delete
  dependency-version: 1.0.19
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: django
- dependency-name: djangorestframework
  dependency-version: 3.16.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: django
- dependency-name: djangorestframework-guardian
  dependency-version: 0.4.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: django
- dependency-name: drf-spectacular-sidecar
  dependency-version: 2025.8.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: django
- dependency-name: pytest-django
  dependency-version: 4.11.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: django
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix log matches related to newlines, add newlines to stdout.writelines

* Fix disable api remote auth test, Django 5.2 no longer uses process_request

* Remove postgres version check

* Update administration.md

* Handle django-multiselectfield v1.0 changes

* Update administration.md

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-08-11 13:45:14 -07:00
Sebastian Steinbeißer 6dca4daea5 Chore: switch from os.path to pathlib.Path (#10397) 2025-08-06 10:50:42 -07:00
Kilian 246f17c6c8 Enhancement: support import of zipped export (#10073) 2025-06-13 10:06:37 -07:00
shamoon 0b079f57ed Partially Revert "Chore(deps): Bump the small-changes group across 1 directory with 6 updates (#9921)"
This partially reverts commit 1fe166e014.
2025-05-19 12:20:05 -07:00
dependabot[bot] 1fe166e014 Chore(deps): Bump the small-changes group across 1 directory with 6 updates (#9921)
* Chore(deps): Bump the small-changes group across 1 directory with 6 updates

Bumps the small-changes group with 6 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [concurrent-log-handler](https://github.com/Preston-Landers/concurrent-log-handler) | `0.9.25` | `0.9.26` |
| [django](https://github.com/django/django) | `5.1.8` | `5.2.1` |
| [django-auditlog](https://github.com/jazzband/django-auditlog) | `3.0.0` | `3.1.2` |
| [drf-spectacular-sidecar](https://github.com/tfranzel/drf-spectacular-sidecar) | `2025.4.1` | `2025.5.1` |
| [ocrmypdf](https://github.com/ocrmypdf/OCRmyPDF) | `16.10.0` | `16.10.1` |
| [setproctitle](https://github.com/dvarrazzo/py-setproctitle) | `1.3.5` | `1.3.6` |



Updates `concurrent-log-handler` from 0.9.25 to 0.9.26
- [Release notes](https://github.com/Preston-Landers/concurrent-log-handler/releases)
- [Changelog](https://github.com/Preston-Landers/concurrent-log-handler/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Preston-Landers/concurrent-log-handler/compare/0.9.25...0.9.26)

Updates `django` from 5.1.8 to 5.2.1
- [Commits](https://github.com/django/django/compare/5.1.8...5.2.1)

Updates `django-auditlog` from 3.0.0 to 3.1.2
- [Release notes](https://github.com/jazzband/django-auditlog/releases)
- [Changelog](https://github.com/jazzband/django-auditlog/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jazzband/django-auditlog/compare/v3.0.0...v3.1.2)

Updates `drf-spectacular-sidecar` from 2025.4.1 to 2025.5.1
- [Commits](https://github.com/tfranzel/drf-spectacular-sidecar/compare/2025.4.1...2025.5.1)

Updates `ocrmypdf` from 16.10.0 to 16.10.1
- [Release notes](https://github.com/ocrmypdf/OCRmyPDF/releases)
- [Changelog](https://github.com/ocrmypdf/OCRmyPDF/blob/main/docs/release_notes.md)
- [Commits](https://github.com/ocrmypdf/OCRmyPDF/compare/v16.10.0...v16.10.1)

Updates `setproctitle` from 1.3.5 to 1.3.6
- [Changelog](https://github.com/dvarrazzo/py-setproctitle/blob/master/HISTORY.rst)
- [Commits](https://github.com/dvarrazzo/py-setproctitle/compare/version-1.3.5...version-1.3.6)

---
updated-dependencies:
- dependency-name: concurrent-log-handler
  dependency-version: 0.9.26
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: small-changes
- dependency-name: django
  dependency-version: 5.2.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: small-changes
- dependency-name: django-auditlog
  dependency-version: 3.1.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: small-changes
- dependency-name: drf-spectacular-sidecar
  dependency-version: 2025.5.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: small-changes
- dependency-name: ocrmypdf
  dependency-version: 16.10.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: small-changes
- dependency-name: setproctitle
  dependency-version: 1.3.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: small-changes
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix log matches related to newlines, add newlines to stdout.writelines

* Fix disable api remote auth test, Django 5.2 no longer uses process_request

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-05-19 11:03:02 -07:00
Trenton H f4413e0a08 Fix: Always clean up INotify (#9359) 2025-03-11 10:01:00 -07:00
shamoon 2d52226732 Enhancement: system status report sanity check, simpler classifier check, styling updates (#9106) 2025-02-26 22:12:20 +00:00
Sebastian Steinbeißer e560fa3be0 Chore: Enable ruff FBT (#8645) 2025-02-07 09:12:03 -08:00
dependabot[bot] 20ec8cb57b Chore(deps-dev): Bump the development group with 2 updates (#8841)
* Chore(deps-dev): Bump the development group with 2 updates

Bumps the development group with 2 updates: [ruff](https://github.com/astral-sh/ruff) and [mkdocs-material](https://github.com/squidfunk/mkdocs-material).


Updates `ruff` from 0.8.6 to 0.9.2
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/ruff/compare/0.8.6...0.9.2)

Updates `mkdocs-material` from 9.5.49 to 9.5.50
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.49...9.5.50)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: development
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: development
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .pre-commit-config.yaml

* Run new ruff format

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-01-21 19:22:25 +00:00
Trenton H 6804c92861 Fix: Include email and webhook objects in the export (#8790) 2025-01-17 13:00:59 -08:00
Sebastian Steinbeißer 935d077836 Chore: Switch from os.path to pathlib.Path (#8325)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2025-01-06 12:12:27 -08:00
shamoon 7d182ab894 Enhancement: prune audit logs and management command (#8416) 2024-12-03 19:28:27 +00:00
shamoon 0fc1860d4c Enhancement: use stable unique IDs for custom field select options (#8299) 2024-12-02 04:15:38 +00:00
dependabot[bot] 00485138f9 Chore(deps-dev): Bump the development group with 4 updates (#8352)
Bumps the development group with 4 updates: [ruff](https://github.com/astral-sh/ruff), [pytest-httpx](https://github.com/Colin-b/pytest_httpx), [pytest-rerunfailures](https://github.com/pytest-dev/pytest-rerunfailures) and [mkdocs-material](https://github.com/squidfunk/mkdocs-material).


Updates `ruff` from 0.7.3 to 0.8.0
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/ruff/compare/0.7.3...0.8.0)

Updates `pytest-httpx` from 0.33.0 to 0.34.0
- [Release notes](https://github.com/Colin-b/pytest_httpx/releases)
- [Changelog](https://github.com/Colin-b/pytest_httpx/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/Colin-b/pytest_httpx/compare/v0.33.0...v0.34.0)

Updates `pytest-rerunfailures` from 14.0 to 15.0
- [Changelog](https://github.com/pytest-dev/pytest-rerunfailures/blob/master/CHANGES.rst)
- [Commits](https://github.com/pytest-dev/pytest-rerunfailures/compare/14.0...15.0)

Updates `mkdocs-material` from 9.5.44 to 9.5.46
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.44...9.5.46)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: development
- dependency-name: pytest-httpx
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: development
- dependency-name: pytest-rerunfailures
  dependency-type: direct:development
  update-type: version-update:semver-major
  dependency-group: development
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-01 18:56:54 -08:00
shamoon 9c1561adfb Change: change update content to handle archive disabled (#8315) 2024-11-20 20:01:13 +00:00
Kevin Doren 827121808a Enhancement: Add --compare-json option to document_exporter to write json files only if changed (#8261) 2024-11-19 07:20:24 -08:00
shamoon e94a92ed59 Feature: two-factor authentication (#8012) 2024-11-18 18:34:46 +00:00