* refactor: switch consumer and callers to ParserRegistry (Phase 4)
Replace all Django signal-based parser discovery with direct registry
calls. Removes `_parser_cleanup`, `parser_is_new_style` shims, and all
old-style isinstance checks. All parser instantiation now uses the
`with parser_class() as parser:` context manager pattern.
- documents/parsers.py: delegate to get_parser_registry(); drop lru_cache
- documents/consumer.py: use registry + context manager; remove shims
- documents/tasks.py: same pattern
- documents/management/commands/document_thumbnails.py: same pattern
- documents/views.py: get_metadata uses context manager
- documents/checks.py: use get_parser_registry().all_parsers()
- paperless/parsers/registry.py: add all_parsers() public method
- tests: update mocks to target documents.consumer.get_parser_class_for_mime_type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: drop get_parser_class_for_mime_type; callers use registry directly
All callers now call get_parser_registry().get_parser_for_file() with
the actual filename and path, enabling score() to use file extension
hints. The MIME-only helper is removed.
- consumer.py: passes self.filename + self.working_copy
- tasks.py: passes document.original_filename + document.source_path
- document_thumbnails.py: same pattern
- views.py: passes Path(file).name + Path(file)
- parsers.py: internal helpers inline the registry call with filename=""
- test_parsers.py: drop TestParserDiscovery (was testing mock behavior);
TestParserAvailability uses registry directly
- test_consumer.py: mocks switch to documents.consumer.get_parser_registry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove document_consumer_declaration signal infrastructure
Remove the document_consumer_declaration signal that was previously used
for parser registration. Each parser app no longer connects to this signal,
and the signal declaration itself has been removed from documents/signals.
Changes:
- Remove document_consumer_declaration from documents/signals/__init__.py
- Remove ready() methods and signal imports from all parser app configs
- Delete signal shim files (signals.py) from all parser apps:
- paperless_tesseract/signals.py
- paperless_text/signals.py
- paperless_tika/signals.py
- paperless_mail/signals.py
- paperless_remote/signals.py
Parser discovery now happens exclusively through the ParserRegistry
system introduced in the previous refactor phases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove empty paperless_text and paperless_tika Django apps
After parser classes were moved to paperless/parsers/ in the plugin
refactor, these Django apps contained only empty AppConfig classes
with no models, views, tasks, migrations, or other functionality.
- Remove paperless_text and paperless_tika from INSTALLED_APPS
- Delete empty app directories entirely
- Update pyproject.toml test exclusions
- Clean stale mypy baseline entries for moved parser files
paperless_remote app is retained as it contains meaningful system
checks for Azure AI configuration.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Moves the checks and tests to the main application and removes the old applications
* Adds a comment to satisy Sonar
* refactor: remove automatic log_summary() call from get_parser_registry()
The summary was logged once per process, causing it to appear repeatedly
during Docker startup (management commands, web server, each Celery
worker subprocess). External parsers are already announced individually
at INFO when discovered; the full summary is redundant noise.
log_summary() is retained on ParserRegistry for manual/debug use.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Cleans up the duplicate test file/fixture
* Fixes a race condition where webserver threads could race to populate the registry
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Move tesseract parser, tests, and samples to paperless.parsers
Relocates files in preparation for the Phase 3 Protocol-based parser
refactor, preserving full git history via rename.
- src/paperless_tesseract/parsers.py -> src/paperless/parsers/tesseract.py
- src/paperless_tesseract/tests/test_parser.py -> src/paperless/tests/parsers/test_tesseract_parser.py
- src/paperless_tesseract/tests/test_parser_custom_settings.py -> src/paperless/tests/parsers/test_tesseract_custom_settings.py
- src/paperless_tesseract/tests/samples/* -> src/paperless/tests/samples/tesseract/
- Moves RUF001 suppression from broad per-file pyproject.toml ignore to inline noqa comments on the two affected lines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor RasterisedDocumentParser to ParserProtocol interface
- Add RasterisedDocumentParser to registry.register_defaults()
- Update parser class: remove DocumentParser inheritance, add Protocol
class attrs/classmethods/properties, context-manager lifecycle
- Add read_file_handle_unicode_errors() to shared parsers/utils.py
- Replace inline unicode-error-handling with shared utility call
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Update tesseract signals.py to import from new parser location
RasterisedDocumentParser moved to paperless.parsers.tesseract; update
the lazy import in signals.get_parser so the signal-based consumer
declaration continues to work during the registry transition. Pop
logging_group and progress_callback kwargs for constructor compatibility.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* tests: rewrite test_tesseract_parser to pytest style with typed fixtures
- Converts all tests from Django TestCase to pytest-style classes
- Adds tesseract_samples_dir, null_app_config, tesseract_parser, and
make_tesseract_parser fixtures in conftest.py; all DB-free except
TestOcrmypdfParameters which uses @pytest.mark.django_db
- Defines MakeTesseractParser type alias in conftest.py for autocomplete
- Fixes FBT001 (boolean positional args) by making bool params
keyword-only with * separator in parametrize test signatures
- Adds type annotations to all fixture parameters for IDE support
- Uses pytest.param(..., id="...") throughout; pytest-mock for patching
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(types): fully annotate paperless/parsers/tesseract.py
Fixes all mypy and pyrefly errors in the new parser file:
- Add missing type annotations to is_image, has_alpha, get_dpi,
calculate_a4_dpi, construct_ocrmypdf_parameters, post_process_text
- Narrow Path-only (no str) for image helper args; convert to str when
building list[str] args for run_subprocess
- Annotate ocrmypdf_args as dict[str, Any] so operator expressions on
its values type-check and ocrmypdf.ocr(**args) resolves cleanly
- Declare text: str | None = None at top of extract_text to unify
all assignments to the same type across both branches
- Import Any from typing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fixes isort
* fix: add RasterisedDocumentParser to new-style parser shim checks
The new RasterisedDocumentParser uses __enter__/__exit__ for resource
management instead of cleanup(). Update all existing new-style shims to
include it in the isinstance checks:
- documents/consumer.py: _parser_cleanup(), parser_is_new_style
- documents/tasks.py: parser_is_new_style, finally cleanup branch
(also adds RemoteDocumentParser which was missing from the latter)
- documents/management/commands/document_thumbnails.py: adds new-style
handling from scratch (enter/exit + 2-arg get_thumbnail signature)
Fix stale import paths in three test files that were still importing
from paperless_tesseract.parsers instead of paperless.parsers.tesseract.
Fix two registry tests that used application/pdf as a proxy for "no
handler" — now that RasterisedDocumentParser is registered, PDF always
has a handler, so switch to a truly unsupported MIME type.
Signal infrastructure and shims remain intact; this is plumbing only.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* One missed import (cherry pick?)
* Adds a no cover for a special case of handling unicode errors in PDF metadata
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor(mail): rename paperless_mail/parsers.py → paperless/parsers/mail.py
Preserve git history for MailDocumentParser by committing the rename
separately before editing, following the project convention.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor(mail): move mail parser tests to paperless/tests/parsers/
Move test_parsers.py → test_mail_parser.py and test_parsers_live.py →
test_mail_parser_live.py alongside the other built-in parser tests,
preserving git history before editing. Update MailDocumentParser import
to the new canonical location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore: move mail parser sample files to paperless/tests/samples/mail/
Relocate all mail test fixtures from src/paperless_mail/tests/samples/ to
src/paperless/tests/samples/mail/ ahead of the parser plugin refactor.
Add the new path to the codespell skip list to prevent false-positive
spell corrections in binary/fixture email files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feat(tests): add mail parser fixtures to paperless/tests/parsers/conftest.py
Add mail_samples_dir, per-file sample fixtures, and mail_parser
(context-manager style) to mirror the old paperless_mail conftest
but rooted at the new samples/mail/ location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feat(parsers): migrate MailDocumentParser to ParserProtocol
Move the mail parser from paperless_mail/parsers.py to
paperless/parsers/mail.py and refactor it to implement ParserProtocol:
- Class-level name/version/author/url attributes
- supported_mime_types() and score() classmethods (score=20)
- can_produce_archive=False, requires_pdf_rendition=True
- Context manager lifecycle (__enter__/__exit__)
- New parse() signature without mailrule_id kwarg; consumer sets
parser.mailrule_id before calling parse() instead
- get_text()/get_date()/get_archive_path() accessor methods
- extract_metadata() returning email headers and attachment info
Register MailDocumentParser in the ParserRegistry alongside Text and
Tika parsers. Update consumer, signals, and all import sites to use
the new location. Update tests to use the new accessor API, patch
paths, and context-manager fixture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix(parsers): pop legacy constructor args in mail signal wrapper
MailDocumentParser.__init__ takes no constructor args in the new
protocol. Update the get_parser() signal wrapper to pop logging_group
and progress_callback (passed by the legacy consumer dispatch path)
before instantiating — the same pattern used by TextDocumentParser.
Also update test_mail_parser_receives_mailrule to use the real signal
wrapper (mail_get_parser) instead of MailDocumentParser directly, so
the test exercises the actual dispatch path and matches the new
parse() call signature (no mailrule kwarg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Bumps this so we can run
* Fixes location of the fixture
* Removes fixtures which were duplicated
* Feat(parsers): add ParserContext and configure() to ParserProtocol
Replace the ad-hoc mailrule_id attribute assignment with a typed,
immutable ParserContext dataclass and a configure() method on the
Protocol:
- ParserContext(frozen=True, slots=True) lives in paperless/parsers/
alongside ParserProtocol and MetadataEntry; currently carries only
mailrule_id but is designed to grow with output_type, ocr_mode, and
ocr_language in a future phase (decoupling parsers from settings.*)
- ParserProtocol.configure(context: ParserContext) -> None is the
extension point; no-op by default
- MailDocumentParser.configure() reads mailrule_id into _mailrule_id
- TextDocumentParser and TikaDocumentParser implement a no-op configure()
- Consumer calls document_parser.configure(ParserContext(...)) before
parse(), replacing the isinstance(parser, MailDocumentParser) guard
and the direct attribute mutation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feat(parsers): call configure(ParserContext()) in update_document task
Apply the same new-style parser shim pattern as the consumer to
update_document_content_maybe_archive_file:
- Call __enter__ for Text/Tika parsers after instantiation
- Call configure(ParserContext()) before parse() for all new-style parsers
(mailrule_id is not available here — this is a re-process of an
existing document, so the default empty context is correct)
- Call parse(path, mime_type) with 2 args for new-style parsers
- Call get_thumbnail(path, mime_type) with 2 args for new-style parsers
- Call __exit__ instead of cleanup() in the finally block
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix(tests): add configure() to DummyParser and missing-method parametrize
ParserProtocol now requires configure(context: ParserContext) -> None.
Update DummyParser in test_registry.py to implement it, and add
'missing-configure' to the test_partial_compliant_fails_isinstance
parametrize list so the new method is covered by the negative test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Cleans up the reprocess task and generally reduces duplicate of classes
* Corrects the score return
* Updates so we can report a page count for these parsers, assuming we do have an archive produced when called
* Increases test coverage
* One more coverage
* Updates typing
* Updates typing
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: move remote parser, test, and sample to paperless.parsers
Relocates three files to their new homes in the parser plugin system:
- src/paperless_remote/parsers.py
→ src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
→ src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
→ src/paperless/tests/samples/remote/simple-digital.pdf
Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feature: migrate RemoteDocumentParser to ParserProtocol interface
Rewrites the remote OCR parser to the new plugin system contract:
- `supported_mime_types()` is now a classmethod that always returns the
full set of 7 MIME types; the old instance-method hack (returning {}
when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
(making the parser invisible to the registry), and 20 when active —
higher than the tesseract default of 10 so the remote engine takes
priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
returns [] for all other MIME types
New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
`get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
sample-file, and settings-helper fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: use fixture factory and usefixtures in remote parser tests
- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
in conftest.py; tests call `make_azure_mock()` or
`make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
`@pytest.mark.usefixtures` wherever their value is not referenced
inside the test body; `TestRemoteParserParseError` marked at the class
level since all three tests need the same setting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: improve remote parser test fixture structure
- make_azure_mock moved from conftest.py back into test_remote_parser.py;
it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
in one step; tests no longer repeat the mocker.patch call or carry an
unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
with a RuntimeError side effect; TestRemoteParserParseError now only
receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
(pdf, png, jpeg, ...) for readable test output
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: wire RemoteDocumentParser into consumer and fix signals
- paperless_remote/signals.py: import from paperless.parsers.remote
(new location after git mv). supported_mime_types() is now a
classmethod that always returns the full set, so get_supported_mime_types()
in the signal layer explicitly checks RemoteEngineConfig validity and
returns {} when unconfigured — preserving the old behaviour where an
unconfigured remote parser does not register for any MIME types.
- documents/consumer.py: extend the _parser_cleanup() shim, parse()
dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
alongside TextDocumentParser. Both new-style parsers use __exit__
for cleanup and take (document_path, mime_type) without a file_name
argument.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: fix type errors in remote parser and signals
- remote.py: add `if TYPE_CHECKING: assert` guards before the Azure
client construction to narrow config.endpoint and config.api_key from
str|None to str. The narrowing is safe: engine_is_valid() guarantees
both are non-None when it returns True (api_key explicitly; endpoint
via `not (engine=="azureai" and endpoint is None)` for the only valid
engine). Asserts are wrapped in TYPE_CHECKING so they carry zero
runtime cost.
- signals.py: add full type annotations — return types, Any-typed
sender parameter, and explicit logging_group argument replacing *args.
Add `from __future__ import annotations` for consistent annotation style.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: get_parser factory forwards logging_group, drops progress_callback
consumer.py calls parser_class(logging_group, progress_callback=...).
RemoteDocumentParser.__init__ accepts logging_group but not
progress_callback, so only the latter is dropped — matching the pattern
established by the TextDocumentParser signals shim.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: text parser get_parser forwards logging_group, drops progress_callback
TextDocumentParser.__init__ accepts logging_group: object = None, same
as RemoteDocumentParser. The old shim incorrectly dropped it; fix to
forward it as a positional arg and only drop progress_callback.
Add type annotations and from __future__ import annotations for
consistency with the remote parser signals shim.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add test_workflow_document_updated_does_not_overwrite_filename to
verify that run_workflows (DOCUMENT_UPDATED path) does not revert a
DB filename that was updated by a concurrent bulk_update_documents
task's update_filename_and_move_files call.
The test replicates the race window by:
- Updating the DB filename directly (simulating BUD-1 completing)
- Mocking refresh_from_db so the stale in-memory filename persists
- Asserting the DB filename is not clobbered after run_workflows
Relates to: https://github.com/paperless-ngx/paperless-ngx/issues/12386
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore: move Tika parser and tests to paperless/
Move TikaDocumentParser and its tests to the canonical parser package
location, matching the pattern established for TextDocumentParser:
- src/paperless_tika/parsers.py → src/paperless/parsers/tika.py
- src/paperless_tika/tests/test_tika_parser.py → src/paperless/tests/parsers/test_tika_parser.py
- src/paperless_tika/tests/samples/ → src/paperless/tests/samples/tika/
Merge tika fixtures (tika_parser, sample_odt_file, sample_docx_file,
sample_doc_file, sample_broken_odt) into the shared parsers conftest.
Remove the now-empty src/paperless_tika/tests/conftest.py.
Content is unchanged — this commit is rename-only so git history is
preserved on the moved files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Feature: Phase 3 — migrate TikaDocumentParser to ParserProtocol
Refactor TikaDocumentParser to satisfy ParserProtocol without subclassing
the legacy DocumentParser ABC:
- Add ClassVars: name, version, author, url
- Add supported_mime_types() classmethod (12 Office/ODF/RTF MIME types)
- Add score() classmethod — returns None when TIKA_ENABLED is False, 10 otherwise
- can_produce_archive = False (PDF is for display, not an OCR archive)
- requires_pdf_rendition = True (Office formats need PDF for browser display)
- __enter__/__exit__ via ExitStack: TikaClient opened once per parser
lifetime and shared across parse() and extract_metadata() calls
- extract_metadata() falls back to a short-lived TikaClient when called
outside a context manager (legacy view-layer metadata path)
- _convert_to_pdf() uses OutputTypeConfig() to honour the database-stored
ApplicationConfiguration before falling back to the env-var setting
- Rename convert_to_pdf → _convert_to_pdf (private helper)
Update paperless_tika/signals.py shim to import from the new module path
and drop the legacy logging_group/progress_callback kwargs.
Update documents/consumer.py to extend the existing TextDocumentParser
special cases to also cover TikaDocumentParser (parse/get_thumbnail
signatures, __exit__ cleanup).
Add TestTikaParserRegistryInterface (7 tests) covering score(), properties,
and ParserProtocol isinstance check. Update existing tests to use the new
accessor API (get_text, get_date, get_archive_path, _convert_to_pdf).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: update remaining imports and move live Tika tests after parser migration
- src/documents/tests/test_parsers.py: import TikaDocumentParser from
paperless.parsers.tika (old paperless_tika.parsers no longer exists)
- git mv paperless_tika/tests/test_live_tika.py →
paperless/tests/parsers/test_live_tika.py to co-locate all Tika tests
with the parser; update import and replace old attribute API
(tika_parser.text/.archive_path) with accessor methods
(get_text/get_archive_path)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: satisfy mypy and pyrefly for TikaDocumentParser
Use a TYPE_CHECKING-guarded assert to narrow self._tika_client from
TikaClient | None to TikaClient at the point of use in parse(). The
assert is visible to type checkers (TYPE_CHECKING=True) so both mypy
and pyrefly accept the subsequent attribute accesses without error;
at runtime TYPE_CHECKING is False so the assert never executes and no
ruff S101 suppression is required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: require context manager for TikaDocumentParser; clean up client lifecycle
- consumer.py: call __enter__ for new-style parsers so _tika_client and
_gotenberg_client are set before parse() is invoked
- views.py: use `with parser` (via nullcontext for old-style parsers) in
get_metadata so extract_metadata always runs inside a context manager
- tika.py: GotenbergClient added to ExitStack alongside TikaClient;
inline client creation removed from extract_metadata and _convert_to_pdf;
__exit__ uses ExitStack.close() instead of __exit__ pass-through
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: stream manifest parsing with ijson in document_importer
Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.
- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
fraction of manifest) for the tqdm progress bar
Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: slim dict in _import_files_from_manifest, discard fields
When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).
Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()
Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses
Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: streaming manifest writer for document exporter (Phase 3)
Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.
Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml
Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption
Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 -- Eliminate JSON round-trip in document exporter
Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.
Phase 2 -- Batched QuerySet serialization in document exporter
Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.