- paperless_remote/signals.py: import from paperless.parsers.remote
(new location after git mv). supported_mime_types() is now a
classmethod that always returns the full set, so get_supported_mime_types()
in the signal layer explicitly checks RemoteEngineConfig validity and
returns {} when unconfigured — preserving the old behaviour where an
unconfigured remote parser does not register for any MIME types.
- documents/consumer.py: extend the _parser_cleanup() shim, parse()
dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
alongside TextDocumentParser. Both new-style parsers use __exit__
for cleanup and take (document_path, mime_type) without a file_name
argument.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- make_azure_mock moved from conftest.py back into test_remote_parser.py;
it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
in one step; tests no longer repeat the mocker.patch call or carry an
unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
with a RuntimeError side effect; TestRemoteParserParseError now only
receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
(pdf, png, jpeg, ...) for readable test output
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
in conftest.py; tests call `make_azure_mock()` or
`make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
`@pytest.mark.usefixtures` wherever their value is not referenced
inside the test body; `TestRemoteParserParseError` marked at the class
level since all three tests need the same setting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites the remote OCR parser to the new plugin system contract:
- `supported_mime_types()` is now a classmethod that always returns the
full set of 7 MIME types; the old instance-method hack (returning {}
when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
(making the parser invisible to the registry), and 20 when active —
higher than the tesseract default of 10 so the remote engine takes
priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
returns [] for all other MIME types
New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
`get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
sample-file, and settings-helper fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Relocates three files to their new homes in the parser plugin system:
- src/paperless_remote/parsers.py
→ src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
→ src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
→ src/paperless/tests/samples/remote/simple-digital.pdf
Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: stream manifest parsing with ijson in document_importer
Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.
- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
fraction of manifest) for the tqdm progress bar
Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: slim dict in _import_files_from_manifest, discard fields
When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).
Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()
Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses
Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: streaming manifest writer for document exporter (Phase 3)
Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.
Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml
Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption
Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 -- Eliminate JSON round-trip in document exporter
Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.
Phase 2 -- Batched QuerySet serialization in document exporter
Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.