Starting from the moved paperless_text/parsers.py, rewrite the class to
satisfy ParserProtocol without inheriting from the old DocumentParser base:
- Add class-level identity attributes (name, version, author, url)
- Add supported_mime_types() and score() classmethods
- Add can_produce_archive and requires_pdf_rendition properties (both False)
- Replace tempdir / read_file_handle_unicode_errors from old base class with
a self-contained __init__, __enter__, __exit__, and _read_text helper
- Drop file_name parameter from parse() and get_thumbnail(); add produce_archive kwarg
- Add extract_metadata() returning [] (plain text has no structured metadata)
- Remove get_settings() (not part of ParserProtocol)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Preserves git history of the original TextDocumentParser implementation.
The file will be edited in the next commit to implement ParserProtocol.
Consumption via the old signal-based system is temporarily broken.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Define MetadataEntry TypedDict (namespace, prefix, key, value) in
paperless.parsers and export it from __all__
- Add extract_metadata(document_path, mime_type) -> list[MetadataEntry]
to ParserProtocol; implementations must not raise
- Implement extract_metadata on TextDocumentParser (returns [])
- Update DummyParser fixture in test_registry to include extract_metadata
and align parse/get_thumbnail signatures with the current Protocol
- Add TestTextParserMetadata tests covering empty-list return and
mime_type-agnostic behaviour
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_builtins, _external, register_builtin, and get_parser_for_file were
typed as plain `type`, giving mypy no way to verify that supported_mime_types
and score exist on the stored classes. Using type[ParserProtocol] throughout
resolves the attr-defined errors and makes the registry's type contract
explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returning the concrete class name would give callers the wrong type if
the class is ever subclassed. Self resolves to the actual runtime type,
matching the ParserProtocol declaration.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both now use TracebackType | None instead of object. The Protocol's
object annotation was overly broad — Python only ever passes TracebackType
or None as the third argument to __exit__, and the narrower type is
required for pyrefly's contravariant parameter check to pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests that were using `with TextDocumentParser() as parser:` inline now
receive the parser via the text_parser fixture. The two lifecycle tests
that must control instantiation directly (cleanup and exception cleanup)
are intentionally left unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move text sample files into tests/samples/text/ so each parser type
has its own folder as more parsers are migrated
- Move test_text_parser.py into tests/parsers/ sub-package (new __init__.py)
- Split conftest.py: top-level keeps clean_registry + samples_dir; new
parsers/conftest.py holds text_samples_dir, sample_txt_file,
malformed_txt_file, and text_parser fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove all Sphinx cross-reference markup (:meth:, :class:, :func:,
:attr:, :data:, backtick quoting) from registry.py and __init__.py
docstrings; use plain prose matching the rest of the codebase
- Remove unused file_name parameter from parse() and get_thumbnail()
in ParserProtocol — no existing parser reads it and the path already
carries the filename
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces the foundation of the entrypoint-based parser discovery
system to replace the signal-based document_consumer_declaration approach.
- Add ParserProtocol: runtime_checkable Protocol defining the full
contract for document parsers (supported_mime_types, score, parse,
context manager, result accessors)
- Add ParserRegistry: lazy singleton with entrypoint discovery via
importlib.metadata group 'paperless_ngx.parsers', uniform score-based
selection across external and built-in parsers
- Add get_parser_registry(), init_builtin_parsers(), reset_parser_registry()
module-level helpers
- Wire Celery worker_process_init to call init_builtin_parsers() eagerly
in each worker, deferring third-party discovery to first task use
- Add 28 pytest tests covering Protocol compliance, singleton lifecycle,
scoring logic, entrypoint discovery, and log output
Built-in parsers and consumer migration follow in Phases 3-6.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: stream manifest parsing with ijson in document_importer
Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.
- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
fraction of manifest) for the tqdm progress bar
Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: slim dict in _import_files_from_manifest, discard fields
When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).
Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()
Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses
Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: streaming manifest writer for document exporter (Phase 3)
Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.
Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml
Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption
Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 -- Eliminate JSON round-trip in document exporter
Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.
Phase 2 -- Batched QuerySet serialization in document exporter
Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.