Commit Graph

3854 Commits

Author SHA1 Message Date
Trenton H
7eb417e796 Feat: refactor TextDocumentParser to ParserProtocol
Starting from the moved paperless_text/parsers.py, rewrite the class to
satisfy ParserProtocol without inheriting from the old DocumentParser base:

- Add class-level identity attributes (name, version, author, url)
- Add supported_mime_types() and score() classmethods
- Add can_produce_archive and requires_pdf_rendition properties (both False)
- Replace tempdir / read_file_handle_unicode_errors from old base class with
  a self-contained __init__, __enter__, __exit__, and _read_text helper
- Drop file_name parameter from parse() and get_thumbnail(); add produce_archive kwarg
- Add extract_metadata() returning [] (plain text has no structured metadata)
- Remove get_settings() (not part of ParserProtocol)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 16:54:52 -07:00
Trenton H
8c08362ebc Chore: move paperless_text/parsers.py to paperless/parsers/text.py
Preserves git history of the original TextDocumentParser implementation.
The file will be edited in the next commit to implement ParserProtocol.
Consumption via the old signal-based system is temporarily broken.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 16:31:00 -07:00
Trenton H
c37ab946e1 Feat: add MetadataEntry TypedDict and extract_metadata to ParserProtocol
- Define MetadataEntry TypedDict (namespace, prefix, key, value) in
  paperless.parsers and export it from __all__
- Add extract_metadata(document_path, mime_type) -> list[MetadataEntry]
  to ParserProtocol; implementations must not raise
- Implement extract_metadata on TextDocumentParser (returns [])
- Update DummyParser fixture in test_registry to include extract_metadata
  and align parse/get_thumbnail signatures with the current Protocol
- Add TestTextParserMetadata tests covering empty-list return and
  mime_type-agnostic behaviour

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 16:07:10 -07:00
Trenton H
82068303d0 Use the main version as the built-in parser version 2026-03-09 15:40:28 -07:00
Trenton H
cc8e9a7108 Fix: type ParserRegistry lists and methods as type[ParserProtocol]
_builtins, _external, register_builtin, and get_parser_for_file were
typed as plain `type`, giving mypy no way to verify that supported_mime_types
and score exist on the stored classes.  Using type[ParserProtocol] throughout
resolves the attr-defined errors and makes the registry's type contract
explicit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 15:30:09 -07:00
Trenton H
1870f69053 Fix: use Self as __enter__ return type in TextDocumentParser
Returning the concrete class name would give callers the wrong type if
the class is ever subclassed.  Self resolves to the actual runtime type,
matching the ParserProtocol declaration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 15:25:54 -07:00
Trenton H
053d590cb8 Fix: align ParserProtocol.__exit__ exc_tb type with TextDocumentParser
Both now use TracebackType | None instead of object.  The Protocol's
object annotation was overly broad — Python only ever passes TracebackType
or None as the third argument to __exit__, and the narrower type is
required for pyrefly's contravariant parameter check to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 15:24:49 -07:00
Trenton H
987aa363dc Chore: use text_parser fixture instead of direct instantiation in tests
Tests that were using `with TextDocumentParser() as parser:` inline now
receive the parser via the text_parser fixture.  The two lifecycle tests
that must control instantiation directly (cleanup and exception cleanup)
are intentionally left unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 15:23:01 -07:00
Trenton H
b8f63026f7 Chore: reorganise parser tests and samples into sub-directories
- Move text sample files into tests/samples/text/ so each parser type
  has its own folder as more parsers are migrated
- Move test_text_parser.py into tests/parsers/ sub-package (new __init__.py)
- Split conftest.py: top-level keeps clean_registry + samples_dir; new
  parsers/conftest.py holds text_samples_dir, sample_txt_file,
  malformed_txt_file, and text_parser fixtures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:38:23 -07:00
Trenton H
3a232f0c8f Feature: Phase 3 — migrate TextDocumentParser to ParserProtocol
- Add paperless/parsers/text.py: standalone TextDocumentParser implementing
  ParserProtocol (no inheritance from old DocumentParser ABC); uses __enter__/
  __exit__ for tempdir lifecycle, score()-based MIME registration
- Register TextDocumentParser in ParserRegistry.register_defaults()
- Add paperless/tests/conftest.py: session-scoped sample_dir, sample_txt_file,
  malformed_txt_file fixtures; function-scoped text_parser fixture using the
  context-manager protocol; autouse clean_registry fixture (moved from
  test_registry.py to avoid duplication)
- Add paperless/tests/test_text_parser.py: 20 tests covering protocol
  compliance, lifecycle/cleanup, parse, thumbnail, and registry integration
- Copy sample files (test.txt, decode_error.txt) to paperless/tests/samples/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:34:08 -07:00
Trenton H
404ef6b40d Formatting 2026-03-09 14:25:33 -07:00
Trenton H
8c40491034 Refactor: Clean up ParserProtocol docstrings and drop file_name parameter
- Remove all Sphinx cross-reference markup (:meth:, :class:, :func:,
  :attr:, :data:, backtick quoting) from registry.py and __init__.py
  docstrings; use plain prose matching the rest of the codebase
- Remove unused file_name parameter from parse() and get_thumbnail()
  in ParserProtocol — no existing parser reads it and the path already
  carries the filename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:09:32 -07:00
Trenton H
0f6bdaf5de Feature: Add parser plugin registry and ParserProtocol (Phase 1 & 2)
Introduces the foundation of the entrypoint-based parser discovery
system to replace the signal-based document_consumer_declaration approach.

- Add ParserProtocol: runtime_checkable Protocol defining the full
  contract for document parsers (supported_mime_types, score, parse,
  context manager, result accessors)
- Add ParserRegistry: lazy singleton with entrypoint discovery via
  importlib.metadata group 'paperless_ngx.parsers', uniform score-based
  selection across external and built-in parsers
- Add get_parser_registry(), init_builtin_parsers(), reset_parser_registry()
  module-level helpers
- Wire Celery worker_process_init to call init_builtin_parsers() eagerly
  in each worker, deferring third-party discovery to first task use
- Add 28 pytest tests covering Protocol compliance, singleton lifecycle,
  scoring logic, entrypoint discovery, and log output

Built-in parsers and consumer migration follow in Phases 3-6.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 13:54:52 -07:00
Trenton H
bcc2f11152 Performance: Stream JSON during import for memory improvements (#12276)
* Perf: stream manifest parsing with ijson in document_importer

Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.

- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
  temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
  fraction of manifest) for the tqdm progress bar

Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Perf: slim dict in _import_files_from_manifest, discard fields

When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).

Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
2026-03-09 10:20:48 -07:00
Trenton H
e30676f889 Feature: Migrate import/export to rich progress (#12260)
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()

Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses

Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 08:59:17 -07:00
GitHub Actions
4badf0e7c2 Auto translate strings 2026-03-09 01:52:08 +00:00
Paul Gessinger
bc26d94593 Chore: Add saved view compatibility in API version 9 (#12280)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-08 18:50:31 -07:00
Trenton H
2cdb1424ef Performance: Further export memory improvements (#12273)
* Perf: streaming manifest writer for document exporter (Phase 3)

Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.

Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
  compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
  bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
  accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml

Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption

Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 14:24:50 -08:00
Trenton H
f5c0c21922 Chore: Lazy imports of the heavy AI modules (#12275) 2026-03-07 12:53:22 -08:00
Trenton H
9d5e618de8 Chore: pytest style paperless tests (#12254) 2026-03-06 13:04:23 -08:00
GitHub Actions
7345f2e81c Auto translate strings 2026-03-06 20:01:12 +00:00
shamoon
731448a8f9 Fixhancement: support version-specific edits (#12233) 2026-03-06 11:59:26 -08:00
shamoon
24a2cfd957 Change: use explicit doc creation instead of clone for versions (#12226) 2026-03-04 15:57:44 -08:00
GitHub Actions
7cf2ef6398 Auto translate strings 2026-03-04 23:29:54 +00:00
shamoon
df03207eef Fix: correct doc version filename handling (#12223) 2026-03-04 23:28:07 +00:00
Trenton H
1e21bcd26e Breaking: Drop support for Python 3.10 (#12234) 2026-03-04 15:03:33 -08:00
Trenton H
a9cb89c633 Enhancement: Improve exporter memory efficiency (#12236)
Phase 1 -- Eliminate JSON round-trip in document exporter

Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.

Phase 2 -- Batched QuerySet serialization in document exporter

Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.
2026-03-04 14:54:20 -08:00
GitHub Actions
a37e24c1ad Auto translate strings 2026-03-04 22:17:32 +00:00
shamoon
85a18e5911 Enhancement: saved view sharing (#12142) 2026-03-04 14:15:43 -08:00
GitHub Actions
ae182c459b Auto translate strings 2026-03-04 21:34:02 +00:00
shamoon
d51a118aac Merge branch 'main' into dev 2026-03-04 13:31:20 -08:00
shamoon
8f311c4b6b Bump version to 2.20.10 2026-03-04 10:38:14 -08:00
shamoon
f25322600d Merge branch 'release/v2.20.x' 2026-03-04 10:09:01 -08:00
shamoon
615f27e6fb Fix: support string coercion in filepath jinja templates (#12244) 2026-03-04 08:32:34 -08:00
Andreas Schneider
190fc70288 Fix: use maxsplit=1 in Redis URL parsing to handle URLs with multiple colons (#12239) 2026-03-04 01:06:51 -08:00
shamoon
5b809122b5 Fix: apply ordering after annotating tag document count (#12238) 2026-03-04 00:33:13 -08:00
GitHub Actions
c623234769 Auto translate strings 2026-03-04 00:29:21 +00:00
shamoon
299dac21ee Enhancement: “live” document updates (#12141) 2026-03-04 00:27:07 +00:00
Trenton H
5498503d60 Chore: Improve user migration path (#12232)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-03 15:51:48 -08:00
shamoon
8b8307571a Fix: enforce path limit for db filename fields (#12235) 2026-03-03 13:19:56 -08:00
Trenton H
43406f44f2 Feature: Improve the retagger output using rich (#12194) 2026-03-03 07:14:59 -08:00
Trenton H
e58a35d40c Feature: Transition sanity check to rich and improve output (#12182) 2026-03-02 10:53:39 -08:00
Trenton H
20a9cd40e8 Feature: Switch all indexing to use rich (#12193) 2026-03-02 10:41:09 -08:00
GitHub Actions
62efb4078f Auto translate strings 2026-03-02 16:23:02 +00:00
shamoon
96ac7b2336 Tweak: Ignore version docs for workflows (#12217) 2026-03-02 08:21:14 -08:00
shamoon
95484db71b Update paperless_mail npm packges 2026-03-02 00:12:52 -08:00
shamoon
8e7084eba7 Bump tailwindcss to v3.4.19 2026-03-02 00:12:52 -08:00
GitHub Actions
dd06627e43 Auto translate strings 2026-02-28 10:34:26 +00:00
shamoon
f65807b906 Merge branch 'main' into dev
# Conflicts:
#	docs/setup.md
#	src-ui/src/app/components/manage/document-attributes/management-list/management-list.component.ts
#	src/documents/tests/test_api_documents.py
2026-02-28 02:31:20 -08:00
shamoon
47f9f642a9 Bump version to 2.20.9 2026-02-28 01:35:26 -08:00