Commit Graph

11139 Commits

Author SHA1 Message Date
Trenton H 8ebc24bcfa Fix: update paperless_text signal shim to import from new parser location
paperless_text/parsers.py was moved to paperless/parsers/text.py as part of
the Phase 3 parser migration.  Update the signal-based get_parser() factory
to import from the new location and strip the legacy logging_group /
progress_callback kwargs that the new TextDocumentParser no longer accepts.

This shim keeps document consumption functional until Phase 4 replaces the
signal path with the new ParserRegistry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 07:22:22 -07:00
Trenton H d7052b8dee Linting 2026-03-09 20:56:11 -07:00
Trenton H c96e9f5dc7 Feat: add MetadataEntry TypedDict and extract_metadata to ParserProtocol
- Define MetadataEntry TypedDict (namespace, prefix, key, value) in
  paperless.parsers and export it from __all__
- Add extract_metadata(document_path, mime_type) -> list[MetadataEntry]
  to ParserProtocol; implementations must not raise
- Implement extract_metadata on TextDocumentParser (returns [])
- Update DummyParser fixture in test_registry to include extract_metadata
  and align parse/get_thumbnail signatures with the current Protocol
- Add TestTextParserMetadata tests covering empty-list return and
  mime_type-agnostic behaviour

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 20:54:00 -07:00
Trenton H f7f162424b Feature: Phase 3 — migrate TextDocumentParser to ParserProtocol
Implement ParserProtocol on the moved TextDocumentParser without inheriting
from the old DocumentParser ABC:

- Add class-level identity attributes (name, version, author, url)
- Add supported_mime_types() and score() classmethods
- Add can_produce_archive and requires_pdf_rendition properties (both False)
- Replace tempdir / read_file_handle_unicode_errors from old base class with
  a self-contained __init__, __enter__, __exit__, and _read_text helper
- Drop file_name parameter from parse() and get_thumbnail(); add produce_archive kwarg
- Use Self as __enter__ return type; align __exit__ exc_tb type to TracebackType | None
- Register TextDocumentParser in ParserRegistry.register_defaults()

Tests:
- Rewrite test_text_parser.py with 20 tests covering protocol compliance,
  lifecycle/cleanup, parse, thumbnail, and registry integration
- Update parsers/conftest.py with text_parser fixture and sample file fixtures
- Update top-level tests/conftest.py with shared clean_registry autouse fixture

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 20:53:51 -07:00
Trenton H cdeabaf75d Chore: move paperless_text parser and tests to paperless/
Move TextDocumentParser and its test suite from paperless_text/ into the
new paperless/ package where parsers are being consolidated:

- paperless_text/parsers.py → paperless/parsers/text.py
- paperless_text/tests/test_parser.py → paperless/tests/parsers/test_text_parser.py
- paperless_text/tests/conftest.py → paperless/tests/parsers/conftest.py
- paperless_text/tests/samples/*.txt → paperless/tests/samples/text/

Also add paperless/tests/__init__.py, paperless/tests/parsers/__init__.py,
and a new top-level paperless/tests/conftest.py for shared fixtures.

The parser and test files are unchanged; subsequent commits will update
them to implement ParserProtocol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 20:53:20 -07:00
Trenton H 404ef6b40d Formatting 2026-03-09 14:25:33 -07:00
Trenton H 8c40491034 Refactor: Clean up ParserProtocol docstrings and drop file_name parameter
- Remove all Sphinx cross-reference markup (:meth:, :class:, :func:,
  :attr:, :data:, backtick quoting) from registry.py and __init__.py
  docstrings; use plain prose matching the rest of the codebase
- Remove unused file_name parameter from parse() and get_thumbnail()
  in ParserProtocol — no existing parser reads it and the path already
  carries the filename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:09:32 -07:00
Trenton H 0f6bdaf5de Feature: Add parser plugin registry and ParserProtocol (Phase 1 & 2)
Introduces the foundation of the entrypoint-based parser discovery
system to replace the signal-based document_consumer_declaration approach.

- Add ParserProtocol: runtime_checkable Protocol defining the full
  contract for document parsers (supported_mime_types, score, parse,
  context manager, result accessors)
- Add ParserRegistry: lazy singleton with entrypoint discovery via
  importlib.metadata group 'paperless_ngx.parsers', uniform score-based
  selection across external and built-in parsers
- Add get_parser_registry(), init_builtin_parsers(), reset_parser_registry()
  module-level helpers
- Wire Celery worker_process_init to call init_builtin_parsers() eagerly
  in each worker, deferring third-party discovery to first task use
- Add 28 pytest tests covering Protocol compliance, singleton lifecycle,
  scoring logic, entrypoint discovery, and log output

Built-in parsers and consumer migration follow in Phases 3-6.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 13:54:52 -07:00
Trenton H bcc2f11152 Performance: Stream JSON during import for memory improvements (#12276)
* Perf: stream manifest parsing with ijson in document_importer

Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.

- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
  temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
  fraction of manifest) for the tqdm progress bar

Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Perf: slim dict in _import_files_from_manifest, discard fields

When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).

Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
2026-03-09 10:20:48 -07:00
shamoon e18b1fd99d Chore: use unified "gates" for ci tests and docs checks (#12277) 2026-03-09 17:02:34 +00:00
Trenton H e30676f889 Feature: Migrate import/export to rich progress (#12260)
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()

Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses

Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 08:59:17 -07:00
Martin Kleine 2a28549c5a Documentation: Update development commands and pnpm for Angular build commands (#12283)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-09 07:06:16 -07:00
GitHub Actions 4badf0e7c2 Auto translate strings 2026-03-09 01:52:08 +00:00
Paul Gessinger bc26d94593 Chore: Add saved view compatibility in API version 9 (#12280)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-08 18:50:31 -07:00
shamoon 93cbbf34b7 Merge branch 'main' into dev 2026-03-07 23:30:08 -08:00
shamoon 1e8622494d Documentation: remove broken link 2026-03-07 23:29:42 -08:00
GitHub Actions 0c3298f030 Auto translate strings 2026-03-08 03:06:59 +00:00
Sven-Hendrik Haase 2b288c094d Enhancement: Show correspondent in document merge dialog (#12271)
---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-07 19:05:28 -08:00
Trenton H 2cdb1424ef Performance: Further export memory improvements (#12273)
* Perf: streaming manifest writer for document exporter (Phase 3)

Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.

Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
  compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
  bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
  accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml

Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption

Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 14:24:50 -08:00
Trenton H f5c0c21922 Chore: Lazy imports of the heavy AI modules (#12275) 2026-03-07 12:53:22 -08:00
Trenton H 91ddda9256 Fix: Uploaded digest artifact name for Docker build (#12272) 2026-03-06 13:15:45 -08:00
Trenton H 9d5e618de8 Chore: pytest style paperless tests (#12254) 2026-03-06 13:04:23 -08:00
Trenton H 50ae49c7da Chore: Uploads the digests as just files, no zips (#12264) 2026-03-06 12:56:34 -08:00
shamoon ba023ef332 Chore: Add anti-slop job to PR workflow (#12248) 2026-03-06 20:36:24 +00:00
GitHub Actions 7345f2e81c Auto translate strings 2026-03-06 20:01:12 +00:00
shamoon 731448a8f9 Fixhancement: support version-specific edits (#12233) 2026-03-06 11:59:26 -08:00
shamoon 1c2d5483c2 Chore: set fetch depth for bundle analysis (#12257) 2026-03-05 23:54:05 -08:00
shamoon 815e598218 Chore: update ESLint to v10 (#12256) 2026-03-05 22:59:47 -08:00
dependabot[bot] a5a267fe49 Bump django-allauth from 65.14.0 to 65.14.1 (#12253)
Bumps [django-allauth](https://github.com/sponsors/pennersr) from 65.14.0 to 65.14.1.
- [Commits](https://github.com/sponsors/pennersr/commits)

---
updated-dependencies:
- dependency-name: django-allauth
  dependency-version: 65.14.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-05 14:32:04 -08:00
shamoon 24a2cfd957 Change: use explicit doc creation instead of clone for versions (#12226) 2026-03-04 15:57:44 -08:00
GitHub Actions 7cf2ef6398 Auto translate strings 2026-03-04 23:29:54 +00:00
shamoon df03207eef Fix: correct doc version filename handling (#12223) 2026-03-04 23:28:07 +00:00
dependabot[bot] fa998ecd49 Bump django from 5.2.11 to 5.2.12 (#12249)
Bumps [django](https://github.com/django/django) from 5.2.11 to 5.2.12.
- [Commits](https://github.com/django/django/compare/5.2.11...5.2.12)

---
updated-dependencies:
- dependency-name: django
  dependency-version: 5.2.12
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-04 15:16:25 -08:00
Trenton H 1e21bcd26e Breaking: Drop support for Python 3.10 (#12234) 2026-03-04 15:03:33 -08:00
Trenton H a9cb89c633 Enhancement: Improve exporter memory efficiency (#12236)
Phase 1 -- Eliminate JSON round-trip in document exporter

Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.

Phase 2 -- Batched QuerySet serialization in document exporter

Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.
2026-03-04 14:54:20 -08:00
GitHub Actions a37e24c1ad Auto translate strings 2026-03-04 22:17:32 +00:00
shamoon 85a18e5911 Enhancement: saved view sharing (#12142) 2026-03-04 14:15:43 -08:00
GitHub Actions ae182c459b Auto translate strings 2026-03-04 21:34:02 +00:00
shamoon d51a118aac Merge branch 'main' into dev 2026-03-04 13:31:20 -08:00
github-actions[bot] d6a316b1df Changelog v2.20.10 - GHA (#12247)
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
2026-03-04 11:25:44 -08:00
shamoon 8f311c4b6b Bump version to 2.20.10 v2.20.10 2026-03-04 10:38:14 -08:00
shamoon f25322600d Merge branch 'release/v2.20.x' 2026-03-04 10:09:01 -08:00
shamoon 615f27e6fb Fix: support string coercion in filepath jinja templates (#12244) 2026-03-04 08:32:34 -08:00
Andreas Schneider 190fc70288 Fix: use maxsplit=1 in Redis URL parsing to handle URLs with multiple colons (#12239) 2026-03-04 01:06:51 -08:00
shamoon 5b809122b5 Fix: apply ordering after annotating tag document count (#12238) 2026-03-04 00:33:13 -08:00
GitHub Actions c623234769 Auto translate strings 2026-03-04 00:29:21 +00:00
shamoon 299dac21ee Enhancement: “live” document updates (#12141) 2026-03-04 00:27:07 +00:00
Trenton H 5498503d60 Chore: Improve user migration path (#12232)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-03-03 15:51:48 -08:00
shamoon 8b8307571a Fix: enforce path limit for db filename fields (#12235) 2026-03-03 13:19:56 -08:00
GitHub Actions 16b58c2de5 Auto translate strings 2026-03-03 19:25:03 +00:00