paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-03-27 11:22:47 +00:00

Author	SHA1	Message	Date
Trenton H	7eb417e796	Feat: refactor TextDocumentParser to ParserProtocol Starting from the moved paperless_text/parsers.py, rewrite the class to satisfy ParserProtocol without inheriting from the old DocumentParser base: - Add class-level identity attributes (name, version, author, url) - Add supported_mime_types() and score() classmethods - Add can_produce_archive and requires_pdf_rendition properties (both False) - Replace tempdir / read_file_handle_unicode_errors from old base class with a self-contained __init__, __enter__, __exit__, and _read_text helper - Drop file_name parameter from parse() and get_thumbnail(); add produce_archive kwarg - Add extract_metadata() returning [] (plain text has no structured metadata) - Remove get_settings() (not part of ParserProtocol) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 16:54:52 -07:00
Trenton H	8c08362ebc	Chore: move paperless_text/parsers.py to paperless/parsers/text.py Preserves git history of the original TextDocumentParser implementation. The file will be edited in the next commit to implement ParserProtocol. Consumption via the old signal-based system is temporarily broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 16:31:00 -07:00
Trenton H	c37ab946e1	Feat: add MetadataEntry TypedDict and extract_metadata to ParserProtocol - Define MetadataEntry TypedDict (namespace, prefix, key, value) in paperless.parsers and export it from __all__ - Add extract_metadata(document_path, mime_type) -> list[MetadataEntry] to ParserProtocol; implementations must not raise - Implement extract_metadata on TextDocumentParser (returns []) - Update DummyParser fixture in test_registry to include extract_metadata and align parse/get_thumbnail signatures with the current Protocol - Add TestTextParserMetadata tests covering empty-list return and mime_type-agnostic behaviour Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 16:07:10 -07:00
Trenton H	82068303d0	Use the main version as the built-in parser version	2026-03-09 15:40:28 -07:00
Trenton H	cc8e9a7108	Fix: type ParserRegistry lists and methods as type[ParserProtocol] _builtins, _external, register_builtin, and get_parser_for_file were typed as plain `type`, giving mypy no way to verify that supported_mime_types and score exist on the stored classes. Using type[ParserProtocol] throughout resolves the attr-defined errors and makes the registry's type contract explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 15:30:09 -07:00
Trenton H	1870f69053	Fix: use Self as __enter__ return type in TextDocumentParser Returning the concrete class name would give callers the wrong type if the class is ever subclassed. Self resolves to the actual runtime type, matching the ParserProtocol declaration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 15:25:54 -07:00
Trenton H	053d590cb8	Fix: align ParserProtocol.__exit__ exc_tb type with TextDocumentParser Both now use TracebackType \| None instead of object. The Protocol's object annotation was overly broad — Python only ever passes TracebackType or None as the third argument to __exit__, and the narrower type is required for pyrefly's contravariant parameter check to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 15:24:49 -07:00
Trenton H	987aa363dc	Chore: use text_parser fixture instead of direct instantiation in tests Tests that were using `with TextDocumentParser() as parser:` inline now receive the parser via the text_parser fixture. The two lifecycle tests that must control instantiation directly (cleanup and exception cleanup) are intentionally left unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 15:23:01 -07:00
Trenton H	b8f63026f7	Chore: reorganise parser tests and samples into sub-directories - Move text sample files into tests/samples/text/ so each parser type has its own folder as more parsers are migrated - Move test_text_parser.py into tests/parsers/ sub-package (new __init__.py) - Split conftest.py: top-level keeps clean_registry + samples_dir; new parsers/conftest.py holds text_samples_dir, sample_txt_file, malformed_txt_file, and text_parser fixtures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 14:38:23 -07:00
Trenton H	3a232f0c8f	Feature: Phase 3 — migrate TextDocumentParser to ParserProtocol - Add paperless/parsers/text.py: standalone TextDocumentParser implementing ParserProtocol (no inheritance from old DocumentParser ABC); uses __enter__/ __exit__ for tempdir lifecycle, score()-based MIME registration - Register TextDocumentParser in ParserRegistry.register_defaults() - Add paperless/tests/conftest.py: session-scoped sample_dir, sample_txt_file, malformed_txt_file fixtures; function-scoped text_parser fixture using the context-manager protocol; autouse clean_registry fixture (moved from test_registry.py to avoid duplication) - Add paperless/tests/test_text_parser.py: 20 tests covering protocol compliance, lifecycle/cleanup, parse, thumbnail, and registry integration - Copy sample files (test.txt, decode_error.txt) to paperless/tests/samples/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 14:34:08 -07:00
Trenton H	404ef6b40d	Formatting	2026-03-09 14:25:33 -07:00
Trenton H	8c40491034	Refactor: Clean up ParserProtocol docstrings and drop file_name parameter - Remove all Sphinx cross-reference markup (:meth:, :class:, :func:, :attr:, :data:, backtick quoting) from registry.py and __init__.py docstrings; use plain prose matching the rest of the codebase - Remove unused file_name parameter from parse() and get_thumbnail() in ParserProtocol — no existing parser reads it and the path already carries the filename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 14:09:32 -07:00
Trenton H	0f6bdaf5de	Feature: Add parser plugin registry and ParserProtocol (Phase 1 & 2) Introduces the foundation of the entrypoint-based parser discovery system to replace the signal-based document_consumer_declaration approach. - Add ParserProtocol: runtime_checkable Protocol defining the full contract for document parsers (supported_mime_types, score, parse, context manager, result accessors) - Add ParserRegistry: lazy singleton with entrypoint discovery via importlib.metadata group 'paperless_ngx.parsers', uniform score-based selection across external and built-in parsers - Add get_parser_registry(), init_builtin_parsers(), reset_parser_registry() module-level helpers - Wire Celery worker_process_init to call init_builtin_parsers() eagerly in each worker, deferring third-party discovery to first task use - Add 28 pytest tests covering Protocol compliance, singleton lifecycle, scoring logic, entrypoint discovery, and log output Built-in parsers and consumer migration follow in Phases 3-6. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 13:54:52 -07:00
Trenton H	bcc2f11152	Performance: Stream JSON during import for memory improvements (#12276 ) * Perf: stream manifest parsing with ijson in document_importer Replace bulk json.load of the full manifest (which materializes the entire JSON array into memory) with incremental ijson streaming. Eliminates self.manifest entirely — records are never all in memory at once. - Add ijson>=3.2 dependency - New module-level iter_manifest_records() generator - load_manifest_files() collects paths only; no parsing at load time - check_manifest_validity() streams without accumulating records - decrypt_secret_fields() streams each manifest to a .decrypted.json temp file record-by-record; temp files cleaned up after file copy - _import_files_from_manifest() collects only document records (small fraction of manifest) for the tqdm progress bar Measured on 200 docs + 200 CustomFieldInstances: - Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction) - Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Perf: slim dict in _import_files_from_manifest, discard fields When collecting document records for the file-copy step, extract only the 4 keys the loop actually uses (pk + 3 exported filename keys) and discard the full fields dict (content, checksum, tags, etc.). Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction). Wall time unchanged.	2026-03-09 10:20:48 -07:00
Trenton H	e30676f889	Feature: Migrate import/export to rich progress (#12260 ) * Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track() Replace direct tqdm usage in document_exporter and document_importer with the PaperlessCommand base class and its track() method, which is backed by Rich and handles --no-progress-bar automatically. Also removes the unused ProgressBarMixin from mixins.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses Each management command now explicitly declares both class attributes rather than relying on defaults, making intent unambiguous at a glance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 08:59:17 -07:00
GitHub Actions	4badf0e7c2	Auto translate strings	2026-03-09 01:52:08 +00:00
Paul Gessinger	bc26d94593	Chore: Add saved view compatibility in API version 9 (#12280 ) --------- Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-03-08 18:50:31 -07:00
Trenton H	2cdb1424ef	Performance: Further export memory improvements (#12273 ) * Perf: streaming manifest writer for document exporter (Phase 3) Replaces the in-memory manifest dict accumulation with a StreamingManifestWriter that writes records to manifest.json incrementally, keeping only one batch resident in memory at a time. Key changes: - Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b compare for --compare-json, discard() on exception - Add _encrypt_record_inline(): per-record encryption replacing the bulk encrypt_secret_fields() call; crypto setup moved before streaming - Add _write_split_manifest(): extracted per-document manifest writing - Refactor dump(): non-doc records streamed during transaction, documents accumulated then written after filenames are assigned - Upgrade check_and_write_json() from MD5 to BLAKE2b - Remove encrypt_secret_fields() and unused itertools.chain import - Add profiling marker to pyproject.toml Measured improvement (200 docs + 200 CustomFieldInstances, same dump() code path, only writer differs): - Peak memory: ~50% reduction - Memory delta: ~70% reduction - Wall time and query count: unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at class definition time. _encrypt_record_inline() now does a single dict lookup instead of a linear scan per record, eliminating the loop and break pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-07 14:24:50 -08:00
Trenton H	f5c0c21922	Chore: Lazy imports of the heavy AI modules (#12275 )	2026-03-07 12:53:22 -08:00
Trenton H	9d5e618de8	Chore: pytest style paperless tests (#12254 )	2026-03-06 13:04:23 -08:00
GitHub Actions	7345f2e81c	Auto translate strings	2026-03-06 20:01:12 +00:00
shamoon	731448a8f9	Fixhancement: support version-specific edits (#12233 )	2026-03-06 11:59:26 -08:00
shamoon	24a2cfd957	Change: use explicit doc creation instead of clone for versions (#12226 )	2026-03-04 15:57:44 -08:00
GitHub Actions	7cf2ef6398	Auto translate strings	2026-03-04 23:29:54 +00:00
shamoon	df03207eef	Fix: correct doc version filename handling (#12223 )	2026-03-04 23:28:07 +00:00
Trenton H	1e21bcd26e	Breaking: Drop support for Python 3.10 (#12234 )	2026-03-04 15:03:33 -08:00
Trenton H	a9cb89c633	Enhancement: Improve exporter memory efficiency (#12236 ) Phase 1 -- Eliminate JSON round-trip in document exporter Replace json.loads(serializers.serialize("json", qs)) with serializers.serialize("python", qs) to skip the intermediate JSON string allocation and parse step. Use DjangoJSONEncoder in check_and_write_json() to handle native Python types (datetime, Decimal, UUID) the Python serializer returns. Phase 2 -- Batched QuerySet serialization in document exporter Add serialize_queryset_batched() helper that uses QuerySet.iterator() and itertools.islice to stream records in configurable chunks, bounding peak memory during serialization to batch_size * avg_record_size rather than loading the entire QuerySet at once.	2026-03-04 14:54:20 -08:00
GitHub Actions	a37e24c1ad	Auto translate strings	2026-03-04 22:17:32 +00:00
shamoon	85a18e5911	Enhancement: saved view sharing (#12142 )	2026-03-04 14:15:43 -08:00
GitHub Actions	ae182c459b	Auto translate strings	2026-03-04 21:34:02 +00:00
shamoon	d51a118aac	Merge branch 'main' into dev	2026-03-04 13:31:20 -08:00
shamoon	8f311c4b6b	Bump version to 2.20.10	2026-03-04 10:38:14 -08:00
shamoon	f25322600d	Merge branch 'release/v2.20.x'	2026-03-04 10:09:01 -08:00
shamoon	615f27e6fb	Fix: support string coercion in filepath jinja templates (#12244 )	2026-03-04 08:32:34 -08:00
Andreas Schneider	190fc70288	Fix: use maxsplit=1 in Redis URL parsing to handle URLs with multiple colons (#12239 )	2026-03-04 01:06:51 -08:00
shamoon	5b809122b5	Fix: apply ordering after annotating tag document count (#12238 )	2026-03-04 00:33:13 -08:00
GitHub Actions	c623234769	Auto translate strings	2026-03-04 00:29:21 +00:00
shamoon	299dac21ee	Enhancement: “live” document updates (#12141 )	2026-03-04 00:27:07 +00:00
Trenton H	5498503d60	Chore: Improve user migration path (#12232 ) Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-03-03 15:51:48 -08:00
shamoon	8b8307571a	Fix: enforce path limit for db filename fields (#12235 )	2026-03-03 13:19:56 -08:00
Trenton H	43406f44f2	Feature: Improve the retagger output using rich (#12194 )	2026-03-03 07:14:59 -08:00
Trenton H	e58a35d40c	Feature: Transition sanity check to rich and improve output (#12182 )	2026-03-02 10:53:39 -08:00
Trenton H	20a9cd40e8	Feature: Switch all indexing to use rich (#12193 )	2026-03-02 10:41:09 -08:00
GitHub Actions	62efb4078f	Auto translate strings	2026-03-02 16:23:02 +00:00
shamoon	96ac7b2336	Tweak: Ignore version docs for workflows (#12217 )	2026-03-02 08:21:14 -08:00
shamoon	95484db71b	Update paperless_mail npm packges	2026-03-02 00:12:52 -08:00
shamoon	8e7084eba7	Bump tailwindcss to v3.4.19	2026-03-02 00:12:52 -08:00
GitHub Actions	dd06627e43	Auto translate strings	2026-02-28 10:34:26 +00:00
shamoon	f65807b906	Merge branch 'main' into dev # Conflicts: # docs/setup.md # src-ui/src/app/components/manage/document-attributes/management-list/management-list.component.ts # src/documents/tests/test_api_documents.py	2026-02-28 02:31:20 -08:00
shamoon	47f9f642a9	Bump version to 2.20.9	2026-02-28 01:35:26 -08:00

1 2 3 4 5 ...

3854 Commits