paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-05-19 04:55:25 +00:00

Author	SHA1	Message	Date
Trenton H	e9e1d4ccca	Refactor: wire RemoteDocumentParser into consumer and fix signals - paperless_remote/signals.py: import from paperless.parsers.remote (new location after git mv). supported_mime_types() is now a classmethod that always returns the full set, so get_supported_mime_types() in the signal layer explicitly checks RemoteEngineConfig validity and returns {} when unconfigured — preserving the old behaviour where an unconfigured remote parser does not register for any MIME types. - documents/consumer.py: extend the _parser_cleanup() shim, parse() dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser alongside TextDocumentParser. Both new-style parsers use __exit__ for cleanup and take (document_path, mime_type) without a file_name argument. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 12:09:33 -07:00
Trenton H	c955ba7d07	Refactor: improve remote parser test fixture structure - make_azure_mock moved from conftest.py back into test_remote_parser.py; it is specific to that module and does not belong in shared fixtures - azure_client fixture composes azure_settings + make_azure_mock + patch in one step; tests no longer repeat the mocker.patch call or carry an unused azure_settings parameter - failing_azure_client fixture similarly composes azure_settings + patch with a RuntimeError side effect; TestRemoteParserParseError now only receives the mock it actually uses - All @pytest.mark.parametrize calls use pytest.param with explicit ids (pdf, png, jpeg, ...) for readable test output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 12:00:37 -07:00
Trenton H	7028bb2163	Refactor: use fixture factory and usefixtures in remote parser tests - `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture in conftest.py; tests call `make_azure_mock()` or `make_azure_mock("custom text")` instead of a module-level function - `azure_settings` and `no_engine_settings` applied via `@pytest.mark.usefixtures` wherever their value is not referenced inside the test body; `TestRemoteParserParseError` marked at the class level since all three tests need the same setting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 11:56:38 -07:00
Trenton H	5d4d87764c	Feature: migrate RemoteDocumentParser to ParserProtocol interface Rewrites the remote OCR parser to the new plugin system contract: - `supported_mime_types()` is now a classmethod that always returns the full set of 7 MIME types; the old instance-method hack (returning {} when unconfigured) is removed - `score()` classmethod returns None when no remote engine is configured (making the parser invisible to the registry), and 20 when active — higher than the tesseract default of 10 so the remote engine takes priority when both are available - No longer inherits from RasterisedDocumentParser; inherits no parser class at all — just implements the protocol directly - `can_produce_archive = True`; `requires_pdf_rendition = False` - `_azure_ai_vision_parse()` takes explicit config arg; API client created and closed within the method - `get_page_count()` returns the PDF page count for application/pdf, delegating to the new `get_page_count_for_pdf()` utility - `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs; returns [] for all other MIME types New files: - `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and `get_page_count_for_pdf()` utilities (pikepdf-based); both the remote and tesseract parsers will use these going forward - `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style tests using pytest-django `settings` and pytest-mock `mocker` fixtures - `src/paperless/tests/parsers/conftest.py` — remote parser instance, sample-file, and settings-helper fixtures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 11:52:11 -07:00
Trenton H	75dce7f19f	Refactor: move remote parser, test, and sample to paperless.parsers Relocates three files to their new homes in the parser plugin system: - src/paperless_remote/parsers.py → src/paperless/parsers/remote.py - src/paperless_remote/tests/test_parser.py → src/paperless/tests/parsers/test_remote_parser.py - src/paperless_remote/tests/samples/simple-digital.pdf → src/paperless/tests/samples/remote/simple-digital.pdf Content and imports will be updated in the follow-up commit that rewrites the parser to the new ParserProtocol interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 11:32:34 -07:00
dependabot[bot]	365ff99934	Bump ocrmypdf from 16.13.0 to 17.3.0 in the document-processing group (#12267 ) * Bump ocrmypdf from 16.13.0 to 17.3.0 in the document-processing group Bumps the document-processing group with 1 update: [ocrmypdf](https://github.com/ocrmypdf/OCRmyPDF). Updates `ocrmypdf` from 16.13.0 to 17.3.0 - [Release notes](https://github.com/ocrmypdf/OCRmyPDF/releases) - [Commits](https://github.com/ocrmypdf/OCRmyPDF/compare/v16.13.0...v17.3.0) --- updated-dependencies: - dependency-name: ocrmypdf dependency-version: 17.3.0 dependency-type: direct:production update-type: version-update:semver-major dependency-group: document-processing ... Signed-off-by: dependabot[bot] <support@github.com> * Updates the argument name for v17 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>	2026-03-13 09:51:21 -07:00
Trenton H	d86cfdb088	Feature: Initial document parser plugin framework (#12294 )	2026-03-12 21:53:17 +00:00
Trenton H	ee0d1a3094	Enhancement: Make the StatusConsumer truly async (#12298 )	2026-03-12 13:27:35 -07:00
GitHub Actions	15db023caa	Auto translate strings	2026-03-12 15:44:21 +00:00
shamoon	45b363659e	Chore: mark document detail email action as deprecated (#12308 )	2026-03-12 15:42:14 +00:00
Trenton H	86fa74c115	Fix: Postgres selection, DBENGINE and migrations (#12299 )	2026-03-11 11:54:24 -07:00
GitHub Actions	217b5df591	Auto translate strings	2026-03-10 23:47:25 +00:00
shamoon	3efc9a5733	Fix: use effective content for matching and suggestion content (#12293 )	2026-03-10 23:45:56 +00:00
GitHub Actions	2b4ea570ef	Auto translate strings	2026-03-10 18:58:20 +00:00
shamoon	86573fc1a0	Chore: separate actions from bulk edit endpoint (#12286 )	2026-03-10 18:55:36 +00:00
GitHub Actions	1221e7f21c	Auto translate strings	2026-03-09 22:37:56 +00:00
shamoon	3e32e90355	Breaking: drop support for api versions < 9 (#12284 )	2026-03-09 22:36:22 +00:00
Trenton H	63cb75564e	Chore: Remove some further old items (encryption passphrase and PNG handling) (#12290 )	2026-03-09 22:04:51 +00:00
GitHub Actions	0c7d56c5e7	Auto translate strings	2026-03-09 17:45:53 +00:00
Trenton H	0bcf904e3a	Chore: Finish settings refactor (#12263 )	2026-03-09 17:43:51 +00:00
Trenton H	bcc2f11152	Performance: Stream JSON during import for memory improvements (#12276 ) * Perf: stream manifest parsing with ijson in document_importer Replace bulk json.load of the full manifest (which materializes the entire JSON array into memory) with incremental ijson streaming. Eliminates self.manifest entirely — records are never all in memory at once. - Add ijson>=3.2 dependency - New module-level iter_manifest_records() generator - load_manifest_files() collects paths only; no parsing at load time - check_manifest_validity() streams without accumulating records - decrypt_secret_fields() streams each manifest to a .decrypted.json temp file record-by-record; temp files cleaned up after file copy - _import_files_from_manifest() collects only document records (small fraction of manifest) for the tqdm progress bar Measured on 200 docs + 200 CustomFieldInstances: - Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction) - Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Perf: slim dict in _import_files_from_manifest, discard fields When collecting document records for the file-copy step, extract only the 4 keys the loop actually uses (pk + 3 exported filename keys) and discard the full fields dict (content, checksum, tags, etc.). Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction). Wall time unchanged.	2026-03-09 10:20:48 -07:00
Trenton H	e30676f889	Feature: Migrate import/export to rich progress (#12260 ) * Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track() Replace direct tqdm usage in document_exporter and document_importer with the PaperlessCommand base class and its track() method, which is backed by Rich and handles --no-progress-bar automatically. Also removes the unused ProgressBarMixin from mixins.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses Each management command now explicitly declares both class attributes rather than relying on defaults, making intent unambiguous at a glance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 08:59:17 -07:00
GitHub Actions	4badf0e7c2	Auto translate strings	2026-03-09 01:52:08 +00:00
Paul Gessinger	bc26d94593	Chore: Add saved view compatibility in API version 9 (#12280 ) --------- Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-03-08 18:50:31 -07:00
Trenton H	2cdb1424ef	Performance: Further export memory improvements (#12273 ) * Perf: streaming manifest writer for document exporter (Phase 3) Replaces the in-memory manifest dict accumulation with a StreamingManifestWriter that writes records to manifest.json incrementally, keeping only one batch resident in memory at a time. Key changes: - Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b compare for --compare-json, discard() on exception - Add _encrypt_record_inline(): per-record encryption replacing the bulk encrypt_secret_fields() call; crypto setup moved before streaming - Add _write_split_manifest(): extracted per-document manifest writing - Refactor dump(): non-doc records streamed during transaction, documents accumulated then written after filenames are assigned - Upgrade check_and_write_json() from MD5 to BLAKE2b - Remove encrypt_secret_fields() and unused itertools.chain import - Add profiling marker to pyproject.toml Measured improvement (200 docs + 200 CustomFieldInstances, same dump() code path, only writer differs): - Peak memory: ~50% reduction - Memory delta: ~70% reduction - Wall time and query count: unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at class definition time. _encrypt_record_inline() now does a single dict lookup instead of a linear scan per record, eliminating the loop and break pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-07 14:24:50 -08:00
Trenton H	f5c0c21922	Chore: Lazy imports of the heavy AI modules (#12275 )	2026-03-07 12:53:22 -08:00
Trenton H	9d5e618de8	Chore: pytest style paperless tests (#12254 )	2026-03-06 13:04:23 -08:00
GitHub Actions	7345f2e81c	Auto translate strings	2026-03-06 20:01:12 +00:00
shamoon	731448a8f9	Fixhancement: support version-specific edits (#12233 )	2026-03-06 11:59:26 -08:00
shamoon	24a2cfd957	Change: use explicit doc creation instead of clone for versions (#12226 )	2026-03-04 15:57:44 -08:00
GitHub Actions	7cf2ef6398	Auto translate strings	2026-03-04 23:29:54 +00:00
shamoon	df03207eef	Fix: correct doc version filename handling (#12223 )	2026-03-04 23:28:07 +00:00
Trenton H	1e21bcd26e	Breaking: Drop support for Python 3.10 (#12234 )	2026-03-04 15:03:33 -08:00
Trenton H	a9cb89c633	Enhancement: Improve exporter memory efficiency (#12236 ) Phase 1 -- Eliminate JSON round-trip in document exporter Replace json.loads(serializers.serialize("json", qs)) with serializers.serialize("python", qs) to skip the intermediate JSON string allocation and parse step. Use DjangoJSONEncoder in check_and_write_json() to handle native Python types (datetime, Decimal, UUID) the Python serializer returns. Phase 2 -- Batched QuerySet serialization in document exporter Add serialize_queryset_batched() helper that uses QuerySet.iterator() and itertools.islice to stream records in configurable chunks, bounding peak memory during serialization to batch_size * avg_record_size rather than loading the entire QuerySet at once.	2026-03-04 14:54:20 -08:00
GitHub Actions	a37e24c1ad	Auto translate strings	2026-03-04 22:17:32 +00:00
shamoon	85a18e5911	Enhancement: saved view sharing (#12142 )	2026-03-04 14:15:43 -08:00
GitHub Actions	ae182c459b	Auto translate strings	2026-03-04 21:34:02 +00:00
shamoon	d51a118aac	Merge branch 'main' into dev	2026-03-04 13:31:20 -08:00
shamoon	8f311c4b6b	Bump version to 2.20.10	2026-03-04 10:38:14 -08:00
shamoon	f25322600d	Merge branch 'release/v2.20.x'	2026-03-04 10:09:01 -08:00
shamoon	615f27e6fb	Fix: support string coercion in filepath jinja templates (#12244 )	2026-03-04 08:32:34 -08:00
Andreas Schneider	190fc70288	Fix: use maxsplit=1 in Redis URL parsing to handle URLs with multiple colons (#12239 )	2026-03-04 01:06:51 -08:00
shamoon	5b809122b5	Fix: apply ordering after annotating tag document count (#12238 )	2026-03-04 00:33:13 -08:00
GitHub Actions	c623234769	Auto translate strings	2026-03-04 00:29:21 +00:00
shamoon	299dac21ee	Enhancement: “live” document updates (#12141 )	2026-03-04 00:27:07 +00:00
Trenton H	5498503d60	Chore: Improve user migration path (#12232 ) Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-03-03 15:51:48 -08:00
shamoon	8b8307571a	Fix: enforce path limit for db filename fields (#12235 )	2026-03-03 13:19:56 -08:00
Trenton H	43406f44f2	Feature: Improve the retagger output using rich (#12194 )	2026-03-03 07:14:59 -08:00
Trenton H	e58a35d40c	Feature: Transition sanity check to rich and improve output (#12182 )	2026-03-02 10:53:39 -08:00
Trenton H	20a9cd40e8	Feature: Switch all indexing to use rich (#12193 )	2026-03-02 10:41:09 -08:00

1 2 3 4 5 ...

3861 Commits