Without explicit directory entries, some zip viewers (simpler tools,
web-based viewers) don't show the folder structure when browsing the
archive. Add a _ensure_zip_dirs() helper that writes directory markers
for all parent paths of each file entry, deduplicating via a set.
Uses ZipFile.mkdir() (available since Python 3.11, the project minimum).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the temp-dir + shutil.make_archive() workaround with direct
zipfile.ZipFile writes. Document files are added via zf.write() and
JSON manifests via zf.writestr()/StringIO buffering, eliminating the
double-I/O and 2x disk usage of the previous approach.
Key changes:
- Removed tempfile.TemporaryDirectory and shutil.make_archive() from handle()
- ZipFile opened on a .tmp path; renamed to final .zip atomically on success;
.tmp cleaned up on failure
- StreamingManifestWriter: zip mode buffers manifest in io.StringIO and
writes to zip atomically on close() (zipfile allows only one open write
handle at a time)
- check_and_copy(): zip mode calls zf.write(source, arcname=...) directly
- check_and_write_json(): zip mode calls zf.writestr(arcname, ...) directly
- files_in_export_dir scan skipped in zip mode (always fresh write)
- --compare-checksums and --compare-json emit warnings when used with --zip
- --delete in zip mode removes pre-existing files from target dir, skipping
the in-progress .tmp and any prior .zip
- Added tests: atomicity on failure, no SCRATCH_DIR usage
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()
Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses
Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: streaming manifest writer for document exporter (Phase 3)
Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.
Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml
Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption
Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 -- Eliminate JSON round-trip in document exporter
Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.
Phase 2 -- Batched QuerySet serialization in document exporter
Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.
* Fix: improve test portability
* Make settings always consistent
* Make a few more tests deterministic wrt settings
* Dont pollute settings for this one
* Fix timezone issue with mail parser
* Update test_parser.py
* Uh, I guess OCR gives variants for this
---------
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>