diff --git a/docs/superpowers/specs/2026-06-16-export-sink-architecture-design.md b/docs/superpowers/specs/2026-06-16-export-sink-architecture-design.md index 517cd243d..bbca20a7f 100644 --- a/docs/superpowers/specs/2026-06-16-export-sink-architecture-design.md +++ b/docs/superpowers/specs/2026-06-16-export-sink-architecture-design.md @@ -76,10 +76,15 @@ These were settled during brainstorming: 1. **Scope is the `document_exporter` command only.** Design the interface so an S3 sink could be added later; do not refactor `bulk_download` or share bundles. -2. **Incremental options are folder-only, with a hard error on misuse.** - `--compare-checksums`, `--compare-json`, and `--delete` are properties of the - folder sink. Passing any of them together with `--zip` raises a `CommandError` - up front (fail fast, not warn-and-ignore). +2. **`--compare-*` are folder-only (hard error with `--zip`); `--delete` is kept + for both.** `--compare-checksums` / `--compare-json` are genuine no-ops in zip + mode today (the temp dir is always empty, so the compare always copies), so + combining either with `--zip` raises a `CommandError` up front. **`--delete`, + however, is an existing tested feature in zip mode** — it wipes the destination + directory of pre-existing files/dirs before the archive lands + (`test_export_zipped_with_delete`). Its meaning differs by destination: folder + `--delete` prunes stale exported files; zip `--delete` clears the target dir. + Both are preserved — `--delete` is a parameter of _both_ sinks, not an error. 3. **The zip manifest spools to a temp file, not memory.** The sink exposes a streaming-write handle. The zip sink streams the manifest to a single temp file in `SCRATCH_DIR` and adds it as the manifest entry at finalize, keeping @@ -126,8 +131,14 @@ class ExportSink(AbstractContextManager): **Contract / invariants** (the checklist a future sink author honors): -- `arcname` is relative and POSIX-style; the sink maps it to its own namespace - (folder: joined under the target; zip: the entry name). +- `arcname` is relative and **POSIX-style (forward slashes)**; the sink maps it to + its own namespace (folder: joined under the target; zip: the entry name). The + command must build arcnames with `Path(...).as_posix()` — `str(Path(...))` + yields backslashes on Windows, which corrupts zip entry names and makes the + manifest's stored paths non-portable. The same string is used both as the sink + key and as the value stored in the manifest (`EXPORTER_FILE_NAME` etc.), so it + must be POSIX at the point of construction. (The share-link bundle path already + uses `.as_posix()`; the document targets currently do not and must be fixed.) - At most one `stream()` is open at a time. It is the manifest. `add_file` / `add_json` may be interleaved with an open stream — implementations that can't interleave a real stream (zip, S3) must spool the stream to a side buffer and @@ -156,29 +167,44 @@ Owns everything the command currently does for folder mode: The folder sink is inherently in-place/incremental, not atomic — that is its nature and is unchanged. Its safety is the per-file `.tmp`+rename it already does. -### `ZipExportSink(target, zip_name)` +### `ZipExportSink(target, zip_name, *, delete)` -- On open: open a `zipfile.ZipFile` at `/.zip.tmp` - (`ZIP_DEFLATED`, `allowZip64=True`). +- On open: ensure `SCRATCH_DIR` exists (`mkdir(parents=True, exist_ok=True)` — + today's `handle()` does this before using it; the sink must do it now), then + open a `zipfile.ZipFile` at `/.zip.tmp` (`ZIP_DEFLATED`, + `allowZip64=True`). The `.zip.tmp` lives in the same directory as the final + `.zip` so the finalize rename is atomic (same filesystem). - `add_file` / `add_json`: write the entry directly, first emitting directory marker entries for parent paths so every zip viewer shows the folder structure - (today's `_ensure_zip_dirs`). + (today's `_ensure_zip_dirs`). A _flat_ export (no `--use-folder-prefix`, no + nested arcnames) has no parent dirs, so it emits **zero** markers — matching + today's `make_archive` output for flat trees (keeps the `namelist()` count + assertions in `test_export_zipped` valid). Nested/prefixed exports gain marker + entries; any count assertion on those must be audited. - `stream`: yields a handle writing to a single temp file in `SCRATCH_DIR`. - `finalize()` (success only): add the spooled manifest temp file as its entry, - close the zip, then atomically rename `.zip.tmp` → `.zip`. + close the zip, then **if `delete`, wipe the destination directory** of every + pre-existing file/dir except the in-progress `.zip.tmp` and any prior `.zip` + (today's zip `--delete` behavior), then atomically rename `.zip.tmp` → `.zip`. - `abort()` (on exception): close the zip, unlink the `.zip.tmp`, delete the manifest temp file. **A `.zip` therefore exists only after a fully successful - run.** -- Rejects `compare_*` / `delete` — but the command guards this before - constructing the sink (see below), so the sink simply has no such parameters. + run**, and on abort the destination is never wiped. +- Rejects `compare_*` (the command guards this before constructing the sink). It + does **not** reject `delete` — that is a supported zip behavior (see above). ### Command changes (`document_exporter.py`) -- **`handle()`**: validate the target, then _up front_ raise `CommandError` if any - of `--compare-checksums` / `--compare-json` / `--delete` is combined with - `--zip`. Construct the appropriate sink. Run the export as - `with sink: self.dump(sink)`. Delete the temp-dir / `shutil.make_archive` - block entirely. +- **`handle()`**: validate the target, then _up front_ raise `CommandError` if + `--compare-checksums` or `--compare-json` is combined with `--zip` (those are + no-ops in zip mode). `--delete` is **not** rejected — it is passed to whichever + sink is built. Construct the appropriate sink (`delete=` passed to both). Run + the export as `with sink: self.dump(sink)`. Delete the temp-dir / + `shutil.make_archive` block entirely. +- **`--data-only`**: unchanged in meaning — it simply skips every `sink.add_file` + call (no document/thumbnail/archive/bundle files) while the manifest stream and + `metadata.json` are still written. Works identically for both sinks; no sink + code is data-only-aware. (`test_export_data_only` and its zip equivalent stay + green.) - **`dump(sink)`**: destination-agnostic. Builds relative arcnames and calls `sink.add_file(...)`, `sink.add_json(...)`, and `sink.stream("manifest.json")`. `self.files_in_export_dir`, `check_and_copy`, `check_and_write_json`, and the @@ -194,6 +220,18 @@ nature and is unchanged. Its safety is the per-file `.tmp`+rename it already doe behavior moved into each sink's `stream()`. - **Crypto / passphrase** handling stays in the command: it transforms record _contents_ before they reach the sink, which is independent of destination. +- **Progress tracking stays in the command — the sinks know nothing about it.** + `PaperlessCommand.track()` wraps the _document iterable_ in `dump()` and ticks + the Rich bar once per document. That loop stays in the command; each iteration + calls `sink.add_file(...)`, so the per-document progress is preserved + unchanged. The sinks deliberately do **not** depend on `PaperlessCommand`, + `track()`, or Rich — coupling the destination abstraction to the command + framework would defeat the isolation goal and make the sinks impossible to unit + -test without a full command. (A sink is a plain context-managed I/O object; it + is constructed by `handle()` and exercised directly in `test_sinks.py`.) If + finer-grained progress is ever wanted for a single very large file, that is a + future enhancement layered via an optional callback — not a `PaperlessCommand` + dependency, and out of scope here. ### How `--split-manifest` fits (no sink special-casing) @@ -216,8 +254,8 @@ manifest stream is open_ never collide with it. ``` handle(options) - ├─ validate target; reject incremental flags + --zip → CommandError - ├─ sink = DirectoryExportSink(...) | ZipExportSink(...) + ├─ validate target; reject --compare-* + --zip → CommandError (--delete allowed) + ├─ sink = DirectoryExportSink(..., delete=…) | ZipExportSink(..., delete=…) └─ with FileLock(MEDIA_LOCK), sink: dump(sink) ├─ with sink.stream("manifest.json") as mh: @@ -237,12 +275,23 @@ handle(options) ## Error handling & atomicity - Any exception in `dump()` propagates through `with sink:` → `__exit__` → - `abort()`. Zip: the `.zip.tmp` and the manifest temp file are deleted; **no - `.zip` is produced.** Folder: in-flight `.tmp` files are discarded, existing - files are left intact, and the stale-prune does not run. -- `finalize()` runs only on clean exit. For the zip that is the single - `.zip.tmp` → `.zip` rename (atomic on the same filesystem). For the folder it - is the optional stale-delete prune. + `abort()`. Zip: the `.zip.tmp` and the manifest temp file are deleted, and the + destination is **not** wiped; **no `.zip` is produced.** Folder: in-flight + `.tmp` files are discarded, existing files are left intact, and the stale-prune + does not run. +- `finalize()` runs only on clean exit, after all contents are written. For the + zip: optionally wipe the destination (`--delete`), then the single `.zip.tmp` → + `.zip` rename (atomic on the same filesystem). For the folder: the optional + stale-delete prune. +- **Honest limits of the atomicity guarantee.** The guarantee is "no + _complete-looking_ `.zip` after a failed run," not "no leftovers." If the + process is `SIGKILL`ed or the rename itself fails _after_ the zip is closed, a + `.zip.tmp` may be orphaned — that is the safe direction (no false-complete + `.zip`), but stale `.zip.tmp` files are **not** auto-cleaned on a later run + (matching the prior branch). `KeyboardInterrupt` is a `BaseException` but + `__exit__` still runs, so `abort()` fires normally. The rename being atomic and + these runs not racing each other both rely on `FileLock(settings.MEDIA_LOCK)`, + which serializes exports; concurrent same-`--zip-name` runs are out of scope. - The `FileLock(settings.MEDIA_LOCK)` wrapping is unchanged. ## Testing @@ -257,16 +306,25 @@ type annotations; run on the Linux VM): under `compare_json`; `delete` prunes a snapshot file not written this run and removes emptied directories; without `delete`, stale files remain. - **Zip atomicity**: injecting an exception mid-export (via `mocker`) leaves no - `.zip` and no leftover `.zip.tmp`; a clean run yields exactly the `.zip` with - directory marker entries present. + `.zip` and no leftover `.zip.tmp`, and does not wipe the destination even with + `--delete`; a clean run yields exactly the `.zip`. A nested/prefixed export has + directory marker entries; a flat export has none. +- **Zip `--delete`**: a clean `--zip --delete` run wipes pre-existing + files/dirs in the destination and produces the `.zip` (preserves + `test_export_zipped_with_delete`). +- **POSIX arcnames**: nested arcnames are stored with forward slashes in both the + zip entry names and the manifest values, regardless of host OS (guards the + Windows backslash bug). +- **`--data-only`**: both sinks produce only `manifest.json` + `metadata.json`, + no document files. - **Stream contract**: opening a second concurrent `stream()` is rejected; `add_file`/`add_json` while a stream is open succeed. -- **Command guard**: `--zip` with each of `--compare-checksums` / `--compare-json` - / `--delete` raises `CommandError`. +- **Command guard**: `--zip` with `--compare-checksums` or `--compare-json` + raises `CommandError`; `--zip --delete` does **not** error. Existing `test_management_exporter.py` and `test_management_importer.py` stay green unchanged — the export's external behavior (layout, manifest, round-trip -import) is preserved. +import, `--zip --delete`, `--data-only`) is preserved. ## Risks diff --git a/docs/superpowers/specs/2026-06-16-export-zip-compression-design.md b/docs/superpowers/specs/2026-06-16-export-zip-compression-design.md index de54fd08b..e73fc3969 100644 --- a/docs/superpowers/specs/2026-06-16-export-zip-compression-design.md +++ b/docs/superpowers/specs/2026-06-16-export-zip-compression-design.md @@ -38,8 +38,14 @@ In scope: `--zip-compression-level`, valid only with `--zip`. - Validation: method availability, level range per method, and the requires-`--zip` guard. -- Import-side compatibility notes/tests (the importer already decompresses +- Import-side: a pre-extract support check in `document_importer` that turns an + unsupported codec into a clear `CommandError` (the importer otherwise decompresses transparently via `ZipFile.extractall`). +- Docs: add both flags and the zstd-portability caveat to `docs/administration.md` + (the `document_exporter` option list, lines ~257-270 and the `-z`/`-zn` section, + lines ~328-330). New flags are long-form only (`--zip-compression`, + `--zip-compression-level`) — no short aliases, to avoid `-zc`/`-zl` collisions + with the existing `-z`/`-zn`. Out of scope: @@ -52,12 +58,16 @@ Out of scope: ### `ZipExportSink` changes +The base sink's signature is `ZipExportSink(target, zip_name, *, delete)`; this +adds two keyword-only params after `delete`: + ```python def __init__( self, target: Path, zip_name: str, *, + delete: bool = False, compression: int = zipfile.ZIP_DEFLATED, compresslevel: int | None = None, ) -> None: @@ -73,59 +83,110 @@ def __init__( ``` `ZipFile` applies `compression`/`compresslevel` as the default for every -`write`/`writestr`, so `add_file` / `add_json` / the manifest entry need no -changes. Directory marker entries are unaffected (they carry no data). +`write`/`writestr` (verified: a `ZipFile(..., compression=ZIP_BZIP2)` yields +entries with `compress_type == ZIP_BZIP2` without per-call args), so `add_file` / +`add_json` / the manifest entry need no changes. Directory marker entries are +empty so their compressed payload is zero, but they are still _tagged_ with the +chosen `compress_type` — harmless, but tests that read `infolist()` should filter +or account for marker entries (see Testing). ### CLI flags (`document_exporter`) - `--zip-compression {stored,deflated,bzip2,lzma}` — and `zstd` **when the runtime supports it** (see below). Maps to the matching `zipfile.ZIP_*` constant. Default `deflated`. -- `--zip-compression-level N` — integer. Meaningful for `deflated` (0–9) and - `bzip2` (1–9); ignored by `stored` and `lzma`; for `zstd` the accepted range - follows the stdlib (negative … 22). Default: unset → library default. +- `--zip-compression-level N` — integer. Meaningful for `deflated` (0–9; `zlib` + also accepts `-1` meaning "default", which is the same as omitting the flag / + `compresslevel=None`) and `bzip2` (1–9; `0` is invalid for bzip2). For `zstd`, + **do not hardcode a numeric range** — defer to the stdlib and surface whatever + it raises (the accepted zstd range has shifted across libzstd versions; see the + zstd section). Combining the flag with `stored` or `lzma` (which ignore level) + is a `CommandError`, not a silent accept — consistent with the base refactor's + fail-fast posture. Default: unset → library default. Both flags require `--zip`; passing either without `--zip` raises a `CommandError`, matching the incremental-flag rule from the base refactor. +**Why validate up front (not let `zipfile` raise):** an invalid level does _not_ +fail at `ZipFile(...)` construction — it fails at the **first `write`/`writestr` +call**, with an opaque message (`ValueError: Invalid initialization option` for +deflated > 9, or `compresslevel must be between 1 and 9` for bzip2), and inside a +`with ZipFile(...)` block that error is then _masked_ by a secondary +`ValueError: Can't close the ZIP file while there is an open writing handle` on +context exit. Up-front validation turns that into a clean `CommandError`. + ### Validation (in `handle()`, before constructing the sink) 1. **Requires `--zip`.** Either flag without `--zip` → `CommandError`. -2. **Method availability.** `bzip2`/`lzma` need the `bz2`/`lzma` stdlib modules - (normally present, but optional in some Python builds); `zstd` needs - Python 3.14+. If the chosen method's module is unavailable, raise a - `CommandError` naming the missing capability rather than failing deep in - `zipfile`. +2. **Method availability — via a named, patchable seam.** Expose a module-level + helper (e.g. `compression_available(method) -> bool` or a capability map) that + does `try: import bz2 / import lzma except ImportError: return False` — **not** + `importlib.util.find_spec`, which can report a stdlib C-extension as present + when importing it actually fails. For `zstd`, the probe must also confirm the + codec is usable on 3.14+ (constant present _and_ `compression.zstd` + importable), not merely that the constant exists. Making this a named function + is also what lets the test patch "method unavailable" with `mocker`. If the + chosen method is unavailable, raise a `CommandError` naming the missing + capability. 3. **Level range.** Reject an out-of-range `--zip-compression-level` for the - chosen method with a clear message; accept-and-note that it is ignored for - `stored`/`lzma`. + chosen method with a clear `CommandError`; reject the flag entirely for + `stored`/`lzma` (see above). ### zstd (Python 3.14+) -Python 3.14 adds Zstandard support to `zipfile` (`ZIP_ZSTANDARD`, backed by the -new `compression.zstd` module, PEP 784). Gate it at runtime: +> ⚠️ **VERIFY before coding.** The following zstd specifics are stated from the +> assistant's knowledge of the 3.14 reorganization and were **not** verifiable in +> the dev environment (it runs 3.11, where `ZIP_ZSTANDARD` does not exist). +> Confirm each against the real Python 3.14 / PEP 784 docs before implementing: +> the exact constant name (`ZIP_ZSTANDARD` vs `ZIP_ZSTD`), the backing module +> path (`compression.zstd`), that `zipfile` actually routes zstd through it, and +> the accepted `compresslevel` bounds. + +Python 3.14 is expected to add Zstandard support to `zipfile` +(`ZIP_ZSTANDARD`, backed by the new `compression.zstd` module, PEP 784). Gate it +at runtime so nothing zstd-related is imported or referenced on < 3.14 (the +project targets Python ≥ 3.11): ```python _ZSTD = getattr(zipfile, "ZIP_ZSTANDARD", None) # None before 3.14 ``` -Only include `zstd` in the `--zip-compression` choices when `_ZSTD is not None`. -Because the project targets Python ≥3.11, the code must not import or reference -`compression.zstd` unconditionally. +Presence of the _constant_ does not guarantee the _codec_ is usable (same stripped +-build risk as `bz2`/`lzma`), so the availability probe (validation step 2) must +attempt the codec, not just check the constant. Do **not** hardcode a zstd level +range — pass the user's level through and let the stdlib raise, mapping any error +to a `CommandError`. + +Keep `zstd` in the `--zip-compression` `choices` **always** (even on < 3.14), and +reject it in validation with a friendly "zstd requires Python 3.14+" message. If +it were dropped from `choices` on older runtimes, argparse would emit a generic +"invalid choice" that reads as though the option never existed — worse UX. ### Import-side compatibility `document_importer` reads zips with `ZipFile(self.source).extractall(...)` (`document_importer.py:453`), which decompresses each entry transparently using whatever method it was stored with — **provided the matching module exists on the -importing machine.** Consequences to document (and test): +importing machine.** + +The failure mode when it doesn't is unfriendly and must be handled: a zstd (or +otherwise unsupported) entry raises a bare `NotImplementedError` **per-entry, +during `extractall`** — _not_ at `ZipFile(self.source)` open, and `is_zipfile()` +still returns true (a zstd archive is a valid zip container). So the importer +enters the zip branch, creates its temp dir, may partially extract other entries, +then blows up mid-extract with no context. **Mitigation (in scope here):** before +extracting, inspect `ZipFile(self.source).infolist()` compress types and, if any +is unsupported on this runtime, raise a `CommandError` naming the method and the +requirement (e.g. "this archive uses zstd, which needs Python 3.14+") instead of +letting `NotImplementedError` escape. + +Per-method summary (document in help text + `administration.md`): - `deflated`/`stored`: universally importable. - `bzip2`/`lzma`: importable wherever the `bz2`/`lzma` modules are present (essentially always). - `zstd`: importable only on Python 3.14+. An archive compressed with `zstd` is - **not** importable on older runtimes. Call this out in the docs/help text for - the flag so users don't pick an archive format their import target can't read. + **not** importable on older runtimes. ## Testing @@ -135,14 +196,25 @@ New cases in the sink tests and an export→import round-trip - **Round-trip per method.** Parametrize over the available methods (skip `zstd` below 3.14, skip `bzip2`/`lzma` if the module is somehow absent): export a small library, import it back, assert documents/manifest match. -- **Method is applied.** Assert each written entry's `compress_type` equals the - requested method (read back via `ZipFile.infolist()`). -- **Level affects size.** For `deflated`, assert level 9 produces an archive no - larger than level 1 on compressible content (loose `<=` to avoid flakiness). +- **Method is applied.** Assert each written _file_ entry's `compress_type` + equals the requested method (read back via `ZipFile.infolist()`), filtering out + directory marker entries (which are tagged but empty). +- **Level affects size — robustly.** Do **not** compare deflate level 9 vs 1 + (on small or incompressible fixtures level 9 can equal or slightly exceed level + 1, causing flaky CI). Instead assert that a compressing method on a + moderately-compressible fixture yields a total smaller than `stored` + (`ZIP_STORED`), which is a stable invariant. - **Validation.** Each flag without `--zip` → `CommandError`; out-of-range level - → `CommandError`; unavailable method (simulate via `mocker` patching the - capability probe) → `CommandError`. -- **Default unchanged.** Omitting both flags yields entries with + (`--zip-compression-level 99`) → a clean `CommandError` from validation + (asserting we never reach the `writestr` that would raise the masked + `ValueError`); `--zip-compression-level` with `stored`/`lzma` → `CommandError`; + unavailable method (patch the named availability seam with `mocker`) → + `CommandError`; on < 3.14, `--zip-compression zstd` → the friendly + "requires 3.14+" `CommandError`. +- **Import pre-check.** An archive containing an unsupported compress type + produces a `CommandError` from the importer naming the method, not a raw + `NotImplementedError` (simulate by patching the importer's support probe). +- **Default unchanged.** Omitting both flags yields file entries with `compress_type == ZIP_DEFLATED`, identical to pre-feature behavior. ## Risks