Docs: revise export specs per critical review

- Preserve --zip --delete (real tested feature), only --compare-* hard-error
- Mandate .as_posix() arcnames; --data-only, SCRATCH_DIR, atomicity-honesty notes
- Progress tracking stays in command; sinks decoupled from PaperlessCommand
- Compression spec: gate/verify zstd, fix import failure mode, patchable probe seam

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
stumpylog
2026-06-16 08:11:52 -07:00
parent 6030d7069a
commit 9252f073e4
2 changed files with 190 additions and 60 deletions
@@ -76,10 +76,15 @@ These were settled during brainstorming:
1. **Scope is the `document_exporter` command only.** Design the interface so an
S3 sink could be added later; do not refactor `bulk_download` or share bundles.
2. **Incremental options are folder-only, with a hard error on misuse.**
`--compare-checksums`, `--compare-json`, and `--delete` are properties of the
folder sink. Passing any of them together with `--zip` raises a `CommandError`
up front (fail fast, not warn-and-ignore).
2. **`--compare-*` are folder-only (hard error with `--zip`); `--delete` is kept
for both.** `--compare-checksums` / `--compare-json` are genuine no-ops in zip
mode today (the temp dir is always empty, so the compare always copies), so
combining either with `--zip` raises a `CommandError` up front. **`--delete`,
however, is an existing tested feature in zip mode** — it wipes the destination
directory of pre-existing files/dirs before the archive lands
(`test_export_zipped_with_delete`). Its meaning differs by destination: folder
`--delete` prunes stale exported files; zip `--delete` clears the target dir.
Both are preserved — `--delete` is a parameter of _both_ sinks, not an error.
3. **The zip manifest spools to a temp file, not memory.** The sink exposes a
streaming-write handle. The zip sink streams the manifest to a single temp
file in `SCRATCH_DIR` and adds it as the manifest entry at finalize, keeping
@@ -126,8 +131,14 @@ class ExportSink(AbstractContextManager):
**Contract / invariants** (the checklist a future sink author honors):
- `arcname` is relative and POSIX-style; the sink maps it to its own namespace
(folder: joined under the target; zip: the entry name).
- `arcname` is relative and **POSIX-style (forward slashes)**; the sink maps it to
its own namespace (folder: joined under the target; zip: the entry name). The
command must build arcnames with `Path(...).as_posix()``str(Path(...))`
yields backslashes on Windows, which corrupts zip entry names and makes the
manifest's stored paths non-portable. The same string is used both as the sink
key and as the value stored in the manifest (`EXPORTER_FILE_NAME` etc.), so it
must be POSIX at the point of construction. (The share-link bundle path already
uses `.as_posix()`; the document targets currently do not and must be fixed.)
- At most one `stream()` is open at a time. It is the manifest. `add_file` /
`add_json` may be interleaved with an open stream — implementations that can't
interleave a real stream (zip, S3) must spool the stream to a side buffer and
@@ -156,29 +167,44 @@ Owns everything the command currently does for folder mode:
The folder sink is inherently in-place/incremental, not atomic — that is its
nature and is unchanged. Its safety is the per-file `.tmp`+rename it already does.
### `ZipExportSink(target, zip_name)`
### `ZipExportSink(target, zip_name, *, delete)`
- On open: open a `zipfile.ZipFile` at `<target>/<zip_name>.zip.tmp`
(`ZIP_DEFLATED`, `allowZip64=True`).
- On open: ensure `SCRATCH_DIR` exists (`mkdir(parents=True, exist_ok=True)`
today's `handle()` does this before using it; the sink must do it now), then
open a `zipfile.ZipFile` at `<target>/<zip_name>.zip.tmp` (`ZIP_DEFLATED`,
`allowZip64=True`). The `.zip.tmp` lives in the same directory as the final
`.zip` so the finalize rename is atomic (same filesystem).
- `add_file` / `add_json`: write the entry directly, first emitting directory
marker entries for parent paths so every zip viewer shows the folder structure
(today's `_ensure_zip_dirs`).
(today's `_ensure_zip_dirs`). A _flat_ export (no `--use-folder-prefix`, no
nested arcnames) has no parent dirs, so it emits **zero** markers — matching
today's `make_archive` output for flat trees (keeps the `namelist()` count
assertions in `test_export_zipped` valid). Nested/prefixed exports gain marker
entries; any count assertion on those must be audited.
- `stream`: yields a handle writing to a single temp file in `SCRATCH_DIR`.
- `finalize()` (success only): add the spooled manifest temp file as its entry,
close the zip, then atomically rename `.zip.tmp``.zip`.
close the zip, then **if `delete`, wipe the destination directory** of every
pre-existing file/dir except the in-progress `.zip.tmp` and any prior `.zip`
(today's zip `--delete` behavior), then atomically rename `.zip.tmp``.zip`.
- `abort()` (on exception): close the zip, unlink the `.zip.tmp`, delete the
manifest temp file. **A `.zip` therefore exists only after a fully successful
run.**
- Rejects `compare_*` / `delete` — but the command guards this before
constructing the sink (see below), so the sink simply has no such parameters.
run**, and on abort the destination is never wiped.
- Rejects `compare_*` (the command guards this before constructing the sink). It
does **not** reject `delete` — that is a supported zip behavior (see above).
### Command changes (`document_exporter.py`)
- **`handle()`**: validate the target, then _up front_ raise `CommandError` if any
of `--compare-checksums` / `--compare-json` / `--delete` is combined with
`--zip`. Construct the appropriate sink. Run the export as
`with sink: self.dump(sink)`. Delete the temp-dir / `shutil.make_archive`
block entirely.
- **`handle()`**: validate the target, then _up front_ raise `CommandError` if
`--compare-checksums` or `--compare-json` is combined with `--zip` (those are
no-ops in zip mode). `--delete` is **not** rejected — it is passed to whichever
sink is built. Construct the appropriate sink (`delete=` passed to both). Run
the export as `with sink: self.dump(sink)`. Delete the temp-dir /
`shutil.make_archive` block entirely.
- **`--data-only`**: unchanged in meaning — it simply skips every `sink.add_file`
call (no document/thumbnail/archive/bundle files) while the manifest stream and
`metadata.json` are still written. Works identically for both sinks; no sink
code is data-only-aware. (`test_export_data_only` and its zip equivalent stay
green.)
- **`dump(sink)`**: destination-agnostic. Builds relative arcnames and calls
`sink.add_file(...)`, `sink.add_json(...)`, and `sink.stream("manifest.json")`.
`self.files_in_export_dir`, `check_and_copy`, `check_and_write_json`, and the
@@ -194,6 +220,18 @@ nature and is unchanged. Its safety is the per-file `.tmp`+rename it already doe
behavior moved into each sink's `stream()`.
- **Crypto / passphrase** handling stays in the command: it transforms record
_contents_ before they reach the sink, which is independent of destination.
- **Progress tracking stays in the command — the sinks know nothing about it.**
`PaperlessCommand.track()` wraps the _document iterable_ in `dump()` and ticks
the Rich bar once per document. That loop stays in the command; each iteration
calls `sink.add_file(...)`, so the per-document progress is preserved
unchanged. The sinks deliberately do **not** depend on `PaperlessCommand`,
`track()`, or Rich — coupling the destination abstraction to the command
framework would defeat the isolation goal and make the sinks impossible to unit
-test without a full command. (A sink is a plain context-managed I/O object; it
is constructed by `handle()` and exercised directly in `test_sinks.py`.) If
finer-grained progress is ever wanted for a single very large file, that is a
future enhancement layered via an optional callback — not a `PaperlessCommand`
dependency, and out of scope here.
### How `--split-manifest` fits (no sink special-casing)
@@ -216,8 +254,8 @@ manifest stream is open_ never collide with it.
```
handle(options)
├─ validate target; reject incremental flags + --zip → CommandError
├─ sink = DirectoryExportSink(...) | ZipExportSink(...)
├─ validate target; reject --compare-* + --zip → CommandError (--delete allowed)
├─ sink = DirectoryExportSink(..., delete=…) | ZipExportSink(..., delete=…)
└─ with FileLock(MEDIA_LOCK), sink:
dump(sink)
├─ with sink.stream("manifest.json") as mh:
@@ -237,12 +275,23 @@ handle(options)
## Error handling & atomicity
- Any exception in `dump()` propagates through `with sink:``__exit__`
`abort()`. Zip: the `.zip.tmp` and the manifest temp file are deleted; **no
`.zip` is produced.** Folder: in-flight `.tmp` files are discarded, existing
files are left intact, and the stale-prune does not run.
- `finalize()` runs only on clean exit. For the zip that is the single
`.zip.tmp``.zip` rename (atomic on the same filesystem). For the folder it
is the optional stale-delete prune.
`abort()`. Zip: the `.zip.tmp` and the manifest temp file are deleted, and the
destination is **not** wiped; **no `.zip` is produced.** Folder: in-flight
`.tmp` files are discarded, existing files are left intact, and the stale-prune
does not run.
- `finalize()` runs only on clean exit, after all contents are written. For the
zip: optionally wipe the destination (`--delete`), then the single `.zip.tmp`
`.zip` rename (atomic on the same filesystem). For the folder: the optional
stale-delete prune.
- **Honest limits of the atomicity guarantee.** The guarantee is "no
_complete-looking_ `.zip` after a failed run," not "no leftovers." If the
process is `SIGKILL`ed or the rename itself fails _after_ the zip is closed, a
`.zip.tmp` may be orphaned — that is the safe direction (no false-complete
`.zip`), but stale `.zip.tmp` files are **not** auto-cleaned on a later run
(matching the prior branch). `KeyboardInterrupt` is a `BaseException` but
`__exit__` still runs, so `abort()` fires normally. The rename being atomic and
these runs not racing each other both rely on `FileLock(settings.MEDIA_LOCK)`,
which serializes exports; concurrent same-`--zip-name` runs are out of scope.
- The `FileLock(settings.MEDIA_LOCK)` wrapping is unchanged.
## Testing
@@ -257,16 +306,25 @@ type annotations; run on the Linux VM):
under `compare_json`; `delete` prunes a snapshot file not written this run and
removes emptied directories; without `delete`, stale files remain.
- **Zip atomicity**: injecting an exception mid-export (via `mocker`) leaves no
`.zip` and no leftover `.zip.tmp`; a clean run yields exactly the `.zip` with
directory marker entries present.
`.zip` and no leftover `.zip.tmp`, and does not wipe the destination even with
`--delete`; a clean run yields exactly the `.zip`. A nested/prefixed export has
directory marker entries; a flat export has none.
- **Zip `--delete`**: a clean `--zip --delete` run wipes pre-existing
files/dirs in the destination and produces the `.zip` (preserves
`test_export_zipped_with_delete`).
- **POSIX arcnames**: nested arcnames are stored with forward slashes in both the
zip entry names and the manifest values, regardless of host OS (guards the
Windows backslash bug).
- **`--data-only`**: both sinks produce only `manifest.json` + `metadata.json`,
no document files.
- **Stream contract**: opening a second concurrent `stream()` is rejected;
`add_file`/`add_json` while a stream is open succeed.
- **Command guard**: `--zip` with each of `--compare-checksums` / `--compare-json`
/ `--delete` raises `CommandError`.
- **Command guard**: `--zip` with `--compare-checksums` or `--compare-json`
raises `CommandError`; `--zip --delete` does **not** error.
Existing `test_management_exporter.py` and `test_management_importer.py` stay
green unchanged — the export's external behavior (layout, manifest, round-trip
import) is preserved.
import, `--zip --delete`, `--data-only`) is preserved.
## Risks
@@ -38,8 +38,14 @@ In scope:
`--zip-compression-level`, valid only with `--zip`.
- Validation: method availability, level range per method, and the
requires-`--zip` guard.
- Import-side compatibility notes/tests (the importer already decompresses
- Import-side: a pre-extract support check in `document_importer` that turns an
unsupported codec into a clear `CommandError` (the importer otherwise decompresses
transparently via `ZipFile.extractall`).
- Docs: add both flags and the zstd-portability caveat to `docs/administration.md`
(the `document_exporter` option list, lines ~257-270 and the `-z`/`-zn` section,
lines ~328-330). New flags are long-form only (`--zip-compression`,
`--zip-compression-level`) — no short aliases, to avoid `-zc`/`-zl` collisions
with the existing `-z`/`-zn`.
Out of scope:
@@ -52,12 +58,16 @@ Out of scope:
### `ZipExportSink` changes
The base sink's signature is `ZipExportSink(target, zip_name, *, delete)`; this
adds two keyword-only params after `delete`:
```python
def __init__(
self,
target: Path,
zip_name: str,
*,
delete: bool = False,
compression: int = zipfile.ZIP_DEFLATED,
compresslevel: int | None = None,
) -> None:
@@ -73,59 +83,110 @@ def __init__(
```
`ZipFile` applies `compression`/`compresslevel` as the default for every
`write`/`writestr`, so `add_file` / `add_json` / the manifest entry need no
changes. Directory marker entries are unaffected (they carry no data).
`write`/`writestr` (verified: a `ZipFile(..., compression=ZIP_BZIP2)` yields
entries with `compress_type == ZIP_BZIP2` without per-call args), so `add_file` /
`add_json` / the manifest entry need no changes. Directory marker entries are
empty so their compressed payload is zero, but they are still _tagged_ with the
chosen `compress_type` — harmless, but tests that read `infolist()` should filter
or account for marker entries (see Testing).
### CLI flags (`document_exporter`)
- `--zip-compression {stored,deflated,bzip2,lzma}` — and `zstd` **when the
runtime supports it** (see below). Maps to the matching `zipfile.ZIP_*`
constant. Default `deflated`.
- `--zip-compression-level N` — integer. Meaningful for `deflated` (09) and
`bzip2` (19); ignored by `stored` and `lzma`; for `zstd` the accepted range
follows the stdlib (negative … 22). Default: unset → library default.
- `--zip-compression-level N` — integer. Meaningful for `deflated` (09; `zlib`
also accepts `-1` meaning "default", which is the same as omitting the flag /
`compresslevel=None`) and `bzip2` (19; `0` is invalid for bzip2). For `zstd`,
**do not hardcode a numeric range** — defer to the stdlib and surface whatever
it raises (the accepted zstd range has shifted across libzstd versions; see the
zstd section). Combining the flag with `stored` or `lzma` (which ignore level)
is a `CommandError`, not a silent accept — consistent with the base refactor's
fail-fast posture. Default: unset → library default.
Both flags require `--zip`; passing either without `--zip` raises a
`CommandError`, matching the incremental-flag rule from the base refactor.
**Why validate up front (not let `zipfile` raise):** an invalid level does _not_
fail at `ZipFile(...)` construction — it fails at the **first `write`/`writestr`
call**, with an opaque message (`ValueError: Invalid initialization option` for
deflated > 9, or `compresslevel must be between 1 and 9` for bzip2), and inside a
`with ZipFile(...)` block that error is then _masked_ by a secondary
`ValueError: Can't close the ZIP file while there is an open writing handle` on
context exit. Up-front validation turns that into a clean `CommandError`.
### Validation (in `handle()`, before constructing the sink)
1. **Requires `--zip`.** Either flag without `--zip``CommandError`.
2. **Method availability.** `bzip2`/`lzma` need the `bz2`/`lzma` stdlib modules
(normally present, but optional in some Python builds); `zstd` needs
Python 3.14+. If the chosen method's module is unavailable, raise a
`CommandError` naming the missing capability rather than failing deep in
`zipfile`.
2. **Method availability — via a named, patchable seam.** Expose a module-level
helper (e.g. `compression_available(method) -> bool` or a capability map) that
does `try: import bz2 / import lzma except ImportError: return False`**not**
`importlib.util.find_spec`, which can report a stdlib C-extension as present
when importing it actually fails. For `zstd`, the probe must also confirm the
codec is usable on 3.14+ (constant present _and_ `compression.zstd`
importable), not merely that the constant exists. Making this a named function
is also what lets the test patch "method unavailable" with `mocker`. If the
chosen method is unavailable, raise a `CommandError` naming the missing
capability.
3. **Level range.** Reject an out-of-range `--zip-compression-level` for the
chosen method with a clear message; accept-and-note that it is ignored for
`stored`/`lzma`.
chosen method with a clear `CommandError`; reject the flag entirely for
`stored`/`lzma` (see above).
### zstd (Python 3.14+)
Python 3.14 adds Zstandard support to `zipfile` (`ZIP_ZSTANDARD`, backed by the
new `compression.zstd` module, PEP 784). Gate it at runtime:
> ⚠️ **VERIFY before coding.** The following zstd specifics are stated from the
> assistant's knowledge of the 3.14 reorganization and were **not** verifiable in
> the dev environment (it runs 3.11, where `ZIP_ZSTANDARD` does not exist).
> Confirm each against the real Python 3.14 / PEP 784 docs before implementing:
> the exact constant name (`ZIP_ZSTANDARD` vs `ZIP_ZSTD`), the backing module
> path (`compression.zstd`), that `zipfile` actually routes zstd through it, and
> the accepted `compresslevel` bounds.
Python 3.14 is expected to add Zstandard support to `zipfile`
(`ZIP_ZSTANDARD`, backed by the new `compression.zstd` module, PEP 784). Gate it
at runtime so nothing zstd-related is imported or referenced on < 3.14 (the
project targets Python ≥ 3.11):
```python
_ZSTD = getattr(zipfile, "ZIP_ZSTANDARD", None) # None before 3.14
```
Only include `zstd` in the `--zip-compression` choices when `_ZSTD is not None`.
Because the project targets Python ≥3.11, the code must not import or reference
`compression.zstd` unconditionally.
Presence of the _constant_ does not guarantee the _codec_ is usable (same stripped
-build risk as `bz2`/`lzma`), so the availability probe (validation step 2) must
attempt the codec, not just check the constant. Do **not** hardcode a zstd level
range — pass the user's level through and let the stdlib raise, mapping any error
to a `CommandError`.
Keep `zstd` in the `--zip-compression` `choices` **always** (even on < 3.14), and
reject it in validation with a friendly "zstd requires Python 3.14+" message. If
it were dropped from `choices` on older runtimes, argparse would emit a generic
"invalid choice" that reads as though the option never existed — worse UX.
### Import-side compatibility
`document_importer` reads zips with `ZipFile(self.source).extractall(...)`
(`document_importer.py:453`), which decompresses each entry transparently using
whatever method it was stored with — **provided the matching module exists on the
importing machine.** Consequences to document (and test):
importing machine.**
The failure mode when it doesn't is unfriendly and must be handled: a zstd (or
otherwise unsupported) entry raises a bare `NotImplementedError` **per-entry,
during `extractall`** — _not_ at `ZipFile(self.source)` open, and `is_zipfile()`
still returns true (a zstd archive is a valid zip container). So the importer
enters the zip branch, creates its temp dir, may partially extract other entries,
then blows up mid-extract with no context. **Mitigation (in scope here):** before
extracting, inspect `ZipFile(self.source).infolist()` compress types and, if any
is unsupported on this runtime, raise a `CommandError` naming the method and the
requirement (e.g. "this archive uses zstd, which needs Python 3.14+") instead of
letting `NotImplementedError` escape.
Per-method summary (document in help text + `administration.md`):
- `deflated`/`stored`: universally importable.
- `bzip2`/`lzma`: importable wherever the `bz2`/`lzma` modules are present
(essentially always).
- `zstd`: importable only on Python 3.14+. An archive compressed with `zstd` is
**not** importable on older runtimes. Call this out in the docs/help text for
the flag so users don't pick an archive format their import target can't read.
**not** importable on older runtimes.
## Testing
@@ -135,14 +196,25 @@ New cases in the sink tests and an export→import round-trip
- **Round-trip per method.** Parametrize over the available methods (skip `zstd`
below 3.14, skip `bzip2`/`lzma` if the module is somehow absent): export a
small library, import it back, assert documents/manifest match.
- **Method is applied.** Assert each written entry's `compress_type` equals the
requested method (read back via `ZipFile.infolist()`).
- **Level affects size.** For `deflated`, assert level 9 produces an archive no
larger than level 1 on compressible content (loose `<=` to avoid flakiness).
- **Method is applied.** Assert each written _file_ entry's `compress_type`
equals the requested method (read back via `ZipFile.infolist()`), filtering out
directory marker entries (which are tagged but empty).
- **Level affects size — robustly.** Do **not** compare deflate level 9 vs 1
(on small or incompressible fixtures level 9 can equal or slightly exceed level
1, causing flaky CI). Instead assert that a compressing method on a
moderately-compressible fixture yields a total smaller than `stored`
(`ZIP_STORED`), which is a stable invariant.
- **Validation.** Each flag without `--zip``CommandError`; out-of-range level
`CommandError`; unavailable method (simulate via `mocker` patching the
capability probe) → `CommandError`.
- **Default unchanged.** Omitting both flags yields entries with
(`--zip-compression-level 99`) → a clean `CommandError` from validation
(asserting we never reach the `writestr` that would raise the masked
`ValueError`); `--zip-compression-level` with `stored`/`lzma``CommandError`;
unavailable method (patch the named availability seam with `mocker`) →
`CommandError`; on < 3.14, `--zip-compression zstd` → the friendly
"requires 3.14+" `CommandError`.
- **Import pre-check.** An archive containing an unsupported compress type
produces a `CommandError` from the importer naming the method, not a raw
`NotImplementedError` (simulate by patching the importer's support probe).
- **Default unchanged.** Omitting both flags yields file entries with
`compress_type == ZIP_DEFLATED`, identical to pre-feature behavior.
## Risks