10 KiB
Pluggable Document Storage Design
Date: 2026-04-23 Status: Approved
Overview
Replace the hardcoded local filesystem storage in paperless-ngx with a pluggable DocumentStorage Protocol. Ship two built-in backends — LocalFilesystemBackend (default, zero config change) and S3CompatibleBackend (supports AWS S3 and any S3-compatible endpoint). Third parties can implement the Protocol to provide their own backends.
Scope
- In scope: original documents, PDF/A archives
- Out of scope: thumbnails (stay on local filesystem, regenerable), consumption directory (stays local)
- Frontend impact: none — S3 is invisible; Django proxies all file access
Protocol
Defined in src/paperless/storage.py:
class DocumentStorage(Protocol):
def __enter__(self) -> Self: ...
def __exit__(self, exc_type, exc_val, exc_tb) -> None: ...
def open(self, name: str) -> IO[bytes]: ...
def save(self, name: str, content: IO[bytes]) -> str: ... # returns actual name used
def delete(self, name: str) -> None: ...
def exists(self, name: str) -> bool: ...
def move(self, old_name: str, new_name: str) -> None: ...
def list_files(self, prefix: str = "") -> Iterable[str]: ...
def size(self, name: str) -> int: ...
name is always the relative key as stored in the DB (e.g. 2024/my-invoice.pdf). All operations including open() must be called within a with storage: block — the context manager handles connection lifecycle and backend-specific cleanup.
Storage Instances
Two module-level singletons in src/paperless/storage.py, each an instance of the configured backend class:
original_storage: DocumentStorage = _build("originals")
archive_storage: DocumentStorage = _build("archive")
_build(prefix) reads PAPERLESS_DOCUMENT_STORAGE_BACKEND and PAPERLESS_DOCUMENT_STORAGE_OPTIONS from settings, instantiates the backend class with the configured options plus the paperless-controlled prefix. The prefix distinguishes originals from archives within the same bucket or directory root — it is not stored in the DB key.
Configuration
Two new settings, using the existing key-value dict mechanism:
| Setting | Default | Description |
|---|---|---|
PAPERLESS_DOCUMENT_STORAGE_BACKEND |
paperless.storage.LocalFilesystemBackend |
Dotted Python path to any class satisfying DocumentStorage |
PAPERLESS_DOCUMENT_STORAGE_OPTIONS |
{} |
Dict of kwargs passed to the backend constructor |
Example — S3-compatible:
PAPERLESS_DOCUMENT_STORAGE_BACKEND=paperless.storage.S3CompatibleBackend
PAPERLESS_DOCUMENT_STORAGE_OPTIONS={"bucket_name": "my-docs", "endpoint_url": "https://s3.wasabi.com", "region_name": "us-east-1", "access_key": "...", "secret_key": "..."}
Existing users set nothing — LocalFilesystemBackend with no options is the default.
Built-in Backends
LocalFilesystemBackend
__enter__: initialises tracking of directories affected during the context__exit__: callsdelete_empty_directories()for all tracked dirs; no-op on exceptionopen/save/delete/exists/move: directPath+shutiloperations rooted atsettings.ORIGINALS_DIR/settings.ARCHIVE_DIR(via the prefix passed by_build)move():shutil.move()— atomic on same filesystemlist_files():Path.rglob("*")
S3CompatibleBackend
- Wraps
django-storagesS3 backend (storages.backends.s3boto3.S3Boto3Storage) foropen,save,delete,exists __enter__: initialises boto3 client/session__exit__: no cleanup required (no empty directory concept on S3)move(): boto3copy_object(server-side, no data transfer) +delete_objectopen(): returns streaming S3 response body; caller'swithblock closes the HTTP connectionlist_files(): S3list_objects_v2with prefix- Works with any S3-compatible endpoint via
endpoint_urloption
Data Migration
One Django migration strips the stored prefix from existing rows:
document.filename:documents/originals/2024/invoice.pdf→2024/invoice.pdfdocument.archive_filename:documents/archive/2024/invoice.pdf→2024/invoice.pdf
The prefix is now owned by the storage instance, not the DB key.
migrate_storage Management Command
manage.py migrate_storage [--dry-run] [--no-delete]
[--source-backend=<dotted.path>] [--source-options=<json>]
Transfers all document files from one storage backend to another. The user updates PAPERLESS_DOCUMENT_STORAGE_BACKEND in their config first, then runs this command to move existing files.
The destination is always the currently configured backend (from settings). The source is specified via --source-backend / --source-options, defaulting to LocalFilesystemBackend with no options if omitted (covering the most common migration path: local → S3).
Flow:
- Instantiate source backend (from CLI args or default) and destination backend (from current settings)
- Iterate
Document.objects.only("filename", "archive_filename") - For each file (original + archive):
- Skip with warning if missing from source
- Skip silently if already present on destination (idempotent — safe to re-run)
- Copy:
destination.save(name, source.open(name)) - Unless
--no-delete:source.delete(name)
- Report counts: moved / skipped / failed
--dry-run: prints actions without touching files
Individual failures are logged and counted but do not abort the run. Bidirectional: local → S3, S3 → local, S3 → S3.
Files to Create
| File | Purpose |
|---|---|
src/paperless/storage.py |
Protocol, built-in backends, original_storage / archive_storage singletons |
src/documents/management/commands/migrate_storage.py |
Migration command |
src/documents/migrations/XXXX_strip_storage_prefix.py |
Strip prefix from existing filename rows |
Files to Modify
| File | Change |
|---|---|
src/paperless/settings/__init__.py |
Add PAPERLESS_DOCUMENT_STORAGE_BACKEND, PAPERLESS_DOCUMENT_STORAGE_OPTIONS |
src/documents/models.py |
source_file, archive_file use storage instances; source_path returns temp file for subprocess callers |
src/documents/consumer.py |
_write() → storage.save(); remove mkdir calls |
src/documents/signals/handlers.py |
shutil.move() → storage.move(); remove create_source_path_directory / delete_empty_directories callsites |
src/documents/tasks.py |
Same as signals |
src/documents/file_handling.py |
exists() checks and directory references use storage API |
src/documents/views/ |
File-serving views use storage.open() within context; wrap for FileResponse lifecycle |
src/documents/management/commands/document_importer.py |
Replace Path.glob() and direct copies with storage API |
src/documents/management/commands/document_exporter.py |
Replace direct file copies and FileLock-guarded writes with storage API |
Locking & Concurrency
The codebase serialises all document file write/move operations with FileLock(settings.MEDIA_LOCK), where MEDIA_LOCK = MEDIA_ROOT / "media.lock". This is used in consumer.py, signals/handlers.py, tasks.py, mail.py, document_importer.py, and document_exporter.py.
The lock file stays on the local filesystem regardless of backend. MEDIA_LOCK lives under MEDIA_ROOT, which is the local path even when documents are stored on S3. This means:
- Single-host deployments (the common case — Docker Compose, single server): the
FileLockcontinues to work correctly. All Celery workers and the Django process share the same lock file. No change required. - Multi-host deployments: the
FileLockis already broken for these today — each host has its own lock file. This is a pre-existing limitation and is out of scope for this feature.
Callsite structure — the storage context manager nests inside the existing lock, preserving current behaviour:
with FileLock(settings.MEDIA_LOCK):
with original_storage as storage:
storage.move(old_name, new_name)
generate_unique_filename race: this function checks storage.exists() then saves, which is not atomic on S3. The FileLock already serialises this on a single host. For multi-host this is a pre-existing gap — not introduced by this feature.
Future path for multi-host: replace FileLock with a database-level advisory lock or Redis lock. Out of scope here.
Key Invariants
- The context manager is required for all storage operations, including reads
nameis always the relative key — never an absolute path or URL- The backend prefix (
originals/archive) is paperless-controlled and never stored in the DB LocalFilesystemBackendis the default — existing deployments require no config change- The migrate command is idempotent and can be re-run after partial failure