mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-06-30 17:24:22 +00:00
Storing more ideas/plans
This commit is contained in:
committed by
stumpylog
parent
6a610e5f87
commit
da02f3ef2d
@@ -0,0 +1,225 @@
|
||||
# Scheduled Backup Design
|
||||
|
||||
**Date**: 2026-05-15
|
||||
**Status**: Approved
|
||||
|
||||
## Overview
|
||||
|
||||
Add a scheduled backup system to paperless-ngx that exports documents as zip files on a user-configurable schedule, retaining the last N backups. The schedule timing is configured via an env var (consistent with all other scheduled tasks), while the backup-specific configuration (output directory, keep count) lives in a new database model editable through the API and UI.
|
||||
|
||||
## Goals
|
||||
|
||||
- Automated periodic exports without manual intervention
|
||||
- Zip-based output for simple, unambiguous rotation
|
||||
- Opt-in: no backup runs unless explicitly configured
|
||||
- Strongly typed export contract usable by both the CLI and the scheduled task
|
||||
- UI-editable backup config, no additional env vars beyond the cron schedule
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Encrypted backups (future enhancement)
|
||||
- Age-based or size-based rotation (count-only for now)
|
||||
- Remote/cloud backup destinations
|
||||
- Import automation
|
||||
|
||||
---
|
||||
|
||||
## Section 1: Data Model and API
|
||||
|
||||
### `BackupConfiguration` model
|
||||
|
||||
New singleton model in `src/paperless/models.py`, following the same `AbstractSingletonModel` pattern as `ApplicationConfiguration`.
|
||||
|
||||
```python
|
||||
class BackupConfiguration(AbstractSingletonModel):
|
||||
output_dir = models.CharField(
|
||||
verbose_name=_("Backup output directory"),
|
||||
max_length=1024,
|
||||
blank=True,
|
||||
default="",
|
||||
)
|
||||
keep_count = models.PositiveIntegerField(
|
||||
verbose_name=_("Number of backups to keep"),
|
||||
default=5,
|
||||
help_text=_("Set to 0 to keep all backups."),
|
||||
)
|
||||
|
||||
class Meta:
|
||||
verbose_name = _("Backup configuration")
|
||||
```
|
||||
|
||||
- `output_dir` blank/empty means backup is disabled (the task treats it as a no-op).
|
||||
- `output_dir` must be an absolute path. The serializer validates this via a custom validator; `run_export` also calls `.resolve()` on the path unconditionally.
|
||||
- `keep_count = 0` means keep all backups; no rotation is performed.
|
||||
|
||||
### Migration
|
||||
|
||||
The migration is created in `src/paperless/migrations/` (not `src/documents/migrations/`), since `BackupConfiguration` lives in the `paperless` app.
|
||||
|
||||
### API
|
||||
|
||||
- **Serializer**: `BackupConfigurationSerializer` in `src/paperless/serialisers.py`
|
||||
- **ViewSet**: `BackupConfigurationViewSet` in `src/paperless/views.py` — singleton GET/PATCH, same pattern as `ApplicationConfiguration`
|
||||
- **Route**: `/api/backup_config/` registered in `src/paperless/urls.py`
|
||||
|
||||
---
|
||||
|
||||
## Section 2: Export Module
|
||||
|
||||
New module `src/documents/export.py` contains the export contract and core logic, extracted from `document_exporter`'s `handle()` method.
|
||||
|
||||
### `ExportOptions` dataclass
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ExportOptions:
|
||||
target: Path
|
||||
compare_checksums: bool = False
|
||||
compare_json: bool = False
|
||||
delete: bool = False
|
||||
use_filename_format: bool = False
|
||||
no_archive: bool = False
|
||||
no_thumbnail: bool = False
|
||||
use_folder_prefix: bool = False
|
||||
split_manifest: bool = False
|
||||
zip_export: bool = False
|
||||
zip_name: str | None = None # None -> default date-based name
|
||||
data_only: bool = False
|
||||
passphrase: str | None = None
|
||||
batch_size: int = 500
|
||||
```
|
||||
|
||||
`zip_name = None` means the caller wants the default date-based name. `run_export` resolves `None` internally to `f"export-{timezone.localdate().isoformat()}"` before use — callers never need to supply a default. The scheduled task always passes an explicit timestamped name.
|
||||
|
||||
### `run_export(options: ExportOptions) -> None`
|
||||
|
||||
The body of the current `Command.handle()` in `document_exporter` moves here, reading from `ExportOptions` instead of parsed CLI options. No behaviour changes.
|
||||
|
||||
### Refactored `document_exporter` management command
|
||||
|
||||
Becomes a thin CLI adapter:
|
||||
|
||||
1. Parse arguments (unchanged)
|
||||
2. Construct `ExportOptions` from parsed args
|
||||
3. Call `run_export(options)`
|
||||
|
||||
---
|
||||
|
||||
## Section 3: Scheduled Task and Rotation
|
||||
|
||||
### `scheduled_backup` task in `src/documents/tasks.py`
|
||||
|
||||
```
|
||||
1. Load BackupConfiguration (singleton)
|
||||
2. If output_dir is blank, log a debug message and return (no-op, no PaperlessTask created)
|
||||
3. Create a PaperlessTask record (TriggerSource.SCHEDULED) to track this run
|
||||
4. Build zip_name as local-time timestamp: "export-YYYY-MM-DD-HHMMSS"
|
||||
using Django's timezone.localtime()
|
||||
5. Construct ExportOptions(
|
||||
target=Path(config.output_dir),
|
||||
zip_export=True,
|
||||
zip_name=zip_name,
|
||||
)
|
||||
6. Call run_export(options)
|
||||
7. If keep_count > 0:
|
||||
zips = sorted(Path(config.output_dir).glob("export-*.zip"), key=lambda p: p.stat().st_mtime)
|
||||
for old_zip in zips[:-keep_count]:
|
||||
old_zip.unlink()
|
||||
8. Mark PaperlessTask as complete (handled by signal handlers)
|
||||
```
|
||||
|
||||
Key design notes:
|
||||
|
||||
- Rotation uses `export-*.zip` glob, not `*.zip`, to avoid matching zip files in the directory that paperless did not create.
|
||||
- Rotation occurs only after a successful export, so a failed run does not consume a rotation slot.
|
||||
- The timestamp format `YYYY-MM-DD-HHMMSS` in local time ensures multiple runs per day produce distinct filenames without collision.
|
||||
|
||||
### PaperlessTask integration
|
||||
|
||||
`PaperlessTask` lifecycle is managed entirely by the Celery signal handlers in `src/documents/signals/handlers.py`, not manually inside the task body.
|
||||
|
||||
**Changes to `TRACKED_TASKS` and `PaperlessTask.TaskType`:**
|
||||
|
||||
- Add `PaperlessTask.TaskType.BACKUP` to the `TaskType` enum in `src/documents/models.py`
|
||||
- Add `"documents.tasks.scheduled_backup": PaperlessTask.TaskType.BACKUP` to `TRACKED_TASKS`
|
||||
|
||||
**Conditional tracking — the no-op case:**
|
||||
|
||||
When `BackupConfiguration.output_dir` is blank the task returns immediately, so no record should appear in the Tasks panel. This requires explicit handling in all five signal handlers. Relying on incidental safety (filters that match 0 rows, `DoesNotExist` guards) is fragile and unclear to future maintainers.
|
||||
|
||||
The approach for each handler when the task type is `BACKUP`:
|
||||
|
||||
| Handler | Current behaviour when no record exists | Required change |
|
||||
| ----------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
|
||||
| `before_task_publish_handler` | Creates the record | Check `BackupConfiguration.get_solo().output_dir`; skip `PaperlessTask.objects.create()` if blank |
|
||||
| `task_prerun_handler` | `.filter().update()` — silent no-op | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
|
||||
| `task_postrun_handler` | `DoesNotExist: return` — incidentally safe | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
|
||||
| `task_failure_handler` | `.filter().first()` returns `None`, update skipped — incidentally safe | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
|
||||
| `task_revoked_handler` | `.filter().update()` — silent no-op | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
|
||||
|
||||
Extract a helper `_backup_task_is_tracked(task_id: str) -> bool` that returns `PaperlessTask.objects.filter(task_id=task_id).exists()`. The four downstream handlers call this after the `TRACKED_TASKS` check and return early if it returns `False` for a `BACKUP` task. This makes the intent explicit: "this task was intentionally not tracked for this invocation."
|
||||
|
||||
---
|
||||
|
||||
## Section 4: Beat Schedule
|
||||
|
||||
Add to the task list in `parse_beat_schedule()` in `src/paperless/settings/custom.py`:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "Scheduled document backup",
|
||||
"env_key": "PAPERLESS_EXPORT_TASK_CRON",
|
||||
"env_default": "disable",
|
||||
"task": "documents.tasks.scheduled_backup",
|
||||
"options": {
|
||||
"expires": 1.0 * 60.0 * 60.0, # 1 hour
|
||||
},
|
||||
},
|
||||
```
|
||||
|
||||
- Default is `"disable"` — the task is not added to the beat schedule unless the env var is explicitly set.
|
||||
- Setting `PAPERLESS_EXPORT_TASK_CRON=disable` (or simply not setting it) produces no scheduled task and no noise.
|
||||
- Typical user value: `"0 2 * * *"` (daily at 02:00 local server time).
|
||||
- `expires` is set to 1 hour: if a scheduled backup has not started within 1 hour of its trigger time (e.g., the Celery worker was down), it is discarded rather than running late. Unlike other tasks whose expiry is tied to a known default interval, this task has a user-defined schedule. 1 hour is a conservative value that prevents stale backup tasks from piling up without being so short that it causes problems on a normally-running worker.
|
||||
|
||||
---
|
||||
|
||||
## Section 5: Frontend
|
||||
|
||||
Location to be decided by co-maintainer (dedicated "Backup" page vs. section within Application Settings). The API contract is independent of this decision.
|
||||
|
||||
The UI requires two fields:
|
||||
|
||||
- **Output directory** — text input for `output_dir` (absolute path on the server)
|
||||
- **Keep count** — number input for `keep_count`, with a note that 0 means keep all
|
||||
|
||||
The component performs a GET to `/api/backup_config/` on load and a PATCH on save, identical to how the Application Settings component works.
|
||||
|
||||
---
|
||||
|
||||
## File Change Summary
|
||||
|
||||
| File | Change |
|
||||
| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `src/paperless/models.py` | Add `BackupConfiguration` model |
|
||||
| `src/paperless/serialisers.py` | Add `BackupConfigurationSerializer` |
|
||||
| `src/paperless/views.py` | Add `BackupConfigurationViewSet` |
|
||||
| `src/paperless/urls.py` | Register `/api/backup_config/` route |
|
||||
| `src/paperless/settings/custom.py` | Add `PAPERLESS_EXPORT_TASK_CRON` beat entry |
|
||||
| `src/documents/export.py` | New module: `ExportOptions`, `run_export()` |
|
||||
| `src/documents/management/commands/document_exporter.py` | Thin wrapper around `run_export()` |
|
||||
| `src/documents/models.py` | Add `PaperlessTask.TaskType.BACKUP` |
|
||||
| `src/documents/signals/handlers.py` | Add `BACKUP` to `TRACKED_TASKS`; add `_backup_task_is_tracked()`; update all 5 signal handlers |
|
||||
| `src/documents/tasks.py` | Add `scheduled_backup` task |
|
||||
| `src-ui/` | New or extended settings component (location TBD) |
|
||||
| `src/paperless/migrations/` | New migration for `BackupConfiguration` |
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
- **`src/paperless/tests/test_backup_config.py`** — model, serializer, API (GET/PATCH)
|
||||
- **`src/documents/tests/test_export.py`** — new unit tests for `run_export()` directly; `test_management_exporter.py` retains its existing CLI wiring tests and gains tests for the thin-wrapper behaviour
|
||||
- **`src/documents/tests/test_tasks_backup.py`** — `scheduled_backup` task: no-op when `output_dir` blank, export called with correct options, rotation deletes correct files, rotation skipped when `keep_count=0`
|
||||
- **`src/documents/tests/test_task_signals.py`** — signal handler behaviour for `BACKUP` task type: no record created when `output_dir` blank, all downstream handlers skip cleanly when no record exists, normal lifecycle when `output_dir` is set
|
||||
- Frontend unit tests for the settings component
|
||||
@@ -0,0 +1,81 @@
|
||||
# Interactive Shell Container Environment
|
||||
|
||||
**Date:** 2026-05-26
|
||||
**Branch:** fix-tanvity-index-lock (to be implemented on a new branch)
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
When paperless-ngx users open an interactive shell in the running container via `docker exec -it <container> bash`, they do not see environment variables resolved from `*_FILE` secret injection.
|
||||
|
||||
The `init-env-file` s6 init script reads `PAPERLESS_*_FILE` variables (e.g. `PAPERLESS_SECRET_KEY_FILE=/run/secrets/key`), reads the referenced file, and writes the resolved value (e.g. `PAPERLESS_SECRET_KEY=abc123`) to `/run/s6/container_environment/`. All s6-managed services and management command wrappers use the `#!/command/with-contenv` shebang, which reads that directory and injects all vars into the process environment before execution.
|
||||
|
||||
`docker exec bash` bypasses s6 entirely. It is a non-login interactive shell launched directly by the Docker daemon, which provides only the original Docker-configured environment (the `*_FILE` paths, not the resolved values). Any manual command a user runs — such as `document_exporter` or `manage.py` calls — will be missing the resolved secrets unless they happen to also be set as plain Docker env vars.
|
||||
|
||||
## Approach
|
||||
|
||||
Source `/run/s6/container_environment/` in every interactive bash shell opened in the container, mirroring what `with-contenv` does for s6 services.
|
||||
|
||||
Two hooks are needed because Debian uses different rc files for different shell types:
|
||||
|
||||
- **Non-login interactive** (`docker exec bash`): sources `/etc/bash.bashrc`
|
||||
- **Login interactive** (`docker exec bash --login`): sources `/etc/profile`, which auto-sources all `/etc/profile.d/*.sh`
|
||||
|
||||
## Changes
|
||||
|
||||
### 1. `docker/rootfs/etc/profile.d/contenv.sh` (new file)
|
||||
|
||||
A POSIX-compatible shell script that exports all files in `/run/s6/container_environment/` as environment variables. Placed here so login shells pick it up automatically.
|
||||
|
||||
```sh
|
||||
#!/bin/sh
|
||||
# Source s6 container environment for interactive shells.
|
||||
# Ensures variables resolved from *_FILE secret injection are visible
|
||||
# when using 'docker exec bash'. Does not affect s6 services (those
|
||||
# use with-contenv directly). Has no effect in non-container contexts
|
||||
# because the directory will not exist.
|
||||
# Note: sh/dash shells opened via 'docker exec sh' are not covered;
|
||||
# only bash-based sessions benefit from this file.
|
||||
_pngx_contenv="/run/s6/container_environment"
|
||||
if [ -d "${_pngx_contenv}" ]; then
|
||||
for _pngx_f in "${_pngx_contenv}"/*; do
|
||||
[ -f "${_pngx_f}" ] || continue
|
||||
_pngx_name=$(basename "${_pngx_f}")
|
||||
_pngx_val=$(cat "${_pngx_f}")
|
||||
export "${_pngx_name}=${_pngx_val}"
|
||||
done
|
||||
fi
|
||||
unset _pngx_contenv _pngx_f _pngx_name _pngx_val
|
||||
```
|
||||
|
||||
### 2. Dockerfile `main-app` stage (one line added)
|
||||
|
||||
Appends a source line to `/etc/bash.bashrc` so non-login interactive shells also pick up contenv. Added after the runtime package installation block, before the Python dependency installation.
|
||||
|
||||
```dockerfile
|
||||
RUN echo '. /etc/profile.d/contenv.sh' >> /etc/bash.bashrc
|
||||
```
|
||||
|
||||
`/etc/bash.bashrc` is provided by the Debian base image and installed during the apt step, so it exists by the time this `RUN` executes.
|
||||
|
||||
## Coverage
|
||||
|
||||
| How user gets a shell | Gets contenv? | Mechanism |
|
||||
| ---------------------------------------- | --------------------- | ---------------------------------------- |
|
||||
| `docker exec -it container bash` | Yes | `/etc/bash.bashrc` sources `contenv.sh` |
|
||||
| `docker exec -it container bash --login` | Yes | `/etc/profile.d/contenv.sh` auto-sourced |
|
||||
| `docker exec -it container sh` | No (known limitation) | `sh` sources neither file |
|
||||
| Management command wrappers | Already worked | `with-contenv` shebang |
|
||||
| s6 services | Already worked | `with-contenv` shebang |
|
||||
|
||||
## Edge Cases
|
||||
|
||||
**Shell opened before `init-env-file` completes:** The directory exists but may not yet contain all resolved vars. The script exports what is present; missing vars are simply absent. No error is produced.
|
||||
|
||||
**Variable value contains special characters:** `$(cat file)` strips only trailing newlines (which `init-env-file` already warns about). Other special characters are preserved correctly by the `export "NAME=VALUE"` form.
|
||||
|
||||
**Directory does not exist (non-container use):** The `[ -d ]` guard makes the script a no-op. Safe to include in any Debian-based image.
|
||||
|
||||
## Testing
|
||||
|
||||
No automated test is added. This is container-bootstrap shell plumbing with no Python code path. Manual verification: run the container with a `*_FILE` secret, `docker exec bash`, and confirm the resolved variable is present in the environment.
|
||||
@@ -0,0 +1,138 @@
|
||||
# LLM Index Schema Migrations (second spec)
|
||||
|
||||
Date: 2026-06-10
|
||||
Depends on: `docs/superpowers/specs/2026-06-10-sqlite-vec-vector-store-design.md` and its implementation plan (`docs/superpowers/plans/2026-06-10-sqlite-vec-transition.md`). This spec layers on top of the completed sqlite-vec transition; do not start it before that branch lands.
|
||||
Supersedes: PR #12968 (in-place LanceDB migrations). The machinery design there is carried over nearly verbatim; only the storage backend specifics change. #12968 should be closed with a pointer here once this ships.
|
||||
|
||||
Scope update (user decision, 2026-06-10): the `embedding.py:115` metadata restructure originally drafted as Part 2 of this spec was folded into the transition plan instead (its Task 5), because the transition forces a full rebuild anyway, so the embedded-text change rides along with no extra re-embed cost. This spec is now machinery-only: it ships with an EMPTY migration registry, ready for whatever schema change comes next. Part 2 below is retained as the worked example of how a re-embed migration would be registered, since the next one will not have a free rebuild to piggyback on.
|
||||
|
||||
## Part 1: Schema migration machinery (ported from PR #12968)
|
||||
|
||||
### What carries over unchanged
|
||||
|
||||
The PR's design survives the store swap intact and is adopted as-is:
|
||||
|
||||
- `Migration` frozen dataclass: `version: int`, `description: str`, `requires_reembed: bool`, `apply: Callable` (compare/hash-excluded field).
|
||||
- `MIGRATIONS: list[Migration]` ordered registry + `CURRENT_SCHEMA_VERSION: Final[int]` in `vector_store.py`. To add a migration: bump the constant, append an entry.
|
||||
- Store surface: `stored_schema_version() -> int` (0 when unrecorded, so pre-versioning tables treat every migration as pending), `pending_migrations()`, `requires_reembed_migration()`, `apply_structural_migrations() -> list[Migration]`.
|
||||
- The stop-at-first-reembed-boundary rule in `apply_structural_migrations()`: structural migrations are applied in version order only up to the first pending `requires_reembed=True` entry, so the version counter can never jump past a re-embed boundary and silently skip the rebuild. (This was the subtle correctness insight of #12968; preserve the comment.)
|
||||
- The `update_llm_index()` hook, verbatim from the PR:
|
||||
|
||||
```python
|
||||
with write_store(embed_model_name=model_name) as store:
|
||||
if not rebuild and store.table_exists():
|
||||
store.apply_structural_migrations()
|
||||
if store.requires_reembed_migration():
|
||||
logger.warning(
|
||||
"Schema migration requires re-embedding; forcing LLM index rebuild.",
|
||||
)
|
||||
rebuild = True
|
||||
```
|
||||
|
||||
- Test approach from the PR: mock `MIGRATIONS`/`CURRENT_SCHEMA_VERSION` with `mocker.patch`, spy on `drop_table` to distinguish in-place from rebuild, one test per path (structural applied without rebuild; pending re-embed forces rebuild).
|
||||
|
||||
### What changes for sqlite-vec
|
||||
|
||||
**1. Version storage: `index_meta['schema_version']` instead of `schema_version.json`.**
|
||||
The Lance store needed a sidecar JSON file because Lance had no convenient mutable metadata. The sqlite-vec store already has the `index_meta` key/value table, which is transactional with the data itself (a migration and its version bump commit atomically, which the file never could). Concretely:
|
||||
|
||||
- `_create_table(dim)` additionally writes `schema_version = str(CURRENT_SCHEMA_VERSION)` (fresh tables are always current).
|
||||
- `stored_schema_version()` reads the meta key, returns 0 on absence/garbage.
|
||||
- `drop_table()` already does `DELETE FROM index_meta`, which clears the version with it. No sidecar file, no unlink bookkeeping.
|
||||
- `apply_structural_migrations()` writes the new version inside the same transaction as the last applied migration.
|
||||
|
||||
**2. `apply` receives the store, not a table handle.**
|
||||
Lance migrations got the raw table for `add_columns`/`alter_columns`. vec0 virtual tables do not support arbitrary `ALTER TABLE`, so structural migrations are SQL against the store's connection. Signature: `apply: Callable[[PaperlessSqliteVecVectorStore], None]`. The store exposes what migrations need: `.client` (connection), `._table_name`, `.vector_dim()`, and the rebuild helper below.
|
||||
|
||||
**3. Structural migrations are create+copy+rename, sharing the compact() machinery.**
|
||||
The sqlite-vec `compact()` already implements the only structural mutation vec0 supports: build a new table, `INSERT INTO ... SELECT` (vectors copied bit-for-bit, no re-embedding), drop old, rename. Factor it into a shared helper on the store:
|
||||
|
||||
```python
|
||||
def rebuild_table(
|
||||
self,
|
||||
*,
|
||||
create_sql: str | None = None,
|
||||
copy_select: str | None = None,
|
||||
) -> None:
|
||||
"""Copy live rows into a freshly created table and swap it in.
|
||||
|
||||
Defaults reproduce the current schema (compaction). Structural
|
||||
migrations pass a modified CREATE statement and a matching SELECT
|
||||
(e.g. adding a column with a literal default). Runs in one
|
||||
transaction; VACUUM afterwards.
|
||||
"""
|
||||
```
|
||||
|
||||
`compact()` becomes a thin caller (threshold check + `rebuild_table()`), and a structural migration like "add a `+page_count` aux column" is:
|
||||
|
||||
```python
|
||||
Migration(
|
||||
version=2,
|
||||
description="add page_count auxiliary column",
|
||||
requires_reembed=False,
|
||||
apply=lambda store: store.rebuild_table(
|
||||
create_sql=..., # CREATE VIRTUAL TABLE ... with the new column
|
||||
copy_select="SELECT id, document_id, modified, node_content, embedding, '' FROM {old}",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
A pleasant consequence: every structural migration is also a compaction (the copy drops dead rows), and the file-format risk surface is one helper with one test suite instead of two code paths.
|
||||
|
||||
**4. Bootstrap version for the sqlite-vec store is 1.**
|
||||
The transition plan ships the new store without machinery; tables it creates carry no `schema_version` key and therefore read as 0. This release lands with `CURRENT_SCHEMA_VERSION = 1` and `MIGRATIONS = []`, so the bootstrap is unconditionally safe: a 0-version table has no pending migrations and `apply_structural_migrations()` simply stamps it to 1. (The metadata restructure having moved into the transition itself is what makes this clean; the registry's first real entry will be v2, written against tables that are all stamped.)
|
||||
|
||||
## Part 2 (worked example, IMPLEMENTED IN THE TRANSITION): the metadata TODO as a re-embed migration
|
||||
|
||||
This section was implemented as Task 5 of the transition plan and ships with the store swap, not with this spec. It is kept as the reference example of how to register the next re-embed migration.
|
||||
|
||||
### The change
|
||||
|
||||
`build_llm_index_text()` currently embeds three short structured values in the body text:
|
||||
|
||||
```python
|
||||
f"Filename: {doc.filename}",
|
||||
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
|
||||
f"Archive Serial Number: {doc.archive_serial_number or ''}",
|
||||
```
|
||||
|
||||
Per the TODO, move them to `node.metadata` (excluded from embeddings, visible to the LLM via llama-index's metadata prepend), the same treatment title/tags/correspondent/document_type got in PR #12944. Notes and Custom Fields stay in the body (long free text / dynamic count, as the TODO says).
|
||||
|
||||
1. `embedding.py build_llm_index_text()`: delete the three lines above (the `lines` list keeps Notes, Custom Fields, and Content). Update the TODO comment to describe only what remains intentional (Notes/Custom Fields stay embedded), or delete it.
|
||||
2. `indexing.py build_document_node()` metadata dict gains:
|
||||
|
||||
```python
|
||||
"filename": doc.filename,
|
||||
"storage_path": document.storage_path.name if document.storage_path else None,
|
||||
"archive_serial_number": document.archive_serial_number,
|
||||
```
|
||||
|
||||
(`None`/int values are fine here: this dict lives in the node-content JSON, not in vec0 metadata columns; only `document_id`/`modified` are columns with the NULL restriction. Matches the existing convention of `correspondent: None`.) 3. `excluded_embed_metadata_keys=list(metadata.keys())` already covers the new keys; `excluded_llm_metadata_keys` stays `["document_id"]` so the LLM sees the new fields.
|
||||
|
||||
### Why this class of change needs a migration
|
||||
|
||||
Removing the three lines changes the embedded text of every document, so stored vectors no longer match what the current code would embed. Incremental updates only re-embed documents whose `modified` changed, so without a forced rebuild the index would be a mixed old/new-text population indefinitely. This particular change escaped that fate only because the transition's forced rebuild covers it. The next embedded-text change will not have that luxury and gets registered like this:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: Final[int] = 2
|
||||
|
||||
MIGRATIONS: list[Migration] = [
|
||||
Migration(
|
||||
version=2,
|
||||
description="<what changed about the embedded text>",
|
||||
requires_reembed=True,
|
||||
apply=lambda store: None,
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
On the first `update_llm_index` after upgrade, the hook sees the pending re-embed migration, logs, and rebuilds.
|
||||
|
||||
### Test plan
|
||||
|
||||
Machinery only (the metadata change is tested in the transition plan's Task 5). Port of the #12968 tests, dedicated file `test_vector_store_migrations.py`: structural migration applies in-place without `drop_table`; pending re-embed forces rebuild; version stamping on create/drop; bootstrap stamping of a pre-machinery 0-version table to 1; stop-at-boundary with a mixed [structural v2, reembed v3, structural v4] registry asserting v4 is NOT applied and the stored version stays at 2; `rebuild_table()` round-trips rows byte-for-byte (shared with compact tests).
|
||||
|
||||
### Open questions
|
||||
|
||||
- PR #12968 disposition: close with a comment pointing at this spec once the machinery lands (the Lance-specific `add_columns` path has no successor; vec0 cannot do in-place column adds).
|
||||
- `created`/`added` fields are also candidates for future structural metadata work, but nothing needs them now (YAGNI; noted only so the next reader does not re-derive it).
|
||||
@@ -0,0 +1,155 @@
|
||||
# sqlite-vec Vector Store Design (replaces PaperlessLanceVectorStore)
|
||||
|
||||
Date: 2026-06-10
|
||||
|
||||
Context: LanceDB wheels SIGILL on non-AVX2 CPUs (#12970); research in `2026-06-10-vector-store-alternatives-research.md` selected sqlite-vec. This is a beta feature, so a one-time re-embed on upgrade is acceptable. Every claim marked [VERIFIED] below was empirically tested against the actual PyPI wheel (0.1.9, and 0.1.10a4 where noted), either in this repo's scratch harness (`/tmp/vstore-avx-test/explore_sqlitevec*.py`) or by the issues-audit agent.
|
||||
|
||||
## Version pin: `sqlite-vec==0.1.9`, and why it is load-bearing
|
||||
|
||||
- The 0.1.9 linux x86_64 wheel is built with **no SIMD flags at all** (`vec_debug()` shows empty build flags) and passed our qemu Westmere (SSE4.2, no AVX) and SandyBridge (AVX, no AVX2) emulation tests [VERIFIED]. This is the entire point of the migration.
|
||||
- The **0.1.10-alpha.4 wheel regresses this**: built with `-mavx -DSQLITE_VEC_ENABLE_AVX` file-wide, no runtime CPU dispatch. It can SIGILL on AVX-less CPUs, including Goldmont Atom/Celeron NAS boxes, exactly the #12970 user base [VERIFIED via vec_debug on the wheel].
|
||||
- Guardrails: pin `==0.1.9` exactly; log `SELECT vec_version(), vec_debug()` at store init as an AVX canary; before ever bumping to 0.1.10+, re-check the wheel flags (and consider raising the runtime-dispatch issue upstream first).
|
||||
- arm64: 0.1.9 manylinux aarch64 wheel is a proper ELF64 binary, no NEON flags baked [VERIFIED]. (The broken 32-bit "aarch64" wheel era was 0.1.6, fixed since.)
|
||||
- No sdist on PyPI (asg017/sqlite-vec#211, open) and no musl wheels; fine for our Debian-based image, blocks Alpine bare-metal installs.
|
||||
|
||||
## Schema
|
||||
|
||||
One dedicated SQLite database file in `LLM_INDEX_DIR` (e.g. `llmindex.db`), never the Django DB. Connections set `PRAGMA journal_mode=WAL`, `busy_timeout`, `synchronous=NORMAL`.
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE nodes USING vec0(
|
||||
id TEXT PRIMARY KEY, -- node_id (uuid)
|
||||
document_id TEXT, -- METADATA column, deliberately NOT a partition key
|
||||
modified TEXT, -- ISO timestamp; never NULL (sentinel "")
|
||||
+node_content TEXT, -- auxiliary column: JSON payload, any size
|
||||
embedding float[{dim}] distance_metric=cosine
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT);
|
||||
-- rows: embed_model, dim, schema_version, created_by_vec_version
|
||||
```
|
||||
|
||||
Design decisions, each verified on 0.1.9:
|
||||
|
||||
- **`document_id` is a metadata column, not a partition key.** With a partition key, `k` applies per partition: `k=5 AND document_id IN (3 docs)` returns 15 rows (asg017/sqlite-vec#142, open) [VERIFIED]. As a metadata column the same query returns a correct global top-k of exactly 5 [VERIFIED]. `query_similar_documents()` passes permission-scoped `IN` lists, so per-partition semantics would over-fetch k x N(docs). At our scale the partition-pruning speedup is not needed (filtered KNN at 20K x 1024 was _faster_ than unfiltered: 39 ms vs 74 ms).
|
||||
- **One document column, not two.** The Lance store carried both `doc_id` (ref_doc_id) and `document_id`; in our usage they are always the same value (`str(document.id)`), so the new schema keeps only `document_id`.
|
||||
- **TEXT primary key works** (insert, UPDATE, DELETE, duplicate rejection) [VERIFIED]. There is no usable rowid mapping with a TEXT pk, which we do not need.
|
||||
- **Aux column for the payload.** `+node_content` holds the multi-KB JSON; aux columns cannot appear in KNN WHERE clauses (loud error, not silent) [VERIFIED], which we never do, and are selectable in scans and KNN results [VERIFIED].
|
||||
- **Metadata columns reject NULL** (asg017/sqlite-vec#141, open) [VERIFIED]. `_row()` must keep coercing everything through `str(... or "")` as it already does today.
|
||||
- **`distance_metric=cosine`**: similarity maps as `1 - distance` (identical vector gives distance 0.0 [VERIFIED]). For unit-norm embeddings the ranking equals today's L2 ranking; for non-normalized models cosine is the safer default, and the beta re-embed makes the behavior change free. (L2 + `1/(1+d)` remains available if exact parity is ever wanted.)
|
||||
- **Vectors are always bound as float32 BLOBs** (`struct.pack`/`np.tobytes`), never JSON text: bypasses the locale-dependent `strtod` parsing bug (asg017/sqlite-vec#241, open) entirely.
|
||||
- Limits, all comfortable: dims <= 8192, k <= 4096, chunk_size default 1024 [VERIFIED]. TEXT metadata has no length cap; values > 12 bytes go to a shadow text table with a prefix fast-path, and the one historical bug at that boundary (long-metadata DELETE, #274) is fixed in 0.1.9.
|
||||
|
||||
## Method mapping (PaperlessLanceVectorStore -> PaperlessSqliteVecVectorStore)
|
||||
|
||||
| Current method | sqlite-vec implementation | Notes |
|
||||
| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `__init__(uri, table_name, embed_model_name)` | `sqlite3.connect(path)` + `enable_load_extension` + `sqlite_vec.load()` + PRAGMAs | Same lazy "table may not exist yet" stance |
|
||||
| `client` property | the `sqlite3.Connection` | |
|
||||
| `table_exists()` | `SELECT 1 FROM sqlite_master WHERE name='nodes'` | |
|
||||
| `vector_dim()` | `index_meta['dim']` | Written at table creation; wrong-dim inserts are rejected by vec0 anyway [VERIFIED] |
|
||||
| `drop_table()` | `DROP TABLE nodes` | Drops all 7 shadow tables with it [VERIFIED]; also clear `index_meta` |
|
||||
| `stored_model_name()` / `config_mismatch()` | `index_meta['embed_model']` | Same conservative None handling |
|
||||
| `_schema(dim, model)` | the CREATE statements above | dim from first batch, as today (`_ensure_table`) |
|
||||
| `_row(node)` | same dict, vector packed to bytes | keep `str(... or "")` coercion (NULL rejection) |
|
||||
| `add(nodes)` | `executemany(INSERT ...)` inside one transaction | ~3,300 rows/s at 1024 dims measured; batching via transactions |
|
||||
| `upsert_document(document_id, nodes)` | `BEGIN; DELETE FROM nodes WHERE document_id = ?; executemany(INSERT); COMMIT` | **Not** `INSERT OR REPLACE`: broken on vec0 (asg017/sqlite-vec#259, open). Transaction gives the same no-transient-empty-state guarantee as merge_insert; rollback verified [VERIFIED] |
|
||||
| `delete(ref_doc_id)` | `DELETE FROM nodes WHERE document_id = ?` | |
|
||||
| `get_nodes(filters)` | `SELECT id, document_id, node_content, embedding FROM nodes [WHERE ...]` | full scans on vec0 work [VERIFIED]; 45 ms / 20K rows |
|
||||
| `query(VectorStoreQuery)` | `SELECT id, node_content, embedding, distance FROM nodes WHERE embedding MATCH ? AND k = ? [AND filters]` then Python-slice to `top_k` | `k = ?` is mandatory; `LIMIT` cannot be combined with `k` [VERIFIED]; results arrive distance-sorted [VERIFIED]; similarities = `1 - distance` |
|
||||
| `_build_where(filters)` | same EQ/IN translation, but emitting `?` placeholders + params list | **Upgrade**: bound parameters replace today's manual `_escape()` string interpolation |
|
||||
| `get_modified_times()` | `SELECT document_id, modified FROM nodes` + first-seen dedupe in Python | identical logic |
|
||||
| `ensure_document_id_scalar_index()` | no-op (delete if nothing else needs it) | metadata filters are evaluated in the chunk scan; nothing to create |
|
||||
| `maybe_create_ann_index()` | no-op on 0.1.9 | ANN (rescore/diskann) is 0.1.10-alpha territory; adopting an ANN index makes the file unreadable by 0.1.9 (one-way door), while flat tables round-trip 0.1.9 <-> 0.1.10a4 cleanly [VERIFIED]. Revisit post-0.1.10-final |
|
||||
| `compact(retention_seconds)` | **rebuild-based compaction**, see below | replaces Lance MVCC cleanup |
|
||||
|
||||
Filter constraint surface (loud errors otherwise, [VERIFIED]): only `=, !=, <, <=, >, >=, IN` on metadata columns in KNN queries. We use only EQ/IN. Never use `NOT IN` (the vtab cannot see it; SQLite post-filters and silently under-delivers below k, asg017/sqlite-vec#116).
|
||||
|
||||
## Compaction: the one real behavioral difference
|
||||
|
||||
vec0 DELETE only flips a validity bit; space is never reclaimed, and VACUUM recovers only about half (asg017/sqlite-vec#54, #220, open; fix PRs #243/#210 unmerged). Measured: 5 delete+reinsert cycles on 2K rows grew the file 3.32 MB -> 6.56 MB; VACUUM got back to 4.94 MB. Paperless's per-document churn (every document edit is a delete+reinsert) hits this directly.
|
||||
|
||||
So `compact()` becomes the maintainer-endorsed rebuild (asg017/sqlite-vec#205):
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE nodes_new USING vec0(...);
|
||||
INSERT INTO nodes_new SELECT id, document_id, modified, node_content, embedding FROM nodes;
|
||||
DROP TABLE nodes;
|
||||
ALTER TABLE nodes_new RENAME TO nodes; -- then VACUUM
|
||||
```
|
||||
|
||||
This copies vectors without re-embedding, runs under the existing write FileLock, and slots into the existing `document_llmindex compact` command and the scheduled maintenance task. A cheap trigger heuristic: rebuild when `count(*) in nodes_rowids shadow` (cumulative) exceeds ~2x live rows, or just keep the existing scheduled cadence.
|
||||
|
||||
## Concurrency
|
||||
|
||||
vec0 is a plain vtab over ordinary shadow tables, so standard SQLite WAL semantics apply, and the existing architecture is already the textbook arrangement: writers serialized by `settings.LLM_INDEX_LOCK` FileLock, readers concurrent via WAL. Verified across processes: a reader during another process's open write transaction does not block and sees a consistent pre-transaction snapshot; post-commit it sees the new rows [VERIFIED]. No sqlite-vec-specific multi-process corruption, locking, or segfault reports exist in the tracker. The 0.1.10a4 cached-statement fix (#295) is a Firefox/mozStorage `sqlite3_close()` issue; CPython's `sqlite3` is unaffected, no Python-side reports.
|
||||
|
||||
Same caveat as the main SQLite DB: `LLM_INDEX_DIR` should not be on NFS.
|
||||
|
||||
## Performance expectations (measured on the 0.1.9 no-SIMD wheel)
|
||||
|
||||
- KNN 20K rows x 1024 dims: ~74 ms plain, ~39 ms with a metadata EQ filter.
|
||||
- 100K x 768: 185 ms/query (vs 497 ms for LanceDB exact search on identical data).
|
||||
- Extrapolated 500K x 1024-1536: ~0.9-1.8 s/query; 384 dims roughly 4x faster. Acceptable for suggestions/chat at the extreme tail; typical installs (low tens of thousands of chunks) are tens of ms.
|
||||
- Insert: ~3,300 rows/s at 1024 dims in a single transaction.
|
||||
- File size: ~raw vector size (~4.3 KB/row at 1024 dims), no compression; plus the bloat behavior above.
|
||||
|
||||
## Migration from the Lance store
|
||||
|
||||
Beta policy: re-embed. On startup/first index task: if `LLM_INDEX_DIR` contains a Lance table but no `llmindex.db`, log and queue a full rebuild, then remove the Lance directory. No cross-store vector copy, no lancedb import anywhere in the path (which is what un-breaks #12970 hosts: they currently crash at import, have no usable index, and get a fresh build).
|
||||
|
||||
PR #12968's migration machinery maps onto `index_meta['schema_version']`: structural migrations = create-new-table + `INSERT ... SELECT` + rename (vectors copied, no re-embed; same shape as the compaction rebuild); re-embed migrations = drop + full rebuild, jumping straight to the current version.
|
||||
|
||||
## Dependency changes
|
||||
|
||||
- Add: `sqlite-vec==0.1.9` (one ~100 KB platform wheel, zero Python deps).
|
||||
- Remove: `lancedb~=0.33.0` (and its pylance/lancedb wheels, ~40 MB). `pyarrow` leaves this module; check whether anything else in the AI stack still needs it before dropping from pyproject.
|
||||
|
||||
## Test plan notes
|
||||
|
||||
- pytest-style per project convention; the store tests can run against a tmp_path DB file (or `:memory:` for pure-logic tests; extension loading works on uv-managed CPython [VERIFIED]).
|
||||
- Port the existing `test_vector_store.py` surface; add dedicated tests for: upsert transactionality (no transient empty state mid-upsert from a second connection), NULL-coercion in `_row()`, k-slice behavior, EQ/IN filter correctness, compaction rebuild preserving rows byte-for-byte, vec_debug canary logging.
|
||||
- The qemu matrix (`/tmp/vstore-avx-test/`) can be re-run against any future sqlite-vec bump: `qemu-x86_64 -cpu Westmere venv/bin/python candidate_test.py sqlite_vec <dir>`.
|
||||
|
||||
## Benchmark harness
|
||||
|
||||
`src/bench_vector_store.py` -- standalone head-to-head comparison run during the migration window when both `PaperlessLanceVectorStore` and `PaperlessSqliteVecVectorStore` coexist (Task 3 Phase A of the implementation plan). After Phase B replaces `vector_store.py`, the Lance import fails gracefully and only the sqlite-vec half runs (useful for post-migration baseline checks).
|
||||
|
||||
```bash
|
||||
cd src
|
||||
uv run python bench_vector_store.py # auto-generates bench_data.pkl on first run
|
||||
uv run python bench_vector_store.py --regenerate # force re-embed
|
||||
```
|
||||
|
||||
**Phase 1 (data generation, skipped if `bench_data.pkl` exists):** Faker generates `--n-docs` (default 2000) fake documents -- title, body, correspondent, ISO timestamp. Each body is split into `--chunks-per-doc` (default 3) equal-length chunks (~6000 total nodes). A warm-up embed call fires before generation to ensure the model is resident in GPU. All chunk texts are embedded via Ollama `/api/embed` in batches of 32 and saved to `bench_data.pkl`. Faker seed 42 for reproducibility.
|
||||
|
||||
**Phase 2 (benchmark):** Each store runs in an isolated `tempfile.TemporaryDirectory()`. Query vectors are drawn reproducibly from the corpus (every 10th node, wrapping).
|
||||
|
||||
| Operation | Reps | Metric |
|
||||
| ----------------------------------------- | ---- | --------------------- |
|
||||
| `add()` bulk insert | 1 | total time |
|
||||
| `query()` plain | 50 | p50 / p95 |
|
||||
| `query()` filtered (IN on 20% of doc IDs) | 50 | p50 / p95 |
|
||||
| `get_modified_times()` | 20 | p50 |
|
||||
| `upsert_document()` | 50 | p50 / p95 |
|
||||
| `compact()` | 1 | total time |
|
||||
| File size | -- | pre- and post-compact |
|
||||
|
||||
**CLI flags:** `--n-docs` (2000), `--chunks-per-doc` (3), `--data-file` (`bench_data.pkl`), `--regenerate`, `--ollama-url` (`http://192.168.1.87:11434`), `--embed-model` (`qwen3-embedding:4b`), `--query-iters` (50).
|
||||
|
||||
**Dependencies:** `faker` and `httpx` must be available (`uv add --dev faker httpx` if not already installed).
|
||||
|
||||
## Risk register (from the 2026-06-10 issues audit)
|
||||
|
||||
| Risk | Ref | State | Disposition |
|
||||
| ------------------------------------------- | --------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| 0.1.10+ wheels bake AVX, no dispatch | release CI change, verified on 0.1.10a4 | current | Pin 0.1.9; vec_debug canary; upstream ask before any bump |
|
||||
| DELETE never reclaims space; VACUUM ~50% | #54, #220 | open | Rebuild-based `compact()` above |
|
||||
| INSERT OR REPLACE broken on vec0 | #259 | open | Use DELETE+INSERT in txn (design already does) |
|
||||
| NULL metadata rejected | #141 | open | Sentinel `""` coercion (already current behavior) |
|
||||
| Partition-key IN returns k per partition | #142 | open | Avoided: document_id is a metadata column |
|
||||
| NOT IN silently under-delivers | #116 | open | Never emit NOT IN |
|
||||
| Locale strtod breaks JSON vector parsing | #241 | open | Always BLOB-bind vectors |
|
||||
| Single weekend maintainer; fix PRs languish | #226 | open | Mitigated by Mozilla sponsorship + Firefox vendoring (release-train consumer); pin + vendor-from-source remains the escape hatch (no sdist: #211) |
|
||||
| ANN index = one-way file format | 0.1.10 alphas | — | Do not adopt ANN until 0.1.10 final + flag audit |
|
||||
| Long-TEXT metadata DELETE bug | #274 | fixed in 0.1.9 | Floor requirement `>=0.1.9` already implied by pin |
|
||||
Reference in New Issue
Block a user