mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-06-30 17:24:22 +00:00
Marks some things as done
This commit is contained in:
@@ -0,0 +1,745 @@
|
||||
# LanceDB Schema Migration Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add a schema versioning and migration system to the LanceDB vector store so that structural column changes can be applied in-place without re-embedding documents, avoiding token costs for users on paid embedding APIs.
|
||||
|
||||
**Architecture:** A `schema_version.json` file is written alongside the LanceDB data directory and tracks the current applied version. A `Migration` dataclass registry in `vector_store.py` holds ordered, typed migration steps; each migration is classified as `requires_reembed=True/False`. At index update time, structural-only migrations are applied in-place via LanceDB's `add_columns`/`alter_columns`/`drop_columns` APIs; if any pending migration requires re-embedding, the existing model-mismatch rebuild path is reused.
|
||||
|
||||
**Tech Stack:** Python 3.11, lancedb 0.33, pyarrow, pytest, pytest-mock, factory-boy
|
||||
|
||||
---
|
||||
|
||||
## File Map
|
||||
|
||||
| File | Change |
|
||||
| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `src/paperless_ai/vector_store.py` | Add `CURRENT_SCHEMA_VERSION`, `Migration` dataclass, version file helpers, migration methods; modify `_ensure_table` and `drop_table` |
|
||||
| `src/paperless_ai/indexing.py` | Call migration inside `update_llm_index`'s `write_store` block |
|
||||
| `src/paperless_ai/tests/test_vector_store.py` | New `TestSchemaVersioning` and `TestMigrations` test classes |
|
||||
| `src/paperless_ai/tests/test_ai_indexing.py` | Two new integration tests for migration path |
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Schema version file helpers
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/vector_store.py`
|
||||
- Test: `src/paperless_ai/tests/test_vector_store.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
Add a new class at the bottom of `test_vector_store.py`:
|
||||
|
||||
```python
|
||||
class TestSchemaVersioning:
|
||||
@pytest.fixture
|
||||
def uri(self, tmp_path: Path) -> str:
|
||||
return str(tmp_path / "idx")
|
||||
|
||||
def test_version_file_written_on_table_creation(self, uri: str) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
|
||||
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.add([_node("1-0", "1", "text", 0.1)])
|
||||
|
||||
version_file = Path(uri) / "schema_version.json"
|
||||
assert version_file.exists()
|
||||
assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
|
||||
|
||||
def test_stored_schema_version_returns_current_when_file_missing(
|
||||
self, uri: str
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
|
||||
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.add([_node("1-0", "1", "text", 0.1)])
|
||||
(Path(uri) / "schema_version.json").unlink()
|
||||
|
||||
reopened = PaperlessLanceVectorStore(uri=uri)
|
||||
assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
|
||||
|
||||
def test_stored_schema_version_persists_after_reopen(self, uri: str) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
|
||||
|
||||
PaperlessLanceVectorStore(uri=uri).add([_node("1-0", "1", "text", 0.1)])
|
||||
|
||||
reopened = PaperlessLanceVectorStore(uri=uri)
|
||||
assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
|
||||
|
||||
def test_drop_table_removes_version_file(self, uri: str) -> None:
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.add([_node("1-0", "1", "text", 0.1)])
|
||||
assert (Path(uri) / "schema_version.json").exists()
|
||||
|
||||
store.drop_table()
|
||||
assert not (Path(uri) / "schema_version.json").exists()
|
||||
|
||||
def test_version_file_written_on_upsert_creation(self, uri: str) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
|
||||
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.upsert_document("1", [_node("1-0", "1", "text", 0.1)])
|
||||
|
||||
version_file = Path(uri) / "schema_version.json"
|
||||
assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
|
||||
```
|
||||
|
||||
Add `import json` and `import pytest_mock` to the top of `test_vector_store.py`.
|
||||
|
||||
- [ ] **Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
|
||||
```
|
||||
|
||||
Expected: all 5 tests fail with `ImportError` or `AttributeError` — `CURRENT_SCHEMA_VERSION` and `stored_schema_version` don't exist yet.
|
||||
|
||||
- [ ] **Step 3: Implement the schema version helpers in `vector_store.py`**
|
||||
|
||||
After the existing imports and before the `DEFAULT_TABLE_NAME` constant, add:
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
```
|
||||
|
||||
After `DEFAULT_TABLE_NAME = "documents"`, add:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: int = 1
|
||||
```
|
||||
|
||||
After the `ANN_PQ_SUB_VECTORS` constant, add nothing yet — version methods go on the class.
|
||||
|
||||
Inside `PaperlessLanceVectorStore`, add these methods after `stored_model_name`:
|
||||
|
||||
```python
|
||||
@property
|
||||
def _schema_version_path(self) -> Path:
|
||||
return Path(self._uri) / "schema_version.json"
|
||||
|
||||
def stored_schema_version(self) -> int:
|
||||
"""Return the schema version recorded on disk, or CURRENT_SCHEMA_VERSION if missing.
|
||||
|
||||
Missing means either the table predates versioning or was just created and the
|
||||
write hasn't happened yet — treat conservatively as already current.
|
||||
"""
|
||||
try:
|
||||
return int(json.loads(self._schema_version_path.read_text())["version"])
|
||||
except (FileNotFoundError, KeyError, ValueError):
|
||||
return CURRENT_SCHEMA_VERSION
|
||||
|
||||
def _write_schema_version(self, version: int) -> None:
|
||||
self._schema_version_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
self._schema_version_path.write_text(json.dumps({"version": version}))
|
||||
```
|
||||
|
||||
Modify `_ensure_table` to write the version after creating the table. Replace the current method body:
|
||||
|
||||
```python
|
||||
def _ensure_table(self, rows: list[dict[str, Any]], dim: int) -> bool:
|
||||
if self._table is not None:
|
||||
return False
|
||||
self._table = self._conn.create_table(
|
||||
self._table_name,
|
||||
rows,
|
||||
schema=self._schema(dim, self._embed_model_name),
|
||||
)
|
||||
self._write_schema_version(CURRENT_SCHEMA_VERSION)
|
||||
return True
|
||||
```
|
||||
|
||||
Modify `drop_table` to also remove the version file:
|
||||
|
||||
```python
|
||||
def drop_table(self) -> None:
|
||||
if self.table_exists():
|
||||
self._conn.drop_table(self._table_name)
|
||||
self._table = None
|
||||
self._schema_version_path.unlink(missing_ok=True)
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run tests to verify they pass**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
|
||||
```
|
||||
|
||||
Expected: all 5 tests pass.
|
||||
|
||||
- [ ] **Step 5: Verify no regressions**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
|
||||
```
|
||||
|
||||
Expected: all existing tests still pass.
|
||||
|
||||
- [ ] **Step 6: Lint**
|
||||
|
||||
```bash
|
||||
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
```
|
||||
|
||||
Expected: no errors.
|
||||
|
||||
- [ ] **Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
git commit -m "feat(ai): add schema version file tracking to LanceDB vector store"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Migration dataclass and pending migration detection
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/vector_store.py`
|
||||
- Test: `src/paperless_ai/tests/test_vector_store.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
Add a new class to `test_vector_store.py`:
|
||||
|
||||
```python
|
||||
class TestMigrationRegistry:
|
||||
@pytest.fixture
|
||||
def uri(self, tmp_path: Path) -> str:
|
||||
return str(tmp_path / "idx")
|
||||
|
||||
def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
|
||||
"""Create a store with a table and then fake its on-disk version."""
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.add([_node("1-0", "1", "text", 0.1)])
|
||||
store._write_schema_version(version)
|
||||
return PaperlessLanceVectorStore(uri=uri) # reopen to pick up written version
|
||||
|
||||
def test_pending_migrations_empty_at_current_version(self, uri: str) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION, Migration
|
||||
|
||||
store = self._store_at_version(uri, CURRENT_SCHEMA_VERSION)
|
||||
assert store.pending_migrations() == []
|
||||
|
||||
def test_pending_migrations_returns_migrations_above_stored_version(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
|
||||
m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
pending = store.pending_migrations()
|
||||
assert pending == [m2, m3]
|
||||
|
||||
def test_pending_migrations_excludes_already_applied(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
|
||||
m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
|
||||
|
||||
store = self._store_at_version(uri, 2)
|
||||
pending = store.pending_migrations()
|
||||
assert pending == [m3]
|
||||
|
||||
def test_pending_migrations_empty_when_no_table(self, uri: str) -> None:
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
assert store.pending_migrations() == []
|
||||
|
||||
def test_requires_reembed_migration_false_when_none_pending(self, uri: str) -> None:
|
||||
store = self._store_at_version(uri, 1)
|
||||
assert store.requires_reembed_migration() is False
|
||||
|
||||
def test_requires_reembed_migration_false_when_only_structural_pending(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
assert store.requires_reembed_migration() is False
|
||||
|
||||
def test_requires_reembed_migration_true_when_reembed_migration_pending(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="reindex", requires_reembed=True, apply=lambda t: None)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
assert store.requires_reembed_migration() is True
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
|
||||
```
|
||||
|
||||
Expected: all 7 tests fail — `Migration`, `MIGRATIONS`, `pending_migrations`, `requires_reembed_migration` don't exist yet.
|
||||
|
||||
- [ ] **Step 3: Add `Migration` dataclass and registry to `vector_store.py`**
|
||||
|
||||
Add near the top of the file, after the existing imports:
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Callable
|
||||
```
|
||||
|
||||
After the `CURRENT_SCHEMA_VERSION` constant, add:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class Migration:
|
||||
version: int
|
||||
description: str
|
||||
requires_reembed: bool
|
||||
apply: Callable[[Any], None] = field(compare=False, hash=False)
|
||||
```
|
||||
|
||||
(`compare=False, hash=False` excludes `apply` from `__eq__` and `__hash__` — equality is driven by `version` alone, which is the natural identity key. This avoids lambda identity issues in tests and makes the API safe for callers that construct `Migration` instances inline.)
|
||||
|
||||
# Ordered list of schema migrations. Each entry upgrades the table to `version`.
|
||||
|
||||
# Structural migrations (requires_reembed=False) are applied in-place via LanceDB's
|
||||
|
||||
# add_columns/alter_columns/drop_columns APIs — no re-embedding needed.
|
||||
|
||||
# Migrations with requires_reembed=True cause a full rebuild on next index update,
|
||||
|
||||
# exactly like a model-name change does today.
|
||||
|
||||
#
|
||||
|
||||
# To add a migration:
|
||||
|
||||
# 1. Increment CURRENT_SCHEMA_VERSION.
|
||||
|
||||
# 2. Append a Migration entry here with the new version number.
|
||||
|
||||
# 3. For structural changes, call table.add_columns/alter_columns/drop_columns in apply().
|
||||
|
||||
# 4. For embedding-invalidating changes, set requires_reembed=True; apply() can be a no-op.
|
||||
|
||||
MIGRATIONS: list[Migration] = []
|
||||
|
||||
````
|
||||
|
||||
Inside `PaperlessLanceVectorStore`, add after `requires_reembed_migration` (which we'll add next):
|
||||
|
||||
```python
|
||||
def pending_migrations(self) -> list[Migration]:
|
||||
"""Return migrations not yet applied to this table, in version order."""
|
||||
if self._table is None:
|
||||
return []
|
||||
current = self.stored_schema_version()
|
||||
return [m for m in MIGRATIONS if m.version > current]
|
||||
|
||||
def requires_reembed_migration(self) -> bool:
|
||||
"""True when any pending migration requires a full re-embedding."""
|
||||
return any(m.requires_reembed for m in self.pending_migrations())
|
||||
````
|
||||
|
||||
- [ ] **Step 4: Run tests to verify they pass**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
|
||||
```
|
||||
|
||||
Expected: all 7 tests pass.
|
||||
|
||||
- [ ] **Step 5: Lint**
|
||||
|
||||
```bash
|
||||
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
git commit -m "feat(ai): add Migration registry and pending migration detection"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Apply structural migrations in-place
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/vector_store.py`
|
||||
- Test: `src/paperless_ai/tests/test_vector_store.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
Add a new class to `test_vector_store.py`:
|
||||
|
||||
```python
|
||||
class TestApplyStructuralMigrations:
|
||||
@pytest.fixture
|
||||
def uri(self, tmp_path: Path) -> str:
|
||||
return str(tmp_path / "idx")
|
||||
|
||||
def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
store.add([_node("1-0", "1", "text", 0.1)])
|
||||
store._write_schema_version(version)
|
||||
return PaperlessLanceVectorStore(uri=uri)
|
||||
|
||||
def test_apply_structural_adds_column_via_lancedb(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
def _add_extra(table: Any) -> None:
|
||||
table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
|
||||
|
||||
m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
applied = store.apply_structural_migrations()
|
||||
|
||||
assert len(applied) == 1
|
||||
assert applied[0] == m2
|
||||
# Column actually present in the table schema.
|
||||
reopened = PaperlessLanceVectorStore(uri=uri)
|
||||
field_names = [f.name for f in reopened._table.schema]
|
||||
assert "extra" in field_names
|
||||
|
||||
def test_apply_structural_updates_version_file(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
store.apply_structural_migrations()
|
||||
|
||||
assert store.stored_schema_version() == 2
|
||||
|
||||
def test_apply_structural_skips_reembed_migrations(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
applied_versions: list[int] = []
|
||||
m2 = Migration(version=2, description="structural", requires_reembed=False, apply=lambda t: applied_versions.append(2) or t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
|
||||
m3 = Migration(version=3, description="reembed", requires_reembed=True, apply=lambda t: applied_versions.append(3))
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
applied = store.apply_structural_migrations()
|
||||
|
||||
assert [m.version for m in applied] == [2]
|
||||
assert 3 not in applied_versions
|
||||
# Version advances only to the last structural migration applied.
|
||||
assert store.stored_schema_version() == 2
|
||||
|
||||
def test_apply_structural_noop_at_current_version(self, uri: str) -> None:
|
||||
store = self._store_at_version(uri, 1)
|
||||
applied = store.apply_structural_migrations()
|
||||
assert applied == []
|
||||
|
||||
def test_apply_structural_noop_when_no_table(self, uri: str) -> None:
|
||||
store = PaperlessLanceVectorStore(uri=uri)
|
||||
applied = store.apply_structural_migrations()
|
||||
assert applied == []
|
||||
|
||||
def test_apply_structural_refreshes_table_reference(
|
||||
self, uri: str, mocker: pytest_mock.MockerFixture
|
||||
) -> None:
|
||||
"""After add_columns the in-memory table object must reflect the new schema."""
|
||||
from paperless_ai.vector_store import Migration
|
||||
|
||||
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"extra": "CAST(NULL AS VARCHAR)"}))
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
|
||||
store = self._store_at_version(uri, 1)
|
||||
store.apply_structural_migrations()
|
||||
|
||||
# The store's own _table reference (not a re-open) must see the new column.
|
||||
field_names = [f.name for f in store._table.schema]
|
||||
assert "extra" in field_names
|
||||
```
|
||||
|
||||
Add `from typing import Any` to the test file imports if not already present.
|
||||
|
||||
- [ ] **Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
|
||||
```
|
||||
|
||||
Expected: all 6 tests fail — `apply_structural_migrations` doesn't exist yet.
|
||||
|
||||
- [ ] **Step 3: Implement `apply_structural_migrations` in `vector_store.py`**
|
||||
|
||||
Add after `requires_reembed_migration` on the class:
|
||||
|
||||
```python
|
||||
def apply_structural_migrations(self) -> list[Migration]:
|
||||
"""Apply all pending structural (non-reembed) migrations in version order.
|
||||
|
||||
Each applied migration's ``apply`` callable receives the live LanceDB table
|
||||
object and should call ``add_columns``, ``alter_columns``, or ``drop_columns``
|
||||
as needed. After all structural migrations run, the version file is updated
|
||||
to the highest version applied and the in-memory table reference is refreshed.
|
||||
|
||||
Migrations with ``requires_reembed=True`` are skipped — the caller is
|
||||
responsible for detecting them via ``requires_reembed_migration()`` and
|
||||
triggering a full rebuild.
|
||||
"""
|
||||
if self._table is None:
|
||||
return []
|
||||
structural = [m for m in self.pending_migrations() if not m.requires_reembed]
|
||||
if not structural:
|
||||
return []
|
||||
for migration in structural:
|
||||
logger.info("Applying schema migration v%d: %s", migration.version, migration.description)
|
||||
migration.apply(self._table)
|
||||
# Refresh the in-memory table so subsequent operations see the new schema.
|
||||
self._table = self._conn.open_table(self._table_name)
|
||||
self._write_schema_version(structural[-1].version)
|
||||
return structural
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run tests to verify they pass**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
|
||||
```
|
||||
|
||||
Expected: all 6 tests pass.
|
||||
|
||||
- [ ] **Step 5: Full test_vector_store regression check**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
|
||||
```
|
||||
|
||||
Expected: all tests pass.
|
||||
|
||||
- [ ] **Step 6: Lint**
|
||||
|
||||
```bash
|
||||
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
```
|
||||
|
||||
- [ ] **Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
|
||||
git commit -m "feat(ai): implement apply_structural_migrations for in-place schema changes"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Wire migrations into `update_llm_index`
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/indexing.py`
|
||||
- Test: `src/paperless_ai/tests/test_ai_indexing.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
Add these two tests to `test_ai_indexing.py`, after the existing `test_update_llm_index_rebuilds_on_model_name_change` test:
|
||||
|
||||
```python
|
||||
@pytest.mark.django_db
|
||||
def test_update_llm_index_applies_structural_migration_without_rebuild(
|
||||
temp_llm_index_dir: Path,
|
||||
real_document: Document,
|
||||
mock_embed_model: FakeEmbedding,
|
||||
mocker: pytest_mock.MockerFixture,
|
||||
) -> None:
|
||||
"""Structural migrations are applied in-place; no full rebuild (drop) occurs."""
|
||||
from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
|
||||
|
||||
column_added: list[bool] = []
|
||||
|
||||
def _add_extra(table) -> None:
|
||||
table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
|
||||
column_added.append(True)
|
||||
|
||||
# Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
|
||||
with patch("documents.models.Document.objects.all") as mock_all:
|
||||
mock_queryset = MagicMock()
|
||||
mock_queryset.exists.return_value = True
|
||||
mock_queryset.__iter__.return_value = iter([real_document])
|
||||
mock_all.return_value = mock_queryset
|
||||
indexing.update_llm_index(rebuild=True)
|
||||
|
||||
# Simulate a new v2 structural migration being introduced after the initial index was built.
|
||||
m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
|
||||
drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
|
||||
|
||||
with patch("documents.models.Document.objects.all") as mock_all:
|
||||
mock_queryset = MagicMock()
|
||||
mock_queryset.exists.return_value = True
|
||||
mock_queryset.__iter__.return_value = iter([real_document])
|
||||
mock_all.return_value = mock_queryset
|
||||
indexing.update_llm_index(rebuild=False)
|
||||
|
||||
assert column_added, "Structural migration apply() was not called"
|
||||
drop_spy.assert_not_called()
|
||||
|
||||
|
||||
@pytest.mark.django_db
|
||||
def test_update_llm_index_forces_rebuild_on_reembed_migration(
|
||||
temp_llm_index_dir: Path,
|
||||
real_document: Document,
|
||||
mock_embed_model: FakeEmbedding,
|
||||
mocker: pytest_mock.MockerFixture,
|
||||
) -> None:
|
||||
"""A pending reembed migration causes a full drop+rebuild on next update."""
|
||||
from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
|
||||
|
||||
# Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
|
||||
with patch("documents.models.Document.objects.all") as mock_all:
|
||||
mock_queryset = MagicMock()
|
||||
mock_queryset.exists.return_value = True
|
||||
mock_queryset.__iter__.return_value = iter([real_document])
|
||||
mock_all.return_value = mock_queryset
|
||||
indexing.update_llm_index(rebuild=True)
|
||||
|
||||
# Simulate a reembed migration at v2 being introduced after the initial index was built.
|
||||
m2 = Migration(version=2, description="requires reembed", requires_reembed=True, apply=lambda t: None)
|
||||
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
|
||||
mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
|
||||
drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
|
||||
|
||||
with patch("documents.models.Document.objects.all") as mock_all:
|
||||
mock_queryset = MagicMock()
|
||||
mock_queryset.exists.return_value = True
|
||||
mock_queryset.__iter__.return_value = iter([real_document])
|
||||
mock_all.return_value = mock_queryset
|
||||
indexing.update_llm_index(rebuild=False)
|
||||
|
||||
drop_spy.assert_called()
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
|
||||
```
|
||||
|
||||
Expected: both tests fail because `update_llm_index` doesn't call migration methods yet.
|
||||
|
||||
- [ ] **Step 3: Add migration check inside `update_llm_index` in `indexing.py`**
|
||||
|
||||
Inside the `with write_store(embed_model_name=model_name) as store:` block in `update_llm_index`, insert the migration check immediately before the `if rebuild or not store.table_exists():` line:
|
||||
|
||||
```python
|
||||
if not rebuild and store.table_exists():
|
||||
store.apply_structural_migrations()
|
||||
if store.requires_reembed_migration():
|
||||
logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
|
||||
rebuild = True
|
||||
```
|
||||
|
||||
The relevant section of `update_llm_index` should now look like:
|
||||
|
||||
```python
|
||||
with write_store(embed_model_name=model_name) as store:
|
||||
if not rebuild and store.table_exists():
|
||||
store.apply_structural_migrations()
|
||||
if store.requires_reembed_migration():
|
||||
logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
|
||||
rebuild = True
|
||||
if rebuild or not store.table_exists():
|
||||
(settings.LLM_INDEX_DIR / "meta.json").unlink(missing_ok=True)
|
||||
logger.info("Rebuilding LLM index.")
|
||||
store.drop_table()
|
||||
...
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run new tests to verify they pass**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
|
||||
```
|
||||
|
||||
Expected: both tests pass.
|
||||
|
||||
- [ ] **Step 5: Full indexing regression check**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
|
||||
```
|
||||
|
||||
Expected: all existing tests still pass.
|
||||
|
||||
- [ ] **Step 6: Full AI module test run**
|
||||
|
||||
```bash
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/ -v"
|
||||
```
|
||||
|
||||
Expected: all tests pass.
|
||||
|
||||
- [ ] **Step 7: Lint**
|
||||
|
||||
```bash
|
||||
ruff check src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
|
||||
ruff format src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
|
||||
```
|
||||
|
||||
- [ ] **Step 8: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
|
||||
git commit -m "feat(ai): wire schema migrations into update_llm_index; structural changes avoid re-embed"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How to add a migration (reference for future developers)
|
||||
|
||||
When a future schema change is needed:
|
||||
|
||||
1. Increment `CURRENT_SCHEMA_VERSION` in `vector_store.py`.
|
||||
2. Append a `Migration` to `MIGRATIONS` with the new version number.
|
||||
3. If the change is **structural only** (add/rename/drop a column, no embedding content changed):
|
||||
- Set `requires_reembed=False`
|
||||
- In `apply`, call `table.add_columns({"col": "CAST(NULL AS string)"})`, `table.drop_columns(["col"])`, or `table.alter_columns({"path": "col", "rename": "new_name"})` as appropriate.
|
||||
4. If the change affects **what text gets embedded** (new fields in `build_llm_index_text`, chunk size change baked into schema, etc.):
|
||||
- Set `requires_reembed=True`
|
||||
- `apply` can be a no-op (`lambda t: None`) — the framework will trigger a full rebuild.
|
||||
5. Write tests for the migration in `test_vector_store.py` following the `TestApplyStructuralMigrations` patterns.
|
||||
|
||||
Example structural migration adding a `language` column:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: int = 2
|
||||
|
||||
MIGRATIONS: list[Migration] = [
|
||||
Migration(
|
||||
version=2,
|
||||
description="Add language column for future locale-aware filtering",
|
||||
requires_reembed=False,
|
||||
apply=lambda table: table.add_columns({"language": "CAST(NULL AS string)"}),
|
||||
),
|
||||
]
|
||||
```
|
||||
@@ -0,0 +1,446 @@
|
||||
# Node Metadata Enrichment Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Move `filename`, `storage_path`, and `archive_serial_number` from the LanceDB embedding text into `node.metadata`, and register a schema migration that triggers an automatic index rebuild on upgrade.
|
||||
|
||||
**Architecture:** Three small, independent changes to two source files, tested first. The migration is a no-op `apply` (the rebuild regenerates all nodes with correct metadata). All three tests go red first, then each implementation makes them green.
|
||||
|
||||
**Tech Stack:** pytest, pytest-django, pytest-mock, factory_boy, llama_index `MetadataMode`, `feature-lancedb-schema-migrate` branch (must be the base branch for this work).
|
||||
|
||||
**Branch base:** `feature-lancedb-schema-migrate`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Fail — embedding text no longer contains the three fields
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/tests/test_embedding.py`
|
||||
|
||||
- [ ] **Step 1: Update `mock_document` fixture to set an explicit `storage_path`**
|
||||
|
||||
The fixture currently doesn't set `storage_path`, so the existing code path (`doc.storage_path.name if doc.storage_path else ''`) would call `.name` on a `MagicMock`. Give it an explicit value so assertions are unambiguous.
|
||||
|
||||
Add these two lines to the `mock_document` fixture after `doc.archive_serial_number = "12345"`:
|
||||
|
||||
```python
|
||||
doc.storage_path = MagicMock()
|
||||
doc.storage_path.name = "Finance/Bills"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update `test_build_llm_index_text` — flip and add assertions**
|
||||
|
||||
The existing test asserts these fields ARE in the result. Change them to assert they are NOT, and add the two missing ones:
|
||||
|
||||
```python
|
||||
# was: assert "Filename: test_file.pdf" in result
|
||||
assert "Filename: test_file.pdf" not in result
|
||||
assert "Storage Path: Finance/Bills" not in result
|
||||
assert "Archive Serial Number: 12345" not in result
|
||||
```
|
||||
|
||||
The assertions for `Notes`, `Content`, and `Custom Field` lines are unchanged — leave them as-is.
|
||||
|
||||
- [ ] **Step 3: Run the test to confirm it fails**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
|
||||
```
|
||||
|
||||
Expected: `FAILED` — `AssertionError: assert 'Filename: test_file.pdf' not in '...'`
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Pass — remove the three fields from `build_llm_index_text`
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/embedding.py`
|
||||
|
||||
- [ ] **Step 1: Remove the three lines and the TODO comment**
|
||||
|
||||
Current `build_llm_index_text` (lines 114–133). Replace the function body:
|
||||
|
||||
```python
|
||||
def build_llm_index_text(doc: Document) -> str:
|
||||
lines = [
|
||||
f"Notes: {','.join([str(c.note) for c in Note.objects.filter(document=doc)])}",
|
||||
]
|
||||
|
||||
for instance in doc.custom_fields.all():
|
||||
lines.append(f"Custom Field - {instance.field.name}: {instance}")
|
||||
|
||||
lines.append("\nContent:\n")
|
||||
lines.append(doc.content or "")
|
||||
|
||||
return _normalize_llm_index_text("\n".join(lines))
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run the test to confirm it passes**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
|
||||
```
|
||||
|
||||
Expected: `PASSED`
|
||||
|
||||
- [ ] **Step 3: Run the full embedding test module to catch regressions**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py -v"
|
||||
```
|
||||
|
||||
Expected: all green.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/embedding.py src/paperless_ai/tests/test_embedding.py
|
||||
git commit -m "refactor(ai): remove filename/storage_path/asn from embedding text"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Fail — `build_document_node` exposes the three fields in metadata
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
|
||||
|
||||
- [ ] **Step 1: Extend `test_build_document_node_structured_fields_in_metadata`**
|
||||
|
||||
This test already checks for `title`, `tags`, etc. Add the three new keys. The `real_document` fixture creates a document with no storage path set, so `storage_path` will be `None` — the key must still be present.
|
||||
|
||||
Replace the existing test body:
|
||||
|
||||
```python
|
||||
@pytest.mark.django_db
|
||||
def test_build_document_node_structured_fields_in_metadata(
|
||||
real_document: Document,
|
||||
) -> None:
|
||||
"""Structured fields must be in node.metadata so the LLM receives them via metadata prepend."""
|
||||
nodes = indexing.build_document_node(real_document)
|
||||
assert len(nodes) > 0
|
||||
for node in nodes:
|
||||
assert "title" in node.metadata
|
||||
assert "tags" in node.metadata
|
||||
assert "correspondent" in node.metadata
|
||||
assert "document_type" in node.metadata
|
||||
assert "created" in node.metadata
|
||||
assert "added" in node.metadata
|
||||
assert "modified" in node.metadata
|
||||
assert "filename" in node.metadata
|
||||
assert "storage_path" in node.metadata # None is fine; key must exist
|
||||
assert "archive_serial_number" in node.metadata
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add a test that storage_path carries the name when set**
|
||||
|
||||
Add a new test function after `test_build_document_node_structured_fields_in_metadata`:
|
||||
|
||||
```python
|
||||
@pytest.mark.django_db
|
||||
def test_build_document_node_storage_path_name_in_metadata() -> None:
|
||||
"""storage_path metadata value is the StoragePath name, not None, when set."""
|
||||
from documents.tests.factories import DocumentFactory, StoragePathFactory
|
||||
|
||||
sp = StoragePathFactory(name="Finance/Bills")
|
||||
doc = DocumentFactory(storage_path=sp)
|
||||
|
||||
nodes = indexing.build_document_node(doc)
|
||||
|
||||
assert len(nodes) > 0
|
||||
for node in nodes:
|
||||
assert node.metadata["storage_path"] == "Finance/Bills"
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add a test that all three new fields are in `excluded_embed_metadata_keys`**
|
||||
|
||||
Add after the previous test:
|
||||
|
||||
```python
|
||||
@pytest.mark.django_db
|
||||
def test_build_document_node_new_fields_excluded_from_embedding(
|
||||
real_document: Document,
|
||||
) -> None:
|
||||
"""filename, storage_path, and archive_serial_number must not appear in embedding text."""
|
||||
from llama_index.core.schema import MetadataMode
|
||||
|
||||
nodes = indexing.build_document_node(real_document)
|
||||
assert len(nodes) > 0
|
||||
for node in nodes:
|
||||
assert "filename" in node.excluded_embed_metadata_keys
|
||||
assert "storage_path" in node.excluded_embed_metadata_keys
|
||||
assert "archive_serial_number" in node.excluded_embed_metadata_keys
|
||||
embed_text = node.get_content(metadata_mode=MetadataMode.EMBED)
|
||||
assert "filename" not in embed_text
|
||||
assert "storage_path" not in embed_text
|
||||
assert "archive_serial_number" not in embed_text
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run the new tests to confirm they fail**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
|
||||
```
|
||||
|
||||
Expected: all `FAILED` — keys not yet in `node.metadata`.
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Pass — add the three fields to `build_document_node`
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/indexing.py`
|
||||
|
||||
- [ ] **Step 1: Update the `metadata` dict in `build_document_node`**
|
||||
|
||||
Current metadata dict starts at line 106. Replace it:
|
||||
|
||||
```python
|
||||
metadata = {
|
||||
"document_id": str(document.id),
|
||||
"title": document.title,
|
||||
"filename": document.filename or "",
|
||||
"storage_path": document.storage_path.name if document.storage_path else None,
|
||||
"archive_serial_number": document.archive_serial_number,
|
||||
"tags": [t.name for t in document.tags.all()],
|
||||
"correspondent": document.correspondent.name
|
||||
if document.correspondent
|
||||
else None,
|
||||
"document_type": document.document_type.name
|
||||
if document.document_type
|
||||
else None,
|
||||
"created": document.created.isoformat() if document.created else None,
|
||||
"added": document.added.isoformat() if document.added else None,
|
||||
"modified": document.modified.isoformat(),
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update `excluded_embed_metadata_keys`**
|
||||
|
||||
The `LlamaDocument(...)` call currently has:
|
||||
|
||||
```python
|
||||
excluded_embed_metadata_keys=list(metadata.keys()),
|
||||
```
|
||||
|
||||
This already excludes all keys, so no change needed here — the new keys are automatically included since they're in the dict. Verify `excluded_llm_metadata_keys` still only excludes `"document_id"`:
|
||||
|
||||
```python
|
||||
excluded_llm_metadata_keys=["document_id"],
|
||||
```
|
||||
|
||||
No change needed.
|
||||
|
||||
- [ ] **Step 3: Run the failing tests to confirm they pass**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
|
||||
```
|
||||
|
||||
Expected: all `PASSED`.
|
||||
|
||||
- [ ] **Step 4: Run the full indexing test module**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
|
||||
```
|
||||
|
||||
Expected: all green.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
|
||||
git commit -m "feat(ai): add filename/storage_path/asn to node metadata"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Fail — migration v2 is registered
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/tests/test_vector_store.py`
|
||||
|
||||
These tests use the real (non-mocked) `MIGRATIONS` list, so they go red until the migration is registered in Task 6.
|
||||
|
||||
- [ ] **Step 1: Add a `TestMetadataEnrichmentMigration` class**
|
||||
|
||||
Add this class near the end of `test_vector_store.py`, before the final `TestApplyStructuralMigrations`:
|
||||
|
||||
```python
|
||||
class TestMetadataEnrichmentMigration:
|
||||
def test_current_schema_version_is_2(self) -> None:
|
||||
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
|
||||
assert CURRENT_SCHEMA_VERSION == 2
|
||||
|
||||
def test_migration_v2_registered(self) -> None:
|
||||
from paperless_ai.vector_store import MIGRATIONS
|
||||
assert len(MIGRATIONS) == 1
|
||||
assert MIGRATIONS[0].version == 2
|
||||
assert MIGRATIONS[0].requires_reembed is True
|
||||
|
||||
def test_store_at_v1_requires_reembed(self, uri: str) -> None:
|
||||
store = _store_at_version(uri, 1)
|
||||
assert store.requires_reembed_migration() is True
|
||||
|
||||
def test_store_at_v2_no_pending_migrations(self, uri: str) -> None:
|
||||
store = _store_at_version(uri, 2)
|
||||
assert store.pending_migrations() == []
|
||||
assert store.requires_reembed_migration() is False
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run the tests to confirm they fail**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
|
||||
```
|
||||
|
||||
Expected: all `FAILED` — `CURRENT_SCHEMA_VERSION` is still 1 and `MIGRATIONS` is still empty.
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Pass — register migration v2 in `vector_store.py`
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/vector_store.py`
|
||||
|
||||
- [ ] **Step 1: Add the migration and bump the version constant**
|
||||
|
||||
On the `feature-lancedb-schema-migrate` branch, `vector_store.py` has:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: Final[int] = 1
|
||||
...
|
||||
MIGRATIONS: list[Migration] = []
|
||||
```
|
||||
|
||||
Change both:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: Final[int] = 2
|
||||
|
||||
MIGRATIONS: list[Migration] = [
|
||||
Migration(
|
||||
version=2,
|
||||
description="move filename/storage_path/asn from embedding text to metadata; rebuild required",
|
||||
requires_reembed=True,
|
||||
apply=lambda table: None,
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run the migration tests to confirm they pass**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
|
||||
```
|
||||
|
||||
Expected: all `PASSED`.
|
||||
|
||||
- [ ] **Step 3: Run the full vector store test module**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
|
||||
```
|
||||
|
||||
Expected: all green. In particular, `TestSchemaVersioning::test_stored_schema_version_persists_after_reopen` and the `TestMigrationRegistry` tests should still pass — they use `CURRENT_SCHEMA_VERSION` as the baseline.
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Integration — `update_llm_index` rebuilds when schema version is stale
|
||||
|
||||
**Files:**
|
||||
|
||||
- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing integration test**
|
||||
|
||||
Add this test near `test_update_llm_index_rebuilds_on_model_name_change`:
|
||||
|
||||
```python
|
||||
@pytest.mark.django_db
|
||||
def test_update_llm_index_rebuilds_on_pending_reembed_migration(
|
||||
temp_llm_index_dir: Path,
|
||||
real_document: Document,
|
||||
mock_embed_model: FakeEmbedding,
|
||||
) -> None:
|
||||
"""A stale schema version (v1) must trigger a full rebuild on the next index run."""
|
||||
from paperless_ai.vector_store import PaperlessLanceVectorStore
|
||||
|
||||
# Build an initial index and then rewind the schema version to 1 to simulate
|
||||
# an index created before migration v2 was registered.
|
||||
indexing.update_llm_index(rebuild=True)
|
||||
store = indexing.get_vector_store()
|
||||
store._write_schema_version(1)
|
||||
|
||||
# An incremental run (rebuild=False) must detect the stale version and rebuild.
|
||||
with patch("documents.models.Document.objects.all") as mock_all:
|
||||
mock_queryset = MagicMock()
|
||||
mock_queryset.exists.return_value = True
|
||||
mock_queryset.__iter__.return_value = iter([real_document])
|
||||
mock_all.return_value = mock_queryset
|
||||
indexing.update_llm_index(rebuild=False)
|
||||
|
||||
# After rebuild the schema version must be current.
|
||||
reopened = PaperlessLanceVectorStore(uri=str(temp_llm_index_dir))
|
||||
assert reopened.stored_schema_version() == 2
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run the test to confirm it fails**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
|
||||
```
|
||||
|
||||
Expected: `FAILED` — schema version stays at 1 because migration v2 isn't registered yet.
|
||||
|
||||
_(If it passes already because `update_llm_index` detects a different condition, verify the assertion is actually exercising the migration path and not the model-name path.)_
|
||||
|
||||
- [ ] **Step 3: Run the test again now that migration v2 is registered (Task 6)**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
|
||||
```
|
||||
|
||||
Expected: `PASSED`.
|
||||
|
||||
- [ ] **Step 4: Run the full indexing test module**
|
||||
|
||||
```
|
||||
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
|
||||
```
|
||||
|
||||
Expected: all green.
|
||||
|
||||
- [ ] **Step 5: Final commit**
|
||||
|
||||
```bash
|
||||
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py src/paperless_ai/tests/test_ai_indexing.py
|
||||
git commit -m "feat(ai): register schema migration v2; triggers rebuild for metadata enrichment"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review checklist
|
||||
|
||||
**Spec coverage:**
|
||||
|
||||
- ✅ `build_llm_index_text` — three lines removed (Tasks 1–2)
|
||||
- ✅ `build_document_node` — three fields added to metadata + excluded_embed_metadata_keys (Tasks 3–4)
|
||||
- ✅ Migration v2 registered with `requires_reembed=True` and no-op apply (Tasks 5–6)
|
||||
- ✅ `update_llm_index` triggers rebuild on stale schema (Task 7)
|
||||
- ✅ Tests: `test_embedding.py`, `test_ai_indexing.py`, `test_vector_store.py`
|
||||
|
||||
**Placeholder scan:** None found. Every step has exact code or exact commands.
|
||||
|
||||
**Type consistency:**
|
||||
|
||||
- `metadata` dict key names (`"filename"`, `"storage_path"`, `"archive_serial_number"`) used consistently across Tasks 1–4.
|
||||
- `CURRENT_SCHEMA_VERSION = 2` and `MIGRATIONS[0].version == 2` are consistent across Tasks 5–6.
|
||||
- `_store_at_version` and `_node` helpers referenced in Task 5 are defined in the existing `test_vector_store.py` on the `feature-lancedb-schema-migrate` branch.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,115 @@
|
||||
# LanceDB Node Metadata Enrichment
|
||||
|
||||
**Status:** Design
|
||||
**Date:** 2026-06-09
|
||||
**Branch target:** `dev`
|
||||
**Prerequisite for:** AI taxonomy hints (`2026-05-20-ai-taxonomy-hints-design.md`)
|
||||
**Depends on:** `feature-lancedb-schema-migrate`
|
||||
|
||||
## Problem
|
||||
|
||||
`build_llm_index_text` currently includes three short structured values in the embedding text:
|
||||
|
||||
```python
|
||||
lines = [
|
||||
f"Filename: {doc.filename}",
|
||||
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
|
||||
f"Archive Serial Number: {doc.archive_serial_number or ''}",
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
These don't belong in the embedding. The embedding should capture semantic content — the meaning of the document — not structured identifiers. Including them means vectors are partly "polluted" with filing metadata, making similarity search less accurate. The existing TODO in `embedding.py:115` explicitly calls this out.
|
||||
|
||||
The right home for structured values is `node.metadata` (excluded from the embedding, but surfaced to the LLM when nodes are retrieved as context). `title`, `tags`, `correspondent`, and `document_type` already follow this pattern.
|
||||
|
||||
Notes and custom fields stay in the embedding text — Notes is long free text, custom fields are dynamic and their semantic content belongs in the vector.
|
||||
|
||||
## Changes
|
||||
|
||||
### `paperless_ai/embedding.py` — `build_llm_index_text`
|
||||
|
||||
Remove the three lines and the TODO comment:
|
||||
|
||||
```python
|
||||
# remove:
|
||||
f"Filename: {doc.filename}",
|
||||
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
|
||||
f"Archive Serial Number: {doc.archive_serial_number or ''}",
|
||||
```
|
||||
|
||||
`Notes` and `Custom Fields` lines remain.
|
||||
|
||||
### `paperless_ai/indexing.py` — `build_document_node`
|
||||
|
||||
Add the three fields to the metadata dict:
|
||||
|
||||
```python
|
||||
metadata = {
|
||||
"document_id": str(document.id),
|
||||
"title": document.title,
|
||||
"filename": document.filename or "",
|
||||
"storage_path": document.storage_path.name if document.storage_path else None,
|
||||
"archive_serial_number": document.archive_serial_number,
|
||||
"tags": [t.name for t in document.tags.all()],
|
||||
"correspondent": document.correspondent.name if document.correspondent else None,
|
||||
"document_type": document.document_type.name if document.document_type else None,
|
||||
"created": document.created.isoformat() if document.created else None,
|
||||
"added": document.added.isoformat() if document.added else None,
|
||||
"modified": document.modified.isoformat(),
|
||||
}
|
||||
```
|
||||
|
||||
All three new keys must also appear in `excluded_embed_metadata_keys` (consistent with all existing keys — none of the metadata is included in the embedding text).
|
||||
|
||||
### `paperless_ai/vector_store.py` — schema migration
|
||||
|
||||
Register migration version 2 on the `feature-lancedb-schema-migrate` framework. The embedding text changes, so all existing vectors are stale — a full rebuild is required. The migration's `apply` is a no-op; the rebuild handles regenerating all nodes with the correct metadata.
|
||||
|
||||
```python
|
||||
MIGRATIONS: list[Migration] = [
|
||||
Migration(
|
||||
version=2,
|
||||
description="move filename/storage_path/asn from embedding text to metadata",
|
||||
requires_reembed=True,
|
||||
apply=lambda table: None,
|
||||
),
|
||||
]
|
||||
CURRENT_SCHEMA_VERSION: Final[int] = 2
|
||||
```
|
||||
|
||||
On next `update_llm_index` run, `requires_reembed_migration()` returns `True`, triggering a full drop-and-rebuild. All new nodes carry the three metadata fields. No manual intervention required.
|
||||
|
||||
## Impact
|
||||
|
||||
- Similarity search quality improves slightly — vectors are more purely semantic.
|
||||
- The LLM receives `filename`, `storage_path`, and `archive_serial_number` as structured metadata alongside retrieved chunks, rather than embedded in the chunk text. Same information, cleaner separation.
|
||||
- One forced index rebuild on upgrade (beta: acceptable).
|
||||
- `node.metadata["storage_path"]`, `node.metadata["filename"]`, `node.metadata["archive_serial_number"]` are available on all retrieved nodes after rebuild — unblocks the taxonomy hints feature.
|
||||
|
||||
## Testing
|
||||
|
||||
All tests use pytest style — grouped under classes, `@pytest.mark.django_db` on the class, `pytest-mock`'s `mocker` fixture, every fixture and test signature type-annotated. Format with `ruff` directly.
|
||||
|
||||
### `paperless_ai/tests/test_embedding.py` (modify)
|
||||
|
||||
- `class TestBuildLlmIndexText:`
|
||||
- Assert `"Filename:"` is **not** in the output.
|
||||
- Assert `"Storage Path:"` is **not** in the output.
|
||||
- Assert `"Archive Serial Number:"` is **not** in the output.
|
||||
- Assert Notes and Custom Fields lines are still present (regression guard).
|
||||
|
||||
### `paperless_ai/tests/test_ai_indexing.py` (modify)
|
||||
|
||||
- `class TestBuildDocumentNode:`
|
||||
- `filename` is in `node.metadata` and in `excluded_embed_metadata_keys`.
|
||||
- `storage_path` is in `node.metadata` (name string) and in `excluded_embed_metadata_keys`; `None` when document has no storage path.
|
||||
- `archive_serial_number` is in `node.metadata` and in `excluded_embed_metadata_keys`; `None` when unset.
|
||||
- None of the three appear in the embedding text produced for the node.
|
||||
|
||||
### `paperless_ai/tests/test_vector_store.py` (modify)
|
||||
|
||||
- `class TestSchemaMigrations:`
|
||||
- `pending_migrations()` returns the v2 migration when stored version is 1.
|
||||
- `requires_reembed_migration()` returns `True` when stored version is 1.
|
||||
- `apply_structural_migrations()` stops at the v2 migration (skips reembed entries).
|
||||
@@ -0,0 +1,138 @@
|
||||
# LLM Index Schema Migrations (second spec)
|
||||
|
||||
Date: 2026-06-10
|
||||
Depends on: `docs/superpowers/specs/2026-06-10-sqlite-vec-vector-store-design.md` and its implementation plan (`docs/superpowers/plans/2026-06-10-sqlite-vec-transition.md`). This spec layers on top of the completed sqlite-vec transition; do not start it before that branch lands.
|
||||
Supersedes: PR #12968 (in-place LanceDB migrations). The machinery design there is carried over nearly verbatim; only the storage backend specifics change. #12968 should be closed with a pointer here once this ships.
|
||||
|
||||
Scope update (user decision, 2026-06-10): the `embedding.py:115` metadata restructure originally drafted as Part 2 of this spec was folded into the transition plan instead (its Task 5), because the transition forces a full rebuild anyway, so the embedded-text change rides along with no extra re-embed cost. This spec is now machinery-only: it ships with an EMPTY migration registry, ready for whatever schema change comes next. Part 2 below is retained as the worked example of how a re-embed migration would be registered, since the next one will not have a free rebuild to piggyback on.
|
||||
|
||||
## Part 1: Schema migration machinery (ported from PR #12968)
|
||||
|
||||
### What carries over unchanged
|
||||
|
||||
The PR's design survives the store swap intact and is adopted as-is:
|
||||
|
||||
- `Migration` frozen dataclass: `version: int`, `description: str`, `requires_reembed: bool`, `apply: Callable` (compare/hash-excluded field).
|
||||
- `MIGRATIONS: list[Migration]` ordered registry + `CURRENT_SCHEMA_VERSION: Final[int]` in `vector_store.py`. To add a migration: bump the constant, append an entry.
|
||||
- Store surface: `stored_schema_version() -> int` (0 when unrecorded, so pre-versioning tables treat every migration as pending), `pending_migrations()`, `requires_reembed_migration()`, `apply_structural_migrations() -> list[Migration]`.
|
||||
- The stop-at-first-reembed-boundary rule in `apply_structural_migrations()`: structural migrations are applied in version order only up to the first pending `requires_reembed=True` entry, so the version counter can never jump past a re-embed boundary and silently skip the rebuild. (This was the subtle correctness insight of #12968; preserve the comment.)
|
||||
- The `update_llm_index()` hook, verbatim from the PR:
|
||||
|
||||
```python
|
||||
with write_store(embed_model_name=model_name) as store:
|
||||
if not rebuild and store.table_exists():
|
||||
store.apply_structural_migrations()
|
||||
if store.requires_reembed_migration():
|
||||
logger.warning(
|
||||
"Schema migration requires re-embedding; forcing LLM index rebuild.",
|
||||
)
|
||||
rebuild = True
|
||||
```
|
||||
|
||||
- Test approach from the PR: mock `MIGRATIONS`/`CURRENT_SCHEMA_VERSION` with `mocker.patch`, spy on `drop_table` to distinguish in-place from rebuild, one test per path (structural applied without rebuild; pending re-embed forces rebuild).
|
||||
|
||||
### What changes for sqlite-vec
|
||||
|
||||
**1. Version storage: `index_meta['schema_version']` instead of `schema_version.json`.**
|
||||
The Lance store needed a sidecar JSON file because Lance had no convenient mutable metadata. The sqlite-vec store already has the `index_meta` key/value table, which is transactional with the data itself (a migration and its version bump commit atomically, which the file never could). Concretely:
|
||||
|
||||
- `_create_table(dim)` additionally writes `schema_version = str(CURRENT_SCHEMA_VERSION)` (fresh tables are always current).
|
||||
- `stored_schema_version()` reads the meta key, returns 0 on absence/garbage.
|
||||
- `drop_table()` already does `DELETE FROM index_meta`, which clears the version with it. No sidecar file, no unlink bookkeeping.
|
||||
- `apply_structural_migrations()` writes the new version inside the same transaction as the last applied migration.
|
||||
|
||||
**2. `apply` receives the store, not a table handle.**
|
||||
Lance migrations got the raw table for `add_columns`/`alter_columns`. vec0 virtual tables do not support arbitrary `ALTER TABLE`, so structural migrations are SQL against the store's connection. Signature: `apply: Callable[[PaperlessSqliteVecVectorStore], None]`. The store exposes what migrations need: `.client` (connection), `._table_name`, `.vector_dim()`, and the rebuild helper below.
|
||||
|
||||
**3. Structural migrations are create+copy+rename, sharing the compact() machinery.**
|
||||
The sqlite-vec `compact()` already implements the only structural mutation vec0 supports: build a new table, `INSERT INTO ... SELECT` (vectors copied bit-for-bit, no re-embedding), drop old, rename. Factor it into a shared helper on the store:
|
||||
|
||||
```python
|
||||
def rebuild_table(
|
||||
self,
|
||||
*,
|
||||
create_sql: str | None = None,
|
||||
copy_select: str | None = None,
|
||||
) -> None:
|
||||
"""Copy live rows into a freshly created table and swap it in.
|
||||
|
||||
Defaults reproduce the current schema (compaction). Structural
|
||||
migrations pass a modified CREATE statement and a matching SELECT
|
||||
(e.g. adding a column with a literal default). Runs in one
|
||||
transaction; VACUUM afterwards.
|
||||
"""
|
||||
```
|
||||
|
||||
`compact()` becomes a thin caller (threshold check + `rebuild_table()`), and a structural migration like "add a `+page_count` aux column" is:
|
||||
|
||||
```python
|
||||
Migration(
|
||||
version=2,
|
||||
description="add page_count auxiliary column",
|
||||
requires_reembed=False,
|
||||
apply=lambda store: store.rebuild_table(
|
||||
create_sql=..., # CREATE VIRTUAL TABLE ... with the new column
|
||||
copy_select="SELECT id, document_id, modified, node_content, embedding, '' FROM {old}",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
A pleasant consequence: every structural migration is also a compaction (the copy drops dead rows), and the file-format risk surface is one helper with one test suite instead of two code paths.
|
||||
|
||||
**4. Bootstrap version for the sqlite-vec store is 1.**
|
||||
The transition plan ships the new store without machinery; tables it creates carry no `schema_version` key and therefore read as 0. This release lands with `CURRENT_SCHEMA_VERSION = 1` and `MIGRATIONS = []`, so the bootstrap is unconditionally safe: a 0-version table has no pending migrations and `apply_structural_migrations()` simply stamps it to 1. (The metadata restructure having moved into the transition itself is what makes this clean; the registry's first real entry will be v2, written against tables that are all stamped.)
|
||||
|
||||
## Part 2 (worked example, IMPLEMENTED IN THE TRANSITION): the metadata TODO as a re-embed migration
|
||||
|
||||
This section was implemented as Task 5 of the transition plan and ships with the store swap, not with this spec. It is kept as the reference example of how to register the next re-embed migration.
|
||||
|
||||
### The change
|
||||
|
||||
`build_llm_index_text()` currently embeds three short structured values in the body text:
|
||||
|
||||
```python
|
||||
f"Filename: {doc.filename}",
|
||||
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
|
||||
f"Archive Serial Number: {doc.archive_serial_number or ''}",
|
||||
```
|
||||
|
||||
Per the TODO, move them to `node.metadata` (excluded from embeddings, visible to the LLM via llama-index's metadata prepend), the same treatment title/tags/correspondent/document_type got in PR #12944. Notes and Custom Fields stay in the body (long free text / dynamic count, as the TODO says).
|
||||
|
||||
1. `embedding.py build_llm_index_text()`: delete the three lines above (the `lines` list keeps Notes, Custom Fields, and Content). Update the TODO comment to describe only what remains intentional (Notes/Custom Fields stay embedded), or delete it.
|
||||
2. `indexing.py build_document_node()` metadata dict gains:
|
||||
|
||||
```python
|
||||
"filename": doc.filename,
|
||||
"storage_path": document.storage_path.name if document.storage_path else None,
|
||||
"archive_serial_number": document.archive_serial_number,
|
||||
```
|
||||
|
||||
(`None`/int values are fine here: this dict lives in the node-content JSON, not in vec0 metadata columns; only `document_id`/`modified` are columns with the NULL restriction. Matches the existing convention of `correspondent: None`.) 3. `excluded_embed_metadata_keys=list(metadata.keys())` already covers the new keys; `excluded_llm_metadata_keys` stays `["document_id"]` so the LLM sees the new fields.
|
||||
|
||||
### Why this class of change needs a migration
|
||||
|
||||
Removing the three lines changes the embedded text of every document, so stored vectors no longer match what the current code would embed. Incremental updates only re-embed documents whose `modified` changed, so without a forced rebuild the index would be a mixed old/new-text population indefinitely. This particular change escaped that fate only because the transition's forced rebuild covers it. The next embedded-text change will not have that luxury and gets registered like this:
|
||||
|
||||
```python
|
||||
CURRENT_SCHEMA_VERSION: Final[int] = 2
|
||||
|
||||
MIGRATIONS: list[Migration] = [
|
||||
Migration(
|
||||
version=2,
|
||||
description="<what changed about the embedded text>",
|
||||
requires_reembed=True,
|
||||
apply=lambda store: None,
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
On the first `update_llm_index` after upgrade, the hook sees the pending re-embed migration, logs, and rebuilds.
|
||||
|
||||
### Test plan
|
||||
|
||||
Machinery only (the metadata change is tested in the transition plan's Task 5). Port of the #12968 tests, dedicated file `test_vector_store_migrations.py`: structural migration applies in-place without `drop_table`; pending re-embed forces rebuild; version stamping on create/drop; bootstrap stamping of a pre-machinery 0-version table to 1; stop-at-boundary with a mixed [structural v2, reembed v3, structural v4] registry asserting v4 is NOT applied and the stored version stays at 2; `rebuild_table()` round-trips rows byte-for-byte (shared with compact tests).
|
||||
|
||||
### Open questions
|
||||
|
||||
- PR #12968 disposition: close with a comment pointing at this spec once the machinery lands (the Lance-specific `add_columns` path has no successor; vec0 cannot do in-place column adds).
|
||||
- `created`/`added` fields are also candidates for future structural metadata work, but nothing needs them now (YAGNI; noted only so the next reader does not re-derive it).
|
||||
@@ -0,0 +1,155 @@
|
||||
# sqlite-vec Vector Store Design (replaces PaperlessLanceVectorStore)
|
||||
|
||||
Date: 2026-06-10
|
||||
|
||||
Context: LanceDB wheels SIGILL on non-AVX2 CPUs (#12970); research in `2026-06-10-vector-store-alternatives-research.md` selected sqlite-vec. This is a beta feature, so a one-time re-embed on upgrade is acceptable. Every claim marked [VERIFIED] below was empirically tested against the actual PyPI wheel (0.1.9, and 0.1.10a4 where noted), either in this repo's scratch harness (`/tmp/vstore-avx-test/explore_sqlitevec*.py`) or by the issues-audit agent.
|
||||
|
||||
## Version pin: `sqlite-vec==0.1.9`, and why it is load-bearing
|
||||
|
||||
- The 0.1.9 linux x86_64 wheel is built with **no SIMD flags at all** (`vec_debug()` shows empty build flags) and passed our qemu Westmere (SSE4.2, no AVX) and SandyBridge (AVX, no AVX2) emulation tests [VERIFIED]. This is the entire point of the migration.
|
||||
- The **0.1.10-alpha.4 wheel regresses this**: built with `-mavx -DSQLITE_VEC_ENABLE_AVX` file-wide, no runtime CPU dispatch. It can SIGILL on AVX-less CPUs, including Goldmont Atom/Celeron NAS boxes, exactly the #12970 user base [VERIFIED via vec_debug on the wheel].
|
||||
- Guardrails: pin `==0.1.9` exactly; log `SELECT vec_version(), vec_debug()` at store init as an AVX canary; before ever bumping to 0.1.10+, re-check the wheel flags (and consider raising the runtime-dispatch issue upstream first).
|
||||
- arm64: 0.1.9 manylinux aarch64 wheel is a proper ELF64 binary, no NEON flags baked [VERIFIED]. (The broken 32-bit "aarch64" wheel era was 0.1.6, fixed since.)
|
||||
- No sdist on PyPI (asg017/sqlite-vec#211, open) and no musl wheels; fine for our Debian-based image, blocks Alpine bare-metal installs.
|
||||
|
||||
## Schema
|
||||
|
||||
One dedicated SQLite database file in `LLM_INDEX_DIR` (e.g. `llmindex.db`), never the Django DB. Connections set `PRAGMA journal_mode=WAL`, `busy_timeout`, `synchronous=NORMAL`.
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE nodes USING vec0(
|
||||
id TEXT PRIMARY KEY, -- node_id (uuid)
|
||||
document_id TEXT, -- METADATA column, deliberately NOT a partition key
|
||||
modified TEXT, -- ISO timestamp; never NULL (sentinel "")
|
||||
+node_content TEXT, -- auxiliary column: JSON payload, any size
|
||||
embedding float[{dim}] distance_metric=cosine
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT);
|
||||
-- rows: embed_model, dim, schema_version, created_by_vec_version
|
||||
```
|
||||
|
||||
Design decisions, each verified on 0.1.9:
|
||||
|
||||
- **`document_id` is a metadata column, not a partition key.** With a partition key, `k` applies per partition: `k=5 AND document_id IN (3 docs)` returns 15 rows (asg017/sqlite-vec#142, open) [VERIFIED]. As a metadata column the same query returns a correct global top-k of exactly 5 [VERIFIED]. `query_similar_documents()` passes permission-scoped `IN` lists, so per-partition semantics would over-fetch k x N(docs). At our scale the partition-pruning speedup is not needed (filtered KNN at 20K x 1024 was _faster_ than unfiltered: 39 ms vs 74 ms).
|
||||
- **One document column, not two.** The Lance store carried both `doc_id` (ref_doc_id) and `document_id`; in our usage they are always the same value (`str(document.id)`), so the new schema keeps only `document_id`.
|
||||
- **TEXT primary key works** (insert, UPDATE, DELETE, duplicate rejection) [VERIFIED]. There is no usable rowid mapping with a TEXT pk, which we do not need.
|
||||
- **Aux column for the payload.** `+node_content` holds the multi-KB JSON; aux columns cannot appear in KNN WHERE clauses (loud error, not silent) [VERIFIED], which we never do, and are selectable in scans and KNN results [VERIFIED].
|
||||
- **Metadata columns reject NULL** (asg017/sqlite-vec#141, open) [VERIFIED]. `_row()` must keep coercing everything through `str(... or "")` as it already does today.
|
||||
- **`distance_metric=cosine`**: similarity maps as `1 - distance` (identical vector gives distance 0.0 [VERIFIED]). For unit-norm embeddings the ranking equals today's L2 ranking; for non-normalized models cosine is the safer default, and the beta re-embed makes the behavior change free. (L2 + `1/(1+d)` remains available if exact parity is ever wanted.)
|
||||
- **Vectors are always bound as float32 BLOBs** (`struct.pack`/`np.tobytes`), never JSON text: bypasses the locale-dependent `strtod` parsing bug (asg017/sqlite-vec#241, open) entirely.
|
||||
- Limits, all comfortable: dims <= 8192, k <= 4096, chunk_size default 1024 [VERIFIED]. TEXT metadata has no length cap; values > 12 bytes go to a shadow text table with a prefix fast-path, and the one historical bug at that boundary (long-metadata DELETE, #274) is fixed in 0.1.9.
|
||||
|
||||
## Method mapping (PaperlessLanceVectorStore -> PaperlessSqliteVecVectorStore)
|
||||
|
||||
| Current method | sqlite-vec implementation | Notes |
|
||||
| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `__init__(uri, table_name, embed_model_name)` | `sqlite3.connect(path)` + `enable_load_extension` + `sqlite_vec.load()` + PRAGMAs | Same lazy "table may not exist yet" stance |
|
||||
| `client` property | the `sqlite3.Connection` | |
|
||||
| `table_exists()` | `SELECT 1 FROM sqlite_master WHERE name='nodes'` | |
|
||||
| `vector_dim()` | `index_meta['dim']` | Written at table creation; wrong-dim inserts are rejected by vec0 anyway [VERIFIED] |
|
||||
| `drop_table()` | `DROP TABLE nodes` | Drops all 7 shadow tables with it [VERIFIED]; also clear `index_meta` |
|
||||
| `stored_model_name()` / `config_mismatch()` | `index_meta['embed_model']` | Same conservative None handling |
|
||||
| `_schema(dim, model)` | the CREATE statements above | dim from first batch, as today (`_ensure_table`) |
|
||||
| `_row(node)` | same dict, vector packed to bytes | keep `str(... or "")` coercion (NULL rejection) |
|
||||
| `add(nodes)` | `executemany(INSERT ...)` inside one transaction | ~3,300 rows/s at 1024 dims measured; batching via transactions |
|
||||
| `upsert_document(document_id, nodes)` | `BEGIN; DELETE FROM nodes WHERE document_id = ?; executemany(INSERT); COMMIT` | **Not** `INSERT OR REPLACE`: broken on vec0 (asg017/sqlite-vec#259, open). Transaction gives the same no-transient-empty-state guarantee as merge_insert; rollback verified [VERIFIED] |
|
||||
| `delete(ref_doc_id)` | `DELETE FROM nodes WHERE document_id = ?` | |
|
||||
| `get_nodes(filters)` | `SELECT id, document_id, node_content, embedding FROM nodes [WHERE ...]` | full scans on vec0 work [VERIFIED]; 45 ms / 20K rows |
|
||||
| `query(VectorStoreQuery)` | `SELECT id, node_content, embedding, distance FROM nodes WHERE embedding MATCH ? AND k = ? [AND filters]` then Python-slice to `top_k` | `k = ?` is mandatory; `LIMIT` cannot be combined with `k` [VERIFIED]; results arrive distance-sorted [VERIFIED]; similarities = `1 - distance` |
|
||||
| `_build_where(filters)` | same EQ/IN translation, but emitting `?` placeholders + params list | **Upgrade**: bound parameters replace today's manual `_escape()` string interpolation |
|
||||
| `get_modified_times()` | `SELECT document_id, modified FROM nodes` + first-seen dedupe in Python | identical logic |
|
||||
| `ensure_document_id_scalar_index()` | no-op (delete if nothing else needs it) | metadata filters are evaluated in the chunk scan; nothing to create |
|
||||
| `maybe_create_ann_index()` | no-op on 0.1.9 | ANN (rescore/diskann) is 0.1.10-alpha territory; adopting an ANN index makes the file unreadable by 0.1.9 (one-way door), while flat tables round-trip 0.1.9 <-> 0.1.10a4 cleanly [VERIFIED]. Revisit post-0.1.10-final |
|
||||
| `compact(retention_seconds)` | **rebuild-based compaction**, see below | replaces Lance MVCC cleanup |
|
||||
|
||||
Filter constraint surface (loud errors otherwise, [VERIFIED]): only `=, !=, <, <=, >, >=, IN` on metadata columns in KNN queries. We use only EQ/IN. Never use `NOT IN` (the vtab cannot see it; SQLite post-filters and silently under-delivers below k, asg017/sqlite-vec#116).
|
||||
|
||||
## Compaction: the one real behavioral difference
|
||||
|
||||
vec0 DELETE only flips a validity bit; space is never reclaimed, and VACUUM recovers only about half (asg017/sqlite-vec#54, #220, open; fix PRs #243/#210 unmerged). Measured: 5 delete+reinsert cycles on 2K rows grew the file 3.32 MB -> 6.56 MB; VACUUM got back to 4.94 MB. Paperless's per-document churn (every document edit is a delete+reinsert) hits this directly.
|
||||
|
||||
So `compact()` becomes the maintainer-endorsed rebuild (asg017/sqlite-vec#205):
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE nodes_new USING vec0(...);
|
||||
INSERT INTO nodes_new SELECT id, document_id, modified, node_content, embedding FROM nodes;
|
||||
DROP TABLE nodes;
|
||||
ALTER TABLE nodes_new RENAME TO nodes; -- then VACUUM
|
||||
```
|
||||
|
||||
This copies vectors without re-embedding, runs under the existing write FileLock, and slots into the existing `document_llmindex compact` command and the scheduled maintenance task. A cheap trigger heuristic: rebuild when `count(*) in nodes_rowids shadow` (cumulative) exceeds ~2x live rows, or just keep the existing scheduled cadence.
|
||||
|
||||
## Concurrency
|
||||
|
||||
vec0 is a plain vtab over ordinary shadow tables, so standard SQLite WAL semantics apply, and the existing architecture is already the textbook arrangement: writers serialized by `settings.LLM_INDEX_LOCK` FileLock, readers concurrent via WAL. Verified across processes: a reader during another process's open write transaction does not block and sees a consistent pre-transaction snapshot; post-commit it sees the new rows [VERIFIED]. No sqlite-vec-specific multi-process corruption, locking, or segfault reports exist in the tracker. The 0.1.10a4 cached-statement fix (#295) is a Firefox/mozStorage `sqlite3_close()` issue; CPython's `sqlite3` is unaffected, no Python-side reports.
|
||||
|
||||
Same caveat as the main SQLite DB: `LLM_INDEX_DIR` should not be on NFS.
|
||||
|
||||
## Performance expectations (measured on the 0.1.9 no-SIMD wheel)
|
||||
|
||||
- KNN 20K rows x 1024 dims: ~74 ms plain, ~39 ms with a metadata EQ filter.
|
||||
- 100K x 768: 185 ms/query (vs 497 ms for LanceDB exact search on identical data).
|
||||
- Extrapolated 500K x 1024-1536: ~0.9-1.8 s/query; 384 dims roughly 4x faster. Acceptable for suggestions/chat at the extreme tail; typical installs (low tens of thousands of chunks) are tens of ms.
|
||||
- Insert: ~3,300 rows/s at 1024 dims in a single transaction.
|
||||
- File size: ~raw vector size (~4.3 KB/row at 1024 dims), no compression; plus the bloat behavior above.
|
||||
|
||||
## Migration from the Lance store
|
||||
|
||||
Beta policy: re-embed. On startup/first index task: if `LLM_INDEX_DIR` contains a Lance table but no `llmindex.db`, log and queue a full rebuild, then remove the Lance directory. No cross-store vector copy, no lancedb import anywhere in the path (which is what un-breaks #12970 hosts: they currently crash at import, have no usable index, and get a fresh build).
|
||||
|
||||
PR #12968's migration machinery maps onto `index_meta['schema_version']`: structural migrations = create-new-table + `INSERT ... SELECT` + rename (vectors copied, no re-embed; same shape as the compaction rebuild); re-embed migrations = drop + full rebuild, jumping straight to the current version.
|
||||
|
||||
## Dependency changes
|
||||
|
||||
- Add: `sqlite-vec==0.1.9` (one ~100 KB platform wheel, zero Python deps).
|
||||
- Remove: `lancedb~=0.33.0` (and its pylance/lancedb wheels, ~40 MB). `pyarrow` leaves this module; check whether anything else in the AI stack still needs it before dropping from pyproject.
|
||||
|
||||
## Test plan notes
|
||||
|
||||
- pytest-style per project convention; the store tests can run against a tmp_path DB file (or `:memory:` for pure-logic tests; extension loading works on uv-managed CPython [VERIFIED]).
|
||||
- Port the existing `test_vector_store.py` surface; add dedicated tests for: upsert transactionality (no transient empty state mid-upsert from a second connection), NULL-coercion in `_row()`, k-slice behavior, EQ/IN filter correctness, compaction rebuild preserving rows byte-for-byte, vec_debug canary logging.
|
||||
- The qemu matrix (`/tmp/vstore-avx-test/`) can be re-run against any future sqlite-vec bump: `qemu-x86_64 -cpu Westmere venv/bin/python candidate_test.py sqlite_vec <dir>`.
|
||||
|
||||
## Benchmark harness
|
||||
|
||||
`src/bench_vector_store.py` -- standalone head-to-head comparison run during the migration window when both `PaperlessLanceVectorStore` and `PaperlessSqliteVecVectorStore` coexist (Task 3 Phase A of the implementation plan). After Phase B replaces `vector_store.py`, the Lance import fails gracefully and only the sqlite-vec half runs (useful for post-migration baseline checks).
|
||||
|
||||
```bash
|
||||
cd src
|
||||
uv run python bench_vector_store.py # auto-generates bench_data.pkl on first run
|
||||
uv run python bench_vector_store.py --regenerate # force re-embed
|
||||
```
|
||||
|
||||
**Phase 1 (data generation, skipped if `bench_data.pkl` exists):** Faker generates `--n-docs` (default 2000) fake documents -- title, body, correspondent, ISO timestamp. Each body is split into `--chunks-per-doc` (default 3) equal-length chunks (~6000 total nodes). A warm-up embed call fires before generation to ensure the model is resident in GPU. All chunk texts are embedded via Ollama `/api/embed` in batches of 32 and saved to `bench_data.pkl`. Faker seed 42 for reproducibility.
|
||||
|
||||
**Phase 2 (benchmark):** Each store runs in an isolated `tempfile.TemporaryDirectory()`. Query vectors are drawn reproducibly from the corpus (every 10th node, wrapping).
|
||||
|
||||
| Operation | Reps | Metric |
|
||||
| ----------------------------------------- | ---- | --------------------- |
|
||||
| `add()` bulk insert | 1 | total time |
|
||||
| `query()` plain | 50 | p50 / p95 |
|
||||
| `query()` filtered (IN on 20% of doc IDs) | 50 | p50 / p95 |
|
||||
| `get_modified_times()` | 20 | p50 |
|
||||
| `upsert_document()` | 50 | p50 / p95 |
|
||||
| `compact()` | 1 | total time |
|
||||
| File size | -- | pre- and post-compact |
|
||||
|
||||
**CLI flags:** `--n-docs` (2000), `--chunks-per-doc` (3), `--data-file` (`bench_data.pkl`), `--regenerate`, `--ollama-url` (`http://192.168.1.87:11434`), `--embed-model` (`qwen3-embedding:4b`), `--query-iters` (50).
|
||||
|
||||
**Dependencies:** `faker` and `httpx` must be available (`uv add --dev faker httpx` if not already installed).
|
||||
|
||||
## Risk register (from the 2026-06-10 issues audit)
|
||||
|
||||
| Risk | Ref | State | Disposition |
|
||||
| ------------------------------------------- | --------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| 0.1.10+ wheels bake AVX, no dispatch | release CI change, verified on 0.1.10a4 | current | Pin 0.1.9; vec_debug canary; upstream ask before any bump |
|
||||
| DELETE never reclaims space; VACUUM ~50% | #54, #220 | open | Rebuild-based `compact()` above |
|
||||
| INSERT OR REPLACE broken on vec0 | #259 | open | Use DELETE+INSERT in txn (design already does) |
|
||||
| NULL metadata rejected | #141 | open | Sentinel `""` coercion (already current behavior) |
|
||||
| Partition-key IN returns k per partition | #142 | open | Avoided: document_id is a metadata column |
|
||||
| NOT IN silently under-delivers | #116 | open | Never emit NOT IN |
|
||||
| Locale strtod breaks JSON vector parsing | #241 | open | Always BLOB-bind vectors |
|
||||
| Single weekend maintainer; fix PRs languish | #226 | open | Mitigated by Mozilla sponsorship + Firefox vendoring (release-train consumer); pin + vendor-from-source remains the escape hatch (no sdist: #211) |
|
||||
| ANN index = one-way file format | 0.1.10 alphas | — | Do not adopt ANN until 0.1.10 final + flag audit |
|
||||
| Long-TEXT metadata DELETE bug | #274 | fixed in 0.1.9 | Floor requirement `>=0.1.9` already implied by pin |
|
||||
Reference in New Issue
Block a user