Marks some things as done

This commit is contained in:
stumpylog
2026-06-12 11:38:20 -07:00
parent b2151acfd5
commit 85cd9b657b
6 changed files with 0 additions and 0 deletions
@@ -1,745 +0,0 @@
# LanceDB Schema Migration Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a schema versioning and migration system to the LanceDB vector store so that structural column changes can be applied in-place without re-embedding documents, avoiding token costs for users on paid embedding APIs.
**Architecture:** A `schema_version.json` file is written alongside the LanceDB data directory and tracks the current applied version. A `Migration` dataclass registry in `vector_store.py` holds ordered, typed migration steps; each migration is classified as `requires_reembed=True/False`. At index update time, structural-only migrations are applied in-place via LanceDB's `add_columns`/`alter_columns`/`drop_columns` APIs; if any pending migration requires re-embedding, the existing model-mismatch rebuild path is reused.
**Tech Stack:** Python 3.11, lancedb 0.33, pyarrow, pytest, pytest-mock, factory-boy
---
## File Map
| File | Change |
| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `src/paperless_ai/vector_store.py` | Add `CURRENT_SCHEMA_VERSION`, `Migration` dataclass, version file helpers, migration methods; modify `_ensure_table` and `drop_table` |
| `src/paperless_ai/indexing.py` | Call migration inside `update_llm_index`'s `write_store` block |
| `src/paperless_ai/tests/test_vector_store.py` | New `TestSchemaVersioning` and `TestMigrations` test classes |
| `src/paperless_ai/tests/test_ai_indexing.py` | Two new integration tests for migration path |
---
## Task 1: Schema version file helpers
**Files:**
- Modify: `src/paperless_ai/vector_store.py`
- Test: `src/paperless_ai/tests/test_vector_store.py`
- [ ] **Step 1: Write the failing tests**
Add a new class at the bottom of `test_vector_store.py`:
```python
class TestSchemaVersioning:
@pytest.fixture
def uri(self, tmp_path: Path) -> str:
return str(tmp_path / "idx")
def test_version_file_written_on_table_creation(self, uri: str) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
version_file = Path(uri) / "schema_version.json"
assert version_file.exists()
assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
def test_stored_schema_version_returns_current_when_file_missing(
self, uri: str
) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
(Path(uri) / "schema_version.json").unlink()
reopened = PaperlessLanceVectorStore(uri=uri)
assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
def test_stored_schema_version_persists_after_reopen(self, uri: str) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
PaperlessLanceVectorStore(uri=uri).add([_node("1-0", "1", "text", 0.1)])
reopened = PaperlessLanceVectorStore(uri=uri)
assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
def test_drop_table_removes_version_file(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
assert (Path(uri) / "schema_version.json").exists()
store.drop_table()
assert not (Path(uri) / "schema_version.json").exists()
def test_version_file_written_on_upsert_creation(self, uri: str) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
store = PaperlessLanceVectorStore(uri=uri)
store.upsert_document("1", [_node("1-0", "1", "text", 0.1)])
version_file = Path(uri) / "schema_version.json"
assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
```
Add `import json` and `import pytest_mock` to the top of `test_vector_store.py`.
- [ ] **Step 2: Run tests to verify they fail**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
```
Expected: all 5 tests fail with `ImportError` or `AttributeError``CURRENT_SCHEMA_VERSION` and `stored_schema_version` don't exist yet.
- [ ] **Step 3: Implement the schema version helpers in `vector_store.py`**
After the existing imports and before the `DEFAULT_TABLE_NAME` constant, add:
```python
import json
from pathlib import Path
```
After `DEFAULT_TABLE_NAME = "documents"`, add:
```python
CURRENT_SCHEMA_VERSION: int = 1
```
After the `ANN_PQ_SUB_VECTORS` constant, add nothing yet — version methods go on the class.
Inside `PaperlessLanceVectorStore`, add these methods after `stored_model_name`:
```python
@property
def _schema_version_path(self) -> Path:
return Path(self._uri) / "schema_version.json"
def stored_schema_version(self) -> int:
"""Return the schema version recorded on disk, or CURRENT_SCHEMA_VERSION if missing.
Missing means either the table predates versioning or was just created and the
write hasn't happened yet — treat conservatively as already current.
"""
try:
return int(json.loads(self._schema_version_path.read_text())["version"])
except (FileNotFoundError, KeyError, ValueError):
return CURRENT_SCHEMA_VERSION
def _write_schema_version(self, version: int) -> None:
self._schema_version_path.parent.mkdir(parents=True, exist_ok=True)
self._schema_version_path.write_text(json.dumps({"version": version}))
```
Modify `_ensure_table` to write the version after creating the table. Replace the current method body:
```python
def _ensure_table(self, rows: list[dict[str, Any]], dim: int) -> bool:
if self._table is not None:
return False
self._table = self._conn.create_table(
self._table_name,
rows,
schema=self._schema(dim, self._embed_model_name),
)
self._write_schema_version(CURRENT_SCHEMA_VERSION)
return True
```
Modify `drop_table` to also remove the version file:
```python
def drop_table(self) -> None:
if self.table_exists():
self._conn.drop_table(self._table_name)
self._table = None
self._schema_version_path.unlink(missing_ok=True)
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
```
Expected: all 5 tests pass.
- [ ] **Step 5: Verify no regressions**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
```
Expected: all existing tests still pass.
- [ ] **Step 6: Lint**
```bash
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
```
Expected: no errors.
- [ ] **Step 7: Commit**
```bash
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
git commit -m "feat(ai): add schema version file tracking to LanceDB vector store"
```
---
## Task 2: Migration dataclass and pending migration detection
**Files:**
- Modify: `src/paperless_ai/vector_store.py`
- Test: `src/paperless_ai/tests/test_vector_store.py`
- [ ] **Step 1: Write the failing tests**
Add a new class to `test_vector_store.py`:
```python
class TestMigrationRegistry:
@pytest.fixture
def uri(self, tmp_path: Path) -> str:
return str(tmp_path / "idx")
def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
"""Create a store with a table and then fake its on-disk version."""
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
store._write_schema_version(version)
return PaperlessLanceVectorStore(uri=uri) # reopen to pick up written version
def test_pending_migrations_empty_at_current_version(self, uri: str) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION, Migration
store = self._store_at_version(uri, CURRENT_SCHEMA_VERSION)
assert store.pending_migrations() == []
def test_pending_migrations_returns_migrations_above_stored_version(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
store = self._store_at_version(uri, 1)
pending = store.pending_migrations()
assert pending == [m2, m3]
def test_pending_migrations_excludes_already_applied(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
store = self._store_at_version(uri, 2)
pending = store.pending_migrations()
assert pending == [m3]
def test_pending_migrations_empty_when_no_table(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri)
assert store.pending_migrations() == []
def test_requires_reembed_migration_false_when_none_pending(self, uri: str) -> None:
store = self._store_at_version(uri, 1)
assert store.requires_reembed_migration() is False
def test_requires_reembed_migration_false_when_only_structural_pending(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
store = self._store_at_version(uri, 1)
assert store.requires_reembed_migration() is False
def test_requires_reembed_migration_true_when_reembed_migration_pending(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="reindex", requires_reembed=True, apply=lambda t: None)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
store = self._store_at_version(uri, 1)
assert store.requires_reembed_migration() is True
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
```
Expected: all 7 tests fail — `Migration`, `MIGRATIONS`, `pending_migrations`, `requires_reembed_migration` don't exist yet.
- [ ] **Step 3: Add `Migration` dataclass and registry to `vector_store.py`**
Add near the top of the file, after the existing imports:
```python
from dataclasses import dataclass, field
from typing import Callable
```
After the `CURRENT_SCHEMA_VERSION` constant, add:
```python
@dataclass(frozen=True)
class Migration:
version: int
description: str
requires_reembed: bool
apply: Callable[[Any], None] = field(compare=False, hash=False)
```
(`compare=False, hash=False` excludes `apply` from `__eq__` and `__hash__` — equality is driven by `version` alone, which is the natural identity key. This avoids lambda identity issues in tests and makes the API safe for callers that construct `Migration` instances inline.)
# Ordered list of schema migrations. Each entry upgrades the table to `version`.
# Structural migrations (requires_reembed=False) are applied in-place via LanceDB's
# add_columns/alter_columns/drop_columns APIs — no re-embedding needed.
# Migrations with requires_reembed=True cause a full rebuild on next index update,
# exactly like a model-name change does today.
#
# To add a migration:
# 1. Increment CURRENT_SCHEMA_VERSION.
# 2. Append a Migration entry here with the new version number.
# 3. For structural changes, call table.add_columns/alter_columns/drop_columns in apply().
# 4. For embedding-invalidating changes, set requires_reembed=True; apply() can be a no-op.
MIGRATIONS: list[Migration] = []
````
Inside `PaperlessLanceVectorStore`, add after `requires_reembed_migration` (which we'll add next):
```python
def pending_migrations(self) -> list[Migration]:
"""Return migrations not yet applied to this table, in version order."""
if self._table is None:
return []
current = self.stored_schema_version()
return [m for m in MIGRATIONS if m.version > current]
def requires_reembed_migration(self) -> bool:
"""True when any pending migration requires a full re-embedding."""
return any(m.requires_reembed for m in self.pending_migrations())
````
- [ ] **Step 4: Run tests to verify they pass**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
```
Expected: all 7 tests pass.
- [ ] **Step 5: Lint**
```bash
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
```
- [ ] **Step 6: Commit**
```bash
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
git commit -m "feat(ai): add Migration registry and pending migration detection"
```
---
## Task 3: Apply structural migrations in-place
**Files:**
- Modify: `src/paperless_ai/vector_store.py`
- Test: `src/paperless_ai/tests/test_vector_store.py`
- [ ] **Step 1: Write the failing tests**
Add a new class to `test_vector_store.py`:
```python
class TestApplyStructuralMigrations:
@pytest.fixture
def uri(self, tmp_path: Path) -> str:
return str(tmp_path / "idx")
def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
store._write_schema_version(version)
return PaperlessLanceVectorStore(uri=uri)
def test_apply_structural_adds_column_via_lancedb(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
def _add_extra(table: Any) -> None:
table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
store = self._store_at_version(uri, 1)
applied = store.apply_structural_migrations()
assert len(applied) == 1
assert applied[0] == m2
# Column actually present in the table schema.
reopened = PaperlessLanceVectorStore(uri=uri)
field_names = [f.name for f in reopened._table.schema]
assert "extra" in field_names
def test_apply_structural_updates_version_file(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
store = self._store_at_version(uri, 1)
store.apply_structural_migrations()
assert store.stored_schema_version() == 2
def test_apply_structural_skips_reembed_migrations(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
from paperless_ai.vector_store import Migration
applied_versions: list[int] = []
m2 = Migration(version=2, description="structural", requires_reembed=False, apply=lambda t: applied_versions.append(2) or t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
m3 = Migration(version=3, description="reembed", requires_reembed=True, apply=lambda t: applied_versions.append(3))
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
store = self._store_at_version(uri, 1)
applied = store.apply_structural_migrations()
assert [m.version for m in applied] == [2]
assert 3 not in applied_versions
# Version advances only to the last structural migration applied.
assert store.stored_schema_version() == 2
def test_apply_structural_noop_at_current_version(self, uri: str) -> None:
store = self._store_at_version(uri, 1)
applied = store.apply_structural_migrations()
assert applied == []
def test_apply_structural_noop_when_no_table(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri)
applied = store.apply_structural_migrations()
assert applied == []
def test_apply_structural_refreshes_table_reference(
self, uri: str, mocker: pytest_mock.MockerFixture
) -> None:
"""After add_columns the in-memory table object must reflect the new schema."""
from paperless_ai.vector_store import Migration
m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"extra": "CAST(NULL AS VARCHAR)"}))
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
store = self._store_at_version(uri, 1)
store.apply_structural_migrations()
# The store's own _table reference (not a re-open) must see the new column.
field_names = [f.name for f in store._table.schema]
assert "extra" in field_names
```
Add `from typing import Any` to the test file imports if not already present.
- [ ] **Step 2: Run tests to verify they fail**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
```
Expected: all 6 tests fail — `apply_structural_migrations` doesn't exist yet.
- [ ] **Step 3: Implement `apply_structural_migrations` in `vector_store.py`**
Add after `requires_reembed_migration` on the class:
```python
def apply_structural_migrations(self) -> list[Migration]:
"""Apply all pending structural (non-reembed) migrations in version order.
Each applied migration's ``apply`` callable receives the live LanceDB table
object and should call ``add_columns``, ``alter_columns``, or ``drop_columns``
as needed. After all structural migrations run, the version file is updated
to the highest version applied and the in-memory table reference is refreshed.
Migrations with ``requires_reembed=True`` are skipped — the caller is
responsible for detecting them via ``requires_reembed_migration()`` and
triggering a full rebuild.
"""
if self._table is None:
return []
structural = [m for m in self.pending_migrations() if not m.requires_reembed]
if not structural:
return []
for migration in structural:
logger.info("Applying schema migration v%d: %s", migration.version, migration.description)
migration.apply(self._table)
# Refresh the in-memory table so subsequent operations see the new schema.
self._table = self._conn.open_table(self._table_name)
self._write_schema_version(structural[-1].version)
return structural
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
```
Expected: all 6 tests pass.
- [ ] **Step 5: Full test_vector_store regression check**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
```
Expected: all tests pass.
- [ ] **Step 6: Lint**
```bash
ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
```
- [ ] **Step 7: Commit**
```bash
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
git commit -m "feat(ai): implement apply_structural_migrations for in-place schema changes"
```
---
## Task 4: Wire migrations into `update_llm_index`
**Files:**
- Modify: `src/paperless_ai/indexing.py`
- Test: `src/paperless_ai/tests/test_ai_indexing.py`
- [ ] **Step 1: Write the failing tests**
Add these two tests to `test_ai_indexing.py`, after the existing `test_update_llm_index_rebuilds_on_model_name_change` test:
```python
@pytest.mark.django_db
def test_update_llm_index_applies_structural_migration_without_rebuild(
temp_llm_index_dir: Path,
real_document: Document,
mock_embed_model: FakeEmbedding,
mocker: pytest_mock.MockerFixture,
) -> None:
"""Structural migrations are applied in-place; no full rebuild (drop) occurs."""
from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
column_added: list[bool] = []
def _add_extra(table) -> None:
table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
column_added.append(True)
# Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=True)
# Simulate a new v2 structural migration being introduced after the initial index was built.
m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=False)
assert column_added, "Structural migration apply() was not called"
drop_spy.assert_not_called()
@pytest.mark.django_db
def test_update_llm_index_forces_rebuild_on_reembed_migration(
temp_llm_index_dir: Path,
real_document: Document,
mock_embed_model: FakeEmbedding,
mocker: pytest_mock.MockerFixture,
) -> None:
"""A pending reembed migration causes a full drop+rebuild on next update."""
from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
# Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=True)
# Simulate a reembed migration at v2 being introduced after the initial index was built.
m2 = Migration(version=2, description="requires reembed", requires_reembed=True, apply=lambda t: None)
mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=False)
drop_spy.assert_called()
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
```
Expected: both tests fail because `update_llm_index` doesn't call migration methods yet.
- [ ] **Step 3: Add migration check inside `update_llm_index` in `indexing.py`**
Inside the `with write_store(embed_model_name=model_name) as store:` block in `update_llm_index`, insert the migration check immediately before the `if rebuild or not store.table_exists():` line:
```python
if not rebuild and store.table_exists():
store.apply_structural_migrations()
if store.requires_reembed_migration():
logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
rebuild = True
```
The relevant section of `update_llm_index` should now look like:
```python
with write_store(embed_model_name=model_name) as store:
if not rebuild and store.table_exists():
store.apply_structural_migrations()
if store.requires_reembed_migration():
logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
rebuild = True
if rebuild or not store.table_exists():
(settings.LLM_INDEX_DIR / "meta.json").unlink(missing_ok=True)
logger.info("Rebuilding LLM index.")
store.drop_table()
...
```
- [ ] **Step 4: Run new tests to verify they pass**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
```
Expected: both tests pass.
- [ ] **Step 5: Full indexing regression check**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
```
Expected: all existing tests still pass.
- [ ] **Step 6: Full AI module test run**
```bash
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/ -v"
```
Expected: all tests pass.
- [ ] **Step 7: Lint**
```bash
ruff check src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
ruff format src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
```
- [ ] **Step 8: Commit**
```bash
git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
git commit -m "feat(ai): wire schema migrations into update_llm_index; structural changes avoid re-embed"
```
---
## How to add a migration (reference for future developers)
When a future schema change is needed:
1. Increment `CURRENT_SCHEMA_VERSION` in `vector_store.py`.
2. Append a `Migration` to `MIGRATIONS` with the new version number.
3. If the change is **structural only** (add/rename/drop a column, no embedding content changed):
- Set `requires_reembed=False`
- In `apply`, call `table.add_columns({"col": "CAST(NULL AS string)"})`, `table.drop_columns(["col"])`, or `table.alter_columns({"path": "col", "rename": "new_name"})` as appropriate.
4. If the change affects **what text gets embedded** (new fields in `build_llm_index_text`, chunk size change baked into schema, etc.):
- Set `requires_reembed=True`
- `apply` can be a no-op (`lambda t: None`) — the framework will trigger a full rebuild.
5. Write tests for the migration in `test_vector_store.py` following the `TestApplyStructuralMigrations` patterns.
Example structural migration adding a `language` column:
```python
CURRENT_SCHEMA_VERSION: int = 2
MIGRATIONS: list[Migration] = [
Migration(
version=2,
description="Add language column for future locale-aware filtering",
requires_reembed=False,
apply=lambda table: table.add_columns({"language": "CAST(NULL AS string)"}),
),
]
```
@@ -1,446 +0,0 @@
# Node Metadata Enrichment Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Move `filename`, `storage_path`, and `archive_serial_number` from the LanceDB embedding text into `node.metadata`, and register a schema migration that triggers an automatic index rebuild on upgrade.
**Architecture:** Three small, independent changes to two source files, tested first. The migration is a no-op `apply` (the rebuild regenerates all nodes with correct metadata). All three tests go red first, then each implementation makes them green.
**Tech Stack:** pytest, pytest-django, pytest-mock, factory_boy, llama_index `MetadataMode`, `feature-lancedb-schema-migrate` branch (must be the base branch for this work).
**Branch base:** `feature-lancedb-schema-migrate`
---
### Task 1: Fail — embedding text no longer contains the three fields
**Files:**
- Modify: `src/paperless_ai/tests/test_embedding.py`
- [ ] **Step 1: Update `mock_document` fixture to set an explicit `storage_path`**
The fixture currently doesn't set `storage_path`, so the existing code path (`doc.storage_path.name if doc.storage_path else ''`) would call `.name` on a `MagicMock`. Give it an explicit value so assertions are unambiguous.
Add these two lines to the `mock_document` fixture after `doc.archive_serial_number = "12345"`:
```python
doc.storage_path = MagicMock()
doc.storage_path.name = "Finance/Bills"
```
- [ ] **Step 2: Update `test_build_llm_index_text` — flip and add assertions**
The existing test asserts these fields ARE in the result. Change them to assert they are NOT, and add the two missing ones:
```python
# was: assert "Filename: test_file.pdf" in result
assert "Filename: test_file.pdf" not in result
assert "Storage Path: Finance/Bills" not in result
assert "Archive Serial Number: 12345" not in result
```
The assertions for `Notes`, `Content`, and `Custom Field` lines are unchanged — leave them as-is.
- [ ] **Step 3: Run the test to confirm it fails**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
```
Expected: `FAILED` — `AssertionError: assert 'Filename: test_file.pdf' not in '...'`
---
### Task 2: Pass — remove the three fields from `build_llm_index_text`
**Files:**
- Modify: `src/paperless_ai/embedding.py`
- [ ] **Step 1: Remove the three lines and the TODO comment**
Current `build_llm_index_text` (lines 114133). Replace the function body:
```python
def build_llm_index_text(doc: Document) -> str:
lines = [
f"Notes: {','.join([str(c.note) for c in Note.objects.filter(document=doc)])}",
]
for instance in doc.custom_fields.all():
lines.append(f"Custom Field - {instance.field.name}: {instance}")
lines.append("\nContent:\n")
lines.append(doc.content or "")
return _normalize_llm_index_text("\n".join(lines))
```
- [ ] **Step 2: Run the test to confirm it passes**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
```
Expected: `PASSED`
- [ ] **Step 3: Run the full embedding test module to catch regressions**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py -v"
```
Expected: all green.
- [ ] **Step 4: Commit**
```bash
git add src/paperless_ai/embedding.py src/paperless_ai/tests/test_embedding.py
git commit -m "refactor(ai): remove filename/storage_path/asn from embedding text"
```
---
### Task 3: Fail — `build_document_node` exposes the three fields in metadata
**Files:**
- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
- [ ] **Step 1: Extend `test_build_document_node_structured_fields_in_metadata`**
This test already checks for `title`, `tags`, etc. Add the three new keys. The `real_document` fixture creates a document with no storage path set, so `storage_path` will be `None` — the key must still be present.
Replace the existing test body:
```python
@pytest.mark.django_db
def test_build_document_node_structured_fields_in_metadata(
real_document: Document,
) -> None:
"""Structured fields must be in node.metadata so the LLM receives them via metadata prepend."""
nodes = indexing.build_document_node(real_document)
assert len(nodes) > 0
for node in nodes:
assert "title" in node.metadata
assert "tags" in node.metadata
assert "correspondent" in node.metadata
assert "document_type" in node.metadata
assert "created" in node.metadata
assert "added" in node.metadata
assert "modified" in node.metadata
assert "filename" in node.metadata
assert "storage_path" in node.metadata # None is fine; key must exist
assert "archive_serial_number" in node.metadata
```
- [ ] **Step 2: Add a test that storage_path carries the name when set**
Add a new test function after `test_build_document_node_structured_fields_in_metadata`:
```python
@pytest.mark.django_db
def test_build_document_node_storage_path_name_in_metadata() -> None:
"""storage_path metadata value is the StoragePath name, not None, when set."""
from documents.tests.factories import DocumentFactory, StoragePathFactory
sp = StoragePathFactory(name="Finance/Bills")
doc = DocumentFactory(storage_path=sp)
nodes = indexing.build_document_node(doc)
assert len(nodes) > 0
for node in nodes:
assert node.metadata["storage_path"] == "Finance/Bills"
```
- [ ] **Step 3: Add a test that all three new fields are in `excluded_embed_metadata_keys`**
Add after the previous test:
```python
@pytest.mark.django_db
def test_build_document_node_new_fields_excluded_from_embedding(
real_document: Document,
) -> None:
"""filename, storage_path, and archive_serial_number must not appear in embedding text."""
from llama_index.core.schema import MetadataMode
nodes = indexing.build_document_node(real_document)
assert len(nodes) > 0
for node in nodes:
assert "filename" in node.excluded_embed_metadata_keys
assert "storage_path" in node.excluded_embed_metadata_keys
assert "archive_serial_number" in node.excluded_embed_metadata_keys
embed_text = node.get_content(metadata_mode=MetadataMode.EMBED)
assert "filename" not in embed_text
assert "storage_path" not in embed_text
assert "archive_serial_number" not in embed_text
```
- [ ] **Step 4: Run the new tests to confirm they fail**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
```
Expected: all `FAILED` — keys not yet in `node.metadata`.
---
### Task 4: Pass — add the three fields to `build_document_node`
**Files:**
- Modify: `src/paperless_ai/indexing.py`
- [ ] **Step 1: Update the `metadata` dict in `build_document_node`**
Current metadata dict starts at line 106. Replace it:
```python
metadata = {
"document_id": str(document.id),
"title": document.title,
"filename": document.filename or "",
"storage_path": document.storage_path.name if document.storage_path else None,
"archive_serial_number": document.archive_serial_number,
"tags": [t.name for t in document.tags.all()],
"correspondent": document.correspondent.name
if document.correspondent
else None,
"document_type": document.document_type.name
if document.document_type
else None,
"created": document.created.isoformat() if document.created else None,
"added": document.added.isoformat() if document.added else None,
"modified": document.modified.isoformat(),
}
```
- [ ] **Step 2: Update `excluded_embed_metadata_keys`**
The `LlamaDocument(...)` call currently has:
```python
excluded_embed_metadata_keys=list(metadata.keys()),
```
This already excludes all keys, so no change needed here — the new keys are automatically included since they're in the dict. Verify `excluded_llm_metadata_keys` still only excludes `"document_id"`:
```python
excluded_llm_metadata_keys=["document_id"],
```
No change needed.
- [ ] **Step 3: Run the failing tests to confirm they pass**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
```
Expected: all `PASSED`.
- [ ] **Step 4: Run the full indexing test module**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
```
Expected: all green.
- [ ] **Step 5: Commit**
```bash
git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
git commit -m "feat(ai): add filename/storage_path/asn to node metadata"
```
---
### Task 5: Fail — migration v2 is registered
**Files:**
- Modify: `src/paperless_ai/tests/test_vector_store.py`
These tests use the real (non-mocked) `MIGRATIONS` list, so they go red until the migration is registered in Task 6.
- [ ] **Step 1: Add a `TestMetadataEnrichmentMigration` class**
Add this class near the end of `test_vector_store.py`, before the final `TestApplyStructuralMigrations`:
```python
class TestMetadataEnrichmentMigration:
def test_current_schema_version_is_2(self) -> None:
from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
assert CURRENT_SCHEMA_VERSION == 2
def test_migration_v2_registered(self) -> None:
from paperless_ai.vector_store import MIGRATIONS
assert len(MIGRATIONS) == 1
assert MIGRATIONS[0].version == 2
assert MIGRATIONS[0].requires_reembed is True
def test_store_at_v1_requires_reembed(self, uri: str) -> None:
store = _store_at_version(uri, 1)
assert store.requires_reembed_migration() is True
def test_store_at_v2_no_pending_migrations(self, uri: str) -> None:
store = _store_at_version(uri, 2)
assert store.pending_migrations() == []
assert store.requires_reembed_migration() is False
```
- [ ] **Step 2: Run the tests to confirm they fail**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
```
Expected: all `FAILED` — `CURRENT_SCHEMA_VERSION` is still 1 and `MIGRATIONS` is still empty.
---
### Task 6: Pass — register migration v2 in `vector_store.py`
**Files:**
- Modify: `src/paperless_ai/vector_store.py`
- [ ] **Step 1: Add the migration and bump the version constant**
On the `feature-lancedb-schema-migrate` branch, `vector_store.py` has:
```python
CURRENT_SCHEMA_VERSION: Final[int] = 1
...
MIGRATIONS: list[Migration] = []
```
Change both:
```python
CURRENT_SCHEMA_VERSION: Final[int] = 2
MIGRATIONS: list[Migration] = [
Migration(
version=2,
description="move filename/storage_path/asn from embedding text to metadata; rebuild required",
requires_reembed=True,
apply=lambda table: None,
),
]
```
- [ ] **Step 2: Run the migration tests to confirm they pass**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
```
Expected: all `PASSED`.
- [ ] **Step 3: Run the full vector store test module**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
```
Expected: all green. In particular, `TestSchemaVersioning::test_stored_schema_version_persists_after_reopen` and the `TestMigrationRegistry` tests should still pass — they use `CURRENT_SCHEMA_VERSION` as the baseline.
---
### Task 7: Integration — `update_llm_index` rebuilds when schema version is stale
**Files:**
- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
- [ ] **Step 1: Write the failing integration test**
Add this test near `test_update_llm_index_rebuilds_on_model_name_change`:
```python
@pytest.mark.django_db
def test_update_llm_index_rebuilds_on_pending_reembed_migration(
temp_llm_index_dir: Path,
real_document: Document,
mock_embed_model: FakeEmbedding,
) -> None:
"""A stale schema version (v1) must trigger a full rebuild on the next index run."""
from paperless_ai.vector_store import PaperlessLanceVectorStore
# Build an initial index and then rewind the schema version to 1 to simulate
# an index created before migration v2 was registered.
indexing.update_llm_index(rebuild=True)
store = indexing.get_vector_store()
store._write_schema_version(1)
# An incremental run (rebuild=False) must detect the stale version and rebuild.
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=False)
# After rebuild the schema version must be current.
reopened = PaperlessLanceVectorStore(uri=str(temp_llm_index_dir))
assert reopened.stored_schema_version() == 2
```
- [ ] **Step 2: Run the test to confirm it fails**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
```
Expected: `FAILED` — schema version stays at 1 because migration v2 isn't registered yet.
_(If it passes already because `update_llm_index` detects a different condition, verify the assertion is actually exercising the migration path and not the model-name path.)_
- [ ] **Step 3: Run the test again now that migration v2 is registered (Task 6)**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
```
Expected: `PASSED`.
- [ ] **Step 4: Run the full indexing test module**
```
bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
```
Expected: all green.
- [ ] **Step 5: Final commit**
```bash
git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py src/paperless_ai/tests/test_ai_indexing.py
git commit -m "feat(ai): register schema migration v2; triggers rebuild for metadata enrichment"
```
---
## Self-review checklist
**Spec coverage:**
- ✅ `build_llm_index_text` — three lines removed (Tasks 12)
- ✅ `build_document_node` — three fields added to metadata + excluded_embed_metadata_keys (Tasks 34)
- ✅ Migration v2 registered with `requires_reembed=True` and no-op apply (Tasks 56)
- ✅ `update_llm_index` triggers rebuild on stale schema (Task 7)
- ✅ Tests: `test_embedding.py`, `test_ai_indexing.py`, `test_vector_store.py`
**Placeholder scan:** None found. Every step has exact code or exact commands.
**Type consistency:**
- `metadata` dict key names (`"filename"`, `"storage_path"`, `"archive_serial_number"`) used consistently across Tasks 14.
- `CURRENT_SCHEMA_VERSION = 2` and `MIGRATIONS[0].version == 2` are consistent across Tasks 56.
- `_store_at_version` and `_node` helpers referenced in Task 5 are defined in the existing `test_vector_store.py` on the `feature-lancedb-schema-migrate` branch.
File diff suppressed because it is too large Load Diff